File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1202_intro.xml

Size: 2,980 bytes

Last Modified: 2025-10-06 14:00:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1202">
  <Title>Sense-Tagging Chinese Corpus</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Tagging task, which adds lexical, syntactic or semantic information to raw text, makes materials more valuable. The researches on part of speech (POS) tagging have been a long history, and achieve very good results. Many POS-tagged corpora are available. The accuracy for POS-tagging is in the range of 95% to 97% 1 . In contrast, although the researches on word sense disambiguation (WSD) are also very early (Kelly and Stone, 1975), large-scale sense-tagged corpus is relatively few. In English, only some sense-tagged corpora such as HECTOR (Atkins, 1993), DSO (Ng and Lee, 1996), SEMCOR (Fellbaum, 1997), and SENSEVAL (Kilgarriff, 1998) are available.</Paragraph>
    <Paragraph position="1"> For evaluating word sense disarnbiguation systems, the first SENSEVAL (Kilgarriff and Rosenzweig, 2000) reports that the performance for a fine-grained word sense disambiguation task is at around 75 %.</Paragraph>
    <Paragraph position="2"> 1 The pelrforlnancg includes tagging wnzmbiguous words. Marslmll (1987) reported that the performance of CLAWS tagger is 94%.</Paragraph>
    <Paragraph position="3"> Approximately 65% of words were tagged nnambiguously, and the disambigualion program achieved better than 80% success on the ambiguous words.</Paragraph>
    <Paragraph position="4"> Tagging accuracy depends on several issues (Manning and Schutze, 1999), e.g., the amount of training data, the granularity of the tagging set, the occurrences of unknown words, and so on.</Paragraph>
    <Paragraph position="5"> Three approaches have been proposed for WSD, including dictionary/thesaurus-based approach, supervised learning, and unsupervised learning. The major differences are what kinds of resources are used, i.e., dictionary versus text corpus, and sense-tagged corpus versus untagged eorpns. A good survey refers to the paper Ode and Veronis, 1998). Compared with English, Chinese does not have large-scale sense-tagged corpus. The widely available corpus is Academic Sinica Balanced Corpus abbreviated as ASBC hereafter (I-Iuang and Chen, 1995), which is a POS-tagged corpus.</Paragraph>
    <Paragraph position="6"> Thus, a computer-aided tool to sense-tag Chinese corpus is indispensable.</Paragraph>
    <Paragraph position="7"> This paper presents a sense tagger for Mandarin Chinese. It is organized as follows.</Paragraph>
    <Paragraph position="8"> Section 2 discusses the degree of polysemy in Mandarin Chinese from several viewpoints.</Paragraph>
    <Paragraph position="9"> Section 3 presents WSD algorithms for tagging ambiguous words and unknown words.</Paragraph>
    <Paragraph position="10"> Section 4 shows our experimental results.</Paragraph>
    <Paragraph position="11"> Finally, Section 5 concludes the remarks.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML