File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-2011_intro.xml

Size: 3,845 bytes

Last Modified: 2025-10-06 14:01:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-2011">
  <Title>Semantic classification of Chinese unknown words</Title>
  <Section position="2" start_page="0" end_page="4" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The biggest problem for assigning semantic categories to words lies in the incompleteness of dictionaries. It is impractical to construct a dictionary that will contain all words that may occur in some previously unseen corpora. This issue is particularly problematic for natural language processing applications that work with Chinese texts. Specifically, for the Sinica Corpus  , Bai, Chen and Chen (1998) found that articles contain on average 3.51% words that were not listed in the Chinese Electronic Dictionary  of 80,000 words. Because novel words are created daily, it is impossible to collect them all. Furthermore, across most of the corpora, many of these newly coined words seem to be used only once, and thus they may not even be worth collecting. However, the occurrence of unknown words makes a number of NLP (Natural Language Processing) tasks such as segmentation and word sense disambiguation more difficult. Consequently, it would be valuable to have some means of automatically assigning meaning to unknown words. This paper describes a classifier that assigns semantic thesaurus categories to unknown Chinese words.</Paragraph>
    <Paragraph position="1"> The Caraballo (1999)'s system adopted the contextual information to assign nouns to their hyponyms. Roark and Charniak (1998) used the co-occurrence of words as features to classify nouns. While context is clearly an important feature, this paper focuses on non-contextual features, which may play a key role for unknown words that occur only once  The Sinica Corpus is a balanced corpus contained five million part-of-speech words in Mandarin Chinese.</Paragraph>
    <Section position="1" start_page="2" end_page="4" type="sub_section">
      <SectionTitle>
Computational Linguistics Society of R.O.C.
</SectionTitle>
      <Paragraph position="0"> and hence have limited context. The feature I focus on, following Ciaramita (2002), is morphological similarity to words whose semantic category is known. Ciaramita (2002) boosted the lexical acquisition system by simple morphological rules and found a significant improvement. Such a finding suggests that a reliable source of semantic information lies in the morphology used to construct the unknown words.</Paragraph>
      <Paragraph position="1"> In Chinese morphology, the two ways to generate new words are compounding and affixation.</Paragraph>
      <Paragraph position="2"> Orthographically, such compounding and affixation is represented by combinations of characters, and as a result, the character combinations and the morpho-syntactic relationship used to link them together can be clues for classification. Furthermore, my analysis of the Sinica Corpus indicates that only 49.68% monosyllabic  words have one word class, but 91.67% multisyallabic words have one word class in Table 1. Once characters merge together, only 8.33% words remain ambiguous. It implies that as characters are combined together, the degree of ambiguity tends to decrease.</Paragraph>
      <Paragraph position="3">  more than 4 10.89% 0.06% Table 1 The ambiguity distribution of monosyllabic and multisyllabic words The remainder of this paper is organized in the following manner: section 2 introduces the CiLin thesaurus, section 3 provides an analysis of unknown words in the Sinica Corpus, and section 4 details the algorithm used for the semantic classification and explains the results.</Paragraph>
      <Paragraph position="4">  'Monosyllabic word' means a word with only a character, and 'multisynllabic word' means a word with more than one character.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML