File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1026_intro.xml

Size: 5,906 bytes

Last Modified: 2025-10-06 14:00:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1026">
  <Title>Automatic Senmntic Classification for Chinese Unknown Compound Nouns</Title>
  <Section position="2" start_page="0" end_page="173" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> The occurrences of unknown words cause difficulties in natural language processing. Tile word set of a natural language is open-ended. There is no way of collecting every words of a language, since new words will be created for expressing new concepts, new inventions. Therefore how to identify new words in a text will bc tile most challenging task for natural language processing. It is especially true for Chinese. Each Chinese morpheme (usually a single character) carries meanings and most are polyscincus. New words are easily constructed by combining lnorphelnes and their meanings are tile semantic composition of morpheme components.</Paragraph>
    <Paragraph position="1"> Of course there are exceptions of semantically non-compositional compounds. In Chinese text, there is no blank to mark word boundaries and no inlqectional markers nor capitalization markers to denote the syntactic or selnantic types of new words.</Paragraph>
    <Paragraph position="2"> Hence the unknown word identification for Chinese became one of the most difficult and demanding research topic.</Paragraph>
    <Paragraph position="3"> The syntactic and semantic categories of unknown words in principle can be determined by their content and contextual information. However many difficult problems have to be solved. First of all it is not possible to find a uniforln representational schema and categorization algorithm to handle different types of unknown words, since each type of unknown words has very much differeut morpho-syntactic structures. Second, the clues for identifying different type of unknown words are also different. For instance, identification of names of Chinese is very much relied on the surnames, which is a limited set of characters.</Paragraph>
    <Paragraph position="4"> The statistical methods are commonly used for identifying proper names (Chang et al. 1994, Sun et al. 1994). The identification of general compounds is more relied on the morphemes and tile semantic relations between morphemes.</Paragraph>
    <Paragraph position="5"> There are co-occurrence restrictions between morphemes of compounds, but their relations are irregular and mostly due to common sense knowledge. The third difficulty is the problems of ambiguities, such as structure ambiguities, syntactic alnbiguitics and semantic ambiguities.</Paragraph>
    <Paragraph position="6"> For instances, usually a morpheme charactedword has multiple lneaning and syntactic categories. Therefore the ambiguity resolution became one of the major tasks.</Paragraph>
    <Paragraph position="7"> Compound nouus are ttle most frequently occurred unknown words in Chinese text.</Paragraph>
    <Paragraph position="8"> According to an inspection on tile Sinica corpus (Chen etc. 1996), 3.51% of lhe word tokens in the corpus are unknown, i.e. they are not listed in the CKIP lexicon, which contains about 80,000 entries. Alnong them, about 51% of the word types are compound nouns, 34% are compound verbs and 15% are proper names. In this paper we locus our attention on the identification of the compound nouns. We propose a representation model, which will be facilitated to identify, to disambiguate and to evaluate the structure of a compound noun. In fact this model can be extended to handle compound verbs also.</Paragraph>
    <Section position="1" start_page="0" end_page="173" type="sub_section">
      <SectionTitle>
1.1 General properties of compounds and
</SectionTitle>
      <Paragraph position="0"> their identification strategy The semantic category and syntactic category are closely related. For coarse-grained analysis, syntactic categorization and semantic categorization are close related. For instances, nouns denote entities; active verbs denote events and stative verbs denote states. For fine-grained analysis, syntactic and semantic classifications take difl'erent classification criterion, in our model the coarse-grained analysis is processed first. The syntactic categories of an unknown  word are predicted first and the possible semantic categories will be identified according to its top ranked syntactic categories. Different syntactic categories require different representational models and different fine-grained semantic classification methods.</Paragraph>
      <Paragraph position="1"> The presupposition of automatic semantic classification for compounds is that the meaning of a compound is the semantic composition of its morphemic components and the head morpheme determines the major semantic class of this compound. There are many poly-syllabic words of which the property of semantic composition does not hold, for instances the transliteration words, those words should be listed in the lexicon. Since for the majority of compounds the presupposition hold, the design of our semantic classification algorithm will be based upon this presupposition. Therefore the process of identifying semantic class of a compound boils down to find and to determine the semantic class of its head morphen-~e. However ambiguous morphological structures cause the difficulties in finding head morpheme. For instances, the compound in la) has two possible morphological structures, but only lb) is the right interpretation.</Paragraph>
      <Paragraph position="3"> Once the morphological head is deterlnined, the semantic resolution for the head morpheme is the next difficulty to be solved. About 51.5% of the 200 most productive morphemes are polysemous and according to the Collocation Dictionary of Noun and Measure Words (CDNM), in average each ambiguous morpheme carries 3.5 different senses (Huang et al. 1997).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML