File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1106_intro.xml

Size: 6,971 bytes

Last Modified: 2025-10-06 14:02:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1106">
  <Title>Character-Sense Association and Compounding Template Similarity: Automatic Semantic Classification of Chinese Compounds</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Sense tagging is an important task in NLP. It is supposed to provide semantic information useful to the application tasks like IR and MT. As generally acknowledged, sense tagging is to assign a certain sense to a word in a certain context by using a semantic lexicon (Yarowsky, 1992, Wilks and Stevenson, 1997). In addition to word sense disambiguation (WSD) for known words, sense determination for words unknown to the lexicon poses another challenge in sense tagging. This is especially the case in NLP of Chinese, a language rich in compound words. According to the data in (Chen and Lin, 2000), about 5.51% of unknown words is encountered in their sense-tagging task of Chinese corpus. Instead of proper names, the cross-linguistically most common type of unknown words, compound words constitute the majority of unknown words in Chinese text. According to Chen and Chen (2000), the three most dominant types of Chinese unknown words are: compound nouns (about 51%), compound verbs (about 34%), and proper names (about 15%). While the identification and classification of proper names is an issue already well discussed in Chinese NLP researches, the sense determination of unknown compounds remains a subject relatively less tackled.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Shallow vs. Deep Classification
</SectionTitle>
      <Paragraph position="0"> While word sense might be conceptually vague and controversial in linguistics and difficult to define (Manning and Schutze, 1999), sense tag is more concrete and can be defined according to the specific need of the NLP tasks in question. For example, in a task of semantic tagging or classification, sense tag can be the semantic class from a thesaurus. Or otherwise, in a task of machine translation, the equivalent foreign word from a bilingual dictionary can be chosen as sense tag. In this paper, it is the sense tag so defined that is meant by the term sense.</Paragraph>
      <Paragraph position="1"> The notion sense determination then refers to the assignment of sense tag to a word without using contextual information. It is so called to be distinguished from sense tagging, which requires contextual information. Under such a definition, semantic classification can be regarded as a case of sense determination using the taxonomy of a certain thesaurus, in which a semantic class is a sense tag.</Paragraph>
      <Paragraph position="2"> According to Wilks and Stevenson (1997), a task assigning broad sense tags like HUMAN, ANIMATE in WordNet is referred to as semantic tagging, different from sense tagging, which assigns more particular sense tags. In fact, a similar distinction can also be made for semantic classification according to the target level of the semantic classes in the taxonomy tree: a task aiming at the top-level classes can be called shallow semantic classification (like Lua, 1997), while a task aiming at the bottom-level classes can be called a deep semantic classification1 (like Chen and Chen, 2000). Since many top-level semantic classes, like TIME, SPACE, QUALITY, ACTION, etc., are often already reflected in the syntactic information, a shallow semantic classification does not actually provide much semantic information independent of syntactic tagging. It is therefore the deep semantic classification that the paper is concerned about.</Paragraph>
      <Paragraph position="3"> 1 Take the word GE7G07('attack') for example. According to CILIN (a thesaurus widely used in Chinese semantic classification, see 3.1), it can be classified to shallow-levels as major class H (ACTIVITY) or as medium class Hb (MILITARY ACTIVITY). It can also be classified to deep-levels as small class Hb03 (specific military operations: ATTACK, RESIST, and COUNTERATTACK) or as subclass Hb031 (ATTACK).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.2 Previous Researches
</SectionTitle>
      <Paragraph position="0"> In the previous researches of automatic semantic classification of Chinese compounds, compounds are generally presupposed to be endocentric, composed of a head and a modifier. Determining the class of the head is therefore determining the class of the target compound (Lua, 1997, Chen and Chen, 2000). This head-determination approach has two advantages: (1) it is simple and easy to implement (2) it works effectively for compound nouns, the dominant type of compounds, since most of them are head-final endocentric words.2 However, there exist considerable exocentric compounds, for which such a simple algorithm does not work successfully. It is especially the case for compound verbs like V-Vs3.</Paragraph>
      <Paragraph position="1"> For example, G71G14 is a V-V compound meaning 'to kill by beating'. Obviously, neither the sense of G71 ('beat') nor that of G14('die') is appropriate to be assigned to the compound G71G14 as the sense of G42 ('car') can be assigned to G94G42('tram', literally 'electricity-car') as a general meaning.</Paragraph>
      <Paragraph position="2"> A second problem encountered in compound semantic classification is that there are considerable out-of-coverage morphemes, which are not listed in the lexicon, as remarked in (Chen and Chen, 2000).</Paragraph>
      <Paragraph position="3"> Moreover, even a morpheme is listed, the given senses are not necessarily appropriate to the task.</Paragraph>
      <Paragraph position="4"> For example, in the search of compound morphological rules in (Chen and Chen, 1998), some appropriate senses of morphemes have to be added manually to facilitate the task. Obviously this causes a great difficulty to an automatic task, especially to the example-based models which rely on the similarity measurement of the modifier morphemes to disambiguate the head senses (Chen and Chen, 1998, 2000). An alternative approach is thus needed to solve the problems of exocentric compounds and lexicon incompleteness.</Paragraph>
      <Paragraph position="5"> Therefore in this paper I will present a non head-oriented model of Chinese compound sense determination, in which lexicon incompleteness will be overcome by exploring the association between 2 Though a compound noun and its head are strictly speaking in a hyponym relation, they are usually categorized as members of the same class. For example, in CILIN,G42('car', 'vehicle') and most of the compounds X-G42 are put under the same class Bo21 (VEHICLES), where X can be a morpheme designating the energy source (like horse, cow, electricity) or the load content (like passenger, merchandise).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML