File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-2011_metho.xml

Size: 20,808 bytes

Last Modified: 2025-10-06 14:08:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-2011">
  <Title>Semantic classification of Chinese unknown words</Title>
  <Section position="3" start_page="4" end_page="9" type="metho">
    <SectionTitle>
2 The CiLin thesaurus
</SectionTitle>
    <Paragraph position="0"> The CiLin (Mei et al 1986) is a thesaurus that contains 12 main categories: A-human, B-object, Ctime and space, D-abstract, E-attribute, F-action, G-mental action, H-activity, I-state, J-association, K-auxiliary, and L-respect. The majority of words in the A-D categories are nouns, while the majority in the F-J categories are verbs. As shown in Figure 1, the main categories are further subdivided into more specific subcategories in a three-tier hierarchy. null</Paragraph>
    <Section position="1" start_page="4" end_page="7" type="sub_section">
      <SectionTitle>
3.1 Definition of unknown words
</SectionTitle>
      <Paragraph position="0"> Unknown words are the Sinica Corpus lexicons that are not listed in the Chinese Electronic Dictionary of 80,000 lexicons and the CiLin. The 5 million word Sinica Corpus contains 77,866 unknown words consisting of 1.59% adjectives, 33.73% common nouns, 25.18% proper nouns, 12.48% location nouns, 2.98% time nouns, and 24.04% verbs as shown in Table 2.</Paragraph>
      <Paragraph position="1"> The focus of most other Chinese unknown word research is on identification of proper nouns such as proper names (Lee 1993), personal names (Lee, Lee and Chen 1994), abbreviation (Huang, Hong and Chen 1994), and organization names (Chen &amp; Chen 2000). Unknown words in categories outside the class of proper nouns are seldom mentioned.</Paragraph>
      <Paragraph position="2"> One of the few examples of multiple class word prediction is Chen, Bai and Chen's 1997 work employing statistical methods based on the prefixcategory and suffix-category associations to predict the syntactic function of unknown words. Although proper nouns may contain lots of useful and valuable information in a sentence, the majority of unknown words in Chinese are lexical words, and consequently, it is also important to classify lexical words. If not, the remaining 70% of unknown words  will be an obstacle to Chinese NLP, where 24.04% of verbs are unknown can be a major problem for parsers.</Paragraph>
      <Paragraph position="3"> Class Unknown words Corpus lexicons  cons of the Sinica Corpus in 6 classes</Paragraph>
    </Section>
    <Section position="2" start_page="7" end_page="9" type="sub_section">
      <SectionTitle>
3.2 Types of unknown words
</SectionTitle>
      <Paragraph position="0"> In Chinese morphology, the two ways to generate new words are compounding and affixation.</Paragraph>
      <Paragraph position="1"> Compounds A compound is a word made up of other words. In general, Chinese compounds are made up of words  Part of location noun still contains some proper nouns like country names.</Paragraph>
      <Paragraph position="2">  It contains both known and unknown words.  Proper noun contains two classes: 1) formal name, such as personal names, races, titles of magazines and so on. 2) Family name, such as Chen and Lee.  Location noun contains 4 subclasses: 1) country names, such as China. 2) common location noun, such as You Ju /youju 'post office' and Xue Xiao /xuexiao 'school'. 3) noun + position, such as Hai Wai /haiwei 'oversea'. 4) direction noun, such as Shang /shang 'up' and Xia /xia 'down'.  Time noun contains 3 classes: 1) historical event and recursive time noun, such as Qing /Qing dynasty and [?] Yue /yiyue 'January'. 2) noun + position, such as Wan Jian /wanjian 'in the evening', 3) adverbial time noun, such as Jiang Lai /jianglai 'in the future'.</Paragraph>
      <Paragraph position="3"> that are linked together by morpho-syntactic relations such as modifier-head, verb-object, and so on (Chao 1968, Li and Thompson 1981). For example, Guang Huan Jue /guanghuanjue LIGHT-ILLUSION 'optical illusion', consists of Guang /guang 'light' and Huan Jue /huanjue 'illusion', and the relation is modifierhead. Guang Guo Min / guangguomin LIGHT-ALLERGY 'photosensitization' is made up of Guang / guang 'light' and Guo Min / guomin 'allergy', and the relation is modifier-head.</Paragraph>
      <Paragraph position="4"> Affixation A word is formed by affixation when a stem is combined with a prefix or a suffix morpheme. For example English suffixes such as -ian and -ist are used to create words referring to a person with a specialty, such as `musician' and `scientist'. Such suffixes can give very specific evidence for the semantic class of the word. Chinese has suffixes with similar meanings to -ian or -ist, such as the Chinese suffix -jia. But the Chinese affix is a much weaker cue to the semantic category of the word than English -ist or -ian, because it is more ambiguous. The suffix -jia contains three major concepts: 1) expert, such as Ke Xue Jia /kexuejia SCIENCE-EXPERT 'scientist' and Yin Le Jia / yinyuejia MUSIC-EXPERT 'musician', 2) family and home, such as Quan Jia /quanjia WHOLE-FAMILY 'whole family' and Fu Gui Jia /fuguijia RICH-FAMILY 'rich family', 3) house, such as Ban Jia /banjia MOVE-HOUSE 'to move house'. In English, the meaning of an unknown word with the suffix -ian or -ist is clear, but in Chinese an unknown word with the suffix -jia could have multiple interpretations. Another example of ambiguous suffix, -xing, has three main concepts: 1) gender, such as Nu Xing /nuxing FEMALE-SEX 'female', 2) property, such as Yao Xing /yaoxing MEDICINE-PROPERTY 'property of a medicine', 3) a characteristic, Shi Sha Cheng Xing /shishachengxing LIKE-KILL-AS-HABIT 'a characteristic of being bloodthirsty'. Even though Chinese also has morphological suffixes to generate unknown words, they do not determine meaning and syntactic category as clearly as they do in English.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="9" end_page="10" type="metho">
    <SectionTitle>
4 Semantic classification
</SectionTitle>
    <Paragraph position="0"> For the task of classifying unknown words, two algorithms are evaluated. The first algorithm uses a simple heuristic where the semantic category of an unknown word is determined by the head of the unknown word. The second algorithm adopts a more sophisticated nearest neighbor approach such that the distance between an unknown word and examples from the CiLin thesaurus computed based upon its morphological structure. The first algorithm serves to provide a baseline against which the performance of the second can be evaluated. null</Paragraph>
    <Section position="1" start_page="9" end_page="9" type="sub_section">
      <SectionTitle>
4.1 Baseline
</SectionTitle>
      <Paragraph position="0"> The baseline method is to assign the semantic category of the morphological head to each word.</Paragraph>
    </Section>
    <Section position="2" start_page="9" end_page="10" type="sub_section">
      <SectionTitle>
4.2 An example-base semantic classification
</SectionTitle>
      <Paragraph position="0"> The algorithm for the nearest neighbor classifier is  as follows: 1) An unknown word is parsed by a morphological analyzer (Tseng and Chen 2002). The analyzer a)  segments a word into a sequence of morphemes, b) tags the syntactic categories of morphemes, and c) predicts morpho-syntactic relationships between morphemes, such as modifier-head, verb-object and resultative verbs as shown as in Table 3. For example, if Wu Dao Jia /wudaojia DANCE-EXPERT 'dancer' is an unknown word, the morphological segmentation is Wu Dao /wudao DANCE 'dance' and Jia /jia EXPERT 'expert', and the relation is modifier-head. null 2) The CiLin thesaurus is then searched for entries (examples) that are similar to the unknown word. A list of words sharing at least one morpheme with the unknown word, in the same position, is constructed. In the case of Wu Dao Jia /wudaojia, such a list would include Ge Chang Jia /gechangjia SING- null 3) The examples that do not have the same morpho-syntactic relationships but shared morpheme belongs to the unknown word's modifier are pruned away. If no examples are found, the system falls back to the baseline classification method. 4) The semantic similarity metric used to compute the distance between the unknown word and the selected examples from the CiLin thesaurus is based upon a method first proposed by Chen and Chen (1997).</Paragraph>
      <Paragraph position="1"> They assume that similarity of two semantic categories is the information content of their parent's  There are still a very small number of coordinate relation compounds that is both of the morphemes in a compound are heads. Since either one of the morphemes can be the meaning of the whole compound, in order to simplify the system, words that have coordinate relations are categorized as modifier head relation. node. For instance, the similarity of Ha Mi Gua /hamigua 'hami melon' (Bh07) and Fan Qie /fanqie 'tomato' (Bh06) is based on the information content of the node of their least common ancestor Bh. The CiLin thesaurus can be used as an information system, and the information content of each semantic category is defined as category) manticEntropy(Sestem)Entropy(Sy [?] The similarity of two words is the least common ancestor information content(IC), and hence, the higher the information content is, the more similar two the words are. The information content is normalized by Entropy(system) in order to keep the similarity between 0 and 1. To simplify the computation, the probabilities of all leaf nodes are assumed equal. For example, the probability of Bh is .0064 and the information content of Bh is log(.0064). Hence, the similarity between Ha Mi Gua / hamigua and Fan Qie / fanqie is .61.</Paragraph>
      <Paragraph position="2">  Resnik (1995, 1998 and 2000) and Lin (1998) also proposed information content algorithms for similarity measurement. The Chen and Chen (1997) algorithm is a simplification of the Resnik algorithm, which makes the simplifying assumption that the occurrence probability of each leaf node is equal.</Paragraph>
      <Paragraph position="3"> One problem for this algorithm is the insufficient coverage of the CiLin (CiLin may not cover all morphemes). The backup method is to run the classifier recursively to predict the possible categories of the unlisted morphemes. If a morpheme of an unknown word or of an unknown word's example is not listed in the CiLin, the similarity measurement will suspend measuring the similarity between the unknown word and the examples and run the classifier to predict he semantic category of the morpheme first. After the category of the morpheme is known, the classifier will continue to measure the similarity between the unknown word and its examples. The probability of adopting this backup method in my experiment is on the average of 3%.</Paragraph>
      <Paragraph position="4"> Here is an example of the recursive semantic measurement. Pao Ma Tou /paomatou RUN-WHARF 'wharf-worker' is an example of an unknown word Pao Han Chuan /paohanchuan RUN-DRY BOAT 'folk activities'. The morphological analyzer breaks the two words into Pao Ma Tou /pao matou and Pao Han Chuan /pao hanchuan. The measurement function will compute the similarity between Ma Tou /matou and Han Chuan /hanchuan, but in this case, Han Chuan /hanchuan is not listed in the CiLin. The next approach is then to run the semantic classifier to guess the possible category of Han Chuan /hanchuan. Based on the predicted category, it then goes on to compute the similarity for Ma Tou /matuo and Han Chuan /hanchuan. By applying this method, there will not be any words without a similarity measurement.</Paragraph>
      <Paragraph position="5"> 5) After the distances from the unknown word to each of the selected examples from the CiLin thesaurus are determined, the average distance to the K nearest neighbors from each semantic category is computed. The category with the lowest distance is assigned to the unknown word.</Paragraph>
      <Paragraph position="6"> The similarity of Wu Dao /wudao and Ge Chang /gechang is .87, of Wu Dao /wudao and Hui /hui is .26, and of Wu Dao /wudao and Fu Gui /fugui is 0. Thus, Wu Dao Jia /wudaojia is more similar to Ge Chang Jia /gechangjia thanHui Jia /huijia orFu Gui Jia /fuguijia. The category of Wu Dao Jia /wudaojia is thus most likely to be Ge Chang Jia /gechangjia.</Paragraph>
      <Paragraph position="7"> The semantic category is predicted as the category that gets the highest score in formula (2). The lexical similarity and frequency of examples of each category are considered as the most important features to decide a category.</Paragraph>
      <Paragraph position="8"> In formula (2), RankScore(C</Paragraph>
      <Paragraph position="10"> ) is a lexical similarity score, which is from the maximum score of Simi- null ) in the category of W</Paragraph>
      <Paragraph position="12"> ) is a frequency score to show how many examples there are in a category. a and (1-a ) are respectively weights for the lexical similarity score and the frequency score.</Paragraph>
      <Paragraph position="13"> )Taxonomy nA...L(CiLi CiLin thein definedcategory semantic whoseword</Paragraph>
    </Section>
    <Section position="3" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
5.1 Data
</SectionTitle>
      <Paragraph position="0"> There are 56,830 words in the CiLin. For experiments, CiLin lexicons are divided into 2 sets: a training set of 80% CiLin words, a development set of 10% of CiLin words, and a test set of 10% CiLin words. All words in the test set are assumed to be unknown, which means the semantic categories in both sets are unknown. Nevertheless, the morphological structures of proper nouns are different from lexical words. Their identification methods are also different and will be out of the scope of this paper. The correct category of the unknown word is the semantic category in the CiLin, and if an unknown word is ambiguous, which means it contains more than one category, the system then chooses only one possible category.</Paragraph>
      <Paragraph position="1"> In evaluation, any one of the categories of an ambiguous word is considered correct.</Paragraph>
    </Section>
    <Section position="4" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
5.2 Result
</SectionTitle>
      <Paragraph position="0"> On the test set, the baseline predicts 53.50% of adjectives, 70.84% of nouns and 47.19% of verbs correctly. The classifier reaches 64.20% in adjectives, 71.77% in nouns and 53.47% in verbs, when a is 0.5 and K is five.</Paragraph>
      <Paragraph position="1">  line and the classifier. Generally, nouns are easier to predict than the other categories, because their morpho-syntactic relation is not as complex as verbs and adjectives. The classifier improves on baseline semantic categorization performance for adjectives and verbs, but not for nouns. The lack of a performance increase for nouns is most likely because nouns only have one kind of morpho-syntactic relation. The advantage of the classifier is to filter out examples in different relations and to find out the most similar example in morphemes and morpho-syntactic relation. The classifier predicts better than the baseline in word classes with multiple relations, such as adjectives and verbs.</Paragraph>
      <Paragraph position="2"> For example, Kai Kuai Che /kaikuaiche OPEN-FAST CAR 'drive fast' is a verb-object verb. The base-line wrongly predicted it due to the verb, Kai /kai OPEN 'open'. However, the semantic classifier grouped it to the category of its similar example, Kai Ye Che /kaiyeche OPEN-NIGHT CAR 'drive during the night'.</Paragraph>
    </Section>
    <Section position="5" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
5.3 Error analysis
</SectionTitle>
      <Paragraph position="0"> Error sources can be grouped into two types: data errors and the classifier errors. The testing data is from the CiLin. Some of testing data are not semantically transparent such as idioms, metaphors, and slang. The meaning of such words is different from the literal meaning. For instance, the literal meaning of Kan Men Gou /kanmengou WATCH-DOOR-DOG is a door-watching dog, and in fact it refers to a person with the belittling meaning. Mu Lao Hu /mulaohu FEMALE-TIGER is a female tiger literally, and it refers to a mean woman. These words do not carry the meaning of their head anymore.</Paragraph>
      <Paragraph position="1"> An unknown word will be created such as Kan Men Mao /kanmenmao WATCH-DOOR-CAT 'a door-watching cat', but it is impossible for unknown words to carry similar meaning of words as Kan Men Gou /kanmengou.</Paragraph>
      <Paragraph position="2"> The classifier errors are due primarily to three factors: a lack of examples, the preciseness of the similarity measurement, and the taxonomy of the CiLin.</Paragraph>
      <Paragraph position="3"> First, some errors occur when there are not enough examples in training data. For example, Tie Lan Gan /tielangan IRON-POLE 'iron pole` does not have any similar examples after the classifier filters out examples whose relations are different and whose shared morphemes are not head. Tie Lan Gan /tielangan is segmented as Tie /tie IRON 'iron' and Lan Gan /langan POLE 'pole'. There are examples of the first morpheme, Tie /tie, but no similar examples of the second,Lan Gan /langan. Since Tie Lan Gan /tielangan has modifier-head relation and Lan Gan /langan is the head of the compound, then the classifier filters out the examples of Tie /tie. There are hence not enough examples. Filtering examples in different structures is performed to make the remaining examples more similar since the similarity measurement may not be able to distinguish slight differences. However, the cost of this filtering of different structure examples is that sometimes this leaves no examples. null Second, the similarity measurement is sometimes not powerful enough. Yun Dong Chang /yundongchang SPORT-SPACE 'a sports ground` has a sufficient number of examples, but has problems with the similarity measurement. The head Chang /chang is ambiguous. Chang /chang has two senses and both mean space. One of them means abstract space and the other means physical space. Hence, in the CiLin thesaurus Chang /chang can be found in C (time and space) and D (abstract). Words in C such as Shang Chang /shangchang BUSINESS-SPACE 'a market', Tu Zai Chang /tuzaichang BUTCHER-SPACE 'a slaughter house' , Hui Chang /huichang MEETING-SPACE 'the place of a meeting', and in D are Qiu Chang / qiuchang BALL-SPACE 'a court', Ti Yu Chang /tiyuchang PHYSICAL TRAINING-SPACE 'a stadium'. Yun Dong Chang /yundongchang should be more similar to Ti Yu Chang /tiyuchang than other space nouns, but the similarity score does not show that they are related and C group has more examples. Thus, the system chooses C incorrectly.</Paragraph>
      <Paragraph position="4"> Third, the taxonomy of the thesaurus is ambiguous.</Paragraph>
      <Paragraph position="5"> For instance, Ti Cao Fang /tichaofang GYMNASTICS-ROOM 'gymnastics room' has similar examples in both B (object) and D (abstract). These two groups are very similar. Words in B group include Xing Fang /xingfan PUNISHMENT-ROOM 'punishment room', Shu Fang /shufan BOOK-ROOM 'study room', An Fang /anfan DARK-ROOM 'dark room', and Chu Fang /chufan KITCHEN-ROOM 'kitchen'. Words in D are such as Lao Fang /laofan PRISON-ROOM 'a jail' and Dan Zi Fang /danzifan BILLIARD-ROOM 'a billiard room'. There are no obvious features to distinguish between these examples. According to the CiLin, Ti Cao Fang /tichaofang belongs to D, but the classifier predicts it as B class which does not actually differ much with D. Such problems may occur with any semantic taxonomy.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="10" end_page="10" type="metho">
    <SectionTitle>
6 Conclusion
</SectionTitle>
    <Paragraph position="0"> The paper presents an algorithm for classifying the unknown words semantically. The classifier adopts a nearest neighbor approach such that the distance between an unknown word and examples from the CiLin thesaurus is computed based upon its morphological structure. The main contributions of the system are: first, it is the first attempt in adding semantic knowledge to Chinese unknown words.</Paragraph>
    <Paragraph position="1"> Since over 70% of unknown words are lexical words, the inability to resolve their meaning is a major obstacle to Chinese NLP such as semantic parsers. Second, without contextual information, the system can still successfully classify 65.76% of adjectives, 71.39% of nouns and 52.84% of verbs.</Paragraph>
    <Paragraph position="2"> Future work will explore the use of the contextual information of the unknown words and the contextual information of the lexicons in the predicted category of the unknown words to boost predictive power.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML