XML Viewer - w04-1106

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1106_metho.xml
Size: 27,234 bytes
Last Modified: 2025-10-06 14:09:09
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1106">
  <Title>Character-Sense Association and Compounding Template Similarity: Automatic Semantic Classification of Chinese Compounds</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 An introspection on the two-character verbs in CILIN shows
</SectionTitle>
    <Paragraph position="0"> that about 48% of them are semantically exocentric, which means the semantic class of a compound X-Y in CILIN is equal neither to that of X nor to that of Y. As to the endocentric V1-V2, V1 and V2 are about equally likely to be the head of a compound verb according to the introspection.</Paragraph>
    <Paragraph position="1"> characters and senses in a MRD. The sense of an unknown compound can be approximated by retrieved synonyms. Its sense tag can be assigned according to a certain MRD. This model facilitates an automatic system of deep semantic classification for unknown compounds. In this paper, a system for V-V compounds is implemented and evaluated. The model can however be extended to handle general Chinese compounds, like V-N and N-N, as well.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. Compound Sense Determination
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Compounding Semantic Templates
</SectionTitle>
      <Paragraph position="0"> Most of the Chinese compounds are composed of two constituents, which can be bound morphemes of one character or free words of one or more characters. The two-character compound is a most representative type because its components can be bound morphemes as well as free words. The handling of two-character compounds becomes therefore the focus in this paper.</Paragraph>
      <Paragraph position="1"> As in general Chinese compounding, a two-character compound is usually semantically compositional, with each character conveying a certain sense. The principle of semantic composition implies that under each compound lies a semantic pattern, which can be represented as the combination of the sense tags of the two component characters. The combination pattern is referred to as compounding semantic template (denoted by S-template) in this paper; compounds of the same S-template are then referred to as template-similar (denoted by T-similar). Since T-similar compounds are alike in their semantic compositions, they are supposed to possess roughly the same meaning and to be put under a considerably fine-grained semantic class. Take the compound verbG71G7E for example.</Paragraph>
      <Paragraph position="2"> This compound suggests the existence of a S-template of HIT-BROKEN, as the senses of the two component characters G71and G7E are respectively 'hit' and 'broken'. The S-template HIT-BROKEN refers to a complex event schema [to make something BROKEN by HITting]. This S-template can also be found in many other compounds with a similar meaning:G71GCE,G07GCE,G16G7E, G16GCE...etc. Obviously such T-similar words can make a good set of examples for the example-based approach to the sense determination, if an effective measure of word similarity is available for their retrieval.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Compound Similarity
</SectionTitle>
      <Paragraph position="0"> As a critical technique, word similarity is generally used in the example-based models of semantic classification. The measure of word similarity can be divided into two major approaches: taxonomy-based lexical approach (Resnik 1995, Lin 1998a, Chen and Chen 1998) and context-based syntactic approach (Lin 1998b,Chen and You 2002), which is not the concern in this context-free model. However, two problems arise here for the taxonomy-based lexical approach. First, such similarity measures risk the failure to capture the similarity among some semantically highly related words, if they happen to be put under classes distant from each other according to a specific ontology4 . Second, as mentioned, the appropriate senses of some characters just cannot be found in the thesaurus. One major reason why dictionaries do not include certain character senses is that many of such characters are used in contemporary Chinese only as bound morphemes not as free words, when the senses in question are involved. However, such senses could be kept in the compounds in the lexicon, so they might be covert but not inextricable.</Paragraph>
      <Paragraph position="1"> To remedy the effects of such lexicon incompleteness, I propose an approach to retrieve the latent senses5 of characters and the latent synonymy among characters by exploring association among characters and senses. The idea is that if a character C appears in a compound W, then according to semantic composition, the sense of C must somehow contributes to S, the sense of W.</Paragraph>
      <Paragraph position="2"> Therefore the association strength between character C and sense S in a MRD is supposed to reflect the potentiality of S to be a sense of C. By transitivity, such association between characters and senses allows to capture association among characters. A new way to measure word similarity of two compounds can be thus derived based on the association strength of the corresponding component characters. This measure actually reflects the S-template similarity between two compounds and can be used to retrieve for a compound its T-similar words, which are potentially synonymous.</Paragraph>
      <Paragraph position="3"> 4 Take an example in CILIN (a Chinese thesaurus, see 3.1). KILL(G17G14), BUTCHER(G8AGC3), and EXCUTE(GB7G03) are three concepts all meaning 'cause to die'. However, the words expressing these three ideas are respectively put under small classes Hn05, Hd28, and Hm10, respectively under medium class Hn: Criminal Activities(G75G40), class Hd: Economical Production Activities(G8AG59), and class Hm: Security and Justice Activities(GDDGF3G2DG50G5B). We wonder if any measurement based on that hierarchy can capture the similarity among the words situated in these three small classes in CILIN, for those words share only a common major class H, denoting vaguely Activities, which includes 296 small classes and 836 subclasses.</Paragraph>
      <Paragraph position="4"> 5 Here the term latent is used only to mean 'hidden, potential, and waiting to be discovered'. It has nothing to do with the LSI techniques, though they both evoke the same meaning of latent.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Synonyms and Sense Approximation
</SectionTitle>
      <Paragraph position="0"> The acquisition of synonyms plays an important role in the sense determination of a word. When a native-speaker is capable of giving synonyms to a word, he is considered to understand the meaning of that word. In fact, such a way of sense capturing is also reflected in how the senses of words can be explained in many dictionaries6. Moreover, as some researches propose, synonyms can be used to construct the semantic space for a given word (Ploux and Victorri, 1998, Ploux and Ji, 2003). In such a semantic space, each synonym with different nuance occupies a certain area. As visually reflected in this approach, retrieving a proper set of its synonyms means the ability to well capture the senses of a word. In fact, my model of automatic sense determination for a compound is exactly built upon the retrieval of its near synonyms, the T-similar compounds as previously described.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Model Representation
</SectionTitle>
      <Paragraph position="0"> With a S-template similarity measure, one can retrieve, for a given compound, its potential synonymous T-similar compounds. Then the sense tags of the retrieved compounds can be used to determine the sense tag of the target compound. The model of compound sense determination can be thus composed of two modules, as illustrated in Fig.1.</Paragraph>
      <Paragraph position="1">  the potential synonyms ({SW-set(X-Y)}) of a given compound (X-Y) by using association information provided from dicos {dico1, dico2,...}. Module-B (&lt;S-tag Determiner&gt;) is to obtain the most likely 6 Especially in Chinese dictionaries, it is often the case that several synonymous words are given as explanation to the meaning of a word, especially when it is a compound verb. sense tags ({S-tag(X-Y)}) according to dicox for the target word by using the output of Module-A. The component filter-C is optional, which passes only the T-similar words with the same syntactic category as the target compound, if it is already known. In fact, a system of semantic classification can be so created by choosing dico2 as dicox and the S-tag is then the semantic class in CILIN (as in section 4).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Character-Sense Association Network
</SectionTitle>
    <Paragraph position="0"> Before exploring the critical measurement of association among characters and senses needed in the model, I have to briefly present the lexical sources in use and to define the idealized dictionary format adopted in this task.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Lexical Sources
</SectionTitle>
      <Paragraph position="0"> The lexical sources used to implement my system include:  (1) Sinica Corpus: a balanced Chinese corpus  with 5 million words segmented and tagged with syntactic categories. (Huang et al., 1995) (2) HowNet: an on-line Chinese-English bilingual lexical resource created by Dong. It is used in this paper as a Chinese-English dictionary registering about 51,600 Chinese words, each assigned with its equivalent English words and  its POS. (http://www.keenage.com/) (3) CILIN: a Chinese thesaurus collecting about  53,200 words. CILIN classifies its lexicon in a four-level hierarchy according to different semantic granularities: 12 major classes (level-1), 95 medium classes (level-2), 1428 small classes (level-3), and 3924 subclasses (level-4). The words in the same small class can be regarded as semantically similar, but only the words in the same subclasses can be surely regarded as synonyms7.(Mei et al., 1984)</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Idealized Dictionary Format (dico)
</SectionTitle>
      <Paragraph position="0"> The idealized dictionary, denoted as dico, is actually a formatted MRD defined as follows: A dico is a set of &lt;W-S&gt; correspondence pairs, where W is a word, and S is a sense tag. (1) 7 Take two verbsG9E('to buy') and G5B('to sell') as examples to demonstrate the taxonomy of CILIN. Both of the two verbs are grouped in the small class He03 (commercial trade), which is under the major class H (activities) and the medium class He (economic activities). However, the two antonyms are put under two different subclasses, respectively He031 (buying) and He032 (selling).</Paragraph>
      <Paragraph position="1"> In the system implementation in this paper, two dicos are converted respectively from HowNet and CILIN for the calculation of the association measures among characters and sense tags with different types of sense tags adopted. For HowNet, the English equivalent words are used as sense tags to form dico1. For CILIN, the subclasses are used as sense tags to form dico2.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Character-Sense Association
</SectionTitle>
      <Paragraph position="0"> All the semantic information provided by a dico, as defined in (1), can be in fact represented as a network with links between two domains: W domain (words) and S domain (sense tags). In such a viewpoint, polysemy is then a one-to-many mapping from W to S, while synonymy a one-to-many mapping from S to W. If we further link a component character C of a word W to one of the S linked to W, such a C-S link might intuitively reflect a potential sense S for the character C, probably a latent sense of C, as previously described in section 2.2. We can use a statistical association measure, like MI or kh2, to extract such C-S links. The statistically extracted C-S association can then lead to the finding of latent senses for a character. The revelation of a latent character-sense association will further lead to the retrieval of new synonymy relation between characters. Symmetrically, the revelation of a latent character-sense association will also lead to the retrieval of the potential polysemy of a character. As illustrated in the Z-diagram below, supposed that C1 is already associated to S1 and C2 to S2, the retrieval of latent sense S1 to C2 will, meanwhile, lead to the finding of an association between C1 and C2 (latent synonymy), and an association between S1 and S2 (latent polysemy).</Paragraph>
      <Paragraph position="2"> The directed association measure from a character to a sense, denoted as CS-asso(Ci,Sj), can be defined as follows:</Paragraph>
      <Paragraph position="4"> where freq(Ci,Sj) is the number of the words in the MRD that contain character Ci and is tagged with sense Sj, while freq(Ci) is the number of words containing character Ci, and freq(Sj) the number of words tagged with sense Sj.8Likewise, the directed association measure from a sense to a character, denoted as SC-asso(Si,Cj), can be defined as follows9:</Paragraph>
      <Paragraph position="6"> Consequently, by link of a Ci-Sj-Ck chain (a latent synonymy), the directed association measure for a character Ci to another character Ck is defined as a combination of two types of directed association measures, the maximal association measure CC-asso1(Ci ,Ck) and the over-all association measure CC-asso2(Ci ,Ck), with respective weights of 1-o and o (the value o is by default set at 0.5).</Paragraph>
      <Paragraph position="8"/>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 S-Template Similarity Measure
</SectionTitle>
      <Paragraph position="0"> Supposed that Wi(Ci1-Ci2) and Wj(Cj1-Cj2) are both two-character compounds, a measure of word-word directed association (denoted as WW-asso) from Wi to Wj can be defined based on the CC-asso between their corresponding component characters:</Paragraph>
      <Paragraph position="2"> Since the corresponding characters of two T-similar compounds must share the same sense tags and thus have strong CC-asso, the measure WW-asso(Wi,Wj) indicates, in fact, how T-similar for a compound Wj to a target Wi, compared with other compounds.</Paragraph>
      <Paragraph position="3"> WW-asso(Wi,Wj) is therefore taken as the measure of S-template similarity (denoted as T-similarity).</Paragraph>
      <Paragraph position="4"> Applying the S-template similarity measure in (5), now the T-similar Word Retriever (&lt;TWR&gt;) can 8 The formula a in (2) is actually a simplified approximation to the kh2 -test measure by supposing that freq(C,S) is much smaller than freq(C) and freq(S). In fact, MI (mutual information) is another association measure frequently used in Chinese NLP. For example, it is successfully used for the character-POS association measure in the task of syntactical classification for Chinese unknown words (Chen et al., 1997). However, a heuristic evaluation on some randomly picked examples shows that it seems to be outperformed by the kh2 measure in this task. 9 It must be noted that the measures of directed association (2) and (3) are asymmetric in that they give different values for the association from Ci to Sj and for the one from Sj to Ci because their normalization factors are not the same. That is why the notion directed is added here to point out the asymmetry.</Paragraph>
      <Paragraph position="5"> give for a compound X-Y the list of its most T-similar compounds from the corpus and their T-similarity scores. As to the &lt;S-tag Determiner&gt;, it receives as input the output T-similar words from &lt;TWR&gt;. Among the input T-similar words, the ones known to dicox, are picked out and their sense tags (S-tag) with the T-similarity scores (WW-asso) are used, as in the formula (6), to calculate the likelihood score L for a compound V-Vi to possess a certain S-tagj. Therefore a set of ranked possible semantic classes for the compound X-Y can be given</Paragraph>
      <Paragraph position="7"/>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. System Implementation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Classification for V-V Compounds
</SectionTitle>
      <Paragraph position="0"> Based on the model proposed, a system of semantic classification can be implemented for two-character V-V compound verbs by using dico2 as the dicox in the Module-B (the S-tag now is the semantic class in CILIN). The V-V compounds are chosen as subjects in this system because the choice can best distinguish the present model from the previous head-orientated approaches. As the involvement of only V characters make training data homogeneous, it simplifies the association network and reduces largely the computational complexity. However, the partial system for V-V compounds can be easily extended to handle V-N compounds and N-N compounds as well when the character-sense association network for N characters is established.</Paragraph>
      <Paragraph position="1"> Since only the V characters are involved, a subset of &lt;W-S&gt; pairs of dico1 (HowNet) and dico2 (CILIN) is extracted to calculate the association measures and then the T-similarity measure. The subset contains only the &lt;W-S&gt; pairs whose W are one-character or two-character verbs. In CILIN the verbs are put under the major classes from E to J, designating the concepts of attributes (E), actions (F), mental activities (G), activities (H), physical states (I), and relations (J). By choosing only the words in the above 6 major classes, the nominal senses of characters (A: human, B: concrete object, C: time and space, D: abstract object) are supposed to be excluded. Besides, the occurrence frequency of a character in a mono-character word will be double weighted, since in this case the word sense is surely contributed by that character alone.</Paragraph>
      <Paragraph position="2"> Let us take the V-V compound GF9G3E ('to catch by hunting', literally 'hunt-catch') for example to see how the model operates. Based on the association network created from HowNet, the characters associated to GF9 and G3E are listed in List 1 and List 2 (only the 10 top ranked are listed here), the 20 top ranked T-similar compounds of GF9 G3E are listed in List 3 with their similarity scores, syntactic categories and semantic classes, if they are known in CILIN. Among the 20 T-similar compounds retrieved, 10 of them (the grayed ones) can be found in CILIN; 9 of them (the framed ones) can be considered as good synonyms of GF9G3E, while other 7 (the starred ones) considered semantically really close. In this particular example, 80% (16/20) of the T-similar compounds can be considered as at least near synonymous, while 50%(8/16) of them can be actually found in CILIN to serve the automatic semantic classification.</Paragraph>
      <Paragraph position="3">  Applying the formula for the likelihood score of semantic class determination in (6), we have the 4 top ranked semantic classes for GF9G3E predicted by the system as follows:  (1) Hm051 (G5AGA2 'arrest' ) (2) Je121 (G37G53 'acquire' ) (3) Hb121 (G90GFE 'attack and occupy' ) (4) Hb141 (G88GC5 'capture as war prisoner')  In this case, the standard answer of class Hm051 for the compound GF9G3E is ranked as the first candidate, while the second ranked candidate class Je121 ('acquire') is also reasonable, which can be considered rather correct in a certain way by human judgment. In fact, according to the native speaker's instinct, the 4th ranked candidate class Hb141 ('capture') is also quite suitable to the meaning of the verb GF9G3E, though that is not what it is classified in CILIN. However, to avoid the subjective interference of human judgment and particularly to make the evaluation task automatic, the evaluation in the following sections will be made by machine only according to the standard classification in CILIN.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Experiment Results
</SectionTitle>
      <Paragraph position="0"> For evaluating the performance of the system, 500 V-V compounds are randomly picked out from CILIN to form the test set. Two modes of evaluation experiments are carried out: both modes adopt dico2 (CILIN) in Module-B (dicox=dioc2) to determine semantic classes, while the inside-test mode uses dico2 (CILIN) in Module-A and the outside-test mode uses dico1 (HowNet) in Module-A, to obtain association network and retrieve the T-similar words.</Paragraph>
      <Paragraph position="1"> To make the test compounds unknown to the model, the semantic classes of the test compounds have to be invisible to CILIN, while the invisibility should not undermine the training of the association network in Module-A. The effect is done by dynamically withdrawing a word from dico2 in Module-B each time when it is in test. Two ways of evaluation can be made: by verifying the answer to the level of small class (level-3) and to the level of subclasses (level-4). The accuracy is calculated by verifying if the correct answer or one of the correct answers (if V-V is polysemous) according to CILIN can be found in the first n ranked semantic classes predicted by the system. The performance of a random head-picking model is offered as the baseline. In this baseline model, one of the semantic classes of X and Y is randomly chosen as the semantic class of the compound X-Y.</Paragraph>
      <Paragraph position="2">  The results in Table 1 show that the system achieves a precision rate of 60.40% for inside test and 36.60% for outside test in level-4 classification against the baseline one of 17.34%. Not to our surprise, the performance of classification to level-3, a slightly shallower level, is slightly better: 61.60% for inside test and 39.80% for outside test. Table 1 also shows that the system can achieve a correction rate of 59.8% (outside) and 80.80% (inside) for including the correct answer in the first 3 ranked candidate classes in level-4, 64.40% (outside) and 83.80% (inside) in level-3, all much better than the baseline ones, 37.54% and 40.21%.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 A Pseudo-WSD Problem
</SectionTitle>
      <Paragraph position="0"> If the correct semantic class can be found in a limited number of candidates, context information can be used to help determine which candidate is more likely to be the proper one, just as a WSD task does. Take again the example of the compound GF9 G3E in section 4.1, which the system classifies most likely as: 'arrest', 'acquire' and 'attack-occupy'.</Paragraph>
      <Paragraph position="1"> Obviously the verbs in the three classes should take different stereotypes of objects: respectively person, thing, and place. Therefore it is not difficult to determine the correct semantic class of the verb in question by using context information, in this case the type of the object. Through this example, we can see that the high inclusion rate of the correct answer in the top ranked classes has in fact a great significance: the ranking of the top candidates can be further adjusted and eventually ameliorated by context information, and thus the task of class determination can become a pseudo-WSD problem, in which domain various techniques are well available (Manning and Schutze, 1999). The performance of the present non-contextual system of automatic semantic classification is then expected to be improvable with the eventual help of a good context-sensitive WSD system, though it is out of the scope in this paper. Therefore the correct inclusion rate of top n ranked classes is also the concern of this paper.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Endocentric vs. Exocentric Compounds
</SectionTitle>
      <Paragraph position="0"> Table 2 shows the performance of the system on the endocentric compounds (with heads) and on the exocentric ones (without heads) in level-3. Among the 500 V-V compounds, the endocentric V-V compounds have much higher precision rates than the exocentric ones. But even for the exocentric compounds, the precision rate of the system is 49.28% in inside test and 27.05% in outside test, while the correct inclusion rate of top 3 ranked classes achieves 74.64% in inside test and 51.69% in outside test. Such a performance is in fact rather encouraging since it shows that this model has overcome the inherent difficulty met by a head-oriented approach.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.5 Syntactic Category Filter
</SectionTitle>
      <Paragraph position="0"> To test the function of the Filter-C in the model, two sets of 500 V-V compounds are randomly picked out from verbs of category VC corpus, and from verbs of category VA in Sinica Corpus.10. Table 3 and 4 show the performance of the system on the two kinds of verbs when evaluated to level-3. The results show that the system using the syntactic category filter (+SCF) performs slightly better than that without using the filter (-SCF) only in the precision of first ranked class in the outside test. Beside that, the use of the syntactic category filter generally undermines the performance of the system. Such a result might be explained by the fact that synonymous words in CILIN are not necessarily of the same syntactic category; it also suggests that for the entire model recall is perhaps more important</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.6 Classification Errors
</SectionTitle>
      <Paragraph position="0"> An examination of the bad performing cases suggests that there are three major sources of erroneous classification in the experiments. (1) Some test compounds are just idiomatic or non semantic compositional. Naturally, it is highly difficult, if not impossible, to correctly predict their semantic classes. (2) Some compounds are from unproductive S-templates, which causes the example sparseness of the T-similar compounds. The scarcity of examples will easily lead to a poor determination result caused by a low noise tolerance of occasional bad examples. (3) Some classifications predicted by the system are reasonable to native speakers, but happen not to be the case in CILIN as the standard answers.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML