File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1113_metho.xml

Size: 12,939 bytes

Last Modified: 2025-10-06 14:09:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1113">
  <Title>Using Synonym Relations In Chinese Collocation Extraction</Title>
  <Section position="4" start_page="0" end_page="178" type="metho">
    <SectionTitle>
3 Our Approach
</SectionTitle>
    <Paragraph position="0"> Our method of extracting Chinese collocations consists of three steps.</Paragraph>
    <Paragraph position="1"> Step 1: Take the output of any lexical statistical algorithm which extracts word bi-gram collocations. The data is then sorted according to each headword , W</Paragraph>
    <Paragraph position="3"> used to extract bigrams, we acquire its synonyms based on a similarity function using HowNet. Any word in HowNet having similarity value over a threshold value is chosen as a synonym headword W s for additional extractions.</Paragraph>
    <Paragraph position="4"> Step 3: For each synonym headword, W  ) as a collocation if the pair co-occurs in the corpus by additional search to the corpus.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Structure of HowNet
</SectionTitle>
      <Paragraph position="0"> Different from WordNet or other synonyms dictionary, HowNet describes words as a set of concepts and each concept is described by a set of primitives . The following lists for the word , one of its corresponding concepts In the above record, DEF is where the primitives are specified. DEF contains up to four types of primitives: the basic independent primitive , the other independent primitive , the relation primitive , and the symbol primitive , where the basic independent primitive and the other independent primitive are used to indicate the semantics of a concept and the others are used to indicate syntactical relationships. The similarity model described in the next subsection will consider both of these relationships.</Paragraph>
      <Paragraph position="1"> The primitives are linked by a hierarchical tree to indicate the parent-child relationships of the primitives as shown in the following example: This hierarchical structure provides a way to link one concept with any other concept in HowNet, and the closeness of concepts can be simulated by the distance between two concepts.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="21" type="sub_section">
      <SectionTitle>
3.2 Similarity Model Based on HowNet
</SectionTitle>
      <Paragraph position="0"> Liu Qun (Liu 2002) defined word similarity as two words which can substitute each other in the same context and still maintain the sentence consistent syntactically and semantically. This is very close to our definition of synonyms. Thus we directly used their similarity function, which is stated as follows.</Paragraph>
      <Paragraph position="1"> A word in HowNet is defined as a set of concepts and each concept is represented by primitives. Thus, HowNet can be described by W, a collection of n words, as:</Paragraph>
      <Paragraph position="3"> is, in turn, described by a set of concepts S as:</Paragraph>
      <Paragraph position="5"> is, in turn, described by a set of primitives:</Paragraph>
      <Paragraph position="7"/>
      <Paragraph position="9"> where a is an adjustable parameter set to 1.6, and ),(  ppDis is the path length between p</Paragraph>
      <Paragraph position="11"> based on the semantic tree structure. The above formula where a is a constant does not indicate explicitly the fact that the depth of a pair of nodes in the tree affects their similarity. For two pairs of  ) with the same distance, the deeper the depth is, the more commonly shared ancestros they would have which should be semantically closer to each other. In following two tree structures, the pair of nodes (p</Paragraph>
      <Paragraph position="13"> ) is the depth of node p</Paragraph>
      <Paragraph position="15"> in the tree . The comparison of calculating the word similarity by applying the formula (2) and (2a) is shown in Section 4.4.</Paragraph>
      <Paragraph position="16"> Based on the DEF description in HowNet, different primitive types play different roles only some are directly related to semantics. To make use of both the semantic and syntactic information included in HowNet to describe a word, the similarity of two concepts should take into consideration of all primitive types with weighted considerations and thus the formula is defined as  . The distribution of the weighting factors is given for each concept a priori in HowNet to indicate the importance of primitive</Paragraph>
      <Paragraph position="18"> in defining the corresponding concept S.</Paragraph>
    </Section>
    <Section position="3" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
3.3 Collocation Extraction
</SectionTitle>
      <Paragraph position="0"> In order to extract collocations from a corpus, and to obtain result for Step 1 of our algorithm, we used the collocation extraction algorithm developed by the research group at the Hong Kong Polytechnic University(Lu et al. 2003). The extraction of bi-gram collocation is based on the English Xtract(Smaja 1993) with improvements.</Paragraph>
      <Paragraph position="1"> Based on the three Steps mentioned earlier, we will present the extractions in each step in the subsections.</Paragraph>
      <Paragraph position="2">  Based on the lexical statistical model proposed by Smadja in Xtract on extracting English collocations, an improved algorithm was developed for Chinese collocation by our research group and the system is called CXtract. For easy of understanding, we will explain the algorithm briefly here. According to Xtract, word cooccurence is denoted by a tripplet (w</Paragraph>
      <Paragraph position="4"> appeared in the corpus in a distance d within the window of [-5, 5]. The frequency f</Paragraph>
      <Paragraph position="6"> in the window of [-5, 5] is defined as:  Then, the average frequency f , and the standard deviation s are defined by</Paragraph>
      <Paragraph position="8"> To eliminate the bi-grams with unlikely cooccurrence, the following sets of threshold values is defined:</Paragraph>
      <Paragraph position="10"> However, the above statistical model given by Smadja fails to extract the bi-grams with a much higher frequency of w h but a relatively low frequency word of w</Paragraph>
      <Paragraph position="12"> For example, in the bi-gram , freq ( ) is much lower than the freq ( ). Therefore, we further defined a weighted mutual information to extract this kind of bi-grams:</Paragraph>
      <Paragraph position="14"> As a result, the system should return a list of</Paragraph>
      <Paragraph position="16"> collocations.</Paragraph>
      <Paragraph position="17">  For each given headword w  h , before taking it as an input to extract its bi-grams directly, we fist apply the similarity formula described in Equation (1) to generate a set of synonyms headwords W</Paragraph>
      <Paragraph position="19"> Where 0 &lt;th &lt;1 is an algorithm parameter which is adjusted based on experience. We set it as 0.85 from the experiment because we would like to balance the strength of the synonyms relationship and the coverage of the synonyms set. The setting of the parameter th &lt; 0.85 weaks the similarity strength of the extracted synonyms. For example, for a given collocation &amp;quot; &amp;quot;, that is unlikely to include the candidates &amp;quot; &amp;quot;, &amp;quot; &amp;quot;, &amp;quot; &amp;quot;. On the other hand, by setting the parameter th &gt; 0.85 will limit the coverage of the synonyms set and hence lose valuable synonyms.</Paragraph>
      <Paragraph position="20"> For example, for a given bi-gram &amp;quot; &amp;quot;, we hope to include the candidate synonymous collocations such as &amp;quot; &amp;quot;, &amp;quot; &amp;quot;, &amp;quot; &amp;quot;. We will show the test of th in the section 4.2.</Paragraph>
      <Paragraph position="21"> This synonyms headwords set provides the possibility to extract the synonymous collocation with the lower frequency that failed to be extracted by lexical statistic.</Paragraph>
      <Paragraph position="22">  A phenomenal among the collocations in natural language is that there are many synonymous collocations exist. For example, 'switch on light' and 'turn on light', &amp;quot; &amp;quot; and &amp;quot; &amp;quot;. Due to the domain specification of the corpus, some of the synonymous collocations may fail to be extracted by the lexical statistic model because of their lower frequency. Based on this observation, this paper takes a further step. The basic idea is for a bi-gram collocation (w  ) as a collocation if its occurrence is greater than 1 in the corpus. There are similar works discussed by Pearce (Pearce 2001). .</Paragraph>
      <Paragraph position="23"> For a given collocation (w  To evaluate the performance of our approach, we conducted a set of experiments based on 9 selected headwords. A baseline system using only lexical statistics given in 3.3.1 is used to get a set of baseline data called Set A. The output using our algorithm is called Set B. Results are checked by hand for validation on what is true collocation and what is not a true collocation.</Paragraph>
      <Paragraph position="24">  not true collocations Table 1 shows samples of extracted word bi-grams using our algorithm that are considered synonymous collocations for the headword &amp;quot; &amp;quot;. Table 2 shows extracted bi-grams by our algorithm that are not considered true collocations.</Paragraph>
    </Section>
    <Section position="4" start_page="21" end_page="178" type="sub_section">
      <SectionTitle>
4.1 Test Set
</SectionTitle>
      <Paragraph position="0"> Our experiment is based on a corpus of six months tagged People Daily with 11 millions number of words. For word bi-gram extractions, we consider only content words, thus headwords are selected from noun, verb and adjective only.</Paragraph>
      <Paragraph position="1"> For evaluation purpose, we selected randomly 3 nouns, 3 verbs and 3 adjectives with frequency of low, medium and high. Thus, in Step 1 of the algorithm, 9 headwords were used to extract bi-gram collocations from the corpus, and 253 pairs of collocations were extracted. Evaluation by hand has identified 77 true collocations in Set A test set. The overall precision rate is 30% (see Table 3).</Paragraph>
      <Paragraph position="2">  Using Step 2 of our algorithm, where th =0.85 is used, we have obtained 55 synonym headwords (include the 9 headwords). Out of these 55 synonyms, 614 bi-gram pairs were then extracted from the lexical statistics based algorithm, in which 179 are consider true collocations. Then, by applying Step 3 of our algorithm, we extracted an additional 201 bi-gram pairs, among them, 178 are considered true collocations. Therefore, using our algorithm, the overall precision rate has achieved 43%, an improvement of almost 50%. The data is summarized in Table 4.</Paragraph>
      <Paragraph position="3"> n., v, and adj.</Paragraph>
    </Section>
    <Section position="5" start_page="178" end_page="178" type="sub_section">
      <SectionTitle>
4.2 The choice of th
</SectionTitle>
      <Paragraph position="0"> We also conducted a set of experiments to choose the best value for the similarity function's threshold th . We tested the best value of th with both the precision rate and the estimated recall rate using the so called remainder bi-grams. The remainder bi-grams is the total number of bi-grams extracted by the algorithm. When precision goes up, the size of the result is smaller, which in a way is an indicator of less recalled collocations. Figure  From Figure 1, it is obvious that at th =0.85 the recall rate starts to drop more drastically without much incentive for precision.</Paragraph>
      <Paragraph position="1">  synonyms collocations, we have also conducted some experiments to see whether the parameters should be adjusted. Table 5 shows the statistics to test the value of (K  =14 is a good trade-off for the precision rate and the remainder Bi-grams. The basic meaning behind the result is reasonable. According to  , d), its co-occurrence in the position d is much higher than in other positions which leads to a peak in the co-occurrence distribution. Therefore, it is selected by the statistical algorithm based on the formula (10). Based on the physical meaning behind, one way to improve the precision rate is to increase the value of</Paragraph>
    </Section>
    <Section position="6" start_page="178" end_page="178" type="sub_section">
      <SectionTitle>
4.4 The comparison of similarity calculation
</SectionTitle>
      <Paragraph position="0"> based on formula (2) and (2a) Table 6 shows the similarity value given by formula (2) where a is a constant given the value 1.6 and by formula (2a) where a is replaced by a function of the depths of the nodes. Results show that (2a) is more fine tuned and reflects the nature of the data better. For example, and are more similar than and .</Paragraph>
      <Paragraph position="1"> and are much similar but not the same.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML