File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/04/w04-1113_relat.xml

Size: 4,370 bytes

Last Modified: 2025-10-06 14:15:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1113">
  <Title>Using Synonym Relations In Chinese Collocation Extraction</Title>
  <Section position="3" start_page="0" end_page="0" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> Methods have proposed to extract collocations based on lexical statistics. Choueka (Choueka 1993) applied quantitative selection criteria based on frequency threshold to extract adjacent n-grams (including bi-grams). Church and Hanks (Church and Hanks 1990) employed mutual information to extract both adjacent and distant bi-grams that tend to co-occur within a fixed-size window. But the method did not extend to extract n-grams. Smadja (Smadja 1993) proposed a statistical model by measuring the spread of the distribution of co-occurring pairs of words with higher strength. This method successfully extracted both adjacent and distant bi-grams and n-grams. However, the method failed to extract bi-grams with lower frequency. The precision rate on bi-grams collocation is very low, only around high 20% and low 30%. Even though, it is difficult to measure recall rate in collocation extraction (almost no report on recall estimation), It is understood that low occurrence collocations cannot be extracted.</Paragraph>
    <Paragraph position="1"> Our research group has further applied the Xtract system to Chinese (Lu et al. 2003) by adjusting the parameters to optimize the algorithm for Chinese and a new weighted algorithm was developed based on mutual information to acquire word bi-grams with one higher frequency word and one lower frequency word. The result has achieved an estimated 5% improvement in recall rate and a 15% improvement in precision comparing to the Xtract system.</Paragraph>
    <Paragraph position="2"> All of the above techniques do not take advantage of the wide range of lexical resources available including synonym information. Pearce (Pearce 2001) presented a collocation extraction technique that relies on a mapping from a word to its synonyms for each of its senses. The underlying intuitions is that if the difference between the occurrence counts of one synonyms pair with respect to a particular word was at least two, then this was deemed sufficient to consider them as a collocation. To apply this approach, knowledge in word (concept) semantics and relations to other words must be available such as the use of WordNet. Dagan (Dagan 1997) applied similarity-based smoothing method to solve the problem of data sparseness in statistical natural language processing. The experiments conducted in his later works showed that this method achieved much better results than back-off smoothing methods in word sense disambiguation. Similarly, Hua Wu (Wu and Zhou 2003) applied synonyms relationship between two different languages to automatically acquire English synonymous collocation. This is the first time that the concept synonymous collocation is proposed. A side intuition raised here is that nature language is full of synonymous collocations. As many of them have low occurrences, they are failed to be retrieved by lexical statistical methods. Even though there are Chinese synonym dictionaries, such as ( Tong Yi Ci Lin), the dictionaries lack structured knowledge and synonyms are too loosely defined to be used for collocation extraction.</Paragraph>
    <Paragraph position="3"> HowNet developed by Dong et al (Dong and Dong 1999) is the best publicly available resource on Chinese semantics. By making use of semantic similarities of words, synonyms can be defined by the closeness of their related concepts and the closeness can be calculated. In Section 3, we present our method to extract synonyms from HowNet and using synonym relations to further extract collocations.</Paragraph>
    <Paragraph position="4"> Sun (Sun 1997) did a preliminary Quantitative analysis on Chinese collocations based on their arbitrariness, recurrence and the syntax structure. The purpose of this study is to help differentiate if a collocation is true or not according to the quantitative factors. By observing the existence of synonyms information in natural language use, we consider it possible to identify different types of collocations using more semantic and syntactic information available. We discuss the basic ideas in section 5..</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML