File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-2001_metho.xml

Size: 15,210 bytes

Last Modified: 2025-10-06 14:09:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-2001">
  <Title>Hybrid Methods for POS Guessing of Chinese Unknown Words</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 Previous Approaches
</SectionTitle>
    <Paragraph position="0"> Previous studies all attempted to develop a unified statistical model for this task. Chen et al.</Paragraph>
    <Paragraph position="1"> (1997) examined all unknown nouns1, verbs, and adjectives and reported a 69.13% precision using Dice metrics to measure the affix-category association strength and an affix-dependent entropy weighting scheme for determining the weightings between prefix-category and suffix-category associations. This approach is blind to the type, length, and context of unknown words. Wu and Jiang (2000) calculated P(Cat,Pos,Len) for each character, where Cat is the POS of a word containing the character, Pos is the position of the character in that word, and Len is the length of that word. They then calculated the POS probabilities for each unknown word as the joint probabilities of the P(Cat,Pos,Len) of its component characters. This approach was applied to unknown nouns, verbs, and adjectives that are two to four characters long2. They did not report results on unknown word tagging, but reported that the new word identification and tagging mechanism increased parser coverage. We will show that this approach suffers reduced recall for multisyllabic  words if the training corpus is small. Goh (2003) reported a precision of 59.58% on all unknown words using Support Vector Machines.</Paragraph>
    <Paragraph position="2"> Several reasons were suggested for rejecting the rule-based approach. First, Chen et al. (1997) claimed that it does not work because the syntactic and semantic information for each character or morpheme is unavailable. This claim does not fully hold, as the POS information about the component words or morphemes of many unknown words is available in the training lexicon. Second, Wu and Jiang (2000) argued that assigning POS to Chinese unknown words on the basis of the internal structure of those words will &amp;quot;result in massive overgeneration&amp;quot; (p. 48). We will show that overgeneration can be controlled by additional constraints.</Paragraph>
  </Section>
  <Section position="5" start_page="1" end_page="3" type="metho">
    <SectionTitle>
4 Proposed Approach
</SectionTitle>
    <Paragraph position="0"> We propose a hybrid model that combines the strengths of different models to arrive at better results for this task. The models we will consider are a rule-based model, the trigram model, and the statistical model developed by Wu and Jiang (2000).</Paragraph>
    <Paragraph position="1"> Combination of the three models will be based on the evaluation of their individual performances on the training data.</Paragraph>
    <Section position="1" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
4.1 The Rule-Based Model
</SectionTitle>
      <Paragraph position="0"> The motivations for developing a set of rules for this task are twofold. First, the rule-based approach was dismissed without testing in previous studies. However, hybrid models that combine rule-based and statistical models outperform purely statistical models in many NLP tasks. Second, the rule-based model can incorporate information about the length, type, and internal structure of unknown words at the same time.</Paragraph>
      <Paragraph position="1"> Rule development involves knowledge of Chinese morphology and generalizations of the training data. Disyllabic words are harder to generalize than longer words, probably because their mono-syllabic component morphemes are more fluid than the longer component morphemes of longer words.</Paragraph>
      <Paragraph position="2"> It is interesting to see if reduction in the degree of fluidity of its components makes a word more predictable. We therefore develop a separate set of rules for words that are two, three, four, and five  or more characters long. The rules developed fall into the following four types: 1) reduplication rules (T1), which tag reduplicated unknown words based on knowledge about the reduplication process; 2) derivation rules (T2), which tag derived unknown words based on knowledge about the affixation process; 3) compounding rules (T3), which tag unknown compounds based on the POS information of their component words; and 4) rules based on generalizations about the training data (T4). Rules may come with additional constraints to avoid overgeneration. The number of rules in each set is listed in Table 1. The complete set of rules are developed over a period of two weeks.</Paragraph>
      <Paragraph position="3"> As will be shown below, the order in which the rules in each set are applied is crucial for dealing with ambiguous cases. To illustrate how rules work, we discuss the complete set of rules for disyllabic words here3. These are given in Figure 1, where A and B refer to the component morpheme of an unknown AB. As rules for disyllabic words tend to overgenerate and as we prefer precision over recall for the rule-based model, most rules in this set are accompanied with additional constraints.</Paragraph>
      <Paragraph position="4"> In the first reduplication rule, the order of the three cases is crucial in that if A can be both a verb and a noun, AA is almost always a verb. The second rule tags a disyllabic unknown word formed by attaching the diminutive suffix er to a monosyllabic root as a noun. This may appear a hasty generalization, but examination of the data shows that er rarely attaches to monosyllabic verbs except for the few well-known cases. In the third rule, a categorizing suffix is one that attaches to other words to form a noun that refers to a category of people or objects, e.g., ji-a '-ist'. The constraint &amp;quot;A is not a verb morpheme&amp;quot; excludes cases where B is polysemous and does not function as a categorizing suffix 3Multisyllabic words can have various internal structures, e.g., a disyllabic noun can have a N-N, Adj-N, or V-N structure. if A equals B if A is a verb morpheme, AB is a verb  else if A is a noun morpheme, AB is a noun else if A is an adjective morpheme, AB is a stative adjective/adverb else if B equals er, AB is a noun else if B is a categorizing suffix AND A is not a verb morpheme, AB is a noun else if A and B are both noun morphemes but not verb morphemes, AB is a noun else if A occurs verb-initially only AND B is not a noun morpheme AND B does not occur noun-finally only, AB is a verb else if B occurs noun-finally only AND A is not a verb morpheme AND A does not occur verb-initially only, AB is a noun  but a noun morpheme. Thus, this rule tags b`eng-y`e 'water-pump industry' as a noun, but not l'i-y`e leavejob 'resign'. The fourth rule tags words such as sh-axi-ang 'sand-box' as nouns, but the constraints prevent verbs such as s-ong-k`ou 'loosen-button' from being tagged as nouns. S-ong can be both a noun and a verb, but it is used as a verb in this word. The last two rules make use of two lists of characters extracted from the list of disyllabic words in the training data, i.e., those that have only appeared in the verb-initial and noun-final positions respectively. This is done because in Chinese, disyllabic compound verbs tend to be head-initial, whereas disyllabic compound nouns tend to be head-final. The fifth rule tags words such as d-ing-yVao 'sting-bite' as verbs, and the additional constraints prevent nouns such as f'u-xi`ang 'lying-elephant' from being tagged as verbs. The last rule tags words such as xuVeb`ei 'snow-quilt' as nouns, but not zh-ai-sh-ao pick-tip 'pick the tips'.</Paragraph>
      <Paragraph position="5"> One derivation rule for trisyllabic words has a special status. Following the tagging guidelines of our training corpus, it tags a word ABC as verb/deverbal noun (v/vn) if C is the suffix hu`a '-ize'. Disambiguation is left to the statistical models.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
4.2 The Trigram Model
</SectionTitle>
      <Paragraph position="0"> The trigram model is used because it captures the information about the POS context of unknown words and returns a tag for each unknown word. We assume that the unknown POS depends on the previous two POS tags, and calculate the trigram probability P(t3|t1,t2), where t3 stands for the unknown  POS, and t1 and t2 stand for the two previous POS tags. The POS tags for known words are taken from the tagged training corpus. Following Brants (2000), we first calculate the maximum likelihood probabilities ^P for unigrams, bigrams, and trigrams as in (1-3). To handle the sparse-data problem, we use the smoothing paradigm that Brants reported as delivering the best result for the TnT tagger, i.e., the context-independent variant of linear interpolation of unigrams, bigrams, and trigrams. A trigram probability is then calculated as in (4).</Paragraph>
      <Paragraph position="2"> As in Brants (2000), l1 + l2 + l3 = 1, and the values of l1, l2, and l3 are estimated by deleted interpolation, following Brants' algorithm for calculating the weights for context-independent linear interpolation when the n-gram frequencies are known.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.3 Wu and Jiang's (2000) Statistical Model
</SectionTitle>
      <Paragraph position="0"> There are several reasons for integrating another statistical model in the model. The rule-based model is expected to yield high precision, as over-generation is minimized, but it is bound to suffer low recall for disyllabic words. The trigram model covers all unknown words, but its precision needs to be boosted.</Paragraph>
      <Paragraph position="1"> Wu and Jiang's (2000) model provides a good complement for the two, because it achieves a higher recall than the rule-based model and a higher precision than the trigram model for disyllabic words.</Paragraph>
      <Paragraph position="2"> As our training corpus is relatively small, this model will suffer a low recall for longer words, but those are handled effectively by the rule-based model. In principle, other statistical models can also be used, but Wu and Jiang's model appears more appealing because of its relative simplicity and higher or comparable precision. It is used to handle disyllabic and trisyllabic unknown words only, as recall drops significantly for longer words.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.4 Combining Models
</SectionTitle>
      <Paragraph position="0"> To determine the best way to combine the three models, their individual performances are evaluated for each unknown word if the trigram model returns one single guess, take it else if the rule-based model returns a non-v/vn tag, take it else if the rule-based model returns a v/vn tag if W&amp;J's model returns a list of guesses eliminate non-v/vn tags on that list and return the rest of it else eliminate non-v/vn tags on the list returned by the trigram model and return the rest of it else if W&amp;J's model returns a list of guesses, take it else return the list of guesses returned by the trigram  in the training data first to identify their strengths. Based on that evaluation, we come up with the algorithm in Figure 2. For each unknown word, if the trigram model returns exactly one POS tag, that tag is prioritized, because in the training data, such tags turn out to be always correct. Otherwise, the guess returned by the rule-based model is prioritized, followed by Wu and Jiang's model. If neither of them returns a guess, the guess returned by the trigram model is accepted. This order of priority is based on the precision of the individual models in the training data. If the rule-based model returns the &amp;quot;v/vn&amp;quot; guess, we first check which of the two tags ranks higher in the list of guesses returned by Wu and Jiang's model. If that list is empty, we then check which of them ranks higher in the list of guesses returned by the trigram model.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="3" end_page="4" type="metho">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
5.1 Experiment Setup
</SectionTitle>
      <Paragraph position="0"> The different models are trained and tested on a portion of the Contemporary Chinese Corpus of Peking University (Yu et al., 2002), which is segmented and POS tagged. This corpus uses a tagset consisting of 40 tags. We consider unknown words that are 1) two or more characters long, 2) formed through reduplication, derivation, or compounding, and 3) in one of the eight categories listed in Table 2. The corpus consists of all the news articles from People's Daily in January, 1998. It has a total of 1,121,016 tokens, including 947,959 word tokens and 173,057 punctuation marks. 90% of the data are used for training, and the other 10% are reserved for testing. We downloaded a reference lexicon4 containing 119,791  entries. A word is considered unknown if it is in the wordlist extracted from the training or test data but is not in the reference lexicon. Given this definition, we first train and evaluate the individual models on the training data and then evaluate the final combined model on the test data. The distribution of unknown words is summarized in Table 3.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="4" end_page="4" type="metho">
    <SectionTitle>
5.2 Results for the Individual Models
</SectionTitle>
    <Paragraph position="0"> The results for the rule-based model are listed in Table 4. Recall (R) is defined as the number of correctly tagged unknown words divided by the total number of unknown words. Precision (P) is defined as the number of correctly tagged unknown words divided by the number of tagged unknown words.</Paragraph>
    <Paragraph position="1"> The small number of words tagged &amp;quot;v/vn&amp;quot; are excluded in the count of tagged unknown words for calculating precision, as this tag is not a final guess but is returned to reduce the search space for the statistical models. F-measure (F) is computed as</Paragraph>
  </Section>
  <Section position="8" start_page="4" end_page="4" type="metho">
    <SectionTitle>
2 [?] RP/(R + P). The rule-based model achieves
</SectionTitle>
    <Paragraph position="0"> very high precision, but recall for disyllabic words is low.</Paragraph>
    <Paragraph position="1"> The results for the trigram model are listed in Table 5. Candidates are restricted to the eight POS categories listed in Table 2 for this model. Precision for the best guess in both datasets is about 62%.</Paragraph>
    <Paragraph position="2"> The results for Wu and Jiang's model are listed in Table 6. Recall for disyllabic words is much higher than that of the rule-based model. Precision for disyllabic words reaches mid 70%, higher than that of the trigram model. Precision for trisyllabic words is very high, but recall is low.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML