File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/98/p98-2239_concl.xml

Size: 6,046 bytes

Last Modified: 2025-10-06 13:58:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2239">
  <Title>Word Association and MI-Trigger-based Language Modeling</Title>
  <Section position="6" start_page="1479" end_page="1479" type="concl">
    <SectionTitle>
5 PINYIN-to-Charaeter Conversion
</SectionTitle>
    <Paragraph position="0"> As an application of the MI-Trigger-based modeling, a PINYIN-to-Character Conversion (PYCC) system is constructed. In fact, PYCC has been one of the basic problems in Chinese processing and the subjects of many researchers in the last decade. Current approaches include: The longest word preference algorithm \[Chen+87\] with some usage learning methods \[Sakai+93\]. This approach is easy to implement, but the hitting accuracy is limited to 92% even with large word dictionaries.</Paragraph>
    <Paragraph position="1"> * The rule-based approach \[Hsieh+89\] \[Hsu94\]. This approach is able to solve the related lexical ambiguity problem efficiently and the hitting accuracy can be enhanced to 96%.</Paragraph>
    <Paragraph position="2"> * The statistical approach \[Sproat92\] \[Chen93\]. This approach uses a large corpus to compute the N-gram and then uses some statistical or mathematical models, e.g. HMM, to find the optimal path through the lattice of possible character transliterations. The hitting accuracy can be around 96%.</Paragraph>
    <Paragraph position="3"> * The hybrid approach using both the rules and statistical data\[Kuo96\]. The hitting accuracy can be close to 98%.</Paragraph>
    <Paragraph position="4"> In this section, we will apply the MI-Trigger-based models in the PYCC application. For ease of comparison, the PINYIN counterparts of 600 Chinese sentences(6104 Chinese characters) from Chinese school text books are used for testing.</Paragraph>
    <Paragraph position="5"> The PYCC recognition rates of different MI-Trigger models are shown in Table 4.</Paragraph>
    <Paragraph position="6">  have better performances than the DI-MI-Trigger models for the same window size. Therefore, the preferred relationships between words should be  modeled in a DD way. It is also found that the PYCC recongition rate can reach up to 96.6%.</Paragraph>
    <Paragraph position="7"> As it was stated above, all the MI-Trigger models only include the best 1M trigger pairs. One may ask: what is a reasonable number of the trigger pairs that an MI-Trigger model should include? Here, we will examine the effect of different numbers of the trigger pairs in an MI-Trigger model on the PINYIN-to-Character conversion rates. We use the DD-6-MI-Trigger model and the result is shown in Table 5.</Paragraph>
    <Paragraph position="8"> We can see from Table 5 that the recognition rate rises quickly from 90.7% to 96.3% as the number of MI-Trigger pairs increases from 100,000 to 800,000 and then it rises slowly from 96.6% to 97.7% as the number of MI-Triggers increases from 1,000,000 to 6,000,000. Therefore, the best 800,000 trigger pairs should at least be included in the DD-6-MI-Trigger model.</Paragraph>
    <Paragraph position="9">  and MI-Trigger model In order to evaluate the efficiency of MI-Trigger-based language modeling, we compare it with word unigram and bigram models. Both word unigram and word bigram models are trained on the XinHua corpus of 29M words. The result is shown in Table 6. Here the DD-6-MI-Trigger model with 5M trigger pairs is used. Table 6 shows that * The MI-Trigger model is superior to word unigram and bigram models. The conditional perplexity of the DD-6-MI-Trigger model is less than that of word bigram model and much less than the word unigram model.</Paragraph>
    <Paragraph position="10"> * The parameter number of the MI-Trigger model is much less than that of word bigram model.</Paragraph>
    <Paragraph position="11"> One of the most powerful abilities of a person is to properly combine different knowledge. This also applies to PYCC. The word bigram model and the MI-Trigger model are merged by linear interpolation as follows: log PMeR~ED (S) = (1 - a)-log Ps,~.~,, (S) +a . log PMt_r,i~g~,.( S) (7) n where S = w~ = w~w2...w . and a is the weight of the word bigram model. Here the DD-6-MI-Trigger model with 5M trigger pairs is applied. The result is shown in Table 7.</Paragraph>
    <Paragraph position="12"> Table 7 shows that the recognition rate reaches up to 98.7% when the N-gram weight is 0.3 and the MI-Trigger weight is  Through the experiments, it has been proven that the merged model has better results over both word bigram and Ml-Trigger models. Compared to the pure word bigram model, the merged model also captures the long-distance dependency of word pairs using the concept of mutual information. Compared to the MI-trigger model which only captures highly correlated word pairs, the merged model also captures poorly correlated word pairs within a short distance by using the word bigram model.</Paragraph>
    <Paragraph position="13"> Conclusion This paper proposes a new MI-Trigger-based modeling approach to capture the preferred relationships between words by using the concept of trigger pair. Both the distance-independent(DI) and distance-dependent(DD) MI-Trigger-based models are constructed within a window. It is found that * The long-distance dependency is useful to language disambiguation and should be modeled properly in natural language processing.</Paragraph>
    <Paragraph position="14">  * The DD MI-Trigger models have better performance than the DI MI-Trigger models for the same window size.</Paragraph>
    <Paragraph position="15"> * The number of the trigger pairs in an MI-Trigger model can be kept to a reasonable size without losing too much of its modeling power. * The MI-Trigger-based language modeling has better performance than the word bigram model while the parameter number of the MI-Trigger model is much less than that of the word bigram model. The PINYIN-to-Character conversion rate reaches up to 97.7% by using the MI-Trigger model. The recognition rate further reaches up to 98.7% by proper word bigram and MI-Trigger merging.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML