File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-1032_evalu.xml

Size: 8,816 bytes

Last Modified: 2025-10-06 13:58:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1032">
  <Title>Exploiting Headword Dependency and Predictive Clustering for Language Modeling</Title>
  <Section position="8" start_page="7" end_page="7" type="evalu">
    <SectionTitle>
6 Results and Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.1 Impact of headword dependency and
</SectionTitle>
      <Paragraph position="0"> predictive clustering We applied a series of language models proposed in this paper to the Japanese Kana-Kanji conversion task in order to test the effectiveness of our techniques. The results are shown in Table 2. The baseline result was obtained by using a conventional word trigram model. HTM stands for the headword trigram model of Equation (6) and (7) without permutation (i.e., l  =1), while PHTM is the model with permutation. The T- and U-prefixes refer to the models using trigram (Equation (6)) or unigram (Equation (7)) estimate for word category probability. The C-prefix, as in C-PHTM, refers to PHTM with predictive clustering (Equation (10)). For comparison, we also include in Table 2 the results of using the predictive clustering model without taking word category into account, referred to as predictive clustering trigram model (PCTM). In PCTM, the probability for all words is estimated  In Table 2, we find that for both PHTM and HTM, models U-HTM and U-PHTM achieve better performance than models T-HTM and T-PHTM.</Paragraph>
      <Paragraph position="1"> Therefore, only models using unigram for category probability estimation are used for further experiments, including the models with predictive clustering.</Paragraph>
      <Paragraph position="2"> By comparing U-HTM with the baseline model, we can see that the headword trigram contributes greatly to the CER reduction: U-HTM outperformed the baseline model by about 8.6% in error rate reduction. HTM with headword permutation (U-PHTM) achieves further improvements of 10.5% CER reduction against the baseline. The contribution of predictive clustering is also very encouraging. Using predictive clustering alone (PCTM), we reduced the error rate by 7.8%. What is particularly noteworthy is that the combination of both techniques leads to even larger improvements: for both HTM and PHTM, predictive clustering (C-HTM and C-PHTM) brings consistent improvements over the models without clustering, achieving the CER reduction of 13.4% and 15.0% respectively against the baseline model, or 4.8% and 4.5% against the models without clustering.</Paragraph>
      <Paragraph position="3"> In sum, considering the good performance of our baseline system and the upper bound on performance improvement due to the 100-best list as shown in Table 1, the improvements we obtained are very promising. These results demonstrate that the simple method of using headword trigrams and predictive clustering can be used to effectively improve the performance of word trigram models.</Paragraph>
    </Section>
    <Section position="2" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.2 Comparsion with other models
</SectionTitle>
      <Paragraph position="0"> In this subsection, we present a comparison of our models with some of the previously proposed models, including the higher-order n-gram models, skipping models, and the ATR models.</Paragraph>
      <Paragraph position="1"> Higher-order n-gram models refer to those n-gram models in which n&gt;3. Although most of the previous research showed little improvement, Goodman (2001) showed recently that, with a large amount of training data and sophisticated smoothing techniques, higher-order n-gram models could be superior to trigram models.</Paragraph>
      <Paragraph position="2"> The headword trigram model proposed in this paper can be thought of as a variation of a higher order n-gram model, in that the headword trigrams capture longer distance dependencies than trigram models. In order to see how far the dependency goes within our headword trigram models, we plotted the distribution of headword trigrams (y-axis) against the n of the word n-gram were it to be captured by the word n-gram (x-axis) in Figure 2. For example, given a word sequence w  are headwords, then the headword trigram  against the n of word n-gram From Figure 2, we can see that approximately 95% of the headword trigrams can be captured by the higher-order n-gram model with the value of n smaller than 7. Based on this observation, we built word n-gram models with the values of n=4, 5 and 6. For all n-gram models, we used the interpolated modified absolute discount smoothing method (Gao et al., 2001), which, in our experiments, achieved the best performance among the state-of-the-art smoothing techniques. Results showed that the performance of the higher-order word n-gram models becomes saturated quickly as n grows: the best performance was achieved by the word 5-gram model, with the CER of 3.71%. Following Goodman (2001), we suspect that the poor performance of these models is due to the data sparseness problem.</Paragraph>
      <Paragraph position="3"> Skipping models are an extension of an n-gram model in that they predict words based on n conditioning words, except that these conditioning words may not be adjacent to the predicted word. For instance, instead of computing P(w  ). Goodman (2001) performed experiments of interpolating various kinds of higher-order n-gram skipping models, and obtained a very limited gain. Our results confirm his results and suggest that simply extending the context window by brute-force can achieve little improvement, while the use of even the most modest form of structural information such as the identification of headwords and automatic clustering can help improve the performance.</Paragraph>
      <Paragraph position="4"> We also compared our models with the trigram version of the ATR models discussed in Section 4, in which the probability of a word is conditioned by the preceding content and function word pair. We performed experiments using the ATR models as described in Isotani and Matsunaga (1994). The results show that the CER of the ATR model alone is much higher than that of the baseline model, but when interpolated with a word trigram model, the CER is slightly reduced by 1.6% from 3.73% to 3.67%. These results are consistent with those reported in previous work. The difference between the ATR model and our models indicates that the predictions of headwords and function words can better be done separately, as they play different semantic and syntactic roles capturing different dependency structure.</Paragraph>
    </Section>
    <Section position="3" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
6.3 Discussion
</SectionTitle>
      <Paragraph position="0"> In order to better understand the effect of the headword trigram, we have manually inspected the actual improvements given by PHTM. As expected, many of the improvements seem to be due to the use of larger context: for example, the headword trigramXiao Fei ~Zhi Chu ~Jian Shao (shouhi 'consume' ~ shishutsu 'expense' ~ genshou 'decrease') contributed to the correct conversion of the phonetic string gensiyou genshou into Jian Shao genshou 'decrease' rather than Xian Xiang genshou 'phenomenon' in the context of Xiao Fei Zhi Chu Chu metenoJian Shao shouhi shishutsu hajimete no genshou 'consumer spending decreases for the first time'.</Paragraph>
      <Paragraph position="1"> On the other hand, the use of headword trigrams and predictive clustering is not without side effects. The overall gain in CER was 15% as we have seen above, but a closer inspection of the conversion results reveals that while C-PHTM corrected the conversion errors of the baseline model in 389 sentences (8%), it also introduced new conversion errors in 201 sentences (4.1%). Among the newly introduced errors, one type of error is particularly worth noting: these are the errors where the candidate conversion preferred by the HTM is grammatically impossible or unlikely. For example, Mi Guo niQin Gong dekiru(beikoku-ni shinkou-dekiru, USA-to invade-can 'can invade USA') was misconverted as Mi Guo niXin Xing dekiru(beikoku-ni shinkou-dekiru, USA-to new-can), even though Qin Gong shinkou 'invade' is far more likely to be preceded by the morpheme ni ni 'to', and Xin Xing shinkou 'new' practically does not precede dekiru dekiru 'can'.</Paragraph>
      <Paragraph position="2"> The HTM does not take these function words into account, leading to a grammatically impossible or implausible conversion. Finding the types of errors introduced by particular modeling assumptions in this manner and addressing them individually will be the next step for further improvements in the conversion task.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML