File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1048_metho.xml
Size: 23,566 bytes
Last Modified: 2025-10-06 14:09:46
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1048"> <Title>Word Sense Disambiguation vs. Statistical Machine Translation</Title> <Section position="3" start_page="387" end_page="388" type="metho"> <SectionTitle> 2 Word sense disambiguation vs. </SectionTitle> <Paragraph position="0"> statistical machine translation We begin by examining the respective strengths and weaknesses of dedicated WSD models versus full SMT models, that could be expected to be relevant to improving lexical choice.</Paragraph> <Section position="1" start_page="387" end_page="387" type="sub_section"> <SectionTitle> 2.1 Features Unique to WSD </SectionTitle> <Paragraph position="0"> Dedicated WSD is typically cast as a classification task with a predefined sense inventory. Sense distinctions and granularity are often manually predefined, which means that they can be adapted to the task at hand, but also that the translation candidates are limited to an existing set.</Paragraph> <Paragraph position="1"> To improve accuracy, dedicated WSD models typically employ features that are not limited to the local context, and that include more linguistic information than the surface form of words. This often requires several stages of preprocessing, such as part-of-speech tagging and/or parsing. (Preprocessor domain can be an issue, since WSD accuracy may suffer from domain mismatches between the data the preprocessors were trained on, and the data they are applied to.) For example, a typical dedicated WSD model might employ features as described by Yarowsky and Florian (2002) in their &quot;feature-enhanced naive Bayes model&quot;, with position-sensitive, syntactic, and local collocational features. The feature set made available to the WSD model to predict lexical choices is therefore much richer than that used by a statistical MT model.</Paragraph> <Paragraph position="2"> Also, dedicated WSD models can be supervised, which yields significantly higher accuracies than unsupervised. For the experiments described in this study we employed supervised training, exploiting the annotated corpus that was produced for the Senseval-3 evaluation.</Paragraph> </Section> <Section position="2" start_page="387" end_page="388" type="sub_section"> <SectionTitle> 2.2 Features Unique to SMT </SectionTitle> <Paragraph position="0"> Unlike lexical sample WSD models, SMT models simultaneously translate complete sentences rather than isolated target words. The lexical choices are made in a way that heavily prefers phrasal cohesion in the output target sentence, as scored by the language model. That is, the predictions benefit from the sentential context of the target language. This has the general effect of improving translation fluency. null The WSD accuracy of the SMT model depends critically on the phrasal cohesion of the target language. As we shall see, this phrasal cohesion prop-erty has strong implications for the utility of WSD. In other work (forthcoming), we investigated the inverse question of evaluating the Chinese-to-English SMT model on word sense disambiguation performance, using standard WSD evaluation methodology and datasets from the Senseval-3 Chinese lexical sample task. We showed the accuracy of the SMT model to be significantly lower than that of all the dedicated WSD models considered, even after adding the lexical sample data to the training set for SMT to allow for a fair comparison. These results highlight the relative strength, and the potential hoped-for advantage of dedicated supervised WSD models.</Paragraph> </Section> </Section> <Section position="4" start_page="388" end_page="388" type="metho"> <SectionTitle> 3 The WSD system </SectionTitle> <Paragraph position="0"> The WSD system used for the experiments is based on the model that achieved the best performance, by a large margin, on the Senseval-3 Chinese lexical sample task (Carpuat et al., 2004).</Paragraph> <Section position="1" start_page="388" end_page="388" type="sub_section"> <SectionTitle> 3.1 Classification model </SectionTitle> <Paragraph position="0"> The model consists of an ensemble of four voting models combined by majority vote.</Paragraph> <Paragraph position="1"> The first voting model is a naive Bayes model, since Yarowsky and Florian (2002) found this model to be the most accurate classifier in a comparative study on a subset of Senseval-2 English lexical sample data.</Paragraph> <Paragraph position="2"> The second voting model is a maximum entropy model (Jaynes, 1978), since Klein and Manning (2002) found that this model yielded higher accuracy than naive Bayes in a subsequent comparison of WSD performance. (Note, however, that a different subset of either Senseval-1 or Senseval-2 English lexical sample data was used for their comparison.) The third voting model is a boosting model (Freund and Schapire, 1997), since has consistently turned in very competitive scores on related tasks such as named entity classification (Carreras et al., 2002) . Specifically, an AdaBoost.MH model was used (Schapire and Singer, 2000), which is a multi-class generalization of the original boosting algorithm, with boosting on top of decision stump classifiers (i.e., decision trees of depth one).</Paragraph> <Paragraph position="3"> The fourth voting model is a Kernel PCA-based model (Wu et al., 2004). Kernel Principal Component Analysis (KPCA) is a nonlinear kernel method for extracting nonlinear principal components from vector sets where, conceptually, the n-dimensional input vectors are nonlinearly mapped from their original space Rn to a high-dimensional feature space F where linear PCA is performed, yielding a transform by which the input vectors can be mapped nonlinearly to a new set of vectors (Sch&quot;olkopf et al., 1998). WSD can be performed by a Nearest Neighbor Classifier in the high-dimensional KPCA feature space. (Carpuat et al., 2004) showed that KPCA-based WSD models achieve close accuracies to the best individual WSD models, while having a significantly different bias.</Paragraph> <Paragraph position="4"> All these classifiers have the ability to handle large numbers of sparse features, many of which may be irrelevant. Moreover, the maximum entropy and boosting models are known to be well suited to handling features that are highly interdependent.</Paragraph> <Paragraph position="5"> The feature set used consists of position-sensitive, syntactic, and local collocational features, as described by Yarowsky and Florian (2002).</Paragraph> </Section> <Section position="2" start_page="388" end_page="388" type="sub_section"> <SectionTitle> 3.2 Lexical choice mapping model </SectionTitle> <Paragraph position="0"> Ideally, we would like the WSD model to predict English translations given Chinese target words in context. Such a model requires Chinese training data annotated with English senses, but such data is not available. Instead, the WSD system was trained using the Senseval-3 Chinese lexical sample task data.</Paragraph> <Paragraph position="1"> (This is suboptimal, but reflects the difficulties that arise when considering a real translation task; we cannot assume that sense-annotated data will always be available for all language pairs.) The Chinese lexical sample task includes 20 target words. For each word, several senses are defined using the HowNet knowledge base. There are an average of 3.95 senses per target word type, ranging from 2 to 8. Only about 37 training instances per target word are available.</Paragraph> <Paragraph position="2"> For the purpose of Chinese to English translation, the WSD model should predict English translations instead of HowNet senses. Fortunately, HowNet provides English glosses. This allows us to map each HowNet sense candidate to a set of English translations, converting the monolingual Chinese WSD system into a translation lexical choice model.</Paragraph> <Paragraph position="3"> We further extended the mapping to include any significant translation choice considered by the SMT system but not in HowNet.</Paragraph> </Section> </Section> <Section position="5" start_page="388" end_page="389" type="metho"> <SectionTitle> 4 The SMT system </SectionTitle> <Paragraph position="0"> To build a representative baseline statistical machine translation system, we restricted ourselves to making use of freely available tools, since the potential contribution of WSD should be easier to see against this baseline. Note that our focus here is not on the SMT model itself; our aim is to evaluate the impact of WSD on a real Chinese to English statistical machine translation task.</Paragraph> <Paragraph position="1"> 56525, 56526, 56527, 56528 path, road, route, way path, road, route, way, circuit, roads 56530, 56531, 56532 line, means, sequence line, means, sequence, lines 56533, 56534 district, region district, region</Paragraph> <Section position="1" start_page="389" end_page="389" type="sub_section"> <SectionTitle> 4.1 Alignment model </SectionTitle> <Paragraph position="0"> The alignment model was trained with GIZA++ (Och and Ney, 2003), which implements the most typical IBM and HMM alignment models. Translation quality could be improved using more advanced hybrid phrasal or tree models, but this would interfere with the questions being investigated here. The alignment model used is IBM-4, as required by our decoder. The training scheme consists of IBM-1, HMM, IBM-3 and IBM-4, following (Och and Ney, 2003).</Paragraph> <Paragraph position="1"> The training corpus consists of about 1 million sentences from the United Nations Chinese-English parallel corpus from LDC. This corpus was automatically sentence-aligned, so the training data does not require as much manual annotation as for the WSD model.</Paragraph> </Section> <Section position="2" start_page="389" end_page="389" type="sub_section"> <SectionTitle> 4.2 Language model </SectionTitle> <Paragraph position="0"> The English language model is a trigram model trained on the Gigaword newswire data and on the English side of the UN and Xinhua parallel corpora.</Paragraph> <Paragraph position="1"> The language model is also trained using a publicly available software, the CMU-Cambridge Statistical Language Modeling Toolkit (Clarkson and Rosenfeld, 1997).</Paragraph> </Section> <Section position="3" start_page="389" end_page="389" type="sub_section"> <SectionTitle> 4.3 Decoding </SectionTitle> <Paragraph position="0"> The ISI ReWrite decoder (Germann, 2003), which implements an efficient greedy decoding algorithm, is used to translate the Chinese sentences, using the alignment model and language model previously described. null Notice that very little contextual information is available to the SMT models. Lexical choice during decoding essentially depends on the translation probabilities learned for the target word, and on the English language model scores.</Paragraph> </Section> </Section> <Section position="6" start_page="389" end_page="390" type="metho"> <SectionTitle> 5 Experimental method </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="389" end_page="389" type="sub_section"> <SectionTitle> 5.1 Test set selection </SectionTitle> <Paragraph position="0"> We extracted the Chinese sentences from the NIST MTEval-04 test set that contain any of the 20 target words from the Senseval-3 Chinese lexical sample target set. For a couple of targets, no instances were available from the test set. The resulting test set contains a total of 175 sentences, which is smaller than typical MT evaluation test sets, but slightly larger than the one used for the Senseval Chinese lexical sample task.</Paragraph> </Section> <Section position="2" start_page="389" end_page="390" type="sub_section"> <SectionTitle> 5.2 Integrating the WSD system predictions </SectionTitle> <Paragraph position="0"> with the SMT model There are numerous possible ways to integrate the WSD system predictions with the SMT model. We choose two different straightforward approaches, which will help analyze the effect of the different components of the SMT system, as we will see in Section 6.5.</Paragraph> <Paragraph position="1"> 5.2.1 Using WSD predictions for decoding In the first approach, we use the WSD sense predictions to constrain the set of English sense candidates considered by the decoder for each of the target words. Instead of allowing all the word translation candidates from the translation model, when we use the WSD predictions we override the translation model and force the decoder to choose the best translation from the predefined set of glosses that maps to the HowNet sense predicted by the WSD model.</Paragraph> <Paragraph position="2"> In the second approach, we use the WSD predictions to postprocess the output of the SMT system: in each output sentence, the translation of the target word chosen by the SMT model is directly replaced by the WSD prediction. When the WSD system predicts more than one candidate, a unique translation is randomly chosen among them. As discussed later, this approach can be used to analyze the effect of the language model on the output.</Paragraph> <Paragraph position="3"> It would also be interesting to use the gold standard or correct sense of the target words instead of the WSD model predictions in these experiments.</Paragraph> <Paragraph position="4"> This would give an upper-bound on performance and would quantify the effect of WSD errors. However, we do not have a corpus which contains both sense annotation and multiple reference translations: the MT evaluation corpus is not annotated with the correct senses of Senseval target words, and the Senseval corpus does not include English translations of the sentences.</Paragraph> </Section> </Section> <Section position="7" start_page="390" end_page="392" type="metho"> <SectionTitle> 6 Results </SectionTitle> <Paragraph position="0"> obtained with and without the WSD model. Using our WSD model to constrain the translation candidates given to the decoder hurts translation quality, as measured by the automated BLEU metric (Papineni et al., 2002).</Paragraph> <Paragraph position="1"> Note that we are evaluating on only difficult sentences containing the problematic target words from the lexical sample task, so BLEU scores can be expected to be on the low side.</Paragraph> <Section position="1" start_page="390" end_page="390" type="sub_section"> <SectionTitle> 6.2 WSD still does not help BLEU score with </SectionTitle> <Paragraph position="0"> improved translation candidates One could argue that the translation candidates chosen by the WSD models do not help because they are only glosses obtained from the HowNet dictionary. They consist of the root form of words only, while the SMT model can learn many more translations for each target word, including inflected forms and synonyms.</Paragraph> <Paragraph position="1"> In order to avoid artificially penalizing the WSD system by limiting its translation candidates to the HowNet glosses, we expand the translation set using the bilexicon learned during translation model training. For each target word, we consider the English words that are given a high translation probability, and manually map each of these English words to the sense categories defined for the Senseval model. At decoding time, the set of translation candidates considered by the language model is therefore larger, and closer to that considered by the pure SMT system.</Paragraph> <Paragraph position="2"> The results in Table 2 show that the improved translation candidates do not help BLEU score. The translation quality obtained with SMT alone is still better than when the improved WSD Model is used.</Paragraph> <Paragraph position="3"> The simpler approach of using WSD predictions in postprocessing yields better BLEU score than the decoding approach, but still does not outperform the SMT model.</Paragraph> </Section> <Section position="2" start_page="390" end_page="391" type="sub_section"> <SectionTitle> 6.3 WSD helps translation quality for very few </SectionTitle> <Paragraph position="0"> target words If we break down the test set and evaluate the effect of the WSD per target word, we find that for all but two of the target words WSD either hurts the BLEU score or does not help it, which shows that the decrease in BLEU is not only due to a few isolated target words for which the Senseval sense distinctions are not helpful.</Paragraph> </Section> <Section position="3" start_page="391" end_page="391" type="sub_section"> <SectionTitle> 6.4 The &quot;language model effect&quot; </SectionTitle> <Paragraph position="0"> Error analysis revealed some surprising effects. One particularly dismaying effect is that even in cases where the WSD model is able to predict a better target word translation than the SMT model, to use the better target word translation surprisingly often still leads to a lower BLEU score.</Paragraph> <Paragraph position="1"> The phrasal coherence property can help explain this surprising effect we observed. The translation chosen by the SMT model will tend to be more likely than the WSD prediction according to the language model; otherwise, it would also have been predicted by SMT. The translation with the higher language model probability influences the translation of its neighbors, thus potentially improving BLEU score, while the WSD prediction may not have been seen occurring within phrases often enough, thereby lowering BLEU score.</Paragraph> <Paragraph position="2"> For example, we observe that the WSD model sometimes correctly predicts &quot;impact&quot; as a better translation for &quot;a224a226&quot; (chongji), where the SMT model selects &quot;shock&quot;. In these cases, some of the reference translations also use &quot;impact&quot;. However, even when the WSD model constrains the decoder to select &quot;impact&quot; rather than &quot;shock&quot;, the resulting sentence translation yields a lower BLEU score. This happens because the SMT model does not know how to use &quot;impact&quot; correctly (if it did, it would likely have chosen &quot;impact&quot; itself). Forcing the lexical choice &quot;impact&quot; simply causes the SMT model to generate phrases such as &quot;against Japan for peace constitution impact&quot; instead of &quot;against Japan for peace constitution shocks&quot;. This actually lowers BLEU score, because of the n-gram effects.</Paragraph> </Section> <Section position="4" start_page="391" end_page="391" type="sub_section"> <SectionTitle> 6.5 Using WSD predictions in postprocessing </SectionTitle> <Paragraph position="0"> does not help BLEU score either In the postprocessing approach, decoding is done before knowing the WSD predictions, which eliminates the &quot;language model effect&quot;. Even in these conditions, the SMT model alone is still the best performing system.</Paragraph> <Paragraph position="1"> The postprocessing approach also outperforms the integrated decoding approach, which shows that the language model is not able to make use of the WSD predictions. One could expect that letting the decoder choose among the WSD translations also yields a better translation of the context. This is indeed the case, but for very few examples only: for instance the target word &quot;a143a48&quot; (difang) is better used in the integrated decoding ouput &quot;the place of local employment&quot; , than in the postprocessing output &quot;the place employment situation&quot;. Instead, the majority of cases follow the pattern illustrated by the following example where the target word is &quot;a144&quot; (lao): the SMT system produces the best output (&quot;the newly elected President will still face old problems&quot;), the postprocessed output uses the fluent sentence with a different translation (&quot;the newly elected President will still face outdated problems&quot;), while the translation is not used correctly with the decoding approach (&quot;the newly elected President will face problems still to be outdated&quot;).</Paragraph> </Section> <Section position="5" start_page="391" end_page="392" type="sub_section"> <SectionTitle> 6.6 BLEU score bias </SectionTitle> <Paragraph position="0"> The &quot;language model effect&quot; highlights one of the potential weaknesses of the BLEU score. BLEU penalizes for phrasal incoherence, which in the present study means that it can sometimes sacrifice adequacy for fluency.</Paragraph> <Paragraph position="1"> However, the characteristics of BLEU are by no means solely responsible for the problems with WSD that we observed. To doublecheck that n-gram effects were not unduly impacting our study, we also evaluated using BLEU-1, which gave largely similar results as the standard BLEU-4 scores reported above.</Paragraph> </Section> </Section> <Section position="8" start_page="392" end_page="392" type="metho"> <SectionTitle> 7 Related work </SectionTitle> <Paragraph position="0"> Most translation disambiguation tasks are defined similarly to the Senseval Multilingual lexical sample tasks. In Senseval-3, the English to Hindi translation disambigation task was defined identically to the English lexical sample task, except that the WSD models are expected to predict Hindi translations instead of WordNet senses. This differs from our approach which consists of producing the translation of complete sentences, and not only of a predefined set of target words.</Paragraph> <Paragraph position="1"> Brown et al. (1991) proposed a WSD algorithm to disambiguate English translations of French target words based on the single most informative context feature. In a pilot study, they found that using this WSD method in their French-English SMT system helped translation quality, manually evaluated using the number of acceptable translations. However, this study is limited to the unrealistic case of words that have exactly two senses in the other language.</Paragraph> <Paragraph position="2"> Most previous work has focused on the distinct problem of exploiting various bilingual resources (e.g., parallel or comparable corpora, or even MT systems) to help WSD. The goal is to achieve accurate WSD with minimum amounts of annotated data.</Paragraph> <Paragraph position="3"> Again, this differs from our objective which consists of using WSD to improve performance on a full machine translation task, and is measured in terms of translation quality.</Paragraph> <Paragraph position="4"> For instance, Ng et al. (2003) showed that it is possible to use word aligned parallel corpora to train accurate supervised WSD models. The objective is different; it is not possible for us to use this method to train our WSD model without undermining the question we aim to investigate: we would need to use the SMT model to word-align the parallel sentences, which could too strongly bias the predictions of the WSD model towards those of the SMT model, instead of combining predictive information from independent sources as we aim to study here.</Paragraph> <Paragraph position="5"> Other work includes Li and Li (2002) who propose a bilingual bootstrapping method to learn a translation disambiguation WSD model, and Diab (2004) who exploited large amounts of automatically generated noisy parallel data to learn WSD models in an unsupervised bootstrapping scheme.</Paragraph> </Section> class="xml-element"></Paper>