File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2021_metho.xml

Size: 18,855 bytes

Last Modified: 2025-10-06 14:09:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2021">
  <Title>Evaluating the Word Sense Disambiguation Performance of Statistical Machine Translation</Title>
  <Section position="3" start_page="120" end_page="120" type="metho">
    <SectionTitle>
2 Statistical machine translation vs.
</SectionTitle>
    <Paragraph position="0"> word sense disambiguation We begin by examining the respective strengths and weaknesses of full SMT models versus dedicated WSD models, which might be expected to be relevant to the WSD task.</Paragraph>
    <Section position="1" start_page="120" end_page="120" type="sub_section">
      <SectionTitle>
2.1 Features unique to SMT
</SectionTitle>
      <Paragraph position="0"> Unlike lexical sample WSD models, SMT models simultaneously translate complete sentences rather than isolated target words. The lexical choices are made in a way that heavily prefers phrasal cohesion in the output target sentence, as scored by the language model.</Paragraph>
      <Paragraph position="1"> That is, the predictions benefit from the sentential context of the target language. This has the general effect of improving translation fluency.</Paragraph>
      <Paragraph position="2"> Another major difference with most lexical sample WSD models is that SMT models are always unsupervised. SMT models learn from large sets of bisentences but the correct word alignment between the two sentences is unknown. SMT models cannot therefore directly exploit sense-annotated data, or at least not as easily as classification-based WSD models do.</Paragraph>
    </Section>
    <Section position="2" start_page="120" end_page="120" type="sub_section">
      <SectionTitle>
2.2 Features unique to WSD
</SectionTitle>
      <Paragraph position="0"> Dedicated WSD is typically cast as a classification task with a predefined sense inventory. Sense distinctions and granularity are often manually predefined, which means that they can be adapted to the task at hand, but also that the translation candidates are limited to an existing set.</Paragraph>
      <Paragraph position="1"> To improve accuracy, dedicated WSD models typically employ features that are not limited to the local context, and that include more linguistic information than the surface form of words. For example, a typical dedicated WSD model might employ features as described by Yarowsky and Florian (2002) in their &amp;quot;feature-enhanced naive Bayes model&amp;quot;, with positionsensitive, syntactic, and local collocational features. The feature set made available to the WSD model to predict lexical choices is therefore much richer than that used by a statistical MT model.</Paragraph>
      <Paragraph position="2"> Also, dedicated WSD models can be supervised, which yields significantly higher accuracies than unsupervised. For the experiments described in this study we employed supervised training, exploiting the annotated corpus that was produced for the Senseval-3 evaluation.</Paragraph>
      <Paragraph position="3"> Again, this brief comparison shows that both models have important and very different strengths, which motivates our controlled empirical comparison of their WSD performance.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="120" end_page="121" type="metho">
    <SectionTitle>
3 The SMT system
</SectionTitle>
    <Paragraph position="0"> To build a representative baseline SMT system, we restricted ourselves to making use of freely available tools.</Paragraph>
    <Section position="1" start_page="120" end_page="121" type="sub_section">
      <SectionTitle>
3.1 Alignment model
</SectionTitle>
      <Paragraph position="0"> The alignment model was trained with GIZA++ (Och and Ney, 2003), which implements the most typical IBM and HMM alignment models. Translation quality could be improved using more advanced hybrid phrasal or tree models, but this would interfere with the questions being investigated here. The alignment model used is IBM-4, as required by our decoder. The training scheme is IBM-1, HMM, IBM-3 and IBM-4, as specified in (Och and Ney, 2003).</Paragraph>
      <Paragraph position="1"> The training corpus consists of about 1 million sentences from the United Nations Chinese-English parallel corpus from LDC. This corpus was automatically sentence-aligned, so the training data does not require as much manual annotation as for the WSD model.</Paragraph>
    </Section>
    <Section position="2" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
3.2 Language model
</SectionTitle>
      <Paragraph position="0"> The English language model is a trigram model trained on the Gigaword newswire data and on the English side of the UN and Xinhua parallel corpora. The language model is also trained using a publicly available software, the CMU-Cambridge Statistical Language Modeling Toolkit (Clarkson and Rosenfeld, 1997).</Paragraph>
    </Section>
    <Section position="3" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
3.3 Decoding
</SectionTitle>
      <Paragraph position="0"> The ISI ReWrite decoder (Germann, 2003), which implements an efficient greedy decoding algorithm, is used to translate the Chinese sentences, using the alignment model and language model previously described. null Notice that very little contextual information is available to the IBM SMT models. Lexical choice during decoding essentially depends on the translation probabilities learned for the target word, and on the English language model scores.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="121" end_page="121" type="metho">
    <SectionTitle>
4 The WSD system
</SectionTitle>
    <Paragraph position="0"> The WSD system used here is based on the model that achieved the best performance on the Senseval-3 Chinese lexical sample task, outperforming other systems by a large margin (Carpuat et al., 2004).</Paragraph>
    <Paragraph position="1"> The model consists of an ensemble of four highly accurate classifiers combined by majority vote: a naive Bayes classifier, a maximum entropy model (Jaynes, 1978), a boosting model (Freund and Schapire, 1997), and a Kernel PCA-based model (Wu et al., 2004), which has the advantage of having a signficantly different bias. All these classifiers have the ability to handle large numbers of sparse features, many of which may be irrelevant. Moreover, the maximum entropy and boosting models are known to be well suited to handling features that are highly interdependent.</Paragraph>
    <Paragraph position="2"> The feature set used consists of position-sensitive, syntactic, and local collocational features, as described by Yarowsky and Florian (2002).</Paragraph>
  </Section>
  <Section position="6" start_page="121" end_page="121" type="metho">
    <SectionTitle>
5 Experimental method
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
5.1 Senseval-3 Chinese lexical sample task
</SectionTitle>
      <Paragraph position="0"> The Senseval-3 Chinese lexical sample task includes 20 target word types. For each word type, several senses are defined using the HowNet knowledge base.</Paragraph>
      <Paragraph position="1"> There are an average of 3.95 senses per target word type, ranging from 2 to 8. Only about 37 training instances per target word are available.</Paragraph>
      <Paragraph position="2"> The dedicated WSD models described in Section 4 are trained to predict HowNet senses for a set of new occurrences of the target word in context.</Paragraph>
      <Paragraph position="3"> We use the SMT sytem described in Section 3 to translate the Chinese sentences of the Senseval evaluation test set, and extract the translation chosen for each of the target word occurrences. In order to evaluate the predictions of the SMT model just like any WSD model, we need to map the English translations to HowNet senses. This mapping is done using HowNet, which provides English glosses for each of the senses of every Chinese word.</Paragraph>
      <Paragraph position="4"> Note that Senseval-3 also defined a translation or multilingual lexical sample task (Chklovski et al., 2004), which is just like the English lexical sample task, except that the WSD systems are expected to predict Hindi translations instead of WordNet senses.</Paragraph>
      <Paragraph position="5"> This translation task might seem to be a more natural evaluation framework for SMT than the monolingual Chinese lexical sample task. However, in practice, there is very little data available to train an Englishto-Hindi SMT model, which would significantly hinder its performance and bias the study in favor of the dedicated WSD models.</Paragraph>
    </Section>
    <Section position="2" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
5.2 Allowing the SMT model to exploit the
Senseval data
</SectionTitle>
      <Paragraph position="0"> Comparing the Senseval WSD models with a regular SMT model is not entirely fair, since, unlike the SMT model, the dedicated WSD models are trained and evaluated on similar data. We address this problem by adapting our SMT model to the lexical sample task domain in two ways.</Paragraph>
      <Paragraph position="1"> First, we augment the training set of the SMT model with the Senseval training data. Since the training set consists of sense-annotated Chinese sentences, and not of Chinese-English bisentences, we artificially create sentence pairs for each training instance, where the Chinese sentence consists of the Chinese target word, and the English sense is the English gloss given by HowNet for that particular target word and HowNet sense.</Paragraph>
      <Paragraph position="2"> Second, we restrict the translation candidates considered by the decoder for the target words to the set of all the English glosses given by HowNet for all the senses of the target word considered. With this modification, the degree of ambiguity faced by the SMT model is closer to that of the WSD model.</Paragraph>
      <Paragraph position="3"> Table 1 shows that each of these modifications help the accuracy, overall yielding a 28.9% relative improvement over the regular SMT system.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="121" end_page="123" type="metho">
    <SectionTitle>
6 Results
</SectionTitle>
    <Paragraph position="0"> Table 2 summarizes the results of the SMT and WSD models on the Senseval-3 Chinese lexical sample task.</Paragraph>
    <Paragraph position="1"> Note that the accuracy of the most frequent sense baseline is extremely low, which shows that the evaluation set contains instances that are particularly difficult to disambiguate. All our SMT and WSD models significantly outperform this baseline.</Paragraph>
    <Paragraph position="2">  Table 2 clearly shows that even the best of the SMT model considered performs significantly worse than any of the dedicated WSD models considered. The accuracy of the best Senseval-3 system is double the accuracy of the best SMT model.</Paragraph>
    <Paragraph position="3"> Since the best Senseval-3 system is a classifier ensemble that benefits from the predictions of four individual WSD models which have very different biases, we also compare the performance of the SMT model with that of the individual WSD models. All the individual component WSD models, including the simplest naive Bayes model, also significantly outperform the SMT model.</Paragraph>
    <Section position="1" start_page="122" end_page="123" type="sub_section">
      <SectionTitle>
6.2 The SMT model prefers phrasal cohesion in
</SectionTitle>
      <Paragraph position="0"> the output sentences to WSD accuracy Inspection of the output reveals that the main cause of errors is that the SMT model tends to prefer phrasal cohesion to word translation adequacy: lexical choice is essentially performed by the English language model, therefore translations are primarily chosen to preserve phrasal cohesion in the output sentence, and only local context is used. We will give three different examples to illustrate this effect.</Paragraph>
      <Paragraph position="1"> The Chinese word &amp;quot;a0a2a1 &amp;quot;(huodong) has the senses &amp;quot;move/exercise&amp;quot; vs. &amp;quot;act&amp;quot;. A Chinese sentence is incorrectly translated as &amp;quot;the party leadership which develop the constitution and laws and in constitutional and legal framework exercise&amp;quot;. Here, &amp;quot;exercise&amp;quot; is not the right translation, &amp;quot;act&amp;quot; should be used instead. However, the language model prefers the use of the phrase &amp;quot;legal framework exercise&amp;quot;, where the word &amp;quot;exercise&amp;quot; is used in a different sense than the one meant in the &amp;quot;move/exercise&amp;quot; category. Note that choosing the wrong translation for this word not only affects the adequacy, but also the grammaticality and fluency of the translated sentence.</Paragraph>
      <Paragraph position="2"> In one of the target sentences, the SMT model has to choose between two translations for the Chinese word &amp;quot;a3a5a4 &amp;quot; (cailiao): &amp;quot;data&amp;quot; or &amp;quot;material&amp;quot;. The two closest left neighbors can be translated as &amp;quot;provide proof&amp;quot;, and the SMT incorrectly picks the &amp;quot;material&amp;quot; sense, because the phrase &amp;quot;provide proof of material...&amp;quot; is more frequent than &amp;quot;provide proof of data&amp;quot;. In contrast, the WSD model has access to a wider context to correctly pick the &amp;quot;data&amp;quot; translation.</Paragraph>
      <Paragraph position="3"> Similarly, in a test sentence where the Chinese word &amp;quot;a6a8a7 &amp;quot; (fengzi) is used in the sense &amp;quot;element/component&amp;quot; vs. &amp;quot;member&amp;quot;, the SMT system incorrectly chooses the &amp;quot;member&amp;quot; translation because the neighboring word translates to &amp;quot;active&amp;quot;, and the language model prefers the phrase &amp;quot;active member&amp;quot; to &amp;quot;active element&amp;quot; or &amp;quot;active component&amp;quot;.</Paragraph>
      <Paragraph position="4"> 6.3 WSD models are consistently better than SMT models for all target word types Computing accuracies per target word type shows that the previous observations hold for each target word.</Paragraph>
      <Paragraph position="5"> Table 3 compares the accuracies of the best SMT vs. the best WSD system per target word type and shows that the WSD system always yields significantly higher scores for the set of target words considered.</Paragraph>
      <Paragraph position="6"> Also this breakdown reveals, interestingly, that the  most difficult words for the SMT model consist of a single character. Eight out of the 20 target words considered consist of a unique character, and appear as such in the test set, while these characters were typically segmented within longer words in the parallel training corpus. However, this is of course not the only reason for the difference in accuracies, as the WSD system also signficantly outperforms the SMT model on target words that consist of more than 1 character.</Paragraph>
    </Section>
    <Section position="2" start_page="123" end_page="123" type="sub_section">
      <SectionTitle>
6.4 A dedicated unsupervised WSD model also
outperforms SMT
</SectionTitle>
      <Paragraph position="0"> One might speculate that the difference in performance obtained with SMT vs. WSD models can be explained by the fact that we are essentially comparing unsupervised models with fully supervised models.</Paragraph>
      <Paragraph position="1"> To address this we can again take advantage of the Senseval framework, and compare the performance of our SMT system with other published results on the same dataset. The system described in (Cabezas et al., 2004) is of particular interest as it uses an unsupervised approach. An unsupervised Chinese-English bilingual WSD model is learned from automatically word-aligned parallel corpora. In order to use this bilingual model, the Chinese lexical sample task is artificially converted into a bilingual task, by automatically translating the Chinese test sentences into English, using an alignment-template based SMT system.</Paragraph>
      <Paragraph position="2"> This unsupervised, but dedicated, WSD model yields an accuracy of 44.5%, thus outperforming all the SMT model variations. It yields a 35% relative improvement over the best SMT model, which remains relatively little compared to the best supervised dedicated WSD system, which doubles the accuracy score of the SMT model.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="123" end_page="124" type="metho">
    <SectionTitle>
7 Related work
</SectionTitle>
    <Paragraph position="0"> To the best of our knowledge, this is the first evaluation of SMT models on standard WSD performance metrics and datasets. One might argue that traditional MT evaluation metrics such as word error rate (WER) also evaluate the WSD performance of MT models. WER is defined as the percentage of words to be inserted, deleted or replaced in the translation in order to obtain the sentence of reference. However, WER does not isolate WSD performance since it also encompasses many other types of errors. Also, since the choice of a translation for a particular word affects the translation of other words in the sentence, the effect of WSD performance on WER is unclear. In contrast, the Senseval accuracy metric counts each incorrect translation choice only once.</Paragraph>
    <Paragraph position="1"> Apart from the voted WSD system described in section 4, and the unsupervised system (Cabezas et al., 2004) mentioned in section 6.4, systems built and optimized for the Senseval-3 Chinese lexical sample task, include Niu et al. (2004). Many of the Senseval-type  WSD system are not language specific and the presentation of the results in the English lexical sample task (Midhalcea et al., 2004), English-Hindi multilingual task (Chklovski et al., 2004), or any of the lexical sample tasks defined in other languages, give a good overview of the variety of approaches to WSD.</Paragraph>
    <Paragraph position="2"> Most previous work on multilingual WSD has focused on the different problem of exploiting bilingual resources (e.g., parallel or comparable corpora, or even full MT systems) to help WSD. For instance, Ng et al. (2003) showed that it is possible to use word aligned parallel corpora to train accurate supervised WSD models. Other work includes Li and Li (2002) who propose a bilingual bootstrapping method to learn a translation disambiguation WSD model, and Diab (2004) who exploited large amounts of automatically generated noisy parallel data to learn WSD models in an unsupervised bootstrapping scheme. In all this work, the goal is to achieve accurate WSD with minimum amounts of annotated data. Again, this differs from our objective which is to directly evaluate an SMT model as a WSD model.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML