File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1003_metho.xml
Size: 15,308 bytes
Last Modified: 2025-10-06 14:10:06
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1003"> <Title>Improved Statistical Machine Translation Using Paraphrases</Title> <Section position="3" start_page="0" end_page="18" type="metho"> <SectionTitle> 2 The Problem of Coverage in SMT </SectionTitle> <Paragraph position="0"> Statistical machine translation made considerable advances in translation quality with the introduction of phrase-based translation (Marcu and Wong, 2002; Koehn et al., 2003; Och and Ney, 2004). By grams, and 4-grams from the Europarl Spanish test sentences for which translations were learned in increasingly large training corpora increasing the size of the basic unit of translation, phrase-based machine translation does away with many of the problems associated with the original word-based formulation of statistical machine translation (Brown et al., 1993). For instance, with multi-word units less re-ordering needs to occur since local dependencies are frequently captured. For example, common adjective-noun alternations are memorized. However, since this linguistic information is not explicitly and generatively encoded in the model, unseen adjective noun pairs may still be handled incorrectly.</Paragraph> <Paragraph position="1"> Thus, having observed phrases in the past dramatically increases the chances that they will be translated correctly in the future. However, for any given test set, a huge amount of training data has to be observed before translations are learned for a reasonable percentage of the test phrases. Figure 1 shows the extent of this problem. For a training corpus containing 10,000 words translations will have been learned for only 10% of the unigrams (types, not tokens). For a training corpus containing 100,000 words this increases to 30%. It is not until nearly 10,000,000 words worth of training data have been analyzed that translation for more than 90% of the vocabulary items have been learned. This problem is obviously compounded for higher-order n-grams (longer phrases), and for morphologically richer languages. null encargarnos to ensure, take care, ensure that garantizar guarantee, ensure, guaranteed, assure, provided velar ensure, ensuring, safeguard, making sure procurar ensure that, try to, ensure, endeavour to asegurarnos ensure, secure, make certain</Paragraph> <Paragraph position="3"> phrases for the Spanish words encargarnos and usado along with their English translations which were automatically learned from the Europarl corpus</Paragraph> <Section position="1" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 2.1 Handling unknown words </SectionTitle> <Paragraph position="0"> Currently most statistical machine translation systems are simply unable to handle unknown words.</Paragraph> <Paragraph position="1"> There are two strategies that are generally employed when an unknown source word is encountered. Either the source word is simply omitted when producing the translation, or alternatively it is passed through untranslated, which is a reasonable strategy if the unknown word happens to be a name (assuming that no transliteration need be done). Neither of these strategies is satisfying.</Paragraph> </Section> <Section position="2" start_page="17" end_page="18" type="sub_section"> <SectionTitle> 2.2 Using paraphrases in SMT </SectionTitle> <Paragraph position="0"> When a system is trained using 10,000 sentence pairs (roughly 200,000 words) there will be a number of words and phrases in a test sentence which it has not learned the translation of. For example, the Spanish sentence Es positivo llegar a un acuerdo sobre los procedimientos, pero debemos encargarnos de que este sistema no sea susceptible de ser usado como arma pol'itica.</Paragraph> <Paragraph position="1"> may translate as It is good reach an agreement on procedures, but we must encargarnos that this system is not susceptible to be usado as political weapon.</Paragraph> <Paragraph position="2"> what is more, the relevant cost dynamic is completelyunder control im ubrigen ist die diesbezugliche kostenentwicklung vollig unter kontrolle we owe it to the taxpayers to keep in checkthe costs wir sind es den steuerzahlern die kosten zu habenschuldig unter kontrolle Figure 2: Using a bilingual parallel corpus to extract paraphrases The strategy that we employ for dealing with unknown source language words is to substitute paraphrases of those words, and then translate the paraphrases. Table 1 gives examples of paraphrases and their translations. If we had learned a translation of garantizar we could translate it instead of encargarnos, and similarly for utilizado instead of usado.</Paragraph> </Section> </Section> <Section position="4" start_page="18" end_page="19" type="metho"> <SectionTitle> 3 Acquiring Paraphrases </SectionTitle> <Paragraph position="0"> Paraphrases are alternative ways of expressing the same information within one language. The automatic generation of paraphrases has been the focus of a significant amount of research lately. Many methods for extracting paraphrases (Barzilay and McKeown, 2001; Pang et al., 2003) make use of monolingual parallel corpora, such as multiple translations of classic French novels into English, or the multiple reference translations used by many automatic evaluation metrics for machine translation.</Paragraph> <Paragraph position="1"> Bannard and Callison-Burch (2005) use bilingual parallel corpora to generate paraphrases. Paraphrases are identified by pivoting through phrases in another language. The foreign language translations of an English phrase are identified, all occurrences of those foreign phrases are found, and all English phrases that they translate back to are treated as potential paraphrases of the original English phrase.</Paragraph> <Paragraph position="2"> Figure 2 illustrates how a German phrase can be used as a point of identification for English paraphrases in this way.</Paragraph> <Paragraph position="3"> The method defined in Bannard and Callison-Burch (2005) has several features that make it an ideal candidate for incorporation into statistical machine translation system. Firstly, it can easily be applied to any language for which we have one or more parallel corpora. Secondly, it defines a paraphrase probability, p(e2|e1), which can be incorporated into the probabilistic framework of SMT.</Paragraph> <Section position="1" start_page="18" end_page="19" type="sub_section"> <SectionTitle> 3.1 Paraphrase probabilities </SectionTitle> <Paragraph position="0"> The paraphrase probability p(e2|e1) is defined in terms of two translation model probabilities: p(f|e1), the probability that the original English phrase e1 translates as a particular phrase f in the other language, and p(e2|f), the probability that the candidate paraphrase e2 translates as the foreign language phrase. Since e1 can translate as multiple foreign language phrases, we marginalize f out:</Paragraph> <Paragraph position="2"> The translation model probabilities can be computed using any standard formulation from phrase-based machine translation. For example, p(e2|f) can be calculated straightforwardly using maximum likelihood estimation by counting how often the phrases e and f were aligned in the parallel corpus:</Paragraph> <Paragraph position="4"> (2) There is nothing that limits us to estimating paraphrases probabilities from a single parallel corpus. We can extend the definition of the paraphrase probability to include multiple corpora, as follows:</Paragraph> <Paragraph position="6"> where c is a parallel corpus from a set of parallel corpora C. Thus multiple corpora may be used by summing over all paraphrase probabilities calculated from a single corpus (as in Equation 1) and normalized by the number of parallel corpora.</Paragraph> </Section> </Section> <Section position="5" start_page="19" end_page="19" type="metho"> <SectionTitle> 4 Experimental Design </SectionTitle> <Paragraph position="0"> We examined the application of paraphrases to deal with unknown phrases when translating from Spanish and French into English. We used the publicly available Europarl multilingual parallel corpus (Koehn, 2005) to create six training corpora for the two language pairs, and used the standard Europarl development and test sets.</Paragraph> <Section position="1" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 4.1 Baseline </SectionTitle> <Paragraph position="0"> For a baseline system we produced a phrase-based statistical machine translation system based on the log-linear formulation described in (Och and Ney,</Paragraph> <Paragraph position="2"> The baseline model had a total of eight feature functions, hm(e,f): a language model probability, a phrase translation probability, a reverse phrase translation probability, lexical translation probability, a reverse lexical translation probability, a word penalty, a phrase penalty, and a distortion cost. To set the weights, lm, we performed minimum error rate training (Och, 2003) on the development set using Bleu (Papineni et al., 2002) as the objective function. null The phrase translation probabilities were determined using maximum likelihood estimation over phrases induced from word-level alignments produced by performing Giza++ training on each of the three training corpora. We used the Pharaoh beam-search decoder (Koehn, 2004) to produce the translations after all of the model parameters had been set.</Paragraph> <Paragraph position="3"> When the baseline system encountered unknown words in the test set, its behavior was simply to reproduce the foreign word in the translated output.</Paragraph> <Paragraph position="4"> This is the default behavior for many systems, as noted in Section 2.1.</Paragraph> </Section> <Section position="2" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 4.2 Translation with paraphrases </SectionTitle> <Paragraph position="0"> We extracted all source language (Spanish and French) phrases up to length 10 from the test and development sets which did not have translations in phrase tables that were generated for the three training corpora. For each of these phrases we generated a list of paraphrases using all of the parallel corpora from Europarl aside from the Spanish-English and French-English corpora. We used bitexts between Spanish and Danish, Dutch, Finnish, French, German, Italian, Portuguese, and Swedish to generate our Spanish paraphrases, and did similarly for the French paraphrases. We manage the parallel corpora with a suffix array -based data structure (Callison-Burch et al., 2005). We calculated paraphrase probabilities using the Bannard and Callison-Burch (2005) method, summarized in Equation 3.</Paragraph> <Paragraph position="1"> Source language phrases that included names and numbers were not paraphrased.</Paragraph> <Paragraph position="2"> For each paraphrase that had translations in the phrase table, we added additional entries in the phrase table containing the original phrase and the paraphrase's translations. We augmented the base-line model by incorporating the paraphrase probability into an additional feature function which assigns values as follows:</Paragraph> <Paragraph position="4"> Just as we did in the baseline system, we performed minimum error rate training to set the weights of the nine feature functions in our translation model that exploits paraphrases.</Paragraph> <Paragraph position="5"> We tested the usefulness of the paraphrase feature function by performing an additional experiment where the phrase table was expanded but the paraphrase probability was omitted.</Paragraph> </Section> <Section position="3" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 4.3 Evaluation </SectionTitle> <Paragraph position="0"> We evaluated the efficacy of using paraphrases in three ways: by calculating the Bleu score for the translated output, by measuring the increase in coverage when including paraphrases, and through a targeted manual evaluation of the phrasal translations of unseen phrases to determine how many of the newly covered phrases were accurately translated.</Paragraph> <Paragraph position="1"> were manually word-aligned. This allowed us to equate unseen phrases with their corresponding English phrase. In this case enumeradas with listed. Although Bleu is currently the standard metric for MT evaluation, we believe that it may not meaningfully measure translation improvements in our setup. By substituting a paraphrase for an unknown source phrase there is a strong chance that its translation may also be a paraphrase of the equivalent target language phrase. Bleu relies on exact matches of n-grams in a reference translation. Thus if our translation is a paraphrase of the reference, Bleu will fail to score it correctly.</Paragraph> <Paragraph position="2"> Because Bleu is potentially insensitive to the type of changes that we were making to the translations, we additionally performed a focused manual evaluation (Callison-Burch et al., 2006). To do this, had bilingual speakers create word-level alignments for the first 150 and 250 sentence in the Spanish-English and French-English test corpora, as shown in Figure 3. We were able to use these alignments to extract the translations of the Spanish and French words that we were applying our paraphrase method to.</Paragraph> <Paragraph position="3"> Knowing this correspondence between foreign phrases and their English counterparts allowed us to directly analyze whether translations that were being produced from paraphrases remained faithful to the meaning of the reference translation. When pro-The article combats discrimination and inequality in the treatment of citizens for the reasons listed therein.</Paragraph> <Paragraph position="4"> The article combats discrimination and the different treatment of citizens for the reasons mentioned in the same.</Paragraph> <Paragraph position="5"> The article fights against uneven and the treatment of citizens for the reasons enshrined in the same.</Paragraph> <Paragraph position="6"> The article is countering discrimination and the unequal treatment of citizens for the reasons that in the same.</Paragraph> <Paragraph position="7"> phrase retained the same meaning as the highlighted phrase in the reference translation (top) ducing our translations using the Pharaoh decoder we employed its &quot;trace&quot; facility, which tells which source sentence span each target phrase was derived from. This allowed us to identify which elements in the machine translated output corresponded to the paraphrased foreign phrase. We asked a monolingual judge whether the phrases in the machine translated output had the same meaning as of the reference phrase. This is illustrated in Figure 4.</Paragraph> <Paragraph position="8"> In addition to judging the accuracy of 100 phrases for each of the translated sets, we measured how much our paraphrase method increased the coverage of the translation system. Because we focus on words that the system was previously unable to translate, the increase in coverage and the translation quality of the newly covered phrases are the two most relevant indicators as to the efficacy of the method.</Paragraph> </Section> </Section> class="xml-element"></Paper>