File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1006_metho.xml
Size: 11,285 bytes
Last Modified: 2025-10-06 14:10:07
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1006"> <Title>Phrase-Based Backoff Models for Machine Translation of Highly Inflected Languages</Title> <Section position="4" start_page="41" end_page="42" type="metho"> <SectionTitle> 3 Backoff Models </SectionTitle> <Paragraph position="0"> Generally speaking, backoff models exploit relationships between more general and more specific probability distributions. They specify under which conditions the more specific model is used and when the model &quot;backs off&quot; to the more general distribution. Backoff models have been used in a variety of ways in natural language processing, most notably in statistical language modeling. In language modeling, a higher-order n-gram distribution is used when it is deemed reliable (determined by the number of occurrences in the training data); otherwise, the model backs off to the next lower-order n-gram distribution. For the case of trigrams, this can be expressed as:</Paragraph> <Paragraph position="2"> where pML denotes the maximum-likelihood estimate, c denotes the count of the triple (wi,wi[?]1,wi[?]2)inthetraining data, t isthe count threshold above which the maximum-likelihood estimate is retained, and dN(wi,wi[?]1,wi[?]2) is a discounting factor (generally between 0 and 1) that is applied to the higher-order distribution. The normalization factor a(wi[?]1,wi[?]2) ensures that the distribution sums to one. In (Bilmes and Kirchhoff, 2003) this method was generalized to a back-off model with multiple paths, allowing the combination of different backed-off probability estimates. Hierarchical backoff schemes have also been used by (Zitouni et al., 2003) for language modeling and by (Gildea, 2001) for semantic role labeling. (Resnik et al., 2001) used backoff translation lexicons for cross-language information retrieval. More recently, (Xi and Hwa, 2005) have used backoff models for combining in-domain and out-of-domain data for the purpose of bootstrapping a part-of-speech tagger for Chinese, outperforming standard methods such as EM.</Paragraph> </Section> <Section position="5" start_page="42" end_page="43" type="metho"> <SectionTitle> 4 Backoff Models in MT </SectionTitle> <Paragraph position="0"> In order to handle unseen words in the test data we propose a hierarchical backoff model that uses morphological information. Several morphological operations, in particular stemming and compound splitting, are interleaved such that a more specific form (i.e. a form closer to the full word form) is chosen before a more general form (i.e. a form that has undergone morphological processing). The procedure is shown in Figure 1 and can be described as follows: First, a standard phrase table based on full word forms is trained. If an unknown word fi is encountered in the test data with context cfi = fi[?]n,...,fi[?]1,fi+1,...,fi+m, the word is first stemmed, i.e. fprimei = stem(fi).</Paragraph> <Paragraph position="1"> The phrase table entries for words sharing the same stem are then modified by replacing the respective words with their stems. If an entry can be found among these such that the source language side of the phrase pair consists of fi[?]n,...,fi[?]1,stem(fi),fi+1,...,fi+m, the corresponding translation is used (or, if several possible translations occur, the one with the highest probability is chosen). Note that the context may be empty, in which case a single-word phrase is used. If this step fails, the model backs off to the next level and applies compound splitting to the unknown word (further described below), i.e.(fprimeprimei1,fprimeprimei2) = split(fi). The match with the original word-based phrase table is then performed again. If this step fails for either of the two parts of fprimeprime, stemming is applied again: fprimeprimeprimei1 = stem(fprimeprimei1) and fprimeprimeprimei2 = stem(fprimeprimei2), and a match with the stemmed phrase table entries is carried out.</Paragraph> <Paragraph position="2"> Only if the attempted match fails at this level is the input passed on verbatim in the translation output.</Paragraph> <Paragraph position="3"> The backoff procedure could in principle be performed on demand by a specialized decoder; however, since we use an off-the-shelf decoder (Pharaoh (Koehn, 2004)), backoff is implicitly enforced by providing a phrase-table that includes all required backoff levels and by preprocessing the test data accordingly. The phrase table will thus include entries for phrases based on full word forms as well as for their stemmed and/or split counterparts.</Paragraph> <Paragraph position="4"> For each entry with decomposed morphological</Paragraph> <Paragraph position="6"> forms, four probabilities need to be provided: two phrasal translation scores for both translation directions, p(-e |-f) and p( -f|-e), and two corresponding lexical scores, which are computed as a product of the word-by-word translation probabilities under the given alignment a:</Paragraph> <Paragraph position="8"> where j ranges of words in phrase -f and i ranges of words in phrase -e. In the case of unknown words in the foreign language, we need the probabilities p(-e|stem( -f)), p(stem( -f)|-e) (where the stemming operation stem( -f) applies to the unknown words in the phrase), and their lexical equivalents. These are computed by relative frequency estimation, e.g.</Paragraph> <Paragraph position="10"> The other translation probabilities are computed analogously. Since normalization is performed over the entire phrase table, this procedure has the effect of discounting the original probability porig(-e |-f) since -e may now have been generated by either -f or by stem( -f). In the standard formulation of backoff models shown in Equation 3, this amounts to:</Paragraph> <Paragraph position="12"/> <Paragraph position="14"> is the amount by which the word-based phrase translation probability is discounted. Equivalent probability computations are carried out for the lexical translation probabilities. Similar to the backoff level that uses stemming, the translation probabilities need to be recomputed for the levels that use splitting and combined splitting/stemming. null In order to derive the morphological decomposition we use existing tools. For stemming we use the TreeTagger (Schmid, 1994) for German and the Snowball stemmer1 for Finnish. A variety of ways for compound splitting have been investigated in machine translation (Koehn, 2003).</Paragraph> <Paragraph position="15"> Here we use a simple technique that considers all possible ways of segmenting a word into two sub-parts (with a minimum-length constraint of three characters on each subpart). A segmentation is accepted if the subparts appear as individual items in the training data vocabulary. The only linguistic knowledge used in the segmentation process is the removal of final <s> from the first part of the compound before trying to match it to an existing word. This character (Fugen-s) is often inserted as &quot;glue&quot; when forming German compounds. Other glue characters were not considered for simplicity (but could be added in the future). The segmentation method is clearly not linguistically adequate: first, words may be split into more than two parts. Second, the method may generate multiple possible segmentations without a principled way of choosing among them; third, it may generate invalid splits. However, a manual analysis of 300 unknown compounds in the German development set (see next section) showed that 95.3% of them were decomposed correctly: for the domain at hand, most compounds need not be split into more than two parts; if one part is itself a compound it is usually frequent enough in the training data to have a translation. Furthermore, lexicalized compounds, whose decomposition would lead to wrong translations, are also typically frequent words and have an appropriate translation in the training data.</Paragraph> </Section> <Section position="6" start_page="43" end_page="43" type="metho"> <SectionTitle> 5 Data </SectionTitle> <Paragraph position="0"> Our data consists of the Europarl training, development and test definitions for German-English and Finnish-English of the 2005 ACL shared data task (Koehn and Monz, 2005). Both German and Finnish are morphologically rich languages: German has four cases and three genders and shows number, gender and case distinctions not only on verbs, nouns, and adjectives, but also on determiners. In addition, it has notoriously many compounds. Finnish is a highly agglutinative language with a large number of inflectional paradigms (e.g. one foreach ofits15 cases). Noun compounds are also frequent. On the 2005 ACL shared MT data task, Finnish to English translation showed the lowest average performance (17.9% BLEU) and German had the second lowest (21.9%), while the average BLEU scores for French-to-English and Spanish-to-English were much higher (27.1% and 27.8%, respectively).</Paragraph> <Paragraph position="1"> The data was preprocessed by lowercasing and filtering out sentence pairs whose length ratio (number of words in the source language divided by the number of words in the target language, or vice versa) was > 9. The development and test sets consist of 2000 sentences each. In order to study the effect of varying amounts of training data we created several training partitions consisting of random selections of a subset of the full training set. The sizes of the partitions are shown in Table 1, together with the resulting percentage of out-of-vocabulary (OOV) words in the development and test sets (&quot;type&quot; refers to a unique word in the vocabulary, &quot;token&quot; to an instance in the actual text).</Paragraph> </Section> <Section position="7" start_page="43" end_page="44" type="metho"> <SectionTitle> 6 System </SectionTitle> <Paragraph position="0"> We use a two-pass phrase-based statistical MT system using GIZA++ (Och and Ney, 2000) for word alignment and Pharaoh (Koehn, 2004) for phrase extraction and decoding. Word alignment is performed in both directions using the IBM4 model. Phrases are then extracted from the word alignments using the method described in (Och and Ney, 2003). For first-pass decoding we use Pharaoh in n-best mode. The decoder uses a weighted combination of seven scores: 4 translation model scores (phrase-based and lexical scores for both directions), a trigram language model score, a distortion score, and a word penalty. Non-monotonic decoding is used, with no limit on the and test sets.</Paragraph> <Paragraph position="1"> number of moves. The score combination weights are trained by a minimum error rate training procedure similar to (Och and Ney, 2003). The tri-gram language model uses modified Kneser-Ney smoothing andinterpolation oftrigram andbigram estimates and was trained on the English side of the bitext. In the first pass, 2000 hypotheses are generated per sentence. In the second pass, the seven scores described above are combined with 4-gram language model scores. The performance of the baseline system on the development and test sets is shown in Table 2. The BLEU scores obtained are state-of-the-art for this task.</Paragraph> </Section> class="xml-element"></Paper>