File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3122_metho.xml
Size: 10,730 bytes
Last Modified: 2025-10-06 14:11:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3122"> <Title>Language Models and Reranking for Machine Translation</Title> <Section position="3" start_page="0" end_page="151" type="metho"> <SectionTitle> 2 System Description </SectionTitle> <Paragraph position="0"> We developed for the WMT 2006 shared task a system that is trained on a (a) word-aligned bilingual corpus, (b) a large monolingual (English) corpus and (c) an English treebank and it is capable of translating from a source language (German, Spanish and French) into English.</Paragraph> <Paragraph position="1"> Our system embeds Phramer2 (used for minimum error rate training, decoding, decoding tools), Pharaoh (Koehn, 2004) (decoding), Carmel 3 (helper for Pharaoh in n-best generation), Charniak's parser (Charniak, 2001) (language model) and SRILM4 (n-gram LM construction).</Paragraph> <Section position="1" start_page="0" end_page="150" type="sub_section"> <SectionTitle> 2.1 Translation table construction </SectionTitle> <Paragraph position="0"> We developed a component that builds a translation table from a word-aligned parallel corpus. The component generates the translation table according to the process described in the Pharaoh training manual5. It generates a vector of 5 numeric values for each phrase pair: * phrase translation probability:</Paragraph> <Paragraph position="2"/> <Paragraph position="4"/> </Section> <Section position="2" start_page="150" end_page="150" type="sub_section"> <SectionTitle> 2.2 Decoding </SectionTitle> <Paragraph position="0"> We used the Pharaoh decoder for both the Minimum Error Rate Training (Och, 2003) and test dataset decoding. Although Phramer provides decoding functionality equivalent to Pharaoh's, we preferred to use Pharaoh for this task because it is much faster than Phramer - between 2 and 15 times faster, depending on the configuration - and preliminary tests showed that there is no noticeable difference between the output of these two in terms of BLEU (Papineni et al., 2002) score.</Paragraph> <Paragraph position="1"> The log-linear model uses 8 features: one distortion feature, one basic LM feature, 5 features from the translation table and one sentence length feature.</Paragraph> </Section> <Section position="3" start_page="150" end_page="150" type="sub_section"> <SectionTitle> 2.3 Minimum Error Rate Training </SectionTitle> <Paragraph position="0"> To determine the best coefficients of the log-linear model (l) for both the initial stage decoding and the second stage reranking, we used the unsmoothed Minimum Error Rate Training (MERT) component present in the Phramer package. The MERT component is highly efficient; the time required to search a set of 200,000 hypotheses is less than 30 seconds per iteration (search from a previous/random l to a local maximum) on a 3GHz P4 machine. We also used the distributed decoding component from Phramer to speed up the search process.</Paragraph> <Paragraph position="1"> We generated the n-best lists required for MERT using the Carmel toolkit. Pharaoh outputs a lattice for each input sentence, from which Carmel extracts a specific number of hypotheses. We used the europarl.en.srilm language model for decoding the n-best lists.</Paragraph> <Paragraph position="2"> The weighting vector is calculated individually for each subtask (pair of source and target languages). null</Paragraph> </Section> <Section position="4" start_page="150" end_page="151" type="sub_section"> <SectionTitle> 2.4 Language Models for reranking </SectionTitle> <Paragraph position="0"> We employed both syntactic language models and n-gram based language models extracted from very large corpora for improving the quality of the translation through reranking of the n-best list. These language models add a total of 13 new features to the log-linear model.</Paragraph> <Paragraph position="1"> We created large-scale n-gram language models using English Gigaword Second Edition6 (EGW). We split the corpus into sentences, tokenized the corpus, lower-cased the sentences, replaced every digit with &quot;9&quot; to cluster different numbers into the same unigram entry, filtered noisy sentences and we collected n-gram counts (up to 4-grams). Table 1 presents the statistics related to this process.</Paragraph> <Paragraph position="2"> We pruned the unigrams that appeared less than 15 times in the corpus and all the n-grams that contain the pruned unigrams. We also pruned 3-grams and 4-grams that appear only once in the corpus.</Paragraph> <Paragraph position="3"> Based on these counts, we calculated 4 features for each sentence: the logarithm of the probability of the sentence based on unigrams, on bigrams, on 3-grams and on 4-grams. The probabilities of each word in the analyzed translation hypotheses were bounded by 10[?]5 (to avoid overall zero probability of a sentence caused by zero-counts).</Paragraph> <Paragraph position="4"> Based on the unpruned counts, we calculated 8 additional features: how many of the n-grams in the the hypothesis appear in the EGW corpus and also how many of the n-grams in the hypotheses don't appear in the Gigaword corpus (n = 1..4). The two types of counts will have different behavior only when they are used to discriminate between two hypotheses with different length.</Paragraph> <Paragraph position="5"> The number of n-grams in each of the two cases We used Charniak's parser as an additional LM (Charniak, 2001) in reranking. The parser provides one feature for our model - the log-grammarprobability of the sentence.</Paragraph> <Paragraph position="6"> We retrained the parser on lowercased Penn Tree-bank II (Marcus et al., 1993), to match the lower-cased output of the MT decoder.</Paragraph> <Paragraph position="7"> Considering the huge number of hypotheses that needed to be parsed for this task, we set it to parse very fast (using the command-line parameter -T107).</Paragraph> </Section> <Section position="5" start_page="151" end_page="151" type="sub_section"> <SectionTitle> 2.5 Reranking and voting </SectionTitle> <Paragraph position="0"> A l weights vector trained over the 8 basic features (l1) is used to decode a n-best list. Then, a l vector trained over all 21 features (l2) is used to rerank the n-best list, potentially generating a new first-best hypothesis.</Paragraph> <Paragraph position="1"> To improve the results, we generated during training a set of distinct l2 weight vectors (4-10 different weight vectors). Each l2 picks a preferred hypothesis. The final hypothesis is chosen using a voting mechanism. The computational cost of the voting process is very low - each of the l2 is applied on the same set of hypotheses - generated by a single l1.</Paragraph> </Section> <Section position="6" start_page="151" end_page="151" type="sub_section"> <SectionTitle> 2.6 Preprocessing </SectionTitle> <Paragraph position="0"> The vocabulary of languages like English, French and Spanish is relatively small. Most of the new words that appear in a text and didn't appear in a pre-defined large text (i.e.: translation table) are abbreviations and proper nouns, that usually don't change their form when they are translated into another language. Thus Pharaoh and Phramer deal with out-of-vocabulary (OOV) words - words that don't appear in the translation table - by copying them into the output translation. German is a compounding language, thus the German vocabulary is virtu7Time factor. Higher is better. Default: 210 ally infinite. In order to avoid OOV issues for new text, we applied a heuristic to improve the probability of properly translating compound words that are not present in the translation table. We extracted the German vocabulary from the translation table. Then, for each word in a text to be translated (development set or test set), we checked if it is present in the translation dictionary. If it was not present, we checked if it can be obtained by concatenating two words in the dictionary. If we found at least one variant of splitting the unknown word, we altered the text by dividing the word into the corresponding pieces. If there are multiple ways of splitting, we randomly took one. The minimum length for the generated word is 3 letters.</Paragraph> <Paragraph position="1"> In order to minimize the risk of inserting words that are not in the reference translation into the output translation, we applied a OOV pruning algorithm (Koehn et al., 2005) - we removed every word in the text to be translated that we know we cannot translate (doesn't appear either in the foreign part of the parallel corpus used for training) or in what we expect to be present in an English text (doesn't appear in the English Gigaword corpus). This method was applied to all the input text that was automatically translated - development and test; German, French and Spanish.</Paragraph> <Paragraph position="2"> For the German-to-English translation, the compound word splitting algorithm was applied before the unknown word removal process.</Paragraph> </Section> </Section> <Section position="4" start_page="151" end_page="152" type="metho"> <SectionTitle> 3 Experimental Setup </SectionTitle> <Paragraph position="0"> We generated the translation tables for each pair of languages using the alignment provided for this shared task.</Paragraph> <Paragraph position="1"> We split the dev2006 files into two halves. The first half was used to determine l1. Using l1, we created a 500-best list for each sentence in the second half. We calculated the value of the enhanced features (EGW and Charniak) for each of these hypotheses. Over this set of almost 500 K hypotheses, we computed 10 different l2 using MERT. The search process was seeded using l1 padded with 0 for the new 13 features. We sorted the l2s by the BLEU score estimated by the MERT algorithm. We pruned manually the l2s that diverge too much from the overall set of l2s (based on the observation that mitted results are bolded.</Paragraph> <Paragraph position="2"> these weights are overfitting). We picked from the remaining set the best l2 and a preferred subset of l2s to be used in voting.</Paragraph> <Paragraph position="3"> The l1 was also used to decode a 500-best list for each sentence in the devtest2006 and test2006 sets. After computing value of the enhanced features for each of these hypotheses, we applied the reranking algorithm to pick a new first-best hypothesis - the output of our system.</Paragraph> <Paragraph position="4"> We used the following parameters for decoding: -dl 5 -b 0.0001 -ttable-limit 30 -s 200 for French and Spanish and -dl 9 -b 0.00001 -ttable-limit 30 -s 200 for German.</Paragraph> </Section> class="xml-element"></Paper>