File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3116_metho.xml
Size: 8,406 bytes
Last Modified: 2025-10-06 14:11:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3116"> <Title>Mood at work: Ramses versus Pharaoh</Title> <Section position="3" start_page="126" end_page="126" type="metho"> <SectionTitle> 2 The MOOD Framework </SectionTitle> <Paragraph position="0"> A decoder must implement a specific combination of two elements: a model representation and a search space exploration strategy. MOOD is a framework designed precisely to allow such a combination, by clearly separating its two elements. The design of the framework is described in (Patry et al., 2006).</Paragraph> <Paragraph position="1"> MOOD is implemented with the C++ programming language and is licensed under the Gnu General Public License (GPL)2. This license grants the right to anybody to use, modify and distribute the program and its source code, provided that any modified version be licensed under the GPL as well.</Paragraph> <Paragraph position="2"> As explained in (Walker, 2005), this kind of license stimulates new ideas and research.</Paragraph> </Section> <Section position="4" start_page="126" end_page="126" type="metho"> <SectionTitle> 3 MOOD at work: RAMSES </SectionTitle> <Paragraph position="0"> As we said above, in order to test our design, we reproduced the most popular phrase-based decoder, PHARAOH (Koehn, 2004), by following as faithfully as possible its detailed user manual. The command-line syntax RAMSES recognizes is compatible with that of PHARAOH. The output produced by both decoders are compatible as well and RAMSES can also output its n-best lists in the same format as PHARAOH does, i.e. in a format that the CARMEL toolkit can parse (Knight and Al-Onaizan, 1999).</Paragraph> <Paragraph position="1"> Switching decoders is therefore straightforward.</Paragraph> </Section> <Section position="5" start_page="126" end_page="127" type="metho"> <SectionTitle> 4 RAMSES versus PHARAOH </SectionTitle> <Paragraph position="0"> To compare the translation performances of both decoders in a meaningful manner, RAMSES and PHARAOH were given the exact same language model and translation table for each translation experiment. Both models were produced with the scripts provided by the organizers. This means in practice that the language model was trained using the SRILM toolkit (Stolcke, 2002). The word alignment required to build the phrase table was produced with the GIZA++ package. A Viterbi alignment computed from an IBM model 4 (Brown et al., 1993) was computed for each translation direction.</Paragraph> <Paragraph position="1"> Both alignments were then combined in a heuristic way (Koehn et al., ). Each pair of phrases in the model is given 5 scores, described in the PHARAOH training manual.3 To tune the coefficients of the log-linear combination that both PHARAOH and RAMSES use when decoding, we used the organizers' minimum-error-rate-training.perl script. This tuning step was performed on the first 500 sentences of the dedicated development corpora. Inevitably, RAMSES differs slightly from PHARAOH, because of some undocumented embedded heuristics. Thus, we found appropriate to tune each decoder separately (although with the same material). In effect, each decoder does slightly better (with BLEU) when it uses its own best parameters obtained from tuning, than when it uses the parameters of its counterpart.</Paragraph> <Paragraph position="2"> Eight coefficents were adjusted this way: five for the translation table (one for each score associated to each pair of phrases), and one for each of the following models: the language model, the so-called word penalty model and the distortion model (word reordering model). Each parameter is given a starting value and a range within which it is allowed to vary. For instance, the language model coefficient's starting value is 1.0 and the coefficient is in the range [0.5-1.5]. Eventually, we obtained two optimal configurations (one for each decoder) with which we translated the TEST material.</Paragraph> <Paragraph position="3"> We evaluated the translations produced by both decoders with the organizers'multi-bleu.perl script, which computes a BLEU score (and displays the n-gram precisions and brevity penalty used). We report the scores we gathered on the test corpus of 2000 pairs of sentences in Table 1. Overall, both decoders offer similar performances, down to the n-gram precisions. To assess the statistical significance of the observed differences in BLEU, we used the bootstrapping technique described in (Zhang and Vogel, 2004), randomly selecting 500 sentences from each test set, 1000 times. Using a 95% confidence interval, we determined that the small differences between the two decoders are not statistically significant, except for two tests. For the direction English to French, RAMSES outperforms PHARAOH, while in the German to English direc- null tion, PHARAOH is better. Whenever a decoder is better than the other, Table 1 shows that it is attributable to higher n-gram precisions; not to the brevity penalty.</Paragraph> <Paragraph position="4"> We further investigated these two cases by calculating BLEU for subsets of the test corpus sharing similar sentence lengths (Table 2). We see that both decoders have similar performances on short sentences, but can differ by as much as 1% in BLEU on longer ones. In contrast, on the Spanish-to-English translation direction, where the two decoders offer similar performances, the difference between BLEU scores never exceeds 0.23%.</Paragraph> <Paragraph position="5"> Expectedly, Spanish and French are much easier to translate than German. This is because, in this study, we did not apply any pre-processing strategy that we know can improve performances, such as clause reordering or compound-word splitting (Collins et al., 2005; Langlais et al., 2005).</Paragraph> <Paragraph position="6"> Table 2 shows that it does not seem much more difficult to translate into English than from English. This is surprising: translating into a morphologically richer language should be more challenging. The opposite is true for German here: without doing anything specific for this language, it is much easier to translate from German to English than the other way around. This may be attributed in part to the language model: for the test corpus, the perplexity of the language models provided is 105.5 for German, compared to 59.7 for English.</Paragraph> </Section> <Section position="6" start_page="127" end_page="127" type="metho"> <SectionTitle> 5 Human Evaluation </SectionTitle> <Paragraph position="0"> In an effort to correlate the objective metrics with human reviews, we undertook the blind evaluation of a sample of 100 pairwise translations for the three Foreign language-to-English translation tasks. The pairs were randomly selected from the 3064 translations produced by each engine. They had to be different for each decoder and be no more than 25 words long.</Paragraph> <Paragraph position="1"> Each evaluator was presented with a source sentence, its reference translation and the translation produced by each decoder. The last two were in random order, so the evaluator did not know which engine produced the translation. The evaluator's task was two-fold. (1) He decided whether one translation was better than the other. (2) If he replied 'yes' on the provided test set of 2000 pairs of sentences per language pair. P stands for PHARAOH, R for RAMSES. All scores are percentages. pn is the n-gram precision and BP is the brevity penalty used when computing BLEU.</Paragraph> <Paragraph position="2"> in test (1), he stated whether the best translation was satisfactory while the other was not. Two evaluators went through the 3x100 sentence pairs. None of them understands German; subject B understands Spanish, and both understand French and English.</Paragraph> <Paragraph position="3"> The results of this informal, yet informative exercise are reported in Table 3.</Paragraph> <Paragraph position="4"> Overall, in many cases (64% and 48% for subject A and B respectively), the evaluators did not prefer one translation over the other. On the Spanishand French-to-English tasks, both subjects slightly preferred the translations produced by RAMSES. In about one fourth of the cases where one translation was preferred did the evaluators actually flag the selected translation as significantly better.</Paragraph> </Section> class="xml-element"></Paper>