File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0831_evalu.xml
Size: 9,586 bytes
Last Modified: 2025-10-06 13:59:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0831"> <Title>Novel Reordering Approaches in Phrase-Based Statistical Machine Translation</Title> <Section position="8" start_page="170" end_page="173" type="evalu"> <SectionTitle> 5 Experimental Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="170" end_page="170" type="sub_section"> <SectionTitle> 5.1 Corpus Statistics </SectionTitle> <Paragraph position="0"> The translation experiments were carried out on the Basic Travel Expression Corpus (BTEC), a multilingual speech corpus which contains tourism-related sentences usually found in travel phrase books.</Paragraph> <Paragraph position="1"> We tested our system on the so called Chinese-to-English (CE) and Japanese-to-English (JE) Supplied Tasks, the corpora which were provided during the International Workshop on Spoken Language Translation (IWSLT 2004) (Akiba et al., 2004). In addition, we performed experiments on the Italian-to-English (IE) task, for which a larger corpus was kindly provided to us by ITC/IRST. The corpus statistics for the three BTEC corpora are given in Tab. 1. The development corpus for the Italian-to-English translation had only one reference translation of each Italian sentence. A set of 506 source sentences and 16 reference translations is used as a development corpus for Chinese-to-English and Japanese-to-English and as a test corpus for Italianto-English tasks. The 500 sentence Chinese and Japanese test sets of the IWSLT 2004 evaluation campaign were translated and automatically scored against 16 reference translations after the end of the campaign using the IWSLT evaluation server.</Paragraph> </Section> <Section position="2" start_page="170" end_page="171" type="sub_section"> <SectionTitle> 5.2 Evaluation Criteria </SectionTitle> <Paragraph position="0"> For the automatic evaluation, we used the criteria from the IWSLT evaluation campaign (Akiba et al., 2004), namely word error rate (WER), position-independent word error rate (PER), and the BLEU and NIST scores (Papineni et al., 2002; Doddington, 2002). The two scores measure accuracy, i. e. larger scores are better. The error rates and scores were computed with respect to multiple reference transla- null tions, when they were available. To indicate this, we will label the error rate acronyms with an m. Both training and evaluation were performed using corpora and references in lowercase and without punctuation marks.</Paragraph> </Section> <Section position="3" start_page="171" end_page="173" type="sub_section"> <SectionTitle> 5.3 Experiments </SectionTitle> <Paragraph position="0"> We used reordering and alignment monotonization in training as described in Sec. 3. To estimate the matrices of local alignment costs for the sentence pairs in the training corpus we used the state occupation probabilities of GIZA++ IBM-4 model training and interpolated the probabilities of source-to-target and target-to-source training directions. After that we estimated a smoothed 4-gram language model on the level of bilingual tuples fj,~ej and represented it as a finite-state transducer.</Paragraph> <Paragraph position="1"> When translating, we applied moderate beam pruning to the search automaton only when using re-ordering constraints with window sizes larger than 3. For very large window sizes we also varied the pruning thresholds depending on the length of the input sentence. Pruning allowed for fast translations and reasonable memory consumption without a significant negative impact on performance.</Paragraph> <Paragraph position="2"> In our first experiments, we tested the four re-ordering constraints with various window sizes. We aimed at improving the translation results on the development corpora and compared the results with two baselines: reordering only the source training sentences and translation of the unreordered test sentences; and the GIATI technique for creating bilingual tuples (fj,~ej) without reordering of the source sentences, neither in training nor during translation.</Paragraph> <Paragraph position="3"> Fig. 3 (left) shows word error rate on the Japanese-to-English task as a function of the window size for different reordering constraints. For each of the constraints, good results are achieved using a window size of 9 and larger. This can be attributed to the Japanese word order which is very different from English and often follows a subject-object-verb structure. For small window sizes, ITG or IBM constraints are better suited for this task, for larger window sizes, inverse IBM constraints perform best. The local constraints perform worst and require very large window sizes to capture the main word order differences between Japanese and English. However, their computational complexity is low; for instance, a system with local constraints and window size of 9 is as fast (25 words per second) as the same system with IBM constraints and window size of 5. Using window sizes larger than 10 is computationally expensive and does not significantly improve the translation quality under any of the constraints.</Paragraph> <Paragraph position="4"> Tab. 2 presents the overall improvements in translation quality when using the best setting: inverse IBM constraints, window size 9. The baseline without reordering in training and testing failed completely for this task, producing empty translations for 37 % of the sentences2. Most of the original alignments in training were non-monotonic which resulted in mapping of almost all Japanese words to e when using only the GIATI monotonization technique. Thus, the proposed reordering methods are of crucial importance for this task.</Paragraph> <Paragraph position="5"> evaluation results for the described system (WFST) with those of the best submitted system (AT).</Paragraph> <Paragraph position="6"> Further improvements were obtained with a rescoring procedure. For rescoring, we produced a k-best list of translation hypotheses and used the word penalty and deletion model features, the IBM Model 1 lexicon score, and target language n-gram models of the order up to 9. The scaling factors for all features were optimized on the development corpus for the NIST score, as described in (Bender et al., 2004).</Paragraph> <Paragraph position="7"> Word order in Chinese and English is usually similar. However, a few word reorderings over quite large distances may be necessary. This is especially true in case of questions, in which question words like &quot;where&quot; and &quot;when&quot; are placed at the end of a sentence in Chinese. The BTEC corpora contain many sentences with questions.</Paragraph> <Paragraph position="8"> The inverse IBM constraints are designed to perform this type of reordering (see Sec. 4.3). As shown in Fig. 3, the system performs well under these con- null constraints and window sizes for the test corpus of the BTEC IE task. *Optimized for WER.</Paragraph> <Paragraph position="9"> straints already with relatively small window sizes.</Paragraph> <Paragraph position="10"> Increasing the window size beyond 4 for these constraints only marginally improves the translation error measures for both short (under 8 words) and long sentences. Thus, a suitable language-pair-specific choice of reordering constraints can avoid the huge computational complexity required for permutations of long sentences.</Paragraph> <Paragraph position="11"> Tab. 2 includes error measures for the best setup with inverse IBM constraints with window size of 7, as well as additional improvements obtained by a k-best list rescoring.</Paragraph> <Paragraph position="12"> The best settings for reordering constraints and model scaling factors on the development corpora were then used to produce translations of the IWSLT Japanese and Chinese test corpora. These translations were evaluated against multiple references which were unknown to the authors. Our system (denoted with WFST, see Tab. 3) produced results competitive with the results of the best system at the evaluation campaign (denoted with AT (Bender et al., 2004)) and, according to some of the error measures, even outperformed this system.</Paragraph> <Paragraph position="13"> The word order in the Italian language does not differ much from the English. Therefore, the absolute translation error rates are quite low and translating without reordering in training and search already results in a relatively good performance. This is reflected in Tab. 4. However, even for this language pair it is possible to improve translation quality by performing reordering both in training and during translation. The best performance on the development corpus is obtained when we constrain the reodering with relatively small window sizes of 3 to 4 and use either IBM or local reordering constraints.</Paragraph> <Paragraph position="14"> On the test corpus, as shown in Tab. 4, all error measures can be improved with these settings.</Paragraph> <Paragraph position="15"> Especially for languages with similar word order it is important to use weighted reorderings (Sec. 4.6) in order to prefer the original word order. Introduction of reordering weights for this task results in notable improvement of most error measures using either the IBM or local constraints. The optimal probability a for the unreordered path was determined on the development corpus as 0.5 for both of these constraints. The results on the test corpus using this setting are also given in Tab. 4.</Paragraph> </Section> </Section> class="xml-element"></Paper>