File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-3108_evalu.xml

Size: 8,993 bytes

Last Modified: 2025-10-06 13:59:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3108">
  <Title>Discriminative Reordering Models for Statistical Machine Translation</Title>
  <Section position="6" start_page="57" end_page="60" type="evalu">
    <SectionTitle>
5 Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="57" end_page="58" type="sub_section">
      <SectionTitle>
5.1 Statistics
</SectionTitle>
      <Paragraph position="0"> The experiments were carried out on the Basic Travel Expression Corpus (BTEC) task (Takezawa et al., 2002). This is a multilingual speech corpus which contains tourism-related sentences similar to those that are found in phrase books. We use the Arabic-English, the Chinese-English and the Japanese-English data. The corpus statistics are shown in Table 1.</Paragraph>
      <Paragraph position="1"> As the BTEC is a rather clean corpus, the preprocessing consisted mainly of tokenization, i.e., separating punctuation marks from words. Additionally, we replaced contractions such as it's or I'm in the English corpus and we removed the case information. For Arabic, we removed the diacritics and we split common prefixes: Al, w, f, b, l. There was no special preprocessing for the Chinese and the Japanese training corpora.</Paragraph>
      <Paragraph position="2"> To train and evaluate the reordering model, we  use the word aligned bilingual training corpus. For evaluating the classification power of the reordering model, we partition the corpus into a training part and a test part. In our experiments, we use about 10% of the corpus for testing and the remaining part for training the feature weights of the reordering model with the GIS algorithm using YASMET (Och, 2001). The statistics of the training and test alignment links is shown in Table 2. The number of training events ranges from 119K for Japanese-English to 144K for Arabic-English.</Paragraph>
      <Paragraph position="3"> The word classes for the class-based features are trained using the mkcls tool (Och, 1999). In the experiments, we use 50 word classes. Alternatively, one could use part-of-speech information for this purpose.</Paragraph>
      <Paragraph position="4"> Additional experiments were carried out on the large data track of the Chinese-English NIST task. The corpus statistics of the bilingual training corpus are shown in Table 3. The language model was trained on the English part of the bilingual training corpus and additional monolingual English data from the GigaWord corpus. The total amount of language model training data was about 600M running words. We use a fourgram language model with modified Kneser-Ney smoothing as implemented in the SRILM toolkit (Stolcke, 2002). For the four English reference translations of the evaluation sets, the accumulated statistics are presented.</Paragraph>
    </Section>
    <Section position="2" start_page="58" end_page="59" type="sub_section">
      <SectionTitle>
5.2 Classification Results
</SectionTitle>
      <Paragraph position="0"> In this section, we present the classification results for the three language pairs. In Table 4, we present the classification results for two orientation classes.</Paragraph>
      <Paragraph position="1"> As baseline we always choose the most frequent orientation class. For Arabic-English, the baseline is with 6.3% already very low. This means that the word order in Arabic is very similar to the word order in English. For Chinese-English, the baseline is with 12.7% about twice as large. The most differences in word order occur for Japanese-English.</Paragraph>
      <Paragraph position="2"> This seems to be reasonable as Japanese has usually a different sentence structure, subject-object-verb compared to subject-verb-object in English. For each language pair, we present results for several combination of features. The three columns per language pair indicate if the features are based on the words (column label 'Words'), on the word classes (column label 'Classes') or on both (column label  'W+C'). We also distinguish if the features depend on the target sentence ('Tgt'), on the source sentence ('Src') or on both ('Src+Tgt').</Paragraph>
      <Paragraph position="3"> For Arabic-English, using features based only on words of the target sentence the classification error rate can be reduced to 4.5%. If the features are based only on the source sentence words, a classification error rate of 2.9% is reached. Combining the features based on source and target sentence words, a classification error rate of 2.8% can be achieved.</Paragraph>
      <Paragraph position="4"> Adding the features based on word classes, the classification error rate can be further improved to 2.1%. For the other language pairs, the results are similar except that the absolute values of the classification error rates are higher.</Paragraph>
      <Paragraph position="5"> We observe the following: * The features based on the source sentence perform better than features based on the target sentence.</Paragraph>
      <Paragraph position="6"> * Combining source and target sentence features performs best.</Paragraph>
      <Paragraph position="7"> * Increasing the window always helps, i.e. additional context information is useful.</Paragraph>
      <Paragraph position="8">  * Often the word-class based features outperform the word-based features.</Paragraph>
      <Paragraph position="9"> * Combining word-based and word-class based features performs best.</Paragraph>
      <Paragraph position="10"> * In general, adding features does not hurt the  performance.</Paragraph>
      <Paragraph position="11"> These are desirable properties of an appropriate reordering model. The main point is that these are fulfilled not only on the training data, but on unseen test data. There seems to be no overfitting problem. In Table 5, we present the results for four orientation classes. The final error rates are a factor 2-4 larger than for two orientation classes. Despite that we observe the same tendencies as for two orientation classes. Again, using more features always helps to improve the performance.</Paragraph>
    </Section>
    <Section position="3" start_page="59" end_page="60" type="sub_section">
      <SectionTitle>
5.3 Translation Results
</SectionTitle>
      <Paragraph position="0"> For the translation experiments on the BTEC task, we report the two accuracy measures BLEU (Papineni et al., 2002) and NIST (Doddington, 2002) as well as the two error rates: word error rate (WER) and position-independent word error rate (PER).</Paragraph>
      <Paragraph position="1"> These criteria are computed with respect to 16 references. null In Table 6, we show the translation results for the BTEC task. In these experiments, the reordering model uses two orientation classes, i.e. it predicts either a left or a right orientation. The features for the maximum-entropy based reordering model are based on the source and target language words within a window of one. The word-class based features are not used for the translation experiments. The maximum-entropy based reordering model achieves small but consistent improvement for all the evaluation criteria. Note that the baseline system, i.e. using the distance-based reordering, was among the best systems in the IWSLT 2005 evalua- null tion campaign (Eck and Hori, 2005).</Paragraph>
      <Paragraph position="2"> Some translation examples are presented in Table 7. We observe that the system using the maximum-entropy based reordering model produces more fluent translations.</Paragraph>
      <Paragraph position="3"> Additional translation experiments were carried out on the large data track of the Chinese-English NIST task. For this task, we use only the BLEU and NIST scores. Both scores are computed case-insensitive with respect to four reference translations using the mteval-v11b tool1.</Paragraph>
      <Paragraph position="4"> For the NIST task, we use the BLEU score as primary criterion which is optimized on the NIST 2002 evaluation set using the Downhill Simplex algorithm (Press et al., 2002). Note that only the eight or nine model scaling factors of Equation 2 are optimized using the Downhill Simplex algorithm. The feature weights of the reordering model are trained using the GIS algorithm as described in Section 4.4. We use a state-of-the-art baseline system which would  tion (NIST, 2005).</Paragraph>
      <Paragraph position="5"> The translation results for the NIST task are presented in Table 8. We observe consistent improvements of the BLEU score on all evaluation sets. The overall improvement due to reordering ranges from 1.2% to 2.0% absolute. The contribution of the maximum-entropy based reordering model to this improvement is in the range of 25% to 58%, e.g. for the NIST 2003 evaluation set about 58% of the improvement using reordering can be attributed to the maximum-entropy based reordering model.</Paragraph>
      <Paragraph position="6"> We also measured the classification performance for the NIST task. The general tendencies are identical to the BTEC task.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML