File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/p01-1027_evalu.xml
Size: 5,659 bytes
Last Modified: 2025-10-06 13:58:45
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1027"> <Title>Refined Lexicon Models for Statistical Machine Translation using a Maximum Entropy Approach</Title> <Section position="8" start_page="0" end_page="0" type="evalu"> <SectionTitle> 6 Experimental results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Training and test corpus </SectionTitle> <Paragraph position="0"> The &quot;Verbmobil Task&quot; is a speech translation task in the domain of appointment scheduling, travel planning, and hotel reservation. The task is difficult because it consists of spontaneous speech and the syntactic structures of the sentences are less restricted and highly variable. For the rescor- null To train the maximum entropy models we used the &quot;Ristad ME Toolkit&quot; described in (Ristad, 1997). We performed 100 iteration of the Improved Iterative Scaling algorithm (Pietra et al., 1997) using the corpus described in Table 6, which is a subset of the corpus shown in Table 5.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Training and test perplexities </SectionTitle> <Paragraph position="0"> In order to compute the training and test perplexities, we split the whole aligned training corpus in two parts as shown in Table 6. The training and test perplexities are shown in Table 7. As expected, the perplexity reduction in the test corpus is lower than in the training corpus, but in both cases better perplexities are obtained using the ME models. The best value is obtained when a threshold of 4 is used.</Paragraph> <Paragraph position="1"> We expected to observe strong overfitting effects when a too small cut-off for features gets used. Yet, for most words the best test corpus perplexity is observed when we use all features including those that occur only once.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.3 Translation results </SectionTitle> <Paragraph position="0"> In order to make use of the ME models in a statistical translation system we implemented a rescoring algorithm. This algorithm take as input the standard lexicon model (not using maximum entropy) and the 348 models obtained with the ME training. For an hypothesis sentence a12 a13a7 and a corresponding alignment a33 a5a7 the algorithm modifies the score a15a17a16a19a18a26a4 a5a7a28a24 a33 a5a7 a20a12a25a13a7 a22 according to the refined maximum entropy lexicon model.</Paragraph> <Paragraph position="1"> We carried out some preliminary experiments with the a2 -best lists of hypotheses provided by the translation system in order to make a rescoring of each i-th hypothesis and reorder the list according to the new score computed with the refined lexicon model. Unfortunately, our a2 -best extraction algorithm is sub-optimal, i.e. not the true best a2 translations are extracted. In addition, so far we had to use a limit of only a87 a39 translations per sentence. Therefore, the results of the translation experiments are only preliminary.</Paragraph> <Paragraph position="2"> For the evaluation of the translation quality we use the automatically computable Word Error Rate (WER). The WER corresponds to the edit distance between the produced translation and one predefined reference translation. A shortcoming of the WER is the fact that it requires a perfect word order. This is particularly a problem for the Verbmobil task, where the word order of the German-English sentence pair can be quite different. As a result, the word order of the automatically generated target sentence can be different from that of the target sentence, but nevertheless acceptable so that the WER measure alone can be misleading. In order to overcome this problem, we introduce as additional measure the position-independent word error rate (PER).</Paragraph> <Paragraph position="3"> This measure compares the words in the two sentences without taking the word order into account.</Paragraph> <Paragraph position="4"> Depending on whether the translated sentence is longer or shorter than the target translation, the remaining words result in either insertion or deletion errors in addition to substitution errors. The PER is guaranteed to be less than or equal to the WER.</Paragraph> <Paragraph position="5"> We use the top-10 list of hypothesis provided by the translation system described in (Tillmann and Ney, 2000) for rescoring the hypothesis using the ME models and sort them according to the new maximum entropy score. The translation results in terms of error rates are shown in Table 8.</Paragraph> <Paragraph position="6"> We use Model 4 in order to perform the translation experiments because Model 4 typically gives better translation results than Model 5.</Paragraph> <Paragraph position="7"> We see that the translation quality improves slightly with respect to the WER and PER. The translation quality improvements so far are quite small compared to the perplexity measure improvements. We attribute this to the fact that the algorithm for computing the a2 -best lists is suboptimal. null Verbmobil Test-147 for different contextual information and different thresholds using the top-10 translations. The baseline translation results for model 4 are WER=54.80 and PER=43.07.</Paragraph> <Paragraph position="8"> Table 9 shows some examples where the translation obtained with the rescoring procedure is better than the best hypothesis provided by the translation system.</Paragraph> </Section> </Section> class="xml-element"></Paper>