File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1021_metho.xml
Size: 12,919 bytes
Last Modified: 2025-10-06 14:08:15
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1021"> <Title>Minimum Error Rate Training in Statistical Machine Translation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Automatic Assessment of Translation Quality </SectionTitle> <Paragraph position="0"> In recent years, various methods have been proposed to automatically evaluate machine translation quality by comparing hypothesis translations with reference translations. Examples of such methods are word error rate, position-independent word error rate (Tillmann et al., 1997), generation string accuracy (Bangalore et al., 2000), multi-reference word error rate (Niessen et al., 2000), BLEU score (Papineni et al., 2001), NIST score (Doddington, 2002).</Paragraph> <Paragraph position="1"> All these criteria try to approximate human assessment and often achieve an astonishing degree of correlation to human subjective evaluation of fluency and adequacy (Papineni et al., 2001; Doddington, 2002).</Paragraph> <Paragraph position="2"> In this paper, we use the following methods: a45 multi-reference word error rate (mWER): When this method is used, the hypothesis translation is compared to various reference translations by computing the edit distance (minimum number of substitutions, insertions, deletions) between the hypothesis and the closest of the given reference translations.</Paragraph> <Paragraph position="3"> a45 multi-reference position independent error rate (mPER): This criterion ignores the word order by treating a sentence as a bag-of-words and computing the minimum number of substitutions, insertions, deletions needed to transform the hypothesis into the closest of the given reference translations.</Paragraph> <Paragraph position="4"> a45 BLEU score: This criterion computes the geometric mean of the precision of a46 -grams of various lengths between a hypothesis and a set of reference translations multiplied by a factor BPa23a48a47a25 that penalizes short sentences:</Paragraph> <Paragraph position="6"> Herea0 a56 denotes the precision ofa46 -grams in the hypothesis translation. We use a57 a1a60a59 .</Paragraph> <Paragraph position="7"> a45 NIST score: This criterion computes a weighted precision of a46 -grams between a hypothesis and a set of reference translations multiplied by a factor BP'a23a48a47a25 that penalizes short sentences:</Paragraph> <Paragraph position="9"> grams in the translation. We use a57 a1a28a59 .</Paragraph> <Paragraph position="10"> Both, NIST and BLEU are accuracy measures, and thus larger values reflect better translation quality. Note that NIST and BLEU scores are not additive for different sentences, i.e. the score for a document cannot be obtained by simply summing over scores for individual sentences.</Paragraph> <Paragraph position="11"> 4 Training Criteria for Minimum Error</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Rate Training </SectionTitle> <Paragraph position="0"> In the following, we assume that we can measure the number of errors in sentence a15 by comparing it with a reference sentence a2 using a function Ea23 a2 a9 a15 a25 . However, the following exposition can be easily adapted to accuracy metrics and to metrics that make use of multiple references.</Paragraph> <Paragraph position="1"> We assume that the number of errors for a set of sentences a15 a36a7 is obtained by summing the errors for the individual sentences: a3 a23 a2 a36a7 a9 a15 a36a7 a25 a1</Paragraph> <Paragraph position="3"> Our goal is to obtain a minimal error count on a representative corpus a0 a36a7 with given reference translations a22a15 a36a7 and a set of a4 different candidate transla-</Paragraph> <Paragraph position="5"> The above stated optimization criterion is not easy to handle: a45 It includes an argmax operation (Eq. 6). Therefore, it is not possible to compute a gradient and we cannot use gradient descent methods to perform optimization.</Paragraph> <Paragraph position="6"> a45 The objective function has many different local optima. The optimization algorithm must handle this.</Paragraph> <Paragraph position="7"> In addition, even if we manage to solve the optimization problem, we might face the problem of overfitting the training data. In Section 5, we describe an efficient optimization algorithm.</Paragraph> <Paragraph position="8"> To be able to compute a gradient and to make the objective function smoother, we can use the following error criterion which is essentially a smoothed error count, with a parameter a22 to adjust the smoothness: null</Paragraph> <Paragraph position="10"> In the extreme case, for a22a34a33 a35 , Eq. 7 converges to the unsmoothed criterion of Eq. 5 (except in the case of ties). Note, that the resulting objective function might still have local optima, which makes the optimization hard compared to using the objective function of Eq. 4 which does not have different local optima. The use of this type of smoothed error count is a common approach in the speech community (Juang et al., 1995; Schl&quot;uter and Ney, 2001). Figure 1 shows the actual shape of the smoothed and the unsmoothed error count for two parameters in our translation system. We see that the unsmoothed error count has many different local optima and is very unstable. The smoothed error count is much more stable and has fewer local optima. But as we show in Section 7, the performance on our task obtained with the smoothed error count does not differ significantly from that obtained with the unsmoothed error count.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Optimization Algorithm for Unsmoothed Error Count </SectionTitle> <Paragraph position="0"> A standard algorithm for the optimization of the unsmoothed error count (Eq. 5) is Powells algorithm combined with a grid-based line optimization method (Press et al., 2002). We start at a random point in the a4 -dimensional parameter space have been computed on the development corpus (see Section 7, Table 1) using a7 a9a1a0a3a2a4a2 alternatives per source sentence. The smoothed error count has been computed with a smoothing parameter a22 a1a6a5 . and try to find a better scoring point in the parameter space by making a one-dimensional line minimization along the directions given by optimizing one parameter while keeping all other parameters fixed. To avoid finding a poor local optimum, we start from different initial parameter values. A major problem with the standard approach is the fact that grid-based line optimization is hard to adjust such that both good performance and efficient search are guaranteed. If a fine-grained grid is used then the algorithm is slow. If a large grid is used then the optimal solution might be missed.</Paragraph> <Paragraph position="1"> In the following, we describe a new algorithm for efficient line optimization of the unsmoothed error count (Eq. 5) using a log-linear model (Eq. 3) which is guaranteed to find the optimal solution. The new algorithm is much faster and more stable than the grid-based line optimization method.</Paragraph> <Paragraph position="2"> Computing the most probable sentence out of a set of candidate translation a5 a1</Paragraph> <Paragraph position="4"> results in an optimization problem of the following functional form:</Paragraph> <Paragraph position="6"> Here, a13 a23a48a47a25 and a5 a23a48a47a25 are constants with respect to a9 .</Paragraph> <Paragraph position="7"> Hence, every candidate translation in a5 corresponds to a line. The function</Paragraph> <Paragraph position="9"> is piecewise linear (Papineni, 1999). This allows us to compute an efficient exhaustive representation of that function.</Paragraph> <Paragraph position="10"> In the following, we sketch the new algorithm to optimize Eq. 5: We compute the ordered sequence of linear intervals constituting a3 a23 a9 a12 a0a30a25 for every sentence a0 together with the incremental change in error count from the previous to the next interval. Hence, we obtain for every sentence a0 a sequence a9a16a15a7a18a17 a9a16a15a19a20a17 a11a10a11a10a11 a17 a9a16a15 a54a22a21 which denote the interval boundaries and a corresponding sequence for the change in error count involved at the corresponding interval boundary a23 a3 a15a7 a9 a23 a3 a15a19 a9a10a11a10a11a10a11a12a9 a23 a3 a15</Paragraph> <Paragraph position="12"> Here, a23 a3 a15a56 denotes the change in the error count at position a23 a9 a15a56a1a0 a7 a7 a9 a15a56 a25a3a2a5a4 to the error count at position</Paragraph> <Paragraph position="14"> a15 for all different sentences of our corpus, the complete set of interval boundaries and error count changes on the whole corpus are obtained. The optimal a9 can now be computed easily by traversing the sequence of interval boundaries while updating an error count.</Paragraph> <Paragraph position="15"> It is straightforward to refine this algorithm to also handle the BLEU and NIST scores instead of sentence-level error counts by accumulating the relevant statistics for computing these scores (n-gram precision, translation length and reference length) .</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Baseline Translation Approach </SectionTitle> <Paragraph position="0"> The basic feature functions of our model are identical to the alignment template approach (Och and Ney, 2002). In this translation model, a sentence is translated by segmenting the input sentence into phrases, translating these phrases and reordering the translations in the target language. In addition to the feature functions described in (Och and Ney, 2002), our system includes a phrase penalty (the number of alignment templates used) and special alignment features. Altogether, the log-linear model includes a1a9a8 different features.</Paragraph> <Paragraph position="1"> Note that many of the used feature functions are derived from probabilistic models: the feature function is defined as the negative logarithm of the corresponding probabilistic model. Therefore, the feature functions are much more 'informative' than for instance the binary feature functions used in standard maximum entropy models in natural language processing.</Paragraph> <Paragraph position="2"> For search, we use a dynamic programming beam-search algorithm to explore a subset of all possible translations (Och et al., 1999) and extract a46 -best candidate translations using A* search (Ueffing et al., 2002).</Paragraph> <Paragraph position="3"> Using ana46 -best approximation, we might face the problem that the parameters trained are good for the list of a46 translations used, but yield worse translation results if these parameters are used in the dynamic programming search. Hence, it is possible that our new search produces translations with more errors on the training corpus. This can happen because with the modified model scaling factors the a46 -best list can change significantly and can include sentences not in the existing a46 -best list. To avoid this problem, we adopt the following solution: First, we perform search (using a manually defined set of parameter values) and compute an a46 -best list, and use this a46 -best list to train the model parameters. Second, we use the new model parameters in a new search and compute a new a46 -best list, which is combined with the existing a46 -best list. Third, using this extended a46 -best list new model parameters are computed. This is iterated until the resulting a46 -best list does not change. In this algorithm convergence is guaranteed as, in the limit, the a46 -best list will contain all possible translations. In our experiments, we compute in every iteration about 200 alternative translations. In practice, the algorithm converges after about five to seven iterations. As a result, error rate cannot increase on the training corpus.</Paragraph> <Paragraph position="4"> A major problem in applying the MMI criterion is the fact that the reference translations need to be part of the provided a46 -best list. Quite often, none of the given reference translations is part of the a46 -best list because the search algorithm performs pruning, which in principle limits the possible translations that can be produced given a certain input sentence.</Paragraph> <Paragraph position="5"> To solve this problem, we define for the MMI training new pseudo-references by selecting from the a46 -best list all the sentences which have a minimal number of word errors with respect to any of the true references. Note that due to this selection approach, the results of the MMI criterion might be biased toward the mWER criterion. It is a major advantage of the minimum error rate training that it is not necessary to choose pseudo-references.</Paragraph> </Section> class="xml-element"></Paper>