File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0836_metho.xml

Size: 20,749 bytes

Last Modified: 2025-10-06 14:09:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0836">
  <Title>Training and Evaluating Error Minimization Rules for Statistical Machine Translation</Title>
  <Section position="3" start_page="208" end_page="209" type="metho">
    <SectionTitle>
2 Addressing Evaluation Metrics
</SectionTitle>
    <Paragraph position="0"> We now describe competing strategies to address the problem of modeling the evaluation metric within the decoding and rescoring process, and introduce our contribution towards training non-tractable error surfaces. The methods discussed below make use of Gen(f), the approximation to the complete candidate translation space E, referred to as an n-best list. Details regarding n-best list generation from decoder output can be found in (Ueffing et al., 2002).</Paragraph>
    <Section position="1" start_page="208" end_page="209" type="sub_section">
      <SectionTitle>
2.1 Minimum Error Rate Training
</SectionTitle>
      <Paragraph position="0"> The predominant approach to reconciling the mis-match between the MAP decision rule and the evaluation metric has been to train the parameters l of the exponential model to correlate the MAP choice with the maximum score as indicated by the evaluation metric on a development set with known references (Och, 2003). We differentiate between the decision rule</Paragraph>
      <Paragraph position="2"> where the Loss function returns an evaluation result quantifying the difference between the English candidate translation transll(f) and its corresponding reference r for a source sentence f. We indicate that this loss function is operating on a sequence of sentences with the vector notation. To avoid overfitting, and since MT researchers are generally blessed with an abundance of data, these sentences are from a separate development set.</Paragraph>
      <Paragraph position="3"> The optimization problem (3b) is hard since the argmax of (3a) causes the error surface to change in steps in Rm, precluding the use of gradient based optimization methods. Smoothed error counts can be used to approximate the argmax operator, but the resulting function still contains local minima. Grid-based line search approaches like Powell's algorithm could be applied but we can expect difficultly when choosing the appropriate grid size and starting parameters. In the following, we summarize the optimization algorithm for the unsmoothed error counts presented in (Och, 2003) and the implementation detailed in (Venugopal and Vogel, 2005).</Paragraph>
      <Paragraph position="4"> * Regard Loss(transll(vectorf),vectorr) as defined in (3b) as a function of the parameter vector l to optimize and take the argmax to compute transll(vectorf) over the translations Gen(f) according to the n-best list generated with an initial estimate l0.</Paragraph>
      <Paragraph position="5"> * The error surface defined by Loss (as a function of l) is piecewise linear with respect to a single model parameter lk, hence we can determine exactly where it would be useful (values that change the result of the argmax) to evaluate lk for a given sentence using a simple line intersection method.</Paragraph>
      <Paragraph position="6"> * Merge the list of useful evaluation points for lk and evaluate the corpus level Loss(transll(vectorf),vectorr) at each one.</Paragraph>
      <Paragraph position="7"> * Select the model parameter that represents the lowest Loss as k varies, set lk and consider the parameter lj for another dimension j.</Paragraph>
      <Paragraph position="8"> This training algorithm, referred to as minimum error rate (MER) training, is a greedy search in each dimension of l, made efficient by realizing that within each dimension, we can compute the points at which changes in l actually have an impact on Loss. The appropriate considerations for termination and initial starting points relevant to any greedy search procedure must be accounted for. From the  nature of the training procedure and the MAP decision rule, we can expect that the parameters selected by MER training will strongly favor a few translations in the n-best list, namely for each source sentence the one resulting in the best score, moving most of the probability mass towards the translation that it believes should be selected. This is due to the decision rule, rather than the training procedure, as we will see when we consider alternative decision rules.</Paragraph>
    </Section>
    <Section position="2" start_page="209" end_page="209" type="sub_section">
      <SectionTitle>
2.2 The Minimum Bayes Risk Decision Rule
</SectionTitle>
      <Paragraph position="0"> The Minimum Bayes Risk Decision Rule as proposed by (Mangu et al., 2000) for the Word Error Rate Metric in speech recognition, and (Kumar and Byrne, 2004) when applied to translation, changes the decision rule in (2) to select the translation that has the lowest expected loss E[Loss(e,r)], which can be estimated by considering a weighted Loss between e and the elements of the n-best list, the approximation to E, as described in (Mangu et al., 2000). The resulting decision rule is:</Paragraph>
      <Paragraph position="2"> (Kumar and Byrne, 2004) explicitly consider selecting both e and a, an alignment between the English and French sentences. Under a phrase based translation model (Koehn et al., 2003; Marcu and Wong, 2002), this distinction is important and will be discussed in more detail. The representation of the evaluation metric or the Loss function is in the decision rule, rather than in the training criterion for the exponential model. This criterion is hard to optimize for the same reason as the criterion in (3b): the objective function is not continuous in l. To make things worse, it is more expensive to evaluate the function at a given l, since the decision rule involves a sum over all translations.</Paragraph>
    </Section>
    <Section position="3" start_page="209" end_page="209" type="sub_section">
      <SectionTitle>
2.3 MBR and the Exponential Model
</SectionTitle>
      <Paragraph position="0"> Previous work has reported the success of the MBR decision rule with fixed parameters relating independent underlying models, typically including only the language model and the translation model as features in the exponential model.</Paragraph>
      <Paragraph position="1"> We extend the MBR approach by developing a training method to optimize the parameters l in the exponential model as an explicit form for the conditional distribution in equation (1). The training task under the MBR criterion is</Paragraph>
      <Paragraph position="3"> We begin with several observations about this optimization criterion.</Paragraph>
      <Paragraph position="4"> * The MAP optimal l[?] are not the optimal parameters for this training criterion.</Paragraph>
      <Paragraph position="5"> * We can expect the error surface of the MBR training criterion to contain larger sections of similar altitude, since the decision rule emphasizes consensus.</Paragraph>
      <Paragraph position="6"> * The piecewise linearity observation made in (Papineni et al., 2002) is no longer applicable since we cannot move the log operation into the expected value.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="209" end_page="211" type="metho">
    <SectionTitle>
3 Score Sampling
</SectionTitle>
    <Paragraph position="0"> Motivated by the challenges that the MBR training criterion presents, we present a training method that is based on the assumption that the error surface is locally non-smooth but consists of local regions of similar Loss values. We would like to focus the search within regions of the parameter space that result in low Loss values, simulating the effect that the MER training process achieves when it determines the merged error boundaries across a set of sentences.</Paragraph>
    <Paragraph position="1"> Let Score(l) be some function of Loss(transll(vectorf),vectorr) that is greater or equal zero, decreases monotonically with Loss, and for which integraltext(Score(l) [?] minlprime Score(lprime))dl is finite; e.g., 1 [?] Loss(transll(vectorf),vectorr) for the word-error rate (WER) loss and a bounded parameter space.</Paragraph>
    <Paragraph position="2"> While sampling parameter vectors l and estimating Loss in these points, we will constantly refine our estimate of the error surface and thereby of the Score function. The main idea in our score  sampling algorithm is to make use of this Score estimate by constructing a probability distribution over the parameter space that depends on the Score estimate in the current iteration step i and sample the parameter vector li+1 for the next iteration from that distribution. More precisely, let hatwiderSc(i) be the estimate of Score in iteration i (we will explain how to obtain this estimate below). Then the probability distribution from which we sample the parameter vector to test in the next iteration is given by:</Paragraph>
    <Paragraph position="4"> This distribution produces a sequence l1,...,ln of parameter vectors that are more concentrated in areas that result in a high Score. We can select the value from this sequence that generates the highest Score, just as in the MER training process.</Paragraph>
    <Paragraph position="5"> The exact method of obtaining the Score estimate hatwiderSc is crucial: If we are not careful enough and guess too low values of hatwiderSc(l) for parameter regions that are still unknown to us, the resulting sampling distribution p might be zero in those regions and thus potentially optimal parameters might never be sampled. Rather than aiming for a consistent estimator of Score (i.e., an estimator that converges to Score when the sample size goes to infinity), we design hatwiderSc with regard to yielding a suitable sampling distribution p.</Paragraph>
    <Paragraph position="6"> Assume that the parameter space is bounded such that mink [?] lk [?] maxk for each dimension k, We then define a set of pivots P, forming a grid of points in Rm that are evenly spaced between mink and maxk for each dimension k. Each pivot represents a region of the parameter space where we expect generally consistent values of Score. We do not restrict the values of lm to be at these pivot points as a grid search would do, rather we treat the pivots as landmarks within the search space.</Paragraph>
    <Paragraph position="7"> We approximate the distribution p(l) with the discrete distribution p(l [?] P), leaving the problem of estimating |P |parameters. Initially, we set p to be uniform, i.e., p(0)(l) = 1/|P|. For subsequent iterations, we now need an estimate of Score(l) for each pivot l [?] P in the discrete version of equation  (6) to obtain the new sampling distribution p. Each iteration i proceeds as follows.</Paragraph>
    <Paragraph position="8"> * Sample tildewideli from the discrete distribution p(i[?]1)(l [?] P) obtained by the previous iteration. null * Sample the new parameter vector li by choos null ing for each k [?] {1,...,m}, lik := tildewiderlik + ek, where ek is sampled uniformly from the interval ([?]dk/2,dk/2) and dk is the distance between neighboring pivot points along dimension k. Thus, li is sampled from a region around the sampled pivot.</Paragraph>
    <Paragraph position="9"> * Evaluate Score(li) and distribute this score to obtain new estimates hatwiderSc(i)(l) for all pivots l [?] P as described below.</Paragraph>
    <Paragraph position="10"> * Use the updated estimates hatwiderSc(i) to generate the sampling distribution p(i) for the next iteration according to</Paragraph>
    <Paragraph position="12"> The score Score(li) of the currently evaluated parameter vector does not only influence the score estimate at the pivot point of the respective region, but the estimates at all pivot points. The closest pivots are influenced most strongly. More precisely, for each pivot l [?] P, hatwiderSc(i)(l) is a weighted average of Score(l1),...,Score(li), where the weights w(i)(l) are chosen according to</Paragraph>
    <Paragraph position="14"> Here, mvnpdf(x,u,S) denotes the m-dimensional multivariate-normal probability density function with mean u and covariance matrix S, evaluated at point x. We chose the covariance matrix S = diag(d21,...,d2m), where again dk is the distance between neighboring grid points along dimension k.</Paragraph>
    <Paragraph position="15"> The term infl(i)(l) quantifies the influence of the evaluated point li on the pivot l, while corr(i)(l) is a correction term for the bias introduced by having sampled li from p(i[?]1).</Paragraph>
    <Paragraph position="16">  Smoothing uncertain regions In the beginning of the optimization process, there will be pivot regions that have not yet been sampled from and for which not even close-by regions have been sampled yet.</Paragraph>
    <Paragraph position="17"> This will be reflected in the low sum of influence</Paragraph>
    <Paragraph position="19"> pivot points l. It is therefore advisable to discount some probability mass from p(i) and distribute it over pivots with low influence sums (reflecting low confidence in the respective score estimates) according to some smoothing procedure.</Paragraph>
    <Paragraph position="20"> 4 N-Best lists in Phrase Based Decoding The methods described above make extensive use of n-best lists to approximate the search space of candidate translations. In phrase based decoding we often interpret the MAP decision rule to select the top scoring path in the translation lattice. Selecting a particular path means in fact selecting the pair &lt;e,s&gt; , where s is a segmentation of the the source sentence f into phrases and alignments onto their translations in e. Kumar and Byrne (2004) represent this decision explicitly, since the Loss metrics considered in their work evaluate alignment information as well as lexical (word) level output. When considering lexical scores as we do here, the decision rule minimizing 0/1 loss actually needs to take the sum over all potential segmentations that can generate the same word sequence. In practice, we only consider the high probability segmentation decisions, namely the ones that were found in the n-best list. This gives the 0/1 loss criterion shown below.</Paragraph>
    <Paragraph position="22"> The 0/1 loss criterion favors translations that are supported by several segmentation decisions. In the context of phrase-based translations, this is a useful criterion, since a given lexical target word sequence can be correctly segmented in several different ways, all of which would be scored equally by an evaluation metric that only considers the word sequence.</Paragraph>
  </Section>
  <Section position="5" start_page="211" end_page="212" type="metho">
    <SectionTitle>
5 Experimental Framework
</SectionTitle>
    <Paragraph position="0"> Our goal is to evaluate the impact of the three decision rules discussed above on a large scale translation task that takes advantage of multidimensional features in the exponential model. In this section we describe the experimental framework used in this evaluation.</Paragraph>
    <Section position="1" start_page="211" end_page="212" type="sub_section">
      <SectionTitle>
5.1 Data Sets and Resources
</SectionTitle>
      <Paragraph position="0"> We perform our analysis on the data provided by the 2005 ACL Workshop in Exploiting Parallel Texts for Statistical Machine Translation, working with the French-English Europarl corpus. This corpus consists of 688031 sentence pairs, with approximately 156 million words on the French side, and 138 million words on the English side. We use the data as provided by the workshop and run lower casing as our only preprocessing step. We use the 15.5 million entry phrase translation table as provided for the shared workshop task for the French-English data set. Each translation pair has a set of 5 associated phrase translation scores that represent the maximum likelihood estimate of the phrase as well as internal alignment probabilities. We also use the English language model as provided for the shared task.</Paragraph>
      <Paragraph position="1"> Since each of these decision rules has its respective training process, we split the workshop test set of 2000 sentences into a development and test set using random splitting. We tried two decoders for translating these sets. The first system is the Pharaoh decoder provided by (Koehn et al., 2003) for the shared data task. The Pharaoh decoder has support for multiple translation and language model scores as well as simple phrase distortion and word length models.</Paragraph>
      <Paragraph position="2"> The pruning and distortion limit parameters remain the same as in the provided initialization scripts, i.e., DistortionLimit = 4,BeamThreshold = 0.1,Stack = 100. For further information on these parameter settings, confer (Koehn et al., 2003).</Paragraph>
      <Paragraph position="3"> Pharaoh is interesting for our optimization task because its eight different models lead to a search space with seven free parameters. Here, a principled optimization procedure is crucial. The second decoder we tried is the CMU Statistical Translation System (Vogel et al., 2003) augmented with the four translation models provided by the Pharaoh system, in the following called CMU-Pharaoh. This system also leads to a search space with seven free parameters. null</Paragraph>
    </Section>
    <Section position="2" start_page="212" end_page="212" type="sub_section">
      <SectionTitle>
5.2 N-Best lists
</SectionTitle>
      <Paragraph position="0"> As mentioned earlier, the model parameters l play a large role in the search space explored by a pruning beam search decoder. These parameters affect the histogram and beam pruning as well as the future cost estimation used in the Pharaoh and CMU decoders. The initial parameter file for Pharaoh provided by the workshop provided a very poor estimate of l, resulting in an n-best list of limited potential. To account for this condition, we ran Minimum Error Rate training on the development data to determine scaling factors that can generate a n-best list with high quality translations. We realize that this step biases the n-best list towards the MAP criteria, since its parameters will likely cause more aggressive pruning. However, since we have chosen a large N=1000, and retrain the MBR, MAP, and 0/1 loss parameters separately, we do not feel that the bias has a strong impact on the evaluation.</Paragraph>
    </Section>
    <Section position="3" start_page="212" end_page="212" type="sub_section">
      <SectionTitle>
5.3 Evaluation Metric
</SectionTitle>
      <Paragraph position="0"> This paper focuses on the BLEU metric as presented in (Papineni et al., 2002). The BLEU metric is defined on a corpus level as follows.</Paragraph>
      <Paragraph position="2"> where pn represent the precision of n-grams suggested in vectore and BP is a brevity penalty measuring the relative shortness of vectore over the whole corpus. To use the BLEU metric in the candidate pair-wise loss calculation in (4), we need to make a decision regarding cases where higher order n-grams matches are not found between two candidates. Kumar and Byrne (2004) suggest that if any n-grams are not matched then the pairwise BLEU score is set to zero. As an alternative we first estimate corpus-wide n-gram counts on the development set. When the pairwise counts are collected between sentences pairs, they are added onto the baseline corpus counts to and scored by BLEU. This scoring simulates the process of scoring additional sentences after seeing a whole corpus.</Paragraph>
    </Section>
    <Section position="4" start_page="212" end_page="212" type="sub_section">
      <SectionTitle>
5.4 Training Environment
</SectionTitle>
      <Paragraph position="0"> It is important to separate the impact of the decision rule from the success of the training procedure. To appropriately compare the MAP, 0/1 loss and MBR decisions rules, they must all be trained with the same training method, here we use the Score Sampling training method described above. We also report MAP scores using the MER training described above to determine the impact of the training algorithm for MAP. Note that the MER training approach cannot be performed on the MBR decision rule, as explained in Section 2.3. MER training is initialized at random values of l and run (successive greedy search over the parameters) until there is no change in the error for three complete cycles through the parameter set. This process is repeated with new starting parameters as well as permutations of the parameter search order to ensure that there is no bias in the search towards a particular parameter. To improve efficiency, pairwise scores are cached across requests for the score at different values of l, and for MBR only the E[Loss(e,r)] for the top twenty hypotheses as ranked by the model are computed.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML