File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1096_intro.xml

Size: 6,081 bytes

Last Modified: 2025-10-06 14:03:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1096">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics An End-to-End Discriminative Approach to Machine Translation</Title>
  <Section position="4" start_page="761" end_page="761" type="intro">
    <SectionTitle>
2 Approach
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="761" end_page="761" type="sub_section">
      <SectionTitle>
2.1 Translation as structured classification
</SectionTitle>
      <Paragraph position="0"> Machine translation can be seen as a structured classification task, in which the goal is to learn a mapping from an input (French) sentence x to an output (English) sentence y. Given this setup, discriminative methods allow us to define a broad class of features Ph that operate on (x,y). For example, some features would measure the fluency of y and others would measure the faithfulness of y as a translation of x.</Paragraph>
      <Paragraph position="1"> However, the translation task in this framework differs from traditional applications of discriminative structured classification such as POS tagging and parsing in a fundamental way. Whereas in POS tagging, there is a one-to-one correspondence between the words xand the tags y, the correspondence between x and y in machine translation is not only much more complex, but is in fact unknown. Therefore, we introduce a hidden correspondence structure h and work with the feature vector Ph(x,y,h).</Paragraph>
      <Paragraph position="2"> The phrase-based model of Koehn et al. (2003) is an instance of this framework. In their model, the correspondence h consists of (1) the segmentation of the input sentence into phrases, (2) the segmentation of the output sentence into the same number of phrases, and (3) a bijection between the input and output phrases. The feature vector Ph(x,y,h) contains four components: the log probability of the output sentence y under a language model, the score of translating x into y based on a phrase table, a distortion score, and a length penalty.1 In Section 6, we vastly increase the number of features to take advantage of the full power of discriminative training.</Paragraph>
      <Paragraph position="3"> Another example of this framework is the hierarchical model of Chiang (2005). In this model the correspondence h is a synchronous parse tree 1More components can be added to the feature vector if additional language models or phrase tables are available. over input and output sentences, and features include the scores of various productions used in the tree.</Paragraph>
      <Paragraph position="4"> Given features Ph and a corresponding set of parameters w, a standard classification rule f is to return the highest scoring output sentence y, maximizing over correspondences h:</Paragraph>
      <Paragraph position="6"> In the phrase-based model, computing the argmax exactly is intractable, so we approximate f with beam decoding.</Paragraph>
    </Section>
    <Section position="2" start_page="761" end_page="761" type="sub_section">
      <SectionTitle>
2.2 Perceptron-based training
</SectionTitle>
      <Paragraph position="0"> To tune the parameters w of the model, we use the averaged perceptron algorithm (Collins, 2002) because of its efficiency and past success on various NLP tasks (Collins and Roark, 2004; Roark et al., 2004). In principle, w could have been tuned by maximizing conditional probability or maximizing margin. However, these two options require either marginalization or numerical optimization, neither of which is tractable over the space of output sentences y and correspondences h. In contrast, the perceptron algorithm requires only a decoder that computes f(x;w).</Paragraph>
      <Paragraph position="1"> Recall the traditional perceptron update rule on an example (xi,yi) is</Paragraph>
      <Paragraph position="3"> tion using the current parameters w.</Paragraph>
      <Paragraph position="4"> We adapt this update rule to work with hidden variables as follows:</Paragraph>
      <Paragraph position="6"> where (yp,hp) is the argmax computation in Equation 1, and (yt,ht) is the target that we update towards. If (yt,ht) is the same argmax computation with the additional constraint that yt = yi, then Equation 3 can be interpreted as a Viterbi approximation to the stochastic gradient EP(h|xi,yi;w)Ph(xi,yi,h)[?]EP(y,h|xi;w)Ph(xi,y,h) for the following conditional likelihood objective:</Paragraph>
      <Paragraph position="8"> are two possible updates, local (b) and bold (c).</Paragraph>
      <Paragraph position="9"> Although the bold update (c) reaches the reference translation, a bad correspondence is used. The local update (b) does not reach the reference, but is more reasonable than (c).</Paragraph>
      <Paragraph position="10"> Discriminative training with hidden variables has been handled in this probabilistic framework (Quattoni et al., 2004; Koo and Collins, 2005), but we choose Equation 3 for efficiency.</Paragraph>
      <Paragraph position="11"> It turns out that using the Viterbi approximation (which we call bold updating) is not always the best strategy. To appreciate the difficulty, consider the example in Figure 1. Suppose we make the prediction (a) with the current set of parameters.</Paragraph>
      <Paragraph position="12"> There are often several acceptable output translations y, for example, (b) and (c). Since (c)'s output matches the reference translation, should we update towards (c)? In this case, the answer is negative. The problem with (c) is that the correspondence h contains an incorrect alignment (', a). However, since h is unobserved, the training procedure has no way of knowing this. While the output in (b) is farther from the reference, its correspondence h is much more reasonable. In short, it does not suffice for yt to be good; both yt and ht need to be good. A major challenge in using the perceptron algorithm for machine translation is determining the target (yt,ht) in Equation 3.</Paragraph>
      <Paragraph position="13"> Section 5 discusses possible targets to update towards. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML