File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3108_metho.xml
Size: 9,150 bytes
Last Modified: 2025-10-06 14:11:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3108"> <Title>Discriminative Reordering Models for Statistical Machine Translation</Title> <Section position="4" start_page="55" end_page="55" type="metho"> <SectionTitle> 3 Baseline System </SectionTitle> <Paragraph position="0"> In statistical machine translation, we are given a source language sentence fJ1 = f1...fj ...fJ, which is to be translated into a target language sentence eI1 = e1 ...ei ...eI. Among all possible target language sentences, we will choose the sentence with the highest probability:</Paragraph> <Paragraph position="2"> The posterior probability Pr(eI1|fJ1 ) is modeled directly using a log-linear combination of several models (Och and Ney, 2002):</Paragraph> <Paragraph position="4"> parenrightBig (2) The denominator represents a normalization factor that depends only on the source sentence fJ1 . Therefore, we can omit it during the search process. As a decision rule, we obtain:</Paragraph> <Paragraph position="6"> This approach is a generalization of the source-channel approach (Brown et al., 1990). It has the advantage that additional models h(*) can be easily integrated into the overall system. The model scaling factors lM1 are trained with respect to the final translation quality measured by an error criterion (Och, 2003).</Paragraph> <Paragraph position="7"> We use a state-of-the-art phrase-based translation system (Zens and Ney, 2004; Zens et al., 2005) including the following models: an n-gram language model, a phrase translation model and a word-based lexicon model. The latter two models are used for both directions: p(f|e) and p(e|f). Additionally, we use a word penalty and a phrase penalty. The reordering model of the baseline system is distancebased, i.e. it assigns costs based on the distance from the end position of a phrase to the start position of the next phrase. This very simple reordering model is widely used, for instance in (Och et al., 1999; Koehn, 2004; Zens et al., 2005).</Paragraph> </Section> <Section position="5" start_page="55" end_page="57" type="metho"> <SectionTitle> 4 The Reordering Model </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="55" end_page="56" type="sub_section"> <SectionTitle> 4.1 Idea </SectionTitle> <Paragraph position="0"> In this section, we will describe the proposed discriminative reordering model.</Paragraph> <Paragraph position="1"> To make use of word level information, we need the word alignment within the phrase pairs. This can be easily stored during the extraction of the phrase pairs from the bilingual training corpus. If there are multiple possible alignments for a phrase pair, we use the most frequent one.</Paragraph> <Paragraph position="2"> The notation is introduced using the illustration in Figure 1. There is an example of a left and a right phrase orientation. We assume that we have already produced the three-word phrase in the lower part.</Paragraph> <Paragraph position="3"> Now, the model has to predict if the start position of the next phrase j' is to the left or to the right of the current phrase. The reordering model is applied only at the phrase boundaries. We assume that the reordering within the phrases is correct.</Paragraph> <Paragraph position="4"> In the remaining part of this section, we will describe the details of this reordering model. The classes our model predicts will be defined in Section 4.2. Then, the feature functions will be defined in Section 4.3. The training criterion and the training events of the maximum entropy model will be described in Section 4.4.</Paragraph> </Section> <Section position="2" start_page="56" end_page="56" type="sub_section"> <SectionTitle> 4.2 Class Definition </SectionTitle> <Paragraph position="0"> Ideally, this model predicts the start position of the next phrase. But as predicting the exact position is rather difficult, we group the possible start positions into classes. In the simplest case, we use only two classes. One class for the positions to the left and one class for the positions to the right. As a refinement, we can use four classes instead of two: 1) one position to the left, 2) more than one positions to the left, 3) one position to the right, 4) more than one positions to the right.</Paragraph> <Paragraph position="1"> In general, we use a parameter D to specify 2 * D classes of the types: * exactly d positions to the left, d = 1,...,D [?] 1 * at least D positions to the left * exactly d positions to the right, d = 1,...,D[?]1 * at least D positions to the right Let cj,j' denote the orientation class for a movement from source position j to source position j' as illustrated in Figure 1. In the case of two orientation classes, cj,j' is defined as: cj,j' = braceleftbigg left, if j' < j right, if j' > j (4) Then, the reordering model has the form p(cj,j'|fJ1 ,eI1,i,j) A well-founded framework for directly modeling the probability p(cj,j'|fJ1 ,eI1,i,j) is maximum entropy (Berger et al., 1996). In this framework, we have a set of N feature functions hn(fJ1 ,eI1,i,j,cj,j'),n = 1,...,N. Each feature function hn is weighted with a factor ln. The resulting model is:</Paragraph> <Paragraph position="3"> The functional form is identical to Equation 2, but here we will use a large number of binary features, whereas in Equation 2 usually only a very small number of real-valued features is used.</Paragraph> <Paragraph position="4"> More precisely, the resulting reordering model</Paragraph> <Paragraph position="6"> ponent in the log-linear combination of Equation 2.</Paragraph> </Section> <Section position="3" start_page="56" end_page="57" type="sub_section"> <SectionTitle> 4.3 Feature Definition </SectionTitle> <Paragraph position="0"> The feature functions of the reordering model depend on the last alignment link (j,i) of a phrase.</Paragraph> <Paragraph position="1"> Note that the source position j is not necessarily the end position of the source phrase. We use the source position j which is aligned to the last word of the target phrase in target position i. The illustration in Figure 1 contains such an example.</Paragraph> <Paragraph position="2"> To introduce generalization capabilities, some of the features will depend on word classes or part-of-speech information. Let FJ1 denote the word class sequence that corresponds to the source language sentence fJ1 and let EI1 denote the target word class sequence that corresponds to the target language sentence eI1. Then, the feature functions are of the form hn(fJ1 ,eI1,FJ1 ,EI1,i,j,j'). We consider the following binary features: 1. source words within a window around the current source position j</Paragraph> <Paragraph position="4"> 2. target words within a window around the current target position i</Paragraph> <Paragraph position="6"> 3. word classes or part-of-speech within a window around the current source position j</Paragraph> <Paragraph position="8"> 4. word classes or part-of-speech within a window around the current target position i</Paragraph> <Paragraph position="10"> Here, d(*,*) denotes the Kronecker-function. In the experiments, we will use d [?] {[?]1,0,1}. Many other feature functions are imaginable, e.g. combinations of the described feature functions, n-gram or multi-word features, joint source and target language feature functions.</Paragraph> </Section> <Section position="4" start_page="57" end_page="57" type="sub_section"> <SectionTitle> 4.4 Training </SectionTitle> <Paragraph position="0"> As training criterion, we use the maximum class posterior probability. This corresponds to maximizing the likelihood of the maximum entropy model.</Paragraph> <Paragraph position="1"> Since the optimization criterion is convex, there is only a single optimum and no convergence problems occur. To train the model parameters lN1 , we use the Generalized Iterative Scaling (GIS) algorithm (Darroch and Ratcliff, 1972).</Paragraph> <Paragraph position="2"> In practice, the training procedure tends to result in an overfitted model. To avoid overfitting, (Chen and Rosenfeld, 1999) have suggested a smoothing method where a Gaussian prior distribution of the parameters is assumed.</Paragraph> <Paragraph position="3"> This method tried to avoid very large lambda values and prevents features that occur only once for a specific class from getting a value of infinity.</Paragraph> <Paragraph position="4"> We train IBM Model 4 with GIZA++ (Och and Ney, 2003) in both translation directions. Then the alignments are symmetrized using a refined heuristic as described in (Och and Ney, 2003). This word-aligned bilingual corpus is used to train the reordering model parameters, i.e. the feature weights lN1 . Each alignment link defines an event for the maximum entropy training. An exception are the one-to-many alignments, i.e. one source word is aligned to multiple target words. In this case, only the top-most alignment link is considered because the other ones cannot occur at a phrase boundary. Many-to-one and many-to-many alignments are handled in a similar way.</Paragraph> </Section> </Section> class="xml-element"></Paper>