File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3227_metho.xml
Size: 15,767 bytes
Last Modified: 2025-10-06 14:09:30
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3227"> <Title>Phrase Pair Rescoring with Term Weightings for Statistical Machine Translation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Phrase-based Machine Translation </SectionTitle> <Paragraph position="0"> In this section, the phrase-based machine translation system used in the experiments is briefly described: the phrase based translation models and the decoding algorithm, which allows for local word reordering.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Translation Model </SectionTitle> <Paragraph position="0"> The phrase-based statistical translation systems use not only word-to-word translation, extracted from bilingual data, but also phrase-to phrase translations. . Different types of extraction approaches have been described in the literature: syntax-based, word-alignment-based, and genuine phrase alignment models. The syntax-based approach has the advantage to model the grammar structures using models of more or less structural richness, such as the syntax-based alignment model in (Yamada and Knight, 2001) or the Bilingual Bracketing in (Wu, 1997). Popular word-alignment-based approaches usually rely on initial word alignments from the IBM and HMM alignment models (Och and Ney, 2000), from which the phrase pairs are then extracted.</Paragraph> <Paragraph position="1"> (Marcu and Wong 2002) and (Zhang et al. 2003) do not rely on word alignment but model directly the phrase alignment.</Paragraph> <Paragraph position="2"> Because all statistical machine translation systems search for a globally optimal translation using the language and translation model, a translation probability has to be assigned to each phrase translation pair. This score should be meaningful in that better translations have a higher probability assigned to them, and balanced with respect to word translations. Bad phrase translations should not win over better word for word translations, only because they are phrases.</Paragraph> <Paragraph position="3"> Our focus here is not phrase extraction, but how to estimate a reasonable probability (or score) to better represent the translation quality of the extracted phrase pairs. One major problem is that most phrase pairs are seen only several times, even in a very large corpus. A reliable and effective estimation approach is explained in section 3, and the proposed models are introduced in section 4.</Paragraph> <Paragraph position="4"> In our system, a collection of phrase translations is called a transducer. Different phrase extraction methods result in different transducers.</Paragraph> <Paragraph position="5"> A manual dictionary can be added to the system as just another transducer. Typically, one source phrase is aligned with several candidate target phrases, with a score attached to each candidate representing the translation quality.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Decoding Algorithm </SectionTitle> <Paragraph position="0"> Given a set of transducers as the translation model (i.e. phrase translation pairs together with the scores of their translation quality), decoding is divided into several steps.</Paragraph> <Paragraph position="1"> The first step is to build a lattice by applying the transducers to the input source sentence. We start from a lattice, which has as its only path the source sentence. Then for each word or sequence of words in the source sentence for which we have an entry in the transducer new edges are generated and inserted into the lattice, spanning over the source phrase. One new edge is created for each translation candidate, and the translation score is assigned to this edge. The resulting lattice has then all the information available from the translation model.</Paragraph> <Paragraph position="2"> The second step is search for a best path through this lattice, but not only based on the translation model scores but applying also the language model. We start with an initial special sentence begin hypothesis at the first node in the lattice. Hypotheses are then expanded over the edges, applying the language model to the partial translations attached to the edges. The following algorithm summarizes the decoding process when not considering word reordering: Current node n, previous node n'; edge e Language model state L, L' Hypothesis h, h' Foreach node n in the lattice Foreach incoming edge e in n phrase = word sequence at e</Paragraph> <Paragraph position="4"> foreach h with LMstate L LMcost = 0.0 foreach word w in phrase</Paragraph> <Paragraph position="6"> store h'in Hypotheses(n,L) The updated hypothesis h' at the current node stores the pointer to the previous hypothesis and the edge (labeled with the target phrase) over which it was expanded. Thus, at the final step, one can trace back to get the path associated with the minimum cost, i.e. the best hypothesis.</Paragraph> <Paragraph position="7"> Other operators such as local word reordering are incorporated into this dynamic programming search (Vogel, 2003).</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Phrase Pair Translation Probability </SectionTitle> <Paragraph position="0"> As stated in the previous section, one of the major problems is how to assign a reasonable probability for the extracted phrase pair to represent the translation quality.</Paragraph> <Paragraph position="1"> Most of the phrase pairs are seen only once or twice in the training data. This is especially true for longer phrases. Therefore, phrase pair co-occurrence counts collected from the training corpus are not reliable and have little discriminative power. In (Vogel et al. 2003) a different estimation approach was proposed. Similar as in the IBM models, it is assumed that each source</Paragraph> <Paragraph position="3"> in the source phrase ),,( total phrase translation probability is then calculated according to the following generative model:</Paragraph> <Paragraph position="5"> This is essentially the lexical probability as calculated in the IBM1 alignment model, without considering position alignment probabilities.</Paragraph> <Paragraph position="6"> Any statistical translation can be used in (1) to calculate the phrase translation probability.</Paragraph> <Paragraph position="7"> However, in our experiment we typically see now significant difference in translation results when using lexicons trained from different alignment models.</Paragraph> <Paragraph position="8"> Also Equation (1) was confirmed to be robust and effective in parallel sentence mining from a very large and noisy comparable corpus (Zhao and Vogel, 2002).</Paragraph> <Paragraph position="9"> Equation (1) does not explicitly discriminate content words from non-content words. As non-content words such as high frequency functional words tend to occur in nearly every parallel sentence pair, they co-occur with most of the source words in the vocabulary with non-trivial translation probabilities. This noise propagates via (1) into the phrase translations probabilities, increasing the chance that non-optimal phrase translation candidates get high probabilities and better translations are often not in the top ranks.</Paragraph> <Paragraph position="10"> We propose a vector model to better distinguish between content words and non-content words with the goal to emphasize content words in the translation. This model will be used to rescore the phrase translation pairs, and to get a normalized score representing the translation probability.</Paragraph> </Section> <Section position="5" start_page="0" end_page="31" type="metho"> <SectionTitle> 4 Vector Model for Phrase Translation </SectionTitle> <Paragraph position="0"> Term weighting models such as tf.idf are applied successfully in information retrieval. The duality of term frequency (tf) and inverse document frequency (idf), document space and collection space respectively, can smoothly predict the probability of terms being informative (Roelleke, 2003). Naturally, tf.idf is suitable to model content words as these words in general have large tf.idf weights.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Phrase Pair as Bag-of-Words </SectionTitle> <Paragraph position="0"> Our translation model: (transducer, as defined in 2.1), is a collection of phrase translation pairs together with scores representing the translation quality. Each phrase translation pair, which can be represented as a triple },{ pts v v - , is now converted into a &quot;Bag-of-Words&quot; D consisting of a collection of both source and target words appearing in the phrase pair, as shown in (2): Given each phrase pair as one document, the whole transducer is a collection of such documents. We can calculate tf.idf for each</Paragraph> <Paragraph position="2"> and represent source and target phrases by vec-</Paragraph> <Paragraph position="4"> respectively.</Paragraph> <Paragraph position="5"> This vector representation can be justified by word co-occurrence considerations. As the phrase translation pairs are extracted from parallel sentences, the source words</Paragraph> <Paragraph position="7"> in the source and target phrases must co-occur in the training data. The co-occurring words should share similar term frequency and document frequency statistics. Therefore, the</Paragraph> <Paragraph position="9"> have similar term weight contours corresponding to the co-occurring word pairs. So the vector representations of a phrase translation pair can reflect the translation quality. In addition, the content words and non-content words are modeled explicitly by using term weights. An over-simplified example would be that a rare word in the source language usually translates into a rare word in the target language.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Term Weighting Schemes </SectionTitle> <Paragraph position="0"> Given the transducer, it is straightforward to calculate term weights for source and target words. There are several versions of tf.idf. The smooth ones are preferred, because phrase translation pairs are rare events collected from training data.</Paragraph> <Paragraph position="1"> The idf model selected is as in Equation (4): where N is the total number of documents in the transducer, i.e. the total number of translation pairs, and df is the document frequency, i.e. in how many phrase pairs a given word occurs. The constant of 0.5 acts as smoothing.</Paragraph> <Paragraph position="2"> Because most of the phrases are short, such as 2 to 8 words, the term frequency in the bag of words representation is usually 1, and some times 2. This, in general, does not bring much discrimination in representing translation quality. The following version of tf is chosen, so that longer target phrases with more words than average will be slightly down-weighted:</Paragraph> <Paragraph position="4"> where tf is the term frequency, )(vlen the average length of source or target phrase calculated from the transducer. Again, the values of 0.5 and 1.5 are constants used in IR tasks acting as smoothing.</Paragraph> <Paragraph position="5"> Thus after a transducer is extracted from a parallel corpus, tf and df are counted from the collection of the &quot;bag-of-words'' phrase alignment representations. For each word in the phrase pair translation its tf.idf weight is assigned and the source and target phrase are transformed into vectors as shown in Equation (3). These vectors reserve the translation quality information and also model the content and non-content words by the term weighting model of tf.idf.</Paragraph> </Section> <Section position="3" start_page="0" end_page="31" type="sub_section"> <SectionTitle> 4.3 Vector Space Alignment </SectionTitle> <Paragraph position="0"> Given the vector representations in Equation (3), a similarity between the two vectors can not directly be calculated. The dimensions I and J are not guaranteed to be the same. The goal is to transform the source vector into a vector having the same dimensions as the target vector, i.e. to map the source vector into the space of the target vector, so that a similarity distance can be calculated. Using the same reasoning as used to motivate Equation (1), it is assumed that every source</Paragraph> <Paragraph position="2"> where I and J are the length of the source and target phrases;</Paragraph> <Paragraph position="4"> are term weights for source word and target words;</Paragraph> <Paragraph position="6"> a w is the transformed weight mapped from all source words to the target dimension at word</Paragraph> <Paragraph position="8"> TREC tests show that bm25 (Robertson and Walker, 1997) is one of the best-known distance schemes. This distance metric is given in Equation (9). The constants of ,, kbk are set to be 1, 1 and 1000 respectively.</Paragraph> <Paragraph position="9"> where avg(l) is the average target phrase length in words given the same source phrase. Our experiments confirmed the bm25 distance is slightly better than the cosine distance, though the difference is not really significant. One advantage of bm25 distance is that the set of free parameters ,, kbk can be tuned to get better performance e.g. via n-fold cross validation.</Paragraph> </Section> <Section position="4" start_page="31" end_page="31" type="sub_section"> <SectionTitle> 4.5 Integrated Translation Score </SectionTitle> <Paragraph position="0"> Our goal is to rescore the phrase translation pairs by using additional evidence of the translation quality in the vector space.</Paragraph> <Paragraph position="1"> The vector based scores (8) & (9) provide a distinct view of the translation quality in the vector space. Equation (1) provides a evidence of the translation quality based on the word alignment probability, and can be assumed to be different from the evidences in vector space. Thus, a natural way of integrating them together is a geometric interpolation shown in (10) or equivalently a linear interpolation in the log domain. The parameter b can be tuned using held-out data. In our cross validation experiments 5.0=b gave the best performance in most cases. Therefore, Equation (10) can be simplified into: The phrase translation score functions in (1) and (11) are non-symmetric. This is because the statistical lexicon Pr(s|t) is non-symmetric. One can easily re-write all the distances by using Pr(t|s). But in our experiments this reverse direction of using Pr(t|s) gives trivially difference. So in all the experimental results reported in this paper, the distances defined in (1) and (11) are used.</Paragraph> </Section> </Section> <Section position="6" start_page="31" end_page="31" type="metho"> <SectionTitle> 5 Length Regularization </SectionTitle> <Paragraph position="0"> Phrase pair extraction does not work perfectly and sometimes a short source phrase is aligned to a long target phrase or vice versa. Length regularization can be applied to penalize too long or too short candidate translations. Similar to the sentence alignment work in (Gale and Church, 1991), the phrase length ratio is assumed to be a Gaussian distribution as given in Equation (12): where l(t) is the target sentence length. Mean u and variance s can be estimated using a parallel corpus using a Maximum Likelihood criteria.</Paragraph> <Paragraph position="1"> The regularized score is the product of (11) and (12).</Paragraph> </Section> class="xml-element"></Paper>