File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1090_metho.xml
Size: 19,492 bytes
Last Modified: 2025-10-06 14:10:23
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1090"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Clustered Global Phrase Reordering Model for Statistical Machine Translation</Title> <Section position="4" start_page="713" end_page="713" type="metho"> <SectionTitle> 2 Baseline Translation Model </SectionTitle> <Paragraph position="0"> In statistical machine translation, the translation of a source (foreign) sentence a0 is formulated as the search for a target (English) sentence a1a2 that maximizes the conditional probability a3a5a4 a2a7a6a0a9a8 , which can be rewritten using the Bayes rule as, where a3a27a4 a0 a6a2 a8 is a translation model and a3a27a4 a2 a8 is a target language model.</Paragraph> <Paragraph position="1"> In phrase-based statistical machine translation, the source sentence a0 is segmented into a sequence of a28 phrases a29a0a31a30a32 , and each source phrase reordered. The translation model used in (Koehn et al., 2003) is the product of translation probabil-</Paragraph> <Paragraph position="3"> where a38 a33 denotes the start position of the source phrase translated into the a54 -th target phrase, and a42 a33a53a45 a32 denotes the end position of the source phrase translated into the a4a53a54 a40a56a55 a8 -th target phrase. The translation probability is calculated from the relative frequency as,</Paragraph> <Paragraph position="5"> a8 is the frequency of alignments between the source phrase</Paragraph> <Paragraph position="7"> (Koehn et al., 2003) used the following distortion model, which simply penalizes non-monotonic phrase alignments based on the word distance of successively translated source phrases with an appropriate value for the parameter a71 ,</Paragraph> </Section> <Section position="5" start_page="713" end_page="714" type="metho"> <SectionTitle> 3 The Global Phrase Reordering Model </SectionTitle> <Paragraph position="0"> Figure 1 shows an example of Japanese-English phrase alignment that consists of four phrase pairs.</Paragraph> <Paragraph position="1"> Note that the Japanese verb phrase &quot;a92 a93a95a94 &quot; at the the end of the sentence is aligned to the English verb &quot;is&quot; at the beginning of the sentence just after the subject. Such reordering is typical in Japanese-English translations.</Paragraph> <Paragraph position="2"> Motivated by the three-valued orientation for local reordering in (Tillmann and Zhang, 2005), we define the following four types of reordering patterns, as shown in Figure 2, a96 monotone adjacent (MA): The two source phrases are adjacent, and are in the same order as the two target phrases.</Paragraph> <Paragraph position="3"> a96 monotone gap (MG): The two source phrases are not adjacent, but are in the same order as the two target phrases.</Paragraph> <Paragraph position="4"> a96 reverse adjacent (RA): The two source phrases are adjacent, but are in the reverse order of the two target phrases.</Paragraph> <Paragraph position="5"> a96 reverse gap (RG): The two source phrases are not adjacent, and are in the reverse order as the two target phrases.</Paragraph> <Paragraph position="6"> For the global reordering model, we only consider the cases in which the two target phrases are adjacent because, in decoding, the target sentence is generated from left to right and phrase by phrase. If we are to generate the a54 -th target phrase</Paragraph> <Paragraph position="8"> Table 1 shows the percentage of each reordering pattern that appeared in the N-best phrase alignments of the training bilingual sentences for the IWSLT 2005 Japanese-English and Chinese-English translation tasks (Eck and Hori, 2005). Since non-local reorderings such as monotone gap and reverse gap are more frequent in Japanese to English translations, they are worth modeling explicitly in this reordering model.</Paragraph> <Paragraph position="9"> Since the probability of reordering pattern a36 (intended to stand for 'distortion') is conditioned on the current and previous blocks, the global phrase reordering model is formalized as follows:</Paragraph> <Paragraph position="11"> We can replace the conventional word distance-based distortion probability a36a37a4a39a38 a33a31a40 a42a25a33a53a45 a32 a8 in Equation (1) with the global phrase reordering model in Equation (4) with minimal modification of the underlying phrase-based decoding algorithm.</Paragraph> </Section> <Section position="6" start_page="714" end_page="716" type="metho"> <SectionTitle> 4 Parameter Estimation Method </SectionTitle> <Paragraph position="0"> In principle, the parameters of the global phrase reordering model in Equation (4) can be estimated from the relative frequencies of respective events in the Viterbi phrase alignment of the training bilingual sentences. This straightforward estimation method, however, often suffers from sparse data problem. To cope with this sparseness, we used N-best phrase alignment and bilingual phrase</Paragraph> <Section position="1" start_page="714" end_page="714" type="sub_section"> <SectionTitle> 4.1 N-best Phrase Alignment </SectionTitle> <Paragraph position="0"> In order to obtain the Viterbi phrase alignment of a bilingual sentence pair, we search for the phrase segmentation and phrase alignment that maximizes the product of the phrase translation</Paragraph> <Paragraph position="2"> where a0 a3 and a2 a33 are words in the target and source phrases.</Paragraph> <Paragraph position="3"> The phrase alignment based on Equation (5) can be thought of as an extension of word alignment based on the IBM Model 1 to phrase alignment. Note that bilingual phrase segmentation (phrase extraction) is also done using the same criteria. The approximation in Equation (6) is motivated by (Vogel et al., 2003). Here, we added the second</Paragraph> <Paragraph position="5"> bilities are estimated using the GIZA++ (Och and Ney, 2003).</Paragraph> <Paragraph position="6"> The above search is implemented in the follow- null ing way: 1. All source word and target word pairs are considered to be initial phrase pairs. 715 2. If the phrase translation probability of the phrase pair is less than the threshold, it is deleted.</Paragraph> <Paragraph position="7"> 3. Each phrase pair is expanded toward the eight neighboring directions as shown in Figure 3. 4. If the phrase translation probability of the expanded phrase pair is less than the threshold, it is deleted.</Paragraph> <Paragraph position="8"> 5. The process of expansion and deletion is repeated until no further expansion is possible. 6. The consistent N-best phrase alignment are searched from all combinations of the above phrase pairs.</Paragraph> <Paragraph position="9"> The search for consistent Viterbi phrase alignments can be implemented as a phrase-based decoder using a beam search whose outputs are constrained only to the target sentence. The consistent N-best phrase alignment can be obtained by using A* search as described in (Ueffing et al., 2002). We did not use any reordering constraints, such as IBM constraint and ITG constraint in the search for the N-best phrase alignment (Zens et al., 2004). The thresholds used in the search are the following: the minimum phrase translation probability is 0.0001. The maximum number of translation candidates for each phrase is 20. The beam width is 1e-10, the stack size (for each target candidate word length) is 1000. We found that, compared with the decoding of sentence translation, we have to search significantly larger space for the N-best phrase alignment.</Paragraph> <Paragraph position="10"> Figure 3 shows an example of phrase pair expansion toward eight neighbors. If the current phrase pair is (a0 , of), the expanded phrase pairs are (a1a3a2a5a4a7a6a9a8a11a10a13a12a15a14a17a16 a0 , means of), (a0 , means of), (a0 a18a20a19 , means of), (a1a21a2a7a4a22a6a20a8a11a10a20a12 a14a5a16 a0 , of), (a0 a18a11a19 , of), (a1a23a2a24a4a25a6a20a8a26a10a20a12a27a14a5a16 a0 , of communication), (a0 , of communication), and (a0 a18a28a19 , of communication).</Paragraph> <Paragraph position="11"> Figure 4 shows an example of the best three phrase alignments for a Japanese-English bilingual sentence. For the estimation of the global phrase reordering model, preliminary tests have shown that the appropriate N-best number is 20. In counting the events for the relative frequency estimation, we treat all N-best phrase alignments equally.</Paragraph> <Paragraph position="12"> For comparison, we also implemented a different N-best phrase alignment method, where</Paragraph> <Paragraph position="14"> phrase pairs are extracted using the standard phrase extraction method described in (Koehn et al., 2003). We call this conventional phrase extraction method &quot;grow-diag-final&quot;, and the proposed phrase extraction method &quot;ppicker&quot; (this is intended to stand for phrase picker).</Paragraph> </Section> <Section position="2" start_page="714" end_page="716" type="sub_section"> <SectionTitle> 4.2 Bilingual Phrase Clustering </SectionTitle> <Paragraph position="0"> The second approach to cope with the sparseness in Equation (4) is to group the phrases into equivalence classes. We used a bilingual word clustering tool, mkcls (Och et al., 1999) for this purpose. It forms partitions of the vocabulary of the two languages to maximize the joint probability of the training bilingual corpus.</Paragraph> <Paragraph position="1"> In order to perform bilingual phrase clustering, all words in a phrase are concatenated by an underscore ' ' to form a pseudo word. We then use the modified bilingual sentences as the input to mkcls. We treat all N-best phrase alignments equally.</Paragraph> <Paragraph position="2"> Thus, the phrase alignments in Figure 4 are converted to the following three bilingual sentence</Paragraph> <Paragraph position="4"> the_light was red Preliminary tests have shown that the appropriate number of classes for the estimation of the global phrase reordering model is 20.</Paragraph> <Paragraph position="5"> As a comparison, we also tried two phrase classification methods based on the part of speech of the head word (Ohashi et al., 2005). We defined (arguably) the first word of each English phrase and the last word of each Japanese phrase as the shorthand reordering model</Paragraph> <Paragraph position="7"> ments head word. We then used the part of speech of the head word as the phrase class. We call this method &quot;1pos&quot;. Since we are not sure whether it is appropriate to introduce asymmetry in head word selection, we also tried a &quot;2pos&quot; method, where the parts of speech of both the first and the last words are used for phrase classification.</Paragraph> </Section> <Section position="3" start_page="716" end_page="716" type="sub_section"> <SectionTitle> 4.3 Conditioning Factor of Reordering </SectionTitle> <Paragraph position="0"> The third approach to cope with sparseness in Equation (4) is to approximate the equation by reducing the conditioning factors.</Paragraph> <Paragraph position="1"> Other than the baseline word distance-based reordering model and the Equation (4) itself, we tried eight different approximations of Equation (4) as shown in Table 2, where, the symbol in the left column is the shorthand for the reordering model in the right column.</Paragraph> <Paragraph position="2"> The approximations are designed based on two intuitions. The current block ( appropriate form of the global phrase reordering model is decided through experimentation.</Paragraph> </Section> </Section> <Section position="7" start_page="716" end_page="717" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="716" end_page="717" type="sub_section"> <SectionTitle> 5.1 Corpus and Tools </SectionTitle> <Paragraph position="0"> We used the IWSLT-2005 Japanese-English translation task (Eck and Hori, 2005) for evaluating the proposed global phrase reordering model. We report results using the well-known automatic eval- null data Language Translation) 2005 is an evaluation campaign for spoken language translation.Its task domain encompasses basic travel conversations. 20,000 bilingual sentences are provided for training. Table 3 shows the number of words and the size of vocabulary of the training data. The average sentence length of Japanese is 9.9 words, while that of English is 9.2 words.</Paragraph> <Paragraph position="1"> Two development sets, each containing 500 source sentences, are also provided and each development sentence comes with 16 reference translations. We used the second development set (devset2) for the experiments described in this paper. This 20,000 sentence corpus allows for fast experimentation and enables us to study different aspects of the proposed global phrase reordering model.</Paragraph> <Paragraph position="2"> Japanese word segmentation was done using ChaSen2 and English tokenization was done using a tool provided by LDC3. For the phrase classification based on the parts of speech of the head word, we used the first two layers of the Chasen's part of speech tag for Japanese. For English part of speech tagging, we used MXPOST4.</Paragraph> <Paragraph position="3"> Word translation probabilities are obtained by using GIZA++ (Och and Ney, 2003). For training, all English words are made in lower case. We used a back-off word trigram model as the language model. It is trained from the lowercased English side of the training corpus using a statistical language modeling toolkit, Palmkit 5.</Paragraph> <Paragraph position="4"> We implemented our own decoder based on the algorithm described in (Ueffing et al., 2002). For decoding, we used phrase translation probability, lexical translation probability, word penalty, and distortion (phrase reordering) probability. Minimum error rate training was not used for weight optimization.</Paragraph> <Paragraph position="5"> The thresholds used in the decoding are the following: the minimum phrase translation probability is 0.01. The maximum number of translation different phrase extraction methods candidates for each phrase is 10. The beam width is 1e-5, the stack size (for each target candidate word length) is 100.</Paragraph> </Section> <Section position="2" start_page="717" end_page="717" type="sub_section"> <SectionTitle> 5.2 Clustered and Lexicalized Model </SectionTitle> <Paragraph position="0"> Figure 5 shows the BLEU score of clustered and lexical reordering model with different conditioning factors. Here, &quot;class&quot; shows the accuracy when the identity of each phrase is represented by its class, which is obtained by the bilingual phrase clustering, while &quot;lex&quot; shows the accuracy when the identity of each phrases is represented by its lexical form.</Paragraph> <Paragraph position="1"> The clustered reordering model &quot;class&quot; is generally better than the lexicalized reordering model &quot;lex&quot;. The accuracy of &quot;lex&quot; drops rapidly as the number of conditioning factors increases. The re-ordering models using the part of speech of the head word for phrase classification such as &quot;1pos&quot; and &quot;2pos&quot; are somewhere in between.</Paragraph> <Paragraph position="2"> The best score is achieved by the clustered model when the phrase reordering pattern is conditioned on either the current target phrase They are significantly better than the baseline of the word distance-based reordering model.</Paragraph> </Section> <Section position="3" start_page="717" end_page="717" type="sub_section"> <SectionTitle> 5.3 Interaction between Phrase Extraction and Phrase Alignment </SectionTitle> <Paragraph position="0"> Table 4 shows the BLEU score of reordering models with different phrase extraction methods. Here, &quot;ppicker&quot; shows the accuracy when phrases are extracted by using the N-best phrase alignment method described in Section 4.1, while &quot;grow-diag-final&quot; shows the accuracy when phrases are extracted using the standard phrase extraction algorithm described in (Koehn et al., 2003).</Paragraph> <Paragraph position="1"> It is obvious that, for building the global phrase reordering model, our phrase extraction method is significantly better than the conventional phrase extraction method. We assume this is because the proposed N-best phrase alignment method optimizes the combination of phrase extraction (segmentation) and phrase alignment in a sentence.</Paragraph> </Section> <Section position="4" start_page="717" end_page="717" type="sub_section"> <SectionTitle> 5.4 Global and Local Reordering Model </SectionTitle> <Paragraph position="0"> In order to show the advantages of explicitly modeling global phrase reordering, we implemented a different reordering model where the reordering pattern is classified into three values: monotone adjacent, reverse adjacent and neutral. By collapsing monotone gap and reverse gap into neutral, it can be thought of as a local reordering model similar to the block orientation bigram (Tillmann and Zhang, 2005).</Paragraph> <Paragraph position="1"> Figure 6 shows the BLEU score of the local and global reordering models. Here, &quot;class3&quot; and &quot;lex3&quot;represent the three-valued local reordering model, while &quot;class4&quot; and &quot;lex4&quot;represent the four-valued global reordering model. &quot;Class&quot; and &quot;lex&quot; represent clustered and lexical models, respectively. We used &quot;grow-diag-final&quot; for phrase extraction in this experiment.</Paragraph> <Paragraph position="2"> It is obvious that the four-valued global reordering model consistently outperformed the three-valued local reordering model under various conditioning factors.</Paragraph> </Section> </Section> <Section position="8" start_page="717" end_page="718" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> As shown in Figure 5, the reordering model of Equation (4) (indicated as e[-1,0]f[-1,0] in shorthand) suffers from a sparse data problem even if phrase clustering is used. The empirically justifiable global reordering model seems to be the following, conditioned on the classes of source and target phrases:</Paragraph> <Paragraph position="2"> which is similar to the block orientation bigram (Tillmann and Zhang, 2005). We should note, however, that the block orientation bigram is a joint probability model for the sequence of blocks (source and target phrases) as well as their orientations (reordering pattern) whose purpose is very different from our global phrase reordering model.</Paragraph> <Paragraph position="3"> The advantage of the reordering model is that it can better model global phrase reordering using a four-valued reordering pattern, and it can be easily incorporated into a standard phrase-based translation decoder.</Paragraph> <Paragraph position="4"> The problem of the global phrase reordering model is the cost of parameter estimation. In particular, the N-best phrase alignment described in Section 4.1 is computationally expensive. We must devise a more efficient phrase alignment algorithm that can globally optimize both phrase segmentation (phrase extraction) and phrase alignment. null</Paragraph> </Section> class="xml-element"></Paper>