File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1022_metho.xml

Size: 24,479 bytes

Last Modified: 2025-10-06 14:09:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1022">
  <Title>Machine Intelligence Lab, Cambridge University Engineering Department</Title>
  <Section position="3" start_page="0" end_page="170" type="metho">
    <SectionTitle>
2 HMM Word and Phrase Alignment
</SectionTitle>
    <Paragraph position="0"> Our goal is to develop a generative probabilistic model of Word-to-Phrase (WtoP) alignment. We start with an l-word source sentence e = el1, and an  m-word target sentence f = fm1 , which is realized as a sequence of K phrases: f = vK1 .</Paragraph>
    <Paragraph position="1"> Each phrase is generated as a translation of one source word, which is determined by the alignment sequence aK1 : eak - vk . The length of each phrase is specified by the process phK1 , which is constrained so that summationtextKk=1 phk = m.</Paragraph>
    <Paragraph position="2"> We also allow target phrases to be inserted, i.e. to be generated by a NULL source word. For this, we define a binary hallucination sequence hK1 : if hk = 0, then NULL - vk ; if hk = 1 then eak - vk.</Paragraph>
    <Paragraph position="3"> With all these quantities gathered into an alignment a = (phK1 ,aK1 ,hK1 ,K), the modeling objective is to realize the conditional distribution P(f,a|e). With the assumption that P(f,a|e) = 0 if f negationslash= vK1 ,</Paragraph>
    <Paragraph position="5"> We now describe the component distributions.</Paragraph>
    <Paragraph position="6"> Sentence Length o(m|l) determines the target sentence length. It is not needed during alignment, where sentence lengths are known, and is ignored.</Paragraph>
    <Paragraph position="7"> Phrase Count P(K|m,e) specifies the number of target phrases. We use a simple, single parameter distribution, with e = 8.0 throughout</Paragraph>
    <Paragraph position="9"> Word-to-Phrase Alignment Alignment is a Markov process that specifies the lengths of phrases and their alignment with source words</Paragraph>
    <Paragraph position="11"> The actual word-to-phrase alignment (ak) is a first-order Markov process, as in HMM-based word-to-word alignment (Vogel et al., 1996). It necessarily depends on the hallucination variable</Paragraph>
    <Paragraph position="13"> This formulation allows target phrases to be inserted without disrupting the Markov dependencies of phrases aligned to actual source words.</Paragraph>
    <Paragraph position="14"> The phrase length model n(ph;e) gives the probability that a word e produces a phrase with ph words in the target language; n(ph;e) is defined for ph = 1,*** ,N. The hallucination process is a simple i.i.d. process, where d(0) = p0, and d(1) = 1[?]p0.</Paragraph>
    <Paragraph position="15"> Word-to-Phrase Translation The translation of words to phrases is given as</Paragraph>
    <Paragraph position="17"> We introduce the notation vk = vk[1],...,vk[phk] and a dummy variable xk (for phrase insertion) :</Paragraph>
    <Paragraph position="19"> We define two models of word-to-phrase translation.</Paragraph>
    <Paragraph position="20"> This simplest is based on context-independent word-</Paragraph>
    <Paragraph position="22"> We also define a model that captures foreign word context with bigram translation probabilities</Paragraph>
    <Paragraph position="24"> Here, t(f|e) is the usual context independent word-to-word translation probability. The bigram translation probability t2(f|f',e) specifies the likelihood that target word f is to follow f' in a phrase generated by source word e.</Paragraph>
    <Section position="1" start_page="170" end_page="170" type="sub_section">
      <SectionTitle>
2.1 Properties of the Model and Prior Work
</SectionTitle>
      <Paragraph position="0"> The formulation of the WtoP alignment model was motivated by both the HMM word alignment model (Vogel et al., 1996) and IBM Model-4 with the goal of building on the strengths of each.</Paragraph>
      <Paragraph position="1"> The relationship with the word-to-word HMM alignment model is straightforward. For example, constraining the phrase length component n(ph;e) to permit only phrases of one word would give a word-to-word HMM alignment model. The extensions introduced are the phrase count, and the phrase length models, and the bigram translation distribution. The hallucination process is motivated by the use of NULL alignments into Markov alignment models as done by (Och and Ney, 2003).</Paragraph>
      <Paragraph position="2"> The phrase length model is motivated by Toutanova et al. (2002) who introduced 'stay' probabilities in HMM alignment as an alternative to word fertility. By comparison, Word-to-Phrase HMM alignment models contain detailed models of state occupancy, motivated by the IBM fertility model, which are more powerful than a single staying parameter. In fact, the WtoP model is a segmental Hidden Markov Model (Ostendorf et al., 1996), in which states emit observation sequences.</Paragraph>
      <Paragraph position="3"> Comparison with Model-4 is less straightforward.</Paragraph>
      <Paragraph position="4"> The main features of Model-4 are NULL source words, source word fertility, and the distortion model. The WtoP alignment model includes the first two of these. However distortion, which allows hypothesized words to be distributed throughout the target sentence, is difficult to incorporate into a model that supports efficient DP-based search. We preserve efficiency in the WtoP model by insisting that target words form connected phrases; this is not as general as Model-4 distortion. This weakness is somewhat offset by a more powerful (Markov) alignment process as well as by the phrase count distribution. Despite these differences, the WtoP alignment model and Model-4 allow similar alignments. For example, in Fig. 1, Model-4 would allow</Paragraph>
      <Paragraph position="6"> f1, f3, and f4 to be generated by e1 with a fertility of 3. Under the WtoP model, e1 could generate f1 and f3f4 with phrase lengths 1 and 2, respectively: source words can generate more than one phrase.</Paragraph>
      <Paragraph position="7"> This alignment could also be generated via four single word foreign phrases. The balance between word-to-word and word-to-phrase alignments is set by the phrase count distribution parameter e. As e increases, alignments with shorter phrases are favored, and for very large e the model allows only word-to-word alignments (see Fig. 2). Although the WtoP alignment model is more complex than the word-to-word HMM alignment model, the Baum-Welch and Viterbi algorithms can still be used. Word-to-word alignments are generated by the Viterbi algorithm: ^a = argmaxa P(f,a|e); if eak - vk , eak is linked to all the words in vk.</Paragraph>
      <Paragraph position="8"> The bigram translation probability relies on word context, known to be helpful in translation (Berger et al., 1996), to improve the identification of target phrases. As an example, f is the Chinese word for &amp;quot;world trade center&amp;quot;. Table 1 shows how the likelihood of the correct English phrase is improved with bigram translation probabilities; this example is from the C-E, N=4 system of Table 2.</Paragraph>
      <Paragraph position="9">  There are of course much prior work in translation that incorporates phrases. Sumita et al. (2004) develop a model of phrase-to-phrase alignment, which while based on HMM alignment process, appears to be deficient. Marcu and Wong (2002) propose a model to learn lexical correspondences at the phrase level. To our knowledge, ours is the first non-syntactic model of bitext alignment (as opposed to translation) that links words and phrases.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="170" end_page="171" type="metho">
    <SectionTitle>
3 Embedded Alignment Model Estimation
</SectionTitle>
    <Paragraph position="0"> We now discuss estimation of the WtoP model parameters by the EM algorithm. Since the WtoP model can be treated as an HMM with a very complex state space, it is straightforward to apply Baum- null Welch parameter estimation. We show the forward recursion as an example.</Paragraph>
    <Paragraph position="1"> Given a sentence pair (el1,fm1 ), the forward probability aj(i,ph) is defined as the probability of generating the first j target words with the added condition that the target words fjj[?]ph+1 form a phrase aligned to source word ei. It can be calculated recursively (omitting the hallucination process, for simplicity) as</Paragraph>
    <Paragraph position="3"> This recursion is over a trellis of l(N + 1)m nodes.</Paragraph>
    <Paragraph position="4"> Models are trained from a flat-start. We begin with 10 iterations of EM to train Model-1, followed by 5 EM iterations to train Model-2 (Brown and others, 1993). We initialize the parameters of the word-to-word HMM alignment model by collecting word alignment counts from the Model-2 Viterbi alignments, and refine the word-to-word HMM alignment model by 5 iterations of the Baum-Welch algorithm.</Paragraph>
    <Paragraph position="5"> We increase the order of the WtoP model (N) from 2 to the final value in increments of 1, by performing 5 Baum Welch iterations at each step. At the final value of N, we introduce the bigram translation probability; we use Witten-Bell smoothing (1991) as a backoff strategy for t2, and other strategies are possible.</Paragraph>
  </Section>
  <Section position="5" start_page="171" end_page="172" type="metho">
    <SectionTitle>
4 Bitext Word Alignment
</SectionTitle>
    <Paragraph position="0"> We now investigate bitext word alignment performance. We start with the FBIS Chinese/English parallel corpus which consists of approx. 10M English/7.5M Chinese words. The Chinese side of the corpus is segmented into words by the LDC segmenter1. The alignment test set consists of 124 sentences from the NIST 2001 dry-run MT-eval2 set that are manually word aligned.</Paragraph>
    <Paragraph position="1"> We first analyze the distribution of word links within these manual alignments. Of the Chinese words which are aligned to more than one English words, 82% of these words align with consecutive  English words (phrases). In the other direction, among all English words which are aligned to multiple Chinese words, 88% of these align to Chinese phrases. In this collection, at least, word-to-phrase alignments are plentiful.</Paragraph>
    <Paragraph position="2"> Alignment performance is measured by the</Paragraph>
    <Paragraph position="4"> where B is a set reference word links, and B' are the word links generated automatically.</Paragraph>
    <Paragraph position="5"> AER gives a general measure of word alignment quality. We are also interested in how the model performs over the word-to-word and word-to-phrase alignments it supports. We split the reference alignments into two subsets: B1[?]1 contains word-to-word reference links (e.g. 1-1 in Fig 1); and B1[?]N contains word-to-phrase reference links (e.g.</Paragraph>
    <Paragraph position="6"> 1-3, 1-4 in Fig 1); The automatic alignment B' is partitioned similarly. We define additional AERs: AER1[?]1 = AER(B1[?]1,B'1[?]1), and AER1[?]N = AER(B1[?]N,B'1[?]N), which measure word-to-word and word-to-phrase alignment, separately.</Paragraph>
    <Paragraph position="7"> Table 2 presents the three AER measurements for  the WtoP alignment models trained as described in Section 3. GIZA++ Model 4 alignment performance is also presented for comparison. We note first that the word-to-word HMM (N=1) alignment model is worse than Model 4, as expected. For the WtoP models in the C-E direction, we see reduced AER for phrases lengths up to 4, although in the E-C direction, AER is reduced only for phrases of length 2; performance for N &gt; 2 is not reported.</Paragraph>
    <Paragraph position="8"> In introducing the bigram phrase translation (the bigram t-table), there is a tradeoff between word-to-word and word-to-phrase alignment quality. As mentioned, the bigram t-table increases the likelihood of word-to-phrase alignments. In both translation directions, this reduces the AER1[?]N. However, it also causes increases in AER1[?]1, primarily due to a drop in recall: fewer word-to-word alignments are produced. For C-E, this is not severe enough to cause an overall AER increase; however, in E-C, AER does increase.</Paragraph>
    <Paragraph position="9"> Fig. 2 (C-E, N=4) shows how the 1-1 and 1-N alignment behavior is balanced by the phrase count parameter. As e increases, the model favors alignments with more word-to-word links and fewer word-to-phrase links; the overall Alignment Error Rate (AER) suggests a good balance at e = 8.0.</Paragraph>
    <Paragraph position="10"> After observing that the WtoP model performs as well as Model-4 over the FBIS C-E bitext, we investigated performance over these large bitexts :  - &amp;quot;NEWS&amp;quot; containing non-UN parallel Chinese/English corpora from LDC (mainly FBIS, Xinhua, Hong Kong, Sinorama, and Chinese Treebank).</Paragraph>
    <Paragraph position="11"> - &amp;quot;NEWS+UN01-02&amp;quot; also including UN parallel corpora from the years 2001 and 2002.</Paragraph>
    <Paragraph position="12"> - &amp;quot;ALL C-E&amp;quot; refers to all the C-E bitext available  from LDC as of his submission; this consists of the NEWS corpora with the UN bitext from all years.</Paragraph>
    <Paragraph position="13"> Over all these collections, WtoP alignment performance (Table 3) is comparable to that of Model4. We do note a small degradation in the E-C WtoP alignments. It is quite possible that this one-to-many model suffers slightly with English as the source and Chinese as the target, since English sentences tend to be longer. Notably, simply increasing the amount of bitext used in training need not improve AER. However, larger aligned bitexts can give improved phrase pair coverage of the test set.</Paragraph>
    <Paragraph position="14"> One of the desirable features of HMMs is that the  Forward-Backward steps can be run in parallel: bi-text is partitioned; the Forward-Backward algorithm is run over the subsets on different CPUs; statistics are merged to reestimate model parameters. Partitioning the bitext also reduces the memory usage, since different cooccurrence tables can be kept for each partition. With the &amp;quot;ALL C-E&amp;quot; bitext collection, a single set of WtoP models (C-E, N=4, bi-gram t-table) can be trained over 200M words of Chinese-English bitext by splitting training over 40 CPUs; each Forward-Backward process takes less than 2GB of memory and the training run finishes in five days. By contrast, the 96M English word NEWS+UN01-02 is about the largest C-E bitext over which we can train Model-4 with our GIZA++ configuration and computing infrastructure.</Paragraph>
    <Paragraph position="15"> Based on these and other experiments, in this paper we set a maximum value of N = 4 for F-E; in E-F, we set N=2 and omit the bigram phrase translation probability; e is set to 8.0. We do not claim that this is optimal, however.</Paragraph>
  </Section>
  <Section position="6" start_page="172" end_page="173" type="metho">
    <SectionTitle>
5 Phrase Pair Induction
</SectionTitle>
    <Paragraph position="0"> A common approach to phrase-based translation is to extract an inventory of phrase pairs (PPI) from bi-text (Koehn et al., 2003), For example, in the phrase-extract algorithm (Och, 2002), a word alignment ^am1 is generated over the bitext, and all word sub-sequences ei2i1 and fj2j1 are found that satisfy :</Paragraph>
    <Paragraph position="2"> The PPI comprises all such phrase pairs (ei2i1,fj2j1 ).</Paragraph>
    <Paragraph position="3"> The process can be stated slightly differently.</Paragraph>
    <Paragraph position="4"> First, we define a set of alignments :</Paragraph>
    <Paragraph position="6"> phrase pair.</Paragraph>
    <Paragraph position="7"> Viewed in this way, there are many possible alignments under which phrases might be paired, and  the selection of phrase pairs need not be based on a single alignment. Rather than simply accepting a phrase pair (ei2i1,fj2j1 ) if the unique MAP alignment satisfies Equation 1, we can assign a probability to phrases occurring as translation pairs :</Paragraph>
    <Paragraph position="9"> For a fixed set of indices i1,i2,j1,j2, the quantity P(f, A(i1,i2;j1,j2 )|e) can be computed efficiently using a modified Forward algorithm. Since P(f|e) can also be computed by the Forward algorithm, the phrase-to-phrase posterior distribution P(A(i1,i2;j1,j2 )|f,e) is easily found.</Paragraph>
    <Paragraph position="10"> PPI Induction Strategies In the phrase-extract algorithm (Och, 2002), the alignment ^a is generated as follows: Model-4 is trained in both directions (e.g. F-E and E-F); two sets of word alignments are generated by the Viterbi algorithm for each set of models; and the two alignments are merged. This forms a static aligned bitext. Next, all foreign word sequences up to a given length (here, 5 words) are extracted from the test set. For each of these, a phrase pair is added to the PPI if the foreign phrase can be found aligned to an English phrase under Eq 1. We refer to the result as the Model-4 Viterbi Phrase-Extract PPI.</Paragraph>
    <Paragraph position="11"> Constructed in this way, the PPI is limited to phrase pairs which can be found in the Viterbi alignments. Some foreign phrases which do appear in the training bitext will not be included in the PPI because suitable English phrases cannot be found.</Paragraph>
    <Paragraph position="12"> To add these to the PPI we can use the phrase-to-phrase posterior distribution to find English phrases as candidate translations. This adds phrases to the Viterbi Phrase-Extract PPI and increase the test set coverage. A somewhat ad hoc PPI Augmentation algorithm is given to the right.</Paragraph>
    <Paragraph position="13"> Condition (A) extracts phrase pairs based on the geometric mean of the E-F and F-E posteriors (Tg = 0.01 throughout). The threshold Tp selects additional phrase pairs under a more forgiving criterion: as Tp decreases, more phrase pairs are added and PPI coverage increases. Note that this algorithm is constructed specifically to improve a Viterbi PPI; it is certainly not the only way to extract phrase pairs under the phrase-to-phrase posterior distribution.</Paragraph>
    <Paragraph position="14"> Once the PPI phrase pairs are set, the phrase translation probabilities are set based on the number of times each phrase pair is extracted from a sentence pair, i.e. from relative frequencies.</Paragraph>
    <Paragraph position="15"> For each foreign phrase v not in the Viterbi PPI : For all pairs (fm1 ,el1) and j1,j2 s.t. fj2j1 = v : For 1 [?] i1 [?] i2 [?] l, find</Paragraph>
    <Paragraph position="17"> HMM-based models are often used if posterior distributions are needed. Model-1 can also be used in this way (Venugopal et al., 2003), although it is a relatively weak alignment model. By comparison, finding posterior distributions under Model-4 is difficult. The Word-to-Phrase alignment model appears not to suffer this tradeoff: it is a good model of word alignment under which statistics such as the phrase-to-phrase posterior can be calculated.</Paragraph>
  </Section>
  <Section position="7" start_page="173" end_page="175" type="metho">
    <SectionTitle>
6 Translation Experiments
</SectionTitle>
    <Paragraph position="0"> We evaluate the quality of phrase pairs extracted from the bitext through the translation performance of the Translation Template Model (TTM) (Kumar et al., 2005), which is a phrase-based translation system implemented using weighted finite state transducers. Performance is measured by BLEU (Papineni and others, 2001).</Paragraph>
    <Paragraph position="1"> Chinese-English Translation We report performance on the NIST Chinese/English 2002, 2003 and 2004 (News only) MT evaluation sets. These consist of 878, 919, and 901 sentences, respectively. Each Chinese sentence has 4 reference translations.</Paragraph>
    <Paragraph position="2"> We evaluate two C-E translation systems. The smaller system is built on the FBIS C-E bitext collection. The language model used for this system is a trigram word language model estimated with 21M  words taken from the English side of the bitext; all language models are built with the SRILM toolkit using Kneser-Ney smoothing (Stolcke, 2002).</Paragraph>
    <Paragraph position="3"> The larger system is based on alignments generated over all available C-E bitext (the &amp;quot;ALL C-E&amp;quot; collection of Section 4). The language model is an equal-weight interpolated trigram model trained over 373M English words taken from the English side of the bitext and the LDC Gigaword corpus.</Paragraph>
    <Paragraph position="4"> Arabic-English Translation We also evaluate our WtoP alignment models in Arabic-English translation. We report results on a small and a large system. In each, Arabic text is tokenized by the Buckwalter analyzer provided by LDC. We test our models on NIST Arabic/English 2002, 2003 and 2004 (News only) MT evaluation sets that consists of 1043, 663 and 707 Arabic sentences, respectively. Each Arabic sentence has 4 reference translations.</Paragraph>
    <Paragraph position="5"> In the small system, the training bitext is from A-E News parallel text, with [?]3.5M words on the English side. We follow the same training procedure and configurations as in Chinese/English system in both translation directions. The language model is an equal-weight interpolated trigram built over [?]400M words from the English side of the bitext, including UN text, and the LDC English Gigaword collection. The large Arabic/English system employs the same language model. Alignments are generated over all A-E bitext available from LDC as of this submission; this consists of approx. 130M words on the English side.</Paragraph>
    <Paragraph position="6"> WtoP Model and Model-4 Comparison We first look at translation performance of the small A-E and C-E systems, where alignment models are trained over the smaller bitext collections. The base-line systems (Table 4, line 1) are based on Model-4 Viterbi Phrase-Extract PPIs.</Paragraph>
    <Paragraph position="7"> We compare WtoP alignments directly to Model-4 alignments by extracting PPIs from the WtoP alignments using the Viterbi Phrase-Extract procedure (Table 4, line 3). In C-E translation, performance is comparable to that of Model-4; in A-E translation, performance lags slightly. As we add phrase pairs to the WtoP Viterbi Phrase-Extract PPI via the Phrase-Posterior Augmentation procedure (Table 4, lines 4-7), we obtain a [?]1% improvement in BLEU; the value of Tp = 0.7 gives improvements across all sets. In C-E translation, this yields good gains relative to Model-4, while in A-E we match or improve the Model-4 performance.</Paragraph>
    <Paragraph position="8"> The performance gains through PPI augmentation are consistent with increased PPI coverage of the test set. We tabulate the percentage of test set phrases that appear in each of the PPIs (the 'cvg' values in Table 4). The augmentation scheme is designed specifically to increase coverage, and we find that BLEU score improvements track the phrase coverage of the test set. This is further confirmed by the experiment of Table 4, line 2 in which we take the PPI extracted from Model-4 Viterbi alignments, and add phrase pairs to it using the Phrase-Posterior augmentation scheme with Tp = 0.7. We find that the augmentation scheme under the WtoP models can be used to improve the Model-4 PPI itself.</Paragraph>
    <Paragraph position="9"> We also investigate C-E and A-E translation performance with PPIs extracted from large bitexts.</Paragraph>
    <Paragraph position="10">  Performance of systems based on Model-4 Viterbi Phrase-Extract PPIs is shown in Table 4, line 8.</Paragraph>
    <Paragraph position="11"> To train Model-4 using GIZA++, we split the bi-texts into two (A-E) or three (C-E) partitions, and train models for each division separately; we find that memory usage is otherwise too great. These serve as a single set of alignments for the bitext, as if they had been generated under a single alignment model. When we translate with Viterbi Phrase-Extract PPIs taken from WtoP alignments created over all available bitext, we find comparable performance to the Model-4 baseline (Table 4, line 9). Using the Phrase-Posterior augmentation scheme with Tp = 0.7 yields further improvement (Table 4, line 10). Pooling the sets to form two large C-E and A-E test sets, the A-E system improvements are significant at a 95% level (Och, 2003); the C-E systems are only equivalent.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML