Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 169–176, Vancouver, October 2005. c©2005 Association for Computational Linguistics
HMM Word and Phrase Alignment for Statistical Machine Translation
Yonggang Deng1 , William Byrne1,2
Center for Language and Speech Processing, Johns Hopkins University
Baltimore, MD 21210, USA 1
Machine Intelligence Lab, Cambridge University Engineering Department
Trumpington Street, Cambridge CB2 1PZ, UK 2
dengyg@jhu.edu , wjb31@cam.ac.uk
Abstract
HMM-based models are developed for the
alignment of words and phrases in bitext.
The models are formulated so that align-
ment and parameter estimation can be per-
formed efficiently. We find that Chinese-
English word alignment performance is
comparable to that of IBM Model-4 even
over large training bitexts. Phrase pairs
extracted from word alignments generated
under the model can also be used for
phrase-based translation, and in Chinese
to English and Arabic to English transla-
tion, performance is comparable to sys-
tems based on Model-4 alignments. Di-
rect phrase pair induction under the model
is described and shown to improve trans-
lation performance.
1 Introduction
Describing word alignment is one of the fundamen-
tal goals of Statistical Machine Translation (SMT).
Alignment specifies how word order changes when
a sentence is translated into another language, and
given a sentence and its translation, alignment spec-
ifies translation at the word level. It is straightfor-
ward to extend word alignment to phrase alignment:
two phrases align if their words align.
Deriving phrase pairs from word alignments is
now widely used in phrase-based SMT. Parameters
of a statistical word alignment model are estimated
from bitext, and the model is used to generate word
alignments over the same bitext. Phrase pairs are ex-
tracted from the aligned bitext and used in the SMT
system. With this approach the quality of the under-
lying word alignments can have a strong influence
on phrase-based SMT system performance. The
common practice therefore is to extract phrase pairs
from the best attainable word alignments. Currently,
Model-4 alignments (Brown and others, 1993) as
produced by GIZA++ (Och and Ney, 2000) are often
the best that can be obtained, especially with large
bitexts.
Despite its modeling power and widespread use,
Model-4 has shortcomings. Its formulation is such
that maximum likelihood parameter estimation and
bitext alignment are implemented by approximate,
hill-climbing, methods. Consequently parameter es-
timation can be slow, memory intensive, and diffi-
cult to parallelize. It is also difficult to compute
statistics under Model-4. This limits its usefulness
for modeling tasks other than the generation of word
alignments.
We describe an HMM alignment model devel-
oped as an alternative to Model-4. In the word align-
ment and phrase-based translation experiments to
be presented, its performance is comparable or im-
proved relative to Model-4. Practically, we can train
the model by the Forward-Backward algorithm, and
by parallelizing estimation, we can control memory
usage, reduce the time needed for training, and in-
crease the bitext used for training. We can also com-
pute statistics under the model in ways not practical
with Model-4, and we show the value of this in the
extraction of phrase pairs from bitext.
2 HMM Word and Phrase Alignment
Our goal is to develop a generative probabilistic
model of Word-to-Phrase (WtoP) alignment. We
start with an l-word source sentence e = el1, and an
169
m-word target sentence f = fm1 , which is realized
as a sequence of K phrases: f = vK1 .
Each phrase is generated as a translation of one
source word, which is determined by the alignment
sequence aK1 : eak → vk . The length of each phrase
is specified by the process φK1 , which is constrained
so that summationtextKk=1 φk = m.
We also allow target phrases to be inserted, i.e. to
be generated by a NULL source word. For this, we
define a binary hallucination sequence hK1 : if hk =
0, then NULL → vk ; if hk = 1 then eak → vk.
With all these quantities gathered into an align-
ment a = (φK1 ,aK1 ,hK1 ,K), the modeling objective
is to realize the conditional distribution P(f,a|e).
With the assumption that P(f,a|e) = 0 if f negationslash= vK1 ,
we write P(f,a|e) = P(vK1 ,K,aK1 ,hK1 ,φK1 |e) and
P(vK1 ,K,aK1 ,hK1 ,φK1 |e)
= ǫ(m|l)×P(K|m,e)
×P(aK1 ,φK1 ,hK1 |K,m,e)
×P(vK1 |aK1 ,hK1 ,φK1 ,K,m,e)
We now describe the component distributions.
Sentence Length ǫ(m|l) determines the target
sentence length. It is not needed during alignment,
where sentence lengths are known, and is ignored.
Phrase Count P(K|m,e) specifies the number of
target phrases. We use a simple, single parameter
distribution, with η = 8.0 throughout
P(K|m,e) = P(K|m,l) ∝ ηK
Word-to-Phrase Alignment Alignment is a
Markov process that specifies the lengths of phrases
and their alignment with source words
P(aK1 ,hK1 ,φK1 |K,m,e)
=
Kproductdisplay
k=1
P(ak,hk,φk|ak−1,φk−1,e)
=
Kproductdisplay
k=1
p(ak|ak−1,hk;l)d(hk)n(φk;eak)
The actual word-to-phrase alignment (ak) is a first-
order Markov process, as in HMM-based word-to-
word alignment (Vogel et al., 1996). It necessarily
depends on the hallucination variable
p(aj|aj−1,hj;l)
=



1 aj = aj−1, hj = 0
0 aj negationslash= aj−1, hj = 0
a(aj|aj−1;l) hj = 1
This formulation allows target phrases to be in-
serted without disrupting the Markov dependencies
of phrases aligned to actual source words.
The phrase length model n(φ;e) gives the proba-
bility that a word e produces a phrase with φ words
in the target language; n(φ;e) is defined for φ =
1,··· ,N. The hallucination process is a simple
i.i.d. process, where d(0) = p0, and d(1) = 1−p0.
Word-to-Phrase Translation The translation of
words to phrases is given as
P(vK1 |aK1 ,hK1 ,φK1 ,K,m,e) =
Kproductdisplay
k=1
p(vk|eak,hk,φk)
We introduce the notation vk = vk[1],...,vk[φk]
and a dummy variable xk (for phrase insertion) :
xk =
braceleftbigg e
ak hk = 1
NULL hk = 0
We define two models of word-to-phrase translation.
This simplest is based on context-independent word-
to-word translation
p(vk|eak,hk,φk) =
φkproductdisplay
j=1
t(vk[j]|xk)
We also define a model that captures foreign word
context with bigram translation probabilities
p(vk|eak,hk,φk)
= t(vk[1]|xk)
φkproductdisplay
j=2
t2(vk[j]|vk[j −1],xk)
Here, t(f|e) is the usual context independent word-
to-word translation probability. The bigram trans-
lation probability t2(f|f′,e) specifies the likelihood
that target word f is to follow f′ in a phrase gener-
ated by source word e.
170
2.1 Properties of the Model and Prior Work
The formulation of the WtoP alignment model
was motivated by both the HMM word alignment
model (Vogel et al., 1996) and IBM Model-4 with
the goal of building on the strengths of each.
The relationship with the word-to-word HMM
alignment model is straightforward. For example,
constraining the phrase length component n(φ;e)
to permit only phrases of one word would give a
word-to-word HMM alignment model. The exten-
sions introduced are the phrase count, and the phrase
length models, and the bigram translation distribu-
tion. The hallucination process is motivated by the
use of NULL alignments into Markov alignment
models as done by (Och and Ney, 2003).
The phrase length model is motivated by
Toutanova et al. (2002) who introduced ‘stay’ prob-
abilities in HMM alignment as an alternative to word
fertility. By comparison, Word-to-Phrase HMM
alignment models contain detailed models of state
occupancy, motivated by the IBM fertility model,
which are more powerful than a single staying pa-
rameter. In fact, the WtoP model is a segmental
Hidden Markov Model (Ostendorf et al., 1996), in
which states emit observation sequences.
Comparison with Model-4 is less straightforward.
The main features of Model-4 are NULL source
words, source word fertility, and the distortion
model. The WtoP alignment model includes the
first two of these. However distortion, which al-
lows hypothesized words to be distributed through-
out the target sentence, is difficult to incorporate into
a model that supports efficient DP-based search. We
preserve efficiency in the WtoP model by insisting
that target words form connected phrases; this is not
as general as Model-4 distortion. This weakness
is somewhat offset by a more powerful (Markov)
alignment process as well as by the phrase count
distribution. Despite these differences, the WtoP
alignment model and Model-4 allow similar align-
ments. For example, in Fig. 1, Model-4 would allow
a0
a1
a0
a1
a0 a0
a2 a3
a2 a3 a4 a5
Figure 1: Word-to-Word and Word-to-Phrase Links
f1, f3, and f4 to be generated by e1 with a fertility
of 3. Under the WtoP model, e1 could generate f1
and f3f4 with phrase lengths 1 and 2, respectively:
source words can generate more than one phrase.
This alignment could also be generated via four
single word foreign phrases. The balance between
word-to-word and word-to-phrase alignments is set
by the phrase count distribution parameter η. As
η increases, alignments with shorter phrases are
favored, and for very large η the model allows
only word-to-word alignments (see Fig. 2). Al-
though the WtoP alignment model is more com-
plex than the word-to-word HMM alignment model,
the Baum-Welch and Viterbi algorithms can still be
used. Word-to-word alignments are generated by
the Viterbi algorithm: ˆa = argmaxa P(f,a|e); if
eak → vk , eak is linked to all the words in vk.
The bigram translation probability relies on word
context, known to be helpful in translation (Berger
et al., 1996), to improve the identification of tar-
get phrases. As an example, f is the Chinese word
for “world trade center”. Table 1 shows how the
likelihood of the correct English phrase is improved
with bigram translation probabilities; this example
is from the C→E, N=4 system of Table 2.
Model unigram bigram
P(world|f) 0.06 0.06
P(trade|world,f) 0.06 0.99
P(center|trade,f) 0.06 0.99
P(world trade center|f,3) 0.0002 0.0588
Table 1: Context in Bigram Phrase Translation.
There are of course much prior work in translation
that incorporates phrases. Sumita et al. (2004) de-
velop a model of phrase-to-phrase alignment, which
while based on HMM alignment process, appears
to be deficient. Marcu and Wong (2002) propose a
model to learn lexical correspondences at the phrase
level. To our knowledge, ours is the first non-
syntactic model of bitext alignment (as opposed to
translation) that links words and phrases.
3 Embedded Alignment Model Estimation
We now discuss estimation of the WtoP model pa-
rameters by the EM algorithm. Since the WtoP
model can be treated as an HMM with a very com-
plex state space, it is straightforward to apply Baum-
171
Welch parameter estimation. We show the forward
recursion as an example.
Given a sentence pair (el1,fm1 ), the forward prob-
ability αj(i,φ) is defined as the probability of gen-
erating the first j target words with the added con-
dition that the target words fjj−φ+1 form a phrase
aligned to source word ei. It can be calculated recur-
sively (omitting the hallucination process, for sim-
plicity) as
αj(i,φ) =
braceleftBigsummationdisplay
i′,φ′
αj−φ(i′,φ′)a(i|i′,l)
bracerightBig
·η
·n(φ;ei)·t(fj−φ+1|ei) ·
jproductdisplay
j′=j−φ+2
t2(fj′|ei)
This recursion is over a trellis of l(N + 1)m nodes.
Models are trained from a flat-start. We begin
with 10 iterations of EM to train Model-1, followed
by 5 EM iterations to train Model-2 (Brown and oth-
ers, 1993). We initialize the parameters of the word-
to-word HMM alignment model by collecting word
alignment counts from the Model-2 Viterbi align-
ments, and refine the word-to-word HMM alignment
model by 5 iterations of the Baum-Welch algorithm.
We increase the order of the WtoP model (N) from
2 to the final value in increments of 1, by perform-
ing 5 Baum Welch iterations at each step. At the fi-
nal value of N, we introduce the bigram translation
probability; we use Witten-Bell smoothing (1991)
as a backoff strategy for t2, and other strategies are
possible.
4 Bitext Word Alignment
We now investigate bitext word alignment perfor-
mance. We start with the FBIS Chinese/English
parallel corpus which consists of approx. 10M En-
glish/7.5M Chinese words. The Chinese side of the
corpus is segmented into words by the LDC seg-
menter1. The alignment test set consists of 124 sen-
tences from the NIST 2001 dry-run MT-eval2 set that
are manually word aligned.
We first analyze the distribution of word links
within these manual alignments. Of the Chinese
words which are aligned to more than one English
words, 82% of these words align with consecutive
1http://www.ldc.upenn.edu/Projects/Chinese
2http://www.nist.gov/speech/tests/mt
Model AER1−1 AER1−N AER
C−→E
Model-4 37.9 68.3 37.3
HMM, N=1 42.8 72.9 42.0
HMM, N=2 38.3 71.2 38.1
HMM, N=3 37.4 69.5 37.8
HMM, N=4 37.1 69.1 37.8
+ bigram t-table 37.5 65.8 37.1
E−→C
Model-4 42.3 87.2 45.0
HMM, N=1 45.0 90.6 47.2
HMM, N=2 42.7 87.5 44.5
+ bigram t-table 44.2 85.5 45.1
Table 2: FBIS Bitext Alignment Error Rate.
2 4 6 8 10 121500
1850
2200
2550
2900
3250
3600
3950
η
# of hypothesized links
0 1426
28
30
32
34
36
38
40
Overall AER
1−1 Links
1−N Links
Total Links
Overall AER
Figure 2: Balancing Word and Phrase Alignments
English words (phrases). In the other direction,
among all English words which are aligned to mul-
tiple Chinese words, 88% of these align to Chinese
phrases. In this collection, at least, word-to-phrase
alignments are plentiful.
Alignment performance is measured by the
Alignment Error Rate (AER) (Och and Ney, 2003)
AER(B;B′) = 1−2×|B ∩B′|/(|B′|+|B|)
where B is a set reference word links, and B′ are the
word links generated automatically.
AER gives a general measure of word alignment
quality. We are also interested in how the model
performs over the word-to-word and word-to-phrase
alignments it supports. We split the reference align-
ments into two subsets: B1−1 contains word-to-
word reference links (e.g. 1→1 in Fig 1); and
B1−N contains word-to-phrase reference links (e.g.
1→3, 1→4 in Fig 1); The automatic alignment B′
is partitioned similarly. We define additional AERs:
AER1−1 = AER(B1−1,B′1−1), and AER1−N =
AER(B1−N,B′1−N), which measure word-to-word
and word-to-phrase alignment, separately.
Table 2 presents the three AER measurements for
172
the WtoP alignment models trained as described in
Section 3. GIZA++ Model 4 alignment performance
is also presented for comparison. We note first that
the word-to-word HMM (N=1) alignment model is
worse than Model 4, as expected. For the WtoP
models in the C→E direction, we see reduced AER
for phrases lengths up to 4, although in the E→C di-
rection, AER is reduced only for phrases of length
2; performance for N > 2 is not reported.
In introducing the bigram phrase translation (the
bigram t-table), there is a tradeoff between word-
to-word and word-to-phrase alignment quality. As
mentioned, the bigram t-table increases the likeli-
hood of word-to-phrase alignments. In both transla-
tion directions, this reduces the AER1−N. However,
it also causes increases in AER1−1, primarily due to
a drop in recall: fewer word-to-word alignments are
produced. For C→E, this is not severe enough to
cause an overall AER increase; however, in E→C,
AER does increase.
Fig. 2 (C→E, N=4) shows how the 1-1 and 1-
N alignment behavior is balanced by the phrase
count parameter. As η increases, the model favors
alignments with more word-to-word links and fewer
word-to-phrase links; the overall Alignment Error
Rate (AER) suggests a good balance at η = 8.0.
After observing that the WtoP model performs as
well as Model-4 over the FBIS C-E bitext, we inves-
tigated performance over these large bitexts :
- “NEWS” containing non-UN parallel Chi-
nese/English corpora from LDC (mainly FBIS, Xin-
hua, Hong Kong, Sinorama, and Chinese Treebank).
- “NEWS+UN01-02” also including UN parallel
corpora from the years 2001 and 2002.
- “ALL C-E” refers to all the C-E bitext available
from LDC as of his submission; this consists of the
NEWS corpora with the UN bitext from all years.
Over all these collections, WtoP alignment per-
formance (Table 3) is comparable to that of Model-
4. We do note a small degradation in the E→C WtoP
alignments. It is quite possible that this one-to-many
model suffers slightly with English as the source and
Chinese as the target, since English sentences tend to
be longer. Notably, simply increasing the amount of
bitext used in training need not improve AER. How-
ever, larger aligned bitexts can give improved phrase
pair coverage of the test set.
One of the desirable features of HMMs is that the
Bitext English Words Model C→E E→C
M-4 37.1 45.3NEWS 71M
WtoP 36.1 44.8
NEWS+ M-4 36.1 43.4
UN01-02 96M WtoP 36.4 44.2
ALL C-E 200M WtoP 36.8 44.7
Table 3: AER Over Large C-E Bitexts.
Forward-Backward steps can be run in parallel: bi-
text is partitioned; the Forward-Backward algorithm
is run over the subsets on different CPUs; statistics
are merged to reestimate model parameters. Parti-
tioning the bitext also reduces the memory usage,
since different cooccurrence tables can be kept for
each partition. With the “ALL C-E” bitext collec-
tion, a single set of WtoP models (C→E, N=4, bi-
gram t-table) can be trained over 200M words of
Chinese-English bitext by splitting training over 40
CPUs; each Forward-Backward process takes less
than 2GB of memory and the training run finishes
in five days. By contrast, the 96M English word
NEWS+UN01-02 is about the largest C-E bitext
over which we can train Model-4 with our GIZA++
configuration and computing infrastructure.
Based on these and other experiments, in this pa-
per we set a maximum value of N = 4 for F→E; in
E→F, we set N=2 and omit the bigram phrase trans-
lation probability; η is set to 8.0. We do not claim
that this is optimal, however.
5 Phrase Pair Induction
A common approach to phrase-based translation is
to extract an inventory of phrase pairs (PPI) from bi-
text (Koehn et al., 2003), For example, in the phrase-
extract algorithm (Och, 2002), a word alignment
ˆam1 is generated over the bitext, and all word sub-
sequences ei2i1 and fj2j1 are found that satisfy :
ˆam1 : ˆaj ∈ [i1,i2] iff j ∈ [j1,j2] . (1)
The PPI comprises all such phrase pairs (ei2i1,fj2j1 ).
The process can be stated slightly differently.
First, we define a set of alignments :
A(i1,i2;j1,j2) = {am1 : aj ∈ [i1,i2] iff j ∈ [j1,j2]} .
If ˆam1 ∈ A(i1,i2;j1,j2) then (ei2i1,fj2j1 ) form a
phrase pair.
Viewed in this way, there are many possible align-
ments under which phrases might be paired, and
173
the selection of phrase pairs need not be based on
a single alignment. Rather than simply accepting a
phrase pair (ei2i1,fj2j1 ) if the unique MAP alignment
satisfies Equation 1, we can assign a probability to
phrases occurring as translation pairs :
P(f, A(i1,i2;j1,j2 )|e) =
summationdisplay
a:am1 ∈A(i1,i2;j1,j2 )
P(f,a|e)
For a fixed set of indices i1,i2,j1,j2, the quan-
tity P(f, A(i1,i2;j1,j2 )|e) can be computed effi-
ciently using a modified Forward algorithm. Since
P(f|e) can also be computed by the Forward al-
gorithm, the phrase-to-phrase posterior distribution
P(A(i1,i2;j1,j2 )|f,e) is easily found.
PPI Induction Strategies In the phrase-extract
algorithm (Och, 2002), the alignment ˆa is gener-
ated as follows: Model-4 is trained in both directions
(e.g. F→E and E→F); two sets of word alignments
are generated by the Viterbi algorithm for each set
of models; and the two alignments are merged. This
forms a static aligned bitext. Next, all foreign word
sequences up to a given length (here, 5 words) are
extracted from the test set. For each of these, a
phrase pair is added to the PPI if the foreign phrase
can be found aligned to an English phrase under
Eq 1. We refer to the result as the Model-4 Viterbi
Phrase-Extract PPI.
Constructed in this way, the PPI is limited to
phrase pairs which can be found in the Viterbi align-
ments. Some foreign phrases which do appear in
the training bitext will not be included in the PPI
because suitable English phrases cannot be found.
To add these to the PPI we can use the phrase-to-
phrase posterior distribution to find English phrases
as candidate translations. This adds phrases to the
Viterbi Phrase-Extract PPI and increase the test set
coverage. A somewhat ad hoc PPI Augmentation
algorithm is given to the right.
Condition (A) extracts phrase pairs based on the
geometric mean of the E→F and F→E posteriors
(Tg = 0.01 throughout). The threshold Tp selects
additional phrase pairs under a more forgiving crite-
rion: as Tp decreases, more phrase pairs are added
and PPI coverage increases. Note that this algorithm
is constructed specifically to improve a Viterbi PPI;
it is certainly not the only way to extract phrase pairs
under the phrase-to-phrase posterior distribution.
Once the PPI phrase pairs are set, the phrase trans-
lation probabilities are set based on the number of
times each phrase pair is extracted from a sentence
pair, i.e. from relative frequencies.
For each foreign phrase v not in the Viterbi PPI :
For all pairs (fm1 ,el1) and j1,j2 s.t. fj2j1 = v :
For 1 ≤ i1 ≤ i2 ≤ l, find
f(i1,i2) = PF→E(A(i1,i2;j1,j2)|el1,fm1 )
b(i1,i2) = PE→F(A(i1,i2;j1,j2)|el1,fm1 )
g(i1,i2) =
radicalbig
f(11,i2)b(i1,i2)
(ˆi1,ˆi2) = argmax
1≤i1,i2≤l
g(i1,i2) , and set u = eˆi2ˆi
1
Add (u,v) to the PPI if any of A, B, or C hold :
b(ˆi1,ˆi2) ≥ Tg and f(ˆi1,ˆi2) ≥ Tg (A)
b(ˆi1,ˆi2) < Tg and f(ˆi1,ˆi2) > Tp (B)
f(ˆi1,ˆi2) < Tg and b(ˆi1,ˆi2) > Tp (C)
PPI Augmentation via Phrase-Posterior Induction
HMM-based models are often used if posterior
distributions are needed. Model-1 can also be used
in this way (Venugopal et al., 2003), although it is
a relatively weak alignment model. By comparison,
finding posterior distributions under Model-4 is dif-
ficult. The Word-to-Phrase alignment model appears
not to suffer this tradeoff: it is a good model of word
alignment under which statistics such as the phrase-
to-phrase posterior can be calculated.
6 Translation Experiments
We evaluate the quality of phrase pairs extracted
from the bitext through the translation performance
of the Translation Template Model (TTM) (Kumar
et al., 2005), which is a phrase-based translation sys-
tem implemented using weighted finite state trans-
ducers. Performance is measured by BLEU (Pap-
ineni and others, 2001).
Chinese→English Translation We report perfor-
mance on the NIST Chinese/English 2002, 2003 and
2004 (News only) MT evaluation sets. These consist
of 878, 919, and 901 sentences, respectively. Each
Chinese sentence has 4 reference translations.
We evaluate two C→E translation systems. The
smaller system is built on the FBIS C-E bitext col-
lection. The language model used for this system is
a trigram word language model estimated with 21M
174
V-PE WtoP eval02 eval03 eval04 eval02 eval03 eval04
Model Tp cvg BLEU cvg BLEU cvg BLEU cvg BLEU cvg BLEU cvg BLEU
FBIS C→E System News A→E System
1 M-4 - 20.1 23.8 17.7 22.8 20.2 23.0 19.5 36.9 21.5 39.1 18.5 40.0
2 0.7 24.6 24.6 21.4 23.7 24.6 23.7 23.8 37.6 26.6 40.2 22.4 40.3
3 WtoP - 19.7 23.9 17.4 23.3 19.8 23.3 18.4 36.2 20.6 38.6 17.4 39.2
4 1.0 23.1 24.0 20.0 23.7 23.2 23.5 21.8 36.7 24.3 39.3 20.4 39.7
5 0.9 24.0 24.8 20.9 23.9 24.0 23.8 23.2 37.2 25.8 39.7 21.8 40.1
6 0.7 24.6 24.9 21.3 24.0 24.7 23.9 23.7 37.2 26.5 39.7 22.4 39.9
7 0.5 24.9 24.9 21.6 24.1 24.8 23.9 24.0 37.2 26.9 39.7 22.7 39.8
Large C→E System Large A→E System
8 M-4 - 32.5 27.7 29.3 27.1 32.5 26.6 26.4 38.1 28.1 40.1 28.2 39.9
9 WtoP - 30.6 27.9 27.5 27.0 30.6 26.4 24.8 38.1 26.6 40.1 26.7 40.6
10 0.7 38.2 28.2 32.3 27.3 37.1 26.8 30.7 39.3 32.9 41.6 32.5 41.9
Table 4: Translation Analysis and Performance of PPI Extraction Procedures
words taken from the English side of the bitext; all
language models are built with the SRILM toolkit
using Kneser-Ney smoothing (Stolcke, 2002).
The larger system is based on alignments gener-
ated over all available C-E bitext (the “ALL C-E”
collection of Section 4). The language model is
an equal-weight interpolated trigram model trained
over 373M English words taken from the English
side of the bitext and the LDC Gigaword corpus.
Arabic→English Translation We also evaluate our
WtoP alignment models in Arabic-English transla-
tion. We report results on a small and a large system.
In each, Arabic text is tokenized by the Buckwalter
analyzer provided by LDC. We test our models on
NIST Arabic/English 2002, 2003 and 2004 (News
only) MT evaluation sets that consists of 1043, 663
and 707 Arabic sentences, respectively. Each Arabic
sentence has 4 reference translations.
In the small system, the training bitext is from
A-E News parallel text, with ∼3.5M words on the
English side. We follow the same training proce-
dure and configurations as in Chinese/English sys-
tem in both translation directions. The language
model is an equal-weight interpolated trigram built
over ∼400M words from the English side of the bi-
text, including UN text, and the LDC English Gi-
gaword collection. The large Arabic/English system
employs the same language model. Alignments are
generated over all A-E bitext available from LDC as
of this submission; this consists of approx. 130M
words on the English side.
WtoP Model and Model-4 Comparison We first
look at translation performance of the small A→E
and C→E systems, where alignment models are
trained over the smaller bitext collections. The base-
line systems (Table 4, line 1) are based on Model-4
Viterbi Phrase-Extract PPIs.
We compare WtoP alignments directly to Model-
4 alignments by extracting PPIs from the WtoP
alignments using the Viterbi Phrase-Extract proce-
dure (Table 4, line 3). In C→E translation, perfor-
mance is comparable to that of Model-4; in A→E
translation, performance lags slightly. As we add
phrase pairs to the WtoP Viterbi Phrase-Extract PPI
via the Phrase-Posterior Augmentation procedure
(Table 4, lines 4-7), we obtain a ∼1% improvement
in BLEU; the value of Tp = 0.7 gives improvements
across all sets. In C→E translation, this yields good
gains relative to Model-4, while in A→E we match
or improve the Model-4 performance.
The performance gains through PPI augmentation
are consistent with increased PPI coverage of the test
set. We tabulate the percentage of test set phrases
that appear in each of the PPIs (the ‘cvg’ values
in Table 4). The augmentation scheme is designed
specifically to increase coverage, and we find that
BLEU score improvements track the phrase cover-
age of the test set. This is further confirmed by the
experiment of Table 4, line 2 in which we take the
PPI extracted from Model-4 Viterbi alignments, and
add phrase pairs to it using the Phrase-Posterior aug-
mentation scheme with Tp = 0.7. We find that the
augmentation scheme under the WtoP models can
be used to improve the Model-4 PPI itself.
We also investigate C→E and A→E translation
performance with PPIs extracted from large bitexts.
175
Performance of systems based on Model-4 Viterbi
Phrase-Extract PPIs is shown in Table 4, line 8.
To train Model-4 using GIZA++, we split the bi-
texts into two (A-E) or three (C-E) partitions, and
train models for each division separately; we find
that memory usage is otherwise too great. These
serve as a single set of alignments for the bitext,
as if they had been generated under a single align-
ment model. When we translate with Viterbi Phrase-
Extract PPIs taken from WtoP alignments created
over all available bitext, we find comparable perfor-
mance to the Model-4 baseline (Table 4, line 9). Us-
ing the Phrase-Posterior augmentation scheme with
Tp = 0.7 yields further improvement (Table 4, line
10). Pooling the sets to form two large C→E and
A→E test sets, the A→E system improvements are
significant at a 95% level (Och, 2003); the C→E sys-
tems are only equivalent.
7 Conclusion
We have described word-to-phrase alignment mod-
els capable of good quality bitext word alignment.
In Arabic-English and Chinese-English translation
and alignment they compare well to Model-4, even
with large bitexts. The model architecture was in-
spired by features of Model-4, such as fertility and
distortion, but care was taken to ensure that dy-
namic programming procedures, such as EM and
Viterbi alignment, could still be performed. There
is practical value in this: training and alignment
are easily parallelized. Working with HMMs also
makes it straightforward to explore new modeling
approaches. We show an augmentation scheme that
adds to phrases extracted from Viterbi alignments;
this improves translation with both the WtoP and the
Model-4 phrase pairs, even though it would be infea-
sible to implement the scheme under Model-4 itself.
We note that these models are still relatively simple,
and we anticipate further alignment and translation
improvement as the models are refined.
Acknowledgments The TTM translation system was provided
by Shankar Kumar. This work was funded by ONR MURI
Grant N00014-01-1-0685.
References
A. L. Berger, S. Della Pietra, and V. J. Della Pietra. 1996.
A maximum entropy approach to natural language pro-
cessing. Computational Linguistics, 22(1):39–71.
P. F. Brown et al. 1993. The mathematics of machine
translation: Parameter estimation. Computational Lin-
guistics, 19:263–312.
P. Koehn, F. Och, and D. Marcu. 2003. Statistical phrase-
based translation. In Proc. of HLT-NAACL.
S. Kumar, Y. Deng, and W. Byrne. 2005. A weighted fi-
nite state transducer translation template model for sta-
tistical machine translation. Journal of Natural Lan-
guage Engineering, 11(3).
D. Marcu and W. Wong. 2002. A phrase-based, joint
probability model for statistical machine translation.
In Proc. of EMNLP.
F. Och and H. Ney. 2000. Improved statistical alignment
models. In Proc. of ACL, Hong Kong, China.
F. J. Och and H. Ney. 2003. A systematic comparison of
various statistical alignment models. Computational
Linguistics, 29(1):19–51.
F. Och. 2002. Statistical Machine Translation: From
Single Word Models to Alignment Templates. Ph.D.
thesis, RWTH Aachen, Germany.
F. Och. 2003. Minimum error rate training in statistical
machine translation. In Proc. of ACL.
M. Ostendorf, V. Digalakis, and O. Kimball. 1996. From
HMMs to segment models: a unified view of stochas-
tic modeling for speech recognition. IEEE Trans.
Acoustics, Speech and Signal Processing, 4:360–378.
K. Papineni et al. 2001. BLEU: a method for automatic
evaluation of machine translation. Technical Report
RC22176 (W0109-022), IBM Research Division.
A. Stolcke. 2002. SRILM – an extensible language mod-
eling toolkit. In Proc. ICSLP.
E. Sumita et al. 2004. EBMT, SMT, Hybrid and More:
ATR spoken language translation system. In Proc.
of the International Workshop on Spoken Language
Translation, Kyoto, Japan.
K. Toutanova, H. T. Ilhan, and C. Manning. 2002. Exten-
tions to HMM-based statistical word alignment mod-
els. In Proc. of EMNLP.
A. Venugopal, S. Vogel, and A. Waibel. 2003. Effective
phrase translation extraction from alignment models.
In Proc. of ACL.
S. Vogel, H. Ney, and C. Tillmann. 1996. HMM based
word alignment in statistical translation. In Proc. of
the COLING.
I. H. Witten and T. C. Bell. 1991. The zero-frequency
problem: Estimating the probabilities of novel events
in adaptive text compression. In IEEE Trans. Inform
Theory, volume 37, pages 1085–1094, July.
176
