Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 177–184, Vancouver, October 2005. c©2005 Association for Computational Linguistics
Inner-Outer Bracket Models for Word Alignment
using Hidden Blocks
Bing Zhao
School of Computer Science
Carnegie Mellon University
{bzhao}@cs.cmu.edu
Niyu Ge and Kishore Papineni
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598, USA
{niyuge, papineni}@us.ibm.com
Abstract
Most statistical translation systems are
based on phrase translation pairs, or
“blocks”, which are obtained mainly from
word alignment. We use blocks to infer
better word alignment and improved word
alignment which, in turn, leads to better
inference of blocks. We propose two new
probabilistic models based on the inner-
outersegmentationsanduseEMalgorithms
for estimating the models’ parameters. The
first model recovers IBM Model-1 as a spe-
cial case. Both models outperform bi-
directional IBM Model-4 in terms of word
alignment accuracy by 10% absolute on the
F-measure. Using blocks obtained from
the models in actual translation systems
yields statistically significant improvements
in Chinese-English SMT evaluation.
1 Introduction
Today’s statistical machine translation systems rely
on high quality phrase translation pairs to acquire
state-of-the-art performance, see (Koehn et al., 2003;
Zens and Ney, 2004; Och and Ney, 2003). Here,
phrase pairs, or “blocks” are obtained automati-
cally from parallel sentence pairs via the underlying
word alignments. Word alignments traditionally are
based on IBM Models 1-5 (Brown et al., 1993) or on
HMMs (Vogel et al., 1996). Automatic word align-
ment is challenging in that its accuracy is not yet
close to inter-annotator agreement in some language
pairs: for Chinese-English, inter-annotator agree-
ment exceeds 90 on F-measure whereas IBM Model-
4 or HMM accuracy is typically below 80s. HMMs
assume that words “close-in-source” are aligned to
words “close-in-target”. While this locality assump-
tion is generally sound, HMMs do have limitations:
the self-transition probability of a state (word) is the
only control on the duration in the state, the length
of the phrase aligned to the word. Also there is no
natural way to control repeated non-contiguous vis-
its to a state. Despite these problems, HMMs remain
attractive for their speed and reasonable accuracy.
We propose a new method for localizing word
alignments. We use blocks to achieve locality in the
following manner: a block in a sentence pair is a
source phrase aligned to a target phrase. We assume
that words inside the source phrase cannot align to
words outside the target phrase and that words out-
side the source phrase cannot align to words inside
the target phrase. Furthermore, a block divides the
sentence pair into two smaller regions: the inner
part of the block, which corresponds to the source
and target phrase in the block, and the outer part of
the block, which corresponds to the remaining source
and target words in the parallel sentence pair. The
two regions are non-overlapping; and each of them is
shorter than the original parallel sentence pair. The
regions are thus easier to align than the original sen-
tence pairs (e.g., using IBM Model-1). While the
model uses a single block to split the sentence pair
into two independent regions, it is not clear which
block we should select for this purpose. Therefore,
we treat the splitting block as a hidden variable.
This proposed approach is far simpler than treat-
ing the entire sentence as a sequence of non-
overlapping phrases (or chunks) and considering such
sequential segmentation either explicitly or implic-
itly. For example, (Marcu and Wong, 2002) for a
joint phrase based model, (Huang et al., 2003) for
a translation memory system; and (Watanabe et
al., 2003) for a complex model of insertion, deletion
and head-word driven chunk reordering. Other ap-
proaches including (Watanabe et al., 2002) treat ex-
tractedphrase-pairsasnewparalleldatawithlimited
success. Typically, they share a similar architecture
of phrase level segmentation, reordering, translation
as in (Och and Ney, 2002; Koehn and Knight, 2002;
Yamada and Knight, 2001). The phrase level inter-
action has to be taken care of for the non-overlapping
sequential segmentation in a complicated way. Our
models model such interactions in a soft way. The
hidden blocks are allowed to overlap with each other,
177
while each block induced two non-overlapping re-
gions, i.e. the model brackets the sentence pair
into two independent parts which are generated syn-
chronously. In this respect, it resembles bilingual
bracketing (Wu, 1997), but our model has more lex-
ical items in the blocks with many-to-many word
alignment freedom in both inner and outer parts.
We present our localization constraints using
blocks for word alignment in Section 2; we detail our
two new probabilistic models and their EM train-
ing algorithms in Section 3; our baseline system, a
maximum-posterior inference for word alignment, is
explained in Section 4; experimental results of align-
ments and translations are in Section 5; and Section
6 contains discussion and conclusions.
2 Segmentation by a Block
We use the following notation in the remainder of
this paper: e and f denote the English and foreign
sentences with sentence lengthes of I and J, respec-
tively. ei is an English word at position i in e; fj is
a foreign word at position j in f. a is the alignment
vector with aj mapping the position of the English
word eajto which fj connects. Therefore, we have
the standard limitation that one foreign word can-
not be connected to more than one English word. A
block δ[] is defined as a pair of brackets as follows:
δ[] = (δe,δf) = ([il,ir],[jl,jr]), (1)
where δe = [il,ir] is a bracket in English sentence de-
fined by a pair of indices: the left position il and the
right position ir, corresponding to a English phrase
eiril . Similar notations are for δf = [jl,jr], which is
one possible projection of δe inf. The subscript l and
r are abbreviations of left and right, respectively.
δe segments e into two parts: (δe,e) = (δe∈,δe/∈).
The inner part δe∈ = {ei,i ∈ [il,ir]} and the outer
part δe/∈ = {ei,i /∈ [il,ir]}; δf segments f similarly.
Thus, the block δ[] splits the parallel sentence pair
into two non-overlapping regions: the Inner δ[]∈ and
Outer δ[]/∈ parts (see Figure 1). With this segmen-
tation, we assume the words in the inner part are
aligned to inner part only: δ[]∈ = δe∈ ↔ δf∈ : {ei,i ∈
[il,ir]} ↔ {fj,j ∈ [jl,jr]}; and words in the outer
part are aligned to outer part only: δ[]/∈ = δe/∈ ↔ δf/∈ :
{ei,i /∈ [il,ir]} ↔ {fj,j /∈ [jl,jr]}. We do not allow
alignments to cross block boundaries. Words inside
a block δ[] can be aligned using a variety of models
(IBM models 1-5, HMM, etc). We choose Model1 for
simplicity. If the block boundaries are accurate, we
can expect high quality word alignment. This is our
proposed new localization method.
Outer
Inner
li ri
rj
lj
eδ
fδ
Figure 1: Segmentation by a Block
3 Inner-Outer Bracket Models
We treat the constraining block as a hidden variable
in a generative model shown in Eqn. 2.
P(f|e) =
summationdisplay
{δ[]}
P(f,δ[]|e)
=
summationdisplay
{δe}
summationdisplay
{δf}
P(f,δf|δe,e)P(δe|e), (2)
where δ[] = (δe,δf) is the hidden block. In the gen-
erative process, the model first generates a bracket
δe for e with a monolingual bracketing model of
P(δe|e). It then uses the segmentation of the En-
glish (δe,e) to generate the projected bracket δf of f
using a generative translation model P(f,δf|δe,e) =
P(δf/∈,δf∈|δe/∈,δe∈) — the key model to implement our
proposed inner-outer constraints. With the hidden
block δ[] inferred, the model then generates word
alignments within the inner and outer parts sepa-
rately. We present two generating processes for the
inner and outer parts induced by δ[] and correspond-
ing two models of P(f,δf|δe,e). These models are
described in the following secions.
3.1 Inner-Outer Bracket Model-A
The first model assumes that the inner part and the
outer part are generated independently. By the for-
mal equivalence of (f,δf) with (δf∈,δf/∈), Eqn. 2 can
be approximated as:
P(f|e)≈
summationdisplay
{δe}
summationdisplay
{δf}
P(δf∈|δe∈)P(δf/∈|δe/∈)P(δe|e)P(δf|δe),
(3)
where P(δf∈|δe∈) and P(δf/∈|δe/∈) are two independent
generative models for inner and outer parts, respec-
178
tively and are futher decompsed into:
P(δf∈|δe∈) =
summationdisplay
{aj∈δe∈}
productdisplay
fj∈δf∈
P(fj|eaj)P(eaj|δe∈)
P(δf/∈|δe/∈) =
summationdisplay
{aj∈δe/∈}
productdisplay
fj∈δf/∈
P(fj|eaj)P(eaj|δe/∈), (4)
where {aJ1} is the word alignment vector. Given the
block segmentation and word alignment, the genera-
tive process first randomly selects a ei according to
either P(ei|δe∈) or P(ei|δe/∈); and then generates fj in-
dexed by word alignment aj with i = aj according to
a word level lexicon P(fj|eaj). This generative pro-
cess using the two models of P(δf∈|δe∈) and P(δf/∈|δe/∈)
mustsatisfytheconstraintsofsegmentationsinduced
by the hidden block δ[] = (δe,δf). The English
wordsδe∈ insidetheblockcanonlygeneratethewords
in δf∈ and nothing else; likewise δe/∈ only generates
δf/∈. Overall, the combination of P(δf∈|δe∈)P(δf/∈|δe/∈)
in Eqn. 3 collaborates each other quite well in prac-
tice. For a particular observation δf∈, if δe∈ is too
small (i.e., missing translations), P(δf∈|δe∈) will suf-
fer; and if δe∈ is too big (i.e., robbing useful words
from δe/∈), P(δf/∈|δe/∈) will suffer. Therefore, our pro-
posed model in Eqn. 3 combines the two costs and
requires both inner and outer parts to be explained
well at the same time.
Because the model in Eqn. 3 is essentially a two-
level (δ[] and a) mixture model similar to IBM Mod-
els, the EM algorithm is quite straight forward as
in IBM models. Shown in the following are several
key E-step computations of the posteriors. The M-
step (optimization) is simply the normalization of
the fractional counts collected using the posteriors
through the inference results from E-step:
Pδ[]
∈
(aj|δf∈,δe∈) = P(fj|eaj)summationtext
ek∈δe∈ P(fj|ek)
Pδ[]
/∈
(aj|δf/∈,δe/∈) = P(fj|eaj)summationtext
ek∈δe/∈ P(fj|ek)
(5)
The posterior probability of P(aJ1|f,δf,δe,e) =producttext
J
j=1 P(aj|f,δ
f,δe,e), where P(aj|f,δf,δe,e) is ei-
ther Pδ[]
∈
(aj|δf∈,δe∈) when (fj,eaj) ∈ δ[]∈, or oth-
erwise Pδ[]
/∈
(aj|δf/∈,δe/∈) when (fj,eaj) ∈ δ[]/∈. As-
suming P(δe|e) to be a uniform distribution, the
posterior of selecting a hidden block given ob-
servations: P(δ[] = (δe,δf)|e,f) is proportional
to block level relative frequency Prel(δf∈|δe∈) up-
dated in each iteration; and can be smoothed
with P(δf|δe,f,e) = P(δf∈|δe∈)P(δf/∈|δe/∈)/summationtext{δprimef}
P(δprimef∈|δe∈)P(δprimef/∈|δe/∈) assuming Model-1 alignment in
the inner and outer parts independently to reduce
the risks of data sparseness in estimations.
In principle, δe can be a bracket of any length
not exceeding the sentence length. If we restrict the
bracket length to that of the sentence length, we re-
cover IBM Model-1. Figure 2 summarizes the gener-
ation process for Inner-Outer Bracket Model-A.
f1 f2  f3  f4e
1 e2  e3
[e1] e2  e3 e1 [e2] e3 [e1 e2] e3 e1 [e2 e3]
….
f1 f4
e1 e3
f2  f3
e2
f1 f3 f4
e1 e3
f2
e2
……
]3,2[=fδ ]2,2[=fδ [.,.]=fδ
]2,2[=eδ]1,1[=eδ ]2,1[=eδ ]3,2[=eδ
innerouter innerouter
Figure 2: Illustration of generative Bracket Model-A
3.2 Inner-Outer Bracket Model-B
A block δ[] invokes both the inner and outer gener-
ations simultaneously in Bracket Model A (BM-A).
However, the generative process is usually more ef-
fective in the inner part as δ[] is generally small and
accurate. We can build a model focusing on gener-
ating only the inner part with careful inferences to
avoid errors from noisy blocks. To ensure that all
fJ1 are generated, we need to propose enough blocks
to cover each observation fj. This constraint can be
met by treating the whole sentence pair as one block.
The generative process is as follows: First the
model generates an English bracket δe as before. The
model then generates a projection δf in f to local-
ize all aj’s for the given δe according to P(δf|δe,e).
δe and δf forms a hidden block δ[]. Given δ[], the
model then generates only the inner part fj ∈ δf∈ via
P(f|δf,δe,e) similarequal P(δf∈|δf,δe,e). Eqn. 6 summarizes
this by rewriting P(f,δf|δe,e):
P(f,δf|δe,e) = P(f|δf,δe,e)P(δf|δe,e) (6)
= P(f|δf,δe,e)P([jl,jr]|δe,e)
similarequal P(δf∈|δf,δe,e)P([jl,jr]|δe,e).
P(δf∈|δf,δe,e) is a bracket level emission proba-
bilistic model which generates a bag of contiguous
words fj ∈ δf∈ under the constraints from the given
hidden block δ[] = (δf,δe). The model is simplified
in Eqn. 7 with the assumption of bag-of-words’ inde-
pendence within the bracket δf:
P(δf∈|δf,δe,e) =summationtext
aJ1
producttext
j∈δf∈ P(fj|eaj)P(eaj|δ
f,δe,e). (7)
179
The P([jl,jr]|δe,e) in Eqn. 6 is a localization prob-
abilisticmodel, whichhas resemblancesto an HMM’s
transition probability, P(aj|aj−1), implementing the
assumption “close-in-source” is aligned to “close-in-
target”. However, instead of using the simple po-
sition variable aj, P([jl,jr]|δe,e) is more expressive
with word identities to localize words {fj} aligned to
δe∈. To model P([jl,jr]|δe,e) reliably, δf = [jl,jr] is
equivalently defined as the center and width of the
bracket δf: (circledotδf,wδf). To simplify it further, we
assume that wδf and circledotδf can be predicted indepen-
dently.
The width model, P(wδf|δe,e), depends on the
length of the English bracket and the fertilities of En-
glish words in it. To simplify M-step computations,
we can compute the expected width as in Eqn. 8.
E{wδf|δe,e}similarequal γ ·|ir −il +1|, (8)
where γ is the expected bracket length ratio and
is approximated by the average sentence length ra-
tio computed using the whole parallel corpus. For
Chinese-English, γ = 1/1.3 = 0.77. In practice, this
estimation is quite reliable.
The center model P(circledotδf|δe,e) is harder. It is
conditioned on the translational equivalence between
the English bracket and its projection. We compute
the expectedcircledotδf by averaging the weighted expected
centers from all the English words in δe as in Eqn. 9.
E{circledotδf|δe,e} = summationtextJj=0 j ·P(j|δe,e)
similarequalsummationtextJj=0 j ·
P
i∈δe P(fj|ei)P
J
jprime=0
P
i∈δe P(fjprime|ei)
. (9)
The expectations of (circledotδf,wδf) from Eqn. 8 and
Eqn. 9 give a reliable starting point for a local search
for the optimal estimation of (ˆcircledotδf, ˆwδf) as in Eqn 10:
(ˆcircledotδf, ˆwδf) = argmax
{(circledotδf ,wδf )}
P(δf∈|δe∈)P(δf/∈|δe/∈), (10)
where the score functions of P(δf∈|δe∈)P(δf/∈|δe/∈) are
in Eqn. 4 with the word alignment explicitly given
from the previous iteration. For the very first itera-
tion, full alignment isassumed; thismeansthatevery
word pair is connected in the parallel sentences. Dur-
ing the local search in Eqn. 10, one can choose the
top-1 (Viterbi) (ˆcircledotδf, ˆwδf) or top-N candidates and
normalize over these candidates to obtain the poste-
riors. Except for the local search of (ˆcircledotδf, ˆwδf), the
remainder EM steps are similar to Bracket Model-A,
though with different interpretations.
By performing local search in Eqn. 10, Model-
B localizes hidden blocks more accurately than the
scheme of the smoothed relative frequency in Model-
A’s EM iterations. The model is also more focused
on the predictions in the inner part. Figure 3 sum-
marizes the generative process of Model-B (BM-B).
f1 f2  f3  f4e
1 e2  e3
[e1] e2  e3 e1 [e2] e3 [e1 e2] e3 e1 [e2 e3]
….
f2  f3 f4
e1 e2
f2  f3
e2
f2
e2 ……
)ˆ,ˆ( ff wδδΘ
]2,2[=eδ]1,1[=eδ ]2,1[=eδ ]3,2[=eδ
inner inner
)ˆ,ˆ( ff wδδΘ
……
inner
Figure 3: Generative Bracket Model-B
3.3 A Null Word Model
The null word model allows words to be aligned
to nothing. In the traditional IBM models, there
is a universal null word which is attached to every
sentence pair to compete with word generators. In
our inner-outer bracket models, we use two context-
specific null word models which use both the left
and right context as competitors in the generative
process for each observation fj: P(fj|fj−1,e) and
P(fj|fj+1,e). This is similar to the approach in
(Toutanova et al., 2002), in which the null word
model is part of an extended HMM using left context
only. With two null word models, we can associate fj
with its left or right context (i.e., a null link) when
the null word models are very strong, or when the
word’s alignment is too far from the expected center
ˆcircledotδf in Eqn. 9.
4 A Max-Posterior for Word
Alignment
In the HMM framework, (Ge, 2004) proposed a
maximum-posterior method which worked much bet-
ter than Viterbi for Arabic to English transla-
tions. The difference between maximum-posterior
and Viterbi, in a nutshell, is that while Viterbi com-
putes the best state sequence given the observation,
the maximum-posterior computes the best state one
at a time.
In the terminology of HMM, let the states be the
words in the foreign sentence fJ1 and observations
be the words in the English sentence eT1 . We use
the subscript t to note the fact that et is observed
(or emitted) at time step t. The posterior probabil-
ities P(fj|et) (state given observation) are obtained
after the forward-backward training. The maximum-
posterior word alignments are obtained by first com-
180
puting a pair (j,t)∗:
(j,t)∗ = argmax
(j,t)
P(fj|et), (11)
that is, the point at which the posterior is maximum.
The pair (j,t) defines a word pair (fj,et) which is
then aligned. The procedure continues to find the
next maximum in the posterior matrix. Contrast
this with Viterbi alignment where one computes
ˆfT1 = argmax
{fT1 }
P(f1,f2,··· ,fT|eT1 ), (12)
We observe, in parallel corpora, that when one
word translates into multiple words in another lan-
guage, it usually translates into a contiguous se-
quence of words. Therefore, we impose a conti-
guity constraint on word alignments. When one
word fj aligns to multiple English words, the En-
glish words must be contiguous in e and vice versa.
The algorithm to find word alignments using max-
posterior with contiguity constraint is illustrated in
Algorithm 1.
Algorithm 1 A maximum-posterior algorithm with
contiguity constraint
1: while (j,t) = (j,t)∗ (as computed in Eqn. 11)
do
2: if (fj,et) is not yet aligned then
3: align(fj, et);
4: else if (et is contiguous to what fj is aligned)
or(fj iscontiguoustowhatet isaligned)then
5: align(fj, et);
6: end if
7: end while
The algorithm terminates when there isn’t any
’next’ posterior maximum to be found. By defi-
nition, there are at most JxT ’next’ maximums in
the posterior matrix. And because of the contiguity
constraint, not all (fj,et) pairs are valid alignments.
The algorithm is sure to terminate. The algorithm
is, in a sense, directionless, for one fj can align to
multiple et’s and vise versa as long as the multiple
connections are contiguous. Viterbi, however, is di-
rectional in which one state can emit multiple obser-
vations but one observation can only come from one
state.
5 Experiments
We evaluate the performances of our proposed mod-
els in terms of word alignment accuracy and trans-
lation quality. For word alignment, we have 260
hand-aligned sentence pairs with a total of 4676 word
pair links. The 260 sentence pairs are randomly
selected from the CTTP1 corpus. They were then
word aligned by eight bilingual speakers. In this set,
we have one-to-one, one-to-many and many-to-many
alignment links. If a link has one target functional
word, it is considered to be a functional link (Ex-
amples of funbctional words are prepositions, deter-
miners, etc. There are in total 87 such functional
words in our experiments). We report the overall F-
measures as well as F-measures for both content and
functional word links. Our significance test shows
an overall interval of ±1.56% F-measure at a 95%
confidence level.
For training data, the small training set has 5000
sentence pairs selected from XinHua news stories
with a total of 131K English words and 125K Chi-
nese words. The large training set has 181K sentence
pairs (5k+176K); and the additional 176K sentence
pairs are from FBIS and Sinorama, which has in to-
tal 6.7 million English words and 5.8 million Chinese
words.
5.1 Baseline Systems
The baseline is our implementation of HMM with
the maximum-posterior algorithm introduced in sec-
tion 4. The HMMs are trained unidirectionally. IBM
Model-4 is trained with GIZA++ using the best re-
ported settings in (Och and Ney, 2003). A few pa-
rameters, especially the maximum fertility, are tuned
for GIZA++’s optimal performance. We collect bi-
directional (bi) refined word alignment by growing
the intersection of Chinese-to-English (CE) align-
ments and English-to-Chinese (EC) alignments with
the neighboring unaligned word pairs which appear
in the union similar to the “final-and” approaches
(Koehn, 2003; Och and Ney, 2003; Tillmann, 2003).
Table 1 summarizes our baseline with different set-
tings. Table 1 shows that HMM EC-P gives the
F-measure(%) Func Cont Both
Small
HMM EC-P 54.69 69.99 64.78
HMM EC-V 31.38 53.56 55.59
HMM CE-P 51.44 69.35 62.69
HMM CE-V 31.43 63.84 55.45
Large
HMM EC-P 60.08 78.01 71.92
HMM EC-V 32.80 74.10 64.26
HMM CE-P 58.45 79.44 71.84
HMM CE-V 35.41 79.12 68.33
Small GIZA MH-bi 45.63 69.48 60.08GIZA M4-bi 48.80 73.68 63.75
Large GIZA MH-bi 49.13 76.51 65.67GIZA M4-bi 52.88 81.76 70.24
- Fully-Align 2 5.10 15.84 9.28
Table 1: Baseline: V: Viterbi; P: Max-Posterior
1LDC2002E17
181
best baseline, better than bidirectional refined word
alignments from GIZA M4 and the HMM Viterbi
aligners.
5.2 Inner-Outer Bracket Models
We trained HMM lexicon P(f|e) to initialize the
inner-outer Bracket models. Afterwards, up to 15–
20 EM iterations are carried out. Iteration starts
from the fully aligned2 sentence pairs, which give an
F-measure of 9.28% at iteration one.
5.2.1 Small Data Track
Figure 4 shows the performance of Model-A (BM-
A) trained on the small data set. For each English
bracket, Top-1 means only the fractional counts from
the Top-1 projection are collected, Top-all means
counts from all possible projections are collected. In-
side means the fractional counts are collected from
the inner part of the block only; and outside means
they are collected from the outer parts only. Using
theTop-1projectionfrom theinnerpartsofthe block
(top-1-inside) gives the best performance: an F-
measure of 72.29%, or a 7.5% absolute improvement
over the best baseline at iteration 5. Figure 5 shows
BM-A with different settings on small dataset
62
64
66
68
70
72
74
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16EM  Iterations
F
-
m
e
a
s
u
r
e top-1 inside
top-all insidetop all
top-1top-1 outside
top-all outside
Figure 4: BM-A with different settings on small data
the performance of Inner-Outer Bracket Model-B
(BM-B) over EM iterations. smoothing means when
collecting the fractional counts, we reweigh the up-
dated fractional count by 0.95 and give the remain-
ing 0.05 weight to original fractional count from the
links, which were aligned in the previous iteration.
w/null means we applied the proposed Null word
model in section 3.3 to infer null links. We also pre-
defined a list of 15 English function words, for which
there might be no corresponding Chinese words as
translations. These 15 English words are “a, an, the,
of, to, for, by, up, be, been, being, does, do, did, -”.
In the drop-null experiments, the links containing
these predefined function words are simply dropped
2Every possible word pair is aligned
in the final word alignment (this means they are left
unaligned).
BM-B with different settings on small dataset
65
67
69
71
73
75
77
1 2 3 4 5 6 7 8EM  Iterations
F
-
m
e
a
s
u
r
e
top-1 smooth dropnulltop-1 smooth w/null
top-1 smoothtop 1
top all
Figure 5: BM-B with different settings on small data
Empirically we found that doing more than 5 it-
erations lead to overfitting. The peak performance
in our model is usually achieved around iteration
4∼5. At iteration 5, setting “BM-B Top-1” gives an
F-measure of 73.93% which is better than BM-A’s
best performance (72.29%). This is because Model
B leverages a local search for less noisy blocks and
hence the inner part is more accurately generated
(which in turn means the outer part is also more
accurate). From this point on, all of our experi-
ments are using Model B. With smoothing, BM-B
improves to 74.46%. After applying the null word
model, we get 75.20%. By simply dropping links
containing the 15 English functional words, we get
76.24%, which is significantly better than our best
baseline obtained from even the large training set
(HMM EC-P: 71.92%).
BM-B with different settings on large data set
69
71
73
75
77
79
81
83
1 2 3 4 5 6 7 8EM  Iterations
F
-
m
e
a
s
u
r
e
top-1 smooth dropnull
top-1 smooth w/null
top-1 smooth
Figure 6: BM-B with different settings on large data
5.2.2 Large Data Track
Figure 6 shows performance pictures of model
BM-B on the large training set. Without dropping
English functional words, the best performance is
182
80.38% at iteration 4 using the Top-1 projection to-
gether with the null word models. By additionally
dropping the links containing the 15 functional En-
glish words, we get 81.47%. These results are all
significantly better than our strongest baseline sys-
tem: 71.92% F-measure using HMM EC-P (70.24%
using bidirectional Model-4 for comparisons).
On this data set, we experimented with different
maximum bracket length limits, from one word (un-
igram) to nine-gram. Results show that a maximum
bracket length of four is already optimal (79.3% with
top-1 projection), increased from 62.4% when maxi-
mum length is limited to one. No improvements are
observed using longer than five-gram.
5.3 Evaluate Blocks in the EM Iterations
Our intuition was that good blocks can improve word
alignment and, in turn, good word alignment can
lead to better block selection. The experimental re-
sults above support the first claim. Now we consider
the second claim that good word alignment leads to
better block selection.
Given reference human word alignment, we extract
reference blocks up to five-gram phrases on Chinese.
The block extraction procedure is based on the pro-
cedures in (Tillmann, 2003).
During EM, we output all the hidden blocks actu-
ally inferred at each iteration, then we evaluate the
precision, recall and F-measure of the hidden blocks
according to the extracted reference blocks. The re-
sults are shown in Figure 7. Because we extract all
10%
15%
20%
25%
30%
35%
40%
45%
F
-
m
e
a
s
u
r
e
s
1 2 3 4 5 6 7 8EM  Iterations
A Direct Eval of blocks' accuracyin'BM-B top-1 smooth w/null'
F-measureRecall
Precision
Figure 7: A Direct Eval. of Blocks in BM-B
possible n-grams at each position in e, the precision
is low and the recall is relatively high as shown by
Figure 7. It also shows that blocks do improve, pre-
sumably benefiting from better word alignments.
Table 2 summarizes word alignment performances
of Inner-Outer BM-B in different settings. Overall,
without the handcrafted function word list, BM-B
gives about 8% absolute improvement in F-measure
on the large training set and 9% for the small set
F-measure(%) Func Cont Both
Small
Baseline 54.69 69.99 64.78
BM-B-drop 62.76 82.99 76.24
BM-B w/null 61.24 82.54 75.19
BM-B smooth 59.61 82.99 74.46
Large
Baseline 60.08 78.01 71.92
BM-B-drop 63.95 90.09 81.47
BM-B w/null 62.24 89.99 80.38
BM-B smooth 60.49 90.09 79.31
Table 2: BM-B with different settings
with a confidence interval of ±1.56%.
5.4 Translation Quality Evaluations
Wealsocarriedoutthetranslation experimentsusing
the best settings for Inner-Outer BM-B (i.e. BM-B-
drop) on the TIDES Chinese-English 2003 test set.
We trained our models on 354,252 test-specific sen-
tence pairs drawn from LDC-supplied parallel cor-
pora. On this training data, we ran 5 iterations of
EM using BM-B to infer word alignments. A mono-
tone decoder similar to (Tillmann and Ney, 2003)
with a trigram language model3 is set up for trans-
lations. We report case sensitive Bleu (Papineni et
al., 2002)scoreBleuCforallexperiments. Thebase-
line system (HMM) used phrase pairs built from the
HMM-EC-P maximum posterior word alignment and
thecorrespondinglexicons. ThebaselineBleuCscore
is 0.2276 ± 0.015. If we use the phrase pairs built
from the bracket model instead (but keep the HMM
trained lexicons), we get case sensitive BleuC score
0.2526. The improvement is statistically significant.
If on the other hand, we use baseline phrase pairs
with bracket model lexicons, we get a BleuC score
0.2325, which is only a marginal improvement. If we
use both phrase pairs and lexicons from the bracket
model, we get a case sensitive BleuC score 0.2750,
which is a statistically significant improvement. The
results are summarized in Table 3.
Settings BleuC
Baseline (HMM phrases and lexicon) 0.2276
Bracket phrases and HMM lexicon 0.2526
Bracket lexicon and HMM phrases 0.2325
Bracket (phrases and lexicon) 0.2750
Table 3: Improved case sensitive BleuC using BM-B
Overall, using Model-B, we improve translation
quality from 0.2276 to 0.2750 in case sensitive BleuC
score.
3Trained on 1-billion-word ViaVoice English data; the
same data is used to build our True Caser.
183
6 Conclusion
Our main contributions are two novel Inner-Outer
Bracket models based on segmentations induced by
hidden blocks. Modeling the Inner-Outer hidden seg-
mentations, wegetsignificantlyimprovedwordalign-
ments for both the small training set and the large
training set over the widely-practiced bidirectional
IBM Model-4 alignment. We also show significant
improvements in translation quality using our pro-
posed bracket models. Robustness to noisy blocks
merits further investigation.
7 Acknowledgement
This work is supported by DARPA under contract
number N66001-99-28916.
References
P.F. Brown, Stephen A. Della Pietra, Vincent. J.
Della Pietra, and Robert L. Mercer. 1993. The
mathematics of statistical machine translation:
Parameter estimation. In Computational Linguis-
tics, volume 19(2), pages 263–331.
Niyu Ge. 2004. A maximum posterior method
for word alignment. In Presentation given at
DARPA/TIDES MT workshop.
J.X.Huang, W.Wang, andM.Zhou. 2003. Aunified
statistical model for generalized translation mem-
ory system. In Machine Translation Summit IX,
pages 173–180, New Orleans, USA, September 23-
27.
Philipp Koehn and Kevin Knight. 2002. Chunkmt:
Statistical machine translation with richer linguis-
tic knowledge. Draft, Unpublished.
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical phrase-based machine transla-
tion. In Proc. of HLT-NAACL 2003, pages 48–54,
Edmonton, Canada, May-June.
Philipp Koehn. 2003. Noun phrase translation. In
Ph.D. Thesis, University of Southern California,
ISI.
Daniel Marcu and William Wong. 2002. A phrase-
based, joint probability model for statistical ma-
chine translation. In Proc. of the Conference on
Empirical Methods in Natural Language Process-
ing, pages 133–139, Philadelphia, PA, July 6-7.
Franz J. Och and Hermann Ney. 2002. Discrimi-
native training and maximum entropy models for
statistical machine translation. In Proceedings of
the 40th Annual Meeting of ACL, pages 440–447.
Franz J. Och and Hermann Ney. 2003. A system-
atic comparison of various statistical alignment
models. In Computational Linguistics, volume 29,
pages 19–51.
Kishore Papineni, Salim Roukos, Todd Ward, and
Wei-Jing Zhu. 2002. Bleu: a method for auto-
matic evaluation of machine translation. In Proc.
of the 40th Annual Conf. of the ACL (ACL 02),
pages 311–318, Philadelphia, PA, July.
Christoph Tillmann and Hermann Ney. 2003. Word
reordering and a dp beam search algorithm for
statistical machine translation. In Computational
Linguistics, volume 29(1), pages 97–133.
Christoph Tillmann. 2003. A projection extension
algorithm for statistical machine translation. In
Proc. of the Conference on Empirical Methods in
Natural Language Processing.
Kristina Toutanova, H. Tolga Ilhan, and Christo-
pher D. Manning. 2002. Extensions to hmm-based
statistical word alignment models. In Proc. of the
Conference on Empirical Methods in Natural Lan-
guage Processing, Philadelphia, PA, July 6-7.
S. Vogel, Hermann Ney, and C. Tillmann. 1996.
Hmm based word alignment in statistical machine
translation. In Proc. The 16th Int. Conf. on Com-
putational Lingustics, (Coling’96), pages 836–841,
Copenhagen, Denmark.
Taro Watanabe, Kenji Imamura, and Eiichiro
Sumita. 2002. Statistical machine translation
based on hierarchical phrases. In 9th International
Conference on Theoretical and Methodological Is-
sues, pages 188–198, Keihanna, Japan, March.
Taro Watanabe, Eiichiro Sumita, and Hiroshi G.
Okuno. 2003. Chunk-based statistical transla-
tion. In In 41st Annual Meeting of the ACL (ACL
2003), pages 303–310, Sapporo, Japan.
Dekai Wu. 1997. Stochastic inversion transduction
grammars and bilingual parsing of parallel cor-
pora. In Computational Linguistics, volume 23(3),
pages 377–403.
K. Yamada and Kevin. Knight. 2001. Syntax-based
statistical translation model. In Proceedings of the
Conference of the Association for Computational
Linguistics (ACL-2001).
R. Zens and H. Ney. 2004. Improvements in phrase-
based statistical machine translation. In Pro-
ceedings of the Human Language Technology Con-
ference (HLT-NAACL)s, pages 257–264, Boston,
MA, May.
184
