Proceedings of the Workshop on Statistical Machine Translation, pages 31–38,
New York City, June 2006. c©2006 Association for Computational Linguistics
Why Generative Phrase Models Underperform Surface Heuristics
John DeNero, Dan Gillick, James Zhang, Dan Klein
Department of Electrical Engineering and Computer Science
University of California, Berkeley
Berkeley, CA 94705
{denero, dgillick, jyzhang, klein}@eecs.berkeley.edu
Abstract
We investigate why weights from generative mod-
els underperform heuristic estimates in phrase-
based machine translation. We first propose a sim-
ple generative, phrase-based model and verify that
its estimates are inferior to those given by surface
statistics. The performance gap stems primarily
from the addition of a hidden segmentation vari-
able, which increases the capacity for overfitting
during maximum likelihood training with EM. In
particular, while word level models benefit greatly
from re-estimation, phrase-level models do not: the
crucial difference is that distinct word alignments
cannot all be correct, while distinct segmentations
can. Alternate segmentations rather than alternate
alignments compete, resulting in increased deter-
minization of the phrase table, decreased general-
ization, and decreased final BLEU score. We also
show that interpolation of the two methods can re-
sult in a modest increase in BLEU score.
1 Introduction
At the core of a phrase-based statistical machine
translation system is a phrase table containing
pairs of source and target language phrases, each
weighted by a conditional translation probability.
Koehn et al. (2003a) showed that translation qual-
ity is very sensitive to how this table is extracted
from the training data. One particularly surprising
result is that a simple heuristic extraction algorithm
based on surface statistics of a word-aligned training
set outperformed the phrase-based generative model
proposed by Marcu and Wong (2002).
This result is surprising in light of the reverse sit-
uation for word-based statistical translation. Specif-
ically, in the task of word alignment, heuristic ap-
proaches such as the Dice coefficient consistently
underperform their re-estimated counterparts, such
as the IBM word alignment models (Brown et al.,
1993). This well-known result is unsurprising: re-
estimationintroducesanelementof competition into
the learning process. The key virtue of competition
in word alignment is that, to a first approximation,
only one source word should generate each target
word. If a good alignment for a word token is found,
other plausible alignments are explained away and
should be discounted as incorrect for that token.
As we show in this paper, this effect does not pre-
vail for phrase-level alignments. The central differ-
ence is that phrase-based models, such as the ones
presented in section 2 or Marcu and Wong (2002),
contain an element of segmentation. That is, they do
not merely learn correspondences between phrases,
but also segmentations of the source and target sen-
tences. However, while it is reasonable to sup-
pose that if one alignment is right, others must be
wrong, the situation is more complex for segmenta-
tions. For example, if one segmentation subsumes
another, they are not necessarily incompatible: both
may be equally valid. While in some cases, such
as idiomatic vs. literal translations, two segmenta-
tions may be in true competition, we show that the
most common result is for different segmentations
to be recruited for different examples, overfitting the
training data and overly determinizing the phrase
translation estimates.
In this work, we first define a novel (but not rad-
ical) generative phrase-based model analogous to
IBM Model 3. While its exact training is intractable,
we describe a training regime which uses word-
level alignments to constrain the space of feasible
segmentations down to a manageable number. We
demonstrate that the phrase analogue of the Dice co-
efficient is superior to our generative model (a re-
sult also echoing previous work). In the primary
contribution of the paper, we present a series of ex-
periments designed to elucidate what re-estimation
learns in this context. We show that estimates are
overly determinized because segmentations are used
31
in unintuitive ways for the sake of data likelihood.
We comment on both the beneficial instances of seg-
ment competition (idioms) as well as the harmful
ones (most everything else). Finally, we demon-
strate that interpolation of the two estimates can
provide a modest increase in BLEU score over the
heuristic baseline.
2 Approach and Evaluation Methodology
The generative model defined below is evaluated
based on the BLEU score it produces in an end-
to-end machine translation system from English to
French. The top-performing diag-and extraction
heuristic(Zensetal., 2002)servesasthebaselinefor
evaluation.1 Each approach – the generative model
and heuristic baseline – produces an estimated con-
ditional distribution of English phrases given French
phrases. We will refer to the distribution derived
from the baseline heuristic as φH. The distribution
learned via the generative model, denoted φEM, is
described in detail below.
2.1 A Generative Phrase Model
While our model for computing φEM is novel, it
is meant to exemplify a class of models that are
not only clear extensions to generative word align-
ment models, but also compatible with the statistical
framework assumed during phrase-based decoding.
The generative process we modeled produces a
phrase-aligned English sentence from a French sen-
tence where the former is a translation of the lat-
ter. Note that this generative process is opposite to
the translation direction of the larger system because
of the standard noisy-channel decomposition. The
learned parameters from this model will be used to
translatesentencesfromEnglishtoFrench. Thegen-
erative process modeled has four steps:2
1. Begin with a French sentence f.
1This well-known heuristic extracts phrases from a sentence
pair by computing a word-level alignment for the sentence and
then enumerating all phrases compatible with that alignment.
The word alignment is computed by first intersecting the direc-
tional alignments produced by a generative IBM model (e.g.,
model 4 with minor enhancements) in each translation direc-
tion, then adding certain alignments from the union of the di-
rectional alignments based on local growth rules.
2Our notation matches the literature for phrase-based trans-
lation: e is an English word, ¯e is an English phrase, and ¯eI1 is a
sequence of I English phrases, and e is an English sentence.
2. Segment f into a sequence of I multi-word
phrases that span the sentence, ¯fI1.
3. For each phrase ¯fi ∈ ¯fI1, choose a correspond-
ing position j in the English sentence and es-
tablish the alignment aj = i, then generate ex-
actly one English phrase ¯ej from ¯fi.
4. The sequence ¯ej ordered by a describes an En-
glish sentence e.
The corresponding probabilistic model for this gen-
erative process is:
P(e|f) = summationdisplay
¯fI1,¯eI1,a
P(e, ¯fI1, ¯eI1,a|f)
= summationdisplay
¯fI1,¯eI1,a
σ( ¯fI1|f) productdisplay
¯fi∈¯fI1
φ(¯ej| ¯fi)d(aj = i|f)
where P(e, ¯fI1, ¯eI1,a|f) factors into a segmentation
model σ, a translation model φ and a distortion
model d. The parameters for each component of this
model are estimated differently:
• The segmentation model σ( ¯fI1|f) is assumed to
be uniform over all possible segmentations for
a sentence.3
• The phrase translation model φ(¯ej| ¯fi) is pa-
rameterized by a large table of phrase transla-
tion probabilities.
• The distortion model d(aj = i|f) is a discount-
ing function based on absolute sentence posi-
tion akin to the one used in IBM model 3.
While similar to the joint model in Marcu and Wong
(2002), our model takes a conditional form com-
patible with the statistical assumptions used by the
Pharaoh decoder. Thus, after training, the param-
eters of the phrase translation model φEM can be
used directly for decoding.
2.2 Training
Significant approximation and pruning is required
to train a generative phrase model and table – such
as φEM – with hidden segmentation and alignment
variables using the expectation maximization algo-
rithm (EM). Computing the likelihood of the data
3This segmentation model is deficient given a maximum
phrase length: many segmentations are disallowed in practice.
32
forasetofparameters(thee-step)involvessumming
over exponentially many possible segmentations for
each training sentence. Unlike previous attempts to
train a similar model (Marcu and Wong, 2002), we
allow information from a word-alignment model to
inform our approximation. This approach allowed
us to directly estimate translation probabilities even
for rare phrase pairs, which were estimated heuristi-
cally in previous work.
In each iteration of EM, we re-estimate each
phrasetranslationprobabilitybysummingfractional
phrase counts (soft counts) from the data given the
current model parameters.
φnew(¯ej| ¯fi) = c(
¯fi, ¯ej)
c( ¯fi) =
summationdisplay
(f,e)
summationtext
¯fI1: ¯fi∈¯fI1
summationtext
¯eI1:¯ej∈¯eI1
summationtext
a:aj=i P(e, ¯fI1, ¯eI1,a|f)summationtext
¯fI1: ¯fi∈¯fI1
summationtext
¯eI1
summationtext
a P(e, ¯fI1, ¯eI1,a|f)
This training loop necessitates approximation be-
cause summing over all possible segmentations and
alignmentsforeachsentenceisintractable,requiring
time exponential in the length of the sentences. Ad-
ditionally, the set of possible phrase pairs grows too
large to fit in memory. Using word alignments, we
can address both problems.4 In particular, we can
determine for any aligned segmentation ( ¯fI1, ¯eI1,a)
whether it is compatible with the word-level align-
ment for the sentence pair. We define a phrase pair
to be compatible with a word-alignment if no word
in either phrase is aligned with a word outside the
other phrase (Zens et al., 2002). Then, ( ¯fI1, ¯eI1,a)
is compatible with the word-alignment if each of its
aligned phrases is a compatible phrase pair.
The training process is then constrained such that,
when evaluating the above sum, only compatible
aligned segmentations are considered. That is, we
allow P(e, ¯fI1, ¯eI1,a|f) > 0 only for aligned seg-
mentations ( ¯fI1, ¯eI1,a) such that a provides a one-
to-one mapping from ¯fI1 to ¯eI1 where all phrase pairs
( ¯faj, ¯ej) are compatible with the word alignment.
This constraint has two important effects. First,
we force P(¯ej| ¯fi) = 0 for all phrase pairs not com-
patible with the word-level alignment for some sen-
tence pair. This restriction successfully reduced the
4The word alignments used in approximating the e-step
were the same as those used to create the heuristic diag-and
baseline.
total legal phrase pair types from approximately 250
million to 17 million for 100,000 training sentences.
However, some desirable phrases were eliminated
because of errors in the word alignments.
Second, the time to compute the e-step is reduced.
While in principle it is still intractable, in practice
we can compute most sentence pairs’ contributions
in under a second each. However, some spurious
word alignments can disallow all segmentations for
a sentence pair, rendering it unusable for training.
Several factors including errors in the word-level
alignments, sparse word alignments and non-literal
translations cause our constraint to rule out approx-
imately 54% of the training set. Thus, the reduced
size of the usable training set accounts for some of
the degraded performance of φEM relative to φH.
However, the results in figure 1 of the following sec-
tion show that φEM trained on twice as much data
as φH still underperforms the heuristic, indicating a
larger issue than decreased training set size.
2.3 Experimental Design
To test the relative performance of φEM and φH,
we evaluated each using an end-to-end translation
system from English to French. We chose this non-
standard translation direction so that the examples
in this paper would be more accessible to a primar-
ily English-speaking audience. All training and test
data were drawn from the French/English section of
the Europarl sentence-aligned corpus. We tested on
the first 1,000 unique sentences of length 5 to 15 in
the corpus and trained on sentences of length 1 to 60
starting after the first 10,000.
The system follows the structure proposed in
the documentation for the Pharaoh decoder and
uses many publicly available components (Koehn,
2003b). The language model was generated from
the Europarl corpus using the SRI Language Model-
ing Toolkit (Stolcke, 2002). Pharaoh performed de-
coding using a set of default parameters for weight-
ing the relative influence of the language, translation
and distortion models (Koehn, 2003b). A maximum
phrase length of three was used for all experiments.
To properly compare φEM to φH, all aspects of
thetranslationpipelinewereheldconstantexceptfor
theparametersofthephrasetranslationtable. Inpar-
ticular, we did not tune the decoding hyperparame-
ters for the different phrase tables.
33
Source25k50k100k
Heuristic0.38530.38830.3897
Iteration 10.37240.37750.3743
Iteration 20.37350.38510.3814
iteration 30.37050.3840.3827
Iteration 40.36950.2850.3801
iteration 50.37050.2840.3774
interp
Source25k50k100k
Heuristic0.38530.38830.3897
Iteration 10.37240.37750.3743
iteration 30.37050.3840.3827
iteration 30.37050.3840.3827
0.36
0.37
0.38
0.39
0.40
25k50k100k
Training sentences
BLEU
Heuristic
Iteration 1
iteration 3
0%
20%
40%
60%
80%
100%
0102030405060
Sentence Length
Sentences Skipped
Figure 1: Statistical re-estimation using a generative
phrase model degrades BLEU score relative to its
heuristic initialization.
3 Results
Having generated φH heuristically and φEM with
EM, we now compare their performance. While the
model and training regimen for φEM differ from the
model from Marcu and Wong (2002), we achieved
results similar to Koehn et al. (2003a): φEM slightly
underperformed φH. Figure 1 compares the BLEU
scores using each estimate. Note that the expecta-
tion maximization algorithm for training φEM was
initialized with the heuristic parameters φH, so the
heuristic curve can be equivalently labeled as itera-
tion 0.
Thus, the first iteration of EM increases the ob-
served likelihood of the training sentences while si-
multaneously degrading translation performance on
the test set. As training proceeds, performance on
thetestsetlevelsoffafterthreeiterationsofEM.The
system never achieves the performance of its initial-
ization parameters. The pruning of our training regi-
menaccountsforpartofthisdegradation, butnotall;
augmenting φEM by adding back in all phrase pairs
that were dropped during training does not close the
performance gap between φEM and φH.
3.1 Analysis
Learning φEM degrades translation quality in large
part because EM learns overly determinized seg-
mentations and translation parameters, overfitting
the training data and failing to generalize. The pri-
mary increase in richness from generative word-
level models to generative phrase-level models is
due to the additional latent segmentation variable.
Although we impose a uniform distribution over
segmentations, it nonetheless plays a crucial role
during training. We will characterize this phe-
nomenon through aggregate statistics and transla-
tion examples shortly, but begin by demonstrating
the model’s capacity to overfit the training data.
Let us first return to the motivation behind in-
troducing and learning phrases in machine transla-
tion. For any language pair, there are contiguous
strings of words whose collocational translation is
non-compositional; that is, they translate together
differently than they would in isolation. For in-
stance, chat in French generally translates to cat in
English, but appeler un chat un chat is an idiom
which translates to call a spade a spade. Introduc-
ing phrases allows us to translate chat un chat atom-
ically to spade a spade and vice versa.
While introducing phrases and parameterizing
their translation probabilities with a surface heuris-
ticallowsforthispossibility,statisticalre-estimation
would be required to learn that chat should never be
translated to spade in isolation. Hence, translating I
have a spade with φH could yield an error.
But enforcing competition among segmentations
introduces a new problem: true translation ambigu-
ity can also be spuriously explained by the segmen-
tation. Consider the french fragment carte sur la
table, which could translate to map on the table or
notice on the chart. Using these two sentence pairs
as training, one would hope to capture the ambiguity
in the parameter table as:
French English φ(e|f)
carte map 0.5
carte notice 0.5
cartesur mapon 0.5
cartesur noticeon 0.5
sur on 1.0
... ... ...
table table 0.5
table chart 0.5
Assuming we only allow non-degenerate seg-
mentations and disallow non-monotonic alignments,
this parameter table yields a marginal likelihood
P(f|e) = 0.25 for both sentence pairs – the intu-
itive result given two independent lexical ambigu-
34
ities. However, the following table yields a likeli-
hood of 0.28 for both sentences:5
French English φ(e|f)
carte map 1.0
cartesur noticeon 1.0
cartesurla noticeonthe 1.0
sur on 1.0
surlatable onthetable 1.0
la the 1.0
latable thetable 1.0
table chart 1.0
Hence, a higher likelihood can be achieved by al-
locating some phrases to certain translations while
reserving overlapping phrases for others, thereby
failing to model the real ambiguity that exists across
the language pair. Also, notice that the phrase sur
la can take on an arbitrary distribution over any en-
glish phrases without affecting the likelihood of ei-
ther sentence pair. Not only does this counterintu-
itive parameterization give a high data likelihood,
but it is also a fixed point of the EM algorithm.
The phenomenon demonstrated above poses a
problem for generative phrase models in general.
The ambiguous process of translation can be mod-
eled either by the latent segmentation variable or the
phrase translation probabilities. In some cases, opti-
mizing the likelihood of the training corpus adjusts
for the former when we would prefer the latter. We
next investigate how this problem manifests in φEM
and its effect on translation quality.
3.2 Learned parameters
The parameters of φEM differ from the heuristically
extracted parameters φH in that the conditional dis-
tributions over English translations for some French
words are sharply peaked for φEM compared to flat-
ter distributions generated by φH. This determinism
– predicted by the previous section’s example – is
not atypical of EM training for other tasks.
To quantify the notion of peaked distributions
over phrase translations, we compute the entropy of
the distribution for each French phrase according to
5For example, summing over the first translation ex-
pands to 17(φ(map | carte)φ(on the table | sur la table)
+φ(map | carte)φ(on | sur)φ(the table | la table)).
it 2.76E-08as there are0.073952202
code2.29E-08the 0.002670946
to 1.98E-12less helpful6.22E-05
it be1.11E-14 please stop messing1.12E-05
0 10203040
0 - .01
.01 - .5
.5 - 1
1 - 1.5
1.5 - 2
> 2
Entropy
% Phrase Translations
Learned
Heuristic
1E-041E-021E+001E+02
'
,
de
.
l
l '
le
et
les
Most Common French Phrases
Entropy
LearnedHeuristic
Figure 2: Many more French phrases have very low
entropy under the learned parameterization.
the standard definition.
H(φ(¯e| ¯f)) =summationdisplay
¯e
φ(¯e| ¯f)log2 φ(¯e| ¯f)
The average entropy, weighted by frequency, for the
most common 10,000 phrases in the learned table
was 1.55, comparable to 3.76 for the heuristic table.
The difference between the tables becomes much
more striking when we consider the histogram of
entropies for phrases in figure 2. In particular, the
learned table has many more phrases with entropy
near zero. The most pronounced entropy differences
often appear for common phrases. Ten of the most
common phrases in the French corpus are shown in
figure 3.
As more probability mass is reserved for fewer
translations, many of the alternative translations un-
der φH are assigned prohibitively small probabili-
ties. In translating 1,000 test sentences, for example,
no phrase translation with φ(¯e| ¯f) less than 10−5 was
used by the decoder. Given this empirical threshold,
nearly 60% of entries in φEM are unusable, com-
pared with 1% in φH.
3.3 Effects on Translation
While this determinism of φEM may be desirable
in some circumstances, we found that the ambi-
guity in φH is often preferable at decoding time.
35
it 2.76E-08 as there are0.073952202
code2.29E-08 the 0.002670946
to 1.98E-12 less helpful6.22E-05
it be1.11E-14 please stop messing1.12E-05
0
10
20
30
40
0 - .01.01 - .5.5 - 11 - 1.51.5 - 2> 2
Entropy
% Phrase 
T
ranslations
Heuristic
Learned
1E-041E-021E+001E+02
 '
,
.
l
l '
n '
que
qui
plus
l ' union
Common French Phrases
EntropyLearnedHeuristic
Figure 3: Entropy of 10 common French phrases.
Several learned distributions have very low entropy.
In particular, the pattern of translation-ambiguous
phrasesreceivingspuriouslypeakeddistributions(as
described in section 3.1) introduces new translation
errors relative to the baseline. We now investigate
both positive and negative effects of the learning
process.
The issue that motivated training a generative
model is sometimes resolved correctly: for a word
that translates differently alone than in the context
of an idiom, the translation probabilities can more
accurately reflect this. Returning to the previous ex-
ample, the phrase table for chat has been corrected
through the learning process. The heuristic process
gives the incorrect translation spade with 61% prob-
ability, while the statistical learning approach gives
cat with 95% probability.
While such examples of improvement are en-
couraging, the trend of spurious determinism over-
whelms this benefit by introducing errors in four re-
lated ways, each of which will be explored in turn.
1. Useful phrase pairs can be assigned very low
probabilities and therefore become unusable.
2. A proper translation for a phrase can be over-
ridden by another translation with spuriously
high probability.
3. Error-prone, common, ambiguous phrases be-
come active during decoding.
4. The language model cannot distinguish be-
tween different translation options as effec-
tively due to deterministic translation model
distributions.
The first effect follows from our observation in
section 3.2 that many phrase pairs are unusable due
to vanishingly small probabilities. Some of the en-
tries that are made unusable by re-estimation are
helpful at decoding time, evidenced by the fact
that pruning the set of φEM’s low-scoring learned
phrases from the original heuristic table reduces
BLEU score by 0.02 for 25k training sentences (be-
low the score for φEM).
The second effect is more subtle. Consider the
sentence in figure 4, which to a first approxima-
tion can be translated as a series of cognates, as
demonstrated by the decoding that follows from the
heuristic parameterization φH.6 Notice also that the
translationprobabilitiesfromheuristicextractionare
non-deterministic. On the other hand, the translation
system makes a significant lexical error on this sim-
ple sentence when parameterized by φEM: the use
of caract´erise in this context is incorrect. This error
arises from a sharply peaked distribution over En-
glish phrases for caract´erise.
This example illustrates a recurring problem: er-
rors do not necessarily arise because a correct trans-
lation is not available. Notice that a preferable trans-
lation of degree as degr´e is available under both pa-
rameterizations. Degr´e is not used, however, be-
cause of the peaked distribution of a competing
translation candidate. In this way, very high prob-
ability translations can effectively block the use of
more appropriate translations at decoding time.
What is furthermore surprising and noteworthy in
this example is that the learned, near-deterministic
translation for caract´erise is not a common trans-
lation for the word. Not only does the statistical
learning process yield low-entropy translation dis-
tributions, but occasionally the translation with un-
desirably high conditional probability does not have
a strong surface correlation with the source phrase.
This example is not unique; during different initial-
izations of the EM algorithm, we noticed such pat-
6While there is some agreement error and awkwardness, the
heuristic translation is comprehensible to native speakers. The
learned translation incorrectly translates degree, degrading the
translation quality.
36
the situation varies to an
la situation varie d ' une
Heuristically Extracted Phrase Table
Learned Phrase Table
enormous
immense
degree
degré
situation varies to
la varie d '
an enormous
une immense
degree
caractérise
the
situation
caract´erise
English φ(e|f)
degree 0.998
characterises 0.001
characterised 0.001
caract´erise
English φ(e|f)
characterises 0.49
characterised 0.21
permeate 0.05
features 0.05
typifies 0.05
degr´e
English φ(e|f)
degree 0.49
level 0.38
extent 0.02
amount 0.02
how 0.01
degr´e
English φ(e|f)
degree 0.64
level 0.26
extent 0.10
Figure 4: Spurious determinism in the learned phrase parameters degrades translation quality.
terns even for common French phrases such as de
and ne.
The third source of errors is closely related: com-
mon phrases that translate in many ways depending
on the context can introduce errors if they have a
spuriously peaked distribution. For instance, con-
sider the lone apostrophe, which is treated as a sin-
gle token in our data set (figure 5). The shape of
the heuristic translation distribution for the phrase is
intuitively appealing, showing a relatively flat dis-
tribution among many possible translations. Such
a distribution has very high entropy. On the other
hand, the learned table translates the apostrophe to
the with probability very near 1.
Heuristic
English φH(e|f)
our 0.10
that 0.09
is 0.06
we 0.05
next 0.05
Learned
English φEM(e|f)
the 0.99
, 4.1·10−3
is 6.5·10−4
to 6.3·10−4
in 5.3·10−4
Figure 5: Translation probabilities for an apostro-
phe, the most common french phrase. The learned
table contains a highly peaked distribution.
Such common phrases whose translation depends
highly on the context are ripe for producing transla-
tion errors. The flatness of the distribution of φH en-
sures that the single apostrophe will rarely be used
during decoding because no one phrase table entry
has high enough probability to promote its use. On
the other hand, using the peaked entry φEM(the|prime)
incurs virtually no cost to the score of a translation.
The final kind of errors stems from interactions
between the language and translation models. The
selection among translation choices via a language
model – a key virtue of the noisy channel frame-
work–ishinderedbythedeterminismofthetransla-
tion model. This effect appears to be less significant
than the previous three. We should note, however,
that adjusting the language and translation model
weights during decoding does not close the perfor-
mance gap between φH and φEM.
3.4 Improvements
In lightof thelow entropyof φEM, we couldhope to
improve translations by retaining entropy. There are
severalstrategieswehaveconsideredtoachievethis.
Broadly, we have tried two approaches: combin-
ing φEM and φH via heuristic interpolation methods
and modifying the training loop to limit determin-
ism.
The simplest strategy to increase entropy is to
interpolate the heuristic and learned phrase tables.
Varying the weight of interpolation showed an im-
provement over the heuristic of up to 0.01 for 100k
sentences. Amoremodestimprovementof0.003for
25k training sentences appears in table 1.
In another experiment, we interpolated the out-
put of each iteration of EM with its input, thereby
maintaining some entropy from the initialization pa-
rameters. BLEU score increased to a maximum of
0.394 using this technique with 100k training sen-
tences, outperforming the heuristic by a slim margin
of 0.005.
We might address the determinization in φEM
without resorting to interpolation by modifying the
37
training procedure to retain entropy. By imposing a
non-uniform segmentation model that favors shorter
phrases over longer ones, we hope to prevent the
error-causing effects of EM training outlined above.
In principle, this change will encourage EM to ex-
plain training sentences with shorter sentences. In
practice, however, this approach has not led to an
improvement in BLEU.
Another approach to maintaining entropy during
the training process is to smooth the probabilities
generated by EM. In particular, we can use the fol-
lowing smoothed update equation during the train-
ing loop, which reserves a portion of probability
mass for unseen translations.
φnew(¯ej| ¯fi) = c(
¯fi, ¯ej)
c( ¯fi) + kl−1
In the equation above, l is the length of the French
phrase and k is a tuning parameter. This formula-
tion not only serves to reduce very spiked probabili-
ties in φEM, but also boosts the probability of short
phrases to encourage their use. With k = 2.5, this
smoothing approach improves BLEU by .007 using
25k training sentences, nearly equaling the heuristic
(table 1).
4 Conclusion
Re-estimating phrase translation probabilities using
a generative model holds the promise of improving
upon heuristic techniques. However, the combina-
torial properties of a phrase-based generative model
have unfortunate side effects. In cases of true ambi-
guity in the language pair to be translated, parameter
estimates that explain the ambiguity using segmen-
tation variables can in some cases yield higher data
likelihoods by determinizing phrase translation esti-
mates. However, this behavior in turn leads to errors
at decoding time.
We have also shown that some modest benefit can
be obtained from re-estimation through the blunt in-
strument of interpolation. A remaining challenge is
to design more appropriate statistical models which
tie segmentations together unless sufficient evidence
of true non-compositionality is present; perhaps
such models could properly combine the benefits of
both current approaches.
Estimate BLEU
φH 0.385
φH phrasepairsthatalsoappearinφEM 0.365
φEM 0.374
φEM withanon-uniformsegmentationmodel 0.374
φEM withsmoothing 0.381
φEM withgapsfilledinbyφH 0.374
φEM interpolatedwithφH 0.388
Table 1: BLEU results for 25k training sentences.
5 Acknowledgments
We would like to thank the anonymous reviewers for
their valuable feedback on this paper.

References
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della
Pietra, and Robert L. Mercer. The mathematics of
statistical machine translation: Parameter estimation.
Computational Linguistics, 19(2), 1993.
Philipp Koehn. Europarl: A Multilingual Corpus for
Evaluation of Machine Translation. USC Information
Sciences Institute, 2002.
Philipp Koehn, Franz Josef Och, and Daniel Marcu. Sta-
tistical phrase-based translation. HLT-NAACL, 2003.
Philipp Koehn. Pharaoh: A Beam Search Decoder for
Phrase-Based Statisical Machine Translation Models.
USC Information Sciences Institute, 2003.
Daniel Marcu and William Wong. A phrase-based, joint
probability model for statistical machine translation.
ConferenceonEmpiricalMethodsinNatualLanguage
Processing, 2002.
Franz Josef Och and Hermann Ney. A systematic com-
parison of various statistical alignment models. Com-
putational Linguistics, 29(1):19–51, 2003.
Franz Josef Och, Christoph Tillmann, and Hermann Ney.
Improved alignment models for statistical machine
translation. ACL Workshops, 1999.
Andreas Stolcke. Srilm – an extensible language model-
ing toolkit. Proceedings of the International Confer-
ence on Statistical Language Processing, 2002.
Richard Zens, Franz Josef Och and Hermann Ney.
Phrase-Based Statistical Machine Translation. Annual
German Conference on AI, 2002.
