Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 53–61,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Phrasetable Smoothing for Statistical Machine Translation
George Foster and Roland Kuhn and Howard Johnson
National Research Council Canada
Ottawa, Ontario, Canada
firstname.lastname@nrc.gc.ca
Abstract
We discuss different strategies for smooth-
ing the phrasetable in Statistical MT, and
give results over a range of translation set-
tings. We show that any type of smooth-
ing is a better idea than the relative-
frequency estimates that are often used.
The best smoothing techniques yield con-
sistent gains of approximately 1% (abso-
lute) according to the BLEU metric.
1 Introduction
Smoothing is an important technique in statistical
NLP, used to deal with perennial data sparseness
and empirical distributions that overfit the training
corpus. Surprisingly, however, it is rarely men-
tioned in statistical Machine Translation. In par-
ticular, state-of-the-art phrase-based SMT relies
on a phrasetable—a large set of ngram pairs over
the source and target languages, along with their
translation probabilities. This table, which may
contain tens of millions of entries, and phrases of
up to ten words or more, is an excellent candidate
for smoothing. Yet very few publications describe
phrasetable smoothing techniques in detail.
In this paper, we provide the first system-
atic study of smoothing methods for phrase-based
SMT. Although we introduce a few new ideas,
most methods described here were devised by oth-
ers; the main purpose of this paper is not to in-
vent new methods, but to compare methods. In
experiments over many language pairs, we show
that smoothing yields small but consistent gains in
translation performance. We feel that this paper
only scratches the surface: many other combina-
tions of phrasetable smoothing techniques remain
to be tested.
We define a phrasetable as a set of source
phrases (ngrams) ˜s and their translations ˜t, along
with associated translation probabilities p(˜s|˜t) and
p(˜t|˜s). These conditional distributions are derived
from the joint frequencies c(˜s,˜t) of source/target
phrase pairs observed in a word-aligned parallel
corpus.
Traditionally, maximum-likelihood estimation
from relative frequencies is used to obtain con-
ditional probabilities (Koehn et al., 2003), eg,
p(˜s|˜t) = c(˜s,˜t)/summationtext˜s c(˜s,˜t) (since the estimation
problems for p(˜s|˜t) and p(˜t|˜s) are symmetrical,
we will usually refer only to p(˜s|˜t) for brevity).
The most obvious example of the overfitting this
causes can be seen in phrase pairs whose con-
stituent phrases occur only once in the corpus.
These are assigned conditional probabilities of 1,
higher than the estimated probabilities of pairs for
which much more evidence exists, in the typical
case where the latter have constituents that co-
occur occasionally with other phrases. During de-
coding, overlapping phrase pairs are in direct com-
petition, so estimation biases such as this one in
favour of infrequent pairs have the potential to sig-
nificantly degrade translation quality.
An excellent discussion of smoothing tech-
niques developed for ngram language models
(LMs) may be found in (Chen and Goodman,
1998; Goodman, 2001). Phrasetable smoothing
differs from ngram LM smoothing in the follow-
ing ways:
• Probabilities of individual unseen events are
not important. Because the decoder only
proposes phrase translations that are in the
phrasetable (ie, that have non-zero count), it
never requires estimates for pairs ˜s,˜t having
53
c(˜s,˜t) = 0.1 However, probability mass is
reserved for the set of unseen translations,
implying that probability mass is subtracted
from the seen translations.
• There is no obvious lower-order distribution
for backoff. One of the most important tech-
niques in ngram LM smoothing is to com-
bine estimates made using the previous n−1
words with those using only the previous n−i
words, for i = 2...n. This relies on the
fact that closer words are more informative,
which has no direct analog in phrasetable
smoothing.
• The predicted objects are word sequences
(in another language). This contrasts to LM
smoothing where they are single words, and
are thus less amenable to decomposition for
smoothing purposes.
We propose various ways of dealing with these
special features of the phrasetable smoothing
problem, and give evaluations of their perfor-
mance within a phrase-based SMT system.
The paper is structured as follows: section 2
gives a brief description of our phrase-based SMT
system; section 3 presents the smoothing tech-
niques used; section 4 reviews previous work; sec-
tion 5 gives experimental results; and section 6
concludes and discusses future work.
2 Phrase-based Statistical MT
Given a source sentence s, our phrase-based SMT
system tries to find the target sentence ˆt that is
the most likely translation of s. To make search
more efficient, we use the Viterbi approximation
and seek the most likely combination of t and its
alignment a with s, rather than just the most likely
t:
ˆt = argmax
t
p(t|s) ≈ argmax
t,a
p(t,a|s),
where a = (˜s1,˜t1,j1),...,(˜sK,˜tK,jK); ˜tk are tar-
get phrases such that t = ˜t1 ...˜tK; ˜sk are source
phrases such that s = ˜sj1 ... ˜sjK; and ˜sk is the
translation of the kth target phrase ˜tk.
1This is a first approximation; exceptions occur when dif-
ferent phrasetables are used in parallel, and when rules are
used to translate certain classes of entities.
To model p(t,a|s), we use a standard loglinear
approach:
p(t,a|s) ∝ exp
bracketleftBiggsummationdisplay
i
λifi(s,t,a)
bracketrightBigg
where each fi(s,t,a) is a feature function, and
weights λi are set using Och’s algorithm (Och,
2003) to maximize the system’s BLEU score (Pa-
pineni et al., 2001) on a development corpus. The
features used in this study are: the length of t;
a single-parameter distortion penalty on phrase
reordering in a, as described in (Koehn et al.,
2003); phrase translation model probabilities; and
trigram language model probabilities logp(t), us-
ing Kneser-Ney smoothing as implemented in the
SRILM toolkit (Stolcke, 2002).
Phrase translation model probabilities are fea-
tures of the form:
logp(s|t,a) ≈
Ksummationdisplay
k=1
logp(˜sk|˜tk)
ie, we assume that the phrases ˜sk specified by a
are conditionally independent, and depend only on
their aligned phrases ˜tk. The “forward” phrase
probabilities p(˜t|˜s) are not used as features, but
only as a filter on the set of possible translations:
for each source phrase ˜s that matches some ngram
in s, only the 30 top-ranked translations ˜t accord-
ing to p(˜t|˜s) are retained.
To derive the joint counts c(˜s,˜t) from which
p(˜s|˜t) and p(˜t|˜s) are estimated, we use the phrase
induction algorithm described in (Koehn et al.,
2003), with symmetrized word alignments gener-
ated using IBM model 2 (Brown et al., 1993).
3 Smoothing Techniques
Smoothing involves some recipe for modifying
conditional distributions away from pure relative-
frequency estimates made from joint counts, in or-
der to compensate for data sparsity. In the spirit of
((Hastie et al., 2001), figure 2.11, pg. 38) smooth-
ing can be seen as a way of combining the relative-
frequency estimate, which is a model with high
complexity, high variance, and low bias, with an-
other model with lower complexity, lower vari-
ance, and high bias, in the hope of obtaining bet-
ter performance on new data. There are two main
ingredients in all such recipes: some probability
distribution that is smoother than relative frequen-
cies (ie, that has fewer parameters and is thus less
54
complex) and some technique for combining that
distribution with relative frequency estimates. We
will now discuss both these choices: the distribu-
tion for carrying out smoothing and the combina-
tion technique. In this discussion, we use ˜p() to
denote relative frequency distributions.
Choice of Smoothing Distribution
One can distinguish between two approaches to
smoothing phrase tables. Black-box techniques do
not look inside phrases but instead treat them as
atomic objects: that is, both the ˜s and the ˜t in the
expression p(˜s|˜t) are treated as units about which
nothing is known except their counts. In contrast,
glass-box methods break phrases down into their
component words.
The black-box approach, which is the sim-
pler of the two, has received little attention in
the SMT literature. An interesting aspect of
this approach is that it allows one to implement
phrasetable smoothing techniques that are analo-
gous to LM smoothing techniques, by treating the
problem of estimating p(˜s|˜t) as if it were the prob-
lem of estimating a bigram conditional probabil-
ity. In this paper, we give experimental results
for phrasetable smoothing techniques analogous
to Good-Turing, Fixed-Discount, Kneser-Ney, and
Modified Kneser-Ney LM smoothing.
Glass-box methods for phrasetable smoothing
have been described by other authors: see sec-
tion 3.3. These authors decompose p(˜s|˜t) into a
set of lexical distributions p(s|˜t) by making inde-
pendence assumptions about the words s in ˜s. The
other possibility, which is similar in spirit to ngram
LM lower-order estimates, is to combine estimates
made by replacing words in ˜t with wildcards, as
proposed in section 3.4.
Choice of Combination Technique
Although we explored a variety of black-box and
glass-box smoothing distributions, we only tried
two combination techniques: linear interpolation,
which we used for black-box smoothing, and log-
linear interpolation, which we used for glass-box
smoothing.
For black-box smoothing, we could have used a
backoff scheme or an interpolation scheme. Back-
off schemes have the form:
p(˜s|˜t) =
braceleftbigg p
h(˜s|˜t), c(˜s,˜t) ≥ τ
pb(˜s|˜t), else
where ph(˜s|˜t) is a higher-order distribution,
pb(˜s|˜t) is a smooth backoff distribution, and τ is
a threshold above which counts are considered re-
liable. Typically, τ = 1 and ph(˜s|˜t) is version of
˜p(˜s|˜t) modified to reserve some probability mass
for unseen events.
Interpolation schemes have the general form:
p(˜s|˜t) = α(˜s,˜t)˜p(˜s|˜t) + β(˜s,˜t)pb(˜s|˜t), (1)
where α and β are combining coefficients. As
noted in (Chen and Goodman, 1998), a key
difference between interpolation and backoff is
that the former approach uses information from
the smoothing distribution to modify ˜p(˜s|˜t) for
higher-frequency events, whereas the latter uses
it only for low-frequency events (most often 0-
frequency events). Since for phrasetable smooth-
ing, better prediction of unseen (zero-count)
events has no direct impact—only seen events are
represented in the phrasetable, and thus hypoth-
esized during decoding—interpolation seemed a
more suitable approach.
For combining relative-frequency estimates
with glass-box smoothing distributions, we em-
ployed loglinear interpolation. This is the tradi-
tional approach for glass-box smoothing (Koehn
et al., 2003; Zens and Ney, 2004). To illustrate the
difference between linear and loglinear interpola-
tion, consider combining two Bernoulli distribu-
tions p1(x) and p2(x) using each method:
plinear(x) = αp1(x) + (1−α)p2(x)
ploglin(x) = p1(x)
αp2(x)
p1(x)αp2(x) + q1(x)αq2(x)
where qi(x) = 1 − pi(x). Setting p2(x) = 0.5
to simulate uniform smoothing gives ploglin(x) =
p1(x)α/(p1(x)α + q1(x)α). This is actually less
smooth than the original distribution p1(x): it pre-
serves extreme values 0 and 1, and makes inter-
mediate values more extreme. On the other hand,
plinear(x) = αp1(x) + (1 − α)/2, which has the
opposite properties: it moderates extreme values
and tends to preserve intermediate values.
An advantage of loglinear interpolation is that
we can tune loglinear weights so as to maximize
the true objective function, for instance BLEU; re-
call that our translation model is itself loglinear,
with weights set to minimize errors. In fact, a lim-
itation of the experiments described in this paper
is that the loglinear weights for the glass-box tech-
niques were optimized for BLEU using Och’s al-
gorithm (Och, 2003), while the linear weights for
55
black-box techniques were set heuristically. Ob-
viously, this gives the glass-box techniques an ad-
vantage when the different smoothing techniques
are compared using BLEU! Implementing an al-
gorithm for optimizing linear weights according to
BLEU is high on our list of priorities.
The preceding discussion implicitly assumes a
single set of counts c(˜s,˜t) from which conditional
distributions are derived. But, as phrases of differ-
ent lengths are likely to have different statistical
properties, it might be worthwhile to break down
the global phrasetable into separate phrasetables
for each value of |˜t| for the purposes of smooth-
ing. Any similar strategy that does not split up
{˜s|c(˜s,˜t) > 0} for any fixed ˜t can be applied to
any smoothing scheme. This is another idea we
are eager to try soon.
We now describe the individual smoothing
schemes we have implemented. Four of them
are black-box techniques: Good-Turing and three
fixed-discount techniques (fixed-discount inter-
polated with unigram distribution, Kneser-Ney
fixed-discount, and modified Kneser-Ney fixed-
discount). Two of them are glass-box techniques:
Zens-Ney “noisy-or” and Koehn-Och-Marcu IBM
smoothing. Our experiments tested not only these
individual schemes, but also some loglinear com-
binations of a black-box technique with a glass-
box technique.
3.1 Good-Turing
Good-Turing smoothing is a well-known tech-
nique (Church and Gale, 1991) in which observed
counts c are modified according to the formula:
cg = (c + 1)nc+1/nc (2)
where cg is a modified count value used to replace
c in subsequent relative-frequency estimates, and
nc is the number of events having count c. An
intuitive motivation for this formula is that it ap-
proximates relative-frequency estimates made by
successively leaving out each event in the corpus,
and then averaging the results (N´adas, 1985).
A practical difficulty in implementing Good-
Turing smoothing is that the nc are noisy for large
c. For instance, there may be only one phrase
pair that occurs exactly c = 347,623 times in a
large corpus, and no pair that occurs c = 347,624
times, leading to cg(347,623) = 0, clearly not
what is intended. Our solution to this problem
is based on the technique described in (Church
and Gale, 1991). We first take the log of the ob-
served (c,nc) values, and then use a linear least
squares fit to lognc as a function of logc. To en-
sure that the result stays close to the reliable values
of nc for large c, error terms are weighted by c, ie:
c(lognc−lognprimec)2, where nprimec are the fitted values.
Our implementation pools all counts c(˜s,˜t) to-
gether to obtain nprimec (we have not yet tried separate
counts based on length of ˜t as discussed above). It
follows directly from (2) that the total count mass
assigned to unseen phrase pairs is cg(0)n0 = n1,
which we approximate by nprime1. This mass is dis-
tributed among contexts ˜t in proportion to c(˜t),
giving final estimates:
p(˜s|˜t) = cg(˜s,˜t)summationtext
s cg(˜s,˜t) + p(˜t)nprime1
,
where p(˜t) = c(˜t)/summationtext˜t c(˜t).
3.2 Fixed-Discount Methods
Fixed-discount methods subtract a fixed discount
D from all non-zero counts, and distribute the re-
sulting probability mass according to a smoothing
distribution (Kneser and Ney, 1995). We use an
interpolated version of fixed-discount proposed by
(Chen and Goodman, 1998) rather than the origi-
nal backoff version. For phrase pairs with non-
zero counts, this distribution has the general form:
p(˜s|˜t) = c(˜s,˜t)−Dsummationtext
˜s c(˜s,˜t)
+ α(˜t)pb(˜s|˜t), (3)
where pb(˜s|˜t) is the smoothing distribution. Nor-
malization constraints fix the value of α(˜t):
α(˜t) = D n1+(∗,˜t)/
summationdisplay
˜s
c(˜s,˜t),
where n1+(∗,˜t) is the number of phrases ˜s for
which c(˜s,˜t) > 0.
We experimented with two choices for the
smoothing distribution pb(˜s|˜t). The first is a plain
unigram p(˜s), and the second is the Kneser-Ney
lower-order distribution:
pb(˜s) = n1+(˜s,∗)/
summationdisplay
˜s
n1+(˜s,∗),
ie, the proportion of unique target phrases that ˜s is
associated with, where n1+(˜s,∗) is defined anal-
ogously to n1+(∗,˜t). Intuitively, the idea is that
source phrases that co-occur with many different
56
target phrases are more likely to appear in new
contexts.
For both unigram and Kneser-Ney smoothing
distributions, we used a discounting coefficient de-
rived by (Ney et al., 1994) on the basis of a leave-
one-out analysis: D = n1/(n1 + 2n2). For the
Kneser-Ney smoothing distribution, we also tested
the “Modified Kneser-Ney” extension suggested
in (Chen and Goodman, 1998), in which specific
coefficients Dc are used for small count values
c up to a maximum of three (ie D3 is used for
c ≥ 3). For c = 2 and c = 3, we used formu-
las given in that paper.
3.3 Lexical Decomposition
The two glass-box techniques that we considered
involve decomposing source phrases with inde-
pendence assumptions. The simplest approach as-
sumes that all source words are conditionally in-
dependent, so that:
p(˜s|˜t) =
˜Jproductdisplay
j=1
p(sj|˜t)
We implemented two variants for p(sj|˜t) that
are described in previous work. (Zens and Ney,
2004) describe a “noisy-or” combination:
p(sj|˜t) = 1−p(¯sj|˜t)
≈ 1−
˜Iproductdisplay
i=1
(1−p(sj|ti))
where ¯sj is the probability that sj is not in the
translation of ˜t, and p(sj|ti) is a lexical proba-
bility. (Zens and Ney, 2004) obtain p(sj|ti) from
smoothed relative-frequency estimates in a word-
aligned corpus. Our implementation simply uses
IBM1 probabilities, which obviate further smooth-
ing.
The noisy-or combination stipulates that sj
should not appear in ˜s if it is not the translation
of any of the words in ˜t. The complement of this,
proposed in (Koehn et al., 2005), to say that sj
should appear in ˜s if it is the translation of at least
one of the words in ˜t:
p(sj|˜t) =
summationdisplay
i∈Aj
p(sj|ti)/|Aj|
where Aj is a set of likely alignment connections
for sj. In our implementation of this method,
we assumed that Aj = {1,..., ˜I}, ie the set of
all connections, and used IBM1 probabilities for
p(s|t).
3.4 Lower-Order Combinations
We mentioned earlier that LM ngrams have a
naturally-ordered sequence of smoothing distribu-
tions, obtained by successively dropping the last
word in the context. For phrasetable smoothing,
because no word in ˜t is a priori less informative
than any others, there is no exact parallel to this
technique. However, it is clear that estimates made
by replacing particular target (conditioning) words
with wildcards will be smoother than the original
relative frequencies. A simple scheme for combin-
ing them is just to average:
p(˜s|˜t) =
summationdisplay
i=˜I
c∗i(˜s,˜t)summationtext
˜s c∗i(˜s,˜t)
/˜I
where:
c∗i(˜s,˜t) =
summationdisplay
ti
c(˜s,t1 ...ti ...t˜I).
One might also consider progressively replacing
the least informative remaining word in the target
phrase (using tf-idf or a similar measure).
The same idea could be applied in reverse, by
replacing particular source (conditioned) words
with wildcards. We have not yet implemented
this new glass-box smoothing technique, but it has
considerable appeal. The idea is similar in spirit to
Collins’ backoff method for prepositional phrase
attachment (Collins and Brooks, 1995).
4 Related Work
As mentioned previously, (Chen and Goodman,
1998) give a comprehensive survey and evalua-
tion of smoothing techniques for language mod-
eling. As also mentioned previously, there is
relatively little published work on smoothing for
statistical MT. For the IBM models, alignment
probabilities need to be smoothed for combina-
tions of sentence lengths and positions not encoun-
tered in training data (Garc´ıa-Varea et al., 1998).
Moore (2004) has found that smoothing to cor-
rect overestimated IBM1 lexical probabilities for
rare words can improve word-alignment perfor-
mance. Langlais (2005) reports negative results
for synonym-based smoothing of IBM2 lexical
probabilities prior to extracting phrases for phrase-
based SMT.
For phrase-based SMT, the use of smoothing to
avoid zero probabilities during phrase induction is
reported in (Marcu and Wong, 2002), but no de-
tails are given. As described above, (Zens and
57
Ney, 2004) and (Koehn et al., 2005) use two dif-
ferent variants of glass-box smoothing (which they
call “lexical smoothing”) over the phrasetable, and
combine the resulting estimates with pure relative-
frequency ones in a loglinear model. Finally, (Cet-
tollo et al., 2005) describes the use of Witten-Bell
smoothing (a black-box technique) for phrasetable
counts, but does not give a comparison to other
methods. As Witten-Bell is reported by (Chen and
Goodman, 1998) to be significantly worse than
Kneser-Ney smoothing, we have not yet tested this
method.
5 Experiments
We carried out experiments in two different set-
tings: broad-coverage ones across six European
language pairs using selected smoothing tech-
niques and relatively small training corpora; and
Chinese to English experiments using all im-
plemented smoothing techniques and large train-
ing corpora. For the black-box techniques,
the smoothed phrase table replaced the original
relative-frequency (RF) phrase table. For the
glass-box techniques, a phrase table (either the
original RF phrase table or its replacement after
black-box smoothing) was interpolated in loglin-
ear fashion with the smoothing glass-box distribu-
tion, with weights set to maximize BLEU on a de-
velopment corpus.
To estimate the significance of the results across
different methods, we used 1000-fold pairwise
bootstrap resampling at the 95% confidence level.
5.1 Broad-Coverage Experiments
In order to measure the benefit of phrasetable
smoothing for relatively small corpora, we used
the data made available for the WMT06 shared
task (WMT, 2006). This exercise is conducted
openly with access to all needed resources and
is thus ideal for benchmarking statistical phrase-
based translation systems on a number of language
pairs.
The WMT06 corpus is based on sentences ex-
tracted from the proceedings of the European Par-
liament. Separate sentence-aligned parallel cor-
pora of about 700,000 sentences (about 150MB)
are provided for the three language pairs hav-
ing one of French, Spanish and German with En-
glish. SRILM language models based on the same
source are also provided for each of the four lan-
guages. We used the provided 2000-sentence dev-
sets for tuning loglinear parameters, and tested on
the 3064-sentence test sets.
Results are shown in table 1 for relative-
frequency (RF), Good-Turing (GT), Kneser-Ney
with 1 (KN1) and 3 (KN3) discount coefficients;
and loglinear combinations of both RF and KN3
phrasetables with Zens-Ney-IBM1 (ZN-IBM1)
smoothed phrasetables (these combinations are
denoted RF+ZN-IBM1 and KN3+ZN-IBM1).
It is apparent from table 1 that any kind of
phrase table smoothing is better than using none;
the minimum improvement is 0.45 BLEU, and
the difference between RF and all other meth-
ods is statistically significant. Also, Kneser-
Ney smoothing gives a statistically significant im-
provement over GT smoothing, with a minimum
gain of 0.30 BLEU. Using more discounting co-
efficients does not appear to help. Smoothing
relative frequencies with an additional Zens-Ney
phrasetable gives about the same gain as Kneser-
Ney smoothing on its own. However, combining
Kneser-Ney with Zens-Ney gives a clear gain over
any other method (statistically significant for all
language pairs except en→es and en→de) demon-
strating that these approaches are complementary.
5.2 Chinese-English Experiments
To test the effects of smoothing with larger
corpora, we ran a set of experiments for
Chinese-English translation using the corpora
distributed for the NIST MT05 evaluation
(www.nist.gov/speech/tests/mt). These are sum-
marized in table 2. Due to the large size of
the out-of-domain UN corpus, we trained one
phrasetable on it, and another on all other parallel
corpora (smoothing was applied to both). We also
used a subset of the English Gigaword corpus to
augment the LM training material.
corpus use sentences
non-UN phrasetable1 + LM 3,164,180
UN phrasetable2 + LM 4,979,345
Gigaword LM 11,681,852
multi-p3 dev 993
eval-04 test 1788
Table 2: Chinese-English Corpora
Table 3 contains results for the Chinese-English
experiments, including fixed-discount with uni-
gram smoothing (FDU), and Koehn-Och-Marcu
smoothing with the IBM1 model (KOM-IBM1)
58
smoothing method fr −→ en es −→ en de −→ en en −→ fr en −→ es en −→ de
RF 25.35 27.25 20.46 27.20 27.18 14.60
GT 25.95 28.07 21.06 27.85 27.96 15.05
KN1 26.83 28.66 21.36 28.62 28.71 15.42
KN3 26.84 28.69 21.53 28.64 28.70 15.40
RF+ZN-IBM1 26.84 28.63 21.32 28.84 28.45 15.44
KN3+ZN-IBM1 27.25 29.30 21.77 29.00 28.86 15.49
Table 1: Broad-coverage results
as described in section 3.3. As with the
broad-coverage experiments, all of the black-box
smoothing techniques do significantly better than
the RF baseline. However, GT appears to work
better in the large-corpus setting: it is statistically
indistinguishable from KN3, and both these meth-
ods are significantly better than all other fixed-
discount variants, among which there is little dif-
ference.
Not surprisingly, the two glass-box methods,
ZN-IBM1 and KOM-IBM1, do poorly when used
on their own. However, in combination with an-
other phrasetable, they yield the best results, ob-
tained by RF+ZN-IBM1 and GT+KOM-IBM1,
which are statistically indistinguishable. In con-
strast to the situation in the broad-coverage set-
ting, these are not significantly better than the
best black-box method (GT) on its own, although
RF+ZN-IBM1 is better than all other glass-box
combinations.
smoothing method BLEU score
RF 29.85
GT 30.66
FDU 30.23
KN1 30.29
KN2 30.13
KN3 30.54
ZN-IBM1 29.55
KOM-IBM1 28.09
RF+ZN-IBM1 30.95
RF+KOM-IBM1 30.10
GT+ZN-IBM1 30.45
GT+KOM-IBM1 30.81
KN3+ZN-IBM1 30.66
Table 3: Chinese-English Results
A striking difference between the broad-
coverage setting and the Chinese-English setting
is that in the former it appears to be beneficial
to apply KN3 smoothing to the phrasetable that
gets combined with the best glass-box phrasetable
(ZN), whereas in the latter setting it does not. To
test whether this was due to corpus size (as the
broad-coverage corpora are around 10% of those
for Chinese-English), we calculated Chinese-
English learning curves for the RF+ZN-IBM1 and
KN3-ZN-IBM1 methods, shown in figure 1. The
results are somewhat inconclusive: although the
KN3+ZN-IBM1 curve is perhaps slightly flatter,
the most obvious characteristic is that this method
appears to be highly sensitive to the particular cor-
pus sample used.
 0.25
 0.255
 0.26
 0.265
 0.27
 0.275
 0.28
 0.285
 0.29
 0.295
 0.3
 0  10  20  30  40  50  60  70  80
BLEU
proportion of corpus
Learning curves for smoothing methods
RF+ZN-IBM1
KN3+ZN-IBM1
Figure 1: Learning curves for two glass-box com-
binations.
6 Conclusion and Future Work
We tested different phrasetable smoothing tech-
niques in two different translation settings: Eu-
ropean language pairs with relatively small cor-
pora, and Chinese to English translation with large
corpora. The smoothing techniques fall into two
59
categories: black-box methods that work only on
phrase-pair counts; and glass-box methods that de-
compose phrase probabilities into lexical proba-
bilities. In our implementation, black-box tech-
niques use linear interpolation to combine relative
frequency estimates with smoothing distributions,
while glass-box techniques are combined in log-
linear fashion with either relative-frequencies or
black-box estimates.
All smoothing techniques tested gave statisti-
cally significant gains over pure relative-frequency
estimates. In the small-corpus setting, the best
technique is a loglinear combination of Kneser-
Ney count smoothing with Zens-Ney glass-box
smoothing; this yields an average gain of 1.6
BLEU points over relative frequencies. In the
large-corpus setting, the best technique is a log-
linear combination of relative-frequency estimates
with Zens-Ney smoothing, with a gain of 1.1
BLEU points. Of the two glass-box smoothing
methods tested, Zens-Ney appears to have a slight
advantage over Koehn-Och-Marcu. Of the black-
box methods tested, Kneser-Ney is clearly bet-
ter for small corpora, but is equivalent to Good-
Turing for larger corpora.
The paper describes several smoothing alterna-
tives which we intend to test in future work:
• Linear versus loglinear combinations (in our
current work, these coincide with the black-
box versus glass-box distinction, making it
impossible to draw conclusions).
• Lower-order distributions as described in sec-
tion 3.4.
• Separate count-smoothing bins based on
phrase length.
7 Acknowledgements
The authors would like to thank their colleague
Michel Simard for stimulating discussions. The
first author would like to thank all his colleagues
for encouraging him to taste a delicacy that was
new to him (shredded paper with maple syrup).
This material is based upon work supported by
the Defense Advanced Research Projects Agency
(DARPA) under Contract No. HR0011-06-C-
0023. Any opinions, findings and conclusions or
recommendations expressed in this material are
those of the author(s) and do not necessarily re-
flect the views of the Defense Advanced Research
Projects Agency (DARPA).

References
Peter F. Brown, Stephen A. Della Pietra, Vincent
Della J. Pietra, and Robert L. Mercer. 1993. The
mathematics of Machine Translation: Parameter es-
timation. Computational Linguistics, 19(2):263–
312, June.
M. Cettollo, M. Federico, N. Bertoldi, R. Cattoni, and
B. Chen. 2005. A look inside the ITC-irst SMT
system. In Proceedings of MT Summit X, Phuket,
Thailand, September. International Association for
Machine Translation.
Stanley F. Chen and Joshua T. Goodman. 1998. An
empirical study of smoothing techniques for lan-
guage modeling. Technical Report TR-10-98, Com-
puter Science Group, Harvard University.
K. Church and W. Gale. 1991. A comparison of the
enhanced Good-Turing and deleted estimation meth-
ods for estimating probabilities of English bigrams.
Computer speech and language, 5(1):19–54.
M. Collins and J. Brooks. 1995. Prepositional phrase
attachment through a backed-off model. In Proceed-
ings of the 3rd ACL Workshop on Very Large Cor-
pora (WVLC), Cambridge, Massachusetts.
Ismael Garc´ıa-Varea, Francisco Casacuberta, and Her-
mann Ney. 1998. An iterative, DP-based search al-
gorithm for statistical machine translation. In Pro-
ceedings of the 5th International Conference on Spo-
ken Language Processing (ICSLP) 1998, volume 4,
pages 1135–1138, Sydney, Australia, December.
Joshua Goodman. 2001. A bit of progress in language
modeling. Computer Speech and Language.
Trevor Hastie, Robert Tibshirani, and Jerome Fried-
man. 2001. The Elements of Statistical Learning.
Springer.
Reinhard Kneser and Hermann Ney. 1995. Improved
backing-off for m-gram language modeling. In Pro-
ceedings of the International Conference on Acous-
tics, Speech, and Signal Processing (ICASSP) 1995,
pages 181–184, Detroit, Michigan. IEEE.
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical phrase-based translation. In Ed-
uard Hovy, editor, Proceedings of the Human Lan-
guage Technology Conference of the North Ameri-
can Chapter of the Association for Computational
Linguistics, pages 127–133, Edmonton, Alberta,
Canada, May. NAACL.
P. Koehn, A. Axelrod, A. B. Mayne, C. Callison-Burch,
M. Osborne, D. Talbot, and M. White. 2005. Ed-
inburgh system description for the 2005 NIST MT
evaluation. In Proceedings of Machine Translation
Evaluation Workshop.
Philippe Langlais, Guihong Cao, and Fabrizio Gotti.
2005. RALI: SMT shared task system description.
In Proceedings of the 2nd ACL workshop on Build-
ing and Using Parallel Texts, pages 137–140, Uni-
versity of Michigan, Ann Arbor, June.
Daniel Marcu and William Wong. 2002. A phrase-
based, joint probability model for statistical machine
translation. In Proceedings of the 2002 Conference
on Empirical Methods in Natural Language Pro-
cessing (EMNLP), Philadelphia, PA.
Robert C. Moore. 2004. Improving IBM word-
alignment model 1. In Proceedings of the 42th An-
nual Meeting of the Association for Computational
Linguistics (ACL), Barcelona, July.
Hermann Ney, Ute Essen, and Reinhard Kneser.
1994. On structuring probabilistic dependencies in
stochastic language modelling. Computer Speech
and Language, 10:1–38.
Arthur N´adas. 1985. On Turing’s formula for
word probabilities. IEEE Transactions on Acous-
tics, Speech and Signal Processing (ASSP), ASSP-
33(6):1415–1417, December.
Franz Josef Och. 2003. Minimum error rate training
for statistical machine translation. In Proceedings
of the 41th Annual Meeting of the Association for
Computational Linguistics (ACL), Sapporo, July.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2001. BLEU: A method for automatic
evaluation of Machine Translation. Technical Re-
port RC22176, IBM, September.
Andreas Stolcke. 2002. SRILM - an extensi-
ble language modeling toolkit. In Proceedings of
the 7th International Conference on Spoken Lan-
guage Processing (ICSLP) 2002, Denver, Colorado,
September.
WMT. 2006. The NAACL Workshop on Statistical
Machine Translation (www.statmt.org/wmt06), New
York, June.
Richard Zens and Hermann Ney. 2004. Improve-
ments in phrase-based statistical machine transla-
tion. In Proceedings of Human Language Technol-
ogy Conference / North American Chapter of the
ACL, Boston, May.
