An Empirical Study of Smoothing Techniques for Language 
Modeling 
Stanley F. Chen 
Harvard University 
Aiken Computation Laboratory 
33 Oxford St. 
Cambridge, MA 02138 
sfc©eecs, harvard, edu 
Joshua Goodman 
Harvard University 
Aiken Computation Laboratory 
33 Oxford St. 
Cambridge, MA 02138 
goodma.n~eecs, harvard, edu 
Abstract 
We present an extensive empirical com- 
parison of several smoothing techniques in 
the domain of language modeling, includ- 
ing those described by Jelinek and Mer- 
cer (1980), Katz (1987), and Church and 
Gale (1991). We investigate for the first 
time how factors such as training data 
size, corpus (e.g., Brown versus Wall Street 
Journal), and n-gram order (bigram versus 
trigram) affect the relative performance of 
these methods, which we measure through 
the cross-entropy of test data. In addition, 
we introduce two novel smoothing tech- 
niques, one a variation of Jelinek-Mercer 
smoothing and one a very simple linear in- 
terpolation technique, both of which out- 
perform existing methods. 
1 Introduction 
Smoothing is a technique essential in the construc- 
tion of n-gram language models, a staple in speech 
recognition (Bahl, Jelinek, and Mercer, 1983) as well 
as many other domains (Church, 1988; Brown et al., 
1990; Kernighan, Church, and Gale, 1990). A lan- 
guage model is a probability distribution over strings 
P(s) that attempts to reflect the frequency with 
which each string s occurs as a sentence in natu- 
ral text. Language models are used in speech recog- 
nition to resolve acoustically ambiguous utterances. 
For example, if we have that P(it takes two) >> 
P(it takes too), then we know ceteris paribus to pre- 
fer the former transcription over the latter. 
While smoothing is a central issue in language 
modeling, the literature lacks a definitive compar- 
ison between the many existing techniques. Previ- 
ous studies (Nadas, 1984; Katz, 1987; Church and 
Gale, 1991; MacKay and Peto, 1995) only compare 
a small number of methods (typically two) on a sin- 
gle corpus and using a single training data size. As 
a result, it is currently difficult for a researcher to 
intelligently choose between smoothing schemes. 
In this work, we carry out an extensive 
empirical comparison of the most widely used 
smoothing techniques, including those described 
by 3elinek and Mercer (1980), Katz (1987), and 
Church and Gale (1991). We carry out experiments 
over many training data sizes on varied corpora us- 
ing both bigram and trigram models. We demon- 
strate that the relative performance of techniques 
depends greatly on training data size and n-gram 
order. For example, for bigram models produced 
from large training sets Church-Gale smoothing has 
superior performance, while Katz smoothing per- 
forms best on bigram models produced from smaller 
data. For the methods with parameters that can 
be tuned to improve performance, we perform an 
automated search for optimal values and show that 
sub-optimal parameter selection can significantly de- 
crease performance. To our knowledge, this is the 
first smoothing work that systematically investigates 
any of these issues. 
In addition, we introduce two novel smooth- 
ing techniques: the first belonging to the class of 
smoothing models described by 3elinek and Mer- 
cer, the second a very simple linear interpolation 
method. While being relatively simple to imple- 
ment, we show that these methods yield good perfor- 
mance in bigram models and superior performance 
in trigram models. 
We take the performance of a method m to be its 
cross-entropy on test data 
1 IT 
IvT - log  Pro(t,) 
i=1 
where Pm(ti) denotes the language model produced 
with method m and where the test data T is com- 
posed of sentences (tl,...,tzr) and contains a total 
of NT words. The entropy is inversely related to 
the average probability a model assigns to sentences 
in the test data, and it is generally assumed that 
lower entropy correlates with better performance in 
applications. 
310 
1.1 Smoothing n-gram Models 
In n-gram language modeling, the probability of a 
string P(s) is expressed as the product of the prob- 
abilities of the words that compose the string, with 
each word probability conditional on the identity of 
the last n - 1 words, i.e., ifs = wl-..wt we have 
l 1 
P(s) = H P(wi\[w{-1) ~ 1-~ P i-1 (1) 
i=1 i=1 
where w i j denotes the words wi • •. wj. Typically, n is 
taken to be two or three, corresponding to a bigram 
or trigram model, respectively. 1 
Consider the case n = 2. To estimate the proba- 
bilities P(wilwi-,) in equation (1), one can acquire 
a large corpus of text, which we refer to as training 
data, and take 
P(Wi-lWi) PML(Wil i-1) -- 
P(wi-1) 
c(wi-lWi)/Ns 
e(wi-1)/Ns 
c(wi_ w ) 
where c(c 0 denotes the number of times the string 
c~ occurs in the text and Ns denotes the total num- 
ber of words. This is called the maximum likelihood 
(ML) estimate for P(wilwi_l). 
While intuitive, the maximum likelihood estimate 
is a poor one when the amount of training data is 
small compared to the size of the model being built, 
as is generally the case in language modeling. For ex- 
ample, consider the situation where a pair of words, 
or bigram, say burnish the, doesn't occur in the 
training data. Then, we have PML(the Iburnish) = O, 
which is clearly inaccurate as this probability should 
be larger than zero. A zero bigram probability can 
lead to errors in speech recognition, as it disallows 
the bigram regardless of how informative the acous- 
tic signal is. The term smoothing describes tech- 
niques for adjusting the maximum likelihood esti- 
mate to hopefully produce more accurate probabili- 
ties. 
As an example, one simple smoothing technique is 
to pretend each bigram occurs once more than it ac- 
tually did (Lidstone, 1920; Johnson, 1932; Jeffreys, 
1948), yielding 
C(Wi-lWi) "\[- 1 
= + IVl 
where V is the vocabulary, the set of all words be- 
ing considered. This has the desirable quality of 
1To make the term P(wdw\[Z~,,+~) meaningful for 
i < n, one can pad the beginning of the string with 
a distinguished token. In this work, we assume there are 
n - 1 such distinguished tokens preceding each sentence. 
preventing zero bigram probabilities. However, this 
scheme has the flaw of assigning the same probabil- 
ity to say, burnish the and burnish thou (assuming 
neither occurred in the training data), even though 
intuitively the former seems more likely because the 
word the is much more common than thou. 
To address this, another smoothing technique is to 
interpolate the bigram model with a unigram model 
PML(Wi) = c(wi)/Ns, a model that reflects how of- 
ten each word occurs in the training data. For ex- 
ample, we can take 
Pinto p( i J i-1) = APM (w  pW _l) + (1 - 
getting the behavior that bigrams involving common 
words are assigned higher probabilities (Jelinek and 
Mercer, 1980). 
2 Previous Work 
The simplest type of smoothing used in practice is 
additive smoothing (Lidstone, 1920; Johnson, 1932; 
aeffreys, 1948), where we take 
i w i-1 e(wi_,,+l) + 
= + elVl (2) 
and where Lidstone and Jeffreys advocate /i = 1. 
Gale and Church (1990; 1994) have argued that this 
method generally performs poorly. 
The Good-Turing estimate (Good, 1953) is cen- 
tral to many smoothing techniques. It is not used 
directly for n-gram smoothing because, like additive 
smoothing, it does not perform the interpolation of 
lower- and higher-order models essential for good 
performance. Good-Turing states that an n-gram 
that occurs r times should be treated as if it had 
occurred r* times, where 
r* = (r + 1)n~+l 
and where n~ is the number of n-grams that. occur 
exactly r times in the training data. 
Katz smoothing (1987) extends the intuitions of 
Good-Turing by adding the interpolation of higher- 
order models with lower-order models. It is perhaps 
the most widely used smoothing technique in speech 
recognition. 
Church and Gale (1991) describe a smoothing 
method that combines the Good-Turing estimate 
with bucketing, the technique of partitioning a set, 
of n-grams into disjoint groups, where each group 
is characterized independently through a set of pa- 
rameters. Like Katz, models are defined recursively 
in terms of lower-order models. Each n-gram is as- 
signed to one of several buckets based on its fre- 
quency predicted from lower-order models. Each 
bucket is treated as a separate distribution and 
Good-Turing estimation is performed within each, 
giving corrected counts that are normalized to yield 
probabilities. 
311 
Nd bucketing 
2 ° * ~°% ° 
o °$ o • . 
° .~ °e o *°*° * ° 
• ** o~,~L.s °o . • o 
oO o ~o ° *b 
;.°*~a-:.. • . ° 
• % at ...,~;e.T¢: 
° . .. : ° 
° % o% **° ~ - ° 
~°~ ° o o 
° °• °~ ° 
o* 
° o 
o o 
, , ,i , , ,i , , ,i , " .... 0 
lo 100 1000 10000 100000 0.o01 r~rn~¢ 
of counts in distN~t\]on 
new bucketing 
., . . ., 
oeW~ o 
. 6'V, 
*°Na, o 
* * I * , , I , , * I , , * I , * 
0.01 0.1 1 10 
average r~n-zem count in dis~bution r~nus One 
Figure 1: )~ values for old and new bucketing schemes for Jelinek-Mercer smoothing; each point represents a 
single bucket 
The other smoothing technique besides Katz 
smoothing widely used in speech recognition is due 
to Jelinek and Mercer (1980). They present a class 
of smoothing models that involve linear interpola- 
tion, e.g., Brown et al. (1992) take 
i--1 
PML(Wi IWi-n+l) "Iv ~Wi__ 1 i--1 
i-- n-\]-I 
P~ /W i-1 , (1-- )~to~-~ ) inte~pt i wi_n+2) (3) 
i-- u-I-1 
That is, the maximum likelihood estimate is inter- 
polated with the smoothed lower-order distribution, 
which is defined analogously. Training a distinct 
I ~-1 for each wi_,~+li-1 is not generally felicitous; 
Wi--n-{-1 
Bahl, Jelinek, and Mercer (1983) suggest partition- 
i-1 ing the 1~,~-~ into buckets according to c(wi_~+l), 
i-- n-l-1 
where all )~w~-~ in the same bucket are constrained 
i-- n-l-1 
to have the same value. 
To yield meaningful results, the data used to esti- 
mate the A~!-, need to be disjoint from the data 
~-- n"l-1 
used to calculate PML .2 In held-out interpolation, 
one reserves a section of the training data for this 
purpose. Alternatively, aelinek and Mercer describe 
a technique called deleted interpolation where differ- 
ent parts of the training data rotate in training either 
PML or the A,o!-' ; the results are then averaged. 
z-- n-\[-I Several smoothing techniques are motivated 
within a Bayesian framework, including work by 
Nadas (1984) and MacKay and Peto (1995). 
3 Novel Smoothing Techniques 
Of the great many novel methods that we have tried, 
two techniques have performed especially well. 
2When the same data is used to estimate both, setting 
all )~ ~-~ to one yields the optimal result. 
Wl-- n-l-1 
3.1 Method average-count 
This scheme is an instance of Jelinek-Mercer 
smoothing. Referring to equation (3), recall that 
Bahl et al. suggest bucketing the A~!-I according 
i--1 to c(Wi_n+l). We have found that partitioning the 
~!-~ according to the average number of counts 
*--~+1 
per non-zero element ~(~--~"+1) yields better Iwi:~(~:_.+~)>01 
results. 
Intuitively, the less sparse the data for estimat- 
ing i-1 PML(WilWi_n+l), the larger A~,~-~ should be. 
*-- ~-t-1 
While larger i-1 c(wi_n+l) generally correspond to less 
sparse distributions, this quantity ignores the allo- 
cation of counts between words. For example, we 
would consider a distribution with ten counts dis- 
tributed evenly among ten words to be much more 
sparse than a distribution with ten counts all on a 
single word. The average number of counts per word 
seems to more directly express the concept of sparse- 
ness, 
In Figure 1, we graph the value of ~ assigned to 
each bucket under the original and new bucketing 
schemes on identical data. Notice that the new buck- 
eting scheme results in a much tighter plot, indicat- 
ing that it is better at grouping together distribu- 
tions with similar behavior. 
3.2 Method one-count 
This technique combines two intuitions. First, 
MacKay and Peto (1995) argue that a reasonable 
form for a smoothed distribution is 
• i-1 Pone(W i i-1 c(wL, +l) +  Po,,e(wilw _ +9 
IWi--nq-1) = i--1 c(wi_n+l) + 
The parameter a can be thought of as the num- 
ber of counts being added to the given distribution, 
312 
where the new counts are distributed as in the lower- 
order distribution. Secondly, the Good-Turing esti- 
mate can be interpreted as stating that the number 
of these extra counts should be proportional to the 
number of words with exactly one count in the given 
distribution. We have found that taking 
i-1 O~ = "y \[nl(Wi_n+l) -~- ~\] (4) 
works well, where 
i-i i 
is the number of words with one count, and where/3 
and 7 are constants. 
4 Experimental Methodology 
4.1 Data 
We used the Penn treebauk and TIPSTER cor- 
pora distributed by the Linguistic Data Consor- 
tium. From the treebank, we extracted text from 
the tagged Brown corpus, yielding about one mil- 
lion words. From TIPSTER, we used the Associ- 
ated Press (AP), Wall Street Journal (WSJ), and 
San Jose Mercury News (SJM) data, yielding 123, 
84, and 43 million words respectively. We created 
two distinct vocabularies, one for the Brown corpus 
and one for the TIPSTER data. The former vocab- 
ulary contains all 53,850 words occurring in Brown; 
the latter vocabulary consists of the 65,173 words 
occurring at least 70 times in TIPSTER. 
For each experiment, we selected three segments 
of held-out data along with the segment of train- 
ing data. One held-out segment was used as the 
test data for performance evaluation, and the other 
two were used as development test data for opti- 
mizing the parameters of each smoothing method. 
Each piece of held-out data was chosen to be roughly 
50,000 words. This decision does not reflect practice 
very well, as when the training data size is less than 
50,000 words it is not realistic to have so much devel- 
opment test data available. However, we made this 
decision to prevent us having to optimize the train- 
ing versus held-out data tradeoff for each data size. 
In addition, the development test data is used to op- 
timize typically very few parameters, so in practice 
small held-out sets are generally adequate, and per- 
haps can be avoided altogether with techniques such 
as deleted estimation. 
4.2 Smoothing Implementations 
In this section, we discuss the details of our imple- 
mentations of various smoothing techniques. Due 
to space limitations, these descriptions are not com- 
prehensive; a more complete discussion is presented 
in Chen (1996). The titles of the following sections 
include the mnemonic we use to refer to the imple- 
mentations in later sections. Unless otherwise speci- 
fied, for those smoothing models defined recursively 
in terms of lower-order models, we end the recursion 
by taking the n = 0 distribution to be the uniform 
distribution Punif(wi) = l/IV\[. For each method, we 
highlight the parameters (e.g., Am and 5 below) that 
can be tuned to optimize performance. Parameter 
values are determined through training on held-out 
data. 
4.2.1 Baseline Smoothing (interp-baseline) 
For our baseline smoothing method, we use an 
instance of Jelinek-Mercer smoothing where we con- 
strain all A,~!-I to be equal to a single value A,~ for ,- n-hi 
each n, i.e., 
i--1 i-1 Pb so(wilw _ +i) = A,, + 
(I Am) -- Pbase(WilWi_n+2) 
4.2.2 Additive Smoothing (plus-one and 
plus-delta) 
We consider two versions of additive smoothing. 
Referring to equation (2), we fix 5 = 1 in plus-one 
smoothing. In plus-delta, we consider any 6. 
4.2.3 Katz Smoothing (katz) 
While the original paper (Katz, 1987) uses a single 
parameter k, we instead use a different k for each 
n > 1, k,~. We smooth the unigram distribution 
using additive smoothing with parameter 5. 
4.2.4 Church-Gale Smoothing 
(church-gale) 
To smooth the counts n~ needed for the Good- 
Turing estimate, we use the technique described by 
Gale and Sampson (1995). We smooth the unigram 
distribution using Good-tiering without any bucket- 
ing. 
Instead of the bucketing scheme described in the 
original paper, we use a scheme analogous to the 
one described by Bahl, Jelinek, and Mercer (1983). 
We make the assumption that whether a bucket is 
large enough for accurate Good-Turing estimation 
depends on how many n-grams with non-zero counts 
occur in it. Thus, instead of partitioning the space 
of P(wi-JP(wi) values in some uniform way as was 
done by Church and Gale, we partition the space 
so that at least Cmi n non-zero n-grams fall in each 
bucket. 
Finally, the original paper describes only bigram 
smoothing in detail; extending this method to tri- 
gram smoothing is ambiguous. In particular, it is 
unclear whether to bucket trigrams according to 
i-1 i--1 P(wi_JP(w d or P(wi_JP(wilwi-1). We chose the 
former; while the latter may yield better perfor- 
mance, our belief is that it is much more difficult 
to implement and that it requires a great deal more 
computation. 
4.2.5 Jelinek-Mercer Smoothing 
(interp-held-out and interp-del-int) 
We implemented two versions of Jelinek-Mercer 
smoothing differing only in what data is used to 
313 
train the A's. We bucket the A ~-1 according to Wi--n-bl 
i-1 C(Wi_~+I) as suggested by Bahl et al. Similar to our 
Church-Gale implementation, we choose buckets to 
ensure that at least Cmi n words in the data used to 
train the A's fall in each bucket. 
In interp-held-out, the A's are trained using 
held-out interpolation on one of the development 
test sets. In interp-del-int, the A's are trained 
using the relaxed deleted interpolation technique de- 
scribed by Jelinek and Mercer, where one word is 
deleted at a time. In interp-del-int, we bucket 
an n-gram according to its count before deletion, as 
this turned out to significantly improve performance. 
4.2.6 Novel Smoothing Methods 
(new-avg-count and new-one-count) 
The implementation new-avg-count, correspond- 
ing to smoothing method average-count, is identical 
to interp-held-out except that we use the novel 
bucketing scheme described in section 3.1. In the 
implementation new-one-count, we have different 
parameters j3~ and 7~ in equation (4) for each n. 
5 Results 
In Figure 2, we display the performance of the 
interp-baseline method for bigram and trigram 
models on TIPSTER, Brown, and the WSJ subset 
of TIPSTER. In Figures 3-6, we display the relative 
performance of various smoothing techniques with 
respect to the baseline method on these corpora, as 
measured by difference in entropy. In the graphs 
on the left of Figures 2-4, each point represents an 
average over ten runs; the error bars represent the 
empirical standard deviation over these runs. Due 
to resource limitations, we only performed multiple 
runs for data sets of 50,000 sentences or less. Each 
point on the graphs on the right represents a sin- 
gle run, but we consider sizes up to the amount of 
data available. The graphs on the bottom of Fig- 
ures 3-4 are close-ups of the graphs above, focusing 
on those algorithms that perform better than the 
baseline. To give an idea of how these cross-entropy 
differences translate to perplexity, each 0.014 bits 
correspond roughly to a 1% change in perplexity. 
In each run except as noted below, optimal val- 
ues for the parameters of the given technique were 
searched for using Powell's search algorithm as real- 
ized in Numerical Recipes in C (Press et al., 1988, 
pp. 309-317). Parameters were chosen to optimize 
the cross-entropy of one of the development test sets 
associated with the given training set. To constrain 
the search, we searched only those parameters that 
were found to affect performance significantly, as 
verified through preliminary experiments over sev- 
eral data sizes. For katz and church-gale, we did 
not perform the parameter search for training sets 
over 50,000 sentences due to resource constraints, 
and instead manually extrapolated parameter val- 
Method Lines 
interp-baseline ~ 400 
plus-one 40 
plus-delta 40 
katz 300 
church-gale i000 
±nterp-held-out 400 
interp-del-int 400 
new-avg-count 400 
new-one-count 50 
Table 1: Implementation difficulty of various meth- 
ods in terms of lines of C++ code 
ues from optimal values found on smaller data sizes. 
We ran interp-del-int only on sizes up to 50,000 
sentences due to time constraints. 
From these graphs, we see that additive smooth- 
ing performs poorly and that methods katz and 
interp-held-out consistently perform well. Our 
implementation church-gale performs poorly ex- 
cept on large bigram training sets, where it performs 
the best. The novel methods new-avg-count and 
new-one-count perform well uniformly across train- 
ing data sizes, and are superior for trigram models. 
Notice that while performance is relatively consis- 
tent across corpora, it varies widely with respect to 
training set size and n-gram order. 
The method interp-del-int performs signifi- 
cantly worse than interp-held-out, though they 
differ only in the data used to train the A's. However, 
we delete one word at a time in interp-del-int; we 
hypothesize that deleting larger chunks would lead 
to more similar performance. 
In Figure 7, we show how the values of the pa- 
rameters 6 and Cmin affect the performance of meth- 
ods katz and new-avg-count, respectively, over sev- 
eral training data sizes. Notice that poor parameter 
setting can lead to very significant losses in perfor- 
mance, and that optimal parameter settings depend 
on training set size. 
To give an informal estimate of the difficulty of 
implementation of each method, in Table 1 we dis- 
play the number of lines of C++ code in each imple- 
mentation excluding the core code common across 
techniques. 
6 Discussion 
To our knowledge, this is the first empirical compari- 
son of smoothing techniques in language modeling of 
such scope: no other study has used multiple train- 
ing data sizes, corpora, or has performed parameter 
optimization. We show that in order to completely 
3To implement the baseline method, we just used the 
interp-held-out code as it is a special case. Written 
anew, it probably would have been about 50 lines. 
314 
11.5 
10.5 
10 
9.5 
0 
a.5 
average over ten runs at each size, up to 50,0OO sentences 
"-~:: :. TIPSTER bigram 
"-. "'~:-WS.J bigrarn 
1000 10000 sentences 
of training data (-25 words~sentence) 
tl.5 
11 
10.5 
tO 
9.5 
0 
8.5 
8 
7.5 
7 
6.5 
tOO 
single run at each size 
",.. 
~io~n t rigrarn 
-.~ ~... 
"',., "'"~'.:"-~. TIPSTER bigram ...... ,.. :=::::; .......... 
V~SJ b~gram 
. TIPSTER tdgra~ 
tO00 1O000 100000 le+06 )e+07 
sentences of training data (-25 words/sentence) 
Figure 2: Baseline cross-entropy on test data; graph on left displays averages over ten runs for training sets 
up to 50,000 sentences, graph on right displays single runs for training sets up to 10,000,000 sentences 
average over ten runs at each size, up to 50,000 sentences 
7 ........ , ........ ) • . . 
l~Us-one ........... ~ ............ ~ ............ = ................ 
6 ....... ~ ..... 
c~ ..... plu s=dsita .......... I .... 
4 ........... .. ! 
._c 
-t ........ ~ ........ ~ 1000 10000 
sentences of training data (-25 wordS/sentence) 
single run at each size, up to 10,000,000 sentences 
. ., . ., . . ., . . ., . . 
+ ......... +....--~ plus~ne 
..... ..~...y~ "'"'~" -..+.,..,,., 
.,-" .... o.-..-~'""~.......~ '-.. . 
--.. ~, 2=,j ........ 
....... 
1 J .church-gata ""* 
ks/z, interp-held-out, ~nterpdel-int, new-avg-count, new-one-count (see below) 
-1 , • ........ ' ' ' "' " 
100 1000 10000 100000 le+06 le+07 
sentences of training data (-25 words~sentence) 
average over ten runs at each size, up to 50,000 sentences single run at each size, Up to 10,000,000 sentences 
0.04 . . , • - - , • - , - - , • 
0 ........................................................................................................................................................... 
-0,02 "~"-.. interp-delqnt 
-0.00 
~.1 n ...... 
..... ~t 1 ............ ~ ........... ~ ........... 
.............. ~ ................ :::::::::::::::::::::::::::::::::::::::::::::: 
-0.16 ........ J ........... 
too 1000 10000 
sentences of training data (-25 words~sentence} 
o.02 o .,';'~o~o,-~nt t 
-0.02 
-0.04 ~.-..._Z .~ ...... ~.-. katz d I 
" / :*" -'-"(" '" "'m. 
...~.,\]\[F ,.a ........ a""~ ', 
JO.O8 " inteq)qle)d-out ..o'"" 
~3.1 .~" .~----~. new-one-count c/" x...-~ 
-0.12 ..... "t~""" " ......... ~-.....-" ./ new-svg-count 
-0.14 k ................. x _~_.._~...i" "~"~ 
"0.1610 o lO0O 1OOOO 10oooo le+06 le+07 
sentences of training data (-25 words/sentonce) 
Figure 3: Trigram model on TIPSTER data; relative performance of various methods with respect to baseline; 
graphs on left display averages over ten runs for training sets up to 50,000 sentences, graphs on right display 
single runs for training sets up to 10,000,000 sentences; top graphs show all algorithms, bottom graphs zoom 
in on those methods that perform better than the baseline method 
315 
E =o 
average over ten runs at each size, up to 50,000 senlences 
5 ........ , ........ , • • - 
4.6 .... ~- ................. "~'" plus-o'n~ ............. ~ .... 
4 ........... 
3.S " 
3 ....... t~.,, " "-~ ........... ~ plus4el~ 
2.5 ........... * 
1. I 
church*gale ' 
0.5 
0 ..... 
-05 100 1000 10000 
sentences of training data (-26 words/sentence) 
average over t~ runs at each size, up to 50,000 sentences 
0.02 ....... , ....... , . . • 
-0.02 "" "~... humh*gale 
~- ..... ......~ ............ {. .............. 
-0.00 
..... ~ ............. -'~-'~" :=L::~" T n:w:n2ount /~ ........ 1 
.0.14 
100 1000 10000 
sentences of training data (-26 words/sentence) 
single run at each size, up to 10,000,000 sentences 
5 , • ., . • .. ., . • ., . • 
4 " .. 
3.5 "~"'~"... I~US*one 
2.5 t''°'""'~'""'o.........." ""o. "*. 
1 f--,~ " "~" " p us-de ta church-gale "'~... 
O.5 "~-.. --._o. 
~3.6 T , ,k~tz, taterp-he~-out., interp~, el~tat, ,~ew,~zvg~ou ' ...... ~ne.~ount ! ...... ,ow), l 
100 1000 10000 100000 le+06 le+07 
sentences of training data (-26 words/sentence) 
stagle r~n at each size, up to 10,OCO,O00 sentences 
0.02 • • •, . , • ., ., . . . 
o I -0.02 church*gale ...~:,'~ -0.04 ~ " " ,¢ " .. 
". n erp-he d-out . " ~,~ -~ Interprdel-mt ..- .* ~'" 
........ ~ ..~ .~./.- 
~0~ 
-0.1 new-one-count ..~D -...B'" .~.. 
• . -..~ j.~.~..:.::;$., ~Om 1 2 
ew-avg-count 
.0.14 ' , , I , , , i , , , m , . , I , , 
10o 10oo 1ooo0 1oo000 le+06 le+07 
sentences of training data (-25 wo~ds/sentence) 
Figure 4: Bigram model on TIPSTER data; relative performance of various methods with respect to baseline; 
graphs on left display averages over ten runs for training sets up to 50,000 sentences, graphs on right display 
single runs for training sets up to 10,000,000 sentences; top graphs show all algorithms, bottom graphs zoom 
in on those methods that perform better than the baseline method 
bigram model 
0.02 ...... , ...... , 
0 
i -0.02 church-gale 
interprdel-int 
.0,04 ..-~, ............... ~. 
.0.00 
.0.o8 ...z" ......... ~*.. 
-0.12 "= . ~inte~p~held-out 
lew-a-~n~t...~--~272~::'z--.. "n~:orpe-~t a .......... D .......... 
-0.16 
-0.18 ........ i ........ i 
100 1000 IOQO0 
sentences of training data (-21 words~sentence) 
tzigram model 
0 ....................................................................................................................................................................... 
.0.02 
-0.06 katz .--~" "'~'"'""'" 
• ..-:::.. ......... .,,<.:..-" 
-0.06 .-" ........ ...i r ~ t e t p..<1 el-~ip_t .......... 
.0.12 :::.-.. "'~=.... ......... ~ ........... Q.. interp*held-out 
............ 7~.=: =-P~::.... " ......... e ............... o ........... e ........... ~ ............... 
.0.14 - ~ - =':::=*~-~_.___ ~ new-one-count 
-0.1600 1000 10000 
sentences of traJelng data (-21 words/sentence) 
Figure 5: Bigram and trigram models on Brown corpus; relative performance of various methods with respect 
to baseline 
316 
bigram model tdgram model 
"(, ~ .......................... o 0 i ~ 0.02 
-0.02 hurch-gale 
~ '--nt erp~J el-int "~ -0.04 .. .... ~ -0.02 
inte rp-d el-int ".. ~ inte rpheld~out ... ~"" ~ "'-~ 
E -0.06 • " ., ~,~. \] ~ - .- ~,. "'A. 
-- -0.06 .-'-'katz ""-~ "-.= .. 
-0°3 ' :i: ,.. ........ ::.>~,.- .~. ..... .. ~ "'- -::::.., y -oo  i -. .............. 
-0.14 • """ "~ " • " -k-atz 
-0.13 ~ -018 ........ = ' ' , 1oo 1000 10oo0 100000 le+06 10o 1000 10000 100000 le+0o 
sentences of training data (-25 words/sentence) sentences of t relelr~g data (-25 words/sentence) 
Figure 6: Bigram and trigram models on Wall Street Journal corpus; relative performance of various methods 
with respect to baseline 
z 
~C 
== 
.=_ 
performance el katz with respect to delta 1.6 .... , • .., .., . .., • .., . . 
10O senl 
1.4 
1.2 
1 
10,0O0 sent 
0.8 1,0O0 sent ..a 
0.6 /' .,.~/" 
0.4 / .-" .ED,O00 sent)< 
0.2 
"'"'d:. / .." ~'" 
e .... I , , ,i , , ,r , , ,I , , ,i , , , 
0.0Ol o.01 0.1 1 lO 10o 1000 
delta 
-0.0O 
-0.07 
==- 
-O.08 
2 -0.O3 
-0.1 
-0.11 
-0.12 
-0.13 
performance of new-avg-c~nt with respect to c-min . . ., . . ., . . 
x\ 
\ / 
~'\ lO.000,000 sent /" / 
/" 
x\, , .o 
",\ // ,,," 
.... / 
'"6. ..'"' 2 l 
"""'u, 1 OO3,0OO sent " / 
j/10,0O0 sent 
10 100 tO00 10(00 100000 
minimum number of counts per bucket 
Figure 7: Performance of katz and new-avg-count with respect to parameters ~ and Cmin, respectively 
characterize the relative performance of two tech- 
niques, it is necessary to consider multiple training 
set sizes and to try both bigram and trigram mod- 
els. Multiple runs should be performed whenever 
possible to discover whether any calculated differ- 
ences are statistically significant. Furthermore, we 
show that sub-optimM parameter selection can also 
significantly affect relative performance. 
We find that the two most widely used techniques, 
Katz smoothing and Jelinek-Mercer smoothing, per- 
form consistently well across training set sizes for 
both bigram and trigram models, with Katz smooth- 
ing performing better on trigram models produced 
from large training sets and on bigram models in 
general. These results question the generality of the 
previous reference result concerning Katz smooth- 
ing: Katz (1987) reported that his method slightly 
outperforms an unspecified version of Jelinek-Mercer 
smoothing on a single training set of 750,000 words. 
Furthermore, we show that Church-Gale smooth- 
ing, which previously had not been compared with 
common smoothing techniques, outperforms all ex- 
isting methods on bigram models produced from 
large training sets. Finally, we find that our novel 
methods average-count and one-count are superior 
to existing methods for trigram models and perform 
well on bigram models; method one-count yields 
marginally worse performance but is extremely easy 
to implement. 
In this study, we measure performance solely 
through the cross-entropy of test data; it would 
be interesting to see how these cross-entropy differ- 
ences correlate with performance in end applications 
such as speech recognition. In addition, it would be 
interesting to see whether these results extend to 
fields other than language modeling where smooth- 
ing is used, such as prepositional phrase attachment 
(Collins and Brooks, 1995), part-of-speech tagging 
(Church, 1988), and stochastic parsing (Magerman, 1994). 
317 
Acknowledgements 
The authors would like to thank Stuart Shieber and 
the anonymous reviewers for their comments on pre- 
vious versions of this paper. We would also like to 
thank William Gale and Geoffrey Sampson for sup- 
plying us with code for "Good-Turing frequency esti- 
mation without tears." This research was supported 
by the National Science Foundation under Grant No. 
IRI-93-50192 and Grant No. CDA-94-01024. The 
second author was also supported by a National Sci- 
ence Foundation Graduate Student Fellowship. 

References 
Bahl, Lalit R., Frederick Jelinek, and Robert L. 
Mercer. 1983. A maximum likelihood approach 
to continuous speech recognition. IEEE Trans- 
actions on Pattern Analysis and Machine Intelli- 
gence, PAMI-5(2):179-190, March. 
Brown, Peter F., John Cocke, Stephen A. DellaPi- 
etra, Vincent J. DellaPietra, Frederick Jelinek, 
John D. Lafferty, Robert L. Mercer, and Paul S. 
Roossin. 1990. A statistical approach to machine 
translation. Computational Linguistics, 16(2):79- 
85, June. 
Brown, Peter F., Stephen A. DellaPietra, Vincent J. 
DellaPietra, Jennifer C. Lai, and Robert L. Mer- 
cer. 1992. An estimate of an upper bound for 
the entropy of English. Computational Linguis- 
tics, 18(1):31-40, March. 
Chen, Stanley F. 1996. Building Probabilistic Mod- 
els for Natural Language. Ph.D. thesis, Harvard 
University. In preparation. 
Church, Kenneth. 1988. A stochastic parts program 
and noun phrase parser for unrestricted text. In 
Proceedings of the Second Conference on Applied 
Natural Language Processing, pages 136-143. 
Church, Kenneth W. and William A. Gale. 1991. 
A comparison of the enhanced Good-Turing and 
deleted estimation methods for estimating proba- 
bilities of English bigrams. Computer Speech and 
Language, 5:19-54. 
Collins, Michael and James Brooks. 1995. Prepo- 
sitional phrase attachment through a backed-off 
model. In David Yarowsky and Kenneth Church, 
editors, Proceedings of the Third Workshop on 
Very Large Corpora, pages 27-38, Cambridge, 
MA, June. 
Gale, William A. and Kenneth W. Church. 1990. 
Estimation procedures for language context: poor 
estimates are worse than none. In COMP- 
STAT, Proceedings in Computational Statistics, 
9th Symposium, pages 69-74, Dubrovnik, Yu- 
goslavia, September. 
Gale, William A. and Kenneth W. Church. 1994. 
What's wrong with adding one? In N. Oostdijk 
and P. de Haan, editors, Corpus-Based Research 
into Language. Rodolpi, Amsterdam. 
Gale, William A. and Geoffrey Sampson. 1995. 
Good-Turing frequency estimation without tears. 
Journal of Quantitative Linguistics, 2(3). To ap- 
pear. 
Good, I.J. 1953. The population frequencies of 
species and the estimation of population parame- 
ters. Biometrika, 40(3 and 4):237-264. 
Jeffreys, H. 1948. Theory of Probability. Clarendon 
Press, Oxford, second edition. 
Jelinek, Frederick and Robert L. Mercer. 1980. In- 
terpolated estimation of Markov source parame- 
ters from sparse data. In Proceedings of the Work- 
shop on Pattern Recognition in Practice, Amster- 
dam, The Netherlands: North-Holland, May. 
Johnson, W.E. 1932. Probability: deductive and 
inductive problems. Mind, 41:421-423. 
Katz, Slava M. 1987. Estimation of probabilities 
from sparse data for the language model com- 
ponent of a speech recognizer. IEEE Transac- 
tions on Acoustics, Speech and Signal Processing, 
ASSP-35(3):400-401, March. 
Kernighan, M.D., K.W. Church, and W.A. Gale. 
1990. A spelling correction program based on 
a noisy channel model. In Proceedings of the 
Thirteenth International Conference on Compu- 
tational Linguistics, pages 205-210. 
Lidstone, G.J. 1920. Note on the general case of the 
Bayes-Laplace formula for inductive or a posteri- 
ori probabilities. Transactions of the Faculty of 
Actuaries, 8:182-192. 
MacKay, David J. C. and Linda C. Peto. 1995. A hi- 
erarchical Dirichlet language model. Natural Lan- 
guage Engineering, 1(3):1-19. 
Magerman, David M. 1994. Natural Language Pars- 
ing as Statistical Pattern Recognition. Ph.D. the- 
sis, Stanford University, February. 
Nadas, Arthur. 1984. Estimation of probabilities in 
the language model of the IBM speech recognition 
system. IEEE Transactions on Acoustics, Speech 
and Signal Processing, ASSP-32(4):859-861, Au- 
gust. 
Press, W.H., B.P. Flannery, S.A. Teukolsky, and 
W.T. Vetterling. 1988. Numerical Recipes in C. 
Cambridge University Press, Cambridge. 
