Similarity-Based Estimation of Word Cooccurrence 
Probabilities 
Ido Dagan Fernando Pereira 
AT&T Bell Laboratories 
600 Mountain Ave. 
Murray Hill, NJ 07974, USA 
dagan©research, att. com 
pereira©research, att. com 
Abstract 
In many applications of natural language processing it 
is necessary to determine the likelihood of a given word 
combination. For example, a speech recognizer may 
need to determine which of the two word combinations 
"eat a peach" and "eat a beach" is more likely. Statis- 
tical NLP methods determine the likelihood of a word 
combination according to its frequency in a training cor- 
pus. However, the nature of language is such that many 
word combinations are infrequent and do not occur in a 
given corpus. In this work we propose a method for es- 
timating the probability of such previously unseen word 
combinations using available information on "most sim- 
ilar" words. 
We describe a probabilistic word association model 
based on distributional word similarity, and apply it 
to improving probability estimates for unseen word bi- 
grams in a variant of Katz's back-off model. The 
similarity-based method yields a 20% perplexity im- 
provement in the prediction of unseen bigrams and sta- 
tistically significant reductions in speech-recognition er- 
ror. 
Introduction 
Data sparseness is an inherent problem in statistical 
methods for natural language processing. Such meth- 
ods use statistics on the relative frequencies of config- 
urations of elements in a training corpus to evaluate 
alternative analyses or interpretations of new samples 
of text or speech. The most likely analysis will be taken 
to be the one that contains the most frequent config- 
urations. The problem of data sparseness arises when 
analyses contain configurations that never occurred in 
the training corpus. Then it is not possible to estimate 
probabilities from observed frequencies, andsome other 
estimation scheme has to be used. 
We focus here on a particular kind of configuration, 
word cooccurrence. Examples of such cooccurrences 
include relationships between head words in syntactic 
constructions (verb-object or adjective-noun, for exam- 
ple) and word sequences (n-grams). In commonly used 
models, the probability estimate for a previously un- 
seen cooccurrence is a function of the probability esti- 
Lillian Lee 
Division of Applied Sciences 
Harvard University 
33 Oxford St. Cambridge MA 02138, USA 
llee©das, harvard, edu 
mates for the words in the cooccurrence. For example, 
in the bigram models that we study here, the probabil- 
ity P(w21wl) of a conditioned word w2 that has never 
occurred in training following the conditioning word wl 
is calculated from the probability of w~, as estimated 
by w2's frequency in the corpus (Jelinek, Mercer, and 
Roukos, 1992; Katz, 1987). This method depends on 
an independence assumption on the cooccurrence of Wl 
and w2: the more frequent w2 is, the higher will be the 
estimate of P(w2\[wl), regardless of Wl. 
Class-based and similarity-based models provide an 
alternative to the independence assumption. In those 
models, the relationship between given words is mod- 
eled by analogy with other words that are in some sense 
similar to the given ones. 
Brown et a\]. (1992) suggest a class-based n-gram 
model in which words with similar cooccurrence distri- 
butions are clustered in word classes. The cooccurrence 
probability of a given pair of words then is estimated ac- 
cording to an averaged cooccurrence probability of the 
two corresponding classes. Pereira, Tishby, and Lee 
(1993) propose a "soft" clustering scheme for certain 
grammatical cooccurrences in which membership of a 
word in a class is probabilistic. Cooccurrence probabil- 
ities of words are then modeled by averaged cooccur- 
rence probabilities of word clusters. 
Dagan, Markus, and Markovitch (1993) argue that 
reduction to a relatively small number of predetermined 
word classes or clusters may cause a substantial loss of 
information. Their similarity-based model avoids clus- 
tering altogether. Instead, each word is modeled by its 
own specific class, a set of words which are most simi- 
lar to it (as in k-nearest neighbor approaches in pattern 
recognition). Using this scheme, they predict which 
unobserved cooccurrences are more likely than others. 
Their model, however, is not probabilistic, that is, it 
does not provide a probability estimate for unobserved 
cooccurrences. It cannot therefore be used in a com- 
plete probabilistic framework, such as n-gram language 
models or probabilistic lexicalized grammars (Schabes, 
1992; Lafferty, Sleator, and Temperley, 1992). 
We now give a similarity-based method for estimating 
the probabilities of cooccurrences unseen in training. 
272 
Similarity-based estimation was first used for language 
modeling in the cooccurrence smoothing method of Es- 
sen and Steinbiss (1992), derived from work on acous- 
tic model smoothing by Sugawara et al. (1985). We 
present a different method that takes as starting point 
the back-off scheme of Katz (1987). We first allocate an 
appropriate probability mass for unseen cooccurrences 
following the back-off method. Then we redistribute 
that mass to unseen cooccurrences according to an av- 
eraged cooccurrence distribution of a set of most similar 
conditioning words, using relative entropy as our sim- 
ilarity measure. This second step replaces the use of 
the independence assumption in the original back-off 
model. 
We applied our method to estimate unseen bigram 
probabilities for Wall Street Journal text and compared 
it to the standard back-off model. Testing on a held-out 
sample, the similarity model achieved a 20% reduction 
in perplexity for unseen bigrams. These constituted 
just 10.6% of the test sample, leading to an overall re- 
duction in test-set perplexity of 2.4%. We also exper- 
imented with an application to language modeling for 
speech recognition, which yielded a statistically signifi- 
cant reduction in recognition error. 
The remainder of the discussion is presented in terms 
of bigrams, but it is valid for other types of word cooc- 
currence as well. 
Discounting and Redistribution 
Many low-probability bigrams will be missing from any 
finite sample. Yet, the aggregate probability of all these 
unseen bigrams is fairly high; any new sample is very 
likely to contain some. 
Because of data sparseness, we cannot reliably use a 
maximum likelihood estimator (MLE) for bigram prob- 
abilities. The MLE for the probability of a bigram 
(wi, we) is simply: 
PML(Wi, we) -- c(w , we) N , (1) 
where c(wi, we) is the frequency of (wi, we) in the train- 
ing corpus and N is the total number of bigrams. How- 
ever, this estimates the probability of any unseen hi- 
gram to be zero, which is clearly undesirable. 
Previous proposals to circumvent the above problem 
(Good, 1953; Jelinek, Mercer, and Roukos, 1992; Katz, 
1987; Church and Gale, 1991) take the MLE as an ini- 
tial estimate and adjust it so that the total probability 
of seen bigrams is less than one, leaving some probabil- 
ity mass for unseen bigrams. Typically, the adjustment 
involves either interpolation, in which the new estimator 
is a weighted combination of the MLE and an estimator 
that is guaranteed to be nonzero for unseen bigrams, or 
discounting, in which the MLE is decreased according to 
a model of the unreliability of small frequency counts, 
leaving some probability mass for unseen bigrams. 
The back-off model of Katz (1987) provides a clear 
separation between frequent events, for which observed 
frequencies are reliable probability estimators, and low- 
frequency events, whose prediction must involve addi- 
tional information sources. In addition, the back-off 
model does not require complex estimations for inter- 
polation parameters. 
A hack-off model requires methods for (a) discounting 
the estimates of previously observed events to leave out 
some positive probability mass for unseen events, and 
(b) redistributing among the unseen events the probabil- 
ity mass freed by discounting. For bigrams the resulting 
estimator has the general form 
fPd(w21wl) if c(wi,w2) > 0 
D(w21wt) = ~.a(Wl)Pr(w2\]wt) otherwise , (2) 
where Pd represents the discounted estimate for seen 
bigrams, P~ the model for probability redistribution 
among the unseen bigrams, and a(w) is a normalization 
factor. Since the overall mass left for unseen bigrams 
starting with wi is given by 
~, P,~(welwi) , 
w~:c(wi ,w~)>0 
~(wi) = 1 - 
the normalization 
Ew2 P(w2\[ wl) : 1 is 
= 
factor required to ensure 
 (wl) 
1 - ~:c(~i,w2)>0 Pr(we\[wi) 
The second formulation of the normalization is compu- 
tationally preferable because the total number of pos- 
sible bigram types far exceeds the number of observed 
types. Equation (2) modifies slightly Katz's presenta- 
tion to include the placeholder Pr for alternative models 
of the distribution of unseen bigrams. 
Katz uses the Good-Turing formula to replace the 
actual frequency c(wi, w2) of a bigram (or an event, in 
general) with a discounted frequency, c*(wi,w2), de- 
fined by 
c*(wi, w2) = (C(Wl, w2) + 1)nc(wl'~)+i , (3) nc(wl,w2) 
where nc is the number of different bigrams in the cor- 
pus that have frequency c. He then uses the discounted 
frequency in the conditional probability calculation for 
a bigram: 
c* (wi, w2) (4) Pa(w21wt) - C(Wl) 
In the original Good-Turing method (Good, 1953) 
the free probability mass is redistributed uniformly 
among all unseen events. Instead, Katz's back-off 
scheme redistributes the free probability mass non- 
uniformly in proportion to the frequency of w2, by set- 
ting 
Pr(weJwi) = P(w~) (5) 
273 
Katz thus assumes that for a given conditioning word 
wl the probability of an unseen following word w2 is 
proportional to its unconditional probability. However, 
the overall form of the model (2) does not depend on 
this assumption, and we will next investigate an esti- 
mate for P~(w21wl) derived by averaging estimates for 
the conditional probabilities that w2 follows words that 
are distributionally similar to wl. 
The Similarity Model 
Our scheme is based on the assumption that words that 
are "similar" to wl can provide good predictions for 
the distribution of wl in unseen bigrams. Let S(Wl) 
denote a set of words which are most similar to wl, 
as determined by some similarity metric. We define 
PsiM(W21Wl), the similarity-based model for the condi- 
tional distribution of wl, as a weighted average of the 
conditional distributions of the words in S(Wl): 
PsiM(W21wl) = 
-, • '- ' ~ w(~i,~') (6) ZWleS(Wl) 2\[--~'(~\]~l'\['/fll)~"~ 
W/w, ~j ) ' 
where W(W~l, wl) is the (unnormalized) weight given to 
w~, determined by its degree of similarity to wl. Ac- 
cording to this scheme, w2 is more likely to follow wl if 
it tends to follow words that are most similar to wl. To 
complete the scheme, it is necessary to define the simi- 
larity metric and, accordingly, S(wl) and W(w~, Wl). 
Following Pereira, Tishby, and Lee (1993), we 
measure word similarity by the relative entropy, or 
Kullback-Leibler (KL) distance, between the corre- 
sponding conditional distributions 
D(w~ II w~) = Z P(w2\]wl) log P(w2Iwl) (7) ~ P(w2lw~) " 
The KL distance is 0 when wl = w~, and it increases 
as the two distribution are less similar. 
To compute (6) and (7) we must have nonzero esti- 
mates of P(w21wl) whenever necessary for (7) to be de- 
fined. We use the estimates given by the standard back- 
off model, which satisfy that requirement. Thus our 
application of the similarity model averages together 
standard back-off estimates for a set of similar condi- 
tioning words. 
We define S(wl) as the set of at most k nearest 
words to wl (excluding wl itself), that also satisfy 
D(Wl II w~) < t. k and t are parameters that control 
the contents of $(wl) and are tuned experimentally, as 
we will see below. 
W(w~, wl) is defined as 
W(w~, Wl) --- exp -/3D(Wl II ~i) 
The weight is larger for words that are more similar 
(closer) to wl. The parameter fl controls the relative 
contribution of words in different distances from wl: as 
the value of fl increases, the nearest words to Wl get rel- 
atively more weight. As fl decreases, remote words get 
a larger effect. Like k and t,/3 is tuned experimentally. 
Having a definition for PSIM(W2\[Wl), we could use it 
directly as Pr(w2\[wl) in the back-off scheme (2). We 
found that it is better to smooth PsiM(W~\[Wl) by inter- 
polating it with the unigram probability P(w2) (recall 
that Katz used P(w2) as Pr(w2\[wl)). Using linear in- 
terpolation we get 
P,(w2\[wl) = 7P(w2) + (1 - 7)PsiM(W2lWl) , (8) 
where "f is an experimentally-determined interpolation 
parameter. This smoothing appears to compensate 
for inaccuracies in Pslu(w2\]wl), mainly for infrequent 
conditioning words. However, as the evaluation be- 
low shows, good values for 7 are small, that is, the 
similarity-based model plays a stronger role than the 
independence assumption. 
To summarize, we construct a similarity-based model 
for P(w2\[wl) and then interpolate it with P(w2). The 
interpolated model (8) is used in the back-off scheme 
as Pr(w2\[wl), to obtain better estimates for unseen bi- 
grams. Four parameters, to be tuned experimentally, 
are relevant for this process: k and t, which determine 
the set of similar words to be considered,/3, which deter- 
mines the relative effect of these words, and 7, which de- 
termines the overall importance of the similarity-based 
model. 
Evaluation 
We evaluated our method by comparing its perplexity 1 
and effect on speech-recognition accuracy with the base- 
line bigram back-off model developed by MIT Lincoln 
Laboratories for the Wall Streel Journal (WSJ) text 
and dictation corpora provided by ARPA's HLT pro- 
grain (Paul, 1991). 2 The baseline back-off model follows 
closely the Katz design, except that for compactness all 
frequency one bigrams are ignored. The counts used ill 
this model and in ours were obtained from 40.5 million 
words of WSJ text from the years 1987-89. 
For perplexity evaluation, we tuned the similarity 
model parameters by minimizing perplexity on an ad- 
ditional sample of 57.5 thousand words of WSJ text, 
drawn from the ARPA HLT development test set. The 
best parameter values found were k = 60, t = 2.5,/3 = 4 
and 7 = 0.15. For these values, the improvement in 
perplexity for unseen bigrams in a held-out 18 thou- 
sand word sample, in which 10.6% of the bigrams are 
unseen, is just over 20%. This improvement on unseen 
1The perplexity of a conditional bigram probability 
model /5 with respect to the true bigram distribution is 
an information-theoretic measure of model quality (Jelinek, 
Mercer, and Roukos, 1992) that can be empirically esti- 
mated by exp - -~ ~-~i log P(w, tu, i_l ) for a test set of length 
N. Intuitively, the lower the perplexity of a model the more 
likely the model is to assign high probability to bigrams that 
actually occur. In our task, lower perplexity will indicate 
better prediction of unseen bigrams. 
2The ARPA WSJ development corpora come in two ver- 
sions, one with verbalized punctuation and the other with- 
out. We used the latter in all our experiments. 
274 
k t ~ 7 training reduction (%) test reduction (%) 
60 2.5 4 0.15 18.4 20.51 
50 2.5 4 0.15 18.38 20.45 
40 2.5 4 0.2 18.34 20.03 
30 2.5 4 0.25 18.33 19.76 
70 2.5 4 0.1 18.3 20.53 
80 2.5 4.5 0.1 18.25 20.55 
100 2.5 4.5 0.1 18.23 20.54 
90 2.5 4.5 0.1 18.23 20.59 
20 1.5 4 0.3 18.04 18.7 
10 1.5 3.5 0.3 16.64 16.94 
Table 1: Perplexity Reduction on Unseen Bigrams for Different Model Parameters 
bigrams corresponds to an overall test set perplexity 
improvement of 2.4% (from 237.4 to 231.7). Table 1 
shows reductions in training and test perplexity, sorted 
by training reduction, for different choices in the num- 
ber k of closest neighbors used. The values of f~, 7 and 
t are the best ones found for each k. 3 
From equation (6), it is clear that the computational 
cost of applying the similarity model to an unseen bi- 
gram is O(k). Therefore, lower values for k (and also 
for t) are computationally preferable. From the table, 
we can see that reducing k to 30 incurs a penalty of less 
than 1% in the perplexity improvement, so relatively 
low values of k appear to be sufficient to achieve most 
of the benefit of the similarity model. As the table also 
shows, the best value of 7 increases as k decreases, that 
is, for lower k a greater weight is given to the condi- 
tioned word's frequency. This suggests that the predic- 
tive power of neighbors beyond the closest 30 or so can 
be modeled fairly well by the overall frequency of the 
conditioned word. 
The bigram similarity model was also tested as a lan- 
guage model in speech recognition. The test data for 
this experiment were pruned word lattices for 403 WSJ 
closed-vocabulary test sentences. Arc scores in those 
lattices are sums of an acoustic score (negative log like- 
lihood) and a language-model score, in this case the 
negative log probability provided by the baseline bi- 
gram model. 
From the given lattices, we constructed new lattices 
in which the arc scores were modified to use the similar- 
ity model instead of the baseline model. We compared 
the best sentence hypothesis in each original lattice and 
in the modified one, and counted the word disagree- 
ments in which one of the hypotheses is correct. There 
were a total of 96 such disagreements. The similarity 
model was correct in 64 cases, and the back-off model in 
32. This advantage for the similarity model is statisti- 
cally significant at the 0.01 level. The overall reduction 
in error rate is small, from 21.4% to 20.9%, because 
the number of disagreements is small compared with 
3Values of fl and t refer to base 10 logarithms and expo- 
nentials in all calculations. 
the overall number of errors in our current recognition 
setup. 
Table 2 shows some examples of speech recognition 
disagreements between the two models. The hypotheses 
are labeled 'B' for back-off and 'S' for similarity, and the 
bold-face words are errors. The similarity model seems 
to be able to model better regularities such as semantic 
parallelism in lists and avoiding a past tense form after 
"to." On the other hand, the similarity model makes 
several mistakes in which a function word is inserted in 
a place where punctuation would be found in written 
text. 
Related Work 
The cooccurrence smooihing technique (Essen and 
Steinbiss, 1992), based on earlier stochastic speech 
modeling work by Sugawara et al. (1985), is the main 
previous attempt to use similarity to estimate the prob- 
ability of unseen events in language modeling. In addi- 
tion to its original use in language modeling for speech 
recognition, Grishman and Sterling (1993) applied the 
cooccurrence smoothing technique to estimate the like- 
lihood of selectional patterns. We will outline here 
the main parallels and differences between our method 
and cooccurrence smoothing. A more detailed analy- 
sis would require an empirical comparison of the two 
methods on the same corpus and task. 
In cooccurrence smoothing, as in our method, a base- 
line model is combined with a similarity-based model 
that refines some of its probability estimates. The sim- 
ilarity model in cooccurrence smoothing is based on 
the intuition that the similarity between two words w 
and w' can be measured by the confusion probability 
Pc(w'lw ) that w' can be substituted for w in an arbi- 
trary context in the training corpus. Given a baseline 
probability model P, which is taken to be the MLE, the 
confusion probability Pc(w~lwl) between conditioning 
words w~ and wl is defined as 
l Pc(wllwl) 
-- 1 (9) 
P( l) p(wllw2)p(wl 1 2)P( 2) ' 
the probability that wl is followed by the same context 
words as w~. Then the bigram estimate derived by 
275 
B commitments ... from leaders felt the three point six billion dollars 
S \] commitments ... from leaders fell to three point six billion dollars 
B I followed bv France the US agreed in ltalv ,y France the US agreed in Italy 
S \[ followed by France the US Greece ...Italy 
B \[ he whispers to made a 
S \[ he whispers to an aide 
B the necessity for change exist 
S \[ the necessity for change exists 
B \] without ...additional reserves Centrust would have reported 
S \[ without ...additional reserves of Centrust would have reported 
B \] in the darkness past the church 
S in the darkness passed the church 
Table 2: Speech Recognition Disagreements between Models 
cooccurrence smoothing is given by 
Ps(w21wl) = ~ P(w~lw'l)Pc(w'llwO 
Notice that this formula has the same form as our sim- 
ilarity model (6), except that it uses confusion proba- 
bilities where we use normalized weights. 4 In addition, 
we restrict the summation to sufficiently similar words, 
whereas the cooccurrence smoothing method sums over 
all words in the lexicon. 
The similarity measure (9) is symmetric in the sense 
that Pc(w'lw) and Pc(w\[w') are identical up to fre- Pc(w'l w) _ P(w) 
quency normalization, that is Pc(wlw') - P(w,)" In 
contrast, D(w H w') (7) is asymmetric in that it weighs 
each context in proportion to its probability of occur- 
rence with w, but not with wq In this way, if w and 
w' have comparable frequencies but w' has a sharper 
context distribution than w, then D(w' I\[ w) is greater 
than D(w \[\[ w'). Therefore, in our similarity model 
w' will play a stronger role in estimating w than vice 
versa. These properties motivated our choice of relative 
entropy for similarity measure, because of the intuition 
that words with sharper distributions are more infor- 
mative about other words than words with flat distri- 
butions. 
4This presentation corresponds to model 2-B in Essen 
and Steinbiss (1992). Their presentation follows the equiv- 
alent model l-A, which averages over similar conditioned 
words, with the similarity defined with the preceding word 
as context. In fact, these equivalent models are symmetric 
in their treatment of conditioning and conditioned word, as 
they can both be rewritten as 
Ps(w2lwl) ,~, , , , , P(w2\[Wl)P(Wl = Iw~)P(w21wl) 
They also consider other definitions of confusion probabil- 
ity and smoothed probability estimate, but the one above 
yielded the best experimental results. 
Finally, while we have used our similarity model only 
for missing bigrams in a back-off scheme, Essen and 
Steinbiss (1992) used linear interpolation for all bi- 
grams to combine the cooccurrence smoothing model 
with MLE models of bigrams and unigrams. Notice, 
however, that the choice of back-off or interpolation is 
independent from the similarity model used. 
Further Research 
Our model provides a basic scheme for probabilistic 
similarity-based estimation that can be developed in 
several directions. First, variations of (6) may be tried, 
such as different similarity metrics and different weight- 
ing schemes. Also, some simplification of the current 
model parameters may be possible, especially with re- 
spect to the parameters t and k used to select the near- 
est neighbors of a word. A more substantial variation 
would be to base the model on similarity between con- 
ditioned words rather than on similarity between con- 
ditioning words. 
Other evidence may be combined with the similarity- 
based estimate. For instance, it may be advantageous 
to weigh those estimates by some measure of the re- 
liability of the similarity metric and of the neighbor 
distributions. A second possibility is to take into ac- 
count negative evidence: if Wl is frequent, but w2 never 
followed it, there may be enough statistical evidence 
to put an upper bound on the estimate of P(w21wl). 
This may require an adjustment of the similarity based 
estimate, possibly along the lines of (Rosenfeld and 
Huang, 1992). Third, the similarity-based estimate can 
be used to smooth the naaximum likelihood estimate 
for small nonzero frequencies. If the similarity-based 
estimate is relatively high, a bigram would receive a 
higher estimate than predicted by the uniform discount- 
ing method. 
Finally, the similarity-based model may be applied 
to configurations other than bigrams. For trigrams, 
it is necessary to measure similarity between differ- 
ent conditioning bigrams. This can be done directly, 
276 
by measuring the distance between distributions of the 
form P(w31wl, w2), corresponding to different bigrams 
(wl, w~). Alternatively, and more practically, it would 
be possible to define a similarity measure between bi- 
grams as a function of similarities between correspond- 
ing words in them. Other types of conditional cooccur- 
rence probabilities have been used in probabilistic pars- 
ing (Black et al., 1993). If the configuration in question 
includes only two words, such as P(objectlverb), then it 
is possible to use the model we have used for bigrams. 
If the configuration includes more elements, it is nec- 
essary to adjust the method, along the lines discussed 
above for trigrams. 
Conclusions 
Similarity-based models suggest an appealing approach 
for dealing with data sparseness. Based on corpus 
statistics, they provide analogies between words that of- 
ten agree with our linguistic and domain intuitions. In 
this paper we presented a new model that implements 
the similarity-based approach to provide estimates for 
the conditional probabilities of unseen word cooccur- 
fences. 
Our method combines similarity-based estimates 
with Katz's back-off scheme, which is widely used for 
language modeling in speech recognition. Although the 
scheme was originally proposed as a preferred way of 
implementing the independence assumption, we suggest 
that it is also appropriate for implementing similarity- 
based models, as well as class-based models. It enables 
us to rely on direct maximum likelihood estimates when 
reliable statistics are available, and only otherwise re- 
sort to the estimates of an "indirect" model. 
The improvement we achieved for a bigram model is 
statistically significant, though modest in its overall ef- 
fect because of the small proportion of unseen events. 
While we have used bigrams as an easily-accessible plat- 
form to develop and test the model, more substantial 
improvements might be obtainable for more informa- 
tive configurations. An obvious case is that of tri- 
grams, for which the sparse data problem is much more 
severe. ~ Our longer-term goal, however, is to apply 
similarity techniques to linguistically motivated word 
cooccurrence configurations, as suggested by lexical- 
ized approaches to parsing (Schabes, 1992; Lafferty, 
Sleator, and Temperley, 1992). In configurations like 
verb-object and adjective-noun, there is some evidence 
(Pereira, Tishby, and Lee, 1993) that sharper word 
cooccurrence distributions are obtainable, leading to 
improved predictions by similarity techniques. 
Acknowledgments 
We thank Slava Katz for discussions on the topic of this 
paper, Doug McIlroy for detailed comments, Doug Paul 
5For WSJ trigrams, only 58.6% of test set trigrams 
occur in 40M of words of training (Doug Paul, personal 
communication). 
for help with his baseline back-off model, and Andre 
Ljolje and Michael Riley for providing the word lattices 
for our experiments. 

References 
Black, Ezra, Fred Jelinek, John Lafferty, David M. 
Magerman, David Mercer, and Salim Roukos. 1993. 
Towards history-based grammars: Using richer mod- 
els for probabilistic parsing. In 30th Annual Meet- 
ing of the Association for Computational Linguistics, 
pages 31-37, Columbus, Ohio. Ohio State University, 
Association for Computational Linguistics, Morris- 
town, New Jersey. 
Brown, Peter F., Vincent J. Della Pietra, Peter V. 
deSouza, Jenifer C. Lai, and Robert L. Mercer. 
1992. Class-based n-gram models of natural lan- 
guage. Computational Linguistics, 18(4):467-479. 
Church, Kenneth W. and William A. Gale. 1991. A 
comparison of the enhanced Good-Turing and deleted 
estimation methods for estimating probabilities of 
English bigrams. Computer Speech and Language, 
5:19-54. 
Dagan, Ido, Shaul Markus, and Shaul Markovitch. 
1993. Contextual word similarity and estimation 
from sparse data. In 30th Annual Meeting of the As- 
sociation for Computational Linguistics, pages 164- 
171, Columbus, Ohio. Ohio State University, Asso- 
ciation for Computational Linguistics, Morristown, 
New Jersey. 
Essen, Ute and Volker Steinbiss. 1992. Coocurrence 
smoothing for stochastic language modeling. In Pro- 
ceedings of ICASSP, volume I, pages 161-164. IEEE. 
Good, I.J. 1953. The population frequencies of 
species and the estimation of population parameters. 
Biometrika, 40(3):237-264. 
Grishman, Ralph and John Sterling. 1993. Smoothing 
of automatically generated selectional constraints. In 
Human Language Technology, pages 254-259, San 
Francisco, California. Advanced Research Projects 
Agency, Software and Intelligent Systems Technology 
Office, Morgan Kaufmann. 
Jelinek, Frederick, Robert L. Mercer, and Salim 
Roukos. 1992. Principles of lexical language mod- 
eling for speech recognition. In Sadaoki Furui and 
M. Mohan Sondhi, editors, Advances in Speech Sig- 
nal Processing. Mercer Dekker, Inc., pages 651-699. 
Katz, Slava M. 1987. Estimation of probabilities from 
sparse data for the language model component of a 
speech recognizer. IEEE Transactions on Acoustics, 
Speeech and Signal Processing, 35(3):400-401. 
Lafferty, John, Daniel Sleator, and Davey Temperley. 
1992. Grammatical trigrams: aa probabilistic model 
of link grammar. In Robert Goldman, editor, AAAI 
Fall Symposium on Probabilistic Approaches to Natu- 
ral Language Processing, Cambridge, Massachusetts. 
American Association for Artificial Intelligence. 
Paul, Douglas B. 1991. Experience with a stack 
decoder-based HMM CSR and back-off n-gram lan- 
guage models. In Proceedings of the Speech and Nat- 
ural Language Workshop, pages 284-288, Palo Alto, 
California, February. Defense Advanced Research 
Projects Agency, Information Science and Technol- 
ogy Office, Morgan Kaufmann. 
Pereira, Fernando C. N., Naftali Z. Tishby, and Lil- 
lian Lee. 1993. Distributional clustering of English 
words. In $Oth Annual Meeting of the Association for 
Computational Linguistics, pages 183-190, Co\]urn- 
bus, Ohio. Ohio State University, Association for 
Computational Linguistics, Morristown, New Jersey. 
Rosenfeld, Ronald and Xuedong Huang. 1992. Im- 
provements in stochastic language modeling. In 
DARPA Speech and Natural Language Workshop, 
pages 107-111, Harriman, New York, February. Mor- 
gan Kaufmann, San Mateo, California. 
Sehabes, Yves. 1992. Stochastic lexiealized tree- 
adjoining grammars. In Proceeedings of the 14th 
International Conference on Computational Linguis- 
tics, Nantes, France. 
Sugawara, K., M. Nishimura, K. Toshioka, M. Okoehi, 
and T. Kaneko. 1985. Isolated word recognition 
using hidden Markov models. In Proceedings of 
ICASSP, pages 1-4, Tampa, Florida. IEEE. 
