Poor Estimates of Context 
are Worse than None 
William A. Gale 
Kenneth W. Church 
AT&T Bell Laboratories 
Murray Hill, N.J. 07974 
Abstract 
It is difficult to estimate the probability of a word's 
context because of sparse data problems. If appropriate 
care is taken, we find that it is possible to make useful 
estimates of contextual probabilities that improve perfor- 
mance in a spelling correction application. In contrast, 
less careful estimates are found to be useless. Specifi- 
cally, we will show that the Good-Turing method makes 
the use of contextual information practical for a spelling 
corrector, while attempts to use the maximum likelihood 
estimator (MLE) or expected Idcellhood estimator (ELE) 
fail. Spelling correction was selected as an application 
domain because it is analogous to many important recogni- 
tion applications based on a noisy channel model (such as 
speech recognition), though somewhat simpler and there- 
fore possibly more amenable to detailed statistical 
analysis. 
Background 
Statistical language models were quite popular in the 
Previous work [5] led to the spelling correction pro- 
gram, correct. In the course of that work, we observed 
that human judges were reluctant to decide between alter- 
native candidate corrections given only as much informa- 
tion as was available to the program, the typo and the can- 
didate corrections. We also observed that the judges felt 
much more confident when they could see a line or two of 
context around the typo. This suggests that there is con- 
siderable information in the context. 
However, it is difficult to measure contextual probabili- 
ties. Suppose, for example, that we consider just the pre- 
vious word, I. (A more adequate model would need to 
look at considerably more than just the previous word, but 
even this radically over-simplified model illustrdtes the 
problem.) Then we need to measure the conditional pro- 
babilities, Pr(llw), for all w and 1 in the vocabulary V. 
The problem is that v2 is generally much larger than the 
size of the corpus, N. V is at least lo5, so V' is at least 
10". The largest currently available corpora are about 
10' (100 million words). Thus, we have at least 100 
times more parameters than data. In fact, the problem is 
much worse because the data are not uniformly distri- 
buted. ' 
1950s, but faded 'ather sudded~ when ChOmsk~ 
Correct is reviewed in the next as it provides 
[I] argued quite successfully that statistics should not play 
the framework which this study is done. The set- 
a role in his competence model. With the recent availabil- 
also estimation techniques, Nles for 
iry of large text corpora (of 100 milson words or more), 
bi.ing sources of evidence, and evaluation pro- 
there has been a resurgence of interest in empirical 
cedures. 
methods, especially in recognition applications such as 
speech recognition (e.g., [2] ), but also in many other areas 
of natural language research including machine translation 
Correct 
(PI ). The sheer size of the available corpus data is 
The takes a of misspelled words 
largely responsible for the revival of these techniques. 
(typos) as input (as might be produced by the Unix@ spell 
Nevertheless, there is never enough data, and conse- 
program), and outputs a set of candidate corrections for 
quently, it is important to study the statistical estimation 
each typo, along with a probability. These probability 
issues very carefully. Specifically, we will show that the 
scores distinguish correct from other spelling correction 
Good-Turing (GT) method[4] for estimating bigram proba- 
programs, that output a (long) list of candidiate 
bilities makes the use of contextual information ~ractical 
r 
in our spelling corrector application, while attempts to use 
the maximum likelihood estimator (MLE) or expected 
2. One might think that the sparse data problem could be solved 
by collecting larger corpora, but ironically, the problem only 
likelihood estimator (ELE) fd. 
gets worse as we look at more data. The vocabulary is not 
fixed both N and V grow as we look at more data. The rate 
1. We would like to acknowledge Mark Kernighan's work on 
of growth is still a matter of debate, but the evidence clearly 
correcf, which laid the groundwork for this study of context 
shows that V > 0(fi), and therefore, the sparse data prob- 
modelling. We thank Jill Burstein for help with the judging. 
lems only get worse as we look at more and more data. 
corrections, many of which are often extremely implausi- 
ble. 
Here is some sample output: 
negotations negotiations 
notcampaigning 1 ??? 
The entry ??? indicates that no correction was found. 
progession 
The first stage of correct finds candidate corrections, c, 
that differ from the typo t by a single insertion, deletion, 
substitution or reversal. For example, given the input 
typo, acress, the first stage generates candidate corrections 
in the table below. Thus, the correction actress could be 
transformed by the noisy channel into the typo acress by 
replacing the t with nothing, @, at position 2. (The sym- 
bols @ and # represent nulls in the typo and correction, 
respectively. The transformations are named from the 
point of view of the correction, not the typo.) This unusu- 
ally d~fficult example was selected to illustrate the four 
transformations; most typos have just a few possible 
corrections, and there is rarely more than one plausible 
correction. 
progression (94%) procession 
(4%) profession (2%) 
Typo Correction 
Transformation 
acress actress @ t 2 deletion 
acress cress a # 0 insertion 
acress caress ac ca 0 reversal 
acress access r c 2 substitution 
acres across e o 3 substitution 
acress acres s # 4 insertion 
acress acres s # 5 insertion 
Each candidate correction is scored by the Bayesian 
combination rule Pr(c) ~r(tl c), and then normalized by 
the sum of the scores for all proposed candidates. Care 
must be taken in estimating the prior because of sparse 
data problems. It is possible (and even likely) that a pro- 
posed correction might not have appeared in the training 
set. Some methods of estimating the prior would produce 
undesirable results in this case. For example, the max- 
imum likelihood estimate (MLE) would estimate 
Pr(c) = 0, and consequately, many candidate corrections 
would be rejected just because they did not happen to 
appear in the training set. We will encounter even more 
severe forms of the sparse data problem when we consider 
context. 
frequency r*, where r* is a function of r. Once r* has 
been determined, then p is estimated as p = r*lN*. 
N* = C r* N, where N, is the frequency of frequency 
r, assuring that the estimated probabilities add to one. The 
maximum likelihood estimator (MLE) sets r* = r. The 
MLE estimate is particularly poor when r = 0, since the 
true probabilities are almost certainly greater than 0. 
Following Box and Tiao [6] , we can assume an unin- 
formative prior and reach a posterior distribution for p. 
Using the expectation of this distribution amounts to using 
r* = r + .5. We call this the expected likelihood estimate 
(ELE). This method is often used in practice because it is 
easy to implement, though it does have some serious 
weaknesses. The third method is the minimax (MM) 
method [7] , which sets r* = r + .5fi. Its derivation is 
based on a risk analysis; it minimizes the maximum qua- 
dratic loss. The fourth method is the Good-Turing (GT) 
method [4] , which sets r* = (r+l) N,+,IN,. Unlike 
the MLE, all three other methods assign nonzero probabili- 
ties, even when r = 0. This is probably a desirable pro- 
perty. 
We use the ELE for the probabilities of single words as 
they are frequent enough not to require elaborate treat- 
ment. The channel probabilities, Pr(t 1 c), are computed 
from four confusion matrices: (1) del[x,y], the number of 
times that the characters xy (in the correct word) were 
typed as x in the training set, (2), add[x,y], the number of 
times that x was typed as xy, (3) sub[x,y], the number of 
times that y was typed as x, and (4) rev[x,y], the number 
of times that xy was typed as yx. Probabilities are 
estimated from these matrices by using chars[x,y] and 
chars[x], the number of times that xy and x appeared in 
the training set, respectively, as the total number of obser- 
vations appropriate to some cell of a matrix. The proba- 
bilities are estimated using the Good-Turing method [4] , 
with the cells of the matrices as the types. 
Returning to the acress example, the seven proposed 
transformations are scored by multipling the prior proba- 
bility (which is proportial to 0.5 + column 4 in the table 
below) and the channel probability (column 5) to form a 
raw score (column 3), which are normalized to produce 
probabilities (column 2). The final results is: acres (45%), 
actress (37%), across (18%), access (O%), caress (O%), 
cress (0%). This example is very hard; in fact, the second 
choice is probably right, as can be seen from the context: 
... was called a "stellar and versatile acress whose combi- 
nation of sass and ghmour has defined her .... The pro- 
gram would need a much better prior model in order to 
handle this case. The next section shows how the context 
can be used to take advantage of the fact that that actress 
is considerably more plausible than acres as an antecedent 
We will consider four estimation methods for dealing 
with the sparse data problems. All of these methods 
attempt to estimate a set of probabilities, p, from observed 
frequencies, r. It is assumed that the observed frequencies 
are generated by a binomial process with N total observa- 
tions. The estimation methods generate an adjusted 
for whose. 
c % Raw 
freq(c) Pr(t Ic) 
actress 37% .I57 1343 55./470,000 
cress 0% .OW 0 46./32,000,000 
caress 0% .OW 4 .95/580,000 
access 0% .000 2280 .98/4,700,000 
across 18% .077 8436 93.110,000,000 
acres 21% .092 2879 417./13,000,000 
acres 23% .098 2879 205./6,000,000 
Many typos such as absorbant have just one candidate 
correction, but others such as adusted are more difficult 
and have multiple corrections. (For the purposes of this 
experiment, a typo is defined to be a lowercase word 
rejected by the UnixO spell program.) The table below 
shows examples of typos with candidate corrections sorted 
by their scores. The second column shows the number of 
typos in a seven month sample of the AP newswire, bro- 
ken out by the number of candidate corrections. For 
example, there were 1562 typos with exactly two correc- 
tions proposed by correct. Most typos have relatively few 
candidate corrections. There is a general trend for fewer 
choices, though the 0-choice case is special. 
# Freq Typo Corrections 
0 3937 adrnininistration 
1 6993 absorbant 
2 1562 adusted 
3 639 ambitios 
4 367 compatabili?y 
5 221 afte 
6 157 dialy 
7 94 poice 
8 82 piots 
9 77 spash 
absorbent 
adjusted dusted 
ambitious ambitions ambition 
compatibility compactability 
comparability computability 
after fate aft ate ante 
daily diary dials dial 
dimly dilly 
police price voice poise 
pice ponce poire 
pilots pivots riots plots pits 
pots pints pious 
splash smash slash spasm stash 
swash sash pash spas 
We decided to look at the 2-candidate case in more 
detail in order to test how often the top scoring candidate 
agreed with a panel of three judges. The judges were 
given 564 triples (e.g., absurb, absorb, absurd) and a con- 
cordance line (e.g., ...$? nancial community. "It is absurb 
and probably obscene for any person so engaged ,to...). 
The first word of the triple was a spell reject, followed by 
two candidates in alphabetical order. The judges were 
given a 5-way forced choice. They could circle any one 
of the three words, if they thought that was what the 
author had intended. In addition, they could say "other" 
if they thought that some other word was intended, or "?" 
if they were not sure what was intended. We decided to 
consider only those cases where at least two judges circled 
one of the two candidate corrections, and they agreed with 
each other. This left only 329 triples, mainly because the 
the judges often circled the first word, indicating that they 
thought it had been incorrectly rejected by spell. 
The following table shows that correct agrees with the 
majority of the judges in 87% of the 329 cases of interest. 
In order to help calibrate this result, three inferior methods 
are also evaluated. The channel-only method ignores the 
prior probability. The prior-only method ignores the chan- 
nel probability. Finally, the neither method ignores both 
probabilities and selects the first candidate in all cases. As 
the following table shows, correct is significantly better 
than the three alternative methods. The table also evalu- 
ates the three judges. Judges were only scored on triples 
for which they selected one of the proposed altematives, 
and for which the other two judges agreed on one of the 
proposed altematives. A triple was scored "correct" for 
one judge if that judge agreed with the other two and 
"incorrect" if that judge disagreed with the other two. 
The table shows that the judges significantly out-perform 
correct, indicating that there is room for improvement. 
Method 
correct 
Context 
Discrimination % 
2861329 87 * 1.9 
channel-only 
prior-onl y 
chance 
Judge 1 
Judge 2 
Judge 3 
As previously noted, the judges were extremely reluc- 
tant to cast a vote without more information than correct 
uses, and they were much more comfortable when they 
could see a concordance line or two. This suggests that 
contextual clues might help improve performance. How- 
ever, it is important to estimate the context carefully; we 
have found that poor measures of context are worse than 
none. 
2631329 80 f 2.2 
2471329 75 f 2.4 
1721329 52 f 2.8 
27 11273 99 * 0.5 
27 11275 99 f 0.7 
2711281 96 f 1.1 
In this work, we use a simple n-gram model of context, 
based on just the word to the left of the typo, I, and the 
word to the right of the typo, r. Although n-gram 
methods are much too simple (compared with much more 
sophisticated methods used in A1 and natural language 
processing), even these simple methods illustrate the prob- 
lem that poor estimates of contextual probabilities are 
worse than none. The same estimation issues are probably 
even more critical when the simple n-gram models of con- 
text are replaced by more sophisticated A1 models. 
The variables 1 and r are introduced into the Baysian 
scoring function by changing the formula from 
Pr(c) Pr(t1c) to Pr(c)~r(t,rc) which can be 
approximated as Pr(c)Pr(tlc)Pr(llc)Pr(rlc), under 
appropriate independence assumptions. The issue, then, is 
how to estimate the two new factors: Pr(llc) and Pr(rlc). 
We have four proposals: MLE, ELE, MM and GT. Let us 
consider one way of using the ELE method first. It is 
straightforward and similar to our best method, but hope- 
lessly wrong. 
Pr(llc ) = Pr(lc) 
Pr(c) 
(freq( lc ) + O.5 )/ d l 
(freq(c)+O.5)/d2 
freq(Ic)+0.5 
o¢: freq(c)+0.5 
where dl = N + V2/2 and d2 = N + V/2. We can ignore 
the constant d2/dl and use the proportion to score candi- 
date corrections. Similarly, we use the relation 
Pr(rlc ) o~ (freq(cr)+O.5)/(freq(c)+0.5) for the right 
context. When these estimates for Pr(llc) and Pr(rlc) 
are substituted in the formula, 
Pr(c)Pr(t\[c)Pr(llc)Pr(rlc), we have: 
Pr(t\[c) (freq(lc)+0.5) (freq(cr)+0.5) 
(freq(c)+0.5) E/E 
This new formula produces the desired results for the 
acress example, as illustrated in the following table. (The 
column labeled raw is 106 times the formula E/E, as only 
proportionalities matter.) Note that actress is now pre- 
fered over acres mostly because actress whose is more 
common than acres whose (8 to 0). Presumably the 
difference in frequencies reflects the fact that actress is a 
better antecedent of whose. Note also though, that cress is 
now considered a plausible rival because of errors intro- 
duced by the ELE method. The high score of cress is due 
to the fact that it was not observed in the corpus, and 
therefore the ELE estimates Pr(llc)= Pr(rlc)= 1, 
which is clearly biased high. 
c % Raw freq(c) Pr(tlc ) freq(Ic) freq(cr) 
actress 69% 1.85 1343 555470,000 2 8 
cress 27% .719 0 46./32,000,000 0 0 
caress 3% .091 4 .95/580,000 0 0 
access 0% .000 2280 .98/4,700,000 2 0 
across 0% .011 8436 93./10,000,000 0 20 
acres 0% .003 2879 417./13,000,000 0 0 
acres 0% .003 2879 205./6,000,000 0 0 
We will consider 
The method just 
five methods for estimating Pr(llc). 
described is called the E\[E method, 
286 
because both Pr(lc) and Pr(c) are estimated with the ELE 
method. The M/E method uses the MLE estimate for 
Pr(lc) and the ELE estimate for Pr(c). The E method 
takes Pr(lc) proportional to the ELE estimate 
(freq(lc)+0.5), but the denominator is adjusted so that 
XPr(llc) = 1. The MM method adjusts the minimax 
c 
suggestion in \[7\] in the same way. The G/E method uses 
the enhanced Good-Turing (GT) method for Pr(Ic) and 
the ELE estimate for Pr(c). 
Pr(llc ) _ Pr(Ic) _ freq(lc)+0.5 
P(c) freq(c)+0.5 E/E 
Pr(llc ) = Pr(lc) _ freq(lc) MiE 
P(c) freq(c)+0.5 
Pr(llc ) _ freq(tc)+0.5 
freq(c)+V/2 E 
Pr(llc ) : freq(lc)+O.5 f~req(c) 
freq(c) + O. 5 V~freq(c) 
MM 
gr+l (r+l) 
Nr er(llc) = G/E 
freq(c)+0.5 
The first two methods are useless, as shown by the perfor- 
mance of the context alone: 
Poor Estimates of Context Offer Little or No Help 
chance M/E E/E 
wrong 164.5 15 169 
uninformative 0 136 4 
right 164.5 178 156 
The other three are better. The performance of G/E is 
significantly better than the other four. 
Better Estimates of Context Exist 
E MM G/E 
wrong 62 59 45 
uninformative 0 0 4 
right 267 270 280 
For the Good-Turing estimates, we use an enhanced 
version of the Good-Turing estimator. The basic estimator 
is applied to subgroups of the bigrams. The subgroups 
have similar values of Npxpy, where Px and py are the 
probabilities for the individual words. The grouping vari- 
able is the expected frequency of the bigrarn if the words 
occurred independently. Its use is discussed in detail by 
\[8\] It results in about 1400 significantly different estimates 
for bigrams not seen in the training text, and in about 150 
different estimates for words seen once. 
When combined with the prior and channel, G/E is the 
only one of the five estimation methods that improves sig- 
nificantly 3 on the performance of correct. The following 
table shows correct in column 1, followed by the two 
disastrous measures M/E and E/E, then the two useless 
measures E and MM, and finally the one useful measure 
G/E. 
Context is Useless Unless Carefully Measured 
disastrous useless useful 
wrong 
useless 
right 
% 
±~ 
no 
context 
43 
0 
286 
86.9 
1.9 
+M/E +E/E 
context context 
11 61 
136 0 
182 268 
55.3 81.5 
2.7 2.1 
+E +MM 
context context 
39 40 
0 0 
290 289 
88.1 87.8 
1.8 1.8 
+G/E 
context 
34 
0 
295 
89.7 
1.7 
Conclusions 
We have studied the problem of incorporating context 
into a spelling correction program, and found that the esti- 
mation issues need to be addressed very carefully. Poor 
estimates of context are useless. It is better to ignore con- 
text than to model it badly. Fortunately, there are good 
methods such as G/E that provide a significant improve- 
ment in performance. However, even the G/E method 
does not achieve human performance, indicating that there 
is considerable room for improvement. One way to 
improve performance might be to add more interesting 
sources of knowledge than simple n-gram models, e.g., 
semantic networks, thesaurus relations, morphological 
decomposition, parse trees. Alternatively, one might try 
more sophisticated statistical approaches. For example, we 
have only considered the simplest Baysian combination 
rules. One might try to fit a log linear model, as one of 
many possibilities. In short, it should be taken as a chal- 
lenge to researchers in computational linguistics and statis- 
tics to find ways to improve performance to be more com- 
petitive with human judges. 
Jelinek, F., Mercer, R., and Pietra, P., "A Statistical 
Approach to French/English Translation," m Proceed- 
ings RIA088 Conference on User-oriented Content- 
based Text and Image Handling, RIAO, Cambridge, 
Massachusetts (March 21-24, 1988). 
4. Good, I. J., "The population frequencies of species and 
the estimation of population parameters," Biometrika 
40 pp. 237-264 (1953). 
5. Kernighan, M. D, Church, K. W., and Gale, W. A., "A 
Spelling Corrector Based on Error Frequencies," in 
Proceedings of the Thirteenth International Conference 
on Computational Linguistics, (1990). 
6. Box, G. E. P. and Tiao, G. C., Bayesian Inference in 
Statistical Analysis, Addison-Wesley, Reading, Mas- 
sachusetts (1973). 
7. Steinhaus, H., "The problem of estimation," Annals of 
Mathematical Statistics 28 pp. 633-648 (1957). 
8. Church, K. W. and Gale, W. A., "Enhanced Good- 
Turing and Cat-Cal: Two New Methods for Estimating 
Probabilities of English Bigrams," Computer, Speech, 
and Language, (1991). 
References 
1. Chomsky, N., Syntactic Structures, Mouton & Co, The 
Hague (1957). 
2. Nadas, A., "Estimation of probabilities m the language 
model of the IBM speech recognition system," IEEE 
Transactions on Acoustics, Speech, and Signal Process- 
ing ASSP-32 pp. 859-861 (1984). 
3. Brown, P., Cocke, J., Della Pietra, S., Della Pietra, V., 
The GT method changes the program's preference in 25 of 
the 329 cases; 17 of the changes are right and 8 of them are 
wrong. The probability of 17 or more right out of 25, assum- 
ing equal probability of two alternatives, is .04. Thus, we 
conclude that the improvement is significant. 
287 
