Similarity-Based Methods For Word Sense Disambiguation 
Ido Dagan 
Dept. of Mathematics and 
Computer Science 
Bar Ilan University 
Ramat Gan 52900, Israel 
dagan©macs, biu. ac. il 
Lillian Lee Fernando Pereira 
Div. of Engineering and AT&T Labs - Research 
Applied Sciences 600 Mountain Ave. 
Harvard University Murray Hill, NJ 07974, USA 
Cambridge, MA 01238, USA pereira©research, att. corn 
llee©eecs, harvard, edu 
Abstract 
We compare four similarity-based esti- 
mation methods against back-off and 
maximum-likelihood estimation meth- 
ods on a pseudo-word sense disam- 
biguation task in which we controlled 
for both unigram and bigram fre- 
quency. The similarity-based meth- 
ods perform up to 40% better on this 
particular task. We also conclude 
that events that occur only once in 
the training set have major impact on 
similarity-based estimates. 
1 Introduction 
The problem of data sparseness affects all sta- 
tistical methods for natural language process- 
ing. Even large training sets tend to misrep- 
resent low-probability events, since rare events 
may not appear in the training corpus at all. 
We concentrate here on the problem of es- 
timating the probability of unseen word pairs, 
that is, pairs that do not occur in the train- 
ing set. Katz's back-off scheme (Katz, 1987), 
widely used in bigram language modeling, esti- 
mates the probability of an unseen bigram by 
utilizing unigram estimates. This has the un- 
desirable result of assigning unseen bigrams the 
same probability if they are made up of uni- 
grams of the same frequency. 
Class-based methods (Brown et al., 1992; 
Pereira, Tishby, and Lee, 1993; Resnik, 1992) 
cluster words into classes of similar words, so 
that one can base the estimate of a word pair's 
probability on the averaged cooccurrence prob- 
ability of the classes to which the two words be- 
long. However, a word is therefore modeled by 
the average behavior of many words, which may 
cause the given word's idiosyncrasies to be ig- 
nored. For instance, the word "red" might well 
act like a generic color word in most cases, but 
it has distinctive cooccurrence patterns with re- 
spect to words like "apple," "banana," and so 
on. 
We therefore consider similarity-based esti- 
mation schemes that do not require building 
general word classes. Instead, estimates for 
the most similar words to a word w are com- 
bined; the evidence provided by word w' is 
weighted by a function of its similarity to w. 
Dagan, Markus, and Markovitch (1993) pro- 
pose such a scheme for predicting which un- 
seen cooccurrences are more likely than others. 
However, their scheme does not assign probabil- 
ities. In what follows, we focus on probabilistic 
similarity-based estimation methods. 
We compared several such methods, in- 
cluding that of Dagan, Pereira, and Lee (1994) 
and the cooccurrence smoothing method of 
Essen and Steinbiss (1992), against classical es- 
timation methods, including that of Katz, in a 
decision task involving unseen pairs of direct ob- 
jects and verbs, where unigram frequency was 
eliminated from being a factor. We found that 
all the similarity-based schemes performed al- 
most 40% better than back-off, which is ex- 
pected to yield about 50% accuracy in our 
experimental setting. Furthermore, a scheme 
based on the total divergence of empirical dis- 
56 
tributions to their average 1 yielded statistically 
significant improvement in error rate over cooc- 
currence smoothing. 
We also investigated the effect of removing 
extremely low-frequency events from the train- 
ing set. We found that, in contrast to back- 
off smoothing, where such events are often dis- 
carded from training with little discernible ef- 
fect, similarity-based smoothing methods suf- 
fer noticeable performance degradation when 
singletons (events that occur exactly once) are 
omitted. 
2 Distributional Similarity Models 
We wish to model conditional probability distri- 
butions arising from the coocurrence of linguis- 
tic objects, typically words, in certain configura- 
tions. We thus consider pairs (wl, w2) E Vi × V2 
for appropriate sets 1/1 and V2, not necessar- 
ily disjoint. In what follows, we use subscript 
i for the i th element of a pair; thus P(w21wi) 
is the conditional probability (or rather, some 
empirical estimate, the true probability being 
unknown) that a pair has second element w2 
given that its first element is wl; and P(wllw2) 
denotes the probability estimate, according to 
the base language model, that wl is the first 
word of a pair given that the second word is w2. 
P(w) denotes the base estimate for the unigram 
probability of word w. 
A similarity-based language model consists 
of three parts: a scheme for deciding which 
word pairs require a similarity-based estimate, 
a method for combining information from simi- 
lar words, and, of course, a function measuring 
the similarity between words. We give the de- 
tails of each of these three parts in the following 
three sections. We will only be concerned with 
similarity between words in V1. 
1To the best of our "knowledge, this is the first use 
of this particular distribution dissimilarity function in 
statistical language processing. The function itself is im- 
plicit in earlier work on distributional clustering (Pereira, 
Tishby, and Lee, 1993}, has been used by Tishby (p.e.) 
in other distributional similarity work, and, as sug- 
gested by Yoav Freund (p.c.), it is related to results of 
Hoeffding (1965) on the probability that a given sample 
was drawn from a given joint distribution. 
2.1 Discounting and Redistribution 
Data sparseness makes the maximum likelihood 
estimate (MLE) for word pair probabilities un- 
reliable. The MLE for the probability of a word 
pair (Wl, w2), conditional on the appearance of 
word wl, is simply 
PML(W2\[wl) -- c(wl, w2) (1) c( i) 
where c(wl, w2) is the frequency of (wl, w2) in 
the training corpus and c(wl) is the frequency 
of wt. However, PML is zero for any unseen 
word pair, which leads to extremely inaccurate 
estimates for word pair probabilities. 
Previous proposals for remedying the above 
problem (Good, 1953; Jelinek, Mercer, and 
Roukos, 1992; Katz, 1987; Church and Gale, 
1991) adjust the MLE in so that the total prob- 
ability of seen word pairs is less than one, leav- 
ing some probability mass to be redistributed 
among the unseen pairs. In general, the ad- 
justment involves either interpolation, in which 
the MLE is used in linear combination with an 
estimator guaranteed to be nonzero for unseen 
word pairs, or discounting, in which a reduced 
MLE is used for seen word pairs, with the prob- 
ability mass left over from this reduction used 
to model unseen pairs. 
The discounting approach is the one adopted 
by Katz (1987): 
/Pd(w2\]wx) C(Wl, w2) > 0 
/5(w2lwl) = \[o~(wl)Pr(w2\[wl) o.w. 
(2) 
where Pd represents the Good-Turing dis- 
counted estimate (Katz, 1987) for seen word 
pairs, and Pr denotes the model for probabil- 
ity redistribution among the unseen word pairs. 
c~(wl) is a normalization factor. 
Following Dagan, Pereira, and Lee (1994), 
we modify Katz's formulation by writing 
Pr(w2\]wl) instead P(w2), enabling us to use 
similarity-based estimates for unseen word pairs 
instead of basing the estimate for the pair on un- 
igram frequency P(w2). Observe that similarity 
estimates are used for unseen word pairs only. 
We next investigate estimates for Pr(w21wl) 
57 
derived by averaging information from words 
that are distributionally similar to Wl. 
2.2 Combining Evidence 
Similarity-based models assume that if word w~ 
is "similar" to word wl, then w~ can yield in- 
formation about the probability of unseen word 
pairs involving wl. We use a weighted aver- 
age of the evidence provided by similar words, 
where the weight given to a particular word w~ 
depends on its similarity to wl. 
More precisely, let W(wl, W~l) denote an in- 
creasing function of the similarity between wl 
and w\[, and let $(Wl) denote the set of words 
most similar to Wl. Then the general form of 
similarity model we consider is a W-weighted 
linear combination of predictions of similar 
words: 
PSIM('W2IWl) = ~V(Wl, W~) E 
~ ~s(~1 ) 
(3) 
where = is a nor- 
malization factor. According to this formula, 
w2 is more likely to occur with wl if it tends to 
occur with the words that are most similar to 
WI. 
Considerable latitude is allowed in defining 
the set $(Wx), as is evidenced by previous work 
that can be put in the above form. Essen 
and Steinbiss (1992) and Karov and Edelman 
(1996) (implicitly) set 8(wl) = V1. However, 
it may be desirable to restrict ,5(wl) in some 
fashion, especially if 1/1 is large. For instance, 
Dagan. Pereira, and Lee (1994) use the closest 
k or fewer words w~ such that the dissimilarity 
between wl and w~ is less than a threshold value 
t; k and t are tuned experimentally. 
Now, we could directly replace P,.(w2\[wl) 
in the back-off equation (2) with PSIM(W21Wl). 
However, other variations are possible, such 
as interpolating with the unigram probability 
P(w2): 
P,.(w2lwl) = 7P(w2) + (1 - 7)PsiM(W2lWl), 
where 7 is determined experimentally (Dagan, 
Pereira, and Lee, 1994). This represents, in ef- 
fect, a linear combination of the similarity es- 
timate and the back-off estimate: if 7 -- 1, 
then we have exactly Katz's back-off scheme. 
As we focus in this paper on alternatives for 
PSlM, we will not consider this approach here; 
that is, for the rest of this paper, Pr(w2\]wl) = 
PslM(W21wl). 
2.3 Measures of Similarity 
We now consider several word similarity func- 
tions that can be derived automatically from 
the statistics of a training corpus, as opposed 
to functions derived from manually-constructed 
word classes (Resnik, 1992). All the similarity 
functions we describe below depend just on the 
base language model P('I'), not the discounted 
model /5(.\[.) from Section 2.1 above. 
2.3.1 KL divergence 
Kullback-Leibler (KL) divergence is a stan- 
dard information-theoretic measure of the dis- 
similarity between two probability mass func- 
tions (Cover and Thomas, 1991). We can ap- 
ply it to the conditional distribution P(.\[wl) in- 
duced by Wl on words in V2: 
D(wx\[lW ) = P(w2lwl) log P(wu\[wx) P(w21wl)" (4) 
For D(wxHw~l) to be defined it must be the 
case that P(w2\]w~l) > 0 whenever P(w21wl) > 
0. Unfortunately, this will not in general be 
the case for MLEs based on samples, so we 
would need smoothed estimates of P(w2\]w~) 
that redistribute some probability mass to zero- 
frequency events. However, using smoothed es- 
timates for P(w2\[wl) as well requires a sum 
over all w2 6 172, which is expensive \['or the 
large vocabularies under consideration. Given 
the smoothed denominator distribution, we set 
l/V(wl, w~) = lO -~D(wlllw'l) , 
where/3 is a free parameter. 
2.3.2 Total divergence to the average 
A related measure is based on the total KL 
divergence to the average of the two distribu- 
tions: 
+ wl A(wx, W11) = D (w, wl )+D (w~\[ + w~) 
2 
58 
where (Wl ÷ w~)/2 shorthand for the distribu- 
tion ½ 
(P(.IwJ + P(.Iw~)) 
Since D('II-) > O, A(Wl,W~) >_ O. Furthermore, 
letting p(w2) = P(w2\[wJ, p'(w2) = P(w2lw~) 
and C : {w2 : p(w2) > O,p'(w2) > O}, it is 
straightforward to show by grouping terms ap- 
propriately that 
A(wi,wb= 
-H(p(w2)) - H(p'(w2)) } 
+ 2 log 2, 
where H(x) = -x logx. Therefore, d(wl, w~) 
is bounded, ranging between 0 and 2log2, and 
smoothed estimates are not required because 
probability ratios are not involved. In addi- 
tion, the calculation of A(wl, w~) requires sum- 
ming only over those w2 for which P(w2iwJ and 
P(w2\]w~) are both non-zero, which, for sparse 
data, makes the computation quite fast. 
As in the KL divergence case, we set 
W(Wl, W~l) to be 10 -~A(~'wl). 
2.3.3 LI norm 
The L1 norm is defined as 
n(wi, wl) : ~ IP(w2lwj - P(w21w'Jl . (6) 
W2 
By grouping terms as before, we can express 
L(wI, w~) in a form depending only on the 
"common" w2: 
n(wl, w~) = 2- E p(w2)- E p'(w2) 
w26C w2EC 
÷ Ip(w2)-p'(w2)t. 
w2EC 
This last form makes it clear that 0 < 
L(Wl, w\[) _< 2, with equality if and only if there 
are no words w2 such that both P(w2lwJ and 
P(w2lw\[) are strictly positive. 
Since we require a weighting scheme that is 
decreasing in L, we set 
W(wl, w~) = (2 - n(wl, W/l)) fl 
with fl again free. 
2.3.4 Confusion probability 
Essen and Steinbiss (1992) introduced confu- 
sion probability 2, which estimates the probabil- 
ity that word w~ can be substituted for word 
Wl: 
Pc(w lWl) = w(wl, 
= ~, P(wllw2)P(w~\[w2)P(w2) 
w2 P(Wl) 
Unlike the measures described above, wl may 
not necessarily be the "closest" word to itself, 
that is, there may exist a word w~ such that 
Pc(W'l\[Wl ) > Pc(w,\[wl) . 
The confusion probability can be computed 
from empirical estimates provided all unigram 
estimates are nonzero (as we assume through- 
out). In fact, the use of smoothed estimates 
like those of Katz's back-off scheme is problem- 
atic, because those estimates typically do not 
preserve consistency with respect to marginal 
estimates and Bayes's rule. However, using con- 
sistent estimates (such as the MLE), we can 
rewrite Pc as follows: 
' w P(w2lwl) . P(w21w'JP(w'J. Pc(W1\[ 1)= ~ P(w2) 
W2 
This form reveals another important difference 
between the confusion probability and the func- 
tions D, A, and L described in the previous sec- 
tions. Those functions rate w~ as similar to wl 
if, roughly, P(w21w~) is high when P(w21'wj is. 
Pc(w~\[wl), however, is greater for those w~ for 
which P(w~, wJ is large when P(w21wJ/P(w2) 
is. When the ratio P(w21wl)/P(w2) is large, we 
may think of w2 as being exceptional, since if w2 
is infrequent, we do not expect P(w21wJ to be 
large. 
2.3.5 Summary 
Several features of the measures of similarity 
listed above are summarized in table 1. "Base 
LM constraints" are conditions that must be 
satisfied by the probability estimates of the base 
2Actually, they present two alternative definitions. 
We use their model 2-B, which they found yielded the 
best experimental results. 
59 
language model. The last column indicates 
whether the weight W(wl, w~) associated with 
each similarity function depends on a parameter 
that needs to be tuned experimentally. 
3 Experimental Results 
We evaluated the similarity measures listed 
above on a word sense disambiguation task, in 
which each method is presented with a noun and 
two verbs, and decides which verb is more likely 
to have the noun as a direct object. Thus, we do 
not measure the absolute quality of the assign- 
ment of probabilities, as would be the case in 
a perplexity evaluation, but rather the relative 
quality. We are therefore able to ignore constant 
factors, and so we neither normalize the similar- 
ity measures nor calculate the denominator in 
equation (3). 
3.1 Task: Pseudo-word Sense 
Disambiguation 
In the usual word sense disambiguation prob- 
lem, the method to be tested is presented with 
an ambiguous word in some context, and is 
asked to identify the correct sense of the word 
from the context. For example, a test instance 
might be the sentence fragment "robbed the 
bank"; the disambiguation method must decide 
whether "bank" refers to a river bank, a savings 
bank, or perhaps some other alternative. 
While sense disambiguation is clearly an im- 
portant task, it presents numerous experimen- 
tal difficulties. First, the very notion of "sense" 
is not clearly defined; for instance, dictionaries 
may provide sense distinctions that are too fine 
or too coarse for the data at hand. Also, one 
needs to have training data for which the cor- 
rect senses have been assigned, which can re- 
quire considerable human effort. 
To circumvent these and other difficulties, 
we set up a pseudo-word disambiguation ex- 
periment (Schiitze, 1992; Gale, Church, and 
Yarowsky, 1992) the general format of which is 
as follows. We first construct a list of pseudo- 
words, each of which is the combination of two 
different words in V2. Each word in V2 con- 
tributes to exactly one pseudo-word. Then, we 
replace each w2 in the test set with its cor- 
responding pseudo-word. For example, if we 
choose to create a pseudo-word out of the words 
"make" and "take", we would change the test 
data like this: 
make plans =~ {make, take} plans 
take action =~ {make, take} action 
The method being tested must choose between 
the two words that make up the pseudo-word. 
3.2 Data 
We used a statistical part-of-speech tagger 
(Church, 1988) and pattern matching and con- 
cordancing tools (due to David Yarowsky) to 
identify transitive main verbs and head nouns 
of the corresponding direct objects in 44 mil- 
lion words of 1988 Associated Press newswire. 
We selected the noun-verb pairs for the 1000 
most frequent nouns in the corpus. These pairs 
are undoubtedly somewhat noisy given the er- 
rors inherent in the part-of-speech tagging and 
pattern matching. 
We used 80%, or 587833, of the pairs so de- 
rived, for building base bigram language mod- 
els, reserving 20.o/0 for testing purposes. As 
some, but not all, of the similarity measures re- 
quire smoothed language models, we calculated 
both a Katz back-off language model (P = 15 
(equation (2)), with Pr(w2\[wl) = P(w2)), and 
a maximum-likelihood model (P = PML)- Fur- 
thermore, we wished to investigate Katz's claim 
that one can delete singletons, word pairs that 
occur only once, from the training set with- 
out affecting model performance (Katz, 1987); 
our training set contained 82407 singletons. We 
therefore built four base language models, sum- 
marized in Table 2. 
MLE 
Katz 
with singletons no singletons 
(587833 pairs) (505426 pairs) 
MLE-1 MLE-ol 
BO-1 BO-ol 
Table 2: Base Language Models 
Since we wished to test the effectiveness of us- 
ing similarity for unseen word cooccurrences, we 
removed from the test set any verb-object pairs 
60 
name 
D 
A 
L 
Pc 
range 
\[0, co\] 
\[0, 2 log 2\] 
\[0, 2\] 
\[0, ½ maxw, P(w2)\] 
base LM constraints 
P(w21w~l) ¢ 0 if P(w2\[wx) ~: 0 
none 
none 
Bayes consistency 
Table 1: Summary of similarity function properties 
tune? 
yes 
yes 
yes 
no 
that occurred in the training set; this resulted 
in 17152 unseen pairs (some occurred multiple 
times). The unseen pairs were further divided 
into five equal-sized parts, T1 through :/'5, which 
formed the basis for fivefold cross-validation: in 
each of five runs, one of the Ti was used as a 
performance test set, with the other 4 sets com- 
bined into one set used for tuning parameters 
(if necessary) via a simple grid search. Finally, 
test pseudo-words were created from pairs of 
verbs with similar frequencies, so as to control 
for word frequency in the decision task. We use 
error rate as our performance metric, defined as 
(# incorrect choices + (# of ties)/2) of 
where N was the size of the test corpus. A tie 
occurs when the two words making up a pseudo- 
word are deemed equally likely. 
3.3 Baseline Experiments 
The performances of the four base language 
models are shown in table 3. MLE-1 and 
MLE-ol both have error rates of exactly .5 be- 
cause the test sets consist of unseen bigrams, 
which are all assigned a probability of 0 by 
maximum-likelihood estimates, and thus are all 
ties for this method. The back-off models BO-1 
and BO-ol also perform similarly. 
MLE-1 
MLE-ol 
BO-1 
BO-ol 
7'1 T~ % T4 % 
.5 .5 .5 .5 .5 
ir 
0.517 0.520 0.512 0.513 0.516 
0.517 0.520 0.512 0.513 0.516 
Table 3: Base Language Model Error Rates 
Since the back-off models consistently per- 
formed worse than the MLE models, we chose 
to use only the MLE models in our subse- 
quent experiments. Therefore, we only ran com- 
parisons between the measures that could uti- 
lize unsmoothed data, namely, the Lt norm, 
L(wx, w~); the total divergence to the aver- 
age, A(wx, w~); and the confusion probability, 
Pc(w~lwx). 3 In the full paper, we give de- 
tailed examples showing the different neighbor- 
hoods induced by the different measures, which 
we omit here for reasons of space. 
3.4 Performance of Similarity-Based 
Methods 
Figure 1 shows the results on the five test sets, 
using MLE-1 as the base language model. The 
parameter/3 was always set to the optimal value 
for the corresponding training set. RAND, 
which is shown for comparison purposes, sim- 
ply chooses the weights W(wl,w~) randomly. 
S(wl) was set equal to Vt in all cases. 
The similarity-based methods consistently 
outperform the MLE method (which, recall, al- 
ways has an error rate of .5) and Katz's back- 
off method (which always had an error rate of 
about .51) by a huge margin; therefore, we con- 
clude that information from other word pairs is 
very useful for unseen pairs where unigram fre- 
quency is not informative. The similarity-based 
methods also do much better than RAND, 
which indicates that it is not enough to simply 
combine information from other words arbitrar- 
ily: it is quite important to take word similarity 
into account. In all cases, A edged out the other 
methods. The average improvement in using A 
instead of Pc is .0082; this difference is signifi- 
cant to the .1 level (p < .085), according to the 
paired t-test. 
3It should be noted, however, that on BO-1 data, KL- 
divergence performed slightly better than the L1 norm. 
61 
T1 T2 
Err~ Rates on T~t Sets, 8aN Language MociJ MLEI 
"RANOMLEI" -- 
"CONFMU~ I" - .... 
"I.MLEI" • .... 
• AMLEI • -- 
ii 
T3 T4 T5 
Figure 1: Error rates for each test set, where the 
base language model was MLE-1. The methods, 
going from left to right, are RAND, Pc, L, and 
A. The performances shown are for settings offl 
that were optimal for the corresponding training 
set. I3 ranged from 4.0 to 4.5 for L and from 10 
to 13 for A. 
The results for the MLE-ol case are depicted 
in figure 2. Again, we see the similarity-based 
methods achieving far lower error rates than the 
MLE, back-off, and RAND methods, and again, 
A always performed the best. However, with 
singletons omitted the difference between A and 
Pc is even greater, the average difference being 
.024, which is significant to the .01 level (paired 
t-test). 
An important observation is that all meth- 
ods, including RAND, were much more effective 
if singletons were included in the base language 
model; thus, in the case of unseen word pairs, 
Katz's claim that singletons can be safely ig- 
nored in the back-off model does not hold for 
similarity-based models. 
4 Conclusions 
Similarity-based language models provide an 
appealing approach for dealing with data 
sparseness. We have described and compared 
the performance of four such models against two 
classical estimation methods, the MLE method 
and Katz's back-off scheme, on a pseudo-word 
disambiguation task. We observed that the 
similarity-based methods perform much better 
on unseen word pairs, with the measure based 
E~or ~tes on TeSt ~. ~ Umgua91 Model MLE.ot 
.... F-\] 
Tt 
;)-I 
T2 1"3 T4 
"RANDMLEol*-- 
"CONFMLE01"- .... 
"LMLEol"- .... 
"7"AMLEol ...... 
"°'!-... 
ii! : ' F 
T5 
Figure 2: Error rates for each test set, where 
the base language model was MLE-ol. /~ ranged 
from 6 to 11 for L and from 21 to 22 for A. 
on the KL divergence to the average, being the 
best overall. 
We also investigated Katz's claim that one 
can discard singletons in the training data, re- 
sulting in a more compact language model, 
without significant loss of performance. Our re- 
sults indicate that for similarity-based language 
modeling, singletons are quite important; their 
omission leads to significant degradation of per- 
formance. 
Acknowledgments 
We thank Hiyan Alshawi, Joshua Goodman, 
Rebecca Hwa, Stuart Shieber, and Yoram 
Singer for many helpful comments and discus- 
sions. Part of this work was done while the first 
and second authors were visiting AT&:T Labs. 
This material is based upon work supported in 
part by the National Science Foundation under 
Grant No. IRI-9350192. The second author 
also gratefully acknowledges support from a Na- 
tional Science Foundation Graduate Fellowship 
and an AT&T GRPW/ALFP grant. 

References 
Brown, Peter F., Vincent J. DellaPietra, Peter V. 
deSouza, Jennifer C. Lai, and Robert L. Mercer. 
1992. Class-based n-gram models of natural lan- 
guage. Computational Linguistics, 18(4):467-479, 
December. 
Church, Kenneth. 1988. A stochastic parts program 
and noun phrase parser for unrestricted text. In 
Proceedings of the Second Conference on Applied 
Natural Language Processing, pages 136-143. 
Church, Kenneth W. and William A. Gale. 1991. 
A comparison of the enhanced Good-Turing and 
deleted estimation methods for estimating proba- 
bilites of english bigrams. Computer Speech and 
Language, 5:19-54. 
Cover, Thomas M. and Joy A. Thomas. 1991. Ele- 
ments of Information Theory. John Wiley. 
Dagan, Ido, Fernando Pereira, and Lillian Lee. 1994. 
Similarity-based estimation of word cooccurrence 
probabilities. In Proceedings of the 32nd Annual 
Meeting of the ACL, pages 272-278, Las Cruces, 
NM. 
Essen, Ute and Volker Steinbiss. 1992. Co- 
occurrence smoothing for stochastic language 
modeling. In Proceedings of ICASSP, volume 1, 
pages 161-164. 
Gale, William, Kenneth Church, and David 
Yarowsky. 1992. Work on statistcal methods for 
word sense disambiguation. In Working Notes, 
AAAI Fall Symposium Series, Probabilistic Ap- 
proaches to Natural Language, pages 54-60. 
Good, I.J. 1953. The population frequencies of 
species and the estimation of population parame- 
ters. Biometrika, 40(3 and 4):237-264. 
Hoeffding, Wassily. 1965. Asymptotically optimal 
tests for nmttinomial distributions. Annals of 
Mathematical Statistics, pages 369-401. 
Jelinek, Frederick, Robert L. Mercer, and Salim 
Roukos. 1992. Principles of lexical language 
modeling for speech recognition. In In Sadaoki 
Furui and M. Mohan Sondhi, editors, Advances 
in Speech Signal Processing. Mercer Dekker, Inc., 
pages 651-699. 
Karov, Yael and Shimon Edelman. 1996. Learning 
similarity-based word sense disambiguation from 
sparse data. In 4rth Workshop on Very Large 
Corpora. 
Katz, Slava M. 1987. Estimation of probabilities 
from sparse data for the language model com- 
ponent of a speech recognizer. IEEE Transac- 
tions on Acoustics, Speech and Signal Processing, 
ASSP-35(3) :400-401, March. 
Pereira, Fernando, Naftali Tishby, and Lillian Lee. 
1993. Distributional clustering of English words. 
In Proceedings of the 31st Annual Meeting of the 
ACL, pages 183-190, Columbus, OH. 
Resnik, Philip. 1992. Wordnet and distributional 
analysis: A class-based approach to lexical discov- 
ery. AAAI Workshop on Statistically-based Natu- 
ral Language Processing Techniques, pages 56-64, 
July. 
Schiitze, Hinrich. 1992. Context space. In Work- 
ing Notes, AAAI Fall Symposium on Probabilistic 
Approaches to Natural Language. 
