Distinguishing Word Senses in Untagged Text 
Ted Pedersen and Rebecca Bruce 
Department of Computer Science and Engineering 
Southern Methodist University 
Dallas, TX 75275-0112 
{pedersen,rbruce)@seas.smu.edu 
Abstract 
This paper describes an experimental com- 
parison of three unsupervised learning al- 
gorithms that distinguish the sense of 
an ambiguous word in untagged text. 
The methods described in this paper, 
McQuitty's similarity analysis, Ward's 
minimum-variance method, and the EM 
algorithm, assign each instance of an am- 
biguous word to a known sense definition 
based solely on the values of automatically 
identifiable features in text. These meth- 
ods and feature sets are found to be more 
successful in disambiguating nouns rather 
than adjectives or verbs. Overall, the most 
accurate of these procedures is McQuitty's 
similarity analysis in combination with a 
high dimensional feature set. 
1 Introduction 
Statistical methods for natural language process- 
ing are often dependent on the availability of costly 
knowledge sources such as manually annotated text 
or semantic networks. This limits the applicability 
of such approaches to domains where this hard to 
acquire knowledge is already available. This paper 
presents three unsupervised learning algorithms that 
are able to distinguish among the known senses (i.e., 
as defined in some dictionary) of a word, based only 
on features that can be automatically extracted from 
untagged text. 
The object of unsupervised learning is to deter- 
mine the class membership of each observation (i.e. 
each object to be classified), in a sample without us- 
ing training examples of correct classifications. We 
discuss three algorithms, McQuitty's similarity anal- 
ysis (McQuitty, 1966), Ward's minimum-variance 
method (Ward, 1963) and the EM algorithm (Demp- 
ster, Laird, and Rubin, 1977), that can be used to 
distinguish among the known senses of an ambigu- 
ous word without the aid of disambiguated exam- 
ples. The EM algorithm produces maximum likeli- 
hood estimates of the parameters of a probabilistic 
model, where that model has been specified in ad- 
vance. Both Ward's and McQuitty's methods are ag- 
glomerative clustering algorithms that form classes 
of unlabeled observations that minimize their respec- 
tive distance measures between class members. 
The rest of this paper is organized as follows. 
First, we present introductions to Ward's and Mc- 
Quitty's methods (Section 2) and the EM algorithm 
(Section 3). We discuss the thirteen words (Section 
4) and the three feature sets (Section 5) used in our 
experiments. We present our experimental results 
(Section 6) and close with a discussion of related 
work (Section 7). 
2 Agglomerative Clustering 
In general, clustering methods rely on the assump- 
tion that classes occupy distinct regions in the fea- 
ture space. The distance between two points in a 
multi-dimensional space can be measured using any 
of a wide variety of metrics (see, e.g. (Devijver 
and Kittler, 1982)). Observations are grouped in 
the manner that minimizes the distance between the 
members of each class. 
Ward's and McQuitty's method are agglomerative 
clustering algorithms that differ primarily in how 
they compute the distance between clusters. All 
such algorithms begin by placing each observation 
in a unique cluster, i.e. a cluster of one. The two 
closest clusters are merged to form a new cluster 
that replaces the two merged clusters. Merging of 
the two closest clusters continues until only some 
specified number of clusters remain. 
However, our data does not immediately lend it- 
self to a distance-based interpretation. Our features 
represent part-of-speech (POS) tags, morphological 
characteristics, and word co-occurrence; such fea- 
tures are nominal and their values do not have scale. 
Given a POS feature, for example, we could choose 
noun = 1, verb = 2, adjective = 3, and adverb = 
4. That adverb is represented by a larger number 
than noun is purely coincidental and implies nothing 
about the relationship between nouns and adverbs. 
Thus, before we employ either clustering algo- 
197 
10 2 5 
1 2 1 
3 2 5 
10 2 5 
Figure 1: Matrix of Feature Values 
0 2 1 0 
2 0 2 2 
1 2 0 1 
0 2 1 0 
Figure 2: Dissimilarity Matrix 
rithm, we represent our data sample in terms of a 
dissimilarity matrix. Suppose that we have N ob- 
servations in a sample where each observation has q 
features. This data is represented in a N x N dis- 
similarity matrix such that the value in cell (i,j), 
where i represents the row number and j represents 
the column, is equal to the number of features in 
observations i and j that do not match. 
For example, in Figure 1 we have four observa- 
tions. We record the values of three nominal fea- 
tures for each observation. This sample can be rep- 
resented by the 4 x 4 dissimilarity matrix shown in 
Figure 2. In the dissimilarity matrix, cells (1, 2) and 
(2, 1) have the value 2, indicating that the first and 
second observations in Figure 1 have different values 
for two of the three features. A value of 0 indicates 
that observations i and j are identical. 
When clustering our data, each observation is rep- 
resented by its corresponding row (or column) in the 
dissimilarity matrix. Using this representation, ob- 
servations that fall close together in feature space are 
likely to belong to the same class and are grouped 
together into clusters. In this paper, we use Ward's 
and McQuitty's methods to form clusters of obser- 
vations, where each observation is represented by a 
row in a dissimilarity matrix. 
2.1 Ward's minimum-variance method 
In Ward's method, the internal variance of a cluster 
is the sum of squared distances between each obser- 
vation in the cluster and the mean observation for 
that cluster (i.e., the average of all the observations 
in the cluster). At each step in Ward's method, a 
new cluster, CKL, with the smallest possible inter- 
nal variance, is created by merging the two clusters, 
CK and CL, that have the minimum variance be- 
tween them. The variance between CK and eL is 
computed as follows: 
II~K -~rII 2 VKL- , + I (1) 
NK I~FL 
where XK is the mean observation for cluster CK, 
NK is the number of observations in CK, and ~L 
and NL are defined similarly for CL. 
Implicit in Ward's method is the assumption that 
the sample comes from a mixture of normal distri- 
butions. While NLP data is typically not well char- 
acterized by a normal distribution (see, e.g. (Zipf, 
1935), (Pedersen, Kayaalp, and Bruce, 1996)), there 
is evidence that our data, when represented by a dis- 
similarity matrix, can be adequately characterized 
by a normal distribution. However, we will continue 
to investigate the appropriateness of this assump- 
tion. 
2.2 McQuitty's similarity analysis 
In McQuitty's method, clusters are based on a sim- 
ple averaging of the feature mismatch counts found 
in the dissimilarity matrix. 
At each step in McQuitty's method, a new cluster, 
CKL, is formed by merging the clusters CK and CL 
that have the fewest number of dissimilar features 
between them. The clusters to be merged, CK and 
CL, are identified by finding the cell (/, k) (or (k, I)), 
where k ~ l, that has the minimum value in the 
dissimilarity matrix. 
Once the new cluster CKL is created, the dissim- 
ilarity matrix is updated to reflect the number of 
dissimilar features between CKL and all other exist- 
ing clusters. The dissimilarity between any existing 
cluster Ci and CKL is computed as: 
DgI -l- DLI DKL-I = 2 (2) 
where DKi is the number of dissimilar features be- 
tween clusters CK and Ci and DLI is similarly de- 
fined for clusters CL and C1. This is simply the 
average number of mismatches between each com- 
ponent of the new cluster and the existing cluster. 
Unlike Ward's method, McQuitty's method makes 
no assumptions concerning the distribution of the 
data sample. 
3 EM Algorithm 
The expectation maximization algorithm (Demp- 
ster, Laird, and Rubin, 1977), commonly known as 
the EM algorithm, is an iterative estimation proce- 
dure in which a problem with missing data is recast 
to make use of complete data estimation techniques. 
In our work, the sense of an ambiguous word is rep- 
resented by a feature whose value is missing. 
In order to use the EM algorithm, the paramet- 
ric form of the model representing the data must 
be known. In these experiments, we assume that 
the model form is the Naive Bayes (Duda and 
Hart, 1973). In this model, all features are con- 
ditionally independent given the value of the clas- 
sification feature, i.e., the sense of the ambigu- 
ous word. This assumption is based on the suc- 
198 
cess of the Naive Bayes model when applied to su- 
pervised word-sense disambiguation (e.g. (Gale, 
Church, and Yarowsky, 1992), (Leacock, Towell, and 
Voorhees, 1993), (Mooney, 1996), (Pedersen, Bruce, 
and Wiebe, 1997), (Pedersen and Bruce, 1997a)). 
There are two potential problems when using the 
EM algorithm. First, it is computationally expen- 
sive and convergence can be slow for problems with 
large numbers of model parameters. Unfortunately 
there is little to be done in this case other than re- 
ducing the dimensionality of the problem so that 
fewer parameters are estimated. Second, if the like- 
lihood function is very irregular it may always con- 
verge to a local maxima and not find the global max- 
imum. In this case, an alternative is to use the more 
computationally expensive method of Gibbs Sam- 
pling (Geman and Geman, 1984). 
3.1 Description 
At the heart of the EM Algorithm lies the Q- 
function. This is the expected value of the log- 
likelihood function for the complete data D = (Y, S), 
where Y is the observed data and S is the missing 
sense value: 
Q(/9/1/9) = E\[lnp(Y, SI/9')I/9, Y)\] (3) 
Here, /9 is the current value of the maximum likeli- 
hood estimates of the model parameters and/9i is the 
improved estimate that we are seeking; p(Y, SI/9 i) is 
the likelihood of observing the complete data given 
the improved estimate of the model parameters. 
When approximating the maximum of the likeli- 
hood function, the EM algorithm starts from a ran- 
domly generated initial estimate of/9 and then re- 
places /9 by the /9i which maximizes Q(/9/I/9)- This 
process is broken down into two steps: expecta- 
tion (the E-step), and maximization (the M-step). 
The E-step finds the expected values of the sufficient 
statistics of the complete model using the current es- 
timates of the model parameters. The M-step makes 
maximum likelihood estimates of the model param- 
eters using the sufficient statistics from the E-step. 
These steps iterate until the parameter estimates/9 
and/91 converge. 
The M-step is usually easy, assuming it is easy 
for the complete data problem; the E-step is not 
necessarily so. However, for decomposable models, 
such as the Naive Bayes, the E-step simplifies to the 
calculation of the expected counts in the marginal 
distributions of interdependent features, where the 
expectation is with respect to/9. The M-step sim- 
plifies to the calculation of new parameter estimates 
from these counts. Further, these expected counts 
can be calculated by multiplying the sample size N 
by the probability of the complete data within each 
marginal distribution given/9 and the observed data 
within each marginal Ym- This simplifies to: 
counti(Sm, Y,~) = P(SmIYm.) x count(Ym) 
where count i is the current estimate of the expected 
count and P(Sm \[Ym) is formulated using 0. 
3.2 Example 
For the Naive Bayes model with 3 observable fea- 
tures A, B, C and an unobservable classification fea- 
ture S, where 8 = {P(a, s), P(b, s), P(c, s), P(s)}, 
the E and M-steps are: 
1. E-step: The expected values of the sufficient 
statistics are computed as follows: 
eoun#(s, a) = P(sla) x count(a) 
coun#(s, b) -= P(slb) × count(b) 
eounti(s, c) = P(slc) x count(c) 
count'(s) -- ~ {P(sla, b, c) x count(a, b, e)} 
a,b,c 
where: 
P(sla) = E P(sla' b, c) 
hie 
P(sla, b, c) = P(s, a, b, c) P(a, b, c) 
P(s, a, b, c) = P(s, a) x P(s, b) x P(s, c) 
P(s) ~ 
P(a, b, c) = E P(s, a) × P(s, b) × P(s, c) 
, P(s) 2 
2. M-step: The sufficient statistics from the E- 
step are used to re-estimate the model param- 
eters/9i: 
Pi(s, a) = c°unti(s' a) 
N 
pi(s, b) -- c°unti(s' b) 
N 
Pi(s, c) -- 'c°unti(s' c) 
N 
counti(s) Pi(s) = 
N 
where s, a, b, and c denote specific values of S, A, B, 
and C respectively, and P(slb) and P(s\]c) are de- 
fined analogously to P(sIa ). 
4 Experimental Procedure 
Experiments were conducted to disambiguate 13 dif- 
ferent words using 3 different feature sets. In these 
experiments, each of the 3 unsupervised disambigua- 
tion methods is applied to each of the 13 words using 
each of the 3 feature sets; this defines a total of 117 
different experiments. In addition, each experiment 
was repeated 25 times in order to study the variance 
introduced by randomly selecting initial parameter 
estimates, in the case of the EM algorithm, and ran- 
domly selecting among equally distant groups when 
clustering using Ward's and McQuitty's methods. 
199 
In order to evaluate the unsupervised learning al- 
gorithms we use sense-tagged text in these exper- 
iments. However, this text is only used to evalu- 
ate the accuracy of our methods. The classes dis- 
covered by the unsupervised learning algorithms are 
mapped to dictionary senses in a manner that max- 
imizes their agreement with the sense-tagged text. 
If the sense-tagged text were not available, as would 
often be the case in an unsupervised experiment, this 
mapping would have to be performed manually. 
The words disambiguated and their sense distri- 
butions are shown in Figure 3. All data, with the ex- 
ception of the data for line, come from the ACL/DCI 
Wall Street Journal corpus (Marcus, Santorini, and 
Marcinkiewicz, 1993). With the exception of line, 
each ambiguous word is tagged with a single sense 
defined in the Longman Dictionary of Contempo- 
rary English (LDOCE) (Procter, 1978). The data 
for the 12 words tagged using LDOCE senses are 
described in more detail in (Bruce, Wiebe, and Ped- 
ersen, 1996). 
The line data comes from both the ACL/DCI 
WSJ corpus and the American Printing House for 
the Blind corpus. Each occurrence of line is tagged 
with a single sense defined in WordNet (Miller, 
1995). This data is described in more detail in (Lea- 
cock, Towell, and Voorhees, 1993). 
Every experiment utilizes all of the sentences 
available for each word. The number of sentences 
available per word is shown as "total count" in Fig- 
ure 3. We have reduced the sense inventory of these 
words so that only the two or three most frequent 
senses are included in the text being disambiguated. 
For several of the words, there are minority senses 
that form a very small percentage (i.e., < 5%) of 
the total sample. Such minority classes are not yet 
well handled by unsupervised techniques; therefore 
we do not consider them in this study. 
5 Feature Sets 
We define three different feature sets for use in these 
experiments. Our objective is to evaluate the effect 
that different types of features have on the accuracy 
of unsupervised learning algorithms such as those 
discussed here. We are particularly interested in the 
impact of the overall dimensionality of the feature 
space, and in determining how indicative different 
feature types are of word senses. Our feature sets are 
composed of various combinations of the following 
five types of features. 
Morphology The feature M represents the mor- 
phology of the ambiguous word. For nouns, M is 
binary indicating singular or plural. For verbs, the 
value of M indicates the tense of the verb and can 
have up to 7 possible values. This feature is not used 
for adjectives. 
Adjective Senses ~'1 
chief. (total count: 1048) 
highest in rank: 86 
most important; main: 14~'c~ 
common: (total count: 1060) ! 
as in the phrase 'common stock': 84~ 
belonging to or shared by 2 or more: 8~ 
happening often; usual: 8°~ 
lasl: (total count: 3004) 
on the occasion nearest in the past: 94°~ 
after all others: 6c~ 
public: (total count: 715) 
concerning people in general: 68cX 
concerning the government and people: 19~ 
not secret or private: 13°~ 
Noun Senses 
bill: (total count: 1341) 
a proposed law under consideration: 68~ 
a piece of paper money or treasury bill: 22°~ 
a list of things bought and their price: 10~ 
concern: (total count: 1235) | 
a business; firm: 64 % 
worry; anxiety: 36c~ 
drug: (total count: 1127) 
a medicine; used to make medicine: 57°~ 
a habit-forming substance: 43% ! 
interest: (total count: 2113) \] 
money paid for the use of money: 59 % 
a share in a company or business: 24% 
readiness to give attention: 17% 
line: (total count: 1149) 
a wire connecting telephones: 37 % 
a cord; cable: 32% 
an orderly series: 30% 
Verb Senses 
agree: (total count: 1109) 
to concede after disagreement: 74~ 
to share the same opinion: 26% I 
close: (total count: 1354) 
to (cause to) end: 77% 
to (cause to) stop operation: 23% 
help: (total count: 1267) 
to enhance - inanimate object: 78°~ 
to assist - human object: 22~ 
include: (total count: 1526) I 
to contain in addition to other parts: 91% 
to be a part of- human subject: 9~ 
Figure 3: Distribution of Senses 
200 
word 
chief 
common 
last 
public 
bill 
concern 
drug 
interest 
line 
agree 
close 
help 
include 
C1 
officer 
share 
year 
offering 
treasury 
million 
fda 
rate 
he 
million 
trading 
it 
million 
C2 Q 
executive president 
million stock 
week million 
million company 
billion house 
company market 
company generic 
million company 
it telephone 
company pay 
exchange stock 
say he 
company year 
Figure 4: Co-occurrence Features 
Part-of-Speech Features of the form PLi repre- 
sent the part-of-speech (POS) of the word i posi- 
tions to the left of the ambiguous word. PRi repre- 
sents the POS of the word i positions to the right. 
In these experiments, we used 4 POS features, PL1, 
PL2, PR1, and PR2 to record the POS of the words 
1 and 2 positions to the left and right of the am- 
biguous word. Each POS feature can have one of 
5 possible values: noun, verb, adjective, adverb or 
other. 
Co-occurrences Features of the form Ci are bi- 
nary co-occurrence features. They indicate the pres- 
ences or absences of a particular content word in the 
same sentence as the ambiguous word. We use 3 bi- 
nary co-occurrence features, C1, C2, and Ca to rep- 
resent the presences or absences of each of the three 
most frequent content words, C1 being the most fre- 
quent content word, C2 the second most frequent 
and C3 the third. Only sentences containing the am- 
biguous word were used to establish word frequen- 
cies. 
Frequency based features like this one contain lit- 
tle information about low frequency classes. For 
words with skewed sense distribution, it is likely that 
the most frequent content words will be associated 
only with the dominate sense. 
As an example, consider the 3 most frequent con- 
tent words occurring in the sentences that contain 
chi@ officer, executive and president. Chief has a 
majority class distribution of 86% and, not surpris- 
ingly, these three content words are all indicative of 
the dominate sense which is "highest in rank". 
The set of content words used in formulating the 
co-occurrence features are shown in Figure 4. Note 
that million and company occur frequently. These 
are not likely to be indicative of a particular sense 
but more reflect the general nature of the Wall Street 
Journal corpus. 
Unrestricted Collocations Features of the form 
ULi and URi indicate the word occurring in the po- 
sition i places to the left or right, respectively, of the 
ambiguous word. All features of this form have 21 
possible values. Nineteen correspond to the 19 most 
frequent words that occur in that fixed position in 
all of the sentences that contain the particular am- 
biguous word. There is also a value, (none), that 
indicates when the position i to the left or right is 
occupied by a word that is not among the 19 most 
frequent, and a value, (null), indicating that the po- 
sition i to the left or right falls outside of the sentence 
boundary. 
In these experiments we use 4 unrestricted collo- 
cation features, UL2, UL1,UR1, and UR2. As an 
example, the values of these features for concern are 
as follows: 
• UL2: and, the, a, of, to, financial, have, be- 
cause, an, 's, real, cause, calif., york, u.s., other, 
mass., german, (null), (none) 
• UL1 : the, services, of, products, banking, 's, 
pharmaceutical, energy, their, expressed, elec- 
tronics, some, biotechnology, aerospace, en- 
vironmental, such, japanese, gas, investment, 
(null), (none) 
• URI: about, said, that, over, 's, in, with, had, 
are, based, and, is, has, was, to, for, among, 
will, did, (null), (none) 
• URn: the, said, a, it, in, that, to, n't, is, which, 
by, and, was, has, its, possible, net, but, annual, 
(null), (none) 
Content Collocations Features of the form CL1 
and CR1 indicate the content word occurring in the 
position 1 place to the left or right, respectively, of 
the ambiguous word. The values of these features 
are defined much like the unrestricted collocations 
above, except that these are restricted to the 19 most 
frequent content words that occur only one position 
to the left or right of the ambiguous word. 
To contrast this set of features with the unre- 
stricted collocations, consider concern again. The 
values of the features representing the 19 most fre- 
quent content words 1 position to the left and right 
are as follows: 
• CLI: services, products, banking, pharmaceu- 
tical, energy, expressed, electronics, biotechnol- 
ogy, aerospace, environmental, japanese, gas, 
Feature Sets A, B and C The 3 feature 
used in these experiments are designated A, B 
C and are formulated as follows: 
investment, food, chemical, broadcasting, u.s., 
industrial, growing, (null), (none) 
CRi: said, had, are, based, has, was, did, 
owned, were, regarding, have, declined, ex- 
pressed, currently, controlled, bought, an- 
nounced, reported, posted, (null), (none) 
sets 
and 
201 
* A: M, PLe, PL1, PRx, PR~, C1, C2, C3 
Dimensionality: 5,000 - 35,000 
• B: M, UL2, UL1,UR1,UR2 
Dimensionality: 194,481- 1,361,367 
• C: M, PL2, PL1, PR1, PRy, CL1, CR1 
Dimensionality: 275,625- 1,929,375 
The dimensionality is the number of possible com- 
binations of feature values and thus the size of the 
feature space. These values vary since the number of 
possible values for M varies with the part-of-speech 
of the ambiguous word. The lower number is asso- 
ciated with adjectives and the higher with verbs. 
To get a feeling for the adequacy of these feature 
sets, we performed supervised learning experiments 
with the interest data using the Naive Bayes model. 
We disambiguated 3 senses using a 10:1 training-to- 
test ratio. The average accuracies for each feature 
set over 100 random trials were as follows: A 80.9%, 
B 87.7%, and C 82.7%. 
The window size, the number of values for the 
POS features, and the number of words considered 
in the collocation features are kept deliberately small 
in order to control the dimensionality of the prob- 
lem. In future work, we will expand all of the above 
types of features and employ techniques to reduce 
dimensionality along the lines suggested in (Duda 
and Hart, 1973) and (Gale, Church, and Yarowsky, 
1995). 
6 Experimental Results 
Figure 5 shows the average accuracy and standard 
deviation of disambiguation over 25 random trials 
for each combination of word, feature set and learn- 
ing algorithm. Those cases where the average accu- 
racy of one algorithm for a particular feature set 
is significantly higher than another algorithm, as 
judged by the t-test (p=.01), are shown in bold face. 
For each word, the most accurate overall experiment 
(i.e., algorithm/feature set combination), and those 
that are not significantly less accurate are under- 
lined. Also included in Figure 5 is the percentage of 
each sample that is composed of the majority sense. 
This is the accuracy that can be obtained by a ma- 
jority classifier; a simple classifier that assigns each 
ambiguous word to the most frequent sense in a sam- 
ple. However, bear in mind that in unsupervised ex- 
periments the distribution of senses is not generally 
known. 
Perhaps the most striking aspect of these results 
is that, across all experiments, only the nouns are 
disambiguated with accuracy greater than that of 
the majority classifier. This is at least partially ex- 
plained by the fact that, as a class, the nouns have 
the most uniform distribution of senses. This point 
will be elaborated on in Section 6.1. While the choice 
of feature set impacts accuracy, overall it is only to 
a small degree. We return to this point in Section 
6.2. The final result, to be discussed in Section 6.3, 
is that the differences in the accuracy of these three 
algorithms are statistically significant both on aver- 
age and for individual words. 
6.1 Distribution of Classes 
Extremely skewed distributions pose a challenging 
learning problem since the sample contains precious 
little information regarding minority classes. This 
makes it difficult to learn their distributions with- 
out prior knowledge. For unsupervised approaches, 
this problem is exacerbated by the difficultly in dis- 
tinguishing the characteristics of the minority classes 
from noise. 
In this study, the accuracy of the unsupervised al- 
gorithms was less than that of the majority classifier 
in every case where the percentage of the majority 
sense exceeded 68%. However, in the cases where 
the performance of these algorithms was less than 
that of the majority classifier, they were often still 
providing high accuracy disambiguation (e.g., 91% 
accuracy for last). Clearly, the distribution of classes 
is not the only factor affecting disambiguation accu- 
racy; compare the performance of these algorithms 
on bill and public which have roughly the same class 
distributions. 
It is difficult to quantify the effect of the distri- 
bution of classes on a learning algorithm particu- 
larly when using naturally occurring data. In previ- 
ous unsupervised experiments with interest, using a 
modified version of Feature Set A, we were able to 
achieve an increase of 36 percentage points over the 
accuracy of the majority classifier when the 3 classes 
were evenly distributed in the sample (Pedersen and 
Bruce, 1997b). Here, our best performance using a 
larger sample with a natural distribution of senses 
is only an increase of 20 percentage points over the 
accuracy of the majority classifier. 
Because skewed distributions are common in lexi- 
cal work (Zipf, 1935), they are an important consid- 
eration in formulating disambiguation experiments. 
In future work, we will investigate procedures for 
feature selection that are more sensitive to minor- 
ity classes. Reliance on frequency based features, as 
used in this work, means that the more skewed the 
sample is, the more likely it is that the features will 
be indicative of only the majority class. 
6.2 Feature Set 
Despite varying the feature sets, the relative accu- 
racy of the three algorithms remains rather consis- 
tent. For 6 of the 13 words there was a single al- 
gorithm that was always significantly more accurate 
than the other two across all features. 
The EM algorithm was most accurate for last and 
line with all three feature sets. McQuitty's method 
was significantly more accurate for chief, common, 
public, and help regardless of the feature set. 
202 
chief 
common 
last 
public 
adjectives 
Mll 
:oncern 
:trug 
nterest 
ine 
3ouns 
~gree 
:lose 
aelp 
include 
Feature Set 
Maj. McQuitty 
.861.844±.05 
.842.648±.12 
!.940 .791±.12 
1.683.560± .08 
.832.711±.15 
.681.669±.08 
.638 .629±.07 
.567.530±.03 
.593.601±.04 
.373 .420±.03 
.570 .570±.10 
A Feature Set 
Ward I EM McQuitty Ward 
.721±.01.729±.06 .831±.06.611±.01 
.513±.08.521:t=.00 .797±.04.444±.04 
.598±.09,903±.00 .541 ±. 11 .659±.03 
.450±.05.473±.03 .558±.07.461±.03 
.571±. 12.657±. 18 .682=t=.15.544±.10 
.647±. 11.537±.05 .753±.05.600±.04 
.741±.04 842±.00 .679±.04.697±.02 
.557±.06,658±.03 .521±.01 .528±.00 
~619±.04,616±.06.653±.06.552±.06 
.441±.03.457±.01 .403±.02.428±.03 
.601±. 12,622±. 14 .602±. 11 .561±. 10 
.547±.03.631±.08.678±.08 
.531±.02.560±.08 .667±.07.664±.0C 
.591±.05.586±.05 .636±. 11.519±.01 
.707±.08.725±.02 .767±.09.770±.0~ 
B 
EM 
.646±.01 
.464±.06 
.909±.00 
.411±.03 
.608±.20 
.624±.08 
.840±.02 
.551±.05 
.615±.05 
.474±.03 
.621±.13 
.740.610±.08 .613±.04\[,683±.14 
.771 !.616±.09 .672±.06 
.780.713±.05 .526±.00 
.9101.880±.06 ,.783±.07 
Feature Set C 
McQuitty 
.856±.00 
.799±.06 
.636±.07 
.628±.05 
.730±.11 
.561±.10 
.614±.08 
.573±.06 
.651±.02 
.410±.02 
.562±.10 
.685±.07 
.720±.11 .5--66±.o6 
.~±.17 
Ward EM 
.673±.03.697±.06 
.561±.05.543±.09 
.601±.08,874±.07 
.488±.04.507±.03 
.581±.08.655±.16 
.515±.04,569±.04 
.758±.04,758±.09 
.632± .06,652±.04 
.615±.04,649±.09 
.427±.02.458±.01 
.589±.12,617±.12 
.601±.00,685±.14 
;.645±.04.648±.05 
.570=t=.03.602±.03 
1.558±.04.535±.00 
 verbs 1.800\[.705 .131.594±.08\[.626±.09Jl.687 .101.642 .10 666:t: .1211.718±.111.593:t:.05\[.618±.09 I 
 verall \[ 734  655±  4\[ 589±    .634±  4\[l.653  h  2\[.58    . 1\[ 63 ±. 6ll 6-6.2±  3\[ 588±  9l.629±  3  
Figure 5: Experimental Results- accuracy ± standard deviation 
Despite this consistency, there were some observ- 
able trends associated with changes in feature set. 
For example, McQuitty's method was significantly 
more accurate overall in combination with feature 
set C while the EM algorithm was more accurate 
with Feature Set A, and the accuracy of Ward's 
method was the least favorable with Feature Set B. 
For the nouns, there was no significant differ- 
ence between Feature Sets A and B when using 
the EM algorithm. For the verbs there was no 
significant difference between the three feature sets 
when using McQuitty's method. The adjectives were 
significantly more accurate when using McQuitty's 
method and Feature Set C. 
One possible explanation for the consistency of 
results as feature sets varied is that perhaps the fea- 
tures most indicative of word senses are included in 
all the sets due to the selection methods and the 
commonality of feature types. These common fea- 
tures may be sufficient for the level of disambigua- 
tion achieved here. This explanation seems more 
plausible for the EM algorithm, where features are 
weighted, but less so for McQuitty's and Ward's 
which use a representation that does not allow fea- 
ture weighting. 
6.3 Disambiguation Algorithm 
Based on the average accuracy over part-of-speech 
categories, the EM algorithm performs with the 
highest accuracy for nouns while McQuitty's method 
performs most accurately for verbs and adjectives. 
This is true regardless of the feature set employed. 
The standard deviations give an indication of the 
effect of ties on the clustering algorithms and the 
effect of the random initialization on the the EM al- 
gorithm. In few cases is the standard deviation very 
small. For the clustering algorithms, a high standard 
deviation indicates that ties are having some effect 
on the cluster analysis. This is undesirable and may 
point to a need to expand the feature set in order to 
reduce ties: For the EM algorithm, a high standard 
deviation means that the algorithm is not settling on 
any particular maxima. Results may become more 
consistent if the number of parameters that must be 
estimated was reduced. 
Figures 6, 7 and 8 show the confusion matrices 
associated with the disambiguation of concern, in- 
terest, and help, using Feature Sets A, B, and C, 
respectively. A confusion matrix shows the number 
of cases where the sense discovered by the algorithm 
agrees with the manually assigned sense along the 
main diagonal; disagreements are shown in the rest 
of the matrix. 
In general, these matrices reveal that both the EM 
algorithm and Ward's method are more biased to- 
ward balanced distributions of senses than is Mc- 
Quitty's method. This may explain the better per- 
formance of McQuitty's method in disambiguating 
those words with the most skewed sense distribu- 
tions, the adjectives and adverbs. It is possible to 
adjust the EM algorithm away from this tendency 
towards discovering balanced distributions by pro- 
viding prior knowledge of the expected sense distri- 
bution. This will be explored in future work. 
203 
Discovered 
Actual worry business 
worry 166 281 447 
business 181 607 788 
347 888 1235 
McQuitty - 773 correct 
Discovered 
Actual worry business 
worry 288 159 447 
business 155 633 788 
443 792 1235 
Ward- 921 correct 
Actual 
worry 
business 
Discovered 
worry business 
384 63 
132 656 
516 719 
447 
788 
1235 
EM - 1040 correct 
Figure 6: concern - Feature Set A 
Discovered 
Actual attention share money 
attention 53 6 302 361 
share 58 187 255 500 
money 108 4 1140 1252 
219 197 1697 2113 
McQuitty - 1380 correct 
Discovered 
Actual attention share money 
attention 280 3 78 361 
share 240 197 63 500 
money 559 0 693 1252 
1079 200 834 2113 
Ward - 1170 correct 
Discovered 
Actual attention share money 
attention 127 230 4 361 
share 134 364 2 500 
money 320 124 808 1252 
581 718 814 2113 
EM - 1299 correct 
Figure 7: interest - Feature Set B 
Discovered 
Actual assist enhance 
assist 45 234 279 
enhance 146 842 988 
191 1076 1267 
McQuitty - 887 correct 
Actual 
assist 
enhance 
Discovered 
assist enhance 
88 191 
354 634 
442 825 
279 
988 
1267 
Ward - 722 correct 
Actual 
assist 
enhance 
Discovered 
assist enhance 
119 160 279 
344 644 988 
463 804 1267 
EM - 763 correct 
Figure 8: help - Feature Set C 
7 Related Work 
Word-sense disambiguation has more commonly 
been cast as a problem in supervised learning (e.g., 
(Black, 1988), (Yarowsky, 1992), (Yarowsky, 1993), 
(Leacock, Towell, and Voorhees, 1993), (Bruce and 
Wiebe, 1994), (Mooney, 1996), (Ng and Lee, 1996), 
(Pedersen, Bruce, and Wiebe, 1997), (Pedersen and 
Bruce, 1997a)). However, all of these methods re- 
quire that manually sense tagged text be available 
to train the algorithm. For most domains such text 
is not available and is expensive to create. It seems 
more reasonable to assume that such text will not 
usually be available and attempt to pursue unsuper- 
vised approaches that rely only on the features in a 
text that can be automatically identified. 
7.1 Bootstrapping 
Bootstrapping approaches require a small amount 
of disambiguated text in order to initialize the un- 
supervised learning algorithm. An early example of 
such an approach is described in (Hearst, 1991). A 
supervised learning algorithm is trained with a small 
amount of manually sense tagged text and applied 
to a held out test set. Those examples in the test set 
that are most confidently disambiguated are added 
to the training sample. 
A more recent bootstrapping approach is de- 
scribed in (Yarowsky, 1995). This algorithm requires 
a small number of training examples to serve as a 
seed. There are a variety of options discussed for 
204 
automatically selecting seeds; one is to identify col- 
locations that uniquely distinguish between senses. 
For plant, the collocations manufacturing plant and 
living plant make such a distinction. Based on 106 
examples of manufacturing plant and 82 examples of 
living plant this algorithm is able to distinguish be- 
tween two senses of plant for 7,350 examples with 97 
percent accuracy. Experiments with 11 other words 
using collocation seeds result in an average accuracy 
of 96 percent. 
While (Yarowsky, 1995) does not discuss distin- 
guishing more than 2 senses of a word, there is no 
immediate reason to doubt that the "one sense per 
collocation" rule (Yarowsky, 1993) would still hold 
for a larger number of senses. In future work we 
will evaluate using the "one sense per collocation" 
rule to seed our various methods. This may help 
in dealing with very skewed distributions of senses 
since we currently select collocations based simply 
on frequency. 
7.2 Clustering 
Clustering has most often been applied in natural 
language processing as a method for inducing syn- 
tactic or semantically related groupings of words 
(e.g., (Rosenfeld, Huang, and Schneider, 1969), 
(Kiss, 1973), (Ritter and Kohonen, 1989), (Pereira, 
Tishby, and Lee, 1993), (Sch/itze, 1993), (Resnik, 
1995a)). 
An early application of clustering to word-sense 
disambiguation is described in (Sch/itze, 1992). 
There words are represented in terms of the co- 
occurrence statistics of four letter sequences. This 
representation uses 97 features to characterize a 
word, where each feature is a linear combination of 
letter four-grams formulated by a singular value de- 
composition of a 5000 by 5000 matrix of letter four- 
gram co-occurrence frequencies. The weight associ- 
ated with each feature reflects all usages of the word 
in the sample. A context vector is formed for each 
occurrence of an ambiguous word by summing the 
vectors of the contextual words (the number of con- 
textual words considered in the sum is unspecified). 
The set of context vectors for the word to be dis- 
ambiguated are then clustered, and the clusters are 
manually sense tagged. 
The features used in this work are complex and 
difficult to interpret and it isn't clear that this com- 
plexity is required. (Yarowsky, 1995) compares his 
method to (Schiitze, 1992) and shows that for four 
words the former performs significantly better in dis- 
tinguishing between two senses. 
Other clustering approaches to word-sense disam- 
biguation have been based on measures of semantic 
distance defined with respect to a semantic network 
such as WordNet. Measures of semantic distance 
are based on the path length between concepts in a 
network and are used to group semantically similar 
concepts (e.g. (Li, Szpakowicz, and Matwin, 1995)). 
(Resnik, 1995b) provides an information theoretic 
definition of semantic distance based on WordNet. 
(McDonald et al., 1990) apply another cluster- 
ing approach to word-sense disambiguation (also 
see (Wilks et al., 1990)). They use co-occurrence 
data gathered from the machine-readable version of 
LDOCE to define neighborhoods of related words. 
Conceptually, the neighborhood of a word is a type 
of equivalence class. It is composed of all other words 
that co-occur with the designated word a significant 
number of times in the LDOCE sense definitions. 
These neighborhoods are used to increase the num- 
ber of words in the LDOCE sense definitions, while 
still maintaining some measure of lexical cohesion. 
The "expanded" sense definitions are then compared 
to the context of an ambiguous word, and the sense- 
definition with the greatest number of word over- 
laps with the context is selected as correct. (Guthrie 
et al., 1991) propose that neighborhoods be subject 
dependent. They suggest that a word should po- 
tentially have different neighborhoods correspond- 
ing to the different LDOCE subject code. Subject- 
specific neighborhoods are composed of words hav- 
ing at least one sense marked with that subject code. 
7.3 EM algorithm 
The only other application of the EM algorithm 
to word-sense disambiguation is described in (Gale, 
Church, and Yarowsky, 1995). There the EM algo- 
rithm is used as part of a supervised learning algo- 
rithm to distinguish city names from people's names. 
A narrow window of context, one or two words to 
either side, was found to perform better than wider 
windows. The results presented are preliminary but 
show an accuracy percentage in the mid-nineties 
when applied to Dixon, a name found to be quite 
ambiguous. 
It should be noted that the EM algorithm relates 
to a large body of work in speech processing. The 
Baum-Welch forward-backward algorithm (Baum, 
1972) is a specialized form of the EM algorithm 
that assumes the underlying parametric model is a 
hidden Markov model. The Baum-Welch forward- 
backward algorithm has been used extensively in 
speech recognition (e.g. (Levinson, Rabiner, and 
Sondhi, 1983), (Kupiec, 1992)), (Jelinek, 1990)). 
8 Conclusions 
Supervised learning approaches to word-sense dis- 
ambiguation fall victim to the knowledge acquisi- 
tion bottleneck. The creation of sense tagged text 
sufficient to serve as a training sample is expensive 
and time consuming. This bottleneck is eliminated 
through the use of unsupervised learning approaches 
which distinguish the sense of a word based only on 
features that can be automatically identified. 
In this study, we evaluated the performance of 
three unsupervised learning algorithms on the dis- 
205 
ambiguation of 13 words in naturally occurring text. 
The algorithms are McQuitty's similarity analysis, 
Ward's minimum-variance method, and the EM al- 
gorithm. Our findings show that each of these al- 
gorithms is negatively impacted by highly skewed 
sense distributions. Our methods and feature sets 
were found to be most successful in disambiguating 
nouns rather than adjectives or verbs. Overall, the 
most successful of our procedures was McQuitty's 
similarity analysis in combination with a high di- 
mensional feature set. In future work, we will inves- 
tigate modifications of these algorithms and feature 
set selection that are more effective on highly skewed 
sense distributions. 
9 Acknowledgments 
This research was supported by the Office of Naval 
Research under grant number N00014-95-1-0776. 

References 
Baum, L. 1972. An inequality and associated max- 
imization technique in statistical estimation for 
probabilistic functions of a Markov process. In 
O. Shisha, editor, Inequalities, volume 3. Aca- 
demic Press, New York, NY, pages 1-8. 
Black, E. 1988. An experiment in computational 
discrimination of English word senses. IBM Jour- 
nal of Research and Development, 32(2):185-194. 
Bruce, R. and J. Wiebe. 1994. Word-sense disam- 
biguation using decomposable models. In Proceed- 
ings of the 32rid Annual Meeting of the Associ- 
ation for Computational Linguistics, pages 139- 
146. 
Bruce, R., J. Wiebe, and T. Pedersen. 1996. The 
measure of a model. In Proceedings of the Confer- 
ence on Empirical Methods in Natural Language 
Processing, pages 101-112. 
Dempster, A., N. Laird, and D. Rubin. 1977. Maxi- 
mum likelihood from incomplete data via the EM 
algorithm. Journal of the Royal Statistical Society 
B, 39:1-38. 
Devijver, P. and J. Kittler. 1982. Pattern Classi- 
fication: A Statistical Approach. Prentice Hall, 
Englewood Cliffs, NJ. 
Duda, R. and P. Hart. 1973. Pattern Classification 
and Scene Analysis. Wiley, New York, NY. 
Gale, W., K. Church, and D. Yarowsky. 1992. A 
method for disambiguating word senses in a large 
corpus. Computers and the Humanities, 26:415- 
439. 
Gale, W., K. Church, and D. Yarowsky. 1995. 
Discrimination decisions for 100,000 dimensional 
spaces.. Journal of Operations Research, 55:323- 
344. 
Geman, S. and D. Geman. 1984. Stochastic re- 
laxation, Gibbs distributions and the Bayesian 
restoration of images. IEEE Transactions on Pat- 
tern Analysis and Machine Intelligence, 6:721- 
741. 
Guthrie, J., L. Guthrie, Y. Wilks, and H. Aidine- 
jad. 1991. Subject-dependent co-occurrence and 
word sense disambiguation. In Proceedings of 
the 29th Meeting of the Association for Computa- 
tional Linguistics, pages 146-152, Berkeley, CA, 
June. 
Hearst, M. 1991. Noun homograph disambiguation 
using local context in large text corpora. In Pro- 
ceedings of the 7th Annual Conference of the UW 
Centre for the New OED and Text Research: Us- 
ing Corpora, Oxford. 
Jelinek, F. 1990. Self-organized language model- 
ing for speech recognition. In Waibel and Lee, 
editors, Readings in Speech Recognition. Morgan 
Kaufmann, San Mateo, CA. 
Kiss, G. 1973. Grammatical word classes: A learn- 
ing process and its simulation. Psychology of 
Learning and Motivation, 7:1-41. 
Kupiec, J. 1992. Robust part-of-speech tagging us- 
ing a hidden Markov model. Computer Speech and 
Language, 6:225-243. 
Leacock, C., G. Towell, and E. Voorhees. 1993. 
Corpus-based statistical sense resolution. In Pro- 
ceedings of the ARPA Workshop on Human Lan- 
guage Technology, pages 260-265, March. 
Levinson, S., L. Rabiner, and M. Sondhi. 1983. An 
introduction to the application of the theory of 
probabilistic functions of a Markov process to au- 
tomatic speech recognition. Bell System Technical 
Journal, 62:1035-1074. 
Li, X., S. Szpakowicz, and S. Matwin. 1995. A 
WordNet-based algorithm for word sense disam- 
biguation. In Proceedings of the 14th Interna- 
tional Joint Conference on Artificial Intelligence, 
Montreal, August. 
Marcus, M., B. Santorini, and M. Marcinkiewicz. 
1993. Building a large annotated corpus of En- 
glish: The Penn Treebank. Computational Lin- 
guistics, 19(2):313-330. 
McDonald, J., T. Plate,, and R. Schvaneveldt. 1990. 
Using pathfinder to extract semantic information 
from text. In R. Schvaneveldt, editor, Pathfinder 
Associative Networks: Studies in Knowledge Or- 
ganization. Ablex, Norwood, NJ. 
McQuitty, L. 1966. Similarity analysis by recipro- 
cal pairs for discrete and continuous data. Edu- 
cational and Psychological Measurement, 26:825- 
831. 
Miller, G. 1995. WordNet: A lexical database. 
Communications of the ACM, 38(11):39-41, 
November. 
Mooney, R. 1996. Comparative experiments on dis- 
ambiguating word senses: An illustration of the 
role of bias in machine learning. In Proceedings of 
the Conference on Empirical Methods in Natural 
Language Processing, pages 82-91, May. 
Ng, H.T. and H.B. Lee. 1996. Integrating multi- 
ple knowledge sources to disambiguate word sense: 
An exemplar-based approach. In Proceedings of 
the 3~th Annual Meeting of the Society for Com- 
putational Linguistics, pages 40-47. 
Pedersen, T. and R. Bruce. 1997a. A new super- 
vised learning algorithm for word sense disam- 
biguation. In Proceedings of the Fourteenth Na- 
tional Conference on Artificial Intelligence, Prov- 
idence, RI, July. 
Pedersen, T. and R. Bruce. 1997b. Unsupervised 
text mining. Technical Report 97-CSE-9, South- 
ern Methodist University, June. 
Pedersen, T., R. Bruce, and J. Wiebe. 1997. Se- 
quential model selection for word sense disam- 
biguation. In Proceedings of the Fifth Conference 
on Applied Natural Language Processing, pages 
388-395, Washington, DC, April. 
Pedersen, T., M. Kayaalp, and R. Bruce. 1996. Sig- 
nificant lexical relationships. In Proceedings of the 
Thirteenth National Conference on Artificial In- 
telligence, pages 455-460, Portland, OR, August. 
Pereira, F., N. Tishby, and L. Lee. 1993. Distri- 
butional clustering of English words. In Proceed- 
ings of the 31st Annual Meeting of the Associ- 
ation for Computational Linguistics, pages 183- 
190, Columbus, OH. 
Procter, P., editor. 1978. Longman Dictionary of 
Contemporary English. Longman Group Ltd., Es- 
sex, UK. 
Resnik, P. 1995a. Disambiguating noun groupings 
with respect to WordNet senses. In Proceedings of 
the Third Workshop on Very Large Corpora, MIT, 
June. 
Resnik, P. 1995b. Using information content to eval- 
uate semantic similarity in a taxonomy. In Pro- 
ceedings of the 14th International Joint Confer- 
ence on Artificial Intelligence, Montreal, August. 
Ritter, H. and T. Kohonen. 1989. Self-organizing 
semantic maps. Biological Cybernetics, 62:241- 
254. 
Rosenfeld, A., H. Huang, and V. Schneider. 1969. 
An application of cluster detection to text and 
picture processing. IEEE Transactions on Infor- 
mation Theory, 15:672-681. 
Schfitze, H. 1992. Dimensions of meaning. In Pro- 
ceedings of Supercomputing '92, pages 787-796, 
Minneapolis, MN. 
Schfitze, H. 1993. Word space. In S. Hanson, 
J. Cowan, and C. Giles, editors, Advances in 
Neural Information Processing Systems 5. Morgan 
Kaufmann Publishers. 
Ward, J. 1963. Hierarchical grouping to optimize 
an objective function. Journal of the American 
Statistical Association, 58:236-244. 
Wilks, Y., D. Fuss, C. Guo, J. McDonald, T. Plate, 
and B. Slator. 1990. Providing machine tractable 
dictionary tools. In J. Pustejovsky, editor, The- 
oretical and Computational Issues in Lexical Se- 
mantics. MIT Press, Cambridge, MA. 
Yarowsky, D. 1992. Word-sense disambiguation us- 
ing statistical models of Roget's categories trained 
on large corpora. In Proceedings of the 14th 
International Conference on Computational Lin- 
guistics (COLING-92), pages 454-460, Nantes, 
France, July. 
Yarowsky, D. 1993. One sense per collocation. In 
Proceedings of the ARPA Workshop on Human 
Language Technology, pages 266-271. 
Yarowsky, D. 1995. Unsupervised word sense dis- 
ambiguation rivaling supervised methods. In Pro- 
ceedings of the 33rd Annual Meeting of the Asso- 
ciation for Computational Linguistics, pages 189- 
196, Cambridge, MA. 
Zipf, G. 1935. The Psycho-Biology of Language. 
Houghton Mifflin, Boston, MA. 
