Using Syntactic Dependency as Local Context to Resolve Word 
Sense Ambiguity 
Dekang Lin 
Department of Computer Science 
University of Manitoba 
Winnipeg, Manitoba, Canada R3T 2N2 
lindek@cs.umanitoba.ca 
Abstract 
Most previous corpus-based algorithms dis- 
ambiguate a word with a classifier trained 
from previous usages of the same word. 
Separate classifiers have to be trained for 
different words. We present an algorithm 
that uses the same knowledge sources to 
disambiguate different words. The algo- 
rithm does not require a sense-tagged cor- 
pus and exploits the fact that two different 
words are likely to have similar meanings if 
they occur in identical local contexts. 
1 Introduction 
Given a word, its context and its possible meanings, 
the problem of word sense disambiguation (WSD) is 
to determine the meaning of the word in that con- 
text. WSD is useful in many natural language tasks, 
such as choosing the correct word in machine trans- 
lation and coreference resolution. 
In several recent proposals (Hearst, 1991; Bruce 
and Wiebe, 1994; Leacock, Towwell, and Voorhees, 
1996; Ng and Lee, 1996; Yarowsky, 1992; Yarowsky, 
1994), statistical and machine learning techniques 
were used to extract classifiers from hand-tagged 
corpus. Yarowsky (Yarowsky, 1995) proposed an 
unsupervised method that used heuristics to obtain 
seed classifications and expanded the results to the 
other parts of the corpus, thus avoided the need to 
hand-annotate any examples. 
Most previous corpus-based WSD algorithms de- 
termine the meanings of polysemous words by ex- 
ploiting their local contexts. A basic intuition that 
underlies those algorithms is the following: 
(i) Two occurrences of the same word have 
identical meanings if they have similar local 
contexts. 
In other words, most previous corpus-based WSD al- 
gorithms learn to disambiguate a polysemous word 
from previous usages of the same word. This has sev- 
eral undesirable consequences. Firstly, a word must 
occur thousands of times before a good classifier can 
be learned. In Yarowsky's experiment (Yarowsky, 
1995), an average of 3936 examples were used to 
disambiguate between two senses. In Ng and Lee's 
experiment, 192,800 occurrences of 191 words were 
used as training examples. There are thousands of 
polysemous words, e.g., there are 11,562 polysemous 
nouns in WordNet. For every polysemous word to 
occur thousands of times each, the corpus must con- 
tain billions of words. Secondly, learning to disam- 
biguate a word from the previous usages of the same 
word means that whatever was learned for one word 
is not used on other words, which obviously missed 
generality in natural languages. Thirdly, these algo- 
rithms cannot deal with words for which classifiers 
have not been learned. 
In this paper, we present a WSD algorithm that 
relies on a different intuition: 
(2) Two different words are likely to have similar 
meanings if they occur in identical local 
contexts. 
Consider the sentence: 
(3) The new facility will employ 500 of the 
existing 600 employees 
The word "facility" has 5 possible meanings in 
WordNet 1.5 (Miller, 1990): (a) installation, (b) 
proficiency/technique, (c) adeptness, (d) readiness, 
(e) toilet/bathroom. To disambiguate the word, we 
consider other words that appeared in an identical 
local context as "facility" in (3). Table 1 is a list 
of words that have also been used as the subject of 
"employ" in a 25-million-word Wall Street Journal 
corpus. The "freq" column are the number of times 
these words were used as the subject of "employ". 
64 
Table 1: Subjects of "employ" with highest likelihood ratio 
word freq logA word freq logA 
bRG 64 50.4 
plant 14 31.0 
company 27 28.6 
operation 8 23.0 
industry 9 14.6 
firm 8 13.5 
pirate 2 12.1 
unit 9 9.32 
shift 3 8.48 
postal service 2 7.73 
machine 3 6.56 
corporation 3 6.47 
manufacturer 3 6.21 
insurance company 2 6.06 
aerospace 2 5.81 
memory device 1 5.79 
department 3 5.55 
foreign office 1 5.41 
enterprise 2 5.39 
pilot 2 5.37 
*ORG includes all proper names recognized as organizations 
The logA column are their likelihood ratios (Dun- 
ning, 1993). The meaning of "facility" in (3) can 
be determined by choosing one of its 5 senses that 
is most similar 1 to the meanings of words in Table 
1. This way, a polysemous word is disambiguated 
with past usages of other words. Whether or not it 
appears in the corpus is irrelevant. 
Our approach offers several advantages: 
• The same knowledge sources are used for all 
words, as opposed to using a separate classifier 
for each individual word. 
• It requires a much smaller corpus that needs not 
be sense-tagged. 
• It is able to deal with words that are infrequent 
or do not even appear in the corpus. 
• The same mechanism can also be used to infer 
the semantic categories of unknown words. 
The required resources of the algorithm include 
the following: (a) an untagged text corpus, (b) a 
broad-coverage parser, (c) a concept hierarchy, such 
as the WordNet (Miller, 1990) or Roget's Thesaurus, 
and (d) a similarity measure between concepts. 
In the next section, we introduce our definition of 
local contexts and the database of local contexts. A 
description of the disambiguation algorithm is pre- 
sented in Section 3. Section 4 discusses the evalua- 
tion results. 
2 Local Context 
Psychological experiments show that humans are 
able to resolve word sense ambiguities given a narrow 
window of surrounding words (Choueka and Lusig- 
nan, 1985). Most WSD algorithms take as input 
• to be defined in Section 3.1 
a polysemous word and its local context. Different 
systems have different definitions of local contexts. 
In (Leacock, Towwell, and Voorhees, 1996), the lo- 
cal context of a word is an unordered set of words in 
the sentence containing the word and the preceding 
sentence. In (Ng and Lee. 1996), a local context of a 
word consists of an ordered sequence of 6 surround- 
ing part-of-speech tags, its morphological features, 
and a set of collocations. 
In our approach, a local context of a word is de- 
fined in terms of the syntactic dependencies between 
the word and other words in the same sentence. 
A dependency relationship (Hudson, 1984; 
Mel'~uk, 1987) is an asymmetric binary relation- 
ship between a word called head (or governor, par- 
ent), and another word called modifier (or depen- 
dent, daughter). Dependency grammars represent 
sentence structures as a set of dependency relation- 
ships. Normally the dependency relationships form 
a tree that connects all the words in a sentence. An 
example dependency structure is shown in (4). 
(4) spec subj /-'~ // 
the boy chased a brown dog 
The local context of a word W is a triple that 
corresponds to a dependency relationship in which 
W is the head or the modifier: 
(type word position) 
where type is the type of the dependency relation- 
ship, such as subj (subject), adjn (adjunct), compl 
(first complement), etc.; word is the word related to 
W via the dependency relationship; and position 
can either be head or rood. The position indicates 
whether word is the head or the modifier in depen- 
65 
dency relation. Since a word may be involved in sev- 
eral dependency relationships, each occurrence of a 
word may have multiple local contexts. 
The local contexts of the two nouns "boy" and 
"dog" in (4) are as follows (the dependency relations 
between nouns and their determiners are ignored): 
(5) 
Word Local Contexts 
boy (subj chase head) 
dog (adjn brown rood) (compl chase head) 
Using a broad coverage parser to parse a corpus, 
we construct a Local Context Database. An en- 
try in the database is a pair: 
(6) (tc, C(tc)) 
where Ic is a local context and C(lc) is a set of (word 
frequency likelihood)-triples. Each triple speci- 
fies how often word occurred in lc and the likelihood 
ratio of lc and word. The likelihood ratio is obtained 
by treating word and Ic as a bigram and computed 
with the formula in (Dunning, 1993). The database 
entry corresponding to Table 1 is as follows: 
C(/c) -- ((ORG 64 50.4) (plant 14 31.0) 
...... (pilot 2 5.37)) 
3 The Approach 
The polysemous words in the input text are disam- 
biguated in the following steps: 
Step A. Parse the input text and extract local con- 
texts of each word. Let LCw denote the set of 
local contexts of all occurrences of w in the in- 
put text. 
Step B. Search the local context database and find 
words that appeared in an identical local con- 
text as w. They are called selectors of w: 
Selectorsw = (\[JlceLC,~ C(Ic) ) - {w}. 
Step C. Select a sense s of w that maximizes the 
similarity between w and Selectors~. 
Step D. The sense s is assigned to all occurrences 
of w in the input text. This implements the 
"one sense per discourse" heuristic advocated 
in (Gale, Church, and Yarowsky, 1992). 
Step C. needs further explanation. In the next sub- 
section, we define the similarity between two word 
senses (or concepts). We then explain how the simi- 
larity between a word and its selectors is maximized. 
3.1 Similarity between Two Concepts 
There have been several proposed measures for sim- 
ilarity between two concepts (Lee, Kim, and Lee, 
1989; Kada et al., 1989; Resnik, 1995b; Wu and 
Palmer, 1994). All of those similarity measures 
are defined directly by a formula. We use instead 
an information-theoretic definition of similarity that 
can be derived from the following assumptions: 
Assumption 1: The commonality between A and 
B is measured by 
I(common(A, B)) 
where common(A, B) is a proposition that states the 
commonalities between A and B; I(s) is the amount 
of information contained in the proposition s. 
Assumption 2: The differences between A and B 
is measured by 
I ( describe( A, B) ) - I ( common( A, B ) ) 
where describe(A, B) is a proposition that describes 
what A and B are. 
Assumption 3: The similarity between A and B, 
sire(A, B), is a function of their commonality and 
differences. That is, 
sire(A, B) = f(I(common(d, B)), I(describe(A, B))) 
Whedomainof f(x,y) is {(x,y)lx > O,y > O,y > x}. 
Assumption 4: Similarity is independent of the 
unit used in the information measure. 
According to Information Theory (Cover and 
Thomas, 1991), I(s) = -logbP(S), where P(s) is 
the probability of s and b is the unit. When b = 2, 
I(s) is the number of bits needed to encode s. Since 
log~,, Assumption 4 means that the func- logbx = logb, b , 
tion f must satisfy the following condition: 
Vc > O, f(x, y) = f(cz, cy) 
Assumption 5: Similarity is additive with respect 
to commonality. 
If common(A,B) consists of two independent 
parts, then the sim(A,B) is the sum of the simi- 
larities computed when each part of the commonal- 
ity is considered. In other words: f(xl + x2,y) = 
f(xl,y) + f(x2,y). 
A corollary of Assumption 5 is that Vy, f(0, y) = 
f(x + O,y) -f(x,y) = O, which means that when 
there is no commonality between A and B, their 
similarity is 0, no matter how different they are. 
For example, the similarity between "depth-first 
search" and "leather sofa" is neither higher nor lower 
than the similarity between "rectangle" and "inter- 
est rate". 
66 
Assumption 6: The similarity between a pair of 
identical objects is 1. 
When A and B are identical, knowning their 
commonalities means knowing what they are, i.e., 
I ( comrnon(.4, B ) ) = I ( describe( A. B ) ) . Therefore, 
the function f must have the following property: 
vz,/(z, z) = 1. 
Assumption 7: The function f(x,y) is continu- 
ous. 
Similarity Theorem: The similarity between A 
and B is measured by the ratio between the amount 
of information neededto state the commonality of A 
and B and the information needed to fully describe 
what A and B are: 
sirn( A. B) = logP(common( A, B) ) 
logP( describe(.4, B) ) 
Proof." To prove the theorem, we need to show 
f(z,y) = ~. Since f(z,V) = f(~,l) (due to As- 
sumption 4), we only need to show that when ~ is a 
rational number f(z, y) = -~. The result can be gen- y 
eralized to all real numbers because f is continuous 
and for any real number, there are rational numbers 
that are infinitely close to it. 
Suppose m and n are positive integers. 
f(nz, y) = f((n - 1)z, V) + f(z, V) = nf(z, V) 
(due to Assumption 5). Thus. f(z, y) = ¼f(nx, y). 
Substituting ~ for x in this equation: 
f(z,v) 
Since z is rational, there exist m and n such that 
~- -- --nu Therefore, Y m" 
m y 
Q.E.D. 
For example. Figure 1 is a fragment of the Word- 
Net. The nodes are concepts (or synsets as they are 
called in the WordNet). The links represent IS-A 
relationships. The number attached to a node C is 
the probability P(C) that a randomly selected noun 
refers to an instance of C. The probabilities are 
estimated by the frequency of concepts in SemCor 
(Miller et al., 1994), a sense-tagged subset of the 
Brown corpus. 
If x is a Hill and y is a Coast, the commonality 
between x and y is that "z is a GeoForm and y 
is a GeoForm". The information contained in this 
0.000113 
0.0000189 
entity 0.395 
inanima\[e-object 0.167 
/ 
natural-~bject 0.0163 
/ ,eyi a, 000,70 
natural-?levation shire 0.0000836 
hill coast 0.0000216 
Figure 1: A fragment of WordNet 
statement is -2 x logP(GeoForm). The similarity 
between the concepts Hill and Coast is: 
2 x logP(GeoForm) sim(HiU, Coast) = = 0.59 
logP(Hill) + logP(Coast) 
Generally speaking, 
2xlogP(N i Ci ) (7) $irlz(C, C') "- iogP(C)+logP(C,) 
where P(fqi Ci) is the probability of that an object 
belongs to all the maximally specific super classes 
(Cis) of both C and C'. 
3.2 Disambiguation by Maximizing 
Similarity 
We now provide the details of Step C in our algo- 
rithm. The input to this step consists of a polyse- 
mous word W0 and its selectors {l,I,'l, I, V2 ..... IVy}. 
The word Wi has ni senses: {sa,..., sin, }. 
Step C.I: Construct a similarity matrix (8). The 
rows and columns represent word senses. The 
matrix is divided into (k + 1) x (k + 1) blocks. 
The blocks on the diagonal are all 0s. The el- 
ements in block Sij are the similarity measures 
between the senses of Wi and the senses of II~. 
Similarity measures lower than a threshold 0 are 
considered to be noise and are ignored. In our 
experiments, 0 = 0.2 was used. 
sire(sit. Sjm) if i ¢ j and 
Sij(l,m) = sim(sa. Sjm) >__ O 
0 otherwise 
67 
(8) 
801 
80n 0 
811 
81~ 1 
8kl 
8kn~ 
801 • -. 80no 
$10 
Sk0 
8kl...Skn~ 
Sok 
S~k 
o 
Step C.2: Let A be the set of polysemous words in {Wo,...,wk): 
A = {Witn~ > 1} 
Step C.3: Find a sense of words in ,4 that gets the 
highest total support from other words. Call 
this sense si,,~,t,,~, : 
k 
si.,a,l.,~ = argmaxs, ~ support(sit, Wj) 
j=0 
where sit is a word sense such that W/E A and 
1 6 \[1, n/\] and support(su,Wj) is the support 
sa gets from Wj: 
support(sil, Wj) = max Sij(l,m) 
mE\[1,nj\] 
Step C.4: The sense of Wi~,,~ is chosen to be 
8i~.~lm,a,. Remove Wi,.,,,, from A. 
A ( A- {W/.,., } 
Step C.5: Modify the similarity matrix to remove 
the similarity values between other senses of 
W/~, and senses of other words. For all l, j, 
m, such that l E \[1,ni.~.,\] and l ~ lmaz and 
j # imax and m E \[1, nj\]: 
Si.~o~j (/, m) e---- 0 
Step C.6: Repeat from Step C.3 unless im,~z = O. 
3.3 Walk Through Examples 
Let's consider again the word "facility" in (3). It 
has two local contexts: subject of "employ" (subj 
employ head) and modifiee of "new" (adjn new 
rood). Table 1 lists words that appeared in the first 
local context. Table 2 lists words that appeared in 
the second local context. Only words with top-20 
likelihood ratio were used in our experiments. 
The two groups of words are merged and used as 
the selectors of "facility". The words "facility" has 
5 senses in the WordNet. 
Table 2: Modifiees of "new" with the highest likeli- 
hood ratios 
word freq logA word freq logA 
post 432 952.9 
issue 805 902.8 
product 675 888.6 
rule 459 875.8 
law 356 541.5 
technology 237 382.7 
generation 150 323.2 
model 207 319.3 
job 260 269.2 
system 318 251.8 
bonds 223 245.4 
capital 178 241.8 
order 228 236.5 
version 158 223.7 
position 236 207.3 
high 152 201.2 
contract 279 198.1 
bill 208 194.9 
venture 123 193.7 
program 283 183.8 
1. something created to provide a particular ser- 
vice; 
2. proficiency, technique; 
3. adeptness, deftness, quickness; 
4. readiness, effortlessness; 
5. toilet, lavatory. 
Senses 1 and 5 are subclasses of artifact. Senses 2 
and 3 are kinds of state. Sense 4 is a kind of ab- 
straction. Many of the selectors in Tables 1 and 
Table 2 have artifact senses, such as "post", "prod- 
uct", "system", "unit", "memory device", "ma- 
chine", "plant", "model", "program", etc. There- 
fore, Senses 1 and 5 of "facility" received much 
more support, 5.37 and 2.42 respectively, than other 
senses. Sense 1 is selected. 
Consider another example that involves an un- 
known proper name: 
(9) DreamLand employed 20 programmers. 
We treat unknown proper nouns as a polysemous 
word which could refer to a person, an organization, 
or a location. Since "DreamLand" is the subject of 
"employed", its meaning is determined by maximiz- 
ing the similarity between one of {person, organiza- 
tion, locaton} and the words in Table 1. Since Table 
1 contains many "organization" words, the support 
for the "organization" sense is nmch higher than the 
others. 
4 Evaluation 
We used a subset of the SemCor (Miller et al., 1994) 
to evaluate our algorithm. 
68 
4.1 Evaluation Criteria 
General-purpose lexical resources, such as Word- 
Net, Longman Dictionary of Contemporary English 
(LDOCE), and Roget's Thesaurus, strive to achieve 
completeness. They often make subtle distinctions 
between word senses. As a result, when the WSD 
task is defined as choosing a sense out of a list of 
senses in a general-purpose lexical resource, even hu- 
mans may frequently disagree with one another on 
what the correct sense should be. 
The subtle distinctions between different word 
senses are often unnecessary. Therefore, we relaxed 
the correctness criterion. A selected sense 8answer 
is correct if it is "similar enough" to the sense tag 
skeu in SemCor. We experimented with three in- 
terpretations of "similar enough". The strictest in- 
terpretation is sim(sanswer,Ske~)=l, which is true 
only when 8answer~Skey. The most relaxed inter- 
pretation is sim(s~nsw~, Skey) >0, which is true if 
8answer and 8key are the descendents of the same 
top-level concepts in WordNet (e.g., entity, group, 
location, etc.). A compromise between these two is 
sim(Sans~er, Skew) >_ 0.27, where 0.27 is the average 
similarity of 50,000 randomly generated pairs (w, w') 
in which w and w ~ belong to the same Roget's cate- 
gory. 
We use three words "duty", "interest" and "line" 
as examples to provide a rough idea about what 
sirn( s~nswer, Skew) >_ 0.27 means. 
The word "duty" has three senses in WordNet 1.5. 
The similarity between the three senses are all below 
0.27, although the similarity between Senses 1 (re- 
sponsibility) and 2 (assignment, chore) is very close 
(0.26) to the threshold. 
The word "interest" has 8 senses. Senses 1 (sake, 
benefit) and 7 (interestingness) are merged. 2 Senses 
3 (fixed charge for borrowing money), 4 (a right or 
legal share of something), and 5 (financial interest 
in something) are merged. The word "interest" is 
reduced to a 5-way ambiguous word. The other 
three senses are 2 (curiosity), 6 (interest group) and 
8 (pastime, hobby). 
The word "line" has 27 senses. The similarity 
threshold 0.27 reduces the number of senses to 14. 
The reduced senses are 
• Senses 1, 5, 17 and 24: something that is com- 
municated between people or groups. 
1: a mark that is long relative to its width 
5: a linear string of words expressing some 
idea 
')The similarities between senses of the same word are 
computed during scoring. We do not actually change the 
WordNet hierarchy 
17: a mark indicating positions or bounds of 
the playing area 
24: as in "drop me a line when you get there" 
• Senses 2, 3, 9, 14, 18: group 
2: a formation of people or things beside one 
another 
3: a formation of people or things one after 
another 
9: a connected series of events or actions or 
developments 
14: the descendants of one individual 
18: common carrier 
• Sense 4: a single frequency (or very narrow 
band) of radiation in a spectrum 
• Senses 6 and 25: cognitive process 
6: line of reasoning 
25: a conceptual separation or demarcation 
• Senses 7, 15, and 26: instrumentation 
7: electrical cable 
15: telephone line 
26: assembly line 
• Senses 8 and 10: shape 
8: a length (straight or curved) without 
breadth or thickness 
10: wrinkle, furrow, crease, crinkle, seam, line 
• Senses 11 and 16: any road or path affording 
passage from one place to another; 
11: pipeline 
16: railway 
• Sense 12: location, a spatial location defined by 
a real or imaginary unidimensional extent; 
• Senses 13 and 27: human action 
13: acting in conformity 
27: occupation, line of work; 
• Sense 19: something long and thin and flexible 
• Sense 20: product line, line of products 
• Sense 21: space for one line of print (one col- 
umn wide and 1/14 inch deep) used to measure 
advertising 
• Sense 22: credit line, line of credit 
• Sense 23: a succession of notes forming a dis- 
tinctived sequence 
where each group is a reduced sense and the numbers 
are original WordNet sense numbers. 
69 
4.2 Results 
We used a 25-million-word Wall Street Journal cor- 
pus (part of LDC/DCI 3 CDROM) to construct the 
local context database. The text was parsed in 
126 hours on a SPARC-Ultra 1/140 with 96MB 
of memory. We then extracted from the parse 
trees 8,665,362 dependency relationships in which 
the head or the modifier is a noun. We then fil- 
tered out (lc, word) pairs with a likelihood ratio 
lower than 5 (an arbitrary threshold). The resulting 
database contains 354,670 local contexts with a to- 
tal of 1,067,451 words in them (Table 1 is counted 
as one local context with 20 words in it). 
Since the local context database is constructed 
from WSJ corpus which are mostly business news, 
we only used the "press reportage" part of Sem- 
Cor which consists of 7 files with about 2000 words 
each. Furthermore, we only applied our algorithm 
to nouns. Table 3 shows the results on 2,832 polyse- 
mous nouns in SemCor. This number also includes 
proper nouns that do not contain simple markers 
(e.g., Mr., Inc.) to indicate its category. Such a 
proper noun is treated as a 3-way ambiguous word: 
person, organization, or location. We also showed 
as a baseline the performance of the simple strategy 
of always choosing the first sense of a word in the 
WordNet. Since the WordNet senses are ordered ac- 
cording to their frequency in SemCor, choosing the 
first sense is roughly the same as choosing the sense 
with highest prior probability, except that we are 
not using all the files in SemCor. 
It can be seen from Table 3 that our algorithm 
performed slightly worse than the baseline when 
the strictest correctness criterion is used. However, 
when the condition is relaxed, its performance gain 
is much lager than the baseline. This means that 
when the algorithm makes mistakes, the mistakes 
tend to be close to the correct answer. 
5 Discussion 
5.1 Related Work 
The Step C in Section 3.2 is similar to Resnik's noun 
group disambiguation (Resnik, 1995a), although he 
did not address the question of the creation of noun 
groups. 
The earlier work on WSD that is most similar to 
ours is (Li, Szpakowicz, and Matwin, 1995). They 
proposed a set of heuristic rules that are based on 
the idea that objects of the same or similar verbs are 
similar. 
3http://www.ldc.upenn.edu/ 
5.2 Weak Contexts 
Our algorithm treats all local contexts equally in 
its decision-making. However, some local contexts 
hardly provide any constraint on the meaning of a 
word. For example, the object of "get" can practi- 
cally be anything. This type of contexts should be 
filtered out or discounted in decision-making. 
5.3 Idiomatic Usages 
Our assumption that similar words appear in iden- 
tical context does not always hold. For example, 
(10) ... the condition in which the heart beats 
between 150 and 200 beats a minute 
The most frequent subjects of "beat" (according to 
our local context database) are the following: 
(11) PER, badge, bidder, bunch, challenger, 
democrat, Dewey, grass, mummification, pimp, 
police, return, semi. and soldier. 
where PER refers to proper names recognized as per- 
sons. None of these is similar to the "body part" 
meaning of "heart". In fact, "heart" is the only body 
part that beats. 
6 Conclusion 
We have presented a new algorithm for word sense 
disambiguation. Unlike most previous corpus- 
based WSD algorithm where separate classifiers are 
trained for different words, we use the same lo- 
cal context database and a concept hierarchy as 
the knowledge sources for disambiguating all words. 
This allows our algorithm to deal with infrequent 
words or unknown proper nouns. 
Unnecessarily subtle distinction between word 
senses is a well-known problem for evaluating WSD 
algorithms with general-purpose lexical resources. 
Our use of similarity measure to relax the correct- 
ness criterion provides a possible solution to this 
problem. 
Acknowledgement 
This research has also been partially supported by 
NSERC Research Grant 0GP121338 and by the In- 
stitute for Robotics and Intelligent Systems. 

References 
Bruce, Rebecca and Janyce Wiebe. 1994. Word- 
sense disambiguation using decomposable models. 
In Proceedings of the 32nd Annual Meeting o/the 
Associations/or Computational Linguistics, pages 
139-145, Las Cruces, New Mexico. 
Choueka, Y. and S. Lusignan. 1985. Disambigua- 
tion by short contexts. Computer and the Hu- 
manities, 19:147-157. 
Cover, Thomas M. and Joy A. Thomas. 1991. El- 
ements of information theory. Wiley series in 
telecommunications. Wiley, New York. 
Dunning, Ted. 1993. Accurate methods for the 
statistics of surprise and coincidence. Computa- 
tional Linguistics, 19(1):61-74, March. 
Gale, W., K. Church, and D. Yarowsky. 1992. A 
method for disambiguating word senses in a large 
corpus. Computers and the Humannities, 26:415- 
439. 
Hearst, Marti. 1991. noun homograph disambigua- 
tion using local context in large text corpora. In 
Conference on Research and Development in In- 
formation Retrieval ACM/SIGIR, pages 36-47, 
Pittsburgh, PA. 
Hudson, Richard. 1984. Word Grammar. Basil 
Blackwell Publishers Limited., Oxford, England. 
Leacock, Claudia, Goeffrey Towwell, and Ellen M. 
Voorhees. 1996. Towards building contextual rep- 
resentations of word senses using statistical mod- 
els. In Corpus Processing for Lexical Acquisition. 
The MIT Press, chapter 6, pages 97-113. 
Lee, Joon Ho, Myoung Ho Kim, and Yoon Joon Lee. 
1989. Information retrieval based on conceptual 
distance in is-a hierarchies. Journal of Documen- 
tation, 49(2):188-207, June. 
Li, Xiaobin, Stan Szpakowicz, and Stan Matwin. 
1995. A wordnet-based algorithm for word sense 
disambiguation. In Proceedings of IJCAI-95, 
pages 1368-1374, Montreal, Canada, August. 
Mel'~uk, Igor A. 1987. Dependency syntax: theory 
and practice. State University of New York Press, 
Albany. 
Miller, George A. 1990. WordNet: An on-line lexi- 
cal database. International Journal of Lexicogra- 
phy, 3(4):235-312. 
Miller, George A., Martin Chodorow, Shari Landes, 
Claudia Leacock, and robert G. Thomas. 1994. 
Using a semantic concordance for sense identifi- 
cation. In Proceedings of the ARPA Human Lan- 
guage Technology Workshop. 
Ng, Hwee Tow and Hian Beng Lee. 1996. Integrat- 
ing multiple knowledge sources to disambiguate 
word sense: An examplar-based approach. In Pro- 
ceedings of 34th Annual Meeting of the Associa- 
tion for Computational Linguistics, pages 40-47, 
Santa Cruz, California. 
Rada, Roy, Hafedh Mili, Ellen Bicknell, and Maria 
Blettner. 1989. Development and application 
of a metric on semantic nets. IEEE Transaction 
on Systems, Man, and Cybernetics, 19(1):17-30, 
February. 
Resnik, Philip. 1995a. Disambiguating noun group- 
ings with respect to wordnet senses. In Third 
Workshop on Very Large Corpora. Association for 
Computational Linguistics. 
Resnik, Philip. 1995b. Using information content 
to evaluate semantic similarity in a taxonomy. 
In Proceedings of IJCAI-95, pages 448-453, Mon- 
treal, Canada, August. 
Wu, Zhibiao and Martha Palmer. 1994. Verb se- 
mantics and lexical selection. In Proceedings of 
the 32nd Annual Meeting of the Associations for 
Computational Linguistics, pages 133-138, Las 
Cruces, New Mexico. 
Yarowsky, David. 1992. Word-sense disambigua- 
tion using statistical models of Roget's cate- 
gories trained on large corpora. In Proceedings 
of COLING-92, Nantes, France. 
Yarowsky, David. 1994. Decision lists for lexical am- 
biguity resolution: Application to accent restora- 
tion in spanish and french. In Proceedings of 32nd 
Annual Meeting of the Association for Computa- 
tional Linguistics, pages 88-95, Las Cruces, NM, 
June. 
Yarowsky, David. 1995. Unsupervised word sense 
disambiguation rivaling supervised methods. In 
Proceedings of 33rd Annual Meeting of the Asso- 
ciation for Computational Linguistics, pages 189- 
196, Cambridge, Massachusetts, June. 
