Class-based Collocations for Word-Sense Disambiguation
Tom O’Hara
Department of Computer Science
New Mexico State University
Las Cruces, NM 88003-8001
tomohara@cs.nmsu.edu
Rebecca Bruce
Department of Computer Science
University of North Carolina at Asheville
Asheville, NC 28804-3299
bruce@cs.unca.edu
Jeﬀ Donner
Department of Computer Science
New Mexico State University
Las Cruces, NM 88003-8001
jdonner@cs.nmsu.edu
Janyce Wiebe
Department of Computer Science
University of Pittsburgh
Pittsburgh, PA 15260-4034
wiebe@cs.pitt.edu
Abstract
This paper describes the NMSU-Pitt-UNCA
word-sense disambiguation system participat-
ing in the Senseval-3 English lexical sample
task. The focus of the work is on using seman-
tic class-based collocations to augment tradi-
tional word-based collocations. Three separate
sources of word relatedness are used for these
collocations: 1) WordNet hypernym relations;
2) cluster-based word similarity classes; and 3)
dictionary definition analysis.
1 Introduction
Supervised systems for word-sense disambigua-
tion (WSD) often rely upon word collocations
(i.e., sense-specific keywords) to provide clues
on the most likely sense for a word given the
context. In the second Senseval competition,
these features figured predominantly among the
feature sets for the leading systems (Mihalcea,
2002; Yarowsky et al., 2001; Seo et al., 2001).
A limitation of such features is that the words
selected must occur in the test data in order for
the features to apply. To alleviate this problem,
class-based approaches augment word-level fea-
tures with category-level ones (Ide and V´eronis,
1998; Jurafsky and Martin, 2000). When ap-
plied to collocational features, this approach ef-
fectively uses class labels rather than wordforms
in deriving the collocational features.
This research focuses on the determination
of class-based collocations to improve word-
sense disambiguation. We do not address refine-
ment of existing algorithms for machine learn-
ing. Therefore, a commonly used decision tree
algorithm is employed to combine the various
features when performing classification.
This paper describes the NMSU-Pitt-
UNCA system we developed for the third
Senseval competition. Section 2 presents an
overview of the feature set used in the system.
Section 3 describes how the class-based colloca-
tions are derived. Section 4 shows the results
over the Senseval-3 data and includes detailed
analysis of the performance of the various col-
locational features.
2SystemOverview
We use a decision tree algorithm for word-sense
disambiguation that combines features from the
local context of the target word with other lex-
ical features representing the broader context.
Figure 1 presents the features that are used
in this application. In the first Senseval com-
petition, we used the first two groups of fea-
tures, Local-context features and Collocational
features, with competitive results (O’Hara et al.,
2000).
Five of the local-context features represent
the part of speech (POS) of words immediately
surrounding the target word. These five fea-
tures are POS±i for i from -2 to +2), where
POS+1, for example, represents the POS of the
word immediately following the target word.
Five other local-context features represent
the word tokens immediately surrounding the
target word (Word±i for i from −2to+2).
Each Word±i feature is multi-valued; its values
correspond to all possible word tokens.
There is a collocation feature WordColl
s
de-
fined for each sense s of the target word. It
is a binary feature, representing the absence or
presence of any word in a set specifically chosen
for s.Awordw that occurs more than once in
the training data is included in the collocation
set for sense s if the relative percent gain in the
conditional probability over the prior probabil-
                                             Association for Computational Linguistics
                        for the Semantic Analysis of Text, Barcelona, Spain, July 2004
                 SENSEVAL-3: Third International Workshop on the Evaluation of Systems
Local-context features
POS: part-of-speech of target word
POS±i: part-of-speech of word at oﬀset i
WordForm: target wordform
Word±i: stem of word at oﬀset i
Collocational features
WordColl
s
: word collocation for sense s
WordColl
∗
wordform of non-sense-specific
collocation (enumerated)
Class-based collocational features
HyperColl
s
: hypernym collocation for s
HyperColl
∗,i
:non-sense-specific hypernym collo-
cation
SimilarColl
s
: similarity collocation for s
DictColl
s
: dictionary collocation for s
Figure 1: Features for word-sense disambigua-
tion. All collocational features are binary indi-
cators for sense s, except for WordColl
∗
.
ity is 20% or higher:
(P(s|w) −P(s))
P(s)
≥ 0.20.
This threshold was determined to be eﬀective
via an optimization search over the Senseval-2
data. WordColl
∗
represents a set of non-sense-
specific collocations (i.e., not necessarily indica-
tive of any one sense), chosen via the G
2
criteria
(Wiebe et al., 1998). In contrast to WordColl
s
,
each of which is a separate binary feature, the
words contained in the set WordColl
∗
serve as
values in a single enumerated feature.
These features are augmented with class-
based collocational features that represent in-
formation about word relationships derived
from three separate sources: 1) WordNet
(Miller, 1990) hypernym relations (HyperColl);
2) cluster-based word similarity classes (Simi-
larColl); and 3) relatedness inferred from dictio-
nary definition analysis (DictColl). The infor-
mation inherent in the sources from which these
class-based features are derived allows words
that do not occur in the training data context
to be considered as collocations during classifi-
cation.
3 Class-based Collocations
The HyperColl features are intended to capture
a portion of the information in the WordNet hy-
pernyms links (i.e., is-a relations). Hypernym-
based collocations are formulated by replacing
each word in the context of the target word (e.g.,
in the same sentence as the target word) with
its complete hypernym ancestry from WordNet.
Since context words are not sense-tagged, each
synset representing a diﬀerent sense of a context
word is included in the set of hypernyms replac-
ing that word. Likewise, in the case of multiple
inheritance, each parent synset is included.
The collocation variable HyperColl
s
for each
sense s is binary, corresponding to the absence
or presence of any hypernym in the set chosen
for s. This set of hypernyms is chosen using the
ratio of conditional probability to prior prob-
ability as described for the WordColl
s
feature
above. In contrast, HyperColl
∗,i
selects non-
sense-specific hypernym collocations: 10 sepa-
rate binary features are used based on the G
2
selection criteria. (More of these features could
be used, but they are limited for tractability.)
For more details on hypernym collocations, see
(O’Hara, forthcoming).
Word-similarity classes (Lin, 1998) derived
from clustering are also used to expand the
pool of potential collocations; this type of se-
mantic relatedness among words is expressed in
the SimilarColl feature. For the DictColl fea-
tures, definition analysis (O’Hara, forthcoming)
is used to determine the semantic relatedness of
the defining words. Diﬀerences between these
two sources of word relations are illustrated by
looking at the information they provide for ‘bal-
lerina’:
word-clusters:
dancer:0.115 baryshnikov:0.072
pianist:0.056 choreographer:0.049
... [18 other words]
nicole:0.041 wrestler:0.040
tibetans:0.040 clown:0.040
definition words:
dancer:0.0013 female:0.0013 ballet:0.0004
This shows that word clusters capture a wider
range of relatedness than the dictionary def-
initions at the expense of incidental associa-
tions (e.g., ‘nicole’). Again, because context
words are not disambiguated, the relations for
all senses of a context word are conflated. For
details on the extraction of word clusters, see
(Lin, 1998); and, for details on the definition
analysis, see (O’Hara, forthcoming).
When formulating the features SimilarColl
and DictColl, the words related to each con-
text word are considered as potential colloca-
tions (Wiebe et al., 1998). Co-occurrence fre-
Sense Distinctions Precision Recall
Fine-grained .566 .565
Course-grained .660 .658
Table 1: Results for Senseval-3 test data.
99.72% of the answers were attempted. All fea-
tures from Figure 1 were used.
quencies f(s,w) are used in estimating the con-
ditional probability P(s|w) required by the rel-
ative conditional probability selection scheme
noted earlier. However, instead of using a unit
weight for each co-occurrence, the relatedness
weight is used (e.g., 0.056 for ‘pianist’); and,
because a given related-word might occur with
more than one context word for the same target-
word sense, the relatedness weights are added.
The conditional probability of the sense given
the relatedness collocation is estimated by di-
viding the weighted frequency by the sum of all
such weighted co-occurrence frequencies for the
word:
P(s|w)≈
wf (s,w)
summationtext
s
prime wf (s
prime
,w)
Here wf(s, w) stands for the weighted co-
occurrence frequency of the related-word collo-
cation w and target sense s.
The relatedness collocations are less reliable
than word collocations given the level of indi-
rection involved in their extraction. Therefore,
tighter constraints are used in order to filter out
extraneous potential collocations. In particular,
the relative percent gain in the conditional ver-
sus prior probability must be 80% or higher, a
threshold again determined via an optimization
search over the Senseval-2 data. In addition,
the context words that they are related to must
occur more than four times in the training data.
4 Results and Discussion
Disambiguation is performed via a decision tree
formulated using Weka’s J4.8 classifier (Witten
and Frank, 1999). For the system used in the
competition, the decision tree was learned over
the entire Senseval-3 training data and then ap-
plied to the test data. Table 1 shows the results
of our system in the Senseval-3 competition.
Table 2 shows the results of 10-fold cross-
validation just over the Senseval-3 training data
(using Naive Bayes rather than decision trees.)
To illustrate the contribution of the three types
Experiment Precision
−Local +Local
Local - .593
WordColl .490 .599
HyperColl .525 .590
DictColl .532 .570
SimilarColl .534 .586
HyperColl+WordColl .525 .611
DictColl+WordColl .501 .606
SimilarColl+WordColl .518 .596
All Collocations .543 .608
#Words: 57 Avg. Entropy: 1.641
Avg. #Senses: 5.3 Baseline: 0.544
Table 2: Results for Senseval-3 training data.
All values are averages, except #Words,which
is the number of distinct word types classified.
Baseline always uses the most-frequent sense.
of class-based collocations, the table shows re-
sults separately for systems developed using a
single feature type, as well as for all features in
combination. In addition, the performance of
these systems are shown with and without the
use of the local features (Local), as well as with
and without the use of standard word colloca-
tions (WordColl). As can be seen, the related-
word and definition collocations perform better
than hypernym collocations when used alone.
However, hypernym collocations perform bet-
ter when combined with other features. Fu-
ture work will investigate ways of ameliorat-
ing such interactions. The best overall system
(HyperColl+WordColl+Local)usesthecom-
bination of local-context features, word colloca-
tions, and hypernym collocations. The perfor-
mance of this system compared to a more typi-
cal system for WSD (WordColl+Local)issta-
tistically significant at p<.05, using a paired
t-test.
We analyzed the contributions of the various
collocation types to determine their eﬀective-
ness. Table 3 shows performance statistics for
each collocation type taken individually over the
training data. Precision is based on the num-
ber of correct positive indicators versus the to-
tal number of positive indicators, whereas recall
is the number correct over the total number of
training instances (7706). This shows that hy-
pernym collocations are nearly as eﬀective as
word collocations. We also analyzed the occur-
rence of unique positive indicators provided by
the collocation types over the training data. Ta-
Total Total
Feature #Corr. #Pos. Recall Prec.
DictColl 273 592 .035 .461
HyperColl 2932 6479 .380 .453
SimilarColl 528 1535 .069 .344
WordColl 3707 7718 .481 .480
Table 3: Collocation performance statistics.
Total #Pos. is number of positive indicators for
the collocation in the training data, and Total
#Corr. is the number of these that are correct.
Unique Unique
Feature #Corr. #Pos. Prec.
DictColl 110 181 .608
HyperColl 992 1795 .553
SimilarColl 198 464 .427
DictColl 1244 2085 .597
Table 4: Analysis of unique positive indicators.
Unique #Pos. is number of training instances
with the feature as the only positive indicator,
and Unique #Corr. is number of these correct.
ble 4 shows how often each feature type is pos-
itive for a particular sense when all other fea-
tures for the sense are negative. This occurs
fairly often, suggesting that the diﬀerent types
of collocations are complementary and thus gen-
erally useful when combined for word-sense dis-
ambiguation. Both tables illustrate coverage
problems for the definition and related word
collocations, which will be addressed in future
work.

References
Nancy Ide and Jean V´eronis. 1998. Introduc-
tion to the special issue on word sense dis-
ambiguation: the state of the art. Computa-
tional Linguistics, 24(1):1–40.
Daniel Jurafsky and James H. Martin. 2000.
Speech and Language Processing.Prentice
Hall, Upper Saddle River, New Jersey.
Dekang Lin. 1998. Automatic retrieval
and clustering of similar words. In Proc.
COLING-ACL 98, pages 768–764, Montreal.
August 10-14.
Rada Mihalcea. 2002. Instance based learning
with automatic feature selection applied to
word sense disambiguation. In Proceedings of
the 19th International Conference on Com-
putational Linguistics (COLING 2002),Tai-
wan. August 26-30.
George Miller. 1990. Introduction. Interna-
tional Journal of Lexicography, 3(4): Special
Issue on WordNet.
Tom O’Hara, Janyce Wiebe, and Rebecca F.
Bruce. 2000. Selecting decomposable models
for word-sense disambiguation: The grling-
sdm system. Computers and the Humanities,
34(1-2):159–164.
Thomas P. O’Hara. forthcoming. Empirical ac-
quisition of conceptual distinctions via dictio-
nary definitions. Ph.D. thesis, Department of
Computer Science, New Mexico State Univer-
sity.
Hee-Cheol Seo, Sang-Zoo Lee, Hae-Chang
Rim, and Ho Lee. 2001. KUNLP sys-
tem using classification information model at
SENSEVAL-2. In Proceedings of the Second
International Workshop on Evaluating Word
Sense Disambiguation Systems (SENSEVAL-
2), pages 147–150, Toulouse. July 5-6.
Janyce Wiebe, Kenneth McKeever, and Re-
becca F. Bruce. 1998. Mapping collocational
properties into machine learning features. In
Proc. 6th Workshop on Very Large Corpora
(WVLC-98), pages 225–233, Montreal, Que-
bec, Canada. Association for Computational
Linguistics. SIGDAT.
Ian H. Witten and Eibe Frank. 1999. Data
Mining: Practical Machine Learning Tools
and Techniques with Java Implementations.
Morgan Kaufmann, San Francisco, CA.
David Yarowsky, Silviu Cucerzan, Radu Flo-
rian, Charles Schafer, and Richard Wicen-
towski. 2001. The Johns Hopkins SENSE-
VAL2 system descriptions. In Proceedings of
the Second International Workshop on Eval-
uating Word Sense Disambiguation Systems
(SENSEVAL-2), pages 163–166, Toulouse.
July 5-6.
