An Empirical Evaluation of Knowledge Sources and
Learning Algorithms for Word Sense Disambiguation
Yoong Keok Lee and Hwee Tou Ng
Department of Computer Science
School of Computing
National University of Singapore
3 Science Drive 2, Singapore 117543
a0 leeyoong, nght
a1 @comp.nus.edu.sg
Abstract
In this paper, we evaluate a vari-
ety of knowledge sources and super-
vised learning algorithms for word sense
disambiguation on SENSEVAL-2 and
SENSEVAL-1 data. Our knowledge
sources include the part-of-speech of
neighboring words, single words in the
surrounding context, local collocations,
and syntactic relations. The learning al-
gorithms evaluated include Support Vec-
tor Machines (SVM), Naive Bayes, Ad-
aBoost, and decision tree algorithms. We
present empirical results showing the rela-
tive contribution of the component knowl-
edge sources and the different learning
algorithms. In particular, using all of
these knowledge sources and SVM (i.e.,
a single learning algorithm) achieves ac-
curacy higher than the best official scores
on both SENSEVAL-2 and SENSEVAL-1
test data.
1 Introduction
Natural language is inherently ambiguous. A word
can have multiple meanings (or senses). Given an
occurrence of a word a2 in a natural language text,
the task of word sense disambiguation (WSD) is to
determine the correct sense of a2 in that context.
WSD is a fundamental problem of natural language
processing. For example, effective WSD is crucial
for high quality machine translation.
One could envisage building a WSD system us-
ing handcrafted rules or knowledge obtained from
linguists. Such an approach would be highly labor-
intensive, with questionable scalability. Another ap-
proach involves the use of dictionary or thesaurus to
perform WSD.
In this paper, we focus on a corpus-based, super-
vised learning approach. In this approach, to disam-
biguate a word a2 , we first collect training texts in
which instances of a2 occur. Each occurrence of a2
is manually tagged with the correct sense. We then
train a WSD classifier based on these sample texts,
such that the trained classifier is able to assign the
sense of a2 in a new context.
Two WSD evaluation exercises, SENSEVAL-1
(Kilgarriff and Palmer, 2000) and SENSEVAL-2
(Edmonds and Cotton, 2001), were conducted in
1998 and 2001, respectively. The lexical sample
task in these two SENSEVALs focuses on evalu-
ating WSD systems in disambiguating a subset of
nouns, verbs, and adjectives, for which manually
sense-tagged training data have been collected.
In this paper, we conduct a systematic evaluation
of the various knowledge sources and supervised
learning algorithms on the English lexical sample
data sets of both SENSEVALs.
2 Related Work
There is a large body of prior research on WSD. Due
to space constraints, we will only highlight prior re-
search efforts that have investigated (1) contribution
of various knowledge sources, or (2) relative perfor-
mance of different learning algorithms.
Early research efforts on comparing different
                                            Association for Computational Linguistics.
                      Language Processing (EMNLP), Philadelphia, July 2002, pp. 41-48.
                         Proceedings of the Conference on Empirical Methods in Natural
learning algorithms (Mooney, 1996; Pedersen and
Bruce, 1997) tend to base their comparison on only
one word or at most a dozen words. Ng (1997) com-
pared two learning algorithms, k-nearest neighbor
and Naive Bayes, on the DSO corpus (191 words).
Escudero et al. (2000) evaluated k-nearest neighbor,
Naive Bayes, Winnow-based, and LazyBoosting al-
gorithms on the DSO corpus. The recent work of
Pedersen (2001a) and Zavrel et al. (2000) evaluated
a variety of learning algorithms on the SENSEVAL-
1 data set. However, all of these research efforts con-
centrate only on evaluating different learning algo-
rithms, without systematically considering their in-
teraction with knowledge sources.
Ng and Lee (1996) reported the relative contribu-
tion of different knowledge sources, but on only one
word “interest”. Stevenson and Wilks (2001) inves-
tigated the interaction of knowledge sources, such as
part-of-speech, dictionary definition, subject codes,
etc. on WSD. However, they do not evaluate their
method on a common benchmark data set, and there
is no exploration on the interaction of knowledge
sources with different learning algorithms.
Participating systems at SENSEVAL-1 and
SENSEVAL-2 tend to report accuracy using a par-
ticular set of knowledge sources and some partic-
ular learning algorithm, without investigating the
effect of varying knowledge sources and learning
algorithms. In SENSEVAL-2, the various Duluth
systems (Pedersen, 2001b) attempted to investigate
whether features or learning algorithms are more im-
portant. However, relative contribution of knowl-
edge sources was not reported and only two main
types of algorithms (Naive Bayes and decision tree)
were tested.
In contrast, in this paper, we systematically vary
both knowledge sources and learning algorithms,
and investigate the interaction between them. We
also base our evaluation on both SENSEVAL-2 and
SENSEVAL-1 official test data sets, and compare
with the official scores of participating systems.
3 Knowledge Sources
To disambiguate a word occurrence a2 , we consider
four knowledge sources listed below. Each training
(or test) context of a2 generates one training (or test)
feature vector.
3.1 Part-of-Speech (POS) of Neighboring
Words
We use 7 features to encode this knowledge source:
a3a5a4a7a6a9a8a10a3a11a4a7a12a9a8a10a3a5a4a14a13a15a8a10a3a17a16a9a8a10a3a5a13a15a8a10a3a18a12a9a8a10a3a17a6 , wherea3a19a4a21a20 (a3a22a20 ) is the
POS of the a23 th token to the left (right) of a2 , and a3 a16
is the POS of a2 . A token can be a word or a punc-
tuation symbol, and each of these neighboring to-
kens must be in the same sentence as a2 . We use a
sentence segmentation program (Reynar and Ratna-
parkhi, 1997) and a POS tagger (Ratnaparkhi, 1996)
to segment the tokens surrounding a2 into sentences
and assign POS tags to these tokens.
For example, to disambiguate the word
bars in the POS-tagged sentence “Reid/NNP
saw/VBD me/PRP looking/VBG at/IN the/DT
iron/NN bars/NNS ./.”, the POS feature vector is
a24a26a25a28a27
a8a10a29a31a30a32a8
a27a33a27
a8
a27a33a27a35a34
a8a37a36a38a8a40a39a37a8a40a39a42a41 where a39 denotes
the POS tag of a null token.
3.2 Single Words in the Surrounding Context
For this knowledge source, we consider all single
words (unigrams) in the surrounding context of a2 ,
and these words can be in a different sentence from
a2 . For each training or test example, the SENSE-
VAL data sets provide up to a few sentences as the
surrounding context. In the results reported in this
paper, we consider all words in the provided context.
Specifically, all tokens in the surrounding context
of a2 are converted to lower case and replaced by
their morphological root forms. Tokens present in
a list of stop words or tokens that do not contain
at least an alphabet character (such as numbers and
punctuation symbols) are removed. All remaining
tokens from all training contexts provided for a2 are
gathered. Each remaining token a43 contributes one
feature. In a training (or test) example, the feature
corresponding to a43 is set to 1 iff the context of a2 in
that training (or test) example contains a43 .
We attempted a simple feature selection method
to investigate if a learning algorithm performs better
with or without feature selection. The feature selec-
tion method employed has one parameter: a44
a12 . A
feature a43 is selected if a43 occurs in some sense of a2
a44
a12 or more times in the training data. This param-
eter is also used by (Ng and Lee, 1996). We have
tried a44 a12a32a45a47a46 and a44 a12a32a45a49a48 (i.e., no feature selection)
in the results reported in this paper.
For example, if a2 is the word bars and the set
of selected unigrams is a50 chocolate, iron, beera51 , the
feature vector for the sentence “Reid saw me looking
at the iron bars .” is a24 0, 1, 0a41 .
3.3 Local Collocations
A local collocation a52 a20a54a53a55 refers to the ordered se-
quence of tokens in the local, narrow context of a2 .
Offsets a23 and a56 denote the starting and ending posi-
tion (relative to a2 ) of the sequence, where a neg-
ative (positive) offset refers to a token to its left
(right). For example, let a2 be the word bars in
the sentence “Reid saw me looking at the iron bars
.” Then a52 a4a7a12a57a53a58a4a14a13 is the iron and a52 a4a14a13a59a53a12 is iron . a39 ,
where a39 denotes a null token. Like POS, a colloca-
tion does not cross sentence boundary. To represent
this knowledge source of local collocations, we ex-
tracted 11 features corresponding to the following
collocations: a52 a4a14a13a59a53a58a4a14a13 , a52 a13a59a53a58a13 , a52 a4a7a12a57a53a58a4a7a12 , a52 a12a57a53a12 , a52 a4a7a12a57a53a58a4a14a13 ,
a52
a4a14a13a59a53a58a13 ,
a52
a13a59a53a12 ,
a52
a4a7a6a57a53a58a4a14a13 ,
a52
a4a7a12a57a53a58a13 ,
a52
a4a14a13a59a53a12 , and
a52
a13a59a53a6 . This
set of 11 features is the union of the collocation fea-
tures used in Ng and Lee (1996) and Ng (1997).
To extract the feature values of the collocation
feature a52 a20a54a53a55 , we first collect all possible collocation
strings (converted into lower case) corresponding to
a52
a20a54a53a55 in all training contexts of
a2 . Unlike the case for
surrounding words, we do not remove stop words,
numbers, or punctuation symbols. Each collocation
string is a possible feature value. Feature value se-
lection using a44 a12 , analogous to that used to select
surrounding words, can be optionally applied. If a
training (or test) context of a2 has collocation a60 , and
a60 is a selected feature value, then the a52
a20a61a53a55 feature of
a2 has value a60 . Otherwise, it has the value a62 , denot-
ing the null string.
Note that each collocation a52 a20a54a53a55 is represented by
one feature that can have many possible feature val-
ues (the local collocation strings), whereas each dis-
tinct surrounding word is represented by one feature
that takes binary values (indicating presence or ab-
sence of that word). For example, if a2 is the word
bars and suppose the set of selected collocations for
a52
a4a7a12a57a53a58a4a14a13 is
a50 a chocolate, the wine, the irona51 , then
the feature value for collocation a52 a4a7a12a57a53a58a4a14a13 in the sen-
tence “Reid saw me looking at the iron bars .” is
the iron.
1(a) attention (noun)
1(b) He turned his attention to the workbench .
1(c) a24 turned, VBD, active, lefta41
2(a) turned (verb)
2(b) He turned his attention to the workbench .
2(c) a24 he, attention, PRP, NN, VBD, activea41
3(a) green (adj)
3(b) The modern tram is a green machine .
3(c) a24 machine, NNa41
Table 1: Examples of syntactic relations (assuming
no feature selection)
3.4 Syntactic Relations
We first parse the sentence containing a2 with a sta-
tistical parser (Charniak, 2000). The constituent tree
structure generated by Charniak’s parser is then con-
verted into a dependency tree in which every word
points to a parent headword. For example, in the
sentence “Reid saw me looking at the iron bars .”,
the word Reid points to the parent headword saw.
Similarly, the word me also points to the parent
headword saw.
We use different types of syntactic relations, de-
pending on the POS of a2 . If a2 is a noun, we use
four features: its parent headword a63 , the POS of a63 ,
the voice of a63 (active, passive, ora62 ifa63 is not a verb),
and the relative position of a63 from a2 (whether a63 is
to the left or right of a2 ). If a2 is a verb, we use six
features: the nearest word a64 to the left of a2 such that
a2 is the parent headword of a64 , the nearest word a65 to
the right of a2 such that a2 is the parent headword of
a65 , the POS of a64 , the POS of a65 , the POS of a2 , and
the voice of a2 . If a2 is an adjective, we use two fea-
tures: its parent headword a63 and the POS of a63 . We
also investigated the effect of feature selection on
syntactic-relation features that are words (i.e., POS,
voice, and relative position are excluded).
Some examples are shown in Table 1. Each POS
noun, verb, or adjective is illustrated by one exam-
ple. For each example, (a) shows a2 and its POS; (b)
shows the sentence where a2 occurs; and (c) shows
the feature vector corresponding to syntactic rela-
tions.
4 Learning Algorithms
We evaluated four supervised learning algorithms:
Support Vector Machines (SVM), AdaBoost with
decision stumps (AdB), Naive Bayes (NB), and de-
cision trees (DT). All the experimental results re-
ported in this paper are obtained using the imple-
mentation of these algorithms in WEKA (Witten and
Frank, 2000). All learning parameters use the de-
fault values in WEKA unless otherwise stated.
4.1 Support Vector Machines
The SVM (Vapnik, 1995) performs optimization to
find a hyperplane with the largest margin that sep-
arates training examples into two classes. A test
example is classified depending on the side of the
hyperplane it lies in. Input features can be mapped
into high dimensional space before performing the
optimization and classification. A kernel function
(linear by default) can be used to reduce the compu-
tational cost of training and testing in high dimen-
sional space. If the training examples are nonsep-
arable, a regularization parameter a52 (a45 a66 by de-
fault) can be used to control the trade-off between
achieving a large margin and a low training error.
In WEKA’s implementation of SVM, each nominal
feature with a67 possible values is converted into a67
binary (0 or 1) features. If a nominal feature takes
the a23 th feature value, then the a23 th binary feature is
set to 1 and all the other binary features are set to 0.
We tried higher order polynomial kernels, but they
gave poorer results. Our reported results in this pa-
per used the linear kernel.
4.2 AdaBoost
AdaBoost (Freund and Schapire, 1996) is a method
of training an ensemble of weak learners such that
the performance of the whole ensemble is higher
than its constituents. The basic idea of boosting is
to give more weights to misclassified training ex-
amples, forcing the new classifier to concentrate on
these hard-to-classify examples. A test example is
classified by a weighted vote of all trained classi-
fiers. We use the decision stump (decision tree with
only the root node) as the weak learner in AdaBoost.
WEKA implements AdaBoost.M1. We used 100 it-
erations in AdaBoost as it gives higher accuracy than
the default number of iterations in WEKA (10).
4.3 Naive Bayes
The Naive Bayes classifier (Duda and Hart, 1973)
assumes the features are independent given the class.
During classification, it chooses the class with the
highest posterior probability. The default setting
uses Laplace (“add one”) smoothing.
4.4 Decision Trees
The decision tree algorithm (Quinlan, 1993) parti-
tions the training examples using the feature with the
highest information gain. It repeats this process re-
cursively for each partition until all examples in each
partition belong to one class. A test example is clas-
sified by traversing the learned decision tree. WEKA
implements Quinlan’s C4.5 decision tree algorithm,
with pruning by default.
5 Evaluation Data Sets
In the SENSEVAL-2 English lexical sample task,
participating systems are required to disambiguate
73 words that have their POS predetermined. There
are 8,611 training instances and 4,328 test instances
tagged with WORDNET senses. Our evaluation is
based on all the official training and test data of
SENSEVAL-2.
For SENSEVAL-1, we used the 36 trainable
words for our evaluation. There are 13,845 train-
ing instances1 for these trainable words, and 7,446
test instances. For SENSEVAL-1, 4 trainable words
belong to the indeterminate category, i.e., the POS is
not provided. For these words, we first used a POS
tagger (Ratnaparkhi, 1996) to determine the correct
POS.
For a word a2 that may occur in phrasal word
form (eg, the verb “turn” and the phrasal form
“turn down”), we train a separate classifier for each
phrasal word form. During testing, if a2 appears in
a phrasal word form, the classifier for that phrasal
word form is used. Otherwise, the classifier for a2 is
used.
6 Empirical Results
We ran the different learning algorithms using var-
ious knowledge sources. Table 2 (Table 3) shows
1We included 718 training instances from the HECTOR dic-
tionary used in SENSEVAL-1, together with 13,127 training in-
stances from the training corpus supplied.
Algorithm POS Surrounding Words Collocations Syntactic Relations Combined
(i) (ii) (iii) (iv) (v) (vi) (vii) (viii) = (ix) =
a68a70a69 =3 a68a71a69 =0 a68a71a69 =3 a68a71a69 =0 a68a71a69 =3 a68a71a69 =0 i+ii+iv+vi i+iii+v+vii
SVM
- 1-per-class 54.7 51.6 57.7 52.8 60.5 49.1 54.5 61.5 65.4
AdB
- normal 53.0 51.9 52.5 52.5 53.2 52.4 51.2 54.6 53.6
- 1-per-class 55.9 53.9 55.4 55.7 59.3 53.5 52.4 62.4 62.8
NB
- normal 58.0 55.8 52.5 54.5 39.5 54.1 54.0 61.6 53.4
- 1-per-class 57.6 56.2 51.5 55.8 37.9 54.0 54.2 62.7 52.7
DT
- normal 55.3 50.9 49.1 57.2 52.4 54.2 53.7 56.8 52.6
- 1-per-class 54.9 49.7 48.1 54.3 51.3 52.7 51.5 52.2 50.0
Table 2: Contribution of knowledge sources on SENSEVAL-2 data set (micro-averaged recall on all words)
Algorithm POS Surrounding Words Collocations Syntactic Relations Combined
(i) (ii) (iii) (iv) (v) (vi) (vii) (viii) = (ix) =
a68a70a69 =3 a68a71a69 =0 a68a71a69 =3 a68a71a69 =0 a68a71a69 =3 a68a71a69 =0 i+ii+iv+vi i+iii+v+vii
SVM
- 1-per-class 70.3 65.5 70.3 69.5 74.0 65.1 69.8 76.3 79.2
AdB
- normal 67.2 63.5 64.4 64.2 65.2 65.7 65.6 68.2 68.4
- 1-per-class 71.6 67.0 68.9 69.7 71.2 69.4 68.3 77.7 78.0
NB
- normal 71.5 66.6 63.5 69.1 53.9 69.4 69.6 75.7 67.2
- 1-per-class 71.6 67.3 64.1 70.3 53.0 69.8 70.4 76.3 68.2
DT
- normal 69.2 66.2 65.0 70.2 67.9 68.9 68.6 73.4 70.2
- 1-per-class 68.7 66.6 65.4 67.0 64.4 67.6 64.8 71.4 67.8
Table 3: Contribution of knowledge sources on SENSEVAL-1 data set (micro-averaged recall on all words)
POS SVM AdB NB DT S1 S2 S3
noun 68.8 69.2 66.4 60.0 68.2 69.5 66.8
verb 61.1 56.1 56.6 51.8 56.6 56.3 57.6
adj 68.0 64.3 68.4 63.8 73.2 68.8 66.8
all 65.4 62.8 62.7 57.2 64.2 63.8 62.9
(a) SENSEVAL-2 data set
POS SVM AdB NB DT s1 s2 s3
noun 85.2 84.9 82.3 81.3 84.9 80.6 80.8
verb 77.0 74.4 73.3 69.5 70.5 70.9 68.7
adj 75.8 74.6 74.5 70.9 76.1 74.3 73.5
indet 76.9 76.8 74.3 70.2 77.6 76.9 76.6
all 79.2 78.0 76.3 73.4 77.1 75.5 74.6
(b) SENSEVAL-1 data set
Table 4: Best micro-averaged recall accuracies for
each algorithm evaluated and official scores of the
top 3 participating systems of SENSEVAL-2 and
SENSEVAL-1
the accuracy figures for the different combinations
of knowledge sources and learning algorithms for
the SENSEVAL-2 (SENSEVAL-1) data set. The
nine columns correspond to: (i) using only POS
of neighboring words (ii) using only single words
in the surrounding context with feature selection
(a44 a12a72a45a73a46 ) (iii) same as (ii) but without feature se-
lection (a44 a12 a45a74a48 ) (iv) using only local collocations
with feature selection (a44 a12a75a45a76a46 ) (v) same as (iv) but
without feature selection (a44 a12a31a45a77a48 ) (vi) using only
syntactic relations with feature selection on words
(a44 a12a70a45a78a46 ) (vii) same as (vi) but without feature se-
lection (a44 a12a33a45a79a48 ) (viii) combining all four knowl-
edge sources with feature selection (ix) combining
all four knowledge sources without feature selec-
tion.
SVM is only capable of handling binary class
problems. The usual practice to deal with multi-
class problems is to build one binary classifier per
output class (denoted “1-per-class”). The original
AdaBoost, Naive Bayes, and decision tree algo-
POS SVM AdB NB DT
S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3
noun a80 a80 a81 a80 a80 a81 a80 a82 a80 a82 a82 a82
verb a83 a83 a83 a80 a80 a80 a80 a80 a80 a82 a82 a82
adj a82 a80 a80 a82 a84 a80 a82 a80 a80 a82 a82 a84
all a81 a81 a83 a84 a80 a80 a84 a80 a80 a82 a82 a82
(a) SENSEVAL-2 data set (using micro-averaged recall)
POS SVM AdB NB DT
s1 s2 s3 s1 s2 s3 s1 s2 s3 s1 s2 s3
noun a80 a83 a83 a80 a80 a83 a84 a80 a81 a84 a80 a80
verb a83 a83 a83 a81 a80 a83 a81 a80 a83 a80 a80 a80
adj a80 a80 a80 a80 a80 a80 a84 a80 a80 a80 a80 a80
indet a80 a80 a80 a80 a80 a80 a80 a80 a80 a80 a80 a80
all a81 a83 a83 a80 a81 a83 a80 a80 a83 a82 a80 a80
(b) SENSEVAL-1 data set (using macro-averaged recall)
Table 5: Paired t-test on SENSEVAL-2 and SENSEVAL-1 data sets: “a85 ”, (“a41 ” and “a24 ”), and (“a86 ” and
“a87 ”) correspond to the p-value a41a73a48a88a36a89a48a91a90 , a92
a48a88a36a89a48a93a66a9a8a10a48a88a36a89a48a91a90a95a94 , and
a96
a48a88a36a89a48a93a66 respectively. “a41 ” or “
a86 ” means our
algorithm is significantly better.
rithms can already handle multi-class problems, and
we denote runs using the original AdB, NB, and DT
algorithms as “normal” in Table 2 and Table 3.
Accuracy for each word task a2 can be measured
by recall (r) or precision (p), defined by:
r a45 no. of test instances correctly labeledno. of test instances in word task
a2
p a45 no. of test instances correctly labeledno. of test instances output in word task
a2
Recall is very close (but not always identical) to pre-
cision for the top SENSEVAL participating systems.
In this paper, our reported results are based on the
official fine-grained scoring method.
To compute an average recall figure over a set of
words, we can either adopt micro-averaging (mi) or
macro-averaging (ma), defined by:
mi a45 total no. of test instances correctly labeledtotal no. of test instances in all word tasks
ma a45
a66
a27
a97a99a98a101a100word tasksa100
a102
a20a54a103 word tasks
recall for word task a23
That is, micro-averaging treats each test instance
equally, so that a word task with many test instances
will dominate the micro-averaged recall. On the
other hand, macro-averaging treats each word task
equally.
As shown in Table 2 and Table 3, the best micro-
averaged recall for SENSEVAL-2 (SENSEVAL-1)
is 65.4% (79.2%), obtained by combining all knowl-
edge sources (without feature selection) and using
SVM as the learning algorithm.
In Table 4, we tabulate the best micro-averaged
recall for each learning algorithm, broken down ac-
cording to nouns, verbs, adjectives, indeterminates
(for SENSEVAL-1), and all words. We also tabu-
late analogous figures for the top three participating
systems for both SENSEVALs. The top three sys-
tems for SENSEVAL-2 are: JHU (S1) (Yarowsky
et al., 2001), SMUls (S2) (Mihalcea and Moldovan,
2001), and KUNLP (S3) (Seo et al., 2001). The
top three systems for SENSEVAL-1 are: hopkins
(s1) (Yarowsky, 2000), ets-pu (s2) (Chodorow et
al., 2000), and tilburg (s3) (Veenstra et al., 2000).
As shown in Table 4, SVM with all four knowledge
sources achieves accuracy higher than the best offi-
cial scores of both SENSEVALs.
We also conducted paired t test to see if one
system is significantly better than another. The
t statistic of the difference between each pair of
recall figures (between each test instance pair for
micro-averaging and between each word task pair
for macro-averaging) is computed, giving rise to a
p value. A large p value indicates that the two sys-
tems are not significantly different from each other.
The comparison between our learning algorithms
and the top three participating systems is given in
Table 5. Note that we can only compare macro-
averaged recall for SENSEVAL-1 systems, since
the sense of each individual test instance output by
the SENSEVAL-1 participating systems is not avail-
able. The comparison indicates that our SVM sys-
tem is better than the best official SENSEVAL-2 and
SENSEVAL-1 systems at the level of significance
0.05.
Note that we are able to obtain state-of-the-art re-
sults using a single learning algorithm (SVM), with-
out resorting to combining multiple learning algo-
rithms. Several top SENSEVAL-2 participating sys-
tems have attempted the combination of classifiers
using different learning algorithms.
In SENSEVAL-2, JHU used a combination of
various learning algorithms (decision lists, cosine-
based vector models, and Bayesian models) with
various knowledge sources such as surrounding
words, local collocations, syntactic relations, and
morphological information. SMUls used a k-nearest
neighbor algorithm with features such as keywords,
collocations, POS, and name entities. KUNLP used
Classification Information Model, an entropy-based
learning algorithm, with local, topical, and bigram
contexts and their POS.
In SENSEVAL-1, hopkins used hierarchical de-
cision lists with features similar to those used by
JHU in SENSEVAL-2. ets-pu used a Naive Bayes
classifier with topical and local words and their POS.
tilburg used a k-nearest neighbor algorithm with fea-
tures similar to those used by (Ng and Lee, 1996).
tilburg also used dictionary examples as additional
training data.
7 Discussions
Based on our experimental results, there appears to
be no single, universally best knowledge source. In-
stead, knowledge sources and learning algorithms
interact and influence each other. For example, lo-
cal collocations contribute the most for SVM, while
parts-of-speech (POS) contribute the most for NB.
NB even outperforms SVM if only POS is used. In
addition, different learning algorithms benefit dif-
ferently from feature selection. SVM performs best
without feature selection, whereas NB performs best
with some feature selection (a44 a12a104a45a74a46 ). We will in-
vestigate the effect of more elaborate feature selec-
tion schemes on the performance of different learn-
ing algorithms for WSD in future work.
Also, using the combination of four knowledge
sources gives better performance than using any
single individual knowledge source for most al-
gorithms. On the SENSEVAL-2 test set, SVM
achieves 65.4% (all 4 knowledge sources), 64.8%
(remove syntactic relations), 61.8% (further remove
POS), and 60.5% (only collocations) as knowledge
sources are removed one at a time.
Before concluding, we note that the SENSEVAL-
2 participating system UMD-SST (Cabezas et al.,
2001) also used SVM, with surrounding words and
local collocations as features. However, they re-
ported recall of only 56.8%. In contrast, our im-
plementation of SVM using the two knowledge
sources of surrounding words and local collocations
achieves recall of 61.8%. Following the description
in (Cabezas et al., 2001), our own re-implementation
of UMD-SST gives a recall of 58.6%, close to their
reported figure of 56.8%. The performance drop
from 61.8% may be due to the different collocations
used in the two systems.

References
Clara Cabezas, Philip Resnik, and Jessica Stevens. 2001.
Supervised sense tagging using support vector ma-
chines. In Proceedings of the Second International
Workshop on Evaluating Word Sense Disambiguation
Systems (SENSEVAL-2), pages 59–62.
Eugene Charniak. 2000. A maximum-entropy-inspired
parser. In Proceedings of the 1st Meeting of the North
American Chapter of the Association for Computa-
tional Linguistics, pages 132–139.
Martin Chodorow, Claudia Leacock, and George A.
Miller. 2000. A topical/local classifier for word sense
identification. Computers and the Humanities, 34(1–
2):115–120.
Richard O. Duda and Peter E. Hart. 1973. Pattern Clas-
sification and Scene Analysis. Wiley, New York.
Philip Edmonds and Scott Cotton. 2001. SENSEVAL-2:
Overview. In Proceedings of the Second International
Workshop on Evaluating Word Sense Disambiguation
Systems (SENSEVAL-2), pages 1–5.
Gerard Escudero, Llu´is M`arquez, and German Rigau.
2000. An empirical study of the domain dependence
of supervised word sense disambiguation systems. In
Proceedings of the Joint SIGDAT Conference on Em-
pirical Methods in Natural Language Processing and
Very Large Corpora, pages 172–180.
Yoav Freund and Robert E. Schapire. 1996. Experi-
ments with a new boosting algorithm. In Proceedings
of the Thirteenth International Conference on Machine
Learning, pages 148–156.
Adam Kilgarriff and Martha Palmer. 2000. Introduction
to the special issue on SENSEVAL. Computers and
the Humanities, 34(1–2):1–13.
Rada F. Mihalcea and Dan I. Moldovan. 2001. Pattern
learning and active feature selection for word sense
disambiguation. In Proceedings of the Second Inter-
national Workshop on Evaluating Word Sense Disam-
biguation Systems (SENSEVAL-2), pages 127–130.
Raymond J. Mooney. 1996. Comparative experiments
on disambiguating word senses: An illustration of the
role of bias in machine learning. In Proceedings of
the Conference on Empirical Methods in Natural Lan-
guage Processing, pages 82–91.
Hwee Tou Ng and Hian Beng Lee. 1996. Integrat-
ing multiple knowledge sources to disambiguate word
sense: An exemplar-based approach. In Proceedings
of the 34th Annual Meeting of the Association for
Computational Linguistics, pages 40–47.
Hwee Tou Ng. 1997. Exemplar-based word sense dis-
ambiguation: Some recent improvements. In Proceed-
ings of the Second Conference on Empirical Methods
in Natural Language Processing, pages 208–213.
Ted Pedersen and Rebecca Bruce. 1997. A new super-
vised learning algorithm for word sense disambigua-
tion. In Proceedings of the 14th National Conference
on Artificial Intelligence, pages 604–609.
Ted Pedersen. 2001a. A decision tree of bigrams is an
accurate predictor of word sense. In Proceedings of
the 2nd Meeting of the North American Chapter of the
Association for Computational Linguistics, pages 79–
86.
Ted Pedersen. 2001b. Machine learning with lexical fea-
tures: The Duluth approach to Senseval-2. In Proceed-
ings of the Second International Workshop on Evaluat-
ing Word Sense Disambiguation Systems (SENSEVAL-
2), pages 139–142.
J. Ross Quinlan. 1993. C4.5: Programs for Machine
Learning. Morgan Kaufmann, San Francisco.
Adwait Ratnaparkhi. 1996. A maximum entropy model
for part-of-speech tagging. In Proceedings of the Con-
ference on Empirical Methods in Natural Language
Processing, pages 133–142.
Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. A
maximum entropy approach to identifying sentence
boundaries. In Proceedings of the Fifth Conference on
Applied Natural Language Processing, pages 16–19.
Hee-Cheol Seo, Sang-Zoo Lee, Hae-Chang Rim, and
Ho Lee. 2001. KUNLP system using classification
information model at SENSEVAL-2. In Proceedings
of the Second International Workshop on Evaluating
Word Sense Disambiguation Systems (SENSEVAL-2),
pages 147–150.
Mark Stevenson and Yorick Wilks. 2001. The interaction
of knowledge sources in word sense disambiguation.
Computational Linguistics, 27(3):321–349.
Vladimir N. Vapnik. 1995. The Nature of Statistical
Learning Theory. Springer-Verlag, New York.
Jorn Veenstra, Antal van den Bosch, Sabine Buchholz,
Walter Daelemans, and Jakub Zavrel. 2000. Memory-
based word sense disambiguation. Computers and the
Humanities, 34(1–2):171–177.
Ian H. Witten and Eibe Frank. 2000. Data Mining: Prac-
tical Machine Learning Tools and Techniques with
Java Implementations. Morgan Kaufmann, San Fran-
cisco.
David Yarowsky, Silviu Cucerzan, Radu Florian, Charles
Schafer, and Richard Wicentowski. 2001. The
Johns Hopkins SENSEVAL2 system descriptions. In
Proceedings of the Second International Workshop
on Evaluating Word Sense Disambiguation Systems
(SENSEVAL-2), pages 163–166.
David Yarowsky. 2000. Hierarchical decision lists for
word sense disambiguation. Computers and the Hu-
manities, 34(1–2):179–186.
Jakub Zavrel, Sven Degroeve, Anne Kool, Walter Daele-
mans, and Kristiina Jokinen. 2000. Diverse classifiers
for NLP disambiguation tasks: Comparison, optimiza-
tion, combination, and evolution. In TWLT 18. Learn-
ing to Behave, pages 201–221.
