Supervised Word Sense Disambiguation with
Support Vector Machines and Multiple Knowledge Sources
Yoong Keok Lee and Hwee Tou Ng and Tee Kiah Chia
Department of Computer Science
National University of Singapore
3 Science Drive 2, Singapore 117543
y.k.lee@alumni.nus.edu.sg
nght@comp.nus.edu.sg
chiateek@comp.nus.edu.sg
Abstract
We participated in the SENSEVAL-3 English lexi-
cal sample task and multilingual lexical sample task.
We adopted a supervised learning approach with
Support Vector Machines, using only the official
training data provided. No other external resources
were used. The knowledge sources used were part-
of-speech of neighboring words, single words in the
surrounding context, local collocations, and syntac-
tic relations. For the translation and sense subtask
of the multilingual lexical sample task, the English
sense given for the target word was also used as
an additional knowledge source. For the English
lexical sample task, we obtained fine-grained and
coarse-grained score (for both recall and precision)
of 0.724 and 0.788 respectively. For the multilin-
gual lexical sample task, we obtained recall (and
precision) of 0.634 for the translation subtask, and
0.673 for the translation and sense subtask.
1 Introduction
This paper describes the approach adopted by our
systems which participated in the English lexical
sample task and the multilingual lexical sample task
of SENSEVAL-3. The goal of the English lexical
sample task is to predict the correct sense of an am-
biguous English word a0 , while that of the multi-
lingual lexical sample task is to predict the correct
Hindi (target language) translation of an ambiguous
English (source language) worda0 .
The multilingual lexical sample task is further
subdivided into two subtasks: the translation sub-
task, as well as the translation and sense subtask.
The distinction is that for the translation and sense
subtask, the English sense of the target ambiguous
word a0 is also provided (for both training and test
data).
In all, we submitted 3 systems: system nusels
for the English lexical sample task, system nusmlst
for the translation subtask, and system nusmlsts for
the translation and sense subtask.
All systems were based on the supervised word
sense disambiguation (WSD) system of Lee and Ng
(2002), and used Support Vector Machines (SVM)
learning. Only the training examples provided in the
official training corpus were used to train the sys-
tems, and no other external resources were used. In
particular, we did not use any external dictionary or
the sample sentences in the provided dictionary.
The knowledge sources used included part-of-
speech (POS) of neighboring words, single words in
the surrounding context, local collocations, and syn-
tactic relations, as described in Lee and Ng (2002).
For the translation and sense subtask of the multi-
lingual lexical sample task, the English sense given
for the target word was also used as an additional
knowledge source. All features encoding these
knowledge sources were used, without any feature
selection.
We next describe SVM learning and the com-
bined knowledge sources adopted. Much of the de-
scription follows that of Lee and Ng (2002).
2 Support Vector Machines (SVM)
The SVM (Vapnik, 1995) performs optimization to
find a hyperplane with the largest margin that sepa-
rates training examples into two classes. A test ex-
ample is classified depending on the side of the hy-
perplane it lies in. Input features can be mapped into
high dimensional space before performing the opti-
mization and classification. A kernel function can
be used to reduce the computational cost of training
and testing in high dimensional space. If the train-
ing examples are nonseparable, a regularization pa-
rameter a1 (a2a4a3 by default) can be used to control
the trade-off between achieving a large margin and
a low training error. We used the implementation
of SVM in WEKA (Witten and Frank, 2000), where
each nominal feature witha5 possible values is con-
verted intoa5 binary (0 or 1) features. If a nominal
feature takes the a6 th value, then the a6 th binary fea-
ture is set to 1 and all the other binary features are
set to 0. The default linear kernel is used. Since
SVM only handles binary (2-class) classification,
we built one binary classifier for each sense class.
                                             Association for Computational Linguistics
                        for the Semantic Analysis of Text, Barcelona, Spain, July 2004
                 SENSEVAL-3: Third International Workshop on the Evaluation of Systems
Note that our supervised learning approach made
use of a single learning algorithm, without combin-
ing multiple learning algorithms as adopted in other
research (such as (Florian et al., 2002)).
3 Multiple Knowledge Sources
To disambiguate a word occurrence a0 , systems
nusels and nusmlst used the first four knowledge
sources listed below. System nusmlsts used the
English sense given for the target ambiguous word
a0 as an additional knowledge source. Previous re-
search (Ng and Lee, 1996; Stevenson and Wilks,
2001; Florian et al., 2002; Lee and Ng, 2002) has
shown that a combination of knowledge sources im-
proves WSD accuracy.
Our experiments on the provided training data
of the SENSEVAL-3 translation and sense subtask
also indicated that the additional knowledge source
of the English sense of the target word further im-
proved accuracy (See Section 4.3 for details).
We did not attempt feature selection since our
previous research (Lee and Ng, 2002) indicated that
SVM performs better without feature selection.
3.1 Part-of-Speech (POS) of Neighboring
Words
We use 7 features to encode this knowledge source:
a0a2a1a4a3a6a5a7a0a8a1a10a9a11a5a7a0a8a1a13a12a14a5a15a0a17a16a6a5a15a0a2a12a18a5a7a0a19a9a20a5a15a0a17a3 , where a0a8a1 a21 (a0 a21 ) is
the POS of thea6 th token to the left (right) ofa0 , and
a0a17a16 is the POS of
a0 . A token can be a word or a
punctuation symbol, and each of these neighboring
tokens must be in the same sentence asa0 . We use a
sentence segmentation program (Reynar and Ratna-
parkhi, 1997) and a POS tagger (Ratnaparkhi, 1996)
to segment the tokens surroundinga0 into sentences
and assign POS tags to these tokens.
For example, to disambiguate the word
bars in the POS-tagged sentence “Reid/NNP
saw/VBD me/PRP looking/VBG at/IN the/DT
iron/NN bars/NNS ./.”, the POS feature vector is
a22a24a23a26a25
a5 a27 a28a29a5
a25a30a25
a5
a25a31a25a30a32
a5a14a33a34a5a36a35a14a5a36a35a2a37 where a35 denotes
the POS tag of a null token.
3.2 Single Words in the Surrounding Context
For this knowledge source, we consider all sin-
gle words (unigrams) in the surrounding context of
a0 , and these words can be in a different sentence
from a0 . For each training or test example, the
SENSEVAL-3 official data set provides a few sen-
tences as the surrounding context. In the results re-
ported here, we consider all words in the provided
context.
Specifically, all tokens in the surrounding context
of a0 are converted to lower case and replaced by
their morphological root forms. Tokens present in
a list of stop words or tokens that do not contain
at least an alphabet character (such as numbers and
punctuation symbols) are removed. All remaining
tokens from all training contexts provided fora0 are
gathered. Each remaining token a38 contributes one
feature. In a training (or test) example, the feature
corresponding to a38 is set to 1 iff the context ofa0 in
that training (or test) example contains a38 .
For example, if a0 is the word bars and the set
of selected unigrams is a39 chocolate, iron, beera40 , the
feature vector for the sentence “Reid saw me look-
ing at the iron bars .” is a22 0, 1, 0a37 .
3.3 Local Collocations
A local collocation a1 a21 a41a42 refers to the ordered se-
quence of tokens in the local, narrow context of a0 .
Offsetsa6 and a43 denote the starting and ending posi-
tion (relative to a0 ) of the sequence, where a neg-
ative (positive) offset refers to a token to its left
(right). For example, let a0 be the word bars in
the sentence “Reid saw me looking at the iron bars
.” Then a1 a1a10a9a7a41a44a1a13a12 is the iron and a1 a1a13a12a36a41a9 is iron . a35 ,
where a35 denotes a null token. Like POS, a colloca-
tion does not cross sentence boundary. To represent
this knowledge source of local collocations, we ex-
tracted 11 features corresponding to the following
collocations: a1 a1a45a12a46a41a44a1a45a12 , a1 a12a46a41a44a12 , a1 a1a10a9a7a41a44a1a10a9 , a1 a9a7a41a9 , a1 a1a10a9a7a41a44a1a13a12 ,
a1 a1a13a12a36a41a44a12 , a1 a12a36a41a9 , a1 a1a4a3a47a41a44a1a45a12 , a1 a1a48a9a18a41a44a12 , a1 a1a13a12a36a41a9 , and a1 a12a36a41a3 . This
set of 11 features is the union of the collocation fea-
tures used in Ng and Lee (1996) and Ng (1997).
Note that each collocationa1 a21 a41a42 is represented by
one feature that can have many possible feature val-
ues (the local collocation strings), whereas each dis-
tinct surrounding word is represented by one feature
that takes binary values (indicating presence or ab-
sence of that word). For example, if a0 is the word
bars and suppose the set of collocations fora1 a1a10a9a7a41a44a1a13a12
is a39 a chocolate, the wine, the irona40 , then the fea-
ture value for collocation a1 a1a48a9a18a41a44a1a45a12 in the sentence
“Reid saw me looking at the iron bars .” is the iron.
3.4 Syntactic Relations
We first parse the sentence containinga0 with a sta-
tistical parser (Charniak, 2000). The constituent
tree structure generated by Charniak's parser is then
converted into a dependency tree in which every
word points to a parent headword. For example,
in the sentence “Reid saw me looking at the iron
bars .”, the word Reid points to the parent headword
saw. Similarly, the word me also points to the parent
headword saw.
We use different types of syntactic relations, de-
pending on the POS ofa0 . Ifa0 is a noun, we use four
features: its parent headword a49 , the POS of a49 , the
voice of a49 (active, passive, or a50 if a49 is not a verb),
1(a) attention (noun)
1(b) He turned his attention to the workbench .
1(c) a22 turned, VBD, active, lefta37
2(a) turned (verb)
2(b) He turned his attention to the workbench .
2(c) a22 he, attention, PRP, NN, VBD, activea37
3(a) green (adj)
3(b) The modern tram is a green machine .
3(c) a22 machine, NNa37
Table 1: Examples of syntactic relations
and the relative position of a49 from a0 (whether a49 is
to the left or right of a0 ). If a0 is a verb, we use six
features: the nearest word a0 to the left ofa0 such that
a0 is the parent headword of
a0 , the nearest word a1 to
the right ofa0 such thata0 is the parent headword of
a1 , the POS of a0 , the POS of a1 , the POS of
a0 , and the
voice ofa0 . Ifa0 is an adjective, we use two features:
its parent headword a49 and the POS of a49 .
Headwords are obtained from a parse tree with
the script used for the CoNLL-2000 shared task
(Tjong Kim Sang and Buchholz, 2000).1
Some examples are shown in Table 1. Each POS
noun, verb, or adjective is illustrated by one exam-
ple. For each example, (a) showsa0 and its POS; (b)
shows the sentence where a0 occurs; and (c) shows
the feature vector corresponding to syntactic rela-
tions.
3.5 Source Language (English) Sense
For the translation and sense subtask of the multilin-
gual lexical sample task, the sense of an ambiguous
worda0 in the source language (English) is provided
for most of the training and test examples. An ex-
ample with unknown English sense is denoted with
question mark (“?”) in the corpus. We treat “?” as
another “sense” ofa0 (just like any other valid sense
ofa0 ).
We compile the set of English senses of a word
a0 encountered in the whole training corpus. For
each sense a2 in this set, a binary feature is generated
for each training and test example. If an example
has a2 as the English sense of a0 , this binary feature
(corresponding to a2 ) is set to 1, otherwise it is set to
0.
4 Evaluation
Since our WSD system always outputs exactly one
prediction for each test example, its recall is always
the same as precision. We report below the micro-
averaged recall over all test words.
1Available at http://ilk.uvt.nl/
a3 sabine/chunklink/chunklink 2-
2-2000 for conll.pl
Evaluation data Recall
SE-2 0.656
SE-1 (with dictionary examples) 0.796
SE-1 (without dictionary examples) 0.776
Table 2: Micro-averaged, fine-grained recall on
SENSEVAL-2 and SENSEVAL-1 test data
System Recall
nusels 0.724 (fine-grained)
0.788 (coarse-grained)
nusmlst 0.634
nusmlsts 0.673
Table 3: Micro-averaged recall on SENSEVAL-3
test data
4.1 Evaluation on SENSEVAL-2 and
SENSEVAL-1 Data
Before participating in SENSEVAL-3, we evaluated
our WSD system on the English lexical sample task
of SENSEVAL-2 and SENSEVAL-1. The micro-
averaged, fine-grained recall over all SENSEVAL-
2 test words and all SENSEVAL-1 test words are
given in Table 2.
In SENSEVAL-1, some example sentences are
provided with the dictionary entries of the words
used in the evaluation. We provide the recall on
SENSEVAL-1 test data with and without the use of
such additional dictionary examples in training.
On both SENSEVAL-2 and SENSEVAL-1 test
data, the accuracy figures we obtained, as reported
in Table 2, are higher than the best official test
scores reported on both evaluation data sets.
4.2 Official SENSEVAL-3 Scores
We participated in the SENSEVAL-3 English lexi-
cal sample task, and both subtasks of the multilin-
gual lexical sample task. The official SENSEVAL-
3 scores are shown in Table 3. Each score is the
micro-averaged recall (which is the same as preci-
sion) over all test words.
According to the task organizers, the fine-grained
(coarse-grained) recall of the best participating sys-
tem in the English lexical sample task is 0.729
(0.795). As such, the performance of our system
nusels compares favorably with the best participat-
ing system.
We are not able to fully assess the performance
of our multilingual lexical sample task systems
nusmlst and nusmlsts at the time of writing this
paper, since performance figures of the best partici-
pating system in this task have not been released by
the task organizers.
4.3 Utility of English Sense as an Additional
Knowledge Source
To determine if using the English sense as an addi-
tional knowledge source improved the accuracy of
the translation and sense subtask, we conducted a
five-fold cross validation experiment. We randomly
divided the training data of the translation and sense
subtask for each word into 5 portions, using 4 por-
tions for training and 1 portion for test. We then re-
peated the process by selecting a different portion as
the test data each time and training on the remaining
portions.
Our investigation revealed that adding the En-
glish sense to the four existing knowledge sources
improved the micro-averaged recall from 0.628 to
0.638 on the training data. As such, we decided to
use the English sense as an additional knowledge
source for our system nusmlsts.
After the official SENSEVAL-3 evaluation
ended, we evaluated a variant of our system nusml-
sts without using the English sense as an additional
knowledge source. Based on the official test keys
released, the micro-averaged recall drops to 0.643,
which seems to suggest that the English sense is
a helpful knowledge source for the translation and
sense subtask.
5 Conclusion
In this paper, we described our participating systems
in the SENSEVAL-3 English lexical sample task
and multilingual lexical sample task. Our WSD sys-
tems used SVM learning and multiple knowledge
sources. Evaluation results on the English lexical
sample task indicate that our method achieves good
accuracy on this task.
6 Acknowledgements
This research is partially supported by a research
grant R252-000-125-112 from National University
of Singapore Academic Research Fund.

References
Eugene Charniak. 2000. A maximum-entropy-
inspired parser. In Proceedings of the 1st Meeting
of the North American Chapter of the Association
for Computational Linguistics, pages 132–139.
Radu Florian, Silviu Cucerzan, Charles Schafer, and
David Yarowsky. 2002. Combining classifiers
for word sense disambiguation. Natural Lan-
guage Engineering, 8(4):327–341.
Yoong Keok Lee and Hwee Tou Ng. 2002. An
empirical evaluation of knowledge sources and
learning algorithms for word sense disambigua-
tion. In Proceedings of the Conference on Em-
pirical Methods in Natural Language Processing
(EMNLP), pages 41–48.
Hwee Tou Ng and Hian Beng Lee. 1996. Integrat-
ing multiple knowledge sources to disambiguate
word sense: An exemplar-based approach. In
Proceedings of the 34th Annual Meeting of the
Association for Computational Linguistics, pages
40–47.
Hwee Tou Ng. 1997. Exemplar-based word sense
disambiguation: Some recent improvements. In
Proceedings of the Second Conference on Em-
pirical Methods in Natural Language Processing,
pages 208–213.
Adwait Ratnaparkhi. 1996. A maximum entropy
model for part-of-speech tagging. In Proceedings
of the Conference on Empirical Methods in Nat-
ural Language Processing, pages 133–142.
Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997.
A maximum entropy approach to identifying sen-
tence boundaries. In Proceedings of the Fifth
Conference on Applied Natural Language Pro-
cessing, pages 16–19.
Mark Stevenson and Yorick Wilks. 2001. The
interaction of knowledge sources in word
sense disambiguation. Computational Linguis-
tics, 27(3):321–349.
Erik F. Tjong Kim Sang and Sabine Buchholz.
2000. Introduction to the CoNLL-2000 shared
task: Chunking. In Proceedings of the CoNLL-
2000 and LLL-2000, pages 127–132.
Vladimir N. Vapnik. 1995. The Nature of Sta-
tistical Learning Theory. Springer-Verlag, New
York.
Ian H. Witten and Eibe Frank. 2000. Data Min-
ing: Practical Machine Learning Tools and
Techniques with Java Implementations. Morgan
Kaufmann, San Francisco.
