Multi-Component Word Sense Disambiguationa0
Massimiliano Ciaramita Mark Johnson
Brown University
Department of Cognitive and Linguistic Sciences
Providence, RI 02912
a1 massi@brown.edu,mark johnson@brown.edu
a2
Abstract
This paper describes the system MC-WSD pre-
sented for the English Lexical Sample task. The
system is based on a multicomponent architecture.
It consists of one classifier with two components.
One is trained on the data provided for the task. The
second is trained on this data and, additionally, on
an external training set extracted from the Wordnet
glosses. The goal of the additional component is to
lessen sparse data problems by exploiting the infor-
mation encoded in the ontology.
1 Introduction
One of the main difficulties in word sense classifi-
cation tasks stems from the fact that word senses,
such as Wordnet’s synsets (Fellbaum, 1998), de-
fine very specific classes1. As a consequence train-
ing instances are often too few in number to cap-
ture extremely fine-grained semantic distinctions.
Word senses, however, are not just independent enti-
ties but are connected by several semantic relations;
e.g., the is-a, which specifies a relation of inclusion
among classes such as “car is-a vehicle”. Based on
the is-a relation Wordnet defines large and complex
hierarchies for nouns and verbs.
These hierarchical structures encode potentially
useful world-knowledge that can be exploited for
word sense classification purposes, by providing
means for generalizing beyond the narrowest synset
level. To disambiguate an instance of a noun like
“bat” a system might be more successful if, in-
stead of limiting itself to applying what it knows
about the concepts “bat-mammal” and “bat-sport-
implement”, it could use additional knowledge
about other “animals” and “artifacts”.
Our system implements this intuition in two
steps. First, for each sense of an ambiguous word
we generate an additional set of training instances
a3 We would like to thank Thomas Hofmann and our colleagues
in the Brown Laboratory for Linguistic Information Processing
(BLLIP).
151% of the noun synsets in Wordnet contain only 1 word.
from the Wordnet glosses. This data is not limited to
the specific synset that represents one of the senses
of the word, but concerns also other synsets that are
semantically similar, i.e., close in the hierarchy, to
that synset. Then, we integrate the task-specific and
the external training data with a multicomponent
classifier that simplifies the system for hierarchical
word sense disambiguation presented in (Ciaramita
et al., 2003). The classifier consists of two com-
ponents based on the averaged multiclass percep-
tron (Collins, 2002; Crammer and Singer, 2003).
The first component is trained on the task-specific
data while the second is trained on the former and
on the external training data. When predicting a la-
bel for an instance the classifier combines the pre-
dictions of the two components. Cross-validation
experiments on the training data show the advan-
tages of the multicomponent architecture.
In the following section we describe the features
used by our system. In Section 3 we explain how we
generated the additional training set. In Section 4
we describe the architecture of the classifier and in
Section 5 we discuss the specifics of the final system
and some experimental results.
2 Features
We used a set of features similar to that which
was extensively described and evaluated in (Yoong
and Hwee, 2002). The sentence with POS annota-
tion “A-DT newspaper-NN and-CC now-RB a-DT
bank-NN have-AUX since-RB taken-VBN over-
RB” serves as an example to illustrate them. The
word to disambiguate is bank (or activate for (7)).
1. part of speech of neighboring words a4a6a5 ,
a7a9a8a11a10a13a12a15a14a17a16a18a12a15a19a17a16a18a12a21a20a22a16a24a23a17a16a26a25a27a20a28a16a26a25a29a19a17a16a30a25a29a14a6a31 ; e.g.,
a4a33a32a35a34a37a36a39a38a41a40 ,
a4a41a42a43a36a39a44a41a44 , a4a46a45a47a34a48a36a50a49a22a51a6a52 , ...
2. words in the same sentence WS or passage WC; e.g.,
a53a55a54
a36a39a56a55a57a22a58a46a59a28a60 ,
a53a55a54
a36a62a61a28a58a6a59a22a63a13a64 ,
a53a55a54
a36a66a65a55a59a28a67a69a68a71a70a6a57a28a70a55a59a22a63a22a72 , ...
3. n-grams:
a73
a44a46a74 a5 ,
a7a9a8a11a10a22a12a15a19a17a16a75a12a21a20a28a16a26a25a27a20a22a16a30a25a29a19a55a31 ; e.g.,
a44a46a74a55a32a69a76a77a36a78a65a6a61a28a67 , a44a46a74a41a45a47a34a48a36a39a56a55a57a22a58a46a59 , a44a46a74a41a45a79a76a80a36a39a81a6a57a28a82a6a59
                                             Association for Computational Linguistics
                        for the Semantic Analysis of Text, Barcelona, Spain, July 2004
                 SENSEVAL-3: Third International Workshop on the Evaluation of Systems
a73
a44a46a74 a5a1a0a2 ,
a3
a7a69a16a5a4a7a6a77a8a11a10
a3
a12a15a19a55a16a75a12a21a20a8a6a26a16
a3
a12a21a20a22a16a30a25a27a20a8a6a26a16
a3
a25a27a20a22a16a30a25a29a19a9a6a30a31 ;
e.g., a44a46a74a17a32a69a76a10a0a32a35a34 a36a39a65a55a61a28a67 a57 ,
a44a46a74a13a45 a34a11a0a45a79a76a37a36a39a56a55a57a28a58a6a59 a81a46a57a22a82a6a59
4. syntactically governing elements under a phrase a74a17a34 ;
e.g., a74 a34 a36a50a81a46a57a22a82a6a59 a54
5. syntactically governed elements under a phrase a74a6a76 ;
e.g., a74a41a76a43a36 a57 a44a41a4 , a74a41a76a77a36a39a65a55a61 a67 a44a41a4
6. coordinates a12a9a12 ; e.g., a12a9a12 a36a39a65a55a59a28a67a33a68a71a70a55a57a28a70a55a59a28a63
7. features for verbs, e.g, “... activate the pressure”:
a73 number of arguments
a13a41a44 ; e.g., a13a41a44 a36
a20
a73 syntactic type of arguments
a13a41a49 ; e.g., a13a41a49 a36a50a44a41a4
8. morphology/spelling:
a73 prefixes/suffixes up to 4
characters a14a41a4a7a15a16a14 a54 ; e.g.,
a17a19a18
a36a21a20
a16
a17a19a18
a36a22a20a24a23
a16
a17a26a25
a36a22a27a29a28
a16
a17a26a25
a36a30a23a31a27a29a28
a73 uppercase characters
a14a28a51 ; e.g., a14a28a51 a36
a23
a73 number/type of word’s components
a14a9a32a33a15a8a14a35a34 ;
e.g., a14a31a32a27a36
a20a22a16
a14a35a34 a36a37a36a55a57 a65a46a82
The same features were extracted from the given
test and training data, and the additional dataset.
POS and other syntactic features were extracted
from parse trees. Training and test data, and
the Wordnet glosses, were parsed with Charniak’s
parser (Charniak, 2000). Open class words were
morphologically simplified with the “morph” func-
tion from the Wordnet library “wn.h”. When it
was not possible to identify the noun or verb in the
glosses 2 we only extracted a limited set of features:
WS, WC, and morphological features. Each gloss
provides one training instance per synset. Overall
we found approximately 200,000 features.
3 External training data
There are 57 different ambiguous words in the task:
32 verbs, 20 nouns, and 5 adjectives. For each word
a38 a training set of pairs
a39a41a40a43a42a45a44a24a46a35a42a48a47a11a49
a42a51a50a53a52
, a46a35a42a55a54a57a56a58a39 a38 a47 , is
generated from the task-specific data; a40 a42 is a vector
of features and a56a59a39 a38 a47 is the set of possible senses for
a38 . Nouns are labeled with Wordnet 1.71 synset la-
bels, while verbs and adjectives are annotated with
the Wordsmyth’s dictionary labels. For nouns and
verbs we used the hierarchies of Wordnet to gener-
ate the additional training data. We used the given
sense map to map Wordsmyth senses to Wordnet
synsets. For adjectives we simply used the task-
specific data and a standard flat classifier.3
For each noun, or verb, synset we generated
a fixed number a60 of other semantically similar
2E.g., the example sentence for the noun synset relegation
is “He has been relegated to a post in Siberia”,
3We used Wordnet 2.0 in our experiments using the Word-
net sense map files to map synsets from 1.71 to 2.0.
Algorithm 1 Find a60 Closest Neighbors
1: input a61a63a62a65a64a66a46a68a67 , a69a71a70a72a62a74a73 , k
2: repeat
3: a75a77a76a79a78a29a80a82a81a84a83a43a85a86a61a72a87
4: a88a90a89a92a91a71a76a94a93a33a95a41a96a29a97a66a98a99a97a16a100 a101a102a98a84a97a66a93a33a98a99a103a53a101a53a104a68a103a43a100a82a97a84a39a41a75a53a44a105a60a106a47
5: for each a107a71a54a108a88a109a89a92a91 do
6: if a110a69a71a70a99a110a112a111a113a60 then
7: a69 a70 a76a79a69 a70a115a114 a107
8: end if
9: end for
10: for each a116a59a117a31a116 is a parent of a75 do
11: ENQUE(Q,v)
12: end for
13: DEQUE(Q)
14: until a110a69a71a70a118a110a35a62a74a60 or a61a65a62a74a73
synsets. For each sense we start collecting synsets
among the descendants of the sense itself and work
our way up the hierarchy following the paths from
the sense to the top until we found a60 synsets. At
each level we look for the closest a60 descendants
of the current synset as follows - this is the “clos-
est descendants()” function of Algorithm 1 above.
If there are a60 or less descendants we collect them
all. Otherwise, we take the closest a60 around the
synset exploiting the fact that when ordered, using
the synset IDs as keys, similar synsets tend to be
close to each other4. For example, synsets around
“Rhode Islander” refer to other American states’ in-
habitants’ names:
Synset ID Nouns
109127828 Pennsylvanian
a119 109127914 Rhode Islander
109128001 South Carolinian
Algorithm 1 presents a schematic description of
the procedure. For each sense a46 of a noun, or verb,
we produced a set a69a120a70 of a60a77a62a122a121a16a123a35a123 similar neighbor
synsets of a46 . We label this set with a124a46 , thus for each
set of labels a56a59a39 a38 a47 we induce a set of pseudo-labels
a124
a56a58a39
a38
a47 .For each synset in a69a92a70 we compiled a train-
ing instance from the Wordnet glosses. At the end
of this process, for each noun or verb, there is an
additional training set a39a41a40 a42 a44a125a124a46 a42 a47a11a126 .
4 Classifier
4.1 Multiclass averaged perceptron
Our base classifier is the multiclass averaged per-
ceptron. The multiclass perceptron (Crammer and
Singer, 2003) is an on-line learning algorithm which
4This likely depends on the fact that the IDs encode the lo-
cation in the hierarchy, even though we don’t know how the IDs
are generated.
Algorithm 2 Multiclass Perceptron
1: input training data a39a41a40 a42 a44a24a46 a42 a47a11a49
a42 a50a53a52
, a1
2: repeat
3: for a2 a62 a121a9a44a4a3a5a3a5a3 a44a7a6 do
4: a8 a42 a62a63a64a10a9a92a54 a56 a117a12a11a41a116a14a13a31a44a24a40 a42a16a15a18a17 a11a41a116a9a70a20a19a24a44a24a40 a42a21a15 a67
5: if a110a8 a42 a110 a17 a123 then
6: a116 a70a20a19 a76 a116 a70a20a19a23a22 a40a29a42
7: for a9 a54a24a8 a42 do
8: a116 a13 a76 a116 a13a26a25 a52a27a28
a19
a27
a40a68a42
9: end for
10: end if
11: end for
12: until no more mistakes
extends to the multiclass case the standard percep-
tron. It takes as input a training set a39a41a40 a42 a44a24a46 a42 a47a11a49
a42a51a50a53a52
,
a40 a42 a54a30a29a31a33a32 , and a46 a42 a54a63a56a59a39
a38
a47 . In the multiclass per-
ceptron, one introduces a weight vector a116a33a70a57a54a34a29a31 a32
for every a46a59a54 a56a59a39 a38 a47 , and defines a35 by the so-called
winner-take-all rule
a35a19a39a41a40a37a36
a1
a47 a62a39a38a14a40a20a41a43a42a44a38a46a45
a70a10a47a14a48
a11a41a116a9a70a33a44a24a40 a15 a3 (1)
Here a1 a54 a29a31
a27
a48a50a49a52a51a54a53
a27a56a55
a32 refers to the matrix of
weights, with every column corresponding to one of
the weight vectors a116a7a70 . The algorithm is summarized
in Algorithm 2. Training patterns are presented one
at a time. Whenever a35a19a39a41a40 a42 a36 a1 a47a58a57a62a65a46 a42 an update step
is performed; otherwise the weight vectors remain
unchanged. To perform the update, one first com-
putes the error set a8 a42 containing those class labels
that have received a higher score than the correct
class:
a8 a42 a62a63a64a10a9a92a54 a56 a117a12a11a41a116a14a13a31a44a24a40 a42a16a15a18a17 a11a41a116a9a70a20a19a24a44a24a40 a42a21a15 a67 (2)
We use the simplest case of uniform update weights,
a25
a52
a27a28
a19
a27 for
a9 a54a59a8 a42 .
The perceptron algorithm defines a sequence of
weight matrices a1 a49a61a60a20a53 a44a4a3a4a3a4a3a66a44 a1 a49a49 a53 , where a1 a49a42 a53 is the
weight matrix after the first a2 training items have
been processed. In the standard perceptron, the
weight matrix a1 a62 a1 a49a49 a53 is used to classify the un-
labeled test examples. However, a variety of meth-
ods can be used for regularization or smoothing in
order to reduce the effect of overtraining. Here
we used the averaged perceptron (Collins, 2002),
where the weight matrix used to classify the test
data is the average of all of the matrices posited dur-
ing training, i.e., a1 a62 a52
a49
a62
a49
a42a51a50a53a52
a1
a42 .
4.2 Multicomponent architecture
Task specific and external training data are inte-
grated with a two-component perceptron. The dis-
Algorithm 3 Multicomponent Perceptron
1: input a39a41a40 a42 a44a24a46 a42 a47a11a49
a42 a50a53a52
, a1 a62a74a123 ,a39a41a40a64a63a35a44 a124a46a65a63a82a47a11a126
a63 a50a53a52
, a66 a62 a123 ,
2: for a67 a62 a121a9a44a4a3a5a3a5a3 a44a69a68 do
3: train M on a39a41a40a64a63a35a44 a124a46a65a63a82a47a11a126
a63 a50a53a52
and a39a41a40 a42 a44a24a46 a42 a47a11a49
a42a51a50a53a52
4: train V on a39a41a40 a42 a44a24a46 a42 a47a11a49
a42a51a50a53a52
5: end for
criminant function is defined as:
a35a19a39a41a40a37a36
a1
a44a70a66 a47 a62a71a38a14a40a70a41a72a42a44a38a46a45
a70a10a47a14a48a73a49a61a51a74a53a14a75
a70a76a11a41a116a31a70a84a44a24a40 a15 a22
a75a78a77
a70 a11a80a79
a77
a70 a44a24a40 a15
The first component is trained on the task-specific
data. The second component learns a separate
weight matrix a66 , where each column vector rep-
resents the set label a124a46 , and is trained on both the
task-specific and the additional training sets. Each
component is weighted by a parameter
a75
; here
a75a81a77
a70
is simply equal to a121 a25
a75
a70 . We experimented with
two values for
a75
a70 , namely 1 and 0.5. In the for-
mer case only the first component is used, in the
latter they are both used, and their contributions are
equally weighted.
The training procedure for the multicomponent
classifier is described in Algorithm 3. This is a sim-
plification of the algorithm presented in (Ciaramita
et al., 2003). The two algorithms are similar except
that convergence, if the data is separable, is clear
in this case because the two components are trained
individually with the standard multiclass perceptron
procedure. Convergence is typically achieved in
less than 50 iterations, but the value for a68 to be used
for evaluation on the unseen test data was chosen by
cross-validation. With this version of the algorithm
the implementation is simpler especially if several
components are included.
4.3 Multilabel cases
Often, several senses of an ambiguous word are very
close in the hierarchy. Thus it can happen that a
synset belongs to the neighbor set of more than one
sense of the ambiguous word. When this is the case
the training instance for that synset is treated as a
multilabeled instance; i.e., a124a46 a42 is actually a set of la-
bels for a40a68a42 , that is, a124a46a7a42a50a82 a124a56a58a39 a38 a47 . Several methods can
be used to deal with multilabeled instances, here we
use a simple generalization of Algorithm 2. The er-
ror set for a multilabel training instance is defined
as:
a8 a42 a62a65a64a10a9 a54a77a56 a117a64a83a99a46 a54 a46 a42 a44a65a11a41a116a14a13 a44a24a40 a42a16a15a26a17 a11a41a116a9a70a7a44a24a40 a42a16a15 a67 (3)
which is equivalent to the definition in Equation 2
when a110a46 a42 a110a106a62 a121 . The positive update of Algorithm 2
(line 6) is also redefined. The update concerns a set
word
a75
a70a90a62 a121
a75
a70a90a62 a123a64a3a1a0 word
a75
a70a55a62 a121
a75
a70a72a62 a123a64a3a1a0 word
a75
a70a55a62 a121
a75
a70a55a62a74a123a64a3a1a0
appear 86.1 85.5 audience 84.8 86.8 encounter 72.9 75.0
arm 85.9 87.5 bank 82.9 82.1 watch 77.1 77.9
ask 61.9 62.7 begin 57.0 61.5 hear 65.6 68.7
lose 53.1 52.5 eat 85.7 85.0 party 77.1 79.0
expect 76.6 75.9 mean 76.5 77.5 image 66.3 67.8
note 59.6 60.4 difficulty 49.2 54.2 write 68.3 65.0
plan 77.2 78.3 disc 72.1 74.1 paper 56.3 57.7
Table 1. Results on several words from the cross-validation experiments on the training data. Accuracies are reported
for the best value of a2 , which is then chosen as the value for the final system, together with the value a3a5a4 that performed
better. On most words the multicomponent model outperforms the flat one
of labels a56 a42a7a6 a124a56a58a39 a38 a47 such that there are incorrect
labels wich achieved a better score; i.e., a56 a42 a62 a64a66a46 a54
a46 a42 a117a23a83a76a9a9a8a54 a46 a42 a44a65a11a41a116a14a13a9a44a24a40 a42a16a15 a17 a11a41a116a9a70a84a44a24a40 a42a21a15 a67 . For each a46a30a54 a56 a42
the update is equal to a22 a52a27
a48a10a19
a27 , which, again, reduces
to the former case when a110a56 a42 a110a33a62 a121 .
5 Results
Table 1 presents results from a set of experiments
performed by cross-validation on the training data,
for several nouns and verbs.For 37 nouns and verbs,
out of 52, the two-component model was more ac-
curate than the flat model5. We used the results from
these experiments to set, separately for each word,
the parameters a68 , which was equal to 13.9 on av-
erage, and
a75
a70 . For adjectives we only set the pa-
rameter a68 and used the standard “flat” perceptron.
For each word in the task we separately trained one
classifier. The system accuracy on the unseen test
set is summarized in the following table:
Measure Precision Recall
Fine all POS 71.1 71.1%
Coarse all POS 78.1 78.1%
Fine verbs 72.5 72.5%
Coarse verbs 80.0 80.0%
Fine nouns 71.3 71.3%
Coarse nouns 77.4 77.4%
Fine adjectives 49.7 49.7%
Coarse adjectives 63.5 63.5%
Overall the system has the following advantages
over that of (Ciaramita et al., 2003). Selecting the
external training data based on the most similar a60
synsets has the advantage, over using supersenses,
of generating an equivalent amount of additional
data for each word sense. The additional data for
each synset is also more homogeneous, thus the
5Since
a3 a4 is an adjustable parameter it is possible that,
with different values for a3 a4 , the multicomponent model would
achieve even better performances.
model should have less variance6. The multicom-
ponent architecture is simpler and has an obvious
convergence proof. Convergence is faster and train-
ing is efficient. It takes less than one hour to build
and train all final systems and generate the complete
test results. We used the averaged version of the per-
ceptron and introduced an adjustable parameter
a75
to
weigh each component’s contribution separately.

References
E. Charniak. 2000. A Maximum-Entropy-Inspired
Parser. In Proceedings of the 38th Annual Meet-
ing of the Association for Computational Linguis-
tics (ACL 2000).
M. Ciaramita, T. Hofmann, and M. Johnson.
2003. Hierarchical Semantic Classification:
Word Sense Disambiguation with World Knowl-
edge. In Proceedings of the 18th International
Joint Conference on Artificial Intelligence (IJCAI
2003).
M. Collins. 2002. Discriminative Training Meth-
ods for Hidden Markov Models: Theory and Ex-
periments with Perceptron Algorithms. In Pro-
ceedings of the Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP
2002), pages 1–8.
K. Crammer and Y. Singer. 2003. Ultraconserva-
tive Online Algorithms for Multiclass Problems.
Journal of Machine Learning Research, 3.
C. Fellbaum. 1998. WordNet: An Electronic Lexi-
cal Database. MIT Press, Cambridge, MA.
K.L Yoong and T.N. Hwee. 2002. An Empirical
Evaluation of Knowledge Sources and Learning
Algorithms for Word Sense Disambiguation. In
Proceedings of the 2002 Conference on Empir-
ical Methods in Natural Language Processing
(EMNLP 2002).
