Learning a Robust Word Sense Disambiguation Model using
Hypernyms in Definition Sentences
Kiyoaki Shirai, Tsunekazu Yagi
School of Information Science, Japan Advanced Institute of Science and Technology
1-1, Asahidai, Tatsunokuchi, 923-1292, Ishikawa, Japan
{kshirai,t-yagi}@jaist.ac.jp
Abstract
This paper proposes a method to improve
the robustness of a word sense disambigua-
tion (WSD) system for Japanese. Two
WSD classifiers are trained from a word
sense-tagged corpus: one is a classifier ob-
tained by supervised learning, the other is
a classifier using hypernyms extracted from
definition sentences in a dictionary. The for-
mer will be suitable for the disambiguation
of high frequency words, while the latter is
appropriate for low frequency words. A ro-
bust WSD system will be constructed by
combining these two classifiers. In our ex-
periments, the F-measure and applicability
of our proposed method were 3.4% and 10%
greater, respectively, compared with a single
classifier obtained by supervised learning.
1 Introduction
Word sense disambiguation (WSD) is the pro-
cess of selecting the appropriate meaning or
sense for a given word in a document. Obvi-
ously, WSD is one of the fundamental and im-
portant processes needed for many natural lan-
guage processing (NLP) applications. Over the
past decade, many studies have been made on
WSD of Japanese. Most current research uses
machine learning techniques (Li and Takeuchi,
1997; Murata et al., 2001; Takamura et al.,
2001), has achieved good performance. How-
ever, as supervised learning methods require
word sense-tagged corpora, they often suﬀer
from data sparseness, i.e., words which do not
occur frequently in a training corpus can not be
disambiguated. Therefore, we cannot use su-
pervised learning algorithms alone in practical
NLP applications, especially when it is neces-
sary to disambiguate both high frequency and
low frequency words.
To tackle this problem, this paper proposes a
method to combine two WSD classifiers. One
is a classifier obtained by supervised learning.
The learning algorithm used for this classifier is
the Support Vector Machine (SVM); this clas-
sifier will work well for the disambiguation of
high frequency words. The second classifier is
the Naive Bayes model, which will work well
for the disambiguation of low frequency words.
In this model, hypernyms extracted from defi-
nition sentences in a dictionary are considered
in order to overcome data sparseness.
The details of the SVM classifier are de-
scribed in Section 2, and the Naive Bayes model
in Section 3. The combination of these two clas-
sifiers is described in Section 4. The experi-
mental evaluation of the proposed method is re-
ported in Section 5. We mention some related
works in Section 6, and conclude the paper in
Section 7.
2 SVM Classifier
The first classifier is the SVM classifier. Since
SVM is a supervised learning algorithm, a word
sense-tagged corpus is required as training data,
and the classifier can not be used to disam-
biguate words which do not occur frequently
in the data. However, as the eﬀectiveness of
SVM has been widely reported for a variety of
NLP tasks including WSD (Murata et al., 2001;
Takamura et al., 2001), we know that it will
work well for disambiguation of high frequency
words.
When training the SVM classifier, each train-
ing instance should be represented by a feature
vector. We used the following features, which
are typical for WSD.
• S(0),S(−1),S(−2),S(+1),S(+2)
Surface forms of a target word and words
just before or after a target word. A num-
ber in parentheses indicates the position of
a word from a target word.
• P(−1),P(−2),P(+1),P(+2)
Parts-of-speech (POSs) of words just before
or after a target word.
• S(−2)·S(−1), S(+1)·S(+2), S(−1)·S(+1)
Pairs of surface forms of words surrounding
a target word.
• P(−2)·P(−1), P(+1)·P(+2), P(−1)·P(+1)
Pairs of POSs of words surrounding a tar-
get word.
• B
sent
Base forms of content words in a sentence
1
.
• C
sent
Semantic classes of content words in a
sentence. Semantic classes used here
are derived from the Japanese the-
saurus “Nihongo-Goi-Taikei” (Ikehara et
al., 1997).
• B
head
, B
mod
Base forms of the head (B
head
)ormodifiers
(B
mod
) of a target word.
• (B
case
;B
noun
)
A pair of the base forms of a case marker
(B
case
) and a case filler noun (B
noun
)when
the target word is a verb.
• (B
case
;C
noun
)
A pair of the base form of a case marker
(B
case
) and the semantic class of a case
filler noun (C
noun
) when the target word
is a verb.
• (B
case
;B
verb
)
A pair of the base forms of a case marker
(B
case
) and a head verb (B
verb
) when the
target word is a case filler noun of a certain
verb.
We used the LIBSVM package
2
for train-
ing the SVM classifier. The SVM model is
ν−SVM (Sch¨olkopf, 2000) with a linear kernel,
where the parameter ν =0.0001. The pairwise
method is used to apply SVM to multi classifi-
cation.
3 Naive Bayes Classifier using
Hypernyms in Definition
Sentences
In this section, we will describe the details of
the WSD classifier using hypernyms of words
1
We tried using the special symbol “NUM” as a fea-
ture for any numbers in a sentence, but the performance
was slightly worse in our experiment. We thank the
anonymous reviewer who gave us the comment about
this.
2
http://www.csie.ntu.edu.tw/%7Ecjlin/
libsvm/
CID Definition sentence
3c5631\lZMs��b�3(a comic
story-telling entertainment)
1f66e3q�q��sM�(a rambling story)
Figure 1: Sense Set of “��”
extracted from definition sentences in a dictio-
nary.
3.1 Overview
Let us explain the basic idea of the model by
considering the case in which the word “��”
(mandan; comic chat) in the following example
sentence (A) should be disambiguated:
(A)T�^�x��w
H�t��z...
(Mr. Sakano was initiated into the
world of comic chat ...)
In this paper, word senses are defined accord-
ing to the EDR concept dictionary (EDR, 1995).
Figure 1 illustrates two meanings for “��”
(comic chat) in the EDR concept dictionary.
“CID” indicates a concept ID, an identification
number of a sense.
One of the ways to disambiguate the senses
of “��” (comic chat) is to train the WSD clas-
sifier from the sense-tagged corpus, as with the
SVM classifier. However, when “��”(comic
chat) occurs infrequently or not at all in the
training corpus, we can not train any reliable
classifiers.
To train the WSD classifier for low frequency
words, we looked at hypernyms of senses in def-
inition sentences. For Japanese, in most cases
the last word in a definition sentence is a hy-
pernym. For example, the hypernym of sense
3c5631 in Figure 1 is the last underlined word
“3”(engei; entertainment), while the hyper-
nym of 1f66e3 is “�”(hanashi; story).
In the EDR concept dictionary, there are
senses whose hypernyms are also “3”(en-
tertainment) or “�” (story). For example, as
shown in Figure 2, 10d9a4, 3c3fbb and 3c5ab3
are senses whose hypernyms are “3”(enter-
tainment), while the hypernym of 3cf737, 0f73c1
and 3c3071 is “�” (story). If these senses oc-
cur in the training corpus, we can train a clas-
sifier that determines whether the hypernym of
“��”(comicchat)is“3” (entertainment)
or “�” (story). If we can determine the correct
hypernym, we can also determine which is the
correct sense, 3c5631 or 1f66e3. Notice that we
can train such a model even when “��”(comic
�X��10d9a4\lZMs��Z,7�tXj�mZ�/
n3(a monologue-style,
comic story-telling entertainment always ending with a punch line)
���3c3fbb�qMO�
Hw�	:3(a type of medieval folk entertainment of Japan,
called ‘Sarugaku’)
��
~�3c5ab3��
~�HMoM�M�w�^�3(entertainment of cutting shapes
out of paper)
�;
��3cf737
rT��t��;Q��h�(a story passed down among people since
ancient times)
����0f73c1�MtKlh�pw�(a true story)
����101156���Zt^��h�(a story for children)
Figure 2: Examples of Senses whose Hypernyms are “3” (entertainment) or “�”(story)
0efb60�U/�/��/w/s
:/�/�b/�
(a word representing the number of
competitions or contests)
Figure 3: Definition Sentence of the sense
0efb60 of the word “���”
chat) itself does not occur in the training cor-
pus.
As described later, we train the probabilistic
model that predicts a hypernym of a given word,
instead of a word sense. Much more training
data will be available to train the model pre-
dicting hypernyms rather than the model pre-
dicting senses, because there are fewer types
of hypernyms than of senses. Figure 2 illus-
trates this fact clearly: all words labeled with
10d9a4, 3c3fbb and 3c5ab3 in the training data
can be used as the data labeled with the hy-
pernym “3” (entertainment). In this way,
we can train a reliable WSD classifier for low
frequency words. Furthermore, hypernyms will
be automatically extracted from definition sen-
tences, as described in Subsection 3.2, so that
the model can be automatically trained without
human intervention.
3.2 Extraction of Hypernyms
In this subsection, we will describe how to ex-
tract hypernyms from definition sentences in a
dictionary. In principle, we assume that the hy-
pernym of a sense is the last word of a defi-
nition sentence. For example, in the definition
sentence of sense 3c5631 of “��” (comic chat),
the last word “3” (entertainment) is the hy-
pernym, as shown in Figure 1. However, we can-
not always regard the last word as a hypernym.
Let us consider the definition of sense 0eb70d of
the word “���”(gˆemu; game). In the EDR
concept dictionary, the expression “A/�/��
b/�”(awordrepresentingA) often appears in
definition sentences. In this case, the hypernym
of the sense is not the last word but A.Thus
the hypernym of 0efb60 is not the last word
“�”(go; word) but “s
:”(kaisuu;number)
in Figure 3.
When we extract a hypernym from a defini-
tion sentence, the definition sentence is first an-
alyzed morphologically (word segmentation and
POS tagging) by ChaSen
3
.Thenahypernym
in a definition sentence is identified by pattern
matching. An example of patterns used here is
the rule extracting A when the expression “A/
�/��b/�” is found in a definition sentence.
We made 64 similar patterns manually in order
to extract hypernyms appropriately.
Out of the 194,303 senses of content words in
the EDR concept dictionary, hypernyms were
extracted for 191,742 senses (98.7%) by our
pattern matching algorithm. Furthermore, we
chose 100 hypernyms randomly and checked
their validity, and found that 96% of the hyper-
nyms were appropriate. Therefore, our method
for extracting hypernyms worked well. The ma-
jor reasons why acquisition of hypernyms failed
were lack of patterns and faults in the morpho-
logical analysis of definition sentences.
3.3 Naive Bayes Model
We will describe the details of our probabilis-
tic model that considers hypernyms in defini-
tion sentences. First of all, let us consider the
following probability:
P(s,c|F)(1)
In (1), s is a sense of a target word, c is a hy-
pernym extracted from the definition sentence
of s,andF is the set of features representing an
input sentence including a target word.
3
ChaSen is the Japanese morphological analyzer.
http://chasen.aist-nara.ac.jp/hiki/ChaSen/
Next, we approximate Equation (1) as (2):
P(s,c|F)=P(s|c,F)P(c|F) similarequal P(s|c)P(c|F)
(2)
The first term, P(s|c,F), is the probabilistic
model that predicts a sense s given a feature
set F (and c). It is similar to the ordinary
Naive Bayes model for WSD (Pedersen, 2000).
However, we assume that this model can not be
trained for low frequency words due to a lack
of training data. Therefore, we approximate
P(s|c,F)toP(s|c).
Using Bayes’ rule, Equation (2) can be com-
puted as follows:
P(s|c)P(c|F)=
P(s)P(c|s)
P(c)
P(c)P(F|c)
P(F)
(3)
=
P(s)P(F|c)
P(F)
(4)
Notice that P(c|s) in (3) is equal to 1, because
ahypernymc of a sense s is uniquely extracted
by pattern matching (Subsection 3.2).
As all we want to do is to choose an s
prime
which
maximizes (4), P(F) can be eliminated:
s
prime
=argmax
s
P(s)P(F|c)
P(F)
(5)
=argmax
s
P(s)P(F|c)(6)
Finally, by the Naive Bayes assumption, that is
all features in F are conditionally independent,
Equation (6) can be approximated as follows:
s
prime
=argmax
s
P(s)
productdisplay
f
i
∈F
P(f
i
|c)(7)
In (7), P(s) is the prior probability of a sense
s which reflects statistics of the appearance of
senses, while P(f
i
|c) is the posterior probability
which reflects collocation statistics between an
individual feature f
i
and a hypernym c.The
parameters of these probabilistic models can be
estimated from the word sense-tagged corpus.
We estimated P(s) by Expected Likelihood Es-
timation and P(f
i
|c) by linear interpolation.
Feature Set
ThefeaturesusedintheNaiveBayesmodelare
almost same as ones used in the SVM classifier
except for the following features:
[Features not used in the Naive Bayes model]
• S(−2),S(+2),P(−2),P(+2)
• S(−2)·S(−1), S(+1)·S(+2), S(−1)·S(+1)
• P(−2)·P(−1), P(+1)·P(+2), P(−1)·P(+1)
• C
sent
,(B
case
;C
noun
)
According to the preliminary experiment, the
accuracy of the Naive Bayes model slightly de-
creased when all features in the SVM classifier
were used. This was the reason why we did not
use the above features.
3.4 Discussion
The following discussion examines our method
for extracting hypernyms from definition sen-
tences.
Multiple Hypernyms
In general, two or more hypernyms can be ex-
tracted from a definition sentence, when the def-
inition of a sense consists of several sentences
or a definition sentence contains a coordinate
structure. However, for this work we extracted
only one hypernym for a sense, because defini-
tions of all senses in the EDR concept dictionary
are described by a single sentence, and most of
them contain no coordinate structure.
In order to apply our model for multiple hy-
pernyms, we must consider the probabilistic
model P(s,C|F) instead of Equation (1), where
C is a set of hypernyms. Unfortunately, the es-
timation of P(s,C|F) is not obvious, so inves-
tigation of this will be done in future.
Ambiguity of hypernyms
The fact that hypernyms may have several
meanings does not appear to be a major prob-
lem, because most hypernyms in definition sen-
tences of a certain dictionary have a single
meaningaccordingtoourroughobservation. So
for this work we ignored the possible ambiguity
of hypernyms.
Using other dictionaries
As described in Subsection 3.2, hypernyms are
extracted by pattern matching. We would have
to rebuild these patterns when we use other dic-
tionaries, but we do not expect to require too
much labor. Generally, in Japanese the last
word in a definition sentence can be regarded
as a hypernym. Furthermore, many extraction
patterns for the EDR concept dictionary may
also be applicable for other dictionaries. We
are already building patterns to extract hyper-
nyms from the other major Japanese dictionary,
the Iwanami Kokugo Jiten, and developing the
WSD system that will use them.
4 Combined Model
The details of two WSD classifiers are described
in the previous two sections: one is the SVM
classifier for high frequency words, and the
other is the Naive Bayes classifier for low fre-
quency words. These two classifiers are com-
bined to construct the robust WSD system. We
developed two kinds of combined models, de-
scribed below in subsections 4.1 and 4.2.
4.1 Simple Ensemble
In this model, the process combining the two
classifiers is quite simple. When only one of
classifiers, SVM or Naive Bayes, outputs senses
for a given word, the combined model outputs
senses provided by that classifier. When both
classifiers output senses, the ones provided by
the SVM classifier are always chosen for the final
output.
In the experiment in Section 5, SVM clas-
sifiers were trained for words which occur more
than 20 times in the training corpus. Therefore,
the simple ensemble described here is summa-
rized as follows: we use the SVM classifier for
high frequency words those which occur more
than 20 times and the Naive Bayes classifier for
the low frequency words.
4.2 Ensemble using Validation Data
First, we prepare validation data, which is a
sense-tagged corpus, as common test data for
the classifiers. The performance of the classi-
fiers for a word w is evaluated by correctness
C
w
, defined by (8).
C
w
=
# of words in which one of the senses
selected by a classifier is correct
# of words for which a classifier selects
one or more senses
(8)
The main reason for combining two classifiers
is to improve the recall and applicability of the
WSD system. Note that a classifier which often
outputs a correct sense would achieve high cor-
rectness C
w
, even though it also outputs wrong
senses. Thus, the higher the C
w
of a classifier,
the more it improves the recall of the combined
model.
Next, the correctness C
w
of each classifier for
each word w is measured on the validation data.
When two classifiers output senses for a given
word, their C
w
scores are compared. Then, the
word senses provided by the better classifier are
selected as the final outputs.
When the number of words in the validation
data is small, comparison of the classifiers’ C
w
is unreliable. For that reason, when the number
of words in the validation data is less that a cer-
tain threshold O
h
, a sense output by the SVM
classifier is chosen for the final output. This is
because the correctness for all words in the vali-
dation data is higher for the SVM classifier than
for the Naive Bayes classifier. In the experiment
in Section 5, we set O
h
to 10.
5 Experiment
In this section, we will describe the experiment
to evaluate our proposed method. We used the
EDR corpus (EDR, 1995) in the experiment. It
is made up of about 200,000 Japanese sentences
extracted from newspaper articles and maga-
zines. In the EDR corpus, each word was an-
notated with a sense ID (CID). We used 20,000
sentences in the EDR corpus as the test data,
20,000 sentence as the validation data, and
the remaining 161,332 sentences as the training
data. The training data was used to train the
SVM classifier and the Naive Bayes classifier,
while the validation data was used for the com-
bined model described in Subsection 4.2. The
target instances used for evaluation were all am-
biguous content words in the test data; the num-
ber of target instances was 91,986.
We evaluated three single WSD classifiers and
two combined models:
• BL
The baseline model. This is the WSD clas-
sifier which always selects the most fre-
quently used sense. When there is more
than one sense with equally high frequency,
the classifier chooses all those senses.
• NB
The Naive Bayes classifier (Section 3).
• SVM
The SVM classifier (Section 2).
• SVM+NB(simple)
The combined model by simple ensemble
(Subsection 4.1).
• SVM+NB(valid)
The combined model using the validation
data (Subsection 4.2).
Table 1 reveals the precision(P), recall(R), F-
measure(F)
4
, applicability(A) and number of
word types(T) of these five classifiers on the test
4 2PR
P+R
where P and R represent the precision and re-
call, respectively.
Table 1: Results of WSD Classifiers
RPFA T
1) .6047 .6036 .6042 .9962 10,310
2) .6274 .6543 .6406 .9568 10,501
3) .6366 .7080 .6704 .8992 4,575
4) .7016 .7010 .7013 .9993 10,592
5) .7050 .7043 .7046 .9993 10,592
1)=BL, 2)=NB, 3)=SVM,
4)=SVM+NB(simple), 5)=SVM+NB(valid)
data. A(applicability) indicates the ratio of the
number of instances disambiguated by a classi-
fier to the total number of target instances; T
indicates the number of word types which could
be disambiguated by a classifier.
The two combined models outperformed the
SVM classifier, for all criteria except precision.
The gains in recall and applicability were espe-
cially remarkable. Notice the figures in column
“T” in Table 1: the SVM classifiers could be ap-
plied only to 4,575 words, while the Naive Bayes
classifiers were applicable to 10,501 words, in-
cluding low frequency words. Thus, the ensem-
ble of these two classifiers would significantly
improve applicability and recall with little loss
of precision.
Comparing the performance of the two com-
bined models, “SVM+NB(validation)” slightly
outperformed “SVM+NB(simple)”, but there
was no significant diﬀerence between them. The
correctness, C
w
, of the SVM classifier on the
validation data was usually greater than that of
the Naive Bayes classifier, so the SVM classifier
was preferred when both were applicable. This
was the almost same strategy for the simple en-
semble, and we think this was the reason why
the performance of two combined models were
almost the same. In the rest of this section, we
will show the results for the combined model
using the validation data only.
Our goal was to improve the robustness of the
WSD system. The naive way to construct a ro-
bust WSD system is to create an ensemble of a
supervised learned classifier and a baseline clas-
sifier. So, we compared our proposed method
(SVM+NB) with the combined model of the
SVM and baseline classifier (SVM+BL). The re-
sults are shown in Table 2 and Figure 4. Table 2
shows the same criteria as in Table 1, indicating
that “SVM+NB” outperformed “SVM+BL” for
all criteria. Figure 4 shows the relation between
the F-measure of the classifiers and word fre-
quency in the training data. The horizontal axis
Table 2: Results of the Combined Models (1)
RPFA T
5) .7050 .7043 .7046 .9993 10,592
6) .6977 .6976 .6977 .9962 10,310
5)=SVM+NB, 6)=SVM+BL
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
0
100
200
300
400
500
600
700
800
F (SVM+NB) F (SVM+BL) NF(o) N(o)
log   o
10
Figure 4: Results of the Combined Models (2)
indicates the occurrence of words in the train-
ingdata(o) in log scale. Squares and triangles
with lines indicates the F(o)ofthe“SVM+NB”
and “SVM+BL”, respectively, where F(o)isthe
macro average of F-measures for words which
occur o times in the training data. The broken
line indicates N(o), the number of word types
which occur o times
5
. For convenience, when
o =0,weplotF(0) and N(0) at x = −0.5
instead of −∞(= log
10
0). As shown in Fig-
ure 4, “SVM+NB” significantly outperformed
“SVM+BL” for low frequency words, and the
number of word types (N(o)) became obviously
greater when o was small. In other words, the
Naive Bayes classifier proposed here could prob-
ably handle many more low frequency words
than the baseline classifier. Therefore, it was
more eﬀective to combine the Naive Bayes clas-
sifier with the SVM classifier rather than the
baseline classifier in order to improve the ro-
bustness of the overall WSD system.
Finally, we constructed a combined model of
all three classifiers, the SVM, Naive Bayes and
baseline classifiers. As shown in Table 3, this
model slightly outperformed the two-classifier
combined models shown in Table 2.
5
To be more accurate, F(o)andN(o) are the figures
for words which occurred more than or equal o times and
less than o+t times, where o+t is the next point at the
horizontal axis. t was chosen as the smallest integer so
that N(o) would be more than 100.
Table 3: Results of SVM+NB+BL
RPFAT
7) .7079 .7066 .7072 1 10,636
7)=SVM+NB+BL
6 Related Work
As described in Section 1, the goal of this
project was to improve robustness of the WSD
system. One of the promising ways to construct
a robust WSD system is unsupervised learn-
ing as with the EM algorithm (Manning and
Sch¨utze, 1999), i.e. training a WSD classifier
from an unlabeled data set. On the other hand,
our approach is to use a machine readable dic-
tionary in addition to a corpus as knowledge
resources for WSD. Notice that we used hyper-
nyms of definition sentences in a dictionary to
train the Naive Bayes classifier, and this pro-
cess worked well for words which did not oc-
cur frequently in the corpus. However, we did
not compare our method and the unsupervised
learning method empirically. This will be one
of our future projects.
Using hypernyms of definition sentences is
similar to using semantic classes derived from a
thesaurus. One of the advantages of our method
is that a thesaurus is not obligatory when word
senses are defined according to a machine read-
able dictionary. Furthermore, our method is
to train the probabilistic model that predicts
a hypernym of a word, while most previous ap-
proachesusesemanticclassesasfeatures(i.e.,
the condition of the posterior probability in
the case of the Naive Bayes model). In facts,
we also use features associated with semantic
classes derived from the thesaurus, C
sent
and
(B
case
;C
noun
), as described in Section 2.
Several previous studies have used both a
corpus and a machine readable dictionary for
WSD (Litkowski, 2002; Rigau et al., 1997;
Stevenson and Wilks, 2001). The diﬀerence
between those methods and ours is the way
we use information derived from the dictionary
for WSD. Training the probabilistic model that
predicts a hypernym in a dictionary is our own
approach. However, these various methods are
not in competition with our method. In fact,
the robustness of the WSD system would be
even more improved by combining these meth-
ods with that described in this paper.
7Conclusion
This paper has proposed a method to develop
a robust WSD system. We combined a WSD
classifier obtained by supervised learning for
high frequency words and a classifier using hy-
pernyms in definition sentences in a dictionary
for low frequency words. Experimental results
showed that both recall and applicability were
remarkably improved with our method. In fu-
ture, we plan to investigate the optimum way to
combine these two classifiers or to train a single
probabilistic model using hypernyms in defini-
tion sentences, which is suitable for both high
and low frequency words.

References

EDR. 1995. EDR electronic dictionary technical
guide (second edition). Technical Report TR–045,
Japan Electronic Dictionary Research Institute.

Satoshi Ikehara et al. 1997. Nihongo Goi Taikei (in
Japanese). Iwanami Shoten, Publishers.

Hand Li and Jun-ichi Takeuchi. 1997. Using evi-
dence that is both strong and reliable in Japanese
homograph disambiguation. In SIG-NL, Informa-
tion Processing Society of Japan, pages 53–59.

Kenneth C. Litkowski. 2002. Sense information
for disambiguation: Confluence of supervised
and unsupervised methods. In Proceedings of the
SIGLEX/SENSEVAL Workshop on Word Sense
Disambiguation, pages 47–53.

Christopher D. Manning and Hinrich Sch¨utze, 1999.
Foundations of Statistical Natural Language Pro-
cessing, chapter 7. MIT Press.

Masaki Murata, Masao Utiyama, Kiyotaka Uchi-
moto, Qing Ma, and Hitoshi Isahara. 2001.
Japanese word sense disambiguation using the
simple bayes and support vector machine meth-
ods. In Proceedings of the SENSEVAL-2, pages
135–138.

Ted Pedersen. 2000. A simple approach to build-
ing ensembles of naive baysian classifiers for
word sense disambiguation. In Proceedings of the
NAACL, pages 63–69.

German Rigau, Jordi Atserias, and Eneko Agirre.
1997. Combining unsupervised lexical knowledge
methods for word sense disambiguation. In Pro-
ceedings of the ACL, pages 48–55.

Bernhard Sch¨olkopf. 2000. New support vector al-
gorithms. Neural Computation, 12:1083–1121.

Mark Stevenson and Yorick Wilks. 2001. The inter-
action of knowledge sources in word sense disam-
biguation. Computational Linguistics, 27(3):321–349.

Hiroya Takamura, Hiroyasu Yamada, Taku Kudoh,
Kaoru Yamamoto, and Yuji Matsumoto. 2001.
Ensembling based on feature space restructuring
with application to WSD. In Proceedings of the
NLPRS, pages 41–48.
