An Empirical Study of the Domain Dependence of Supervised 
Word Sense Disambiguation Systems* 
Gerard Escudero, Lluis M~trquez~ and German Rigau 
TALP Research Center. LSI Department. Universitat Polit~cnica de Catalunya (UPC) 
Jordi Girona Salgado 1-3. E-08034 Barcelona. Catalonia 
{escudero, lluism, g. rigau}@isi, upc. es 
Abstract 
This paper describes a set of experiments car- 
ried out to explore the domain dependence 
of alternative supervised Word Sense Disam- 
biguation algorithms. The aim of the work is 
threefold: studying the performance of these 
algorithms when tested on a different cor- 
pus from that they were trained on; explor- 
ing their ability to tune to new domains, 
and demonstrating empirically that the Lazy- 
Boosting algorithm outperforms state-of-the- 
art supervised WSD algorithms in both previ- 
ous situations. 
Keywords: Cross-corpus evaluation of Ni_P 
systems, Word Sense Disambiguation, Super- 
vised Machine Learning 
1 Introduction 
Word Sense Disambiguation (WSD) is the 
problem of assigning the appropriate meaning 
(sense) to a given word in a text or discourse. 
Resolving the ambiguity of words is a central 
problem for large scale  understand- 
ing applications and their associate tasks (Ide 
and V4ronis, 1998), e.g., machine transla- 
tion, information retrieval, reference resolu- 
tion, parsing, etc. 
WSD is one of the most important open 
problems in NLP. Despite the wide range of 
approaches investigated and the large effort 
devoted to tackle this problem, to date, no 
large-scale broad-coverage and highly accu- 
rate WSD system has been built --see the 
main conclusions of the first edition of Sen- 
sEval (Kilgarriff and Rosenzweig, 2000). 
One of the most successful current lines 
of research is the corpus-based approach in 
" This research has been partially funded by the Span- 
ish Research Department (CICYT's project TIC98- 
0423-C06). by the EU Commission (NAMIC IST- 
1999-12392), and by the Catalan Research Depart- 
ment (CIRIT's consolidated research group 1999SGR- 
150 and CIRIT's grant 1999FI 00773). 
which statistical or Machine Learning (ML) al- 
gorithms are applied to learn statistical mod- 
els or classifiers from corpora in order to per- 
form WSD. Generally, supervised approaches 1 
have obtained better results than unsuper- 
vised methods on small sets of selected am- 
biguous words, or artificial pseudo-words. 
Many standard M L algorithms for supervised 
learning have been applied, such as: Decision 
Lists (¥arowsky, 1994; Agirre and Martinez, 
2000), Neural Networks (Towell and Voorhees, 
1998), Bayesian learning (Bruce and Wiebe, 
1999), Exemplar-Based learning (Ng, 1997a; 
Fujii et al., 1998), Boosting (Escudero et al., 
2000a), etc. Unfortunately, there have been 
very few direct comparisons between alterna- 
tive methods for WSD. 
In general, supervised learning presumes 
that the training examples are somehow re- 
flective of the task that will be performed by 
the trainee on other data. Consequently, the 
performance of such systems is commonly es- 
timated by testing the algorithm on a separate 
part of the set of training examples (say 10- 
20% of them), or by N-fold cross-validation, 
in which the set of examples is partitioned into 
N disjoint sets (or folds), and the training- 
test procedure is repeated N times using all 
combinations of N-1 folds for training and 1 
fold for testing. In both cases, test examples 
are different from those used for training, but 
they belong to the same corpus, and, there- 
fore, they are expected to be quite similar. 
Although this methodology could be valid 
for certain NLP problems, such as English 
Part-of-Speech tagging, we think that there 
exists reasonable evidence to say that, in 
WSD, accuracy results cannot be simply ex- 
trapolated to other domains (contrary to the 
opinion of other authors (Ng, 1997b)): On the 
aSupervised approaches, also known as data-driven 
or corpus-dmven, are those that learn from a previ- 
ously semantically annotated corpus. 
172 
one hand, WSD is very dependant to the do- 
main of application (Gale et al., 1992b) --see 
also (Ng and Lee, 1996; Ng, 1997a), in which 
quite different accuracy figures are obtained 
when testing an exemplar-based WSD classi- 
fier on two different corpora. Oi1 the other 
hand, it does not seem reasonable to think 
that the training material is large and repre- 
sentative enough to cover "all" potential types 
of examples. 
To date, a thorough study of the domain 
dependence of WSD --in the style of other 
studies devoted to parsing (Sekine, 1997)-- 
has not been carried out. We think that such 
an study is needed to assess the validity of 
the supervised approach, and to determine to 
which extent a tuning process is necessary to 
make real WSD systems portable. In order 
to corroborate the previous hypotheses, this 
paper explores the portability and tuning of 
four different ML algorithms (previously ap- 
plied to WSD) by training and testing them 
on different corpora. 
Additionally, supervised methods suffer 
from the "knowledge acquisition bottle- 
neck" (Gale et al., 1992a). (Ng, 1997b) esti- 
mates that the manual annotation effort nec- 
essary to build a broad coverage semantically 
annotated English corpus is about 16 person- 
years. This overhead for supervision could be 
much greater if a costly tuning procedure is 
required before applying any existing system 
to each new domain. 
Due to this fact, recent works have focused 
on reducing the acquisition cost as well as the 
need for supervision in corpus-based methods. 
It is our belief that the research by (Leacock et 
al., 1998; Mihalcea and Moldovan, 1999) 2 pro- 
vide enough evidence towards the "opening" 
of the bottleneck in the near future. For that 
reason, it is worth further investigating the 
robustness and portability of existing super- 
vised ML methods to better resolve the WSD 
problem. 
It is important to note that the focus of 
this work will be on the empirical cross- 
corpus evaluation of several M L supervised al- 
gorithms. Other important issues, such as: 
selecting the best attribute set, discussing an 
appropriate definition of senses for the task, 
etc., are not addressed in this paper. 
eIn the line of using lexical resources and search en- 
gunes to automatically collect training examples from 
large text collections or Internet. 
This paper is organized as follows: Section 2 
presents the four ML algorithms compared. 
In section 3 the setting is presented in de- 
tail, including the corpora and the experimen- 
tal methodology used. Section 4 reports the 
experiments carried out and the results ob- 
tained. Finally, section 5 concludes and out- 
lines some lines for further research. 
2 Learning Algorithms Tested 
2.1 Naive-Bayes (NB) 
Naive Bayes is intended as a simple represen- 
tative of statistical learning methods. It has 
been used in its most classical setting (Duda 
and Hart, 1973). That is, assuming indepen- 
dence of features, it classifies a new example 
by assigning the class that maximizes the con- 
ditional probability of the class given the ob- 
served sequence of features of that example. 
Model probabilities are estimated during 
training process using relative frequencies. To 
avoid the effect of zero counts when esti- 
mating probabilities, a very simple smooth- 
ing technique has been used, which was pro- 
posed in (Ng, 1997a). Despite its simplicity, 
Naive Bayes is claimed to obtain state-of-the- 
art accuracy on supervised WSD in many pa- 
pers (Mooney, 1996; Ng, 1997a; Leacock et 
al., 1998). 
2.2 Exemplar-based Classifier (EB) 
In Exemplar-based learning (Aha et al., 1991) 
no generalization of training examples is per- 
formed. Instead, the examples are stored 
in memory and the classification of new ex- 
amples is based on the classes of the most 
similar stored examples. In our implemen- 
tation, all examples are kept in memory and 
the classification of a new example is based 
on a k-NN (Nearest-Neighbours) algorithm 
using Hamming distance 3 to measure close- 
ness (in doing so, all examples are examined). 
For k's greater than 1, the resulting sense is 
the weighted majority sense of the k near- 
est neighbours --where each example votes its 
sense with a strength proportional to its close- 
ness to the test example. 
In the experiments explained in section 4, 
the EB algorithm is run several times using 
different number of nearest neighbours (1, 3, 
SAlthough the use of MVDM metric (Cost and 
Salzberg, 1993) could lead to better results, current 
implementations have prohivitive computational over- 
heads(Escudero et al., 2000b) 
173 
5, 7, 10, 15, 20 and 25) and the results corre- 
sponding to the best choice are reported 4. 
Exemplar-based learning is said to be the 
best option for VSD (Ng, 1997a). Other au- 
thors (Daelemans et al., 1999) point out that 
exemplar-based methods tend to be superior 
in  learning problems because they 
do not forget exceptions. 
2.3 Snow: A Winnow-based Classifier 
Snow stands for Sparse Network Of Winnows, 
and it is intended as a representative of on- 
line learning algorithms. 
The basic component is the Winnow al- 
gorithm (Littlestone, 1988). It consists of a 
linear threshold algorithm with multiplicative 
weight updating for 2-class problems, which 
learns very fast in the presence of many bi- 
nary input features. 
In the Snow architecture there is a winnow 
node for each class, which learns to separate 
that class from all the rest. During training, 
each example is considered a positive example 
for winnow node associated to its class and 
a negative example for all the rest. A key 
point that allows a fast learning is that the 
winnow nodes are not connected to all features 
but only to those that are "relevant" for their 
class. When classifying a new example, Snow 
is similar to a neural network which takes the 
input features and outputs the class with the 
highest activation. 
Snow is proven to perform very well in 
high dimensional domains, where both, the 
training examples and the target function re- 
side very sparsely in the feature space (Roth, 
1998), e.g: text categorization, context- 
sensitive spelling correction, WSD, etc. 
In this paper, our approach to WSD using 
Snow follows that of (Escudero et al., 2000c). 
2.4 LazyBoosting (LB) 
The main idea of boosting algorithms is to 
combine many simple and moderately accu- 
rate hypotheses (called weak classifiers) into 
a single, highly accurate classifier. The weak 
classifiers are trained sequentially and, con- 
ceptually, each of them is trained on the ex- 
amples which were most difficult to classify 
by the preceding weak classifiers. These weak 
4In order to construct a real EB-based system for 
WSD, the k parameter should be estimated by cross- 
validation using only the training set (Ng, 1997a), 
however, in our case, this cross-validation inside the 
cross-validation involved in the testing process would 
generate a prohibitive overhead. 
hypotheses are then linearly combined into a 
single rule called the combined hypothesis. 
More particularly, the Schapire and Singer's 
real AdaBoost.MH algorithm for multi- 
class multi-label classification (Schapire and 
Singer, to appear) has been used. As in that 
paper, very simple weak hypotheses are used. 
They test the value of a boolean predicate and 
make a real-valued prediction based on that 
value. The predicates used, which are the bi- 
narization of the attributes described in sec- 
tion 3.2, are of the form "f = v", where f is a 
feature and v is a value (e.g: "-r v" p e mus_word 
= hospital"). Each weak rule uses a single 
feature, and, therefore, they can be seen as 
simple decision trees with one internal node 
(testing the value of a binary feature) and two 
leaves corresponding to the yes/no answers to 
that test. 
LazyBoosting (Escudero et al., 2000a), is a 
simple modification of the AdaBoost.MH al- 
gorithm, which consists of reducing the fea- 
ture space that is explored when learning each 
weak classifier. More specifically, a small pro- 
portion p of attributes are randomly selected 
and the best weak rule is selected only among 
them. The idea behind this method is that 
if the proportion p is not too small, probably 
a sufficiently good rule can be found at each 
iteration. Besides, the chance for a good rule 
to appear in the whole learning process is very 
high. Another important characteristic is that 
no attribute needs to be discarded and, thus, 
the risk of eliminating relevant attributes is 
avoided. The method seems to work quite well 
since no important degradation is observed in 
performance for values of p greater or equal 
to 5% (this may indicate that there are many 
irrelevant or highly dependant attributes in 
the WSD domain). Therefore, this modifica- 
tion significantly increases the efficiency of the 
learning process (empirically, up to 7 times 
faster) with no loss in accuracy. 
3 Setting 
3.1 The DSO Corpus 
The DSO corpus is a semantically annotated 
corpus containing 192,800 occurrences of 121 
nouns and 70 verbs, corresponding to the most 
frequent and ambiguous English words. This 
corpus was collected by Ng and colleagues (Ng 
and Lee, 1996) and it is available from the 
Linguistic Data Consortium (LDC) 5. 
5LDC address: http://www. Idc.upeaa. ed~/ 
174 
The D50 corpus contains sentences from 
two different corpora, namely Wall Street 
Journal (WSJ) and Brown Corpus (BC). 
Therefore, it is easy to perform experiments 
about the portability of alternative systems 
by training them on the WSJ part and testing 
them on the BE part, or vice-versa. Here- 
inafter, the WSJ part of DSO will be referred 
to as corpus A, and the BC part to as corpus B. 
At a word level, we force the number of exam- 
ples of corpus A and B be the same 6 in order 
to have symmetry and allow the comparison 
in both directions. 
From these corpora, a group of 21 words 
which frequently appear in the WSD litera- 
ture has been selected to perform the com- 
parative experiments (each word is treated 
as a different classification problem). These 
words are 13 nouns (age, art, body, car, child, 
cost, head, interest, line, point, state, thing, 
work) and 8 verbs (become, fall, grow, lose, 
set, speak, strike, tell). Table 1 contains in- 
formation about the number of examples, the 
number of senses, and the percentage of the 
most frequent sense (MF5) of these reference 
words, grouped by nouns, verbs, and all 21 
words. 
3.2 Attributes 
Two kinds of information are used to perform 
disambiguation: local and topical context. 
Let "... w-3 w-2 w-1 w W+l w+2 w+3..." 
be the context of consecutive words around 
the word w to be disambiguated, and p±, 
(-3 < i _< 3)be the part-of-speech tag 
of word w±~. Attributes referring to local 
context are the following 15: P-3, P-2, 
P-l, P+i, P+2, P+3, w-l, W+l, (W-2,W-1), (w-i.w+i), (w+l,w+2), 
(w-2, W-l, w+l), (w-i, w+l, w+2), and 
(w+l,w+2, w+3), where the last seven cor- 
respond to collocations of two and three 
consecutive words. 
The topical context is formed by Cl,..., Cm, 
which stand for the unordered set of open class 
words appearing in the sentence 7. 
The four methods tested translate this 
information into features in different ways. 
Snow and LB algorithms require binary fea- 
6This is achieved by ramdomly reducing the size of 
the largest corpus to the size of the smallest. 
7The already described set of attributes contains 
those attributes used in (Ng and Lee, 1996), with the 
exception of the morphology of the target word and 
the verb-object syntactic relation. 
tures. Therefore, local context attributes have 
to be binarized in a preprocess, while the top- 
ical context attributes remain as binary tests 
about the presence/absence of a concrete word 
in the sentence. As a result the number of 
attributes is expanded to several thousands 
(from 1,764 to 9,900 depending on the partic- 
ular word). 
The binary representation of attributes is 
not appropriate for NB and EB algorithms. 
Therefore, the 15 local-context attributes are 
taken straightforwardly. Regarding the binary 
topical-context attributes, we have used the 
variants described in (Escudero et al., 2000b). 
For EB, the topical information is codified as 
a single set-valued attribute (containing all 
words appearing in the sentence) and the cal- 
culation of closeness is modified so as to han- 
dle this type of attribute. For NB, the top- 
ical context is conserved as binary features, 
but when classifying new examples only the 
information of words appearing in the exam- 
ple (positive information) is taken into ac- 
count. In that paper, these variants are called 
positive Exemplar-based (PEB) and positive 
Naive Bayes (PNB), respectively. PNB and 
PEB algorithms are empirically proven to per- 
form much better in terms of accuracy and 
efficiency in the WSD task. 
3.3 Experimental Methodology 
The comparison of algorithms has been per- 
formed in series of controlled experiments us- 
ing exactly the same training and test sets. 
There are 7 combinations of training-test sets 
called: A+B-A+B, A+B-A, A+B-B, A-A, B- 
B, A-B, and B-A, respectively. In this nota- 
tion, the training set is placed at the left hand 
side of symbol "-", while the test set is at the 
right hand side. For instance, A-B means that 
the training set is corpus A and the test set 
is corpus B. The symbol "+" stands for set 
union, therefore A+B-B means that the train- 
ing set is A union B and the test set is B. 
When comparing the performance of two al- 
gorithms, two different statistical tests of sig- 
nificance have been apphed depending on the 
case. A-B and B-A combinations represent a 
single training-test experiment. In this cases, 
the McNemar's test of significance is used 
(with a confidence value of: X1,0.952 = 3.842), 
which is proven to be more robust than a sim- 
ple test for the difference of tw0_proportions. 
In the other combinations, a 10-fold cross- 
validation was performed in order to prevent 
175 
nouns 
verbs 
AorB 
examples 
rain 
senses 
min max avg 
A 
i MFS (%) 
min min 
senses 
B 
MFS (%! 
min max max avg max avg max avg avg 
122 714 420 2 24 7.7 37.9 90.7 59.8 3 24 8.8 21.0 87.7 45.3 
101 741 369 4 13 8.9120.8 81.6 49.3 4 14 11.4 28.0 71.7 46.3 
101 741 401 2 24 8.1 J20.8 90.7 56.1 3 24 9.8 21.0 87.7 45.6 
Table 1: Information about the set of 21 words of reference. 
testing on the same material used for training. 
In these cases, accuracy/error rate figures re- 
ported in section 4 are averaged over the re- 
sults of the 10 folds. The associated statistical 
tests of significance is a paired Student's t-test 
with a confidence value of: t9,0.975 = 2.262. 
Information about both statistical tests can 
be found at (Dietterich, 1998). 
4 Experiments 
4.1 First Experiment 
Table 2 shows the accuracy figures of the four 
methods in all combinations of training and 
test sets s. Standard deviation numbers are 
supplied in all cases involving cross valida- 
tion. M FC stands for a Most-Frequent-sense 
Classifier, that is, a naive classifier that learns 
the most frequent sense of the training set 
and uses it to classify all examples of the test 
set. Averaged results are presented for nouns. 
verbs, and overall, and the best results for 
each case are printed in boldface. 
The following conclusions can be drawn: 
• LB outperforms all other methods in 
all cases. Additionally, this superiority 
is statistically significant, except when 
comparing LB to the PEB approach in the 
cases marked with an asterisk. 
• Surprisingly, LB in A+B-A (or A+B-B) 
does not achieve substantial improvement 
to the results of A-A (or B-S) win fact, 
the first variation is not statistically sig- 
nificant and the second is only slightly 
significant. That is, the addition of extra 
examples from another domain does not 
necessarily contribute to improve the re- 
sults on the original corpus. This effect is 
also observed in the other methods, spe- 
cially in some cases (e.g. Snow in A+B-A 
vs. A-A) in which the joining of both 
training corpora is even counterproduc- 
tive. 
SThe second and third column correspond to the 
train and test sets used by (Ng and Lee, 1996; Ng, 
1997a) 
• Regarding the portability of the systems, 
very disappointing results are obtained. 
Restricting to \[B results, we observe that 
the accuracy obtained in A-B is 47.1% 
while the accuracy in B-B (which can 
be considered an upper bound for LB in 
B corpus) is 59.0%, that is, a drop of 
12 points. Furthermore, 47.1% is only 
slightly better than the most frequent 
sense in corpus B, 45.5%. The compari- 
son in the reverse direction is even worse: 
a drop from 71.3% (A-A) to 52.0% (B- 
A), which is lower than the most frequent 
sense of corpus A, 55.9%. 
4.2 Second Experiment 
The previous experiment shows that classi- 
tiers trained on the A corpus do not work well 
on the B corpus, and vice-versa. Therefore, 
it seems that some kind of tuning process is 
necessary to adapt supervised systems to each 
new domain. 
This experiment explores the effect of a sim- 
ple tuning process consisting of adding to the 
original training set a relatively small sarn- 
ple of manually sense tagged examples of the 
new domain. The size of this supervised por- 
tion varies from 10% to 50% of the available 
corpus in steps of 10% (the remaining 50% is 
kept for testing). This set of experiments will 
be referred to as A+%B-B, or conversely, to 
B+%A-A. 
In order to determine to which extent the 
original training set contributes to accurately 
disambiguate in the new domain, we also cal- 
culate the results for %A-A (and %B-B), that 
is, using only the tuning corpus for training. 
Figure 1 graphically presents the results ob- 
tained by all methods. Each plot contains the 
X+%Y-Y and %Y-Y curves, and the straight 
lines corresponding to the lower bound MFC, 
and to the upper bounds Y-Y and X+Y-Y. 
As expected, the accuracy of all methods 
grows (towards the upper bound) as more tun- 
ing corpus is added to the training set. How- 
ever, the relation between X+%Y-Y and %Y- 
Y reveals some interesting facts. In plots 2a, 
176 
nouns 
MFC verbs 
total 
nouns 
PNB verbs 
total 
nouns 
PEB verbs 
total 
nouns 
Snow verbs 
total 
nouns 
LB verbs 
total 
A+B-A+B 
46.59±1.08 
46.49±1.37 
46.55±0.71 
62.29±1.25 
60.18±1.64 
61.55±1.04 
62.66±0.87 
63.67±1.94 
63.01±0.93 
61.24±1.14 
60.35±1.57 
60.92±1.09 
66.00±1.47 
66.91±2.25 
66.32±1.34 
A+B-A 
56.68±2.79 
48.74±1.98 
53.90±2.01 
68.89±0.93 
64.21±2.26 
67.25±1.07 
69.45±1 51 
68.39±3.25 
69.08±1.66 
66.36±1 57 
64.11±2.76 
65.57±1.33 
72.09±1.61 
71.23±2.99 
71.79±1.51 
Accuracy (%) 
A+B-B 
36.49±2.41 
44.23±2.67 
39.21±1.90 
55.69±1.94 
56.14±2.79 
55.85±1.81 
56.09±1.12 
58.58±2.40 
56.97±1.22 
56.11±1.45 
56.58±2.45 
56.28±1.10 
59.92±1.93 
62.58±2.93 
60.85±1.81 
A-A 
59.77±1.44 
48.85±2.09 
55.94±1.10 
66.93±1.44 
63.87±1.80 
65.86±1.11 
69.38±1.24 
68.25±2.84 
68.98±1.06 
68.85±1.36 
63.91±1.51 
67.12±1.16 
71.69±1.54 
70.45±2.14" 
71.26±1.15 
B-B \[ A-B B-A 
45.28±1.81 33.97 39.46 
45.96±2.6O 40.91 37.31 
45.52±1.27 36.40 38.71 
56.17±1.60 36.62 45.99 
57.97±2.86 50.20 50.75 
56.80±1.12 41.38 47.66 
56.17±1.80 42.15 50.53 
59.57±2.86 51.19 52.24 
57.36±1.68 45.32 51.13 
56.55±1.31 42.13 49.96 
55.36±3.27 47.66 49.39 
56.13±1.23 44.07 49.76 
58.33±2.26 43.92 51.28" 
60.14±3.43" 52.99 53.29* 
58.96±1.86 47.10 51.99" 
Table 2: Accuracy results (:i: standard deviation) of the methods on all training-test combina- 
tions 
3a, and lb the contribution of the original 
training corpus is null. Furthermore, in plots 
la, 2b, and 3b a degradation on the accuracy 
performance is observed. Summarizing, these 
six plots show that for Naive Bayes, Exemplar 
Based, and Snow methods it is not worth keep- 
ing the original training examples. Instead, a 
better (but disappointing) strategy would be 
simply using the tuning corpus. 
However, this is not the situation of Lazy- 
Boosting (plots 4a and 4b), for which a mod- 
erate (but consistent) improvement of accu- 
racy is observed when retaining the original 
training set. Therefore, Lazy\[3oosting shows 
again a better behaviour than their competi- 
tors when moving from one domain to an- 
other. 
4.3 Third Experiment 
The bad results about portability could be ex- 
plained by, at least, two reasons: 1) Corpus 
A and \[3 have a very different distribution of 
senses, and, therefore, different a-priori bi- 
ases; 2) Examples of corpus A and \[3 con- 
tain different information, and, therefore, the 
learning algorithms acquire different (and non 
interchangeable) classification cues from both 
corpora,. 
The first hypothesis is confirmed by observ- 
ing the bar plots of figure 2, which contain the 
distribution of the four most frequent senses 
of some sample words in the corpora A and 
B. respectively. In order to check the second 
hypothesis, two new sense-balanced corpora 
have been generated from the DSO corpus, by 
equilibrating the number of examples of each 
sense between A and B parts. In this way, the 
first difficulty is artificially overrided and the 
algorithms should be portable if examples of 
both parts are quite similar. 
Table 3 shows the results obtained by Lazy- 
Boosting on these new corpora. 
Regarding portability, we observe a signifi- 
cant accuracy decrease of 7 and 5 points from 
A-A to B-A, and from B-B to A-B, respec- 
tively 9. That is, even when the sazne distri- 
bution of senses is conserved between training 
and test examples, the portability of the su- 
pervised WSD systems is not guaranteed. 
These results imply that examples have to 
be largely different from one corpus to an- 
other. By studying the weak rules generated 
by kazyBoosting in both cases, we could cor- 
roborate this fact. On the one hand, the type 
of features used in the rules were significantly 
different between corpora, and, additionally, 
there were very few rules that apply to both 
sets; On the other hand, the sign of the pre- 
diction of many of these common rules was 
somewhat contradictory between corpora. 
9This loss in accuracy is not as important as m the 
first experiment, due to the simplification provided by 
the balancing of sense distributions. 
177 
Naive Bayes 
Exemplar Based 
Snow 
LazyBoosting 
58 
56 1 
54 
~°52 
o 
50 
44 
4O 
58 
56 
Af~ 
52 
~o 
~ 48 
46 
58 
56 ' 
54 
o 52 
50 
46 
62 
60 ' 
58 
~o 
48 
46 
4.4 
Test on B corpus 
(la) 
..... -MF'~ .... 
o B,-B 
A+B-B o 
A+%B-B 
%B-B ..... 
/.f" 
5 10 15 20 25 30 35 40 45 50 
(2a) 
....... • ; .... ; ...... ~ --:-" ~S--:~.::,-- ~ 
B-B --- 
A+B-B o 
A+%B-B *-- 
%B-B -"u'- 
/+.- 
J 
5 10 15 20 25 30 355 40 45 50 
(3a) 
MFS 
A+B-B o 
A+%B-B ~-- 
%B-B ==- 
5 10 15 20 25 30 35 40 45 50 
(4a) 
B-B ~-- 
A+%B-B .... 
%B-B ---- 
//+ //"" - 
-/ ,/ .+ 
$+* 
, , = , = , i , , 
5 10 15 20 25 30 35 40 45 50 
72 
70 
68 
~66 
~,64 
62 
60 
58 
56 
54 
72 
70 
68 
66 
64 
62. 
60 
58 
56 
54 
72 
7O 
68 ~66 
60 
58 
56 
54 
72 
7O 
68 
o  .64 
~: eo 
58 
56 
54 
Test on A corpus 
(lb) 
MFS 
A-A ---- 
B+A-A 
B+%A-A ~-- 
o o . %,~oA -~- 
....................................... 
/ 
, , , , , .... 
5 10 15 20 25 30 35 40 45 50 
(2b) 
MFS 
A-A~-- 
B+A-A ................... ~ ...... =.- ~..,.Z~:~... ~ ~.~ 
%A-A .... 
_~.-- 
+ + 
• , , , , , , , , , 
5 10 15 20 25 30 35 40 45 50 
(3b) 
MF$ 
A-A .... 
B+A-A a 
B+%A-A ~- 
%A-A ..... 
j." 
/ 
5 10 15 20 25 30 ,35 40 45 50 
(4b) 
................................ f~tE~ ..... 
A-A .... 
B+A=A 
B+%A-A .... 
/P +/" 
+" 1" 
/ 
/ 
/ /,' 
, , , , , , , , i 
5 10 15 20 25 30 35 40 45 50 
Figure 1: Results of the tuning experiment 
5 Conclusions and Further Work 
This work has pointed out some difficulties 
regarding the portability of supervised WSD 
systems, a very important issue that has been 
paid little attention up to the present. 
According to our experiments, it seems that 
the performance of supervised sense taggers is 
not guaranteed when moving from one domain 
to another (e.g. from a balanced corpus, such 
as BC, to an economic domain, such as WSJ). 
These results implies that some_kind of adap- 
tation is required for cross-corpus application. 
178 
..... il ii\[ "~ head ~a Irltemt ~o fall oo grow mL. 
Figure 2: Distribution of the four most frequent senses for two nouns (head, interest) and two 
verbs (line, state). Black bars = A corpus; Grey bars = B corpus 
nouns 
MFC verbs 
total 
nouns 
LB verbs 
total 
Accuracy (%) 
A+B-A+B A+B-B A-A 
48.75±0.91 
48.22±1 68 
48.55±1 16 
62.82±1.43 
66.82±1.53 
64.35±1.16 
A+B-A 
48.90±1.69 
48.22±1.90 
48.64±1.04 
64.26±2.07 
69.33±2.92 
66.20±2.12 
48.61±0.96 
48.22±3 06 
48.46±1.21 
61.38±2.08 
64.32±3.27 
62.50±1.47 
48.87±1 68 
48.22±1.90 
48.62±1.09 
63.19±1.65 
68.51±2.45 
65.22±1.50 
B-B A-B B-A 
48.61±0.96 48.99 48.99 
48.22±3.06 48.22 48.22 
48.46±1.21 48.70 48.70 
60.65±1.01 53.45 55.27 
63.49±2.27 60.44 62.55 
61.74±1.18 56.12 58.05 
Table 3: Accuracy results (5= standard deviation) of LazyBoosting on the sense-balanced corpora 
Furthermore, these results are in contradic- 
tion with the idea of "robust broad-coverage 
WSD" introduced by (Ng, 1997b), in which a 
supervised system trained on a large enough 
corpora (say a thousand examples per word) 
~hould provide accurate disambiguation on 
any corpora (or, at least significantly better 
than MFS). 
Consequently, it is our belief that a number 
of issues regarding portability, tuning, knowl- 
edge acquisition, etc., should be thoroughly 
studied before stating that the supervised ML 
paradigm is able to resolve a realistic WSD 
problem. 
Regarding the M L algorithms tested, the 
contribution of this work consist of empiri- 
cally demonstrating that the LazyBoosting al- 
gorithm outperforms other three state-of-the- 
art supervised ML methods for WSD. Further- 
more. this algorithm is proven to have better 
properties when is applied to new domains. 
Further work is planned to be done in the 
following directions: 
• Extensively evaluate LazyBoosting on the 
WSD task. This would include tak- 
ing into account additional/alternative 
attributes and testing the algorithm in 
other corpora --specially on sense-tagged 
corpora automatically obtained from In- 
ternet or large text collections using non- 
supervised methods (Leazock et al., 1998; 
Mihalcea and Moldovan, 1999). 
• Since most of the knowledge learned from 
a domain is not useful when changing 
to a new domain, further investigation is 
needed on tuning strategies, specially on 
those using non-supervised algorithms. 
• It is known that mislabelled examples re- 
sulting from annotation errors tend to be 
hard examples to classify correctly, and, 
therefore, tend to have large weights in 
the final distribution. This observation 
allows both to identify the noisy exam- 
ples and use LazyBoosting as a way to 
improve data quality. Preliminary exper- 
iments have been already carried out in 
this direction on the DSO corpus. 
• Moreover, the inspection of the rules 
learned by kazyBoosting could provide 
evidence about similar behaviours of a- 
priori different senses. This type of 
knowledge could be useful to perform 
clustering of too fine-grained or artificial 
senses. 

References 
E. Agirre and D. Martinez. 2000. Decision Lists 
and Automatic Word Sense Disambiguation. In 
Proceedings o\] the COLING Workshop on Se- 
mantic Annotation and Intelligent Content 
D. Aha, D. Kibler, and M. Albert. 1991. Instance- 
based Learning Algorithms. Machine Learning, 
7:37-66. 
R. F. Bruce and J. M. Wiebe. 1999. Decompos- 
179 
able Modeling in Natural Language Processing. 
Computatwnal Linguistics. 25(2):195-207. 
S. Cost and S. Salzberg. 1993. A weighted nearest 
neighbor algorithm for learning with symbolic 
features. Machine Learning, 10(1), 57-78. 
W. Daelemans, A. van den Bosch, and J. Zavrel. 
1999. Forgetting Exceptions is Harmful in Lan- 
guage Learning. Machine Learning, 34:11-41. 
T. G. Dietterich. 1998. Approximate Statisti- 
cal Tests for Comparing Supervised Classifi- 
cation Learning Algorithms. Neural Computa- 
tion, 10(7). 
R. O. Duda and P. E. Hart. 1973. Pattern Clas- 
sificatwn and Scene Analysis. Wiley. 
G. Escudero, L. M~rquez, and G. Rigau. 2000a. 
Boosting Applied to Word Sense Disam- 
biguation. In Proceedings of the 12th Euro- 
pean Conference on Machine Learning, ECML, 
Barcelona, Spain. 
G. Escudero. L. M~rquez, and G. Rigau. 2000b. 
Naive Bayes and Exemplar-Based Approaches 
to Word Sense Disambiguation Revisited. In 
To appear in Proceedings of the 14th European 
Conference on Artificial Intelligence, ECAI. 
G. Escudero, L. M~quez, and G. Rigau. 2000c. 
On the Portability and Tuning of Super- 
vised Word Sense Disambiguation Systems. Re- 
search Report LSI-00-30-R, Software Depart- 
ment (LSI). Technical University of Catalonia 
(UPC). 
A. Fujii, K. Inui. T. Tokunaga, and H. Tanaka. 
1998. Selective Sampling for Example-based 
W'ord Sense Disambiguation. Computatwnal 
Linguistics, 24(4):573-598. 
W. Gale, K. W. Church, and D. Yarowsky. 1992a. 
A Method for Disambiguating Word Senses in a 
Large Corpus. Computers and the Humanities, 
26:415-439. 
W. Gale, K. W. Church, and D. Yarowsky. 1992b. 
Estimating Upper and Lower Bounds on the 
Performance of Word Sense Disambiguation. 
In Proceedings of the 30th Annual Meeting of 
the Association for Computational Linguistics. 
ACL. 
N. Ide and J. V@ronis. 1998. Introduction to the 
Special Issue on Word Sense Disambiguation: 
The State of the Art. Computational Linguis- 
tics, 24(1):1-40. 
A. Kilgarriff and J. Rosenzweig. 2000. English 
SENSEVAL: Report and Results. In Proceed- 
ings of the 2nd International Conference on 
Language Resources and Evaluatwn, LREC, 
Athens, Greece. 
C. Leacock, M. Chodorow, and G. A. Miller. 1998. 
Using Corpus Statistics and WordNet Relations 
for Sense Identification. Computatwnal Lin- 
guistwcs, 24(1):147-166. 
N. Littlestone. 1988. Learning Quickly when Irrel- 
evant Attributes Abound. Machine Learning, 
2:285-318. 
R. Mihalcea and I. Moldovan. 1999. An Au- 
tomatic Method for Generating Sense Tagged 
Corpora. In Proceedings of the 16th National 
Conference on Artificial Intelligence. AAAI 
Press. 
R. J. Mooney. 1996. Comparative Experiments 
on Disambiguating Word Senses: An Illustra- 
tion of the Role of Bias in Machine Learning. 
In Proceedings of the 1st Conference on Empir- 
ical Methods m Natural Language Processing, 
EMNLP. 
H. T. Ng and H. B. Lee. 1996. Integrating Multi- 
ple Knowledge Sources to Disambiguate Word 
Sense: An Exemplar-based Approach. In Pro- 
ceedmgs of the 3~th Annual Meeting of the As- 
sociation for Computational Linguistics. ACL. 
H. T. Ng. 1997a. Exemplar-Base Wbrd Sense Dis- 
ambiguation: Some Recent Improvements. In 
Proceedings of the 2nd Conference on Empir- 
zcal Methods in Natural Language Processing, 
EMNLP. 
H. T. Ng. 1997b. Getting Serious about Word 
Sense Disambiguation. In Proceedings of the 
ACL SIGLEX Workshop: Tagging Text with 
Lexical Semantics: Why, what and how?, Wash- 
ington, USA. 
D. Roth. 1998. Learning to Resolve Natural Lan- 
guage Ambiguities: A Unified Approach. In 
Proceedings of the National Conference on Ar- 
tzficial Intelhgence, AAAI 'Y8, July. 
R. E. Schapire and Y. Singer. to appear. Improved 
Boosting Algorithms Using Confidence-rated 
Predictions. Machine Learning. Also appearing 
in Proceedings of the 11th Annual Conference on 
Computatzonal Learning Theory, 1998. 
S. Sekine. 1997. The Domain Dependence of Pars- 
ing. In Proceedings o\] the 5th Conference on 
Applied Natural Language Processing, ANLP, 
Washington DC. ACL. 
G. Towell and E. M. Voorhees. 1998. Disam- 
biguating Highly Ambiguous Words. Computa- 
tional Lingu~stzcs. 24(1):125-146. 
D. Yarowsky. 1994. Decision Lists for Lexical 
Ambiguity Resolution: Application to Accent 
Restoration in Spanish and French. In Proceed- 
ings of the 32nd Annual Meeting of the Associ- 
ation for Computational Linguistics, pages 88- 
95, Las Cruces, NM. ACL. 
