In: Proceedings of CoNLL-2000 and LLL-2000, pages 31-36, Lisbon, Portugal, 2000. 
A Comparison between Supervised Learning Algorithms for Word 
Sense Disambiguation* 
Gerard Escudero and Lluis Mhrquez and German Rigau 
TALP Research Center. LSI Department. Universitat Polit~cnica de Catalunya (UPC) 
Jordi Girona Salgado 1-3. E-08034 Barcelona. Catalonia {escudero, lluism, g.rigau}@Isi.upc.es 
Abstract 
This paper describes a set of comparative exper- 
iments, including cross-corpus evaluation, be- 
tween five alternative algorithms for supervised 
Word Sense Disambiguation (WSD), namely 
Naive Bayes, Exemplar-based learning, SNOW, 
Decision Lists, and Boosting. Two main conclu- 
sions can be drawn: 1) The LazyBoosting algo- 
rithm outperforms the other four state-of-the- 
art algorithms in terms of accuracy and ability 
to tune to new domains; 2) The domain depen- 
dence of WSD systems seems very strong and 
suggests that some kind of adaptation or tun- 
ing is required for cross-corpus application. 
1 Introduction 
Word Sense Disambiguation (WSD) is the prob- 
lem of assigning the appropriate meaning (or 
sense) to a given word in a text or discourse. 
Resolving the ambiguity of words is a central 
problem for large scale  understanding 
applications and their associate tasks (Ide and 
V4ronis, 1998). Besides, WSD is one of the most 
important open problems in NLP. Despite the 
wide range of approaches investigated (Kilgar- 
rift and Rosenzweig, 2000) and the large effort 
devoted to tackling this problem, to date, no 
large-scale broad-coverage and highly accurate 
WSD system has been built. 
One of the most successful current lines of 
research is the corpus-based approach, in which 
statistical or Machine Learning (M L) algorithms 
have been applied to learn statistical models 
or classifiers from corpora in order to per- 
* This research has been partially funded by the Spanish 
Research Department (CICYT's project TIC98-0423- 
C06), by the EU Commission (NAMIC I8T-1999-12392), 
and by the Catalan Research Department (CIRIT's 
consolidated research group 1999SGR-150 and CIRIT's 
grant 1999FI 00773). 
form WSD. Generally, supervised approaches 
(those that learn from previously semantically 
annotated corpora) have obtained better results 
than unsupervised methods on small sets of se- 
lected ambiguous words, or artificial pseudo- 
words. Many standard M L algorithms for su- 
pervised learning have been applied, such as: 
Decision Lists (Yarowsky, 1994; Agirre and 
Martinez, 2000), Neural Networks (Towell and 
Voorhees, 1998), Bayesian learning (Bruce and 
Wiebe, 1999), Exemplar-based learning (Ng, 
1997), Boosting (Escudero et al., 2000a), etc. 
Further, in (Mooney, 1996) some of the previ- 
ous methods are compared jointly with Decision 
Trees and Rule Induction algorithms, on a very 
restricted domain. 
Although some published works include the 
comparison between some alternative algo- 
rithms (Mooney, 1996; Ng, 1997; Escudero et 
al., 2000a; Escudero et al., 2000b), none of 
them addresses the issue of the portability of 
supervised ML algorithms for WSD, i.e., testing 
whether the accuracy of a system trained on 
a certain corpus can be extrapolated to other 
corpora or not. We think that the study of the 
domain dependence of WSD --in the style of 
other studies devoted to parsing (Sekine, 1997; 
Ratnaparkhi, 1999)-- is needed to assure the 
validity of the supervised approach, and to de- 
termine to which extent a tuning pre-process is 
necessary to make real WSD systems portable. 
In this direction, this work compares five differ- 
ent M L algorithms and explores their portability 
and tuning ability by training and testing them 
on different corpora. 
2 Learning Algorithms Tested 
Naive-Bayes (NB). Naive Bayes is intended 
as a simple representative of statistical learning 
methods. It has been used in its most classi- 
31 
cal setting (Duda and Hart, 1973). That is, 
assuming the independence of features, it clas- 
sifies a new example by assigning the class that 
maximizes the conditional probability of the 
class given the observed sequence of features 
of that example. Model probabilities are esti- 
mated during the training process using relative 
frequencies. To avoid the effect of zero counts, a 
very simple smoothing technique has been used, 
which was proposed in (Ng, 1997). 
Despite its simplicity, Naive Bayes is claimed 
to obtain state-of-the-art accuracy on super- 
vised WSD in many papers (Mooney, 1996; Ng, 
1997; Leacock et al., 1998). 
Exemplar-based Classifier (EB). In exem- 
plar, instance, or memory-based learning (Aha 
et al., 1991) no generalization of training ex- 
amples is performed. Instead, the examples are 
simply stored in memory and the classification 
of new examples is based on the most similar 
stored exemplars. In our implementation, all 
examples are kept in memory and the classifica- 
tion is based on a k-NN (Nearest-Neighbours) 
algorithm using Hamming distance to measure 
closeness. For k's greater than 1, the resulting 
sense is the weighted majority sense of the k 
nearest neighbours --where each example votes 
its sense with a strength proportional to its 
closeness to the test example. 
Exemplar-based learning is said to be the 
best option for WSD (Ng, 1997). Other au- 
thors (Daelemans et al., 1999) point out that 
exemplar-based methods tend to be superior in 
 learning problems because they do not 
forget exceptions. 
The SNoW Architecture (SN). SNoWis a 
Sparse Network of linear separators which uti- 
lizes the Winnow learning algorithm 1. In the 
SNo W architecture there is a winnow node for 
each class, which learns to separate that class 
from all the rest. During training, which is per- 
formed in an on-line fashion, each example is 
considered a positive example for the winnow 
node associated to its class and a negative ex- 
ample for all the others. A key point that allows 
a fast learning is that the winnow nodes are not 
connected to all features but only to those that 
1The Winnow algorithm (Littlestone, 1988) consists 
of a linear threshold algorithm with multiplicative weight 
updating for 2-class problems. 
are "relevant" for their class. When classify- 
ing a new example, SNo W is similar to a neural 
network which takes the input features and out- 
puts the class with the highest activation. Our 
implementation of SNo W for WSD is explained 
in (Escudero et al., 2000c). 
SNoW is proven to perform very well in 
high dimensional NLP problems, where both the 
training examples and the target function reside 
very sparsely in the feature space (Roth, 1998), 
e.g: context-sensitive spelling correction, POS 
tagging, PP-attachment disambiguation, etc. 
Decision Lists (DL). In this setting, a Deci- 
sion List is a list of features extracted from the 
training examples and sorted by a log-likelihood 
measure. This measure estimates how strong a 
particular feature is as an indicator of a specific 
sense (Yarowsky, 1994). When testing, the deci- 
sion list is checked in order and the feature with 
the highest weight that matches the test exam- 
ple is used to select the winning word sense. 
Thus, only the single most reliable piece of ev- 
idence is used to perform disambiguation. Re- 
garding the details of implementation (smooth- 
ing, pruning of the decision list, etc.) we have 
followed (Agirre and Martinez, 2000). 
Decision Lists were one of the most success- 
ful systems on the 1st Senseval competition for 
WSD (Kilgarriff and Rosenzweig, 2000). 
LazyBoosting (LB). The main idea of boost- 
ing algorithms is to combine many simple and 
moderately accurate hypotheses (weak classi- 
fiers) into a single, highly accurate classifier. 
The weak classifiers are trained sequentially 
and, conceptually, each of them is trained on the 
examples which were most difficult to classify by 
the preceding weak classifiers. These weak hy- 
potheses are then linearly combined into a single 
rule called the combined hypothesis. 
Schapire and Singer's real AdaBoost.MH al- 
gorithm for multiclass multi-label classifica- 
tion (Schapire and Singer, 1999) has been used. 
It constructs a combination of very simple 
weak hypotheses that test the value of a single 
boolean predicate and make a real-valued pre- 
diction based on that value. LazyBoosting (Es- 
cudero et al., 2000a) is a simple modification 
of the AdaBoost.MH algorithm, which consists 
in reducing the feature space that is explored 
when learning each weak classifier. This mod- 
ification significantly increases the efficiency of 
32 
the learning process with no loss in accuracy. 
3 Setting 
A number of comparative experiments has been 
carried out on a subset of 21 highly ambiguous 
words of the DSO corpus, which is a semanti- 
cally annotated English corpus collected by Ng 
and colleagues (Ng and Lee, 1996). Each word 
is treated as a different classification problem. 
The 21 words comprise 13 nouns (age, art, body, 
car, child, cost, head, interest, line, point, state, 
thing, work) and 8 verbs (become, fall, grow, lose, 
set, speak, strike, tell), which frequently appear 
in the WSD literature. The average number of 
senses per word is close to 10 and the number 
of training examples is around 1,000. 
The DSO corpus contains sentences from two 
different corpora, namely Wall Street Journal 
(WSJ) and Brown Corpus (BC). Therefore, it is 
easy to perform experiments about the porta- 
bility of systems by training them on the WSJ 
part (A part, hereinafter) and testing them on 
the BC part (B part, hereinafter), or vice-versa. 
Two kinds of information are used to train 
classifiers: local and topical context. Let 
... " be ~ W-3 W--2 W--1 W W-i_ 1 W+2 W+3... 
the context of consecutive words around the 
word w to be disambiguated, and P±i (-3 < 
i < 3) be the part-of-speech tag of word 
w±i. Attributes referring to local context 
are the following 15: P-3, P-2, P-l, P+I, 
P+2, P+3, w-l, W-t-1 , (W-2, W-1), (W-i,W+I), 
(W+I,W+2), (W-3, W--2, W--1), (W-2, W-i,W+I), 
(W--l, W+i , W+2), and (W+l, w+2, w+3), where 
the last seven correspond to collocations of two 
and three consecutive words. The topical con- 
text is formed by cl,..., Cm, which stand for the 
unordered set of open class words appearing in 
the sentence 2. Details about how the different 
algorithms translate this information into fea- 
tures can be found in (Escudero et al., 2000c). 
4 Comparing the five approaches 
The five algorithms, jointly with a naive Most- 
Frequent-sense Classifier (MFC), have been 
tested, by 10-fold cross validation, on 7 different 
combinations of training-test sets 3. Accuracy 
2This set of attributes corresponds to that used in (Ng 
and Lee, 1996), with the exception of the morphology of 
the target word and the verb-object syntactic relation. 
3The combinations of training-test sets are called: 
A+B-A+B, A-I-B-A, A+B-B, A-A, B-B, A-B, and B-A, 
figures, micro-averaged over the 21 words and 
over the ten folds, are reported in table 1. The 
comparison leads to the following conclusions: 
As expected, the five algorithms significantly 
outperform the baseline M FC classifier. Among 
them, three groups can be observed: Ni3, DL, 
and SN perform similarly; LB outperforms all 
the other algorithms in all experiments; and EB 
is somewhere in between. The difference be- 
tween \[B and the rest is statistically significant 
in all cases except when comparing \[B to the EB 
approach in the case marked with an asterisk 4. 
Extremely poor results are observed when 
testing the portability of the systems. Restrict- 
ing to LB results, it can be observed that the 
accuracy obtained in A-B is 47.1%, while the 
accuracy in B-B (which can be considered an 
upper bound for LB in B corpus) is 59.0%, that 
is, that there is a difference of 12 points. Fur- 
thermore, 47.1% is only slightly better than the 
most frequent sense in corpus B, 45.5%. 
Apart from accuracy figures, the comparison 
between the predictions made by the five meth- 
ods on the test sets provides interesting infor- 
mation about the relative behaviour of the algo- 
rithms. Table 2 shows the agreement rates and 
the Kappa statistics 5 between all pairs of meth- 
ods in the A+B-A+B experiment. Note that 
'DSO' stands for the annotation of DSO corpus, 
which is taken as the correct one. 
It can be observed that N B obtains the most 
similar results with regard to M FC in agreement 
and Kappa values. The agreement ratio is 74%, 
that is, almost 3 out of 4 times it predicts the 
most frequent sense. On the other extreme, LB 
obtains the most similar results with regard to 
DSO in agreement and Kappa values, and it has 
the least similar with regard to M FC, suggesting 
respectively. In this notation, the training set is placed 
on the left hand side of symbol "-", while the test set 
is on the right hand side. For instance, A-B means that 
the training set is corpus A and the test set is corpus B. 
The symbol "+" stands for set union. 
4Statistical tests of significance applied: McNemar's 
test and 10-fold cross-validation paired Student's t-test 
at a confidence value of 95% (Dietterich, 1998). 
~The Kappa statistic (Cohen, 1960) is a better mea- 
sure of inter-annotator agreement which reduces the ef- 
fect of chance agreement. It has been used for measur- 
ing inter-annotator agreement during the construction 
of semantic annotated corpora (V~ronis, 1998; Ng et al., 
1999). A Kappa value of 1 indicates perfect agreement, 
while 0.8 is considered as indicating good agreement. 
33 
Accuracy (%) 
LazyBoosting 
A+B-A+B A+B-A A+B-B A-A B-B 
MFC 46.55±o.71 53.90±2.ol 39.21±i.9o 55.94±Mo 45.52±1.27 
Naive Bayes 61.55±1.o4 67.25±1.o7 55.85±1.81 65.86±1.11 56.80±1.12 
Decision Lists 61.58±o.98 67.64±0.94 55.53±1.85 67.57±1.44 56.56±1.59 
SNoW 60.92±1.o9 65.57±1.33 56.28±1.1o 67.12±1.16 56.13±1.23 
Exemplar-based 63.01±o.93 69.08±1.66 56.97±1.22 68.98±1.o6 57.36±1.68 
66.32±1.34 71.79±1.51 60.85~L81 71,26±1.i5 58.96±1.86 
A-B B-A 
36.40 38.71 
41.38 47.66 
43.01 48.83 
44.07 49.76 
45.32 51.13 
47.10 51.99" 
Table 1: Accuracy results (=h standard deviation) of the methods on all training-test combinations 
A+B-A+B 
DSO MFC NB EB SN DL LB 
DSO -- 46.6 61.6 63.0 60.9 61.6 66.3 
MFC -0.19 -- 73.9 60.0 55.9 64.9 54.9 
NB 0.24 -0.09 -- 76.3 74.5 76.8 71.4 
EB 0.36 -0.15 0.44 -- 69.6 70.7 72.5 
SN 0.36 -0.17 0.44 0.44 -- 67.5 69.0 
DL 0.32 -0.13 0.40 0.41 0.38 -- 69.9 
LB 0.44 -0.17 0.37 0.50 0.46 0.42 -- 
Table 2: Kappa statistic (below diagonal) and 
% of agreement (above diagonal) between all 
methods in the A+B-A+B experiment 
that LB is the algorithm that better learns the 
behaviour of the DSO examples. 
In absolute terms, the Kappa values are very 
low. But, as it is suggested in (Vdronis, 1998), 
evaluation measures should be computed rela- 
tive to the agreement between the human an- 
notators of the corpus and not to a theoreti- 
cal 100%. It seems pointless to expect more 
agreement between the system and the refer- 
ence corpus than between the annotators them- 
selves. Contrary to the intuition that the agree- 
ment between human annotators should be very 
high in the WSD task, some papers report sur- 
prisingly low figures. For instance, (Ng et al., 
1999) reports an accuracy rate of 56.7% and a 
Kappa value of 0.317 when comparing the anno- 
tation of a subset of the DSO corpus performed 
by two independent research groups. From this 
perspective, the Kappa value of 0.44 achieved 
by LB in A+B-A+B could be considered an ex- 
cellent result. Unfortunately, the subset of the 
\[:)SO corpus studied by (Ng et al., 1999) and 
that used in this report are not the same and, 
thus, a direct comparison is not possible. 
4.1 About the tuning to new domains 
This experiment explores the effect of a sim- 
ple tuning process consisting in adding to the 
original training set A a relatively small sample 
of manually sense-tagged examples of the new 
domain B. The size of this supervised portion 
varies from 10% to 50% of the available corpus 
in steps of 10% (the remaining 50% is kept for 
testing) 6. This experiment will be referred to 
as A+%B-B T. In order to determine to which 
extent the original training set contributes to 
accurately disambiguating in the new domain, 
we also calculate the results for %B-B, that is, 
using only the tuning corpus for training. 
Figure 1 graphically presents the results ob- 
tained by all methods. Each plot contains the 
A+%B-B and %B-B curves, and the straight 
lines corresponding to the lower bound MFC, 
and to the upper bounds B-B and A+B-B. 
As expected, the accuracy of all methods 
grows (towards the upper bound) as more tun- 
ing corpus is added to the training set. How- 
ever, the relation between A+%B-B and %B-B 
reveals some interesting facts. In plots (c) and 
(d), the contribution of the original training cor- 
pus is null, while in plots (a) and (b), a degrada- 
tion onthe accuracy is observed. Summarizing, 
these results suggest that for NB, DL, SN, and 
EB methods it is not worth keeping the original 
training examples. Instead, a better (but dis- 
appointing) strategy would be simply using the 
tuning corpus. However, this is not the situa- 
tion of LB --plot (d)-- for which a moderate, 
but consistent, improvement of accuracy is ob- 
served when retaining the original training set. 
6Tuning examples can be weighted more highly than 
the training examples to force the learning algorithm to 
adapt more quickly to the new corpus. Some experi- 
ments in this direction revealed that slightly better re- 
sults can be obtained, though the improvement was not 
statistically significant. 
7The converse experiment B-F%A-A is not reported 
in this paper due to space limitations. Results can be 
found in (Escudero et al., 2000c). 
34 
58 
56 
54 
g 52 
al 
~ 50 
~ 46 
44 
42 
40 
(a) Naive Bayes 
........................................................... M~S o 
........................................................... ~ .................. B...B ..... 
A+B-B ........ 
A+%B-B ~- 
%B-B ..... .. 
56 
~ 54 
o 52 
~ 48 
46 
44 
58 
56 
54 
16 
44 
42 
(b) Decision Lists 
................... G ................... o ................... ~ ............ AYS:B::i::: .... 
A+%B-B 
%B-B ..... ' 
, , , , , 
5 10 15 20 25 30 35 40 45 50 5 10 15 
(d) SNoW 
58 
E::E::E.~:E.~:E.~:EZ::Z~:E.~:Z.Z..~YE::LL:::Y.:~.'S::47724:L:; 
B-B ..... Zoi~:~ '°~ 
5 10 15 20 25 30 35 40 45 50 
0 20 25 30 35 40 45 50 
62 
60 
58 
~ 56 
~ 52 
48 
46 
44 
(c) Exemplar Based 
58 
MFS 
56 =::=:==::=Q:~=:::=:==:::a=:====::==~=::=:::=:=:Q-~,~=--~---~ A+B-B ........ 
A+%B-B 
54 %B-B ....... 
52 
50 ..... ~ 
46 /o , , , 
' ' ' 44 ' ' ' ' ' ' ' ' ' 
5 10 15 20 25 30 35 40 45 50 
(e) Lazygoosting 
.................. o ................... ~ ................... ~ ................ MF$--~-- 
B-B ...... 
...................................................... A~'B-B ....... 
A+%B-B ~-- 
%B-B ..... 
/ / 
i / 
5 10 15 20 25 30 35 40 45 50 
Figure 1: Results of the tuning experiment 
We observed that part of the poor results 
obtained is explained by: 1) corpus A and 
B have a very different distribution of senses, 
and, therefore, different a-priori biases; further- 
more, 2) examples of corpus A and B contain 
different information and, therefore, the learn- 
ing algorithms acquire different (and non inter- 
changeable) classification clues from both cor- 
pora. The study of the rules acquired by Lazy- 
Boosting from WSJ and BC helped understand- 
ing the differences between corpora. On the one 
hand, the type of features used in the rules was 
significantly different between corpora and, ad- 
ditionally, there were very few rules that applied 
to both sets. On the other hand, the sign of the 
prediction of many of these common rules was 
somewhat contradictory between corpora. See 
(Escudero et al., 2000c) for details. 
4.2 About the training data quality 
The observation of the rules acquired by Lazy- 
Boosting could also help improving data quality 
in a semi-supervised fashion. It is known that 
mislabelled examples resulting from annotation 
errors tend to be hard examples to classify cor- 
rectly and, therefore, tend to have large weights 
in the final distribution. This observation al- 
lows both to identify the noisy examples and 
use LazyBoosting as a way to improve the train- 
ing corpus. 
A preliminary experiment has been carried 
out in this direction by studying the rules ac- 
quired by LazyBoosting from the training ex- 
amples of the word state. The manual revi- 
sion, by four different people, of the 50 high- 
est scored rules, allowed us to identify 28 noisy 
training examples. 11 of them were clear tag- 
ging errors, and the remaining 17 were not co- 
herently tagged and very difficult to judge, since 
the four annotators achieved systematic dis- 
agreement (probably due to the extremely fine 
grained sense definitions involved in these ex- 
amples). 
5 Conclusions 
This work reports a comparative study of five 
ML algorithms for WSD, and provides some re- 
sults on cross corpora evaluation and domain 
re-tuning. 
Regarding portability, it seems that the per- 
formance of supervised sense taggers is not 
guaranteed when moving from one domain to 
another (e.g. from a balanced corpus, such 
as BC, to an economic domain, such as WSJ). 
35 
These results imply that some kind of adap- 
tation is required for cross-corpus application. 
Consequently, it is our belief that a number of 
issues regarding portability, tuning, knowledge 
acquisition, etc., should be thoroughly studied 
before stating that the supervised M k paradigm 
is able to resolve a realistic WSD problem. 
Regarding the ML algorithms tested, kazy- 
Boosting emerges as the best option, since 
it outperforms the other four state-of-the-art 
methods in all experiments. Furthermore, this 
algorithm shows better properties when tuned 
to new domains. Future work is planned for 
an extensive evaluation of kazyBoosting on the 
WSD task. This would include taking into ac- 
count additional/alternative attributes, learn- 
ing curves, testing the algorithm on other cor- 
pora, etc. 

References 
E. Agirre and D. Martinez. 2000. Decision Lists and 
Automatic Word Sense Disambiguation. In Pro- 
ceedings of the COLING Workshop on Semantic 
Annotation and Intelligent Content. 
D. Aha, D. Kibler, and M. Albert. 1991. Instance- 
based Learning Algorithms. Machine Learning, 
7:37-66. 
R. F. Bruce and J. M. Wiebe. 1999. Decomposable 
Modeling in Natural Language Processing. Com- 
putational Linguistics, 25(2):195-207. 
J. Cohen. 1960. A Coefficient of Agreement for 
Nominal Scales. Journal of Educational and Psy- 
chological Measurement, 20:37-46. 
W. Daelemans, A. van den Bosch, and J. Zavrel. 
1999. Forgetting Exceptions is Harmful in Lan- 
guage Learning. Machine Learning, 34:11-41. 
T. G. Dietterich. 1998. Approximate Statistical 
Tests for Comparing Supervised Classification 
Learning Algorithms. Neural Computation, 10(7). 
R. O. Duda and P. E. Hart. 1973. Pattern Classifi- 
cation and Scene Analysis. Wiley ~: Sons. 
G. Escudero, L. M~rquez, and G. Rigau. 2000a. 
Boosting Applied to Word Sense Disambiguation. 
In Proceedings of the 12th European Conference 
on Machine Learning, ECML. 
G. Escudero, L. M~rquez, and G. Rigau. 2000b. 
Naive Bayes and Exemplar-Based Approaches to 
Word Sense Disambiguation Revisited. In Pro- 
ceedings of the 14th European Conference on Ar- 
tificial Intelligence, ECAL 
G. Escudero, L. M~rquez, and G. Rigau. 2000c. On 
the Portability and Tuning of Supervised Word 
Sense Disambiguation Systems. Research Report 
LSI-00-30-R, Software Department (LSI). Techni- 
cal University of Catalonia (UPC). 
N. Ide and J. V@ronis. 1998. Introduction to the 
Special Issue on Word Sense Disambiguation: 
The State of the Art. Computational Linguistics, 
24(1):1-40. 
A. Kilgarriff and J. Rosenzweig. 2000. English SEN- 
SEVAL: Report and Results. In Proceedings of the 
2nd International Conference on Language Re- 
sources and Evaluation, LREC. 
C. Leacock, M. Chodorow, and G. A. Miller. 1998. 
Using Corpus Statistics and WordNet Relations 
for Sense Identification. Computational Linguis- 
tics, 24(1):147-166. 
N. Littlestone. 1988. Learning Quickly when Ir- 
relevant Attributes Abound. Machine Learning, 
2:285-318. 
R. J. Mooney. 1996. Comparative Experiments on 
Disambiguating Word Senses: An Illustration of 
the Role of Bias in Machine Learning. In Proceed- 
ings of the 1st Conference on Empirical Methods 
in Natural Language Processing, EMNLP. 
H. T. Ng and H. B. Lee. 1996. Integrating Multiple 
Knowledge Sources to Disambiguate Word Senses: 
An Exemplar-based Approach. In Proceedings of 
the 3~th Annual Meeting of the ACL. 
H. T. Ng, C. Lim, and S. Foo. 1999. A Case Study 
on Inter-Annotator Agreement for Word Sense 
Disambiguation. In Procs. of the ACL SIGLEX 
Workshop: Standardizing Lexical Resources. 
H. T. Ng. 1997. Exemplar-Base Word Sense Disam- 
biguation: Some Recent Improvements. In Procs. 
of the 2nd Conference on Empirical Methods in 
Natural Language Processing, EMNLP. 
A. Ratnaparkhi. 1999. Learning to Parse Natural 
Language with Maximum Entropy Models. Ma- 
chine Learning, 34:151-175. 
D. Roth. 1998. Learning to Resolve Natural Lan- 
guage Ambiguities: A Unified Approach. In Pro- 
ceedings of the National Conference on Artificial 
Intelligence, AAAI '98. 
R. E. Schapire and Y. Singer. 1999. Improved 
Boosting Algorithms Using Confidence-rated Pre- 
dictions. Machine Learning, 37(3):297-336. 
S. Sekine. 1997. The Domain Dependence of Pars- 
ing. In Proceedings of the 5th Conference on Ap- 
plied Natural Language Processing, ANLP. 
G. Towell and E. M. Voorhees. 1998. Disambiguat- 
ing Highly Ambiguous Words. Computational 
Linguistics, 24(1):125-146. 
J. V@ronis. 1998. A study of polysemy judgements 
and inter-annotator agreement. In Programme 
and advanced papers of the Senseval workshop, 
Herstmonceux Castle, England. 
D. Yarowsky. 1994. Decision Lists for Lexical Ambi- 
guity Resolution: Application to Accent Restora- 
tion in Spanish and French. In Proceedings of the 
32nd Annual Meeting of the ACL. 
