Augmenting Ensemble Classification for Word Sense Disambiguation with a
Kernel PCA Model
Marine CARPUAT Weifeng SU Dekai WU1
marine@cs.ust.hk weifeng@cs.ust.hk dekai@cs.ust.hk
Human Language Technology Center
HKUST
Department of Computer Science
University of Science and Technology, Clear Water Bay, Hong Kong
Abstract
The HKUST word sense disambiguation systems
benefit from a new nonlinear Kernel Principal
Component Analysis (KPCA) based disambigua-
tion technique. We discuss and analyze results
from the Senseval-3 English, Chinese, and Multi-
lingual Lexical Sample data sets. Among an en-
semble of four different kinds of voted models, the
KPCA-based model, along with the maximum en-
tropy model, outperforms the boosting model and
na¨ıve Bayes model. Interestingly, while the KPCA-
based model typically achieves close or better ac-
curacy than the maximum entropy model, neverthe-
less a comparison of predicted classifications shows
that it has a significantly different bias. This char-
acteristic makes it an excellent voter, as confirmed
by results showing that removing the KPCA-based
model from the ensemble generally degrades per-
formance.
1 Introduction
Classifier combination has become a standard ar-
chitecture for shared task evaluations in word sense
disambiguation (WSD), named entity recognition,
and similar problems that can naturally be cast as
classification problems. Voting is the most com-
mon method of combination, having proven to be
remarkably effective yet simple.
A key problem in improving the accuracy of such
ensemble classification systems is to find new vot-
ing models that (1) exhibit significantly different
prediction biases from the models already voting,
and yet (2) attain stand-alone classification accura-
cies that are as good or better. When either of these
conditions is not met, adding the new voting model
typically degrades the accuracy of the ensemble in-
stead of helping it.
In this work, we investigate the potential of one
promising new disambiguation model with respect
1The author would like to thank the Hong Kong Research
Grants Council (RGC) for supporting this research in part
through research grants RGC6083/99E, RGC6256/00E, and
DAG03/04.EG09.
to augmenting our existing ensemble combining a
maximum entropy model, a boosting model, and
a na¨ıve Bayes model—a combination representing
some of the best stand-alone WSD models cur-
rently known. The new WSD model, proposed
by Wu et al. (2004), is a method for disambiguat-
ing word senses that exploits a nonlinear Kernel
Principal Component Analysis (KPCA) technique.
That the KPCA-based model could potentially be
a good candidate for a new voting model is sug-
gested by Wu et al.’s empirical results showing that
it yielded higher accuracies on Senseval-2 data sets
than other models that included maximum entropy,
na¨ıve Bayes, and SVM based models.
In the following sections, we begin with a de-
scription of the experimental setup, which utilizes
a number of individual classifiers in a voting en-
semble. We then describe the KPCA-based model
to be added to the baseline ensemble. The accuracy
results of the three submitted models are examined,
and also the individual voting models are compared.
Subsequently, we analyze the degree of difference
in voting bias of the KPCA-based model from the
others, and finally show that this does indeed usu-
ally lead to accuracy gains in the voting ensemble.
2 Experimental setup
2.1 Tasks evaluated
We performed experiments on the following lexical
sample tasks from Senseval-3:
English (fine). The English lexical sample task
includes 57 target words (32 verbs, 20 nouns and
5 adjectives). For each word, training and test in-
stances tagged with WordNet senses are provided.
There are an average of 8.5 senses per target word
type, ranging from 3 to 23. On average, 138 training
instances per target word are available.
English (coarse). This modified evaluation of the
preceding task employs a sense map that groups
fine-grained sense distinctions into the same coarse-
grained sense.
Chinese. The Chinese lexical sample task in-
cludes 21 target words. For each word, several
                                             Association for Computational Linguistics
                        for the Semantic Analysis of Text, Barcelona, Spain, July 2004
                 SENSEVAL-3: Third International Workshop on the Evaluation of Systems
senses are defined using the HowNet knowledge
base. There are an average of 3.95 senses per tar-
get word type, ranging from 2 to 8. Only about 37
training instances per target word are available.
Multilingual (t). The Multilingual (t) task is de-
fined similarly to the English lexical sample task,
except that the word senses are the translations into
Hindi, rather than WordNet senses. The Multilin-
gual (t) task requires finding the Hindi sense for 31
English target word types. There are an average of
7.54 senses per target word type, ranging from 3 to
16. A relatively large training set is provided (more
than 260 training instances per word on average).
Multilingual (ts). The Multilingual (ts) task uses
a different data set of 10 target words and provides
the correct English sense of the target word for both
training and testing. There are an average of 6.2
senses per target word type, ranging from 3 to 11.
The training set for this subtask was smaller, with
about 150 training instances per target word.
2.2 Ensemble classification
The WSD models presented here consist of ensem-
bles utilizing various combinations of four voting
models, as follows. Some of these component mod-
els were also evaluated on other Senseval-3 tasks:
the Basque, Catalan, Italian, and Romanian Lexical
Sample tasks (Wicentowski et al., 2004), as well as
Semantic Role Labeling (Ngai et al., 2004).
The first voting model, a na¨ıve Bayes model, was
built as Yarowsky and Florian (2002) found this
model to be the most accurate classifier in a compar-
ative study on a subset of Senseval-2 English lexical
sample data.
The second voting model, a maximum entropy
model (Jaynes, 1978), was built as Klein and Man-
ning (2002) found that it yielded higher accuracy
than na¨ıve Bayes in a subsequent comparison of
WSD performance. However, note that a different
subset of either Senseval-1 or Senseval-2 English
lexical sample data was used.
The third voting model, a boosting model (Fre-
und and Schapire, 1997), was built as boosting has
consistently turned in very competitive scores on re-
lated tasks such as named entity classification (Car-
reras et al., 2002)(Wu et al., 2002). Specifically, we
employed an AdaBoost.MH model (Schapire and
Singer, 2000), which is a multi-class generalization
of the original boosting algorithm, with boosting on
top of decision stump classifiers (decision trees of
depth one).
The fourth voting model, the KPCA-based
model, is described below.
All classifier models were selected for their abil-
ity to able to handle large numbers of sparse fea-
tures, many of which may be irrelevant. More-
over, the maximum entropy and boosting models are
known to be well suited to handling features that are
highly interdependent.
2.3 Controlled feature set
In order to facilitate a controlled comparison across
the individual voting models, the same feature set
was employed for all classifiers. The features are
as described by Yarowsky and Florian (2002) in
their “feature-enhanced na¨ıve Bayes model”, with
position-sensitive, syntactic, and local collocational
features.
2.4 The KPCA-based WSD model
We briefly summarize the KPCA-based model here;
for full details including illustrative examples and
graphical interpretation, please refer to Wu et al.
(2004).
Kernel PCA Kernel Principal Component Analy-
sis is a nonlinear kernel method for extracting non-
linear principal components from vector sets where,
conceptually, the n-dimensional input vectors are
nonlinearly mapped from their original space Rn
to a high-dimensional feature space F where linear
PCA is performed, yielding a transform by which
the input vectors can be mapped nonlinearly to a
new set of vectors (Sch¨olkopf et al., 1998).
As with other kernel methods, a major advantage
of KPCA over other common analysis techniques is
that it can inherently take combinations of predic-
tive features into account when optimizing dimen-
sionality reduction. For WSD and indeed many nat-
ural language tasks, significant accuracy gains can
often be achieved by generalizing over relevant fea-
ture combinations (see, e.g., Kudo and Matsumoto
(2003)). A further advantage of KPCA in the con-
text of the WSD problem is that the dimensionality
of the input data is generally very large, a condition
where kernel methods excel.
Nonlinear principal components (Diamantaras
and Kung, 1996) are defined as follows. Suppose
we are given a training set of M pairs (xt;ct) where
the observed vectors xt 2 Rn in an n-dimensional
input space X represent the context of the target
word being disambiguated, and the correct class ct
represents the sense of the word, for t = 1;::;M.
Suppose  is a nonlinear mapping from the input
space Rn to the feature space F. Without loss of
generality we assume the M vectors are centered
vectors in the feature space, i.e., PMt=1  (xt) = 0;
uncentered vectors can easily be converted to cen-
tered vectors (Sch¨olkopf et al., 1998). We wish to
diagonalize the covariance matrix in F:
C = 1M
MX
j=1
 (xj)  T (xj) (1)
To do this requires solving the equation  v =
Cv for eigenvalues   0 and eigenvectors v 2
Rnnf0g. Because
Cv = 1M
MX
j=1
( (xj)  v) (xj) (2)
we can derive the following two useful results. First,
 ( (xt)  v) =  (xt)  Cv (3)
for t = 1;::;M. Second, there exist  i for i =
1;:::;M such that
v =
MX
i=1
 i (xi) (4)
Combining (1), (3), and (4), we obtain
M 
MX
i=1
 i ( (xt)   (xi ))
=
MX
i=1
 i( (xt)  
MX
j=1
 (xj))( (xj)   (xi ))
for t = 1;::;M. Let ^K be the M  M matrix such
that ^
Kij =  (xi)   (xj) (5)
and let ^ 1  ^ 2  :::  ^ M denote the eigenval-
ues of ^K and ^ 1 ,..., ^ M denote the corresponding
complete set of normalized eigenvectors, such that
^ t(^ t ^ t) = 1 when ^ t > 0. Then the lth nonlinear
principal component of any test vector xt is defined
as
ylt =
MX
i=1
^ li ( (xi)   (xt )) (6)
where ^ li is the lth element of ^ l .
See Wu et al. (2004) for a possible geometric in-
terpretation of the power of the nonlinearity.
WSD using KPCA In order to extract nonlin-
ear principal components efficiently, first note that
in both Equations (5) and (6) the explicit form of
 (xi) is required only in the form of ( (xi)  
 (xj)), i.e., the dot product of vectors in F. This
means that we can calculate the nonlinear princi-
pal components by substituting a kernel function
k(xi;xj) for ( (xi)   (xj )) in Equations (5) and
(6) without knowing the mapping  explicitly; in-
stead, the mapping  is implicitly defined by the
kernel function. It is always possible to construct
a mapping into a space where k acts as a dot prod-
uct so long as k is a continuous kernel of a positive
integral operator (Sch¨olkopf et al., 1998).
Thus we train the KPCA model using the follow-
ing algorithm:
1. Compute an M  M matrix ^K such that
^Kij = k(xi;xj) (7)
2. Compute the eigenvalues and eigenvectors of
matrix ^K and normalize the eigenvectors. Let
^ 1  ^ 2  :::  ^ M denote the eigenvalues
and ^ 1,..., ^ M denote the corresponding com-
plete set of normalized eigenvectors.
To obtain the sense predictions for test instances,
we need only transform the corresponding vectors
using the trained KPCA model and classify the re-
sultant vectors using nearest neighbors. For a given
test instance vector x, its lth nonlinear principal
component is
ylt =
MX
i=1
^ lik(xi;xt) (8)
where ^ li is the ith element of ^ l.
For our disambiguation experiments we employ a
polynomial kernel function of the form k(xi;xj) =
(xi  xj)d, although other kernel functions such as
gaussians could be used as well. Note that the de-
generate case of d = 1 yields the dot product kernel
k(xi;xj) = (xi xj) which covers linear PCA as a
special case, which may explain why KPCA always
outperforms PCA.
3 Results and discussion
3.1 Accuracy
Table 1 summarizes the results of the submitted sys-
tems along with the individual voting models. Since
our models attempted to disambiguate all test in-
stances, we report accuracy (precision and recall be-
ing equal). Earlier experiments on Senseval-2 data
showed that the KPCA-based model significantly
outperformed both na¨ıve Bayes and maximum en-
tropy models (Wu et al., 2004). On the Senseval-
3 data, the maximum entropy model fares slightly
better: it remains significantly worse on the Multi-
lingual (ts) task, but achieves statistically the same
accuracy on the English (fine) task and is slightly
Table 1: Comparison of accuracy results for various HKUST ensemble and individual models on Senseval-
3 Lexical Sample tasks, confirming the high accuracy of the KPCA-based model. All test instances were
attempted. (Bold model names were the systems entered.)
English
(fine)
English
(coarse)
Chinese Multilingual
(t)
Multilingual
(ts)
HKUST comb2 (me, boost, nb, kpca) 71.4 78.6 66.2 62.0 63.8
HKUST comb (me, boost, kpca) 70.9 78.1 66.5 61.4 63.8
HKUST me 69.3 76.4 64.4 60.6 60.8
HKUST kpca 69.2 - 63.6 60.0 63.3
HKUST boost 67.0 - 64.1 57.3 60.3
HKUST nb 64.3 - 60.4 57.3 56.8
Table 2: Confusion matrices showing that the KPCA-based model votes very differently from the other
models on the Senseval-3 Lexical Sample tasks. Percentages representing disagreement between KPCA and
other voting models are shown in bold.
kpca vs: me boost nb
task incorrect correct incorrect correct incorrect correct
English incorrect 24.14% 6.62% 21.60% 9.15% 21.04% 9.71%
(fine) correct 6.59% 62.65% 11.38% 57.86% 14.63% 54.61%
Chinese incorrect 24.01% 12.40% 22.96% 13.46% 26.65% 9.76%
correct 11.61% 51.98% 12.93% 50.66% 12.93% 50.66%
Multilingual incorrect 32.71% 7.33% 32.04% 8.01% 30.54% 9.51%
(t) correct 6.74% 53.22% 10.63% 49.33% 12.20% 47.75%
Multilingual incorrect 33.17% 3.52% 31.66% 5.03% 30.15% 6.53%
(ts) correct 6.03% 57.29% 8.04% 55.28% 13.07% 50.25%
more accurate on the Multilingual (t) task. For un-
known reasons—possibly the very small number of
training instances per Chinese target word, as men-
tioned earlier—there is an exception on the Chinese
task, where boosting outperforms the KPCA-based
model. We are investigating the possible causes.
The na¨ıve Bayes model remains significantly worse
under all conditions.
3.2 Differentiated voting bias
For a new voting model to raise the accuracy of an
existing classifier ensemble, it is not only important
that the new voting model achieve accuracy compa-
rable to the other voters, as shown above, but also
that it provides a significantly differentiated predic-
tion bias than the other voters. Otherwise, the accu-
racy is typically hurt rather than helped by the new
voting model.
To examine whether the KPCA-based model sat-
isfies this requirement, we compared its predictions
against each of the other classifiers (for those tasks
where we have been given the answer key). Table 2
shows nine confusion matrices revealing the per-
centage of instances where the KPCA-based model
votes differently from one of the other voters. The
disagreement between KPCA and the other voting
models ranges from 6.03% to 14.63%, as shown
by the bold entries in the confusion matrices. Note
that where there is disagreement, the KPCA-based
model predicts the correct sense with significantly
higher accuracy, in nearly all cases.
3.3 Voting effectiveness
The KPCA-based model exhibits the accuracy and
differentiation characteristics requisite for an effec-
tive additional voter, as shown in the foregoing sec-
Table 3: Comparison of the accuracies for the voting ensembles with and without the KPCA voter, confirm-
ing that adding the KPCA-based model to the voting ensemble always helps on Senseval-3 Lexical Sample
tasks.
English
(fine)
English
(coarse)
Chinese Multilingual
(t)
Multilingual
(ts)
HKUST comb3 (me, boost, nb) 71.2 - 67.5 60.6 60.8
HKUST comb2 (me, boost, nb, kpca) 71.4 78.6 66.2 62.0 63.8
tions. To verify that adding the KPCA-based model
to the voting ensemble indeed improves accuracy,
we compared our voting ensemble’s accuracies to
that obtained with KPCA removed. The results,
shown in Table 3, confirm that the KPCA-based
model generally helps on Senseval-3 Lexical Sam-
ple tasks. The only exception is on Chinese, due
to the aforementioned anomaly of boosting outper-
forming KPCA on that task. In the Multilingual (t)
and (ts) cases, the improvement in accuracy is sig-
nificant.
4 Conclusion
We have described our word sense disambiguation
system and its performance on the Senseval-3 En-
glish, Chinese, and Multilingual Lexical Sample
tasks. The system consists of an ensemble clas-
sifier utilizing combinations of maximum entropy,
boosting, na¨ıve Bayes, and a new Kernel PCA based
model.
We have demonstrated that our new model based
on Kernel PCA is, along with maximum entropy,
one of the most accurate stand-alone models vot-
ing in the ensemble, as evaluated under carefully
controlled to ensure the same optimized feature set
across all models being compared. Moreover, we
have shown that the KPCA model exhibits a signif-
icantly different classification bias, a characteristic
that makes it a valuable voter in an ensemble. The
results confirm that accuracy is generally improved
by the addition of the KPCA-based model.

References
Xavier Carreras, Llu´ıs M`arques, and Llu´ıs Padr´o. Named entity
extraction using AdaBoost. In Dan Roth and Antal van den
Bosch, editors, Proceedings of CoNLL-2002, pages 167–
170, Taipei, Taiwan, 2002.
Konstantinos I. Diamantaras and Sun Yuan Kung. Principal
Component Neural Networks. Wiley, New York, 1996.
Yoram Freund and Robert E. Schapire. A decision-theoretic
generalization of on-line learning and an application to
boosting. In Journal of Computer and System Sciences,
55(1), pages 119–139, 1997.
E.T. Jaynes. Where do we Stand on Maximum Entropy? MIT
Press, Cambridge MA, 1978.
Dan Klein and Christopher D. Manning. Conditional struc-
ture versus conditional estimation in NLP models. In Pro-
ceedings of EMNLP-2002, Conference on Empirical Meth-
ods in Natural Language Processing, pages 9–16, Philadel-
phia, July 2002. SIGDAT, Association for Computational
Linguistics.
Taku Kudo and Yuji Matsumoto. Fast methods for kernel-based
text analysis. In Proceedings of the 41st Annual Meeting of
the Association for Computational Linguistics, pages 24–31,
2003.
Grace Ngai, Dekai Wu, Marine Carpuat, Chi-Shing Wang,
and Chi-Yung Wang. Semantic role labeling with boost-
ing, SVMs, maximum entropy, SNOW, and decision lists.
In Proceedings of Senseval-3, Third International Work-
shop on Evaluating Word Sense Disambiguation Systems,
Barcelona, July 2004. SIGLEX, Association for Computa-
tional Linguistics.
Robert E. Schapire and Yoram Singer. Boostexter: A boosting-
based system for text categorization. In Machine Learning,
39(2/3), pages 135–168, 2000.
Bernhard Sch¨olkopf, Alexander Smola, and Klaus-Rober
M¨uller. Nonlinear component analysis as a kernel eigen-
value problem. Neural Computation, 10(5), 1998.
Richard Wicentowski, Grace Ngai, Dekai Wu, Marine Carpuat,
Emily Thomforde, and Adrian Packel. Joining forces to
resolve lexical ambiguity: East meets West in Barcelona.
In Proceedings of Senseval-3, Third International Work-
shop on Evaluating Word Sense Disambiguation Systems,
Barcelona, July 2004. SIGLEX, Association for Computa-
tional Linguistics.
Dekai Wu, Grace Ngai, Marine Carpuat, Jeppe Larsen, and
Yongsheng Yang. Boosting for named entity recognition.
In Dan Roth and Antal van den Bosch, editors, Proceedings
of CoNLL-2002, pages 195–198. Taipei, Taiwan, 2002.
Dekai Wu, Weifeng Su, and Marine Carpuat. A Kernel PCA
method for superior word sense disambiguation. In Pro-
ceedings of the 42nd Annual Meeting of the Association for
Computational Linguistics, Barcelona, July 2004.
David Yarowsky and Radu Florian. Evaluating sense disam-
biguation across diverse parameter spaces. Natural Lan-
guage Engineering, 8(4):293–310, 2002.
