Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL-X),
pages 77–84, New York City, June 2006. c©2006 Association for Computational Linguistics
Applying Alternating Structure Optimization
to Word Sense Disambiguation
Rie Kubota Ando
IBM T.J. Watson Research Center
Yorktown Heights, NY 10598, U.S.A.
rie1@us.ibm.com
Abstract
This paper presents a new application of
the recently proposed machine learning
method Alternating Structure Optimiza-
tion (ASO), to word sense disambiguation
(WSD). Given a set of WSD problems
and their respective labeled examples, we
seek to improve overall performance on
that set by using all the labeled exam-
ples (irrespective of target words) for the
entire set in learning a disambiguator for
each individual problem. Thus, in effect,
on each individual problem (e.g., disam-
biguation of “art”) we benefit from train-
ing examples for other problems (e.g.,
disambiguation of “bar”, “canal”, and so
forth). We empirically study the effective
use of ASO for this purpose in the multi-
task and semi-supervised learning config-
urations. Our performance results rival
or exceed those of the previous best sys-
tems on several Senseval lexical sample
task data sets.
1 Introduction
Word sense disambiguation (WSD) is the task of
assigning pre-defined senses to words occurring in
some context. An example is to disambiguate an oc-
currence of “bank” between the “money bank” sense
and the “river bank” sense. Previous studies e.g.,
(Lee and Ng, 2002; Florian and Yarowsky, 2002),
have applied supervised learning techniques to WSD
with success.
A practical issue that arises in supervised WSD
is the paucity of labeled examples (sense-annotated
data) available for training. For example, the train-
ing set of the Senseval-21 English lexical sample
1http://www.cs.unt.edu/
DIrada/senseval/. WSD systems have
task has only 10 labeled training examples per sense
on average, which is in contrast to nearly 6K training
examples per name class (on average) used for the
CoNLL-2003 named entity chunking shared task2.
One problem is that there are so many words and so
many senses that it is hard to make available a suf-
ficient number of labeled training examples for each
of a large number of target words.
On the other hand, this indicates that the total
number of available labeled examples (irrespective
of target words) can be relatively large. A natural
question to ask is whether we can effectively use all
the labeled examples (irrespective of target words)
for learning on each individual WSD problem.
Based on these observations, we study a new
application of Alternating Structure Optimization
(ASO) (Ando and Zhang, 2005a; Ando and Zhang,
2005b) to WSD. ASO is a recently proposed ma-
chine learning method for learning predictive struc-
ture (i.e., information useful for predictions) shared
by multiple prediction problems via joint empiri-
cal risk minimization. It has been shown that on
several tasks, performance can be significantly im-
proved by a semi-supervised application of ASO,
which obtains useful information from unlabeled
data by learning automatically created prediction
problems. In addition to such semi-supervised learn-
ing, this paper explores ASO multi-task learning,
which learns a number of WSD problems simul-
taneously to exploit the inherent predictive struc-
ture shared by these WSD problems. Thus, in ef-
fect, each individual problem (e.g., disambiguation
of “art”) benefits from labeled training examples for
other problems (e.g., disambiguation of “bar”, dis-
ambiguation of “canal”, and so forth).
The notion of benefiting from training data for
other word senses is not new by itself. For instance,
been evaluated in the series of Senseval workshops.
2http://www.cnts.ua.ac.be/conll2003/ner/
77
on the WSD task with respect to WordNet synsets,
Kohomban and Lee (2005) trained classifiers for the
top-level synsets of the WordNet semantic hierar-
chy, consolidating labeled examples associated with
the WordNet sub-trees. To disambiguate test in-
stances, these coarse-grained classifiers are first ap-
plied, and then fine-grained senses are determined
using a heuristic mapping. By contrast, our ap-
proach does not require pre-defined relations among
senses such as the WordNet hierarchy. Rather, we
let the machine learning algorithm ASO automati-
cally and implicitly find relations with respect to the
disambiguation problems (i.e., finding shared pre-
dictive structure). Interestingly, in our experiments,
seemingly unrelated or only loosely related word-
sense pairs help to improve performance.
This paper makes two contributions. First, we
present a new application of ASO to WSD. We em-
pirically study the effective use of ASO and show
that labeled examples of all the words can be effec-
tively exploited in learning each individual disam-
biguator. Second, we report performance results that
rival or exceed the state-of-the-art systems on Sen-
seval lexical sample tasks.
2 Alternating structure optimization
This section gives a brief summary of ASO. We first
introduce a standard linear prediction model for a
single task and then extend it to a joint linear model
used by ASO.
2.1 Standard linear prediction models
In the standard formulation of supervised learning,
we seek a predictor that maps an input vector (or
feature vector) DC BE CG to the corresponding output
DD BE CH. For NLP tasks, binary features are often used
– for example, if the word to the left is “money”, set
the corresponding entry of DC to 1; otherwise, set it to
0. A CZ-way classification problem can be cast as CZ
binary classification problems, regarding output DD BP
B7BD and DD BP A0BD as “in-class” and “out-of-class”,
respectively.
Predictors based on linear prediction models take
the form: CUB4DCB5 BP DBCCDC, where DB is called a weight
vector. A common method to obtain a predictor
CM
CU is regularized empirical risk minimization, which
minimizes an empirical loss of the predictor (with
regularization) on the D2 labeled training examples
CUB4CG
CX
BNCH
CX
B5CV:
CM
CU BP CPD6CVD1CXD2
CU
AW
D2
CG
CXBPBD
C4B4CUB4CG
CX
B5BNCH
CX
B5 B7 D6B4CUB5
AX
BM (1)
A loss function C4B4A1B5 quantifies the difference be-
tween the prediction CUB4CG
CX
B5 and the true output CH
CX
,
and D6B4A1B5 is a regularization term to control the model
complexity.
2.2 Joint linear models for ASO
Consider D1 prediction problems indexed by CO BE
CUBDBNBMBMBMBND1CV, each with D2
CO
samples B4CG
CO
CX
BNCH
CO
CX
B5 for CX BE
CUBDBNBMBMBMBND2
CO
CV, and assume that there exists a low-
dimensional predictive structure shared by these D1
problems. Ando and Zhang (2005a) extend the
above traditional linear model to a joint linear model
so that a predictor for problem CO is in the form:
CU
CO
B4A2BNDCB5 BP DB
CC
CO
DCB7DA
CC
CO
A2DC BN A2A2
CC
BP C1BN (2)
where C1 is the identity matrix. DB
CO
and DA
CO
are
weight vectors specific to each problem CO. Predic-
tive structure is parameterized by the structure ma-
trix A2 shared by all the D1 predictors. The goal of
this model can also be regarded as learning a com-
mon good feature map A2DC used for all the D1 prob-
lems.
2.3 ASO algorithm
Analogous to (1), we compute A2 and predictors so
that they minimize the empirical risk summed over
all the problems:
CJ
CM
A2BNCU
CM
CU
CO
CVCL BPCPD6CVD1CXD2
A2BNCUCU
CO
CV
D1
CG
COBPBD
AW
D2
CO
CG
CXBPBD
C4B4CU
CO
B4A2BNCG
CO
CX
B5BNCH
CO
CX
B5
D2
CO
B7 D6B4CU
CO
B5
AX
BM
(3)
It has been shown in (Ando and Zhang, 2005a) that
the optimization problem (3) has a simple solution
using singular value decomposition (SVD) when we
choose square regularization: D6B4CU
CO
B5 BP ALCZDB
CO
CZ
BE
BEwhere
AL is a regularization parameter. Let D9
CO
BP
DB
CO
B7 A2
CC
DA
CO
BM Then (3) becomes the minimization
of the joint empirical risk written as:
D1
CG
COBPBD
AW
D2
CO
CG
CXBPBD
C4B4D9
CC
CO
CG
CO
CX
BNCH
CO
CX
B5
D2
CO
B7 ALCZD9
CO
A0 A2
CC
DA
CO
CZ
BE
BE
AX
BM (4)
This minimization can be approximately solved by
repeating the following alternating optimization pro-
cedure until a convergence criterion is met:
78
Nouns art, authority, bar, bum, chair, channel, child, church, circuit, day, detention, dyke, facility, fatigue, feeling,
grip, hearth, holiday, lady, material, mouth, nation, nature, post, restraint, sense, spade, stress, yew
Verbs begin, call, carry, collaborate, develop, draw, dress, drift, drive, face, ferret, find, keep, leave, live, match,
play, pull, replace, see, serve strike, train, treat, turn, use, wander wash, work
Adjectives blind, colourless, cool, faithful, fine, fit, free, graceful, green, local, natural, oblique, simple, solemn, vital
Figure 1: Words to be disambiguated; Senseval-2 English lexical sample task.
1. Fix B4A2BNCUDA
CO
CVB5, and find D1 predictors CUD9
CO
CV that
minimizes the joint empirical risk (4).
2. Fix D1 predictors CUD9
CO
CV, and find B4A2BNCUDA
CO
CVB5 that
minimizes the joint empirical risk (4).
The first step is equivalent to training D1 predictors
independently. The second step, which couples all
the predictors, can be done by setting the rows of
A2 to the most significant left singular vectors of the
predictor (weight) matrix CD BP CJD9
BD
BNBMBMBMBND9
D1
CL, and
setting DA
CO
BP A2D9
CO
. That is, the structure matrix A2 is
computed so that the projection of the predictor ma-
trix CD onto the subspace spanned by A2’s rows gives
the best approximation (in the least squares sense)
of CD for the given row-dimension of A2. Thus, in-
tuitively, A2 captures the commonality of the D1 pre-
dictors.
ASO has been shown to be useful in its semi-
supervised learning configuration, where the above
algorithm is applied to a number of auxiliary prob-
lems that are automatically created from the unla-
beled data. By contrast, the focus of this paper is the
multi-task learning configuration, where the ASO
algorithm is applied to a number of real problems
with the goal of improving overall performance on
these problems.
3 Effective use of ASO on word sense
disambiguation
The essence of ASO is to learn information useful
for prediction (predictive structure) shared by mul-
tiple tasks, assuming the existence of such shared
structure. From this viewpoint, consider the target
words of the Senseval-2 lexical sample task, shown
in Figure 1. Here we have multiple disambiguation
tasks; however, at a first glance, it is not entirely
clear whether these tasks share predictive structure
(or are related to each other). There is no direct se-
mantic relationship (such as synonym or hyponym
relations) among these words.
word uni-grams in 5-word window,
Local word bi- and tri-grams of B4DB
A0BE
BNDB
A0BD
B5,
context B4DB
B7BD
BNDB
B7BE
B5BNB4DB
A0BD
BNDB
B7BD
B5,
B4DB
A0BF
BNDB
A0BE
BNDB
A0BD
B5BNB4DB
B7BD
BNDB
B7BE
BNDB
B7BF
B5,
B4DB
A0BE
BNDB
A0BD
BNDB
B7BD
B5BNB4DB
A0BD
BNDB
B7BD
BNDB
B7BE
B5.
Syntactic full parser output; see Section 3 for detail.
Global all the words excluding stopwords.
POS uni-, bi-, and tri-grams in 5-word window.
Figure 2: Features. DB
CX
stands for the word at position CX
relative to the word to be disambiguated. The 5-word win-
dow is CJA0BEBNB7BECL. Local context and POS features are position-
sensitive. Global context features are position insensitive (a bag
of words).
The goal of this section is to empirically study
the effective use of ASO for improving overall per-
formance on these seemingly unrelated disambigua-
tion problems. Below we first describe the task set-
ting, features, and algorithms used in our imple-
mentation, and then experiment with the Senseval-
2 English lexical sample data set (with the offi-
cial training / test split) for the development of our
methods. We will then evaluate the methods de-
veloped on the Senseval-2 data set by carrying out
the Senseval-3 tasks, i.e., training on the Senseval-3
training data and then evaluating the results on the
(unseen) Senseval-3 test sets in Section 4.
Task setting In this work, we focus on the Sense-
val lexical sample task. We are given a set of target
words, each of which is associated with several pos-
sible senses, and their labeled instances for training.
Each instance contains an occurrence of one of the
target words and its surrounding words, typically a
few sentences. The task is to assign a sense to each
test instance.
Features We adopt the feature design used by Lee
and Ng (2002), which consists of the following
four types: (1) Local context: D2-grams of nearby
words (position sensitive); (2) Global context: all
the words (excluding stopwords) in the given con-
text (position-insensitive; a bag of words); (3) POS:
parts-of-speech D2-grams of nearby words; (4) Syn-
79
tactic relations: syntactic information obtained from
parser output. To generate syntactic relation fea-
tures, we use the Slot Grammar-based full parser
ESG (McCord, 1990). We use as features syntactic
relation types (e.g., subject-of, object-of, and noun
modifier), participants of syntactic relations, and bi-
grams of syntactic relations / participants. Details of
the other three types are shown in Figure 2.
Implementation Our implementation follows
Ando and Zhang (2005a). We use a modifi-
cation of the Huber’s robust loss for regression:
C4B4D4BNDDB5 BP B4D1CPDCB4BCBNBDA0D4DDB5B5
BE if
D4DD AL A0BD; and A0BGD4DD
otherwise; with square regularization (AL BP BDBC
A0BG),
and perform empirical risk minimization by
stochastic gradient descent (SGD) (see e.g., Zhang
(2004)). We perform one ASO iteration.
3.1 Exploring the multi-task learning
configuration
The goal is to effectively apply ASO to the set of
word disambiguation problems so that overall per-
formance is improved. We consider two factors: fea-
ture split and partitioning of prediction problems.
3.1.1 Feature split and problem partitioning
Our features described above inherently consist of
four feature groups: local context (C4BV), global con-
text (BZBV), syntactic relation (CBCA), and POS features.
To exploit such a natural feature split, we explore the
following extension of the joint linear model:
CU
CO
B4CUA2
CY
CVBNDCB5 BP DB
CC
CO
DCB7
CG
CYBEBY
DA
B4CYB5
CO
CC
A2
CY
DC
B4CYB5
BN (5)
where A2
CY
A2
CC
CY
BP C1 for CY BE BY, BY is a set of dis-
joint feature groups, and DCB4CYB5 (or DAB4CYB5
CO
) is a portion
of the feature vector DC (or the weight vector DA
CO
) cor-
responding to the feature group CY, respectively. This
is a slight modification of the extension presented
in (Ando and Zhang, 2005a). Using this model,
ASO computes the structure matrix A2
CY
for each fea-
ture group separately. That is, SVD is applied to
the sub-matrix of the predictor (weight) matrix cor-
responding to each feature group CY, which results
in more focused dimension reduction of the predic-
tor matrix. For example, suppose that BY BP CUCBCACV.
Then, we compute the structure matrix A2
CBCA
from
the corresponding sub-matrix of the predictor ma-
trix CD, which is the gray region of Figure 3 (a). The
structure matrices A2
CY
for CY BPBE BY (associated with the
white regions in the figure) should be regarded as
being fixed to the zero matrices. Similarly, it is pos-
sible to compute a structure matrix from a subset of
the predictors (such as noun disambiguators only),
as in Figure 3 (b). In this example, we apply the
extension of ASO with BY BP CUCBCACV to three sets of
problems (disambiguation of nouns, verbs, and ad-
jectives, respectively) separately.
LC
GC
SR
POS
(a) Partitioned by features: 
F = { SR }
m predictors
ΘSR
predictors   
for nouns
predictors 
for verbs
predictors 
for adjectives
ΘSR,Adj
ΘSR,Verb
ΘSR,Noun
(b) Partitioned by F = { SR }
and problem types.
LC
GC
SR
POS
Predictor matrix U Predictor matrix U
Figure 3: Examples of feature split and problem partitioning.
To see why such partitioning may be useful for
our WSD problems, consider the disambiguation of
“bank” and the disambiguation of “save”. Since a
“bank” as in “money bank” and a “save” as in “sav-
ing money” may occur in similar global contexts,
certain global context features effective for recog-
nizing the “money bank” sense may be also effective
for disambiguating “save”, and vice versa. However,
with respect to the position-sensitive local context
features, these two disambiguation problems may
not have much in common since, for instance, we
sometimes say “the bank announced”, but we rarely
say “the save announced”. That is, whether prob-
lems share predictive structure may depend on fea-
ture types, and in that case, seeking predictive struc-
ture for each feature group separately may be more
effective. Hence, we experiment with the configu-
rations with and without various feature splits using
the extension of ASO.
Our target words are nouns, verbs, and adjec-
tives. As in the above example of “bank” (noun)
and “save” (verb), the predictive structure of global
context features may be shared by the problems ir-
respective of the parts of speech of the target words.
However, the other types of features may be more
dependent on the target word part of speech. There-
80
fore, we explore two types of configuration. One
applies ASO to all the disambiguation problems at
once. The other applies ASO separately to each of
the three sets of disambiguation problems (noun dis-
ambiguation problems, verb disambiguation prob-
lems, and adjective disambiguation problems) and
uses the structure matrix A2
CY
obtained from the noun
disambiguation problems only for disambiguating
nouns, and so forth.
Thus, we explore combinations of two parame-
ters. One is the set of feature groups BY in the model
(5). The other is the partitioning of disambiguation
problems.
3.1.2 Empirical results
64.5
65
65.5
66
66.5
67
67.5
68
1 2 3 4 5 6 7 8
all problems at
once
nouns, verbs,
adjectives,
separately
Baseline {LC} {GC} {SR}{POS} {LC,SR,GC} 
{LC+SR+GC} 
no feature 
split 
Feature group set F
Problem partitioning
Figure 4: F-measure on Senseval-2 English test set. Multi-
task configurations varying feature group set BY and problem
partitioning. Performance at the best dimensionality of A2
CY
(in
CUBDBCBNBEBHBNBHBCBNBDBCBCBNA1A1A1CV) is shown.
In Figure 4, we compare performance on the
Senseval-2 test set produced by training on the
Senseval-2 training set using the various configura-
tions discussed above. As the evaluation metric, we
use the F-measure (micro-averaged)3 returned by the
official Senseval scorer. Our baseline is the standard
single-task configuration using the same loss func-
tion (modified Huber) and the same training algo-
rithm (SGD).
The results are in line with our expectation. To
learn the shared predictive structure of local context
(LC) and syntactic relations (SR), it is more advanta-
geous to apply ASO to each of the three sets of prob-
lems (disambiguation of nouns, verbs, and adjec-
tives, respectively), separately. By contrast, global
context features (GC) can be more effectively ex-
ploited when ASO is applied to all the disambigua-
3Our precision and recall are always the same since our sys-
tems assign exactly one sense to each instance. That is, our
F-measure is the same as ‘micro-averaged recall’ or ‘accuracy’
used in some of previous studies we will compare with.
tion problems at once. It turned out that the con-
figuration BY BP CUC8C7CBCV does not improve the per-
formance over the baseline. Therefore, we exclude
POS from the feature group set BY in the rest of our
experiments. Comparison of BY BP CUC4BVB7CBCAB7BZBVCV
(treating the features of these three types as one
group) and BY BP CUC4BVBNCBCABNBZBVCV indicates that use
of this feature split indeed improves performance.
Among the configurations shown in Figure 4, the
best performance (67.8%) is obtained by applying
ASO to the three sets of problems (corresponding
to nouns, verbs, and adjectives) separately, with the
feature split BY BP CUC4BVBNCBCABNBZBVCV.
ASO has one parameter, the dimensionality of the
structure matrix A2
CY
(i.e., the number of left singular
vectors to compute). The performance shown in Fig-
ure 4 is the ceiling performance obtained at the best
dimensionality (in CUBDBCBNBEBHBNBHBCBNBDBCBCBNBDBHBCBNA1A1A1CV). In
Figure 5, we show the performance dependency on
A2
CY
’s dimensionality when ASO is applied to all the
problems at once (Figure 5 left), and when ASO is
applied to the set of the noun disambiguation prob-
lems (Figure 5 right). In the left figure, the config-
uration BY BP CUBZBVCV (global context) produces bet-
ter performance at a relatively low dimensionality.
In the other configurations shown in these two fig-
ures, performance is relatively stable as long as the
dimensionality is not too low.
64.5
65
65.5
66
66.5
67
67.5
0 100 200 300 400 500
dimensionality
69
70
71
72
73
74
0 100 200 300
dimensionality
{LC,GC,SR}
{LC+GC+SR}
{LC}
{GC}
{SR}
baseline
Figure 5: Left: Applying ASO to all the WSD problems at
once. Right: Applying ASO to noun disambiguation problems
only and testing on the noun disambiguation problems only. DC-
axis: dimensionality of A2
CY
.
3.2 Multi-task learning procedure for WSD
Based on the above results on the Senseval-2 test set,
we develop the following procedure using the fea-
ture split and problem partitioning shown in Figure
6. Let C6BNCE, and BT be sets of disambiguation prob-
lems whose target words are nouns, verbs, and ad-
jectives, respectively. We write A2
B4CYBND7B5
for the struc-
81
predictors   for nouns predictors for verbs predictors for adjectives
LC
GC
SR
POS
We compute seven structure 
matrices Θj,s each from the 
seven shaded regions of the 
predictor matrix U. 
Figure 6: Effective feature split and problem partitioning.
ture matrix associated with the feature group CY and
computed from a problem set D7. That is, we replace
A2
CY
in (5) with A2
B4CYBND7B5
.
AF Apply ASO to the three sets of disambigua-
tion problems (corresponding to nouns, verbs,
and adjectives), separately, using the extended
model (5) with BY BP CUC4BVBNCBCACV. As a result,
we obtain A2
B4CYBND7B5
for every B4CYBND7B5 BE CUC4BVBNCBCACVA2
CUC6BNCEBNBTCV.
AF Apply ASO to all the disambiguation problems
at once using the extended model (5) with BY BP
CUBZBVCV to obtain A2
B4BZBVBNC6CJCECJBTB5
.
AF For a problem CO BE C8 BE CUC6BNCEBNBTCV, our final
predictor is based on the model:
CU
CO
B4DCB5 BP DB
CC
CO
DCB7
CG
B4CYBND7B5BECC
DA
B4CYBND7B5
CO
CC
A2
B4CYBND7B5
DC
B4CYB5
BN
where CC BP CUB4C4BVBNC8B5BNB4CBCABNC8B5BNB4BZBVBNC6 CJCE CJ
BTB5CV. We obtain predictor
CM
CU
CO
by minimizing the
regularized empirical risk with respect to DB
CO
and DA
CO
.
We fix the dimension of the structure matrix cor-
responding to global context features to 50. The di-
mensions of the other structure matrices are set to
0.9 times the maximum possible rank to ensure rela-
tively high dimensionality. This procedure produces
BIBKBMBDB1 on the Senseval-2 English lexical sample test
set.
3.3 Previous systems on Senseval-2 data set
Figure 7 compares our performance with those of
previous best systems on the Senseval-2 English lex-
ical sample test set. Since we used this test set for the
development of our method above, our performance
should be understood as the potential performance.
(In Section 4, we will present evaluation results on
ASO multi-task learning (optimum config.) 68.1
classifier combination [FY02] 66.5
polynomial KPCA [WSC04] 65.8
SVM [LN02] 65.4
Our single-task baseline 65.3
Senseval-2 (2001) best participant 64.2
Figure 7: Performance comparison with previous best sys-
tems on Senseval-2 English lexical sample test set. FY02 (Flo-
rian and Yarowsky, 2002), WSC04 (Wu et al., 2004), LN02 (Lee
and Ng, 2002)
the unseen Senseval-3 test sets.) Nevertheless, it is
worth noting that our potential performance (68.1%)
exceeds those of the previous best systems.
Our single-task baseline performance is almost
the same as LN02 (Lee and Ng, 2002), which
uses SVM. This is consistent with the fact that we
adopted LN02’s feature design. FY02 (Florian and
Yarowsky, 2002) combines classifiers by linear av-
erage stacking. The best system of the Senseval-2
competition was an early version of FY02. WSC04
used a polynomial kernel via the kernel Principal
Component Analysis (KPCA) method (Sch¨olkopf et
al., 1998) with nearest neighbor classifiers.
4 Evaluation on Senseval-3 tasks
In this section, we evaluate the methods developed
on the Senseval-2 data set above on the standard
Senseval-3 lexical sample tasks.
4.1 Our methods in multi-task and
semi-supervised configurations
In addition to the multi-task configuration described
in Section 3.2, we test the following semi-supervised
application of ASO. We first create auxiliary prob-
lems following Ando and Zhang (2005a)’s partially-
supervised strategy (Figure 8) with distinct fea-
ture maps A9
BD
and A9
BE
each of which uses one of
CUC4BVBNBZBVBNCBCACV. Then, we apply ASO to these auxil-
iary problems using the feature split and the problem
partitioning described in Section 3.2.
Note that the difference between the multi-task
and semi-supervised configurations is the source of
information. The multi-task configuration utilizes
the label information of the training examples that
are labeled for the rest of the multiple tasks, and
the semi-supervised learning configuration exploits
a large amount of unlabeled data.
82
1. Train a classifier BV
BD
only using feature map A9
BD
on the
labeled data for the target task.
2. Auxiliary problems are to predict the labels assigned by
BV
BD
to the unlabeled data, using the other feature map A9
BE
.
3. Apply ASO to the auxiliary problems to obtain A2.
4. Using the joint linear model (2), train the final
predictor by minimizing the empirical risk for fixed A2
on the labeled data for the target task.
Figure 8: Ando and Zhang (2005a)’s ASO semi-supervised
learning method using partially-supervised procedure for creat-
ing relevant auxiliary problems.
4.2 Data and evaluation metric
We conduct evaluations on four Senseval-3 lexical
sample tasks (English, Catalan, Italian, and Spanish)
using the official training / test splits. Data statis-
tics are shown in Figure 9. On the Spanish, Cata-
lan, and Italian data sets, we use part-of-speech in-
formation (as features) and unlabeled examples (for
semi-supervised learning) provided by the organizer.
Since the English data set was not provided with
these additional resources, we use an in-house POS
tagger trained with the PennTree Bank corpus, and
extract 100K unlabeled examples from the Reuters-
RCV1 corpus. On each language, the number of un-
labeled examples is 5–15 times larger than that of the
labeled training examples. We use syntactic relation
features only for English data set. As in Section 3,
we report micro-averaged F measure.
4.3 Baseline methods
In addition to the standard single-task supervised
configuration as in Section 3, we test the following
method as an additional baseline.
Output-based method The goal of our multi-task
learning configuration is to benefit from having the
labeled training examples of a number of words. An
alternative to ASO for this purpose is to use directly
as features the output values of classifiers trained
for disambiguating the other words, which we call
‘output-based method’ (cf. Florian et al. (2003)).
We explore several variations similarly to Section
3.1 and report the ceiling performance.
4.4 Evaluation results
Figure 10 shows F-measure results on the four
Senseval-3 data sets using the official training / test
splits. Both ASO multi-task learning and semi-
supervised learning improve performance over the
#words #train avg #sense avg #train
per word per sense
English 73 8611 10.7 10.0
Senseval-3 data sets
English 57 7860 6.5 21.3
Catalan 27 4469 3.1 53.2
Italian 45 5145 6.2 18.4
Spanish 46 8430 3.3 55.5
Figure 9: Data statistics of Senseval-2 English lexical sample
data set (first row) and Senseval-3 data sets. On each data set, #
of test instances is about one half of that of training instances.
single-task baseline on all the data sets. The best
performance is achieved when we combine multi-
task learning and semi-supervised learning by using
all the corresponding structure matrices A2
B4CYBND7B5
pro-
duced by both multi-task and semi-supervised learn-
ing, in the final predictors. This combined configu-
ration outperforms the single-task supervised base-
line by up to 5.7%.
Performance improvements over the supervised
baseline are relatively small on English and Span-
ish. We conjecture that this is because the supervised
performance is already close to the highest perfor-
mance that automatic methods could achieve. On
these two languages, our (and previous) systems out-
perform inter-human agreement, which is unusual
but can be regarded as an indication that these tasks
are difficult.
The performance of the output-based method
(baseline) is relatively low. This indicates that out-
put values or proposed labels are not expressive
enough to integrate information from other predic-
tors effectively on this task. We conjecture that for
this method to be effective, the problems are re-
quired to be more closely related to each other as
in Florian et al. (2003)’s named entity experiments.
A practical advantage of ASO multi-task learning
over ASO semi-supervised learning is that shorter
computation time is required to produce similar
performance. On this English data set, training
for multi-task learning and semi-supervised learning
takes 15 minutes and 92 minutes, respectively, using
a Pentium-4 3.20GHz computer. The computation
time mostly depends on the amount of the data on
which auxiliary predictors are learned. Since our ex-
periments use unlabeled data 5–15 times larger than
labeled training data, semi-supervised learning takes
longer, accordingly.
83
methods English Catalan Italian Spanish
multi-task learning 73.8 (+0.8) 89.5 (+1.5) 63.2 (+4.9) 89.0 (+1.0)
ASO semi-supervised learning 73.5 (+0.5) 88.6 (+0.6) 62.4 (+4.1) 88.9 (+0.9)
multi-task+semi-supervised 74.1 (+1.1) 89.9 (+1.9) 64.0 (+5.7) 89.5 (+1.5)
baselines output-based 73.0 (0.0) 88.3 (+0.3) 58.0 (-0.3) 88.2 (+0.2)
single-task supervised learning 73.0 88.0 58.3 88.0
previous SVM with LSA kernel [GGS05] 73.3 89.0 61.3 88.2
systems Senseval-3 (2004) best systems 72.9 [G04] 85.2 [SGG04] 53.1 [SGG04] 84.2 [SGG04]
inter-annotator agreement 67.3 93.1 89.0 85.3
Figure 10: Performance results on the Senseval-3 lexical sample test sets. Numbers in the parentheses are performance gains
compared with the single-task supervised baseline (italicized). [G04] Grozea (2004); [SGG04] Strapparava et al. (2004).
GGS05 combined various kernels, which includes
the LSA kernel that exploits unlabeled data with
global context features. Our implementation of the
LSA kernel with our classifier (and our other fea-
tures) also produced performance similar to that of
GGS05. While the LSA kernel is closely related
to a special case of the semi-supervised application
of ASO (see the discussion of PCA in Ando and
Zhang (2005a)), our approach here is more general
in that we exploit not only unlabeled data and global
context features but also the labeled examples of
other target words and other types of features. G04
achieved high performance on English using regu-
larized least squares with compensation for skewed
class distributions. SGG04 is an early version of
GGS05. Our methods rival or exceed these state-
of-the-art systems on all the data sets.
5 Conclusion
With the goal of achieving higher WSD perfor-
mance by exploiting all the currently available re-
sources, our focus was the new application of the
ASO algorithm in the multi-task learning configu-
ration, which improves performance by learning a
number of WSD problems simultaneously instead of
training for each individual problem independently.
A key finding is that using ASO with appropriate
feature / problem partitioning, labeled examples of
seemingly unrelated words can be effectively ex-
ploited. Combining ASO multi-task learning with
ASO semi-supervised learning results in further im-
provements. The fact that performance improve-
ments were obtained consistently across several lan-
guages / sense inventories demonstrates that our ap-
proach has broad applicability and hence practical
significance.

References
Rie Kubota Ando and Tong Zhang. 2005a. A framework
for learning predictive structures from multiple tasks and
unlabeled data. Journal of Machine Learning Research,
6(Nov):1817–1853. An early version was published as IBM
Research Report (2004).
Rie Kubota Ando and Tong Zhang. 2005b. High performance
semi-supervised learning for text chunking. In Proceedings
of ACL-2005.
Radu Florian and David Yarowsky. 2002. Modeling consensus:
Classifier combination for word sense disambiguation. In
Proceedings of EMNLP-2002.
Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang.
2003. Named entity recognition through classifier combina-
tion. In Proceedings of CoNLL-2003.
Cristian Grozea. 2004. Finding optimal parameter settings for
high performance word sense diambiguation. In Proceed-
ings of Senseval-3 Workshop.
Upali S. Kohomban and Wee Sun Lee. 2005. Learning seman-
tic classes for word sense disambiguation. In Proceedings of
ACL-2005.
Yoong Keok Lee and Hwee Tou Ng. 2002. An empirical evalu-
ation of knowledge sources and learning algorithms for word
sense disambiguation. In Proceedings of EMNLP-2002.
Michael C. McCord. 1990. Slot Grammar: A system for
simpler construction of practical natural language grammars.
Natural Language and Logic: International Scientific Sym-
posium, Lecture Notes in Computer Science, pages 118–145.
Bernhard Sch¨olkopf, Alexander Smola, and Klaus-Rober
M¨uller. 1998. Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation, 10(5).
Carlo Strapparava, Alfio Gliozzo, and Claudio Giuliano. 2004.
Pattern abstraction and term similarity for word sense disam-
biguation: IRST at Senseval-3. In Proceedings of Senseval-3
Workshop.
Dekai Wu, Weifeng Su, and Marine Carpuat. 2004. A kernel
PCA method for superior word sense disambiguation. In
Proceedings of ACL-2004.
Tong Zhang. 2004. Solving large scale linear prediction prob-
lems using stochastic gradient descent algorithms. In ICML
04, pages 919–926.
