Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 415–422,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Partially Supervised Sense Disambiguation by Learning Sense Number
from Tagged and Untagged Corpora
Zheng-Yu Niu, Dong-Hong Ji
Institute for Infocomm Research
21 Heng Mui Keng Terrace
119613 Singapore
{zniu, dhji}@i2r.a-star.edu.sg
Chew Lim Tan
Department of Computer Science
National University of Singapore
3 Science Drive 2
117543 Singapore
tancl@comp.nus.edu.sg
Abstract
Supervised and semi-supervised sense dis-
ambiguation methods will mis-tag the in-
stances of a target word if the senses of
these instances are not de ned in sense in-
ventories or there are no tagged instances
for these senses in training data. Here we
used a model order identi cation method
to avoid the misclassi cation of the in-
stances with unde ned senses by discov-
ering new senses from mixed data (tagged
and untagged corpora). This algorithm
tries to obtain a natural partition of the
mixed data by maximizing a stability cri-
terion de ned on the classi cation result
from an extended label propagation al-
gorithm over all the possible values of
the number of senses (or sense number,
model order). Experimental results on
SENSEVAL-3 data indicate that it outper-
forms SVM, a one-class partially super-
vised classi cation algorithm, and a clus-
tering based model order identi cation al-
gorithm when the tagged data is incom-
plete.
1 Introduction
In this paper, we address the problem of partially
supervised word sense disambiguation, which is
to disambiguate the senses of occurrences of a tar-
get word in untagged texts when given incomplete
tagged corpus 1.
Word sense disambiguation can be de ned as
associating a target word in a text or discourse
1 incomplete tagged corpus means that tagged corpus
does not include the instances of some senses for the target
word, while these senses may occur in untagged texts.
with a de nition or meaning. Many corpus based
methods have been proposed to deal with the sense
disambiguation problem when given de nition for
each possible sense of a target word or a tagged
corpus with the instances of each possible sense,
e.g., supervised sense disambiguation (Leacock et
al., 1998), and semi-supervised sense disambigua-
tion (Yarowsky, 1995).
Supervised methods usually rely on the infor-
mation from previously sense tagged corpora to
determine the senses of words in unseen texts.
Semi-supervised methods for WSD are charac-
terized in terms of exploiting unlabeled data in
the learning procedure with the need of prede-
 ned sense inventories for target words. The in-
formation for semi-supervised sense disambigua-
tion is usually obtained from bilingual corpora
(e.g. parallel corpora or untagged monolingual
corpora in two languages) (Brown et al., 1991; Da-
gan and Itai, 1994), or sense-tagged seed examples
(Yarowsky, 1995).
Some observations can be made on the previous
supervised and semi-supervised methods. They
always rely on hand-crafted lexicons (e.g., Word-
Net) as sense inventories. But these resources may
miss domain-speci c senses, which leads to in-
complete sense tagged corpus. Therefore, sense
taggers trained on the incomplete tagged corpus
will misclassify some instances if the senses of
these instances are not de ned in sense invento-
ries. For example, one performs WSD in informa-
tion technology related texts using WordNet 2 as
sense inventory. When disambiguating the word
 boot in the phrase  boot sector , the sense tag-
ger will assign this instance with one of the senses
of  boot listed in WordNet. But the correct sense
2Online version of WordNet is available at
http://wordnet.princeton.edu/cgi-bin/webwn2.0
415
 loading operating system into memory is not in-
cluded in WordNet. Therefore, this instance will
be associated with an incorrect sense.
So, in this work, we would like to study the
problem of partially supervised sense disambigua-
tion with an incomplete sense tagged corpus.
Speci cally, given an incomplete sense-tagged
corpus and a large amount of untagged examples
for a target word 3, we are interested in (1) label-
ing the instances in the untagged corpus with sense
tags occurring in the tagged corpus; (2) trying to
 nd unde ned senses (or new senses) of the target
word 4 from the untagged corpus, which will be
represented by instances from the untagged cor-
pus.
We propose an automatic method to estimate
the number of senses (or sense number, model or-
der) of a target word in mixed data (tagged cor-
pus+untagged corpus) by maximizing a stability
criterion de ned on classi cation result over all
the possible values of sense number. At the same
time, we can obtain a classi cation of the mixed
data with the optimal number of groups. If the es-
timated sense number in the mixed data is equal
to the sense number of the target word in tagged
corpus, then there is no new sense in untagged
corpus. Otherwise new senses will be represented
by groups in which there is no instance from the
tagged corpus.
This partially supervised sense disambiguation
algorithm may help enriching manually compiled
lexicons by inducing new senses from untagged
corpora.
This paper is organized as follows. First, a
model order identi cation algorithm will be pre-
sented for partially supervised sense disambigua-
tion in section 2. Section 3 will provide experi-
mental results of this algorithm for sense disam-
biguation on SENSEVAL-3 data. Then related
work on partially supervised classi cation will be
summarized in section 4. Finally we will conclude
our work and suggest possible improvements in
section 5.
2 Partially Supervised Word Sense
Disambiguation
The partially supervised sense disambiguation
problem can be generalized as a model order iden-
3Untagged data usually includes the occurrences of all the
possible senses of the target word
4 unde ned senses are the senses that do not appear in
tagged corpus.
ti cation problem. We try to estimate the sense
number of a target word in mixed data (tagged cor-
pus+untagged corpus) by maximizing a stability
criterion de ned on classi cation results over all
the possible values of sense number. If the esti-
mated sense number in the mixed data is equal to
the sense number in the tagged corpus, then there
is no new sense in the untagged corpus. Other-
wise new senses will be represented by clusters in
which there is no instance from the tagged corpus.
The stability criterion assesses the agreement be-
tween classi cation results on full mixed data and
sampled mixed data. A partially supervised clas-
si cation algorithm is used to classify the full or
sampled mixed data into a given number of classes
before the stability assessment, which will be pre-
sented in section 2.1. Then we will provide the
details of the model order identi cation procedure
in section 2.2.
2.1 An Extended Label Propagation
Algorithm
Table 1: Extended label propagation algorithm.
Function: ELP(DL, DU, k, Y 0DL+DU )
Input: labeled examples DL, unlabeled
examples DU, model order k, initial
labeling matrix Y 0DL+DU ;
Output: the labeling matrix YDU on DU;
1 If k < kXL then
YDU =NULL;
2 Else if k = kXL then
Run plain label propagation algorithm
on DU with YDU as output;
3 Else then
3.1 Estimate the size of tagged data set
of new classes;
3.2 Generate tagged examples from DU
for (kXL + 1)-th to k-th new classes;
3.3 Run plain label propagation algorithm
on DU with augmented tagged dataset
as labeled data;
3.4 YDU is the output from plain label
propagation algorithm;
End if
4 Return YDU ;
Let XL+U = fxigni=1 be a set of contexts of
occurrences of an ambiguous word w, where xi
represents the context of the i-th occurrence, and n
is the total number of this word’s occurrences. Let
416
SL = fsjgcj=1 denote the sense tag set of w in XL,
where XL denotes the  rst l examples xg(1  g  
l) that are labeled as yg (yg 2 SL). Let XU denote
other u (l + u = n) examples xh(l + 1  h  n)
that are unlabeled.
Let Y 0XL+U 2 N|XL+U|×|SL| represent initial
soft labels attached to tagged instances, where
Y 0XL+U,ij = 1 if yi is sj and 0 otherwise. Let Y 0XL
be the top l rows of Y 0XL+U and Y 0XU be the remain-
ing u rows. Y 0XL is consistent with the labeling in
labeled data, and the initialization of Y 0XU can be
arbitrary.
Let k denote the possible value of the number
of senses in mixed data XL+U , and kXL be the
number of senses in initial tagged data XL. Note
that kXL = jSLj, and k  kXL.
The classi cation algorithm in the order identi-
 cation process should be able to accept labeled
data DL 5, unlabeled data DU 6 and model order k
as input, and assign a class label or a cluster index
to each instance in DU as output. Previous super-
vised or semi-supervised algorithms (e.g. SVM,
label propagation algorithm (Zhu and Ghahra-
mani, 2002)) cannot classify the examples in DU
into k groups if k > kXL. The semi-supervised k-
means clustering algorithm (Wagstaff et al., 2001)
may be used to perform clustering analysis on
mixed data, but its ef ciency is a problem for clus-
tering analysis on a very large dataset since multi-
ple restarts are usually required to avoid local op-
tima and multiple iterations will be run in each
clustering process for optimizing a clustering so-
lution.
In this work, we propose an alternative method,
an extended label propagation algorithm (ELP),
which can classify the examples in DU into k
groups. If the value of k is equal to kXL, then
ELP is identical with the plain label propagation
algorithm (LP) (Zhu and Ghahramani, 2002). Oth-
erwise, if the value of k is greater than kXL, we
perform classi cation by the following steps:
(1) estimate the dataset size of each new class as
sizenew class by identifying the examples of new
classes using the  Spy technique 7 and assuming
5DL may be the dataset XL or a subset sampled from XL.
6DU may be the dataset XU or a subset sampled from
XU .
7The  Spy technique was proposed in (Liu et al., 2003).
Our re-implementation of this technique consists of three
steps: (1) sample a small subset DsL with the size 15%×|DL|
from DL; (2) train a classi er with tagged data DL − DsL;
(3) classify DU and DsL, and then select some examples from
DU as the dataset of new classes, which have the classi ca-
that new classes are equally distributed;
(2) DprimeL = DL, DprimeU = DU;
(3) remove tagged examples of the m-th new
class (kXL + 1  m  k) from DprimeL 8 and train a
classi er on this labeled dataset without the m-th
class;
(4) the classi er is then used to classify the ex-
amples in DprimeU;
(5) the least con dently unlabeled point
xclass m 2 DprimeU, together with its label m, is added
to the labeled data DprimeL = DprimeL + xclass m, and
DprimeU = DprimeU  xclass m;
(6) steps (3) to (5) are repeated for each new
class till the augmented tagged data set is large
enough (here we try to select sizenew class/4 ex-
amples with their sense tags as tagged data for
each new class);
(7) use plain LP algorithm to classify remaining
unlabeled data DprimeU with DprimeL as labeled data.
Table 1 shows this extended label propagation
algorithm.
Next we will provide the details of the plain la-
bel propagation algorithm.
De ne Wij = exp( d
2
ij
σ2 ) if i 6= j and Wii = 0(1  i, j  jD
L + DUj), where dij is the distance
(e.g., Euclidean distance) between the example xi
and xj, and σ is used to control the weight Wij.
De ne jDL + DUj  jDL + DUj probability
transition matrix Tij = P(j ! i) = Wijsummationtextn
k=1 Wkj
,
where Tij is the probability to jump from example
xj to example xi.
Compute the row-normalized matrix T by
Tij = Tij/summationtextnk=1 Tik.
The classi cation solution is obtained by
YDU = (I  Tuu)−1TulY 0DL. I is jDUj  jDUj
identity matrix. T uu and Tul are acquired by split-
ting matrix T after thejDLj-th row and thejDLj-th
column into 4 sub-matrices.
2.2 Model Order Identi cation Procedure
For achieving the model order identi cation (or
sense number estimation) ability, we use a clus-
ter validation based criterion (Levine and Domany,
2001) to infer the optimal number of senses of w
in XL+U .
tion con dence less than the average of that in DsL. Classi -
cation con dence of the example xi is de ned as the absolute
value of the difference between two maximum values from
the i-th row in labeling matrix.
8Initially there are no tagged examples for the m-th class
in DprimeL. Therefore we do not need to remove tagged examples
for this new class, and then directly train a classi er with DprimeL.
417
Table 2: Model order evaluation algorithm.
Function: CV(XL+U , k, q, Y 0XL+U )
Input: data set XL+U , model order k,
and sampling frequency q;
Output: the score of the merit of k;
1 Run the extended label propagation
algorithm with XL, XU, k and Y 0XL+U ;
2 Construct connectivity matrix Ck based
on above classi cation solution on XU ;
3 Use a random predictor ρk to assign
uniformly drawn labels to each vector
in XU ;
4 Construct connectivity matrix Cρk using
above classi cation solution on XU;
5 For µ = 1 to q do
5.1 Randomly sample a subset XµL+U with
the size αjXL+Uj from XL+U , 0 < α < 1;
5.2 Run the extended label propagation
algorithm with XµL, XµU , k and Y 0µ;
5.3 Construct connectivity matrix Cµk using
above classi cation solution on XµU;
5.4 Use ρk to assign uniformly drawn labels
to each vector in XµU;
5.5 Construct connectivity matrix Cµρk using
above classi cation solution on XµU;
Endfor
6 Evaluate the merit of k using following
formula:
Mk = 1q summationtextµ(M(Cµk , Ck)  M(Cµρk, Cρk)),
where M(Cµ, C) is given by equation (2);
7 Return Mk;
Then this model order identi cation procedure
can be formulated as:
ˆkX
L+U = argmaxKmin≤k≤Kmax{CV (XL+U, k, q, Y
0
XL+U )}.
(1)
ˆkX
L+U is the estimated sense number in XL+U ,
Kmin (or Kmax) is the minimum (or maximum)
value of sense number, and k is the possible value
of sense number in XL+U . Note that k  kXL.
Then we set Kmin = kXL. Kmax may be set as a
value greater than the possible ground-truth value.
CV is a cluster validation based evaluation func-
tion. Table 2 shows the details of this function.
We set q, the resampling frequency for estimation
of stability score, as 20. α is set as 0.90. The ran-
dom predictor assigns uniformly distributed class
labels to each instance in a given dataset. We
run this CV procedure for each value of k. The
value of k that maximizes this function will be se-
lected as the estimation of sense number. At the
same time, we can obtain a partition of XL+U with
ˆkX
L+U groups.
The function M(Cµ, C) in Table 2 is given by
(Levine and Domany, 2001):
M(Cµ, C) =
summationtext
i,j 1{C
µ
i,j = Ci,j = 1, xi, xj ∈ X
µ
U}summationtext
i,j 1{Ci,j = 1, xi, xj ∈ X
µ
U}
,
(2)
where XµU is the untagged data in XµL+U , XµL+U
is a subset with the size αjXL+Uj (0 < α < 1)
sampled from XL+U , C or Cµ is jXUj jXUj or
jXµUj jXµUj connectivity matrix based on classi-
 cation solutions computed on XU or XµU respec-
tively. The connectivity matrix C is de ned as:
Ci,j = 1 if xi and xj belong to the same cluster,
otherwise Ci,j = 0. Cµ is calculated in the same
way.
M(Cµ, C) measures the proportion of example
pairs in each group computed on XU that are also
assigned into the same group by the classi cation
solution on XµU. Clearly, 0  M  1. Intu-
itively, if the value of k is identical with the true
value of sense number, then classi cation results
on the different subsets generated by sampling
should be similar with that on the full dataset. In
the other words, the classi cation solution with the
true model order as parameter is robust against re-
sampling, which gives rise to a local optimum of
M(Cµ, C).
In this algorithm, we normalize M(Cµk , Ck) by
the equation in step 6 of Table 2, which makes
our objective function different from the  gure of
merit (equation ( 2)) proposed in (Levine and Do-
many, 2001). The reason to normalize M(Cµk , Ck)
is that M(Cµk , Ck) tends to decrease when increas-
ing the value of k (Lange et al., 2002). Therefore
for avoiding the bias that the smaller value of k
is to be selected as the model order, we use the
cluster validity of a random predictor to normalize
M(Cµk , Ck).
If ˆkXL+U is equal to kXL, then there is no new
sense in XU. Otherwise (ˆkXL+U > kXL) new
senses of w may be represented by the groups in
which there is no instance from XL.
3 Experiments and Results
3.1 Experiment Design
We evaluated the ELP based model order iden-
ti cation algorithm on the data in English lexi-
cal sample task of SENSEVAL-3 (including all
418
Table 3: Description of The percentage of of cial
training data used as tagged data when instances
with different sense sets are removed from of cial
training data.
The percentage of of cial
training data used as tagged data
Ssubset = {s1} 42.8%
Ssubset = {s2} 76.7%
Ssubset = {s3} 89.1%
Ssubset = {s1, s2} 19.6%
Ssubset = {s1, s3} 32.0%
Ssubset = {s2, s3} 65.9%
the 57 English words ) 9, and further empirically
compared it with other state of the art classi -
cation methods, including SVM 10 (the state of
the art method for supervised word sense disam-
biguation (Mihalcea et al., 2004)), a one-class par-
tially supervised classi cation algorithm (Liu et
al., 2003) 11, and a semi-supervised k-means clus-
tering based model order identi cation algorithm.
The data for English lexical samples task in
SENSEVAL-3 consists of 7860 examples as of -
cial training data, and 3944 examples as of cial
test data for 57 English words. The number of
senses of each English word varies from 3 to 11.
We evaluated these four algorithms with differ-
ent sizes of incomplete tagged data. Given of -
cial training data of the word w, we constructed
incomplete tagged data XL by removing the all
the tagged instances from of cial training data that
have sense tags from Ssubset, where Ssubset is a
subset of the ground-truth sense set S for w, and S
consists of the sense tags in of cial training set for
w. The removed training data and of cial test data
of w were used as XU . Note that SL = S Ssubset.
Then we ran these four algorithm for each target
word w with XL as tagged data and XU as un-
tagged data, and evaluated their performance us-
ing the accuracy on of cial test data of all the 57
words. We conducted six experiments for each tar-
get word w by setting Ssubset as fs1g, fs2g, fs3g,
fs1, s2g, fs1, s3g, or fs2, s3g, where si is the i-th
most frequent sense of w. Ssubset cannot be set as
fs4g since some words have only three senses. Ta-
ble 3 lists the percentage of of cial training data
used as tagged data (the number of examples in in-
9Available at http://www.senseval.org/senseval3
10we used a linear SV Mlight, available at
http://svmlight.joachims.org/.
11Available at http://www.cs.uic.edu/∼liub/LPU/LPU-
download.html
complete tagged data divided by the number of ex-
amples in of cial training data) when we removed
the instances with sense tags from Ssubset for all
the 57 words. If Ssubset = fs3g, then most of
sense tagged examples are still included in tagged
data. If Ssubset = fs1, s2g, then there are very few
tagged examples in tagged data. If no instances are
removed from of cial training data, then the value
of percentage is 100%.
Given an incomplete tagged corpus for a target
word, SVM does not have the ability to  nd the
new senses from untagged corpus. Therefore it la-
bels all the instances in the untagged corpus with
sense tags from SL.
Given a set of positive examples for a class and
a set of unlabeled examples, the one-class partially
supervised classi cation algorithm, LPU (Learn-
ing from Positive and Unlabeled examples) (Liu
et al., 2003), learns a classi er in four steps:
Step 1: Identify a small set of reliable negative
examples from unlabeled examples by the use of a
classi er.
Step 2: Build a classi er using positive ex-
amples and automatically selected negative exam-
ples.
Step 3: Iteratively run previous two steps until
no unlabeled examples are classi ed as negative
ones or the unlabeled set is null.
Step 4: Select a good classi er from the set of
classi ers constructed above.
For comparison, LPU 12 was run to perform
classi cation on XU for each class in XL. The
label of each instance in XU was determined by
maximizing the classi cation score from LPU out-
put for each class. If the maximum score of an
instance is negative, then this instance will be la-
beled as a new class. Note that LPU classi es
XL+U into kXL + 1 groups in most of cases.
The clustering based partially supervised sense
disambiguation algorithm was implemented by re-
placing ELP with a semi-supervised k-means clus-
tering algorithm (Wagstaff et al., 2001) in the
model order identi cation procedure. The label
information in labeled data was used to guide the
semi-supervised clustering on XL+U . Firstly, the
labeled data may be used to determine initial clus-
ter centroids. If the cluster number is greater
12The three parameters in LPU were set as follows:  -s1
spy -s2 svm -c 1 . It means that we used the spy technique for
step 1 in LPU, the SVM algorithm for step 2, and selected the
 rst or the last classi er as the  nal classi er. It is identical
with the algorithm  Spy+SVM IS in Liu et al. (2003).
419
than kXL, the initial centroids of clusters for new
classes will be assigned as randomly selected in-
stances. Secondly, in the clustering process, the
instances with the same class label will stay in
the same cluster, while the instances with different
class labels will belong to different clusters. For
better clustering solution, this clustering process
will be restarted three times. Clustering process
will be terminated when clustering solution con-
verges or the number of iteration steps is more than
30. Kmin = kXL = jSLj, Kmax = Kmin + m. m
is set as 4.
We used Jensen-Shannon (JS) divergence (Lin,
1991) as distance measure for semi-supervised
clustering and ELP, since plain LP with JS diver-
gence achieves better performance than that with
cosine similarity on SENSEVAL-3 data (Niu et al.,
2005).
For the LP process in ELP algorithm, we con-
structed connected graphs as follows: two in-
stances u, v will be connected by an edge if u is
among v’s 10 nearest neighbors, or if v is among
u’s 10 nearest neighbors as measured by cosine or
JS distance measure (following (Zhu and Ghahra-
mani, 2002)).
We used three types of features to capture the
information in all the contextual sentences of tar-
get words in SENSEVAL-3 data for all the four
algorithms: part-of-speech of neighboring words
with position information, words in topical con-
text without position information (after removing
stop words), and local collocations (as same as the
feature set used in (Lee and Ng, 2002) except that
we did not use syntactic relations). We removed
the features with occurrence frequency (counted
in both training set and test set) less than 3 times.
If the estimated sense number is more than the
sense number in the initial tagged corpus XL, then
the results from order identi cation based meth-
ods will consist of the instances from clusters of
unknown classes. When assessing the agreement
between these classi cation results and the known
results on of cial test set, we will encounter the
problem that there is no sense tag for each instance
in unknown classes. Slonim and Tishby (2000)
proposed to assign documents in each cluster with
the most dominant class label in that cluster, and
then conducted evaluation on these labeled docu-
ments. Here we will follow their method for as-
signing sense tags to unknown classes from LPU,
clustering based order identi cation process, and
ELP based order identi cation process. We as-
signed the instances from unknown classes with
the dominant sense tag in that cluster. The result
from LPU always includes only one cluster of the
unknown class. We also assigned the instances
from the unknown class with the dominant sense
tag in that cluster. When all instances have their
sense tags, we evaluated the their results using the
accuracy on of cial test set.
3.2 Results on Sense Disambiguation
Table 4 summarizes the accuracy of SVM, LPU,
the semi-supervised k-means clustering algorithm
with correct sense number jSj or estimated sense
number ˆkXL+U as input, and the ELP algorithm
with correct sense number jSj or estimated sense
number ˆkXL+U as input using various incomplete
tagged data. The last row in Table 4 lists the av-
erage accuracy of each algorithm over the six ex-
perimental settings. Using jSj as input means that
we do not perform order identi cation procedure,
while using ˆkXL+U as input is to perform order
identi cation and obtain the classi cation results
on XU at the same time.
We can see that ELP based method outperforms
clustering based method in terms of average accu-
racy under the same experiment setting, and these
two methods outperforms SVM and LPU. More-
over, using the correct sense number as input helps
to improve the overall performance of both clus-
tering based method and ELP based method.
Comparing the performance of the same sys-
tem with different sizes of tagged data (from the
 rst experiment to the third experiment, and from
the fourth experiment to the sixth experiment), we
can see that the performance was improved when
given more labeled data. Furthermore, ELP based
method outperforms other methods in terms of ac-
curacy when rare senses (e.g. s3) are missing in
the tagged data. It seems that ELP based method
has the ability to  nd rare senses with the use of
tagged and untagged corpora.
LPU algorithm can deal with only one-class
classi cation problem. Therefore the labeled data
of other classes cannot be used when determining
the positive labeled data for current class. ELP
can use the labeled data of all the known classes to
determine the seeds of unknown classes. It may
explain why LPU’s performance is worse than
ELP based sense disambiguation although LPU
can correctly estimate the sense number in XL+U
420
Table 4: This table summarizes the accuracy of SVM, LPU, the semi-supervised k-means clustering al-
gorithm with correct sense number jSj or estimated sense number ˆkXL+U as input, and the ELP algorithm
with correct sense number jSj or estimated sense number ˆkXL+U as input on the of cial test data of ELS
task in SENSEVAL-3 when given various incomplete tagged corpora.
Clustering algorithm ELP algorithm Clustering algorithm ELP algorithm
SVM LPU with |S| as input with |S| as input with ˆkXL+U as input with ˆkXL+U as input
Ssubset =
{s1} 30.6% 22.3% 43.9% 47.8% 40.0% 38.7%
Ssubset =
{s2} 59.7% 54.6% 44.0% 62.4% 48.5% 62.6%
Ssubset =
{s3} 67.0% 53.4% 48.7% 67.2% 52.4% 69.1%
Ssubset =
{s1, s2} 14.6% 13.1% 44.4% 40.2% 35.6% 33.0%
Ssubset =
{s1, s3} 25.7% 21.1% 48.5% 37.9% 39.8% 31.0%
Ssubset =
{s2, s3} 56.2% 53.1% 47.3% 59.4% 46.6% 58.7%
Average accuracy 42.3% 36.3% 46.1% 52.5% 43.8% 48.9%
Table 5: These two tables provide the mean and
standard deviation of absolute values of the differ-
ence between ground-truth results jSj and sense
numbers estimated by clustering or ELP based or-
der identi cation procedure respectively.
Clustering based method ELP based method
Ssubset =
{s1} 1.3±1.1 2.2±1.1
Ssubset =
{s2} 2.4±0.9 2.4±0.9
Ssubset =
{s3} 2.6±0.7 2.6±0.7
Ssubset =
{s1, s2} 1.2±0.6 1.6±0.5
Ssubset =
{s1, s3} 1.4±0.6 1.8±0.4
Ssubset =
{s2, s3} 1.8±0.5 1.8±0.5
when only one sense is missing in XL.
When very few labeled examples are avail-
able, the noise in labeled data makes it dif cult
to learn the classi cation score (each entry in
YDU ). Therefore using the classi cation con -
dence criterion may lead to poor performance of
seed selection for unknown classes if the classi -
cation score is not accurate. It may explain why
ELP based method does not outperform cluster-
ing based method with small labeled data (e.g.,
Ssubset = fs1g).
3.3 Results on Sense Number Estimation
Table 5 provides the mean and standard devia-
tion of absolute difference values between ground-
truth results jSj and sense numbers estimated by
clustering or ELP based order identi cation pro-
cedures respectively. For example, if the ground
truth sense number of the word w is kw, and the es-
timated value is ˆkw, then the absolute value of the
difference between these two values is jkw  ˆkwj.
Therefore we can have this value for each word.
Then we calculated the mean and deviation on this
array of absolute values. LPU does not have the
order identi cation capability since it always as-
sumes that there is at least one new class in un-
labeled data, and does not further differentiate the
instances from these new classes. Therefore we do
not provide the order identi cation results of LPU.
From the results in Table 5, we can see that esti-
mated sense numbers are closer to ground truth re-
sults when given less labeled data for clustering or
ELP based methods. Moreover, clustering based
method performs better than ELP based method in
terms of order identi cation when given less la-
beled data (e.g., Ssubset = fs1g). It seems that
ELP is not robust to the noise in small labeled data,
compared with the semi-supervised k-means clus-
tering algorithm.
4 Related Work
The work closest to ours is partially supervised
classi cation or building classi ers using positive
examples and unlabeled examples, which has been
studied in machine learning community (Denis et
al., 2002; Liu et al., 2003; Manevitz and Yousef,
2001; Yu et al., 2002). However, they cannot
421
group negative examples into meaningful clusters.
In contrast, our algorithm can  nd the occurrence
of negative examples and further group these neg-
ative examples into a  natural number of clusters.
Semi-supervised clustering (Wagstaff et al., 2001)
may be used to perform classi cation by the use
of labeled and unlabeled examples, but it encoun-
ters the same problem of partially supervised clas-
si cation that model order cannot be automatically
estimated.
Levine and Domany (2001) and Lange et al.
(2002) proposed cluster validation based criteria
for cluster number estimation. However, they
showed the application of the cluster validation
method only for unsupervised learning. Our work
can be considered as an extension of their methods
in the setting of partially supervised learning.
In natural language processing community, the
work that is closely related to ours is word sense
discrimination which can induce senses by group-
ing occurrences of a word into clusters (Sch¨utze,
1998). If it is considered as unsupervised meth-
ods to solve sense disambiguation problem, then
our method employs partially supervised learning
technique to deal with sense disambiguation prob-
lem by use of tagged and untagged texts.
5 Conclusions
In this paper, we present an order identi cation
based partially supervised classi cation algorithm
and investigate its application to partially super-
vised word sense disambiguation problem. Exper-
imental results on SENSEVAL-3 data indicate that
our ELP based model order identi cation algo-
rithm achieves better performance than other state
of the art classi cation algorithms, e.g., SVM,
a one-class partially supervised algorithm (LPU),
and a semi-supervised k-means clustering based
model order identi cation algorithm.

References
Brown P., Stephen, D.P., Vincent, D.P., & Robert, Mer-
cer.. 1991. Word Sense Disambiguation Using Sta-
tistical Methods. Proceedings of ACL.
Dagan, I. & Itai A.. 1994. Word Sense Disambigua-
tion Using A Second Language Monolingual Cor-
pus. Computational Linguistics, Vol. 20(4), pp. 563-
596.
Denis, F., Gilleron, R., & Tommasi, M.. 2002. Text
Classi cation from Positive and Unlabeled Exam-
ples. Proceedings of the 9th International Confer-
ence on Information Processing and Management of
Uncertainty in Knowledge-Based Systems.
Lange, T., Braun, M., Roth, V., & Buhmann, J. M.
2002. Stability-Based Model Selection. NIPS 15.
Leacock, C., Miller, G.A. & Chodorow, M.. 1998.
Using Corpus Statistics and WordNet Relations for
Sense Identi cation. Computational Linguistics,
24:1, 147 165.
Lee, Y.K. & Ng, H.T.. 2002. An Empirical Eval-
uation of Knowledge Sources and Learning Algo-
rithms for Word Sense Disambiguation. Proceed-
ings of EMNLP, (pp. 41-48).
Levine, E., & Domany, E. 2001. Resampling Method
for Unsupervised Estimation of Cluster Validity.
Neural Computation, Vol. 13, 2573 2593.
Lin, J. 1991. Divergence Measures Based on the
Shannon Entropy. IEEE Transactions on Informa-
tion Theory, 37:1, 145 150.
Liu, B., Dai, Y., Li, X., Lee, W.S., & Yu, P.. 2003.
Building Text Classi ers Using Positive and Unla-
beled Examples. Proceedings of IEEE ICDM.
Manevitz, L.M., & Yousef, M.. 2001. One Class
SVMs for Document Classi cation. Journal of Ma-
chine Learning, 2, 139-154.
Mihalcea R., Chklovski, T., & Kilgariff, A.. 2004.
The SENSEVAL-3 English Lexical Sample Task.
SENSEVAL-2004.
Niu, Z.Y., Ji, D.H., & Tan, C.L.. 2005. Word Sense
Disambiguation Using Label Propagation Based
Semi-Supervised Learning. Proceedings of ACL.
Sch utze, H.. 1998. Automatic Word Sense Discrimi-
nation. Computational Linguistics, 24:1, 97 123.
Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S..
2001. Constrained K-Means Clustering with Back-
ground Knowledge. Proceedings of ICML.
Yarowsky, D.. 1995. Unsupervised Word Sense Dis-
ambiguation Rivaling Supervised Methods. Pro-
ceedings of ACL.
Yu, H., Han, J., & Chang, K. C.-C.. 2002. PEBL: Pos-
itive example based learning for web page classi -
cation using SVM. Proceedings of ACM SIGKDD.
Zhu, X. & Ghahramani, Z.. 2002. Learning from La-
beled and Unlabeled Data with Label Propagation.
CMU CALD tech report CMU-CALD-02-107.
