Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 907–914, Vancouver, October 2005. c©2005 Association for Computational Linguistics
A Semi-Supervised Feature Clustering Algorithm
with Application to Word Sense Disambiguation
Zheng-Yu Niu, Dong-Hong Ji
Institute for Infocomm Research
21 Heng Mui Keng Terrace
119613 Singapore
{zniu, dhji}@i2r.a-star.edu.sg
Chew Lim Tan
Department of Computer Science
National University of Singapore
3 Science Drive 2
117543 Singapore
tancl@comp.nus.edu.sg
Abstract
In this paper we investigate an applica-
tion of feature clustering for word sense
disambiguation, and propose a semi-
supervised feature clustering algorithm.
Compared with other feature clustering
methods (ex. supervised feature cluster-
ing), it can infer the distribution of class
labels over (unseen) features unavailable
in training data (labeled data) by the use of
the distribution of class labels over (seen)
features available in training data. Thus,
it can deal with both seen and unseen fea-
tures in feature clustering process. Our ex-
perimental results show that feature clus-
tering can aggressively reduce the dimen-
sionality of feature space, while still main-
taining state of the art sense disambigua-
tion accuracy. Furthermore, when com-
bined with a semi-supervised WSD algo-
rithm, semi-supervised feature clustering
outperforms other dimensionality reduc-
tion techniques, which indicates that using
unlabeled data in learning process helps to
improve the performance of feature clus-
tering and sense disambiguation.
1 Introduction
This paper deals with word sense disambiguation
(WSD) problem, which is to assign an appropriate
sense to an occurrence of a word in a given context.
Many corpus based statistical methods have been
proposed to solve this problem, including supervised
learning algorithms (Leacock et al., 1998; Towel and
Voorheest, 1998), weakly supervised learning algo-
rithms (Dagan and Itai, 1994; Li and Li, 2004; Mi-
halcea, 2004; Niu et al., 2005; Park et al., 2000;
Yarowsky, 1995), unsupervised learning algorithms
(or word sense discrimination) (Pedersen and Bruce,
1997; Sch¨utze, 1998), and knowledge based algo-
rithms (Lesk, 1986; McCarthy et al., 2004).
In general, the most common approaches start by
evaluating the co-occurrence matrix of features ver-
sus contexts of instances of ambiguous word, given
sense-tagged training data for this target word. As
a result, contexts are usually represented in a high-
dimensional sparse feature space, which is far from
optimal for many classification algorithms. Further-
more, processing data lying in high-dimensional fea-
ture space requires large amount of memory and
CPU time, which limits the scalability of WSD
model to very large datasets or incorporation of
WSD model into natural language processing sys-
tems.
Standard dimentionality reduction techniques in-
clude (1) supervised feature selection and super-
vised feature clustering when given labeled data, (2)
unsupervised feature selection, latent semantic in-
dexing, and unsupervised feature clustering when
only unlabeled data is available. Supervised fea-
ture selection improves the performance of an ex-
amplar based learning algorithm over SENSEVAL-
2 data (Mihalcea, 2002), Naive Bayes and deci-
sion tree over SENSEVAL-1 and SENSEVAL-2 data
(Lee and Ng, 2002), but feature selection does not
improve SVM and Adaboost over SENSEVAL-1
and SENSEVAL-2 data (Lee and Ng, 2002) for
word sense disambiguation. Latent semantic in-
dexing (LSI) studied in (Sch¨utze, 1998) improves
the performance of sense discrimination, while un-
supervised feature selection also improves the per-
formance of word sense discrimination (Niu et al.,
2004). But little work is done on using feature clus-
tering to conduct dimensionality reduction for WSD.
This paper will describe an application of feature
907
clustering technique to WSD task.
Feature clustering has been extensively studied
for the benefit of text categorization and document
clustering. In the context of text categorization, su-
pervised feature clustering algorithms (Baker and
McCallum, 1998; Bekkerman et al., 2003; Slonim
and Tishby, 2001) usually cluster words into groups
based on the distribution of class labels over fea-
tures, which can compress the feature space much
more aggressively while still maintaining state of
the art classification accuracy. In the context of
document clustering, unsupervised feature cluster-
ing algorithms (Dhillon, 2001; Dhillon et al., 2002;
Dhillon et al., 2003; El-Yaniv and Souroujon, 2001;
Slonim and Tishby, 2000) perform word clustering
by the use of word-document co-occurrence matrix,
which can improve the performance of document
clustering by clustering documents over word clus-
ters.
Supervised feature clustering algorithm groups
features into clusters based on the distribution of
class labels over features. But it can not group un-
seen features (features that do not occur in labeled
data) into meaningful clusters since there are no
class labels associated with these unseen features.
On the other hand, while given labeled data, un-
supervised feature clustering method can not uti-
lize class label information to guide feature cluster-
ing procedure. While, as a promising classification
strategy, semi-supervised learning methods (Zhou et
al., 2003; Zhu and Ghahramani, 2002; Zhu et al.,
2003) usually utilize all the features occurring in la-
beled data and unlabeled data. So in this paper we
propose a semi-supervised feature clustering algo-
rithm to overcome this problem. Firstly, we try to
induce class labels for unseen features based on the
similarity among seen features and unseen features.
Then all the features (including seen features and
unseen features) are clustered based on the distrib-
ution of class labels over them.
This paper is organized as follows. First, we
will formulate a feature clustering based WSD prob-
lem in section 2. Then in section 3 we will de-
scribe a semi-supervised feature clustering algo-
rithm. Section 4 will provide experimental results
of various dimensionality reduction techniques with
combination of state of the art WSD algorithms on
SENSEVAL-3 data. Section 5 will provide a review
of related work on feature clustering. Finally we will
conclude our work and suggest possible improve-
ment in section 6.
2 Problem Setup
Let X = {xi}ni=1 be a set of contexts of occur-
rences of an ambiguous word w, where xi repre-
sents the context of the i-th occurrence, and n is
the total number of this word’s occurrences. Let
S = {sj}cj=1 denote the sense tag set of w. The first
l examples xg(1 ≤ g ≤ l) are labeled as yg (yg ∈ S)
and other u (l+u = n) examples xh(l+1 ≤ h ≤ n)
are unlabeled. The goal is to predict the sense of w
in context xh by the use of label information of xg
and similarity information among examples in X.
We use tildewideF to represent feature clustering result
into NtildewideF clusters when F is a set of features. After
feature clustering, any context xi in X can be repre-
sented as a vector over feature clusters tildewideF. Then we
can use supervised methods (ex. SVM) (Lee and
Ng, 2002) or semi-supervised methods (ex. label
propagation algorithm) (Niu et al., 2005) to perform
sense disambiguation on unlabeled instances of tar-
get word.
3 Semi-Supervised Feature Clustering
Algorithm
In supervised feature clustering process, F consists
of features occurring in the first l labeled examples,
which can be denoted as FL. But in the setting of
transductive learning, semi-supervised learning al-
gorithms will utilize not only the features in labeled
examples (FL), but also unseen features in unlabeled
examples (denoted as FL). FL consists of the fea-
tures that occur in unlabeled data, but never appear
in labeled data.
Supervised feature clustering algorithm usually
performs clustering analysis over feature-class ma-
trix, where each entry (i,j) in this matrix is the num-
ber of times of the i-th feature co-occurring with the
j-th class. Therefore it can not group features in FL
into meaningful clusters since there are no class la-
bels associated with these features. We overcome
this problem by firstly inducing class labels for un-
seen features based on the similarity among features
in FL and FL, then clustering all the features (in-
cluding FL and FL) based on the distribution of class
908
labels over them.
This semi-supervised feature clustering algorithm
is defined as follows:
Input:
Feature set F = FL uniontextFL (the first |FL| features
in F belong to FL, and the remaining |FL| features
belong to FL), context set X, the label information
of xg(1 ≤ g ≤ l), NtildewideF (the number of clusters in tildewideF);
Output:
Clustering solution tildewideF;
Algorithm:
1. Construct |F| × |X| feature-example matrix
MF,X, where entry MF,Xi,j is the number of times of
fi co-occurring with example xj (1 ≤ j ≤ n).
2. Form |F|×|F| affinity matrix W defined by
Wij = exp(−d
2
ij
σ2 ) if i negationslash= j and Wii = 0 (1 ≤i,j ≤ |F|), where d
ij is the distance (ex. Euclid-
ean distance) between fi (the i-th row in MF,X) and
fj (the j-th row in MF,X), and σ is used to control
the weight Wij.
3. Construct |FL| × |S| feature-class matrix
Y FL,S, where the entry Y FL,Si,j is the number of
times of feature fi (fi ∈ FL) co-occurring with
sense sj.
4. Obtain hard label matrix for features in FL
(denoted as Y FL,Shard ) based on Y FL,S, where entry
Y F,Shard i,j = 1 if the hard label of fi is sj, otherwise
zero. Obtain hard labels for features in FL using
a classifier based on W and Y FL,Shard . In this paper
we use label propagation (LP) algorithm (Zhu and
Ghahramani, 2002) to get hard labels for FL.
5. Construct |F|×|S| feature-class matrix Y F,Shard,
where entry Y F,Shard i,j = 1 if the hard label of fi is
sj, otherwise zero.
6. Construct the matrix L = D−1/2WD−1/2 in
which D is a diagonal matrix with its (i,i)-element
equal to the sum of the i-th row of W.
7. Label each feature in F as soft label hatwideY F,Si , the
i-th row of hatwideY F,S, where hatwideY F,S = (I −αL)−1Y F,Shard.
8. Obtain the feature clustering solution tildewideF by
clustering the rows of hatwideY F,Si into NtildewideF groups. In
this paper we use sequential information bottleneck
(sIB) algorithm (Slonim and Tishby, 2000) to per-
form clustering analysis.
End
Step 3 ∼ 5 are the process to obtain hard la-
bels for features in F, while the operation in step 6
and 7 is a local and global consistency based semi-
supervised learning (LGC) algorithm (Zhou et al.,
2003) that smooth the classification result of LP al-
gorithm to acquire a soft label for each feature.
At first sight, this semi-supervised feature cluster-
ing algorithm seems to make little sense. Since we
run feature clustering in step 8, why not use LP algo-
rithm to obtain soft label matrix Y FL,S for features
in FL by the use of Y FL,S and W, then just apply
sIB directly to soft label matrix hatwideY F,S (constructed
by catenating Y FL,S and Y FL,S)?
The reason for using LGC algorithm to acquire
soft labels for features in F is that in the context
of transductive learning, the size of labeled data is
rather small, which is much less than that of un-
labeled data. This makes it difficult to obtain re-
liable estimation of class label’s distribution over
features from only labeled data. This motivates
us to use raw information (hard labels of features
in FL) from labeled data to estimate hard labels
of features in FL. Then LGC algorithm is used
to smooth the classification result of LP algorithm
based on the assumption that a good classification
should change slowly on the coherent structure ag-
gregated by a large amount of unlabeled data. This
operation makes our algorithm more robust to the
noise in feature-class matrix Y FL,S that is estimated
from labeled data.
In this paper, σ is set as the average distance be-
tween labeled examples from different classes, and
NtildewideF = |F|/10. Latent semantic indexing technique
(LSI) is used to perform factor analysis in MF,X be-
fore calculating the distance between features in step
2.
4 Experiments and Results
4.1 Experiment Design
For empirical study of dimensionality reduction
techniques on WSD task, we evaluated five dimen-
sionality reduction algorithms on the data in English
lexical sample (ELS) task of SENSEVAL-3 (Mihal-
cea et al., 2004)(including all the 57 English words
) 1: supervised feature clustering (SuFC) (Baker and
McCallum, 1998; Bekkerman et al., 2003; Slonim
1Available at http://www.senseval.org/senseval3
909
and Tishby, 2001), iterative double clustering (IDC)
(El-Yaniv and Souroujon, 2001), semi-supervised
feature clustering (SemiFC) (our algorithm), super-
vised feature selection (SuFS) (Forman, 2003), and
latent semantic indexing (LSI) (Deerwester et. al.,
1990) 2.
We used sIB algorithm 3 to cluster features in
FL into groups based on the distribution of class la-
bels associated with each feature. This procedure
can be considered as our re-implementation of su-
pervised feature clustering. After feature clustering,
examples can be represented as vectors over feature
clusters.
IDC is an extension of double clustering method
(DC) (Slonim and Tishby, 2000), which performs it-
erations of DC. In the transductive version of IDC,
they cluster features in F as distributions over class
labels (given by the labeled data) during the first
stage of the IDC first iteration. This phase results in
feature clusters tildewideF. Then they continue as usual; that
is, in the second phase of the first IDC iteration they
group X into NtildewideX clusters, where X is represented
as distribution over tildewideF. Subsequent IDC iterations
use all the unlabeled data. This IDC algorithm can
result in two clustering solutions: tildewideF and tildewideX. Follow-
ing (El-Yaniv and Souroujon, 2001), the number of
iterations is set as 15, and NtildewideX = |S| (the number of
senses of target word) in our re-implementation of
IDC. After performing IDC, examples can be repre-
sented as vectors over feature clusters tildewideF.
Supervised feature selection has been extensively
studied for text categorization task (Forman, 2003).
Information gain (IG) is one of state of the art cri-
teria for feature selection, which measures the de-
crease in entropy when the feature is given vs. ab-
sent. In this paper, we calculate IG score for each
feature in FL, then select top |F|/10 features with
highest scores to form reduced feature set. Then
examples can be represented as vectors over the re-
duced feature set.
LSI is an unsupervised factor analysis technique
based on Singular Value Decomposition of a |X|×
|F| example-feature matrix. The underlying tech-
nique for LSI is to find an orthogonal basis for the
2Following (Baker and McCallum, 1998), we use LSI as a
representative method for unsupervised dimensionality reduc-
tion.
3Available at http://www.cs.huji.ac.il/∼noamm/
feature-example space for which the axes lie along
the dimensions of maximum variance. After using
LSI on the example-feature matrix, we can get vec-
tor representation for each example in X in reduced
feature space.
For each ambiguous word in ELS task of
SENSEVAL-3, we used three types of features to
capture contextual information: part-of-speech of
neighboring words with position information, un-
ordered single words in topical context, and local
collocations (as same as the feature set used in (Lee
and Ng, 2002) except that we did not use syntactic
relations). We removed the features with occurrence
frequency (counted in both training set and test set)
less than 3 times.
We ran these five algorithms for each ambiguous
word to reduce the dimensionality of feature space
from |F| to |F|/10 no matter which training data is
used (ex. full SENSEVAL-3 training data or sam-
pled SENSEVAL-3 training data). Then we can ob-
tain new vector representation of X in new feature
space acquired by SuFC, IDC, SemiFC, and LSI or
reduced feature set by SuFS.
Then we used SVM 4 and LP algorithm to per-
form sense disambiguation on vectors in dimension-
ality reduced feature space. SVM and LP were eval-
uated using accuracy 5 (fine-grained score) on test
set of SENSEVAL-3. For LP algorithm, the test set
in SENSEVAL-3 data was also used as unlabeled
data in tranductive learning process.
We investigated two distance measures for LP: co-
sine similarity and Jensen-Shannon (JS) divergence
(Lin, 1991). Cosine similarity measures the angle
between two feature vectors, while JS divergence
measures the distance between two probability dis-
tributions if each feature vector is considered as
probability distribution over features.
For sense disambiguation on SENSEVAL-3 data,
we constructed connected graphs for LP algorithm
following (Niu et al., 2005): two instances u,v will
be connected by an edge if u is among v’s k nearest
neighbors, or if v is among u’s k nearest neighbors
4We used SVMlight with linear kernel function, available
at http://svmlight.joachims.org/.
5If there are multiple sense tags for an instance in training
set or test set, then only the first tag is considered as correct
answer. Furthermore, if the answer of the instance in test set is
“U”, then this instance will be removed from test set.
910
as measured by cosine or JS distance measure. k is
5 in later experiments.
4.2 Experiments on Full SENSEVAL-3 Data
In this experiment, we took the training set in
SENSEVAL-3 as labeled data, and the test set as un-
labeled data. In other words, all of dimensionality
reduction methods and classifiers can use the label
information in training set, but can not access the
label information in test set. We evaluated differ-
ent sense disambiguation processes using test set in
SENSEVAL-3.
We use features with occurrence frequency no less
than 3 in training set and test set as feature set F for
each ambiguous word. F consists of two disjoint
subsets: FL and FL. FL consists of features occur-
ring in training set of target word in SENSEVAL-3,
while FL consists of features that occur in test set,
but never appear in training set.
Table 1 lists accuracies of SVM and LP
without or with dimensionality reduction on full
SENSEVAL-3 data. From this table, we have some
findings as follows:
(1) If without dimensionality reduction, the best
performance of sense disambiguation is 70.3%
(LPJS), while if using dimensionality reduction,
the best two systems can achieve 69.8% (SuFS +
LPJS) and 69.0% (SemiFC + LPJS) accuracies.
It seems that feature selection and feature clustering
can significantly reduce the dimensionality of fea-
ture space while losing only about 1.0% accuracy.
(2) Furthermore, LPJS algorithm performs bet-
ter than SVM when combined with the same dimen-
sionality reduction technique (except IDC). Notice
that LP algorithm uses unlabelled data during its dis-
ambiguation phase while SVM doesn’t. This indi-
cates that using unlabeled data helps to improve the
performance of sense disambiguation.
(3) When using LP algorithm for sense disam-
biguation, SemiFC performs better than other fea-
ture clustering algorithms, such as SuFC, IDC.
This indicates that clustering seen and unseen fea-
tures can satisfy the requirement of semi-supervised
learning algorithm, which does help the classifica-
tion process.
(4) When using SuFC, IDC, SuFS, or SemiFC for
dimensionality reduction, the performance of sense
disambiguation is always better than that using LSI
as dimensionality reduction method. SuFC, IDC,
SuFS, and SemiFC use label information to guide
feature clustering or feature selection, while LSI is
an unsupervised factor analysis method that can con-
duct dimensionality reduction without the use of la-
bel information from labeled data. This indicates
that using label information in dimensionality re-
duction procedure can cluster features into better
groups or select better feature subsets, which results
in better representation of contexts in reduced fea-
ture space.
4.3 Additional Experiments on Sampled
SENSEVAL-3 Data
For investigating the performance of various dimen-
sionality reduction techniques with very small train-
ing data, we ran them with only lw examples from
training set of each word in SENSEVAL-3 as la-
beled data. The remaining training examples and
all the test examples were used as unlabeled data
for SemiFC or LP algorithm. Finally we evaluated
different sense disambiguation processes using test
set in SENSEVAL-3. For each labeled set size lw,
we performed 20 trials. In each trial, we randomly
sampled lw labeled examples for each word from
training set. If any sense was absent from the sam-
pled labeled set, we redid the sampling. lw is set as
Nw,train × 10%, where Nw,train is the number of
examples in training set of word w. Other settings
of this experiment is as same as that of previous one
in section 4.2.
In this experiment, feature set F is as same as that
in section 4.2. FL consists of features occurring in
sampled training set of target word in SENSEVAL-
3, while FL consists of features that occur in unla-
beled data (including unselected training data and all
the test set), but never appear in labeled data (sam-
pled training set).
Table 2 lists accuracies of SVM and LP with-
out or with dimensionality reduction on sampled
SENSEVAL-3 training data 6. From this table, we
have some findings as follows:
(1) If without dimensionality reduction, the best
performance of sense disambiguation is 54.9%
(LPJS), while if using dimensionality reduction, the
6We can not obtain the results of IDC over 20 trials since it
costs about 50 hours for each trial (Pentium 1.4 GHz CPU/1.0
GB memory).
911
Table 1: This table lists the accuracies of SVM and LP without or with dimensionality reduction on full
SENSEVAL-3 data. There is no result for LSI + LPJS, since the vectors obtained by LSI may contain
negative values, which prohibits the application of JS divergence for measuring the distance between these
vectors.
Without With various dimensionality
dimensionality reduction techniques
Classifier reduction SuFC IDC SuFS LSI SemiFC
SVM 69.7% 66.4% 65.1% 65.2% 59.1% 64.0%
LPcosine 68.4% 66.7% 64.9% 66.0% 60.7% 67.6%
LPJS 70.3% 67.2% 64.0% 69.8% - 69.0%
Table 2: This table lists the accuracies of SVM and LP without or with dimensionality reduction on sam-
pled SENSEVAL-3 training data. For each classifier, we performed paired t-test between the system using
SemiFC for dimensionality reduction and any other system with or without dimensionality reduction. ≫ (or
≪) means p-value ≤ 0.01, while > (or <) means p-value falling into (0.01,0.05]. Both ≫ (or ≪) and >
(or <) indicate that the performance of current WSD system is significantly better (or worse) than that using
SemiFC for dimensionality reduction, when given same classifier.
Without With various dimensionality
dimensionality reduction techniques
Classifier reduction SuFC SuFS LSI SemiFC
SVM 53.4±1.1% (≫) 50.4±1.1% (≪) 52.2±1.2% (>) 49.8±0.8% (≪) 51.5±1.0%
LPcosine 54.4±1.2% (≫) 49.5±1.1% (≪) 51.1±1.0% (≪) 49.8±1.0% (≪) 52.9±1.0%
LPJS 54.9±1.1% (≫) 52.0±0.9% (≪) 52.5±1.0% (≪) - 54.1±1.2%
best performance of sense disambiguation is 54.1%
(SemiFC + LPJS). Feature clustering can signif-
icantly reduce the dimensionality of feature space
while losing only 0.8% accuracy.
(2) LPJS algorithm performs better than SVM
when combined with most of dimensionality reduc-
tion techniques. This result confirmed our previous
conclusion that using unlabeled data can improve
the sense disambiguation process. Furthermore,
SemiFC performs significantly better than SuFC and
SuFS when using LP as the classifier for sense dis-
ambiguation. The reason is that when given very
few labeled examples, the distribution of class labels
over features can not be reliably estimated, which
deteriorates the performance of SuFC or SuFS. But
SemiFC uses only raw label information (hard label
of each feature) estimated from labeled data, which
makes it robust to the noise in very small labeled
data.
(3) SuFC, SuFS and SemiFC perform better than
LSI no matter which classifier is used for sense dis-
ambiguation. This observation confirmed our previ-
ous conclusion that using label information to guide
dimensionality reduction process can result in bet-
ter representation of contexts in feature subspace,
which further improves the results of sense disam-
biguation.
5 Related Work
Feature clustering has been extensively studied for
the benefit of text categorization and document clus-
tering, which can be categorized as supervised fea-
ture clustering, semi-supervised feature clustering,
and unsupervised feature clustering.
Supervised feature clustering algorithms (Baker
and McCallum, 1998; Bekkerman et al., 2003;
Slonim and Tishby, 2001) usually cluster words into
groups based on the distribution of class labels over
features. Baker and McCallum (1998) apply super-
vised feature clustering based on distributional clus-
tering for text categorization, which can compress
the feature space much more aggressively while still
912
maintaining state of the art classification accuracy.
Slonim and Tishby (2001) and Bekkerman et. al.
(2003) apply information bottleneck method to find
word clusters. They present similar results with the
work by Baker and McCallum (1998). Slonim and
Tishby (2001) goes further to show that when the
training sample is small, word clusters can yield sig-
nificant improvement in classification accuracy.
Unsupervised feature clustering algorithms
(Dhillon, 2001; Dhillon et al., 2002; Dhillon et al.,
2003; El-Yaniv and Souroujon, 2001; Slonim and
Tishby, 2000) perform word clustering by the use
of word-document co-occurrence matrix, which do
not utilize class labels to guide clustering process.
Slonim and Tishby (2000), El-Yaniv and Souroujon
(2001) and Dhillon et. al. (2003) show that word
clusters can improve the performance of document
clustering.
El-Yaniv and Souroujon (2001) present an itera-
tive double clustering (IDC) algorithm, which per-
forms iterations of double clustering (Slonim and
Tishby, 2000). Furthermore, they extend IDC algo-
rithm for semi-supervised learning when given both
labeled and unlabeled data.
Our algorithm belongs to the family of semi-
supervised feature clustering techniques, which can
utilize both labeled and unlabeled data to perform
feature clustering.
Supervised feature clustering can not group un-
seen features (features that do not occur in labeled
data) into meaningful clusters since there are no
class labels associated with these unseen features.
Our algorithm can overcome this problem by induc-
ing class labels for unseen features based on the sim-
ilarity among seen features and unseen features, then
clustering all the features (including both seen fea-
tures and unseen features) based on the distribution
of class labels over them.
Compared with the semi-supervised version of
IDC algorithm, our algorithm is more efficient, since
we perform feature clustering without iterations.
The difference between our algorithm and unsu-
pervised feature clustering is that our algorithm de-
pends on both labeled and unlabeled data, but unsu-
pervised feature clustering requires only unlabeled
data.
O’Hara et. al. (2004) use semantic class-
based collocations to augment traditional word-
based collocations for supervised WSD. Three sep-
arate sources of word relatedness are used for
these collocations: 1) WordNet hypernym rela-
tions; 2) cluster-based word similarity classes; and
3) dictionary definition analysis. Their system
achieved 56.6% fine-grained score on ELS task of
SENSEVAL-3. In contrast with their work, our data-
driven method for feature clustering based WSD
does not require external knowledge resource. Fur-
thermore, our SemiFC+LPJS method can achieve
69.0% fine-grained score on the same dataset, which
shows the effectiveness of our method.
6 Conclusion
In this paper we have investigated feature clustering
techniques for WSD, which usually group features
into clusters based on the distribution of class labels
over features. We propose a semi-supervised fea-
ture clustering algorithm to satisfy the requirement
of semi-supervised classification algorithms for di-
mensionality reduction in feature space. Our ex-
perimental results on SENSEVAL-3 data show that
feature clustering can aggressively reduce the di-
mensionality of feature space while still maintaining
state of the art sense disambiguation accuracy. Fur-
thermore, when combined with a semi-supervised
WSD algorithm, semi-supervised feature cluster-
ing outperforms supervised feature clustering and
other dimensionality reduction techniques. Our ad-
ditional experiments on sampled SENSEVAL-3 data
indicate that our semi-supervised feature clustering
method is robust to the noise in small labeled data,
which achieves better performance than supervised
feature clustering.
In the future, we may extend our work by using
more datasets to empirically evaluate this feature
clustering algorithm. This semi-supervised feature
clustering framework is quite general, which can be
applied to other NLP tasks, ex. text categorization.
Acknowledgements We would like to thank
anonymous reviewers for their helpful comments.
Z.Y. Niu is supported by A*STAR Graduate Schol-
arship.
References
Baker L. & McCallum A.. 1998. Distributional Clus-
tering of Words for Text Classification. ACM SIGIR
913
1998.
Bekkerman, R., El-Yaniv, R., Tishby, N., & Winter, Y..
2003. Distributional Word Clusters vs. Words for
Text Categorization. Journal of Machine Learning Re-
search, Vol. 3: 1183-1208.
Dagan, I. & Itai A.. 1994. Word Sense Disambigua-
tion Using A Second Language Monolingual Corpus.
Computational Linguistics, Vol. 20(4), pp. 563-596.
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas,
G.W., & Harshman, R.A.. 1990. Indexing by Latent
Semantic Analysis. Journal of the American Society
of Information Science, Vol. 41(6), pp. 391-407.
Dhillon I.. 2001. Co-Clustering Documents and Words
Using Bipartite Spectral Graph Partitioning. ACM
SIGKDD 2001.
Dhillon I., Mallela S., & Kumar R.. 2002. Enhanced
Word Clustering for Hierarchical Text Classification.
ACM SIGKDD 2002.
Dhillon I., Mallela S., & Modha, D.. 2003. Information-
Theoretic Co-Clustering. ACM SIGKDD 2003.
El-Yaniv, R., & Souroujon, O.. 2001. Iterative Dou-
ble Clustering for Unsupervised and Semi-Supervised
Learning. NIPS 2001.
Forman, G.. 2003. An Extensive Empirical Study of Fea-
ture Selection Metrics for Text Classification. Journal
of Machine Learning Research 3(Mar):1289–1305.
Leacock, C., Miller, G.A. & Chodorow, M.. 1998. Us-
ing Corpus Statistics and WordNet Relations for Sense
Identification. Computational Linguistics, 24:1, 147–
165.
Lee, Y.K. & Ng, H.T.. 2002. An Empirical Evaluation
of Knowledge Sources and Learning Algorithms for
Word Sense Disambiguation. EMNLP 2002, (pp. 41-
48).
Lesk M.. 1986. Automated Word Sense Disambiguation
Using Machine Readable Dictionaries: How to Tell a
Pine Cone from an Ice Cream Cone. ACM SIGDOC
1986.
Li, H. & Li, C.. 2004. Word Translation Disambiguation
Using Bilingual Bootstrapping. Computational Lin-
guistics, 30(1), 1-22.
Lin, J. 1991. Divergence Measures Based on the Shan-
non Entropy. IEEE Transactions on Information The-
ory, 37:1, 145–150.
McCarthy, D., Koeling, R., Weeds, J., & Carroll, J..
2004. Finding Predominant Word Senses in Untagged
Text. ACL 2004.
Mihalcea R.. 2002. Instance Based Learning with Au-
tomatic Feature Selection Applied to Word Sense Dis-
ambiguation. COLING 2002.
Mihalcea R.. 2004. Co-Training and Self-Training for
Word Sense Disambiguation. CoNLL 2004.
Mihalcea R., Chklovski, T., & Kilgariff, A.. 2004. The
SENSEVAL-3 English Lexical Sample Task. SENSE-
VAL 2004.
Niu, Z.Y., Ji, D.H., & Tan, C.L.. 2004. Learning Word
Senses With Feature Selection and Order Identification
Capabilities. ACL 2004.
Niu, Z.Y., Ji, D.H., & Tan, C.L.. 2005. Word Sense
Disambiguation Using Label Propagation Based Semi-
Supervised Learning. ACL 2005.
O’Hara, T., Bruce, R., Donner, J., & Wiebe, J..
2004. Class-Based Collocations for Word-Sense Dis-
ambiguation. SENSEVAL 2004.
Park, S.B., Zhang, B.T., & Kim, Y.T.. 2000. Word
Sense Disambiguation by Learning from Unlabeled
Data. ACL 2000.
Pedersen. T., & Bruce, R.. 1997. Distinguishing Word
Senses in Untagged Text. EMNLP 1997.
Sch¨utze, H.. 1998. Automatic Word Sense Discrimina-
tion. Computational Linguistics, 24:1, 97–123.
Slonim, N. & Tishby, N.. 2000. Document Clustering
Using Word Clusters via the Information Bottleneck
Method. ACM SIGIR 2000.
Slonim, N. & Tishby, N.. 2001. The Power of Word
Clusters for Text Classification. The 23rd European
Colloquium on Information Retrieval Research.
Towel, G. & Voorheest, E.M.. 1998. Disambiguating
Highly Ambiguous Words. Computational Linguis-
tics, 24:1, 125–145.
Yarowsky, D.. 1995. Unsupervised Word Sense Disam-
biguation Rivaling Supervised Methods. ACL 1995,
pp. 189-196.
Zhou D., Bousquet, O., Lal, T.N., Weston, J., &
Sch¨olkopf, B.. 2003. Learning with Local and Global
Consistency. NIPS 16,pp. 321-328.
Zhu, X. & Ghahramani, Z.. 2002. Learning from La-
beled and Unlabeled Data with Label Propagation.
CMU CALD tech report CMU-CALD-02-107.
Zhu, X., Ghahramani, Z., & Lafferty, J.. 2003. Semi-
Supervised Learning Using Gaussian Fields and Har-
monic Functions. ICML 2003.
914
