Optimizing Feature Set for Chinese Word Sense Disambiguation
Zheng-Yu Niu, Dong-Hong Ji
Institute for Infocomm Research
21 Heng Mui Keng Terrace
119613 Singapore
fzniu, dhjig@i2r.a-star.edu.sg
Chew-Lim Tan
Department of Computer Science
National University of Singapore
3 Science Drive 2
117543 Singapore
tancl@comp.nus.edu.sg
Abstract
This article describes the implementation of I2R
word sense disambiguation system (I2R ¡WSD)
that participated in one senseval3 task: Chinese lex-
ical sample task. Our core algorithm is a supervised
Naive Bayes classifier. This classifier utilizes an op-
timal feature set, which is determined by maximiz-
ing the cross validated accuracy of NB classifier on
training data. The optimal feature set includes part-
of-speech with position information in local con-
text, and bag of words in topical context.
1 Introduction
Word sense disambiguation (WSD) is to assign ap-
propriate meaning to a given ambiguous word in
a text. Corpus based method is one of the suc-
cessful lines of research on WSD. Many supervised
learning algorithms have been applied for WSD,
ex. Bayesian learning (Leacock et al., 1998), ex-
emplar based learning (Ng and Lee, 1996), decision
list (Yarowsky, 2000), neural network (Towel and
Voorheest, 1998), maximum entropy method (Dang
et al., 2002), etc.. In this paper, we employ Naive
Bayes classifier to perform WSD.
Resolving the ambiguity of words usually relies
on the contexts of their occurrences. The feature
set used for context representation consists of lo-
cal and topical features. Local features include part
of speech tags of words within local context, mor-
phological information of target word, local collo-
cations, and syntactic relations between contextual
words and target word, etc.. Topical features are
bag of words occurred within topical context. Con-
textual features play an important role in providing
discrimination information for classifiers in WSD.
In other words, an informative feature set will help
classifiers to accurately disambiguate word senses,
but an uninformative feature set will deteriorate the
performance of classifiers. In this paper, we opti-
mize feature set by maximizing the cross validated
accuracy of Naive Bayes classifier on sense tagged
training data.
2 Naive Bayes Classifier
Let C = fc1; c2; :::; cLg represent class labels,
F = ff1; f2; :::; fMg be a set of features. The
value of fj, 1 • j • M, is 1 if fj is present in
the context of target word, otherwise 0. In classi-
fication process, the Naive Bayes classifier tries to
find the class that maximizes P(cijF), the proba-
bility of class ci given feature set F, 1 • i • L.
Assuming the independence between features, the
classification procedure can be formulated as:
ˆi = arg max
1•i•L
p(ci)QMj=1 p(fjjci)
QM
j=1 p(fj)
; (1)
where p(ci), p(fjjci) and p(fj) are estimated using
maximum likelihood method. To avoid the effects
of zero counts when estimating p(fjjci), the zero
counts of p(fjjci) are replaced with p(ci)=N, where
N is the number of training examples.
3 Feature Set
For Chinese WSD, there are two strategies to extract
contextual information. One is based on Chinese
characters, the other is to utilize Chinese words and
related morphological or syntactic information. In
our system, context representation is based on Chi-
nese words, since words are less ambiguous than
characters.
We use two types of features for Chinese WSD:
local features and topical features. All of these fea-
tures are acquired from data at senseval3 without
utilization of any other knowledge resource.
3.1 Local features
Two sets of local features are investigated, which
are represented by LocalA and LocalB. Let nl de-
note the local context window size.
LocalA contains only part of speech tags
with position information: POS¡nl; ...,
POS¡1; POS0; POS+1; ..., POS+nl, where
POS¡i (POS+i) is the part of speech (POS) of the
i-th words to the left (right) of target word w, and
POS0 is the POS of w.
                                             Association for Computational Linguistics
                        for the Semantic Analysis of Text, Barcelona, Spain, July 2004
                 SENSEVAL-3: Third International Workshop on the Evaluation of Systems
LocalB enriches the local context by including
the following features: local words with position in-
formation (W¡nl, ..., W¡1, W+1, ..., W+nl), bigram
templates ((W¡nl; W¡(nl¡1)), ..., (W¡1; W+1),
..., (W+(nl¡1); W+nl)), local words with POS tags
(W POS) (position information is not considered),
and part of speech tags with position information.
All of these POS tags, words, and bigrams are
gathered and each of them contributed as one fea-
ture. For a training or test example, the value of
some feature is 1 if it occurred in local context, oth-
erwise it is 0. In this paper, we investigate two val-
ues of nl for LocalA and LocalB, 1 and 2, which
results in four feature sets.
3.2 Topical features
We consider all Chinese words within a context
window size nt as topical features. For each training
or test example, senseval3 data provides one sen-
tence as the context of ambiguous word. In sense-
val3 Chinese training data, all contextual sentences
are segmented into words and tagged with part of
speech.
Words which contain non-Chinese character are
removed, and remaining words occurred within
context window size nt are gathered. Each remain-
ing word is considered as one feature. The value of
topical feature is 1 if it occurred within window size
nt, otherwise it is 0.
In later experiment, we set different values for nt,
ex. 1, 2, 3, 4, 5, 10, 20, 30, 40, 50. Our experimen-
tal result indicated that the accuracy of sense dis-
ambiguation is related to the value of nt. For differ-
ent ambiguous words, the value of nt which yields
best disambiguation accuracy is different. It is de-
sirable to determine an optimal value, ˆnt, for each
ambiguous word by maximizing the cross validated
accuracy.
4 Data Set
In Chinese lexical sample task, training data con-
sists of 793 sense-tagged examples for 20 ambigu-
ous Chinese words. Test data consists of 380 un-
tagged examples for the same 20 target words. Ta-
ble 1 shows the details of training data and test data.
5 Criterion for Evaluation of Feature Sets
In this paper, five fold cross validation method was
employed to estimate the accuracy of our classi-
fier, which was the criterion for evaluation of fea-
ture sets. All of the sense tagged examples of some
target word in senseval3 training data were shuf-
fled and divided into five equal folds. We used four
folds as training set and the remaining fold as test
set. This procedure was repeated five times under
different division between training set and test set.
The average accuracy over five runs is defined as the
accuracy of our classifier.
6 Evaluation of Feature Sets
Four feature sets were investigated:
FEATUREA1: LocalA with nl = 1, and topical
feature within optimal context window size ˆnt;
FEATUREA2: LocalA with nl = 2, and topical
feature within optimal context window size ˆnt;
FEATUREB1: LocalB with nl = 1, and topical
feature within optimal context window size ˆnt;
FEATUREB2: LocalB with nl = 2, and topical
feature within optimal context window size ˆnt.
We performed training and test procedure using
exactly same training and test set for each feature
set. For each word, the optimal value of topical con-
text window size ˆnt was determined by selecting a
minimal value of nt which maximized the cross val-
idated accuracy.
Table 2 summarizes the results of Naive Bayes
classifier using four feature sets evaluated on sen-
seval3 Chinese training data. Figure 1 shows the
accuracy of Naive Bayes classifier as a function of
topical context window size on four nouns and three
verbs. Several results should be noted specifically:
If overall accuracy over 20 Chinese charac-
ters is used as evaluation criterion for feature
set, the four feature sets can be sorted as fol-
lows: FEATUREA1 > FEATUREA2 …
FEATUREB1 > FEATUREB2. This indi-
cated that simply increasing local window size or
enriching feature set by incorporating bigram tem-
plates, local word with position information, and lo-
cal words with POS tags did not improve the perfor-
mance of sense disambiguation.
In table 2, it showed that with FEATUREA1, the
optimal topical context window size was less than
10 words for 13 out of 20 target words. Figure
1 showed that for most of nouns and verbs, Naive
Bayes classifier achieved best disambiguation accu-
racy with small topical context window size (<10
words). This gives the evidence that for most of
Chinese words, including nouns and verbs, the near
distance context is more important than the long dis-
tance context for sense disambiguation.
7 Experimental Result
The empirical study in section 6 showed that FEA-
TUREA1 performed best among all the feature sets.
A Naive Bayes classifier with FEATUREA1 as fea-
ture set was learned from all the senseval3 Chinese
training data for each target word. Then we used
Table 1: Details of training data and test data in Chinese lexical sample task.
POS occurred # senses occurred
Ambiguous word in training data # training examples in training data # test examples
ba3wo4 n v vn 31 4 15
bao1 n nr q v 76 8 36
cai2liao4 n 20 2 10
chong1ji1 v vn 28 3 13
chuan1 v 28 3 14
di4fang1 b n 36 4 17
fen1zi3 n 36 2 16
huo2dong4 a v vn 36 5 16
lao3 Ng a an d j 57 6 26
lu4 n nr q 57 6 28
mei2you3 d v 30 3 15
qi3lai2 v 40 4 20
qian2 n nr 40 4 20
ri4zi5 n 48 3 21
shao3 Ng a ad j v 42 5 20
tu1chu1 a ad v 30 3 15
yan2jiu1 n v vn 30 3 15
yun4dong4 n nz v vn 54 3 27
zou3 v vn 49 5 24
zuo4 v 25 3 12
this classifier to determine the senses of occurrences
of target words in test data. The official result of
I2R¡WSD system in Chinese lexical sample task
is listed below:
Precision: 60.40% (229.00 correct of 379.00 at-
tempted).
Recall: 60.40% (229.00 correct of 379.00 in to-
tal).
Attempted: 100.00% (379.00 attempted of
379.00 in total).
8 Conclusion
In this paper, we described the implementation of
I2R ¡ WSD system that participated in one sen-
seval3 task: Chinese lexical sample task. An op-
timal feature set was selected by maximizing the
cross validated accuracy of supervised Naive Bayes
classifier on sense-tagged data. The senses of occur-
rences of target words in test data were determined
using Naive Bayes classifier with optimal feature
set learned from training data. Our system achieved
60.40% precision and recall in Chinese lexical sam-
ple task.

References
Dang, H. T., Chia, C. Y., Palmer M., & Chiou, F.D.
(2002) Simple Features for Chinese Word Sense
Disambiguation. In Proc. of COLING.
Leacock, C., Chodorow, M., & Miller G. A. (1998)
Using Corpus Statistics and WordNet Relations
for Sense Identification. Computational Linguis-
tics, 24:1, 147–165.
Mooney, R. J. (1996) Comparative Experiments on
Disambiguating Word Senses: An Illustration of
the Role of Bias in Machine Learning. In Proc.
of EMNLP, pp. 82-91, Philadelphia, PA.
Ng, H. T., & Lee H. B. (1996) Integrating Multi-
ple Knowledge Sources to Disambiguate Word
Sense: An Exemplar-Based Approach. In Proc.
of ACL, pp. 40-47.
Pedersen, T. (2001) A Decision Tree of Bigrams is
an Accurate Predictor of Word Sense. In Proc. of
NAACL.
Towel, G., & Voorheest, E. M. (1998) Disambiguat-
ing Highly Ambiguous Words. Computational
Linguistics, 24:1, 125–146.
Yarowsky, D. (2000) Hierarchical Decision Lists
for Word Sense Disambiguation. Computers and
the Humanities, 34(1-2), 179–186.
