Word Sense Disambiguation Using Automatically Translated
Sense Examples
Xinglong Wang
School of Informatics
University of Edinburgh
2 Buccleuch Place, Edinburgh
EH8 9LW, UK
xwang@inf.ed.ac.uk
David Martinez
Department of Computer Science
University of Sheffield
Sheffield, S1 4DP, UK
davidm@dcs.shef.ac.uk
Abstract
We present an unsupervised approach to
Word Sense Disambiguation (WSD). We
automatically acquire English sense exam-
ples using an English-Chinese bilingual
dictionary, Chinese monolingual corpora
and Chinese-English machine translation
software. We then train machine learn-
ing classifiers on these sense examples
and test them on two gold standard En-
glish WSD datasets, one for binary and
the other for fine-grained sense identifica-
tion. On binary disambiguation, perfor-
mance of our unsupervised system has ap-
proached that of the state-of-the-art super-
vised ones. On multi-way disambiguation,
it has achieved a very good result that is
competitive to other state-of-the-art unsu-
pervised systems. Given the fact that our
approach does not rely on manually anno-
tated resources, such as sense-tagged data
or parallel corpora, the results are very
promising.
1 Introduction
Results from recent Senseval workshops have
shown that supervised Word Sense Disambigua-
tion (WSD) systems tend to outperform their unsu-
pervised counterparts. However, supervised sys-
tems rely on large amounts of accurately sense-
annotated data to yield good results and such re-
sources are very costly to produce. It is difficult
for supervised WSD systems to perform well and
reliably on words that do not have enough sense-
tagged training data. This is the so-called knowl-
edge acquisition bottleneck.
To overcome this bottleneck, unsupervised
WSD approaches have been proposed. Among
them, systems under the multilingual paradigm
have shown great promise (Gale et al., 1992; Da-
gan and Itai, 1994; Diab and Resnik, 2002; Ng et
al., 2003; Li and Li, 2004; Chan and Ng, 2005;
Wang and Carroll, 2005). The underlying hy-
pothesis is that mappings between word forms
and meanings can be different from language to
language. Much work have been done on ex-
tracting sense examples from parallel corpora for
WSD. For example, Ng et al. (2003) proposed
to train a classifier on sense examples acquired
from word-aligned English-Chinese parallel cor-
pora. They grouped senses that share the same
Chinese translation, and then the occurrences of
the word on the English side of the parallel corpora
were considered to have been disambiguated and
“sense tagged” by the appropriate Chinese trans-
lations. Their system was evaluated on the nouns
in Senseval-2 English lexical sample dataset, with
promising results. Their follow-up work (Chan
and Ng, 2005) has successfully scaled up the ap-
proach and achieved very good performance on
the Senseval-2 English all-word task.
Despite the promising results, there are prob-
lems with relying on parallel corpora. For exam-
ple, there is a lack of matching occurrences for
some Chinese translations to English senses. Thus
gathering training examples for them might be dif-
ficult, as reported in (Chan and Ng, 2005). Also,
parallel corpora themselves are rare resources and
not available for many language pairs.
Some researchers seek approaches using mono-
lingual resources in a second language and then
try to map the two languages using bilingual dic-
tionaries. For example, Dagan and Itai (1994) car-
ried out WSD experiments using monolingual cor-
pora, a bilingual lexicon and a parser for the source
language. One problem of this method is that
45
for many languages, accurate parsers do not exist.
Wang and Carroll (2005) proposed to use mono-
lingual corpora and bilingual dictionaries to auto-
matically acquire sense examples. Their system
was unsupervised and achieved very promising
results on the Senseval-2 lexical sample dataset.
Their system also has better portability, i.e., it runs
on any language pair as long as a bilingual dictio-
nary is available. However, sense examples ac-
quired using the dictionary-based word-by-word
translation can only provide “bag-of-words” fea-
tures. Many other features useful for machine
learning (ML) algorithms, such as the ordering of
words, part-of-speech (POS), bigrams, etc., have
been lost. It could be more interesting to translate
Chinese text snippets using machine translation
(MT) software, which would provide richer con-
textual information that might be useful for WSD
learners. Although MT systems themselves are
expensive to build, once they are available, they
can be used repeatedly to automatically generate
as much data as we want. This is an advantage
over relying on other expensive resources such as
manually sense-tagged data and parallel copora,
which are limited in size and producing additional
data normally involves further costly investments.
We carried out experiments on acquiring sense
examples using both MT software and a bilingual
dictionary. When we had the two sets of sense ex-
amples ready, we trained a ML classifier on them
and then tested them on coarse-grained and fine-
grained gold standard WSD datasets, respectively.
We found that on both test datasets the classi-
fier using MT translated sense examples outper-
formed the one using those translated by a dictio-
nary, given the same amount of training examples
used on each word sense. This confirms our as-
sumption that a richer feature set, although from
a noisy data source, such as machine translated
text, might help ML algorithms. In addition, both
systems performed very well comparing to other
state-of-the-art WSD systems. As we expected,
our system is particularly good on coarse-grained
disambiguation. Being an unsupervised approach,
it achieved a performance competitive to state-of-
the-art supervised systems.
This paper is organised as follows: Section 2
revisits the process of acquiring sense examples
proposed in (Wang and Carroll, 2005) and then
describes our adapted approach. Section 3 out-
lines resources, the ML algorithm and evaluation
metrics that we used. Section 4 and Section 5 de-
tail experiments we carried out on gold standard
datasets. We also report our results and error anal-
ysis. Finally, Section 6 concludes the paper and
draws future directions.
2 Acquisition of Sense Examples
Wang and Carroll (2005) proposed an automatic
approach to acquire sense examples from large
amount of Chinese text and English-Chinese and
Chinese-English dictionaries. The acquisition pro-
cess is summarised as follows:
1. Translate an English ambiguous word
a0
to Chinese,
using an English-Chinese lexicon. Given the assump-
tion that mappings between words and senses are dif-
ferent between English and Chinese, each sense a1 a2 of
a0
maps to a distinct Chinese word. At the end of this
step, we have produced a set
a4
, which consists of Chi-
nese words a5 a6 a8 a10 a6 a12 a10 a13 a13 a13 a10 a6 a18 a20 ,wherea6 a2 is the translation
corresponding to sense a1 a2 of
a0
,anda22 is the number of
senses that
a0
has.
2. Query large Chinese corpora or/and a search engine us-
ing each element in
a4
. For each a6 a2 in
a4
, we collect the
text snippets retrieved and construct a Chinese corpus.
3. Word-segment these Chinese text snippets.
4. Use an electronic Chinese-English lexicon to translate
the Chinese corpora constructed word by word to En-
glish.
This process can be completely automatic and
unsupervised. However, in order to compare
the performance against other WSD systems, one
needs to map senses in the bilingual dictionary to
those used by gold standard datasets, which are
often from WordNet (Fellbaum, 1998). This step
is inevitable unless we use senses in the bilingual
dictionary as gold standard. Fortunately, the map-
ping process only takes a very short time
1
,com-
paring to the effort that it would take to manually
sense annotate training examples. At the end of
the acquisition process, for each sense a23 a24 of an am-
biguous word a25 , we have a large set of English
contexts. Note that a context is represented by a
bag of words only. We mimicked this process and
built a set of sense examples.
To obtain a richer set of features, we adapted the
above process and carried out another acquisition
experiment. When translating Chinese text snip-
pets to English in the 4th step, we used MT soft-
ware instead of a bilingual dictionary. The intu-
ition is that although machine translated text con-
tains noise, features like word ordering, POS tags
1
A similar process took 15 minutes per noun as reported
in (Chan and Ng, 2005), and about an hour for 20 nouns as
reported in (Wang and Carroll, 2005).
46
English ambiguous word w
Sense 1 of w Sense 2 of w
Chinese translation of
sense 2
Chinese translation of
sense 1
English-Chinese
Lexicon
Chinese text snippet 1
Chinese text snippet 2
... ...
Search
Chinese
Corpora
Machine
Translation
Software
Chinese text snippet 1
Chinese text snippet 2
... ...
{English sense example 1
for sense 1 of w}
{English sense example 2
for sense 1 of w}
... ...
{English sense example 1
for sense 2 of w}
{English sense example 2
for sense 2 of w}
... ...
Figure 1:Adapted process of automatic acquisition of sense
examples. For simplicity, assume
a0
has two senses.
and bigrams/trigrams may still be of some use for
ML classifiers. In this approach, the 3rd step can
be omitted, since MT software should be able to
take care of segmentation. Figure 1 illustrates our
adapted acquisition process.
As described above, we prepared two sets of
training examples for each English word sense
to disambiguate: one set was translated word-by-
word by looking up a bilingual dictionary, as pro-
posed in (Wang and Carroll, 2005), and the other
translated using MT software. In detail, we first
mapped senses of ambiguous words, as defined
in the gold-standard TWA (Mihalcea, 2003) and
Senseval-3 lexical sample (Mihalcea et al., 2004)
datasets (which we use for evaluation) onto their
corresponding Chinese translations. We did this
by looking up an English-Chinese dictionary Pow-
erWord 2002
2
. This mapping process involved
human intervention, but it only took an annota-
tor (fluent speaker in both Chinese and English)
4 hours. Since some Chinese translations are
also ambiguous, which may affect WSD perfor-
mance, the annotator was asked to select the Chi-
nese words that are relatively unambiguous (or
ideally monosemous) in Chinese for the target
word senses, when it was possible. Sometimes
multiple senses of an English word can map to
the same Chinese word, according to the English-
Chinese dictionary. In such cases, the annotator
was advised to try to capture the subtle difference
between these English word senses and then to
2
PowerWord is a commercial electronic dictio-
nary application. There is a free online version at:
http://cb.kingsoft.com.
select different Chinese translations for them, us-
ing his knowledge on the languages. Then, using
the translations as queries, we retrieved as many
text snippets as possible from the Chinese Giga-
word Corpus. For efficiency purposes, we ran-
domly chose maximumly
a0 a1 a1
text snippets for
each sense, when acquiring data for nouns and
adjectives from Senseval-3 lexical sample dataset.
The length of the snippets was set to a4
a1 a1
Chinese
characters.
From here we prepared two sets of sense exam-
ples differently. For the approach of dictionary-
based translation, we segmented all text snippets,
using the application ICTCLAS
3
. After the seg-
mentor marked all word boundaries, the system
automatically translated the text snippets word by
word using the electronic LDC Mandarin-English
Translation Lexicon 3.0. All possible translations
of each word were included. As expected, the lex-
icon does not cover all Chinese words. We simply
discarded those Chinese words that do not have an
entry in this lexicon. We also discarded those Chi-
nese words with multiword English translations.
Finally we got a set of sense examples for each
sense. Note that a sense example produced here is
simply a bag of words without ordering.
We prepared the other set of sense examples by
translating text snippets with the MT software Sys-
tran a5 a6
a1
Standard, where each example contains
much richer features that potentially can be ex-
ploited by ML algorithms.
3 Experimental Settings
3.1 Training
We applied the Vector Space Model (VSM) algo-
rithm on the two different kinds of sense examples
(i.e., dictionary translated ones vs. MT software
translated ones), as it has been shown to perform
well with the features described below (Agirre and
Martinez, 2004a). In VSM, we represent each
context as a vector, where each feature has an 1
or 0 value to indicate its occurrence or absence.
For each sense in training, a centroid vector is ob-
tained, and these centroids are compared to the
vectors that represent test examples, by means of
the cosine similarity function. The closest centroid
assigns its sense to the test example.
For the sense examples translated by MT soft-
ware, we analysed the sentences using different
3
See: http://mtgroup.ict.ac.cn/a8 zhp/ICTCLAS
47
tools and extracted relevant features. We ap-
plied stemming and POS tagging, using the fnTBL
toolkit (Ngai and Florian, 2001), as well as shal-
low parsing
4
. Then we extracted the following
types of topical and domain features
5
,whichwere
then fed to the VSM machine learner:
a0
Topical features: we extracted lemmas of the
content words in two windows around the tar-
get word: the whole context and a
a1
4word
window. We also obtained salient bigrams in
the context, with the methods and the soft-
ware described in (Pedersen, 2001). We in-
cluded another feature type, which match the
closest words (for each POS and in both
directions) to the target word (e.g. LEFT
NOUN “dog” or LEFT VERB “eat”).
a0
Domain features: The “WordNet Domains”
resource was used to identify the most rel-
evant domains in the context. Following
the relevance formula presented in (Magnini
and Cavagli´a, 2000), we defined two feature
types: (1) the most relevant domain, and (2)
a list of domains above a threshold
6
.
For the dictionary-translated sense examples,
we simply used bags of words as features.
3.2 Evaluation
We evaluated our WSD classifier on both
coarse-grained and fine-grained datasets. For
coarse-grained WSD evaluation, we used TWA
dataset (Mihalcea, 2003), which is a binarily
sense-tagged corpus drawn from the British Na-
tional Corpus (BNC), for 6 nouns. For fine-
grained evaluation, we used Senseval-3 English
lexical sample dataset (Mihalcea et al., 2004),
which comprises 7,860 sense-tagged instances for
training and 3,944 for testing, on 57 words (nouns,
verbs and adjectives). The examples were mainly
drawn from BNC. WordNet a2 a6 a3 a6 a2
7
was used as
sense inventory for nouns and adjectives, and
Wordsmyth
8
for verbs. We only evaluated our
WSD systems on nouns and adjectives.
4
This software was kindly provided by David Yarowsky’s
group at Johns Hopkins University.
5
Preliminary experiments using local features (bigrams
and trigrams) showed low performance, which was expected
because of noise in the automatically acquired data.
6
This software was kindly provided by Gerard Escudero’s
group at Universitat Politecnica de Catalunya. The threshold
was set in previous work.
7
http://wordnet.princeton.edu
8
http://www.wordsmyth.net
We also used the SemCor corpus (Miller et al.,
1993) for tuning our relative-threshold heuristic. It
contains a number of texts, mainly from the Brown
Corpus, comprising about 200,000 words, where
all content words have been manually tagged with
senses from WordNet.
Throughout the paper we will use the concepts
of precision and recall to measure the performance
of WSD systems, where precision refers to the ra-
tio of correct answers to the total number of an-
swers given by the system, and recall indicates the
ratio of correct answers to the total number of in-
stances. Our ML systems attempt every instance
and always give a unique answer, and hence preci-
sion equals to recall. When comparing with other
systems that participated in Senseval-3 in Table 7,
both recall and precision are shown. When POS
and overall averages are given, they are calculated
by micro-averaging the number of examples per
word.
4 Experiments on TWA dataset
First we trained a VSM classifier on the sense
examples translated with the Systran MT soft-
ware (we use notion “MT-based approach” to re-
fer to this process), and then tested it on the TWA
test dataset. We tried two combinations of fea-
tures: one only used topical features and the other
used the whole feature set (i.e., topical and do-
main features). Table 1 summarises the sizes of
the training/test data, the Most Frequent Sense
(MFS) baseline and performances when apply-
ing the two different feature combinations. We
can see that best results were obtained when us-
ing all the features. It also shows that both our
systems achieved a significant improvement over
the MFS baseline. Therefore, in the subsequent
WSD experiments following the MT-based ap-
proach, we decided to use the entire feature set.
To compare the machine-translated sense exam-
ples with the ones translated word-by-word, we
then trained the same VSM classifier on the ex-
amples translated with a bilingual dictionary (we
use notion “dictionary-based approach” to refer
to this process) and evaluated it on the same test
dataset. Table 2 shows results of the dictionary-
based approach and the MT-based approach. For
comparison, we include results from another sys-
tem (Mihalcea, 2003), which uses monosemous
relatives to automatically acquire sense examples.
The right-most column shows results of a 10-fold
48
Word Train ex. Test ex. MFS Topical All
bass 3,201 107 90.7 92.5 93.5
crane 3,656 95 74.7 84.2 83.2
motion 2,821 201 70.1 78.6 84.6
palm 1,220 201 71.1 82.6 85.1
plant 4,183 188 54.4 76.6 76.6
tank 3,898 201 62.7 79.1 77.1
Overall 18,979 993 70.6 81.1 82.5
Table 1:Recall(%) of the VSM classifier trained on the MT-
translated sense examples, with different sets of features. The
MFS baseline(%) and the number of training and test exam-
ples are also shown.
(Mihalcea, Dictionary- MT- Hand-
Word
2003) based based tagged
bass 92.5 91.6 93.5 90.7
crane 71.6 74.5 83.2 81.1
motion 75.6 72.6 84.6 93.0
palm 80.6 81.1 85.1 87.6
plant 69.1 51.6 76.6 87.2
tank 63.7 66.7 77.1 84.1
Overall 76.6 71.3 82.5 87.6
Table 2:Recall(%) on TWA dataset for 3 unsupervised sys-
tems and a supervised cross-validation on test data.
cross-validation on the TWA data, which indicates
the score that a supervised system would attain,
taking additional advantage that the examples for
training and test are drawn from the same corpus.
We can see that our MT-based approach has
achieved significantly better recall than the other
two automatic methods. Besides, the results of
our unsupervised system are approaching the per-
formance achieved with hand-tagged data. It is
worth mentioning that Mihalcea (2003) applied a
similar supervised cross-validation method on this
dataset that scored 83.35%, very close to our unsu-
pervised system
9
. Thus, we can conclude that the
MT-based system is able to reach the best perfor-
mance reported on this dataset for an unsupervised
system.
5 Experiments on Senseval-3
In this section we describe the experiments carried
out on the Senseval-3 lexical sample dataset. First,
we introduce a heuristic method to deal with the
problem of fine-grainedness of WordNet senses.
The remaining two subsections will be devoted
to the experiments of the baseline system and the
contribution of the heuristic to the final system.
9
The main difference to our hand-tagged evaluation, apart
from the ML algorithm, is that we did not remove the bias
from the “one sense per discourse” factor, as she did.
Remove Remove Sn.-Tk.
Threshold
Senses Tokens ratio
4 7,669 (40.6) 11,154 (15.9) 2.55
5 9,759 (51.6) 15,516 (22.1) 2.34
6 11,341 (60.0) 18,827 (26.8) 2.24
7 12,569 (66.5) 21,775 (31.0) 2.14
8 13,553 (71.7) 24,224 (34.5) 2.08
9 14,376 (76.0) 27,332 (38.9) 1.95
10 14,914 (78.9) 29,418 (41.9) 1.88
Table 3:Sense filtering by relative-threshold on SemCor. For
each threshold the number of removed senses/tokens and am-
biguity are shown.
5.1 Unsupervised methods on fine-grained
senses
When applying unsupervised WSD algorithms to
fine-grained word senses, senses that rarely occur
in texts often cause problems, as these cases are
difficult to detect without relying on hand-tagged
data. This is why many WSD systems use sense-
tagged corpora such as SemCor to discard or pe-
nalise low-frequency senses.
For our work, we did not want to rely on hand-
tagged corpora, and we devised a method to detect
low-frequency senses and to remove them before
using our translation-based approach. The method
is based on the hypothesis that word senses that
have few close relatives (synonyms, hypernyms,
and hyponyms) tend to have low frequency in cor-
pora. We collected all the close relatives to the
target senses, according to WordNet, and then re-
moved all the senses that did not have a number of
relatives above a given threshold. We used this
method on nouns, as the WordNet hierarchy is
more developed for them.
First, we observed the effect of sense removal
in the SemCor corpus. For all the polysemous
nouns, we applied different thresholds (4-10 rel-
atives) and measured the percentage of senses and
SemCor tokens that were removed. Our goal was
to remove as many senses as we could, while keep-
ing as many tokens as possible. Table 3 shows
the results of the process on all a5
a0
a4
a1 a2
polysemous
nouns in SemCor for a total of 18,912 senses and
70,238 tokens. The average number of senses per
token initially is
a1
a6 a4 a3 .
For the lowest threshold (4) we can see that
we are able to remove a large number of senses
from consideration (40%), keeping 85% of the to-
kens in SemCor. Higher thresholds can remove
more senses, but it forces us to discard more valid
tokens. In Table 3, the best ratios are given by
lower thresholds, suggesting that conservative ap-
49
proaches would be better. However, we have to
take into account that unsupervised state-of-the-
art WSD methods on fine-grained senses perform
below 50% recall on this dataset
10
, and therefore
an approach that is more aggressive may be worth
trying.
We applied this heuristic method in our exper-
iments and decided to measure the effect of the
threshold parameter by relying on SemCor and the
Senseval-3 training data. Thus, we tested the MT-
based system for different threshold values, re-
moving the senses for consideration when the rel-
ative number was below the threshold. The results
of the experiments using this technique will be de-
scribed in Section 5.3.
5.2 Baseline system
We performed experiments on Senseval-3 test
data with both MT-based and dictionary-based ap-
proaches. We show the results for nouns and ad-
jectives in Table 4, together with the MFS base-
line (obtained from the Senseval-3 lexical sam-
ple training data). We can see that the results are
similar for nouns, while for adjectives the MT-
based system achieves significantly better recall.
Overall, the performance was much lower than our
previous 2-way disambiguation. The system also
ranks below the MFS baseline.
One of the main reasons for the low perfor-
mance was that senses with few examples in the
test data are over-represented in training. This is
because we trained the classifiers on equal num-
ber of maximumly 200 sense examples for every
sense, no matter how rarely a sense actually oc-
curs in real text. As we explained in the previ-
ous section, this problem could be alleviated for
nouns by using the relative-based heuristics. We
only implemented the MT-based approach for the
rest of the experiments, as it performed better than
the dictionary-based one.
5.3 Relative threshold
In this section we explored the contribution of
the relative-based threshold to the system. We
tested the system only on nouns. In order to
tune the threshold parameter, we first applied the
method on SemCor and the Senseval-3 training
data. We used hand-tagged corpora from two
different sources to see whether the method was
10
Best score in Senseval-3 for nouns without SemCor
or hand-tagged data: 47.5% recall (figure obtained from
http://www.senseval.org).
Test Dictionary- MT-
Word
Ex.
MFS
based based
Nouns 1807 54.23 40.07 40.73
Adjs 159 49.69 15.74 23.29
Overall 1966 53.86 38.10 39.32
Table 4:Averaged recall(%) for the dictionary-based and MT-
based methods in Senseval-3 lexical-sample data. The MFS
baseline(%) and the number of testing examples are also
shown.
Avg. test
Threshold ambiguity Senseval-3 SemCor
0 5.80 40.68 30.11
4 3.60 40.15 32.99
5 3.32 39.43 32.82
6 2.76 40.53 34.18
7 2.52 43.89 35.94
8 2.36 46.90 39.15
9 2.08 45.37 38.98
10 1.88 48.62 46.16
11 1.80 48.59 47.68
12 1.68 48.34 43.63
13 1.40 47.23 45.31
14 1.28 44.32 42.05
Table 5:Average ambiguity and recall(%) for the relative-
based threshold on Senseval-3 training data and SemCor (for
nouns only). Best results shown in bold.
generic enough to be applied on unseen test data.
Note also that we used this experiment to define a
general threshold for the heuristic, instead of opti-
mising it for different words. Once the threshold
is fixed, it will be used for all target words.
The results of the MT-based system applying
threshold values from 4 to 14 are given in Table 5.
We can see clearly that the algorithm benefits from
the heuristic, specially when ambiguity is reduced
to around 2 senses in average. Also observe that
the contribution of the threshold is quite similar
for SemCor and Senseval-3 training data. From
this table, we chose 11 as threshold value for the
test data, as it obtained the best performance on
SemCor.
Thus, we performed a single run of the algo-
rithm on the test data applying the chosen thresh-
old. The performance for all nouns is given in
Table 6. We can see that the recall has increased
significantly, and is now closer to the MFS base-
line, which is a very hard baseline for unsuper-
vised systems (McCarthy et al., 2004). Still, the
performance is significantly lower than the score
achieved by supervised systems, which can reach
above 72% recall (Mihalcea et al., 2004). Some of
the reasons for the gap are the following:
a0
The acquisition process: problems can arise
50
Word Test Ex. MFS Our System
argument 111 51.40 45.90
arm 133 82.00 85.70
atmosphere 81 66.70 35.80
audience 100 67.00 67.00
bank 132 67.40 67.40
degree 128 60.90 60.90
difference 114 40.40 40.40
difficulty 23 17.40 39.10
disc 100 38.00 27.00
image 74 36.50 17.60
interest 93 41.90 11.80
judgment 32 28.10 40.60
organization 56 73.20 19.60
paper 117 25.60 37.60
party 116 62.10 52.60
performance 87 26.40 26.40
plan 84 82.10 82.10
shelter 98 44.90 39.80
sort 96 65.60 65.60
source 32 65.60 65.60
Overall 1807 54.23 48.58
Table 6:Final results(%) for all nouns in Senseval-3 test data.
Together with the number of test examples and MFS base-
line(%).
from ambiguous Chinese words, and the ac-
quired examples can contain noise generated
by the MT software.
a0
Distribution of fine-grained senses: As we
have seen, it is difficult to detect rare senses
for unsupervised methods, while supervised
systems can simply rely on frequency of
senses.
a0
Lack of local context: Our system does
not benefit from local bigrams and trigrams,
which for supervised systems are one of the
best sources of knowledge.
5.4 Comparison with Senseval-3
unsupervised systems
Finally, we compared the performance of our
system with other unsupervised systems in the
Senseval-3 lexical-sample competition. We eval-
uated these systems for nouns, using the out-
puts provided by the organisation
11
, and focusing
on the systems that are considered unsupervised.
However, we noticed that most of these systems
used the information of SemCor frequency, or
even Senseval-3 examples in their models. Thus,
we classified the systems depending on whether
they used SemCor frequencies (Sc), Senseval-3
examples (S-3), or did not (Unsup.). This is an
11
http://www.senseval.org
System Type Prec. Recall
wsdiit S-3 67.96 67.96
Cymfony S-3 57.94 57.94
Prob0 S-3 55.01 54.13
clr04 Sc 48.86 48.75
upv-unige-CIAOSENSO Sc 53.95 48.70
MT-based Unsup. 48.58 48.58
duluth-senserelate Unsup. 47.48 47.48
DFA-Unsup-LS Sc 46.71 46.71
KUNLP.eng.ls Sc 45.10 45.10
DLSI-UA-ls-eng-nosu. Unsup. 20.01 16.05
Table 7:Comparison of unsupervised S3 systems for nouns
(sorted by recall(%)). Our system given in bold.
important distinction, as simply knowing the most
frequent sense in hand-tagged data is a big advan-
tage for unsupervised systems (applying the MFS
heuristic for nouns in Senseval-3 would achieve
54.2% precision, and 53.0% recall when using
SemCor). At this point, we would like to remark
that, unlike other systems using Semcor, we have
applied it to the minimum extent. Its only contri-
bution has been to indirectly set the threshold for
our general heuristic based on WordNet relatives.
We are exploring better ways to integrate the rela-
tive information in the model.
The results of the Senseval-3 systems are given
in Table 7. There are only 2 systems that do not re-
quire any hand-tagged data, and our method is able
to improve both when using the relative-threshold.
The best systems in Senseval-3 benefited from the
training examples from the training data, particu-
larly the top-scoring system, which is clearly su-
pervised. The 2nd ranked system requires 10%
of the training examples in Senseval-3 to map the
clusters that it discovers automatically, and the 3rd
simply applies the MFS heuristic.
The remaining systems introduce bias of the
SemCor distribution in their models, which clearly
helped their performance for each word. Our sys-
tem is able to obtain a similar performance to the
best of those systems without relying on hand-
tagged data. We also evaluated the systems on
the coarse-grained sense groups provided by the
Senseval-3 organisers. The results in Table 8 show
that our system is comparatively better on this
coarse-grained disambiguation task.
6 Conclusions and Future Work
We automatically acquired English sense exam-
ples for WSD using large Chinese corpora and MT
software. We compared our sense examples with
those reported in previous work (Wang and Car-
51
System Type Prec. Recall
wsdiit S-3 75.3 75.3
Cymfony S-3 66.6 66.6
Prob0 S-3 61.9 61.9
MT-based Unsup. 57.9 57.9
clr04 Sc. 57.6 57.6
duluth-senserelate Unsup. 56.1 56.1
KUNLP-eng-ls Sc. 55.6 55.6
upv-unige-CIAOSENSO- Sc. 61.3 55.3
DFA-Unsup-LS Sc. 54.5 54.5
DLSI-UA-ls-eng-nosu. Unsup. 27.6 27.6
Table 8:Coarse-grained evaluation of unsupervised S3 sys-
tems for nouns (sorted by recall(%)). Our system given in
bold.
roll, 2005), by training a ML classifier on them
and then testing the classifiers on both coarse-
grained and fine-grained English gold standard
datasets. On both datasets, our MT-based sense
examples outperformed dictionary-based ones. In
addition, evaluations show our unsupervised WSD
system is competitive to the state-of-the-art super-
vised systems on binary disambiguation, and un-
supervised systems on fine-grained disambigua-
tion.
In the future, we would like to combine our ap-
proach with other systems based on automatic ac-
quisition of sense examples that can provide lo-
cal context (Agirre and Martinez, 2004b). The
goal would be to construct a collection of exam-
ples automatically obtained from different sources
and to apply ML algorithms on them. Each exam-
ple would have a different weight depending on
the acquisition method used.
Regarding the influence of sense distribution
in the training data, we will explore the poten-
tial of using a weighting scheme on the “relative
threshold” algorithm. Also, we would like to anal-
yse if automatically obtained information on sense
distribution (McCarthy et al., 2004) can improve
WSD performance. We may also try other MT
systems and possibly see if our WSD can in turn
help MT, which can be viewed as a bootstrapping
learning process. Another interesting direction is
automatically selecting the most informative sense
examples as training data for ML classifiers.

References
E. Agirre and D. Martinez. 2004a. The Basque Country Uni-
versity system: English and Basque tasks. In Proceedings
of the 3rd ACL workshop on the Evaluation of Systems
for the Semantic Analysis of Text (SENSEVAL), Barcelona,
Spain.
E. Agirre and D. Martinez. 2004b. Unsupervised wsd
based on automatically retrieved examples: The impor-
tance of bias. In Proceedings of the Conference on Empir-
ical Methods in Natural Language Processing (EMNLP),
Barcelona, Spain.
Y. S. Chan and H. T. Ng. 2005. Scaling up word sense disam-
biguation via parallel texts. In Proceedings of the 20th Na-
tional Conference on Artificial Intelligence (AAAI 2005),
Pittsburgh, Pennsylvania, USA.
I. Dagan and A. Itai. 1994. Word sense disambiguation using
a second language monolingual corpus. Computational
Linguistics, 20(4):563–596.
M. Diab and P. Resnik. 2002. An unsupervised method for
word sense tagging using parallel corpora. In Proceedings
of the Anniversary Meeting of the Association for
Computational Linguistics (ACL-02). Philadelphia, USA.
C. Fellbaum. 1998. WordNet: An Electronic Lexical
Database. MIT Press.
W. A. Gale, K. W. Church, and D. Yarowsky. 1992. Us-
ing bilingual materials to develop word sense disambigua-
tion methods. In Proceedings of the International Con-
ference on Theoretical and Methodological Issues in Ma-
chine Translation, pages 101–112.
H. Li and C. Li. 2004. Word translation disambiguation us-
ing bilingual bootstrapping. Computational Linguistics,
20(4):563–596.
B. Magnini and G. Cavagli´a. 2000. Integrating subject field
codes into WordNet. In Proceedings of the Second Inter-
national LREC Conference, Athens, Greece.
D. McCarthy, R. Koeling, J. Weeds, and J. Carroll. 2004.
Finding Predominant Word Senses in Untagged Text. In
Proceedings of the 42nd Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL), Barcelona,
Spain.
R. Mihalcea, T. Chklovski, and Adam Killgariff. 2004. The
Senseval-3 English lexical sample task. In Proceedings of
the 3rd ACL workshop on the Evaluation of Systems for the
Semantic Analysis of Text (SENSEVAL), Barcelona, Spain.
R. Mihalcea. 2003. The role of non-ambiguous words in
natural language disambiguation. In Proceedings of the
Conference on Recent Advances in Natural Language Pro-
cessing, RANLP.
G. A. Miller, C. Leacock, R. Tengi, and R. Bunker. 1993.
A Semantic Concordance. In Proceedings of the ARPA
Human Language Technology Workshop, pages 303–308,
Princeton, NJ, March. distributed as Human Language
Technology by San Mateo, CA: Morgan Kaufmann Pub-
lishers.
H. T. Ng, B. Wang, and Y. S. Chan. 2003. Exploiting parallel
texts for word sense disambiguation: an empirical study.
In Proceedings of the 41st Annual Meeting of the Associ-
ation for Computational Linguistics.
G. Ngai and R. Florian. 2001. Transformation-based learn-
inginthefastlane. Proceedings of the Second Confer-
ence of the North American Chapter of the Association
for Computational Linguistics, pages 40-47, Pittsburgh,
PA, USA.
T. Pedersen. 2001. A decision tree of bigrams is an accu-
rate predictor of word sense. Proceedings of the Second
Meeting of the NAACL, Pittsburgh, PA.
X. Wang and J. Carroll. 2005. Word sense disambiguation
using sense examples automatically acquired from a sec-
ond language. In Proceedings of HLT/EMNLP, Vancou-
ver, Canada.
