Relieving The Data Acquisition Bottleneck In Word Sense Disambiguation
Mona Diab
Linguistics Department
Stanford University
mdiab@stanford.edu
Abstract
Supervised learning methods for WSD yield better
performance than unsupervised methods. Yet the
availability of clean training data for the former is
still a severe challenge. In this paper, we present
an unsupervised bootstrapping approach for WSD
which exploits huge amounts of automatically gen-
erated noisy data for training within a supervised
learning framework. The method is evaluated using
the 29 nouns in the English Lexical Sample task of
SENSEVAL2. Our algorithm does as well as super-
vised algorithms on 31% of this test set, which is an
improvement of 11% (absolute) over state-of-the-art
bootstrapping WSD algorithms. We identify seven
different factors that impact the performance of our
system.
1 Introduction
Supervised Word Sense Disambiguation (WSD)
systems perform better than unsupervised systems.
But lack of training data is a severe bottleneck
for supervised systems due to the extensive la-
bor and cost involved. Indeed, one of the main
goals of the SENSEVAL exercises is to create large
amounts of sense-annotated data for supervised sys-
tems (Kilgarriff&Rosenzweig, 2000). The problem
is even more challenging for languages which pos-
sess scarce computer readable knowledge resources.
In this paper, we investigate the role of large
amounts of noisily sense annotated data obtained
using an unsupervised approach in relieving the data
acquisition bottleneck for the WSD task. We boot-
strap a supervised learning WSD system with an un-
supervised seed set. We use the sense annotated data
produced by Diab’s unsupervised system SALAAM
(Diab&Resnik, 2002; Diab, 2003). SALAAM is a
WSD system that exploits parallel corpora for sense
disambiguation of words in running text. To date,
SALAAM yields the best scores for an unsupervised
system on the SENSEVAL2 English All-Words task
(Diab, 2003). SALAAM is an appealing approach
as it provides automatically sense annotated data in
two languages simultaneously, thereby providing a
multilingual framework for solving the data acquisi-
tion problem. For instance, SALAAM has been used
to bootstrap the WSD process for Arabic as illus-
trated in (Diab, 2004).
In a supervised learning setting, WSD is cast as
a classification problem, where a predefined set of
sense tags constitutes the classes. The ambigu-
ous words in text are assigned one or more of
these classes by a machine learning algorithm based
on some extracted features. This algorithm learns
parameters from explicit associations between the
class and the features, or combination of features,
that characterize it. Therefore, such systems are
very sensitive to the training data, and those data
are, generally, assumed to be as clean as possible.
In this paper, we question that assumption. Can
large amounts of noisily annotated data used in
training be useful within such a learning paradigm
for WSD? What is the nature of the quality-quantity
trade-off in addressing this problem?
2 Related Work
To our knowledge, the earliest study of bootstrap-
ping a WSD system with noisy data is by Gale et.
al., (Gale et al. , 1992). Their investigation was lim-
ited in scale to six data items with two senses each
and a bounded number of examples per test item.
Two more recent investigations are by Yarowsky,
(Yarowsky, 1995), and later, Mihalcea, (Mihalcea,
2002). Each of the studies, in turn, addresses the is-
sue of data quantity while maintaining good quality
training examples. Both investigations present algo-
rithms for bootstrapping supervised WSD systems
using clean data based on a dictionary or an onto-
logical resource. The general idea is to start with
a clean initial seed and iteratively increase the seed
size to cover more data.
Yarowsky starts with a few tagged instances to
train a decision list approach. The initial seed is
manually tagged with the correct senses based on
entries in Roget’s Thesaurus. The approach
yields very successful results — 95% — on a hand-
ful of data items.
Mihalcea, on the other hand, bases the bootstrap-
ping approach on a generation algorithm, GenCor
(Mihalcea&Moldovan, 1999). GenCor creates
seeds from monosemous words in WordNet, Sem-
cor data, sense tagged examples from the glosses
of polysemous words in WordNet, and other hand
tagged data if available. This initial seed set is used
for querying the Web for more examples and the re-
trieved contexts are added to the seed corpus. The
words in the contexts of the seed words retrieved are
then disambiguated. The disambiguated contexts
are then used for querying the Web for yet more
examples, and so on. It is an iterative algorithm
that incrementally generates large amounts of sense
tagged data. The words found are restricted to either
part of noun compounds or internal arguments of
verbs. Mihalcea’s supervised learning system is an
instance-based-learning algorithm. In the study, Mi-
halcea compares results yielded by the supervised
learning system trained on the automatically gener-
ated data, GenCor, against the same system trained
on manually annotated data. She reports successful
results on six of the data items tested.
3 Empirical Layout
Similar to Mihalcea’s approach, we compare results
obtained by a supervised WSD system for English
using manually sense annotated training examples
against results obtained by the same WSD system
trained on SALAAM sense tagged examples. The
test data is the same, namely, the SENSEVAL 2
English Lexical Sample test set. The supervised
WSD system chosen here is the University of Mary-
land System for SENSEVAL 2 Tagging (a0a2a1a4a3a5a3a7a6 )
(Cabezas et al. , 2002).
3.1 a0a8a1a9a3a5a3a10a6
The learning approach adopted by a0a2a1a4a3a11a3a10a6
is based on Support Vector Machines
(SVM). a0a8a1a9a3a5a3a10a6 uses SVM-a12a14a13a16a15a18a17a20a19a22a21a24a23 by Joachims
(Joachims, 1998).1
For each target word, where a target word is a
test item, a family of classifiers is constructed, one
for each of the target word senses. All the positive
examples for a sense a25 a3a27a26a16a28 are considered the nega-
tive examples of a25 a3a30a29a31a28 , where a13a33a32a34a36a35 .(Allwein et al.,
2000) In a0a8a1a9a3a5a3a10a6 , each target word is considered
an independent classification problem.
The features used for a0a2a1a4a3a11a3a10a6 are mainly con-
textual features with weight values associated with
each feature. The features are space delimited units,
1http://www.ai.cs.uni.dortmund.de/svmlight.
tokens, extracted from the immediate context of the
target word. Three types of features are extracted:
a37 Wide Context Features: All the tokens in the
paragraph where the target word occurs.
a37 Narrow Context features: The tokens that col-
locate in the surrounding context, to the left
and right, with the target word within a fixed
window size of a38 .
a37 Grammatical Features: Syntactic tuples such
as verb-obj, subj-verb, etc. extracted from the
context of the target word using a dependency
parser, MINIPAR(Lin, 1998).
Each feature extracted is associated with a weight
value. The weight calculation is a variant on the In-
verse Document Frequency (IDF) measure in Infor-
mation Retrieval. The weighting, in this case, is an
Inverse Category Frequency (ICF) measure where
each token is weighted by the inverse of its fre-
quency of occurrence in the specified context of the
target word.
3.1.1 Manually Annotated Training Data
The manually-annotated training data is the SEN-
SEVAL2 Lexical Sample training data for the En-
glish task, (SV2LS Train).2 This training data cor-
pus comprises 44856 lines and 917740 tokens.
There is a close affinity between the test data and
the manually annotated training data. The Pearson
a25a40a39
a28 correlation between the sense distributions for
the test data and the manually annotated training
data, per test item, ranges between a41a43a42a45a44a47a46a49a48 .3
3.2 SALAAM
SALAAM exploits parallel corpora for sense annota-
tion. The key intuition behind SALAAMis that when
words in one language, L1, are translated into the
same word in a second language, L2, then those
L1 words are semantically similar. For example,
when the English — L1 — words bank, brokerage,
mortgage-lender translate into the French — L2 —
word banque in a parallel corpus, where bank is pol-
ysemous, SALAAMdiscovers that the intended sense
for bank is the financial institution sense, not the
geological formation sense, based on the fact that
it is grouped with brokerage and mortgage-lender.
SALAAM’s algorithm is as follows:
a37 SALAAM expects a word aligned parallel cor-
pus as input;
2http://www.senseval.org
3The correlation is measured between two frequency distri-
butions. Throughout this paper, we opt for using the parametric
Pearson a50 correlation rather than KL distance in order to test
statistical significance.
a37 L1 words that translate into the same L2 word
are grouped into clusters;
a37 SALAAM identifies the appropriate senses for
the words in those clusters based on the words
senses’ proximity in WordNet. The word sense
proximity is measured in information theo-
retic terms based on an algorithm by Resnik
(Resnik, 1999);
a37 A sense selection criterion is applied to choose
the appropriate sense label or set of sense la-
bels for each word in the cluster;
a37 The chosen sense tags for the words in the
cluster are propagated back to their respective
contexts in the parallel text. Simultaneously,
SALAAMprojects the propagated sense tags for
L1 words onto their L2 corresponding transla-
tions.
3.2.1 Automatically Generated SALAAM
Training Data
Three sets of SALAAM tagged training corpora are
created:
a37 SV2LS TR: English SENSEVAL2 Lexical
Sample trial and training corpora with no man-
ual annotations. It comprises 61879 lines and
1084064 tokens.
a37 MT: The English Brown Corpus, SENSE-
VAL1 (trial, training and test corpora), Wall
Street Journal corpus, and SENSEVAL 2 All
Words corpus. All of which comprise 151762
lines and 37945517 tokens.
a37 HT: UN English corpus which comprises
71672 lines of 1734001 tokens
The SALAAM-tagged corpora are rendered in a
format similar to that of the manually annotated
training data. The automatic sense tagging for
MT and SV2LS TR training data is based on us-
ing SALAAM with machine translated parallel cor-
pora. The HT training corpus is automatically sense
tagged based on using SALAAM with the English-
Spanish UN naturally occurring parallel corpus.
3.3 Experimental Conditions
Experimental conditions are created based on three
of SALAAM’s tagging factors, Corpus, Language
and Threshold:
a37 Corpus: There are 4 different combinations
for the training corpora: MT+SV2LS TR;
MT+HT+SV2LS TR; HT+SV2LS TR; or
SV2LS TR alone.
a37 Language: The context language of the paral-
lel corpus used by SALAAMto obtain the sense
tags for the English training corpus. There are
three options: French (FR), Spanish (SP), or,
Merged languages (ML), where the results are
obtained by merging the English output of FR
and SP.
a37 Threshold: Sense selection criterion, in
SALAAM, is set to either MAX (M) or
THRESH (T).
These factors result in 39 conditions.4
3.4 Test Data
The test data are the 29 noun test items for the SEN-
SEVAL 2 English Lexical Sample task, (SV2LS-
Test). The data is tagged with the WordNet 1.7pre
(Fellbaum, 1998; Cotton et al. , 2001). The average
perplexity for the test items is 3.47 (see Section 5.3),
the average number of senses is 7.93, and the total
number of contexts for all senses of all test items is
1773.
4 Evaluation
In this evaluation, a0a8a1a9a3a5a3a10a6 a3 is the a0a8a1a9a3a5a3a10a6
system trained with SALAAM-tagged data and
a0a8a1a9a3a5a3a10a6 a0 is the a0a8a1a9a3a5a3a10a6 system trained with
manually annotated data. Since we don’t expect
a0a8a1a9a3a5a3a10a6 a3 to outperform human tagging, the re-
sults yielded by a0a2a1a4a3a5a3a7a6 a0 , are the upper bound
for the purposes of this study. It is important to note
that a0a2a1a4a3a5a3a7a6 a3 is always trained with SV2LS TR
as part of the training set in order to guarantee
genre congruence between the training and test
sets.The scores are calculated using scorer2.5
The average precision score over all the items for
a0a8a1a9a3a5a3a10a6 a0 is 65.3% at 100% Coverage.
4.1 Metrics
We report the results using two metrics, the har-
monic mean of precision and recall, (a1a3a2a5a4a7a6 ) score,
and the Performance Ratio (PR), which we define
as the ratio between two precision scores on the
same test data where precision is rendered using
scorer2. PR is measured as follows:
a8a10a9
a34
a11a13a12a15a14a13a14a5a16 a14 a8
a39a18a17a20a19 a13a22a21a31a13a24a23a26a25
a11a13a12a15a14a13a14a5a16 a27 a8
a39a18a17a20a19 a13a22a21a31a13a24a23a26a25
(1)
4Originally, there are 48 conditions, 9 of which are excluded
due to extreme sparseness in training contexts.
5From http://www.senseval.org, all scorer2 results are
reported in fine-grain mode.
4.2 Results
Table 1 shows the a1 a2a5a4a7a6 scores for the upper bound
a0a8a1a9a3a5a3a10a6 a0 . a0a2a1a4a3a11a3a10a6 a3
a1a3a2a5a4a7a6 is the condition in
a0a8a1a9a3a5a3a10a6 a3 that yields the highest overall
a1 a2a5a4a7a6
score over all noun items. a0a2a1a4a3a5a3a7a6 a3a9a8a11a10a13a12 the max-
imum a1a7a2a5a4a7a6 score achievable, if we know which
condition yields the best performance per test item,
therefore it is an oracle condition.6 Since our ap-
proach is unsupervised, we also report the results of
other unsupervised systems on this test set. Accord-
ingly, the last seven row entries in Table 1 present
state-of-the-art SENSEVAL2 unsupervised systems
performance on this test set.7
System a14a16a15a18a17a20a19
a21a23a22a25a24a9a24a27a26 a28 65.3
a21a23a22a25a24a9a24a27a26 a24a30a29a32a31a7a33a32a34 36.02
a21a23a22a25a24a9a24a27a26 a24a36a35a38a37a40a39 45.1
ITRI 45
UNED-LS-U 40.1
CLRes 29.3
IIT2(R) 24.4
IIT1(R) 23.9
IIT2 23.2
IIT1 22
Table 1: a1a7a2a5a4a7a6 scores on SV2LS Test for
a0a8a1a9a3a5a3a10a6 a0 , a0a2a1a4a3a5a3a7a6 a3
a1a3a2a41a4a42a6 ,
a0a2a1a4a3a5a3a7a6 a3 a8a11a10a13a12 ,
and state-of-the-art unsupervised systems partici-
pating in the SENSEVAL2 English Lexical Sample
task.
All of the unsupervised methods including
a0a8a1a9a3a5a3a10a6 a3
a1a3a2a5a4a42a6 and
a0a8a1a9a3a5a3a10a6 a3 a8a43a10a44a12 are signifi-
cantly below the supervised method, a0a2a1a4a3a11a3a10a6 a0 .
a0a8a1a9a3a5a3a10a6 a3
a1a3a2a5a4a42a6 is the third in the unsupervised
methods. It is worth noting that the average
a1a7a2 a4a7a6 score across the 39 conditions is a38 a38a18a42a46a45a48a47 , and
the lowest is a38a43a48 a42 a48a49a45 . The five best conditions
for a0a2a1a4a3a11a3a10a6 a3 , that yield the highest average
a1a7a2 a4a7a6 across all test items, use the HT corpus in
the training data, four of which are the result of
merged languages in SALAAM indicating that ev-
idence from different languages simultaneously is
desirable. a0a2a1a4a3a11a3a10a6 a3a27a8a11a10a13a12 is the maximum poten-
tial among all unsupervised approaches if the best of
all the conditions are combined. One of our goals is
to automatically determine which condition or set of
conditions yield the best results for each test item.
Of central interest in this paper is the perfor-
mance ratio (PR) for the individual nouns. Table
6The different conditions are considered independent tag-
gers and there is no interaction across target nouns
7http://www.senseval.org
2 illustrates the PR of the different nouns yielded by
a0a8a1a9a3a5a3a10a6 a3
a1a3a2a5a4a42a6 and
a0a8a1a9a3a5a3a10a6 a3 a8a11a10a13a12 sorted in de-
scending order by a0a8a1a9a3a5a3a10a6 a3 a8a43a10a44a12 PR scores. A
a48 a42a41 a41 PR indicates an equivalent performance be-
tween a0a8a1a9a3a5a3a10a6 a3 and a0a8a1a9a3a5a3a10a6 a0 . The highest
PR values are highlighted in bold.
Nouns #Ss UMH% UMSb UMSm
detention 4 65.6 1.00 1.05
chair 7 83.3 1.02 1.02
bum 4 85 0.14 1.00
dyke 2 89.3 1.00 1.00
fatigue 6 80.5 1.00 1.00
hearth 3 75 1.00 1.00
spade 6 75 1.00 1.00
stress 6 50 0.05 1.00
yew 3 78.6 1.00 1.00
art 17 47.9 0.98 0.98
child 7 58.7 0.93 0.97
material 16 55.9 0.81 0.92
church 6 73.4 0.75 0.77
mouth 10 55.9 0 0.73
authority 9 62 0.60 0.70
post 12 57.6 0.66 0.66
nation 4 78.4 0.34 0.59
feeling 5 56.9 0.33 0.59
restraint 8 60 0.2 0.56
channel 7 62 0.52 0.52
facility 5 54.4 0.32 0.51
circuit 13 62.7 0.44 0.44
nature 7 45.7 0.43 0.43
bar 19 60.9 0.20 0.30
grip 6 58.8 0.27 0.27
sense 8 39.6 0.24 0.24
lady 8 72.7 0.09 0.16
day 16 62.5 0.06 0.08
holiday 6 86.7 0.08 0.08
Table 2: The number of senses per item, in
column #Ss, a0a2a1a4a3a5a3a7a6 a0 precision performance
per item as indicated in column UMH, PR
scores for a0a8a1a9a3a5a3a10a6 a3 a1a3a2a41a4a42a6 in column UMSb and
a0a8a1a9a3a5a3a10a6 a3 a8a43a10a44a12 in column UMSm on SV2LS Test
a0a2a1a4a3a5a3a7a6 a3 a8a43a10a44a12 yields PR scores
a50a36a41a43a42a45a44a43a48 for the
top 12 test items listed in Table 2. Our algorithm
does as well as supervised algorithm, a0a2a1a4a3a11a3a10a6 a0 ,
on 41.6% of this test set. In a0a8a1a9a3a5a3a10a6 a3 a1a3a2a5a4a7a6 , 31%
of the test items, (9 nouns yield PR scores a50 a41a43a42a45a44a52a51 ),
do as well as a0a2a1a4a3a5a3a7a6 a0 . This is an improve-
ment of 11% absolute over state-of-the-art boot-
strapping WSD algorithm yielded by Mihalcea (Mi-
halcea, 2002). Mihalcea reports high PR scores for
six test items only: art, chair, channel, church, de-
tention, nation. It is worth highlighting that her
bootstrapping approach is partially supervised since
it depends mainly on hand labelled data as a seed
for the training data.
Interestingly, two nouns, detention and chair,
yield better performance than a0a2a1a4a3a5a3a7a6 a0 , as in-
dicated by the PRs a48 a42a41a1a0 and a48 a42a41 a51 , respectively. This
is attributed to the fact that SALAAM produces a lot
more correctly annotated training data for these two
words than that provided in the manually annotated
training data for a0a8a1a9a3a5a3a10a6 a0 .
Some nouns yield very poor PR values mainly
due to the lack of training contexts, which is the case
for mouth in a0a2a1a4a3a11a3a10a6 a3 a1a3a2a41a4a42a6 , for example. Or lack
of coverage of all the senses in the test data such as
for bar and day, or simply errors in the annotation
of the SALAAM-tagged training data.
If we were to include only nouns that achieve ac-
ceptable PR scores of a2 a41a43a42a46a45a3a0 — the first 16 nouns in
Table 2 for a0a2a1a4a3a5a3a7a6 a3 a8a11a10a44a12 — the overall potential
precision of a0a2a1a4a3a11a3a10a6 a3 is significantly increased
to 63.8% and the overall precision of a0a8a1a9a3a5a3a10a6 a0
is increased to 68.4%.8
These results support the idea that we could re-
place hand tagging with SALAAM’s unsupervised
tagging if we did so for those items that yield an ac-
ceptable PR score. But the question remains: How
do we predict which training/test items will yield
acceptable PR scores?
5 Factors Affecting Performance Ratio
In an attempt to address this question, we analyze
several different factors for their impact on the per-
formance of a0a8a1a9a3a5a3a10a6 a3 quanitified as PR. In or-
der to effectively alleviate the sense annotation ac-
quisition bottleneck, it is crucial to predict which
items would be reliably annotated automatically us-
ing a0a8a1a9a3a5a3a10a6 a3 . Accordingly, in the rest of this pa-
per, we explore 7 different factors by examining the
yielded PR values in a0a2a1a4a3a5a3a7a6 a3 a8a43a10a44a12 .
5.1 Number of Senses
The test items that possess many senses, such as art
(17 senses), material (16 senses), mouth (10 senses)
and post (12 senses), exhibit PRs of 0.98, 0.92, 0.73
and 0.66, respectively. Overall, the correlation be-
tween number of senses per noun and its PR score
is an insignificant a39 a34a5a4 a41a43a42a45a38a43a48 , a25 a1a47a25 a48a7a6a13a51a1a8 a28 a34 a51a18a42a45a44a9a6a11a10 a50
a41a43a42 a48
a28 . Though it is a weak negative correlation, it
does suggest that when the number of senses in-
creases, PR tends to decrease.
5.2 Number of Training Examples
This is a characteristic of the training data. We ex-
amine the correlation between the PR and the num-
8A PR of
a12a14a13a15a17a16 is considered acceptable since a18a20a19a22a21a23a21a9a24 a25
achieves an overall a26a9a27a29a28a31a30 score of a15a17a16a32a13a33 in the WSD task.
ber of training examples available to a0a2a1a4a3a5a3a7a6 a3
for each noun in the training data. The correlation
between the number of training examples and PR is
insignificant at a39 a34a34a4 a41a43a42 a48a32a0 , a25 a1a47a25 a48a7a6a13a51a1a8 a28 a34 a41a43a42a46a45 a38a1a8a35a6a11a10 a50
a41a43a42 a47
a28 . More interestingly, however, spade, with only
5 training examples, yields a PR score of a48 a42a41 . This
contrasts with nation, which has more than 4200
training examples, but yields a low PR score of a41a43a42a36a0 a44 .
Accordingly, the number of training examples alone
does not seem to have a direct impact on PR.
5.3 Sense Perplexity
This factor is a characteristic of the training data.
Perplexity is a51a3a37a39a38 a6a41a40a43a42a11a44a29a45 . Entropy is measured as fol-
lows:
a0
a25a47a46
a28
a34a5a4a49a48
a12a7a50a52a51
a10a7a25a47a53
a28
a12 a23 a15a3a54 a25a55a10a7a25a47a53
a28 a28 (2)
where a53 is a sense for a polysemous noun and a46 is
the set of all its senses.
Entropy is a measure of confusability in the
senses’ contexts distributions; when the distribution
is relatively uniform, entropy is high. A skew in the
senses’ contexts distributions indicates low entropy,
and accordingly, low perplexity. The lowest possi-
ble perplexity is a48 , corresponding to a41 entropy. A
low sense perplexity is desirable since it facilitates
the discrimination of senses by the learner, there-
fore leading to better classification. In the SALAAM-
tagged training data, for example, bar has the high-
est perplexity value of a44a18a42a36a56a3a0 over its 19 senses, while
day, with 16 senses, has a much lower perplexity of
a48 a42a45a38 .
Surprisingly, we observe nouns with high per-
plexity such as bum (sense perplexity value of a38a18a42a41 a38 )
achieving PR scores of a48 a42a41 . While nouns with rel-
atively low perplexity values such as grip (sense
perplexity of a41a43a42a36a0 a38 ) yields a low PR score of a41a43a42a46a51a52a45 .
Moreover, nouns with the same perplexity and sim-
ilar number of senses yield very different PR scores.
For example, examining holiday and child, both
have the same perplexity of a51a18a42 a48 a47a52a47 and the number
of senses is close, with 6 and 7 senses, respectively,
however, the PR scores are very different; holiday
yields a PR of a41a43a42a41a1a56 , and child achieves a PR of a41a43a42a45a44a1a8 .
Furthermore, nature and art have the same perplex-
ity of a51a18a42a46a51 a44 ; art has 17 senses while nature has 7
senses only, nonetheless, art yields a much higher
PR score of (a41a43a42a45a44a3a56 ) compared to a PR of a41a43a42 a47a52a47 for
nature.
These observations are further solidified by the
insignificant correlation of a39 a34a5a4 a41a43a42 a48a49a51 , a25 a1a47a25 a48a7a6a13a51a1a8 a28 a34
a41a43a42 a47a57a0a9a6a11a10 a50 a41a43a42a36a0
a28 between sense perplexity and PR.
At first blush, one is inclined to hypothesize that,
the combination of low perplexity associated with a
large number of senses — as an indication of high
skew in the distribution — is a good indicator of
high PR, but reviewing the data, this hypothesis is
dispelled by day which has 16 senses and a sense
perplexity of a48 a42a45a38 , yet yields a low PR score of a41a43a42a41a1a56 .
5.4 Semantic Translation Entropy
Semantic translation entropy (STE) (Melamed,
1997) is a special characteristic of the SALAAM-
tagged training data, since the source of evidence
for SALAAM tagging is multilingual translations.
STE measures the amount of translational variation
for an L1 word in L2, in a parallel corpus. STE is
a variant on the entropy measure. STE is expressed
as follows:
a0
a25
a6a1a0
a21
a28
a34a34a4 a48
a6
a50
a21
a10a7a25a40a19
a0
a21
a28
a42a12 a23 a15 a54 a25a55a10a7a25a40a19
a0
a21
a28 a28 (3)
where a19 is a translation in the set of possible trans-
lations a6 in L2; and a21 is L1 word.
The probability of a translation a19 is calculated di-
rectly from the alignments of the test nouns and
their corresponding translations via the maximum
likelihood estimate.
Variation in translation is beneficial for SALAAM
tagging, therefore, high STE is a desirable feature.
Correlation between the automatic tagging preci-
sion and STE is expected to be high if SALAAM
has good quality translations and good quality align-
ments. However, this correlation is a low a39 a34
a41a43a42a45a38 a38 . Consequently, we observe a low correlation
between STE and PR, a39 a34 a41a43a42a46a51a52a51 , a25 a1a47a25 a48a7a6a13a51a1a8 a28 a34
a48 a42a45a38a43a48a7a6a11a10 a50 a41a43a42a46a51a52a45
a28 .
Examining the data, the nouns bum, detention,
dyke, stress, and yew exhibit both high STE and high
PR; Moreover, there are several nouns that exhibit
low STE and low PR. But the intriguing items are
those that are inconsistent. For instance, child and
holiday: child has an STE of a41a43a42a41a1a56 and comprises 7
senses at a low sense perplexity of a48 a42a46a45 a44 , yet yields
a high PR of a41a43a42a45a44a1a8 . As mentioned earlier, low STE
indicates lack of translational variation. In this spe-
cific experimental condition, child is translated as
a2 enfant, enfantile, ni˜no, ni˜no-peque˜no
a3 , which are
words that preserve ambiguity in both French and
Spanish. On the other hand, holiday has a relatively
high STE value of a41a43a42a46a45a52a45 , yet results in the lowest PR
of a41a43a42a41a1a56 . Consequently, we conclude that STE alone
is not a good direct indicator of PR.
5.5 Perplexity Difference
Perplexity difference (PerpDiff) is a measure of the
absolute difference in sense perplexity between the
test data items and the training data items. For the
manually annotated training data items, the overall
correlation between the perplexity measures is a sig-
nificant a39 a34 a41a43a42a45a44a52a45 which contrasts to a low over-
all correlation of a39
a34
a41a43a42 a47 a38 between the SALAAM-
tagged training data items and the test data items.
Across the nouns in this study, the correlation be-
tween PerpDiff and PR is a39 a34 a4 a41a43a42 a47 . It is advan-
tageous to be as similar as possible to the training
data to guarantee good classification results within
a supervised framework, therefore a low PerpDiff
is desirable. We observe cases with a low PerpDiff
such as holiday (PerpDiff of a41a43a42a41a1a0 ), yet the PR is a
low a41a43a42a41a1a56 . On the other hand, items such as art have
a relatively high PerpDiff of a51a18a42a46a45a52a51 , but achieves a
high PR of a41a43a42a45a44a1a8 . Accordingly, PerpDiff alone is not
a good indicator of PR.
5.6 Sense Distributional Correlation
Sense Distributional Correlation (SDC) results from
comparing the sense distributions of the test data
items with those of SALAAM-tagged training data
items. It is worth noting that the correlation be-
tween the SDC of manually annotated training data
and that of the test data ranges from a39 a34 a41a43a42a45a44 a4
a48 a42a41 . A strong significant correlation of a39
a34
a41a43a42a36a56a1a8 ,
a25 a1a47a25 a48a7a6a13a51a1a8
a28
a34
a56 a41a31a6a11a10a5a4 a41a43a42a41 a41 a41a20a48
a28 between SDC and
PR exists for SALAAM-tagged training data and the
test data. Overall, nouns that yield high PR have
high SDC values. However, there are some in-
stances where this strong correlation is not exhib-
ited. For example, circuit and post have relatively
high SDC values, a41a43a42a55a8 a44a48a47 and a41a43a42a36a56a3a0 a44 , respectively,
in a0a8a1a9a3a5a3a10a6 a3 a8a43a10a44a12 , but they score lower PR val-
ues than detention which has a comparatively lower
SDC value of a41a43a42a55a8a3a8 a45 . The fact that both circuit
and post have many senses, 13 and 12, respectively,
while detention has 4 senses only is noteworthy. de-
tention has a higher STE and lower sense perplexity
than either of them however. Overall, the data sug-
gests that SDC is a very good direct indicator of PR.
5.7 Sense Context Confusability
A situation of sense context confusability (SCC)
arises when two senses of a noun are very similar
and are highly uniformly represented in the train-
ing examples. This is an artifact of the fine gran-
ularity of senses in WordNet 1.7pre. Highly simi-
lar senses typically lead to similar usages, therefore
similar contexts, which in a learning framework de-
tract from the learning algorithm’s discriminatory
power.
Upon examining the 29 polysemous nouns in the
training and test sets, we observe that a significant
number of the words have similar senses according
to a manual grouping provided by Palmer, in 2002.9
For example, senses 2 and 3 of nature, meaning trait
and quality, respectively, are considered similar by
the manual grouping. The manual grouping does
not provide total coverage of all the noun senses
in this test set. For instance, it only considers the
homonymic senses 1, 2 and 3 of spade, yet, in the
current test set, spade has 6 senses, due to the exis-
tence of sub senses.
26 of the 29 test items exhibit multiple groupings
based on the manual grouping. Only three nouns,
detention, dyke, spade do not have any sense group-
ings. They all, in turn, achieve high PR scores of
a48 a42a41 .
There are several nouns that have relatively high
SDC values yet their performance ratios are low
such as post, nation, channel and circuit. For in-
stance, nation has a very high SDC value of a41a43a42a45a44a52a45a52a51 ,
a low sense perplexity of a48 a42a45a38 — relatively close to
the a48 a42a46a45 sense perplexity of the test data — a suffi-
cient number of contexts (4350), yet it yields a PR
of a41a43a42a36a0 a44 . According to the manual sense grouping,
senses 1 and 3 are similar, and indeed, upon inspec-
tion of the context distributions, we find the bulk
of the senses’ instance examples in the SALAAM-
tagged training data for the condition that yields
this PR in a0a2a1a4a3a5a3a7a6 a3 a8a43a10a44a12 are annotated with ei-
ther sense 1 or sense 3, thereby creating confusable
contexts for the learning algorithm. All the cases
of nouns that achieve high PR and possess sense
groups do not have any SCC in the training data
which strongly suggests that SCC is an important
factor to consider when predicting the PR of a sys-
tem.
5.8 Discussion
We conclude from the above exploration that SDC
and SCC affect PR scores directly. PerpDiff, STE,
and Sense Perplexity, number of senses and number
of contexts seem to have no noticeable direct impact
on the PR.
Based on this observation, we calculate the SDC
values for all the training data used in our experi-
mental conditions for the 29 test items.
Table 3 illustrates the items with the highest SDC
values, in descending order, as yielded from any
of the SALAAM conditions. We use an empirical
cut-off value of a41a43a42a55a8a7a0 for SDC. The SCC values are
reported as a boolean Y/N value, where a Y indi-
cates the presence of a sense confusable context. As
shown a high SDC can serve as a means of auto-
9http://www.senseval.org/sense-groups. The manual sense
grouping comprises 400 polysemous nouns including the 29
nouns in this evaluation.
Noun SDC SCC PR
dyke 1 N 1.00
bum 1 N 1.00
fatigue 1 N 1.00
hearth 1 N 1.00
yew 1 N 1.00
chair 0.99 N 1.02
child 0.99 N 0.95
detention 0.98 N 1.0
spade 0.97 N 1.00
mouth 0.96 Y 0.73
nation 0.96 N 0.59
material 0.92 N 0.92
post 0.90 Y 0.63
authority 0.86 Y 0.70
art 0.83 N 0.98
church 0.80 N 0.77
circuit 0.79 N 0.44
stress 0.77 N 1.00
Table 3: Highest SDC values for the test items as-
sociated with their respective SCC and PR values.11
matically predicting a high PR, but it is not suffi-
cient. If we eliminate the items where an SCC ex-
ists, namely, mouth, post, and authority, we are still
left with nation and circuit, where both yield very
low PR scores. nation has the desirable low Per-
pDiff of a41a43a42a46a51a52a51 . The sense annotation tagging pre-
cision of the a3a1a0 a51a3a2
a3 a6 a9 in this condition which
yields the highest SDC — Spanish UN data with
the a3a4a0 a51a3a2 a3 a6 a9 for training — is a low a38 a41a43a42 a47a6a5 and
a low STE value of a41a43a42 a48a49a51 a44 . This is due to the fact that
both French and Spanish preserve ambiguity in sim-
ilar ways to English which does not make it a good
target word for disambiguation within the SALAAM
framework, given these two languages as sources of
evidence. Accordingly, in this case, STE coupled
with the noisy tagging could have resulted in the
low PR. However, for circuit, the STE value for its
respective condition is a high a41a43a42a46a51 a44a43a48 , but we observe
a relatively high PerpDiff of a48 a42a36a0 a38 compared to the
PerpDiff of a41 for the manually annotated data.
Therefore, a combination of high SDC and
nonexistent SCC can reliably predict good PR. But
the other factors still have a role to play in order to
achieve accurate prediction.
It is worth emphasizing that two of the identified
factors are dependent on the test data in this study,
SDC and PerpDiff. One solution to this problem
is to estimate SDC and PerpDiff using a held out
data set that is hand tagged. Such a held out data
set would be considerably smaller than the required
size of a manually tagged training data for a clas-
sical supervised WSD system. Hence, SALAAM-
tagged training data offers a viable solution to the
annotation acquisition bottleneck.
6 Conclusion and Future Directions
In this paper, we applied an unsupervised approach
within a learning framework a0a8a1a9a3a5a3a10a6 a3 for the
sense annotation of large amounts of data. The ul-
timate goal of a0a2a1a4a3a11a3a10a6 a3 is to alleviate the data
labelling bottleneck by means of a trade-off be-
tween quality and quantity of the training data.
a0a8a1a9a3a5a3a10a6 a3 is competitive with state-of-the-art un-
supervised systems evaluated on the same test set
from SENSEVAL2. Moreover, it yields superior re-
sults to those obtained by the only comparable boot-
strapping approach when tested on the same data
set. Moreover, we explore, in depth, different fac-
tors that directly and indirectly affect the perfor-
mance of a0a2a1a4a3a5a3a7a6 a3 quantified as a performance
ratio, PR. Sense Distribution Correlation (SDC) and
Sense Context Confusability (SCC) have the highest
direct impact on performance ratio, PR. However,
evidence suggests that probably a confluence of all
the different factors leads to the best prediction of
an acceptable PR value. An investigation into the
feasibility of combining these different factors with
the different attributes of the experimental condi-
tions for SALAAMto automatically predict when the
noisy training data can reliably replace manually an-
notated data is a matter of future work.
7 Acknowledgements
I would like to thank Philip Resnik for his guid-
ance and insights that contributed tremendously to
this paper. Also I would like to acknowledge Daniel
Jurafsky and Kadri Hacioglu for their helpful com-
ments. I would like to thank the three anonymous
reviewers for their detailed reviews. This work
has been supported, in part, by NSF Award #IIS-
0325646.
References
Erin L. Allwein, Robert E. Schapire, and Yoram Singer.
2000. Reducing multiclass to binary: A unifying ap-
proach for margin classifiers. Journal of Machine
Learning Research, 1:113-141.
Clara Cabezas, Philip Resnik, and Jessica Stevens.
2002. Supervised Sense Tagging using Support Vector
Machines. Proceedings of the Second International
Workshop on Evaluating Word Sense Disambiguation
Systems (SENSEVAL-2). Toulouse, France.
Scott Cotton, Phil Edmonds, Adam Kilgarriff, and
Martha Palmer, ed. 2001. SENSEVAL-2: Second
International Workshop on Evaluating Word Sense
Disambiguation Systems. ACL SIGLEX, Toulouse,
France.
Mona Diab. 2004. An Unsupervised Approach for Boot-
strapping Arabic Word Sense Tagging. Proceedings
of Arabic Based Script Languages, COLING 2004.
Geneva, Switzerland.
Mona Diab and Philip Resnik. 2002. An Unsupervised
Method for Word Sense Tagging Using Parallel Cor-
pora. Proceedings of 40th meeting of ACL. Pennsyl-
vania, USA.
Mona Diab. 2003. Word Sense Disambiguation Within a
Multilingual Framework. PhD Thesis. University of
Maryland College Park, USA.
Christiane Fellbaum. 1998. WordNet: An Electronic
Lexical Database. MIT Press.
William A. Gale and Kenneth W. Church and David
Yarowsky. 1992. Using Bilingual Materials to De-
velop Word Sense Disambiguation Methods. Proceed-
ings of the Fourth International Conference on Theo-
retical and Methodological Issues in Machine Trans-
lation. Montr´eal, Canada.
Thorsten Joachims. 1998. Text Categorization with Sup-
port Vector Machines: Learning with Many Relevant
Features. Proceedings of the European Conference on
Machine Learning. Springer.
A. Kilgarriff and J. Rosenzweig. 2000. Framework and
Results for English SENSEVAL. Journal of Computers
and the Humanities. pages 15—48, 34.
Dekang Lin. 1998. Dependency-Based Evaluation of
MINIPAR. Proceedings of the Workshop on the
Evaluation of Parsing Systems, First International
Conference on Language Resources and Evaluation.
Granada, Spain.
Dan I. Melamed. 1997. Measuring Semantic Entropy.
ACL SIGLEX, Washington, DC.
Rada Mihalcea and Dan Moldovan. 1999. A method for
Word Sense Disambiguation of unrestricted text. Pro-
ceedings of the 37th Annual Meeting of ACL. Mary-
land, USA.
Rada Mihalcea. 2002. Bootstrapping Large sense
tagged corpora. Proceedings of the 3rd International
Conference on Languages Resources and Evaluations
(LREC). Las Palmas, Canary Islands, Spain.
Philip Resnik. 1999. Semantic Similarity in a Taxon-
omy: An Information-Based Measure and its Applica-
tion to Problems of Ambiguity in Natural Language.
Journal Artificial Intelligence Research. (11) p. 95-
130.
David Yarowsky. 1995. Unsupervised Word Sense Dis-
ambiguation Rivaling Supervised Methods. Proceed-
ings of the 33rd Annual Meeting of ACL. Cambridge,
MA.
