Proceedings of EACL '99 
95% Replicability for Manual Word Sense Tagging 
Adam Kilgarriff 
ITRI, University of Brighton, Lewes Road, Brighton UK 
email: adam@itri.bton.ac.uk 
People have been writing programs for auto- 
matic Word Sense Disambiguation (WSD) for 
forty years now, yet the validity of the task has 
remained in doubt. At a first pass, the task is 
simply defined: a word like bank can mean 'river 
bank' or 'money bank' and the task-is to deter- 
mine which of these applies in a context in which 
the word bank appears. The problems arise be- 
cause most sense distinctions are not as clear as 
the distinction between 'river bank' and 'money 
b.~nk', so it is not always straightforward for a 
person to say what the correct answer is. Thus 
we do not always know what it would mean to 
say that a computer program got the right an- 
swer. The issue is discussed in detail by (Gale 
et al., 1992) who identify the problem as one 
of identifying the 'upper bound' for the perfor- 
mance of a WSD program. If people can only 
agree on the correct answer x% of the time, a 
claim that a program achieves more than x% ac- 
curacy is hard to interpret, and x% is the upper 
bound for what the program can (meaningfully) 
achieve. 
There have been some discussions as to what 
this upper bound might be. Gale et al. re- 
view a psycholinguistic study (Jorgensen, 1990) 
in which the level of agreement averaged 68%. 
But an upper bound of 68% is disastrous for 
the enterprise, since it implies that the best a 
program could possibly do is still not remotely 
good enough for any practical purpose. 
Even worse news comes from (Ng and Lee, 
1996), who re-tagged parts of the manually 
tagged SEMCOR corpus (Fellbaum, 1998). The 
taggings matched only 57% of the time. 
If these represent as high a level of inter- 
tagger agreement as one could ever expect, 
WSD is a doomed enterprise. However, neither 
study set out to identify an upper bound for 
WSD and it is far from ideal to use their results 
in this way. In this paper we report on a study 
which did aim specifically at achieving as high 
a level of replicability as possible. 
The study took place within the context of 
SENSEVAL, an evaluation exercise for WSD 
programs. 1 It was, clearly, critical to the va- 
lidity of SENSEVAL as a whole to establish the 
integrity of the 'gold standard' corpus against 
which WSD programs would be judged. 
Measures taken to maximise the agreement 
level were: 
humans: whereas other tagging exercises 
had mostly used students, SENSEVAL 
used professional lexicographers 
dictionary: the dictionary that provided 
the sense inventory had lengthy entries, 
with substantial numbers of examples 
task definition: in cases where none, or 
more than one, of the senses applied, the 
lexicographer was encouraged to tag the in- 
stance as "unassignable" or with multiple 
tags 2 
The exercise is chronicled at http://vn~.itri.bton.ac.uk/events/senseval 
and in 
(Kilgarriff and Palmer, Forthcoming), where a fuller ac- 
count of all matters covered in the poster can be found. 
2The scoring algorithm simply treated "unassignable" 
as another tag. (Less than 1% of instnaces were tagged 
"unassignable".) Where there were multiple tags and 
a partial match between taggings, partial credit was 
assigned. 
277 
Proceedings of EACL '99 
• arbitration: first, two or three lexicogra- 
phers provided taggings. Then, any in- 
stances where these taggings were not iden- 
tical were forwarded to a third lexicogra- 
pher for arbitration. 
The data for SENSEVAL comprised around 
200 corpus instances for each of 35 words, mak- 
ing a total of 8455 instances. A scoring scheme 
was developed which assigned partial credit 
where more than one sense had been assigned 
to an instance. This was developed primarily 
for scoring the WSD systems, but was also used 
for scoring the lexicographers' taggings. 
At the time of the SENSEVAL workshop, 
the tagging procedure (including arbitration) 
had been undertaken once for each corpus in- 
stance. We scored lexicographers' initial pre- 
arbitration results against the post-arbitration 
results. The scores ranged between 88% to 
100%, with just five out of 122 results for 
<lexicographer, word> pairs falling below 95%. 
To determine the replicability of the whole 
process in a thoroughgoing way, we repeated it 
for a sample of four of the words. The words 
were selected to reflect the spread of difficulty: 
we took the word which had given rise to the 
lowest inter-tagger agreement in the previous 
round, (generous, 6 senses), the word that had 
given rise to the highest, (sack, 12 senses), and 
two words from the middle of the range (onion, 
5, and shake, 36). The 1057 corpus instances 
for the four words were tagged by two lexicog- 
raphers who had not seen the data before; the 
non-identical taggings were forwarded to a third 
for arbitration. These taggings were then com- 
pared with the ones produced previously. 
The table shows, for each word, the number of 
corpus instances (Inst), the number of multiply- 
tagged instances in each of the two sets of tag- 
gings (A and B), and the level of agreement be- 
tween the two sets (Agr). 
There were 240 partial mismatches, with par- 
tial credit assigned, in contrast to just 7 com- 
plete mismatches. 
A instance on which the taggings disagreed 
was: 
Give plants generous root space. 
Word Inst A B Agr% 
generous 
onion 
sack 
shake 
227 76 68 88.7 
214 10 11 98.9 
260 0 3 99.4 
356 35 49 95.1 
ALL 1057 121 131 95.5 
Sense 4 of generous is defined as simply "abun- 
dant; copious", and sense 5 as "(of a room or 
building) large in size; spacious". One tag- 
ging selected each. In general, taggings failed 
to match where the definitions were vague and 
overlapping, and where, as in sense 5, some part 
of a defintion matches a corpus instance well 
("spacious") but another part does not ("of a 
room or building"). 
Conclusion 
The upper bound for WSD is around 95%, and 
Gale et al.'s worries about the integrity of the 
task can be laid to rest. In order for manually 
tagged test corpora to achieve 95% replicability, 
it is critical to take care over the task definition, 
to employ suitably qualified individuals, and to 
double-tag and include an arbitration phase. 
References 
Christiane Fellbaum, editor. 1998. WordNet: An 
Electronic Lexical Database. MIT Press, Cam- 
bridge, Mass. 
William Gale, Kenneth Church, and David 
Yarowsky. 1992. Estimating upper and lower 
bounds on the performance of word-sense disam- 
biguation programs. In Proceedings, 30th A CL, 
pages 249-156. 
Julia C. Jorgensen. 1990. The psychological reality 
of word senses. Journal of Psycholinguistic Re- 
search, 19(3):167-190. 
Adam Kilgarriff and Martha Palmer. Forthcom- 
ing. Guest editors, Special Issue on SENSE- 
VAL: Evaluating Word Sense Disambiguation Pro- 
grams. Computers and the Humanities. 
Hwee Tou Ng and Hian Beng Lee. 1996. Integrat- 
ing multiple knowledge sources to disambiguate 
word sense: An exemplar-based approach. In 
A CL Proceedings, pages 40-47, Technical Univer- 
sity, Berlin, Santa Cruz, California. 
278 
