A Case Study on Inter-Annotator Agreement for 
Word Sense Disambiguation 
Hwee Tou Ng 
Chung Yong Lim 
Shou King Foo 
DSO Natmnal Laboratories 
20 Scmnce Park Drive, Singapore 118230 
{nhweetou, Ichungyo, fshoukln}@dso org. sg 
Abstract 
There is a general concern within the field of word 
sense dusamb~guatmn about the rater-annotator 
agreement between human annotators. In thus pa- 
per, we examine th~s msue by comparing the agree- 
ment rate on a large corpus of more than 30,000 
sense-tagged instances Thin corpus us the mtersec- 
tmn of the WORDNET Semcor corpus and the DSO 
corpus, which has been independently tagged by two 
separate groups of human annotators The contri- 
bution of this paper us two-fold First, ~t presents a 
greedy search algorithm that can automatically de- 
rive coarser sense classes based on the sense tags 
assigned by two human annotators The resulting 
derived coarse sense classes achmve a h~gher agree- 
ment rate but we stfl!mamtam as many of the orig- 
inal sense classes as posmble Second, the coarse 
sense grouping derived by the algorithm, upon veri- 
fication by human, can potentially serve as a better 
sense inventory for evaluating automated word sense 
d~samb~guatmn algorithms Moreover, we examined 
the derived coarse sense classes and found some in- 
teresting groupings of word senses that correspond 
to human mtmtlve judgment of sense granularity 
1 Introduction . 
It us widely acknowledged that word sense d~sam- 
blguatmn (WSD) us a central problem m natural 
language processing In order for computers to be 
able to understand and process natural language be- 
yond simple keyword matching, the problem of d~s- 
amblguatmg word sense, or dlscermng the meamng 
of a word m context, must be effectively dealt with 
Advances in WSD v, ill have slgmficant Impact on 
apphcatlons hke information retrieval and machine 
translation 
For natural language subtasks hke part-of-speech 
tagging or s)ntactm parsing, there are relatlvely well 
defined and agreed-upon cnterm of what it means to 
have the "correct" part of speech or syntactic struc- 
ture assigned to a word or sentence For instance, 
the Penn Treebank corpus (Marcus et al, 1993) pro- 
~ide~ ,t large repo.~tory of texts annotated w~th part- 
of-speech and s}ntactm structure mformatlon Tv.o 
independent human annotators can achieve a high 
rate of agreement on assigning part-of-speech tags 
to words m a g~ven sentence 
Unfortunately, th~s us not the case for word sense 
assignment F~rstly, it is rarely the case that any two 
dictionaries will have the same set of sense defim- 
tmns for a g~ven word Different d~ctlonanes tend to 
carve up the "semantic space" m a different way, so 
to speak Secondly, the hst of senses for a word m a 
typical dmtmnar~ tend to be rather refined and com- 
prehensive This is especmlly so for the commonly 
used words which have a large number of senses The 
sense dustmctmn between the different senses for a 
commonly used word m a d~ctmnary hke WoRDNET 
(Miller, 1990) tend to be rather fine Hence, two 
human annotators may genuinely dusagree m their 
sense assignment to a word m context 
The agreement rate between human annotators on 
word sense assignment us an Important concern for 
the evaluatmn of WSD algorithms One would pre- 
fer to define a dusamblguatlon task for which there 
us reasonably hlgh agreement between human an- 
notators The agreement rate between human an- 
notators will then form the upper ceiling against 
whmh to compare the performance of WSD algo- 
rithms For instance, the SENSEVAL exerclse has 
performed a detaded study to find out the rater- 
annotator agreement among ~ts lexicographers tag- 
grog the word senses (Kllgamff, 1998c, Kllgarnff, 
1998a, Kflgarrlff, 1998b) 
2 A Case Study 
In this-paper, we examine the ~ssue of rater- 
annotator agreement by comparing the agreement 
rate of human annotators on a large sense-tagged 
corpus of more than 30,000 instances of the most fre- 
quently occurring nouns and verbs of Enghsh This 
corpus is the intersection of the WORDNET Semcor 
corpus (Miller et al, 1993) and the DSO corpus (Ng 
and Lee, 1996, Ng, 1997), which has been indepen- 
dently tagged wlth the refined senses of WORDNET 
by two separate groups of human annotators 
The Semcor corpus us a subset of the Brown corpus 
tagged with ~VoRDNET senses, and consists of more 
than 670,000 words from 352 text files Sense tag- 
gmg was done on the content words (nouns, ~erbs, 
adjectives and adverbs) m this subset 
The DSO corpus consists of sentences drawn from 
the Brown corpus and the Wall Street Journal For 
each word w from a hst of 191 frequently occur- 
ring words of Enghsh (121 nouns and 70 verbs), sen- 
tences containing w (m singular or plural form, and 
m its various reflectional verb form) are selected and 
each word occurrence w ~s tagged w~th a sense from 
WoRDNET There ~s a total of about 192,800 sen- 
tences in the DSO corpus m which one word occur- 
rence has been sense-tagged m each sentence 
The intersection of the Semcor corpus and the 
DSO corpus thus consists of Brown corpus sentences 
m which a word occurrence w is sense-tagged m each 
sentence, where w Is one of.the 191 frequently oc- 
,currmg English nouns or verbs Since this common 
pomon has been sense-tagged by two independent 
groups of human annotators, ~t serves as our data set 
for investigating inter-annotator agreement in this 
paper 
3 Sentence Matching 
To determine the extent of inter-annotator agree- 
ment, the first step ~s to match each sentence m 
Semcor to its corresponding counterpart In the DSO 
corpus This step ~s comphcated by the following 
factors 
1 Although the intersected portion of both cor- 
pora came from Brown corpus, they adopted 
different tokemzatmn convention, and segmen- 
tartan into sentences differed sometimes 
2 The latest versmn of Semcor makes use of the 
senses from WORDNET 1 6, whereas the senses 
used m the DSO corpus were from WoRDNET 
15 1 
To match the sentences, we first converted the 
senses m the DSO corpus to those of WORDNET 
1 6 We ignored all sentences m the DSO corpus m 
which a word is tagged with sense 0 or -1 (A word is 
tagged with sense 0 or -1 ff none of the given senses 
m WoRDNFT applies ) 
4, sentence from Semcor is considered to match 
one from the DSO corpus ff both sentences are ex- 
actl) ldent~cal or ff the~ differ only m the pre~ence 
or absence of the characters " (permd) or -' (hy- 
phen) 
For each remaining Semcor sentence, taking into 
account word ordering, ff 75% or more of the words 
m the sentence match those in a DSO corpus sen- 
tence, then a potential match ~s recorded These 
i -kctua\[ly, the WORD~q'ET senses used m the DSO corpus 
were from a shght variant of the official WORDNE'I 1 5 release 
Th~s ssas brought to our attention after the pubhc release of 
the DSO corpus 
potential matches are then manually verffied to en- 
sure that they are true matches and to ~eed out any 
false matches 
Using this method of matching, a total of 
13,188 sentence-palrs contasnmg nouns and 17,127 
sentence-pa~rs containing verbs are found to match 
from both corpora, ymldmg 30,315 sentences which 
form the intersected corpus used m our present 
study 
4 The Kappa Statistic 
Suppose there are N sentences m our corpus where 
each sentence contains the word w Assume that 
w has M senses Let 4 be the number of sentences 
which are assigned identical sense b~ two human an- 
notators Then a simple measure to quantify the 
agreement rate between two human annotators Is 
Pc, where Pc, = A/N 
The drawback of this simple measure is that it 
does not take into account chance agreement be- 
tween two annotators The Kappa statistic a (Co- 
hen, 1960) is a better measure of rater-annotator 
agreement which takes into account the effect of 
chance agreement It has been used recently 
w~thm computatmnal hngu~stlcs to measure rater- 
annotator agreement (Bruce and Wmbe, 1998, Car- 
letta, 1996, Veroms, 1998) 
Let Cj be the sum of the number of sentences 
which have been assigned sense 3 by annotator 1 and 
the number of sentences whmh have been assigned 
sense 3 by annotator 2 Then 
P~-P~ 
1-P~ 
where 
M 
j=l 
and Pe measures the chance agreement between two 
annotators A Kappa ~alue of 0 indicates that 
the agreement is purely due to chance agreement, 
whereas a Kappa ~alue of 1 indicates perfect agree- 
ment A Kappa ~alue of 0 8 and above is considered 
as mdmatmg good agreement (Carletta, 1996) 
Table 1 summarizes the inter-annotator agree- 
ment on the mtersected corpus The first (becond) 
row denotes agreement on the nouns (xerbs), wh~le 
the lass row denotes agreement on all words com- 
bined The a~erage ~ reported m the table is a s~m- 
pie average of the individual ~ value of each word 
The agreement rate on the 30,315 sentences as 
measured by P= is 57% This tallies with the fig- 
ure reported ~n our earlier paper (Ng and Lee, 1996) 
where we performed a quick test on a subset of 5,317 
sentences ,n the intersection of both the Semcor cor- 
pus and the DSO corpus 
10 
\[\] 
mm 
m 
m 
m 
m 
m 
mm 
m 
m 
m 
m 
mm 
m 
m 
m 
Type Num of v, ords A N \[ P~ Avg 
Nouns 121 7,676 13,188 I 0 582 0 300 
Verbs 70 9,520 17,127 I 0 555 0 347 
All I 191 I 17,196 30,315 I 056T 0317 
Table 1 Raw inter-annotator agreement 
5 Algorithm 
Since the rater-annotator agreement on the inter- 
sected corpus is not high, we would like to find out 
how the agreement rate would be affected if different 
sense classes were in use 
In this section, we present a greedy search algo- 
rithm that can automatmalb derive coarser sense 
classes based on the sense tags assigned by two hu- 
man annotators The resulting derived coarse sense 
classes achmve a higher agreement rate but we still 
maintain as many of the original sense classes as 
possible The algorithm is given m Figure 1 
The algorithm operates on a set of sentences where 
each sentence contains an occurrence of the word w 
whmh has been sense-tagged by two human anno- 
tators - At each Iteration of the algorithm, tt finds 
the pair of sense classes Ct and Cj such that merg- 
ing these two sense classes results in the highest t~ 
value for the resulting merged group of sense classes 
It then proceeds to merge Cz and C~ Thin process 
Is repeated until the ~ value reaches a satisfactory 
value ~,~t,~, which we set as 0 8 
Note that this algorithm is also applicable to de- 
riving any coarser set of classes from a refined set 
for any NLP tasks in which prior human agreement 
rate may not be high enough Such NLP tasks could 
be discourse tagging, speech-act categorization, etc 
6 Results 
For each word w from the list of 121 nouns and 
70 verbs, ~e applied the greedy search algorithm to 
each set of sentences in the intersected corpus con- 
taming w For a subset of 95 words (53 nouns and 42 
verbs), the algorithm was able to derive a coarser set 
of 2 or more senses for each of these 95 words such 
that the resulting Kappa ~alue reaches 0 8 or higher 
For the other 96 words, m order for the Kappa value 
to reach 0 8 or higher, the algorithm collapses all 
senses of the ~ord to a single (trivial) class Table 2 
and 3 summarizes the results for the set of 53 nouns 
and 42 ~erbs, respectively 
Table 2 md~cates that before the collapse of sense 
classes, these 53 nouns have an average of 7 6 senses 
per noun There is a total of 5,339 sentences in the 
intersected corpus containing these nouns, of which 
3,387 sentences were assigned the same sense by 
the two groups of human annotators The average 
Kappa statistic (computed as a simple average of the 
Kappa statistic of ~he mdlwdual nouns) is 0 463 
After the collapse of sense classes by the greedy 
search algorithm, the average number of senses per 
noun for these 53 nouns drops to 40 Howe~er, 
the number of sentences which have been asmgned 
the same coarse sense by the annotators increases to 
5,033 That is, about 94 3% of the sentences have 
been assigned the same coarse sense, and that the 
average Kappa statistic has improved to 0 862, mgm- 
fymg high rater-annotator agreement on the derived 
coarse senses Table3 gl~es the analogous figures for 
the 42 verbs, agmn mdmatmg that high agreement 
is achieved on the coarse sense classes den~ed for 
verbs 
7 Discussion 
Our findings on rater-annotator agreement for word 
sense tagging indicate that for average language 
users, it is quite dl~cult to achieve high agreement 
when they are asked to assign refned sense tags 
(such as those found in WORDNET) given only the 
scanty definition entries m the WORDNET dlctio- 
nary and a few or no example sentences for the 
usage of each word sense Thin observation agrees 
wlth that obtmned m a recent study done by (Vero- 
ms, 1998), where the agreement on sense-tagging by 
naive users was also not hlgh Thus It appears that 
an average language user is able to process language 
wlthout needing to perform the task of dlsamblguat- 
mg word sense to a very fine-grained resolutmn as 
formulated m a tradltlonal dmtlonary 
In contrast, expert lexicographers tagged the ~ ord 
sense in the sentences used m the SENSEVAL exer- 
clse, where high rater-annotator agreement was re- 
ported There are also fuller dlctlonary entries m 
the HECTOR dlctlonary used and more e<amples 
showing the usage of each word sense m HECTOR 
These factors are likely to have contributed to the 
difference in rater-annotator agreement observed m 
the three studies conducted 
We also examined the coarse sense classes derived 
by the greedy search algorithm Vv'e found some in- 
teresting groupings of coarse senses for nouns which 
~e hst in Table 4 
From Table 4, it is apparent that the greedy search 
algorithm can derive interesting groupings of word 
senses that correspond to human mtmtwe judgment 
of sense graz}.ulanty It Is clear that some of the dis- 
agreement between the two groups of human anno- 
tators can be attributed solely to the overly refined 
senses of WoRDNET As an example, there is a total 
Ii 
loop: let Ct, , C M denote the current M sense classes 
~* +-- -oo 
for all z,3 such that 1 <, < 3 < M 
let C\[, ,C~w_ 1 denote the resulting M - 1 sense classes by mergmg C, and C 3 
compute ~(C\[, , C~/_t) 
ff ~(C{, , C~4_x) > ~* then 
~" +- ~(C~, ,C~_t), z* +- ~, ~* +- 
end for 
merge the sense class C,. and C~. 
M+--M-1 
If E* < ~rn,n gore loop 
Table 2 
F~gure 1 ~ greedy search algorithm 
Type Avg Num of senses A N P~ ~'~g ~ I 
Before 76 3,387 5,339 0 634 0463 \] 
After 40 5,033 5,339 0 943 0862 j 
Inter-annotator agreement for 53 nouns before and after the collapse of senses 
Sense 1 change, alteratmn, modfficatlon - (an event 
that occurs when something passes from one state 
or phase to another "the change was intended to 
increase sales", thLs storm ~s certaanly a change for 
the worse") 
Sense 2 change - (a relaUonal &fference between 
state.% esp between states before and after some 
event "he attributed the change to their marnage") 
Sense 3 change - (the act of changing something, 
'the change of government had no ~mpact on the 
economy", "hLs change on abortmn cost h~m the elec- 
tion" ) 
Sense 4 change - (the result of alterauon or modlfi- 
carton, there were marked changes m the hmng of 
the lungs", "there had been no change m the moun- 
tains" ) 
Sense 5 change - (the balance of money recmved 
~hen the amount you tender is greater than the 
amount due, 'I paid with a twenty and pocketed 
the change") 
Sense 6 change - (a thing that Is different, 'he re- 
spected several changes before selecting one") 
Sense 8 change - (corns of small denommatlon re- 
garded collecuvel~, 'he had a pocketful of change") 
Figure 2 Seven senses of the noun "change" used 
b~ the human annotators 
of III sentences m the intersected corpus containing 
the noun with root word form 'change" They are 
assigned one of the seven senses hsted in F~gure 2 by 
the two groups of human annotators 
Based on the imtml word senses assigned, Pa = 
0 38 and ~ = -0 09 (~ is negative when there Is sys- 
~ematlc dlsagreement ) Hovve~er, the greed? search 
algorithm collapses sense I, 2, 3, 4 and 6 into one 
coarse sense and sense 5 and 8 into another coarse 
sense As a result, Pa = ~ = 1, mdmaung perfect 
agreement when the senses are collapsed m the man- 
ner found This corresponds to our mtum~e judg- 
ment of the relauve closeness of the various senses 
here 
Similarly, some of the 96 words for whmh the 
greedy search algorithm collapses into one single 
sense are such that the various senses are too close to 
be rehably dmtmgmshed In short, we believe that 
the coarse sense classes derived by the greedy search 
algorithm, upon verlficauon by human, can poten- 
ually serve as a better sense inventory for evaluating 
automated word sense d~samb~guatlon algorithms 
8 Related Work 
Recently, both Bruce and Wmbe (1998) and Veroms 
(1998) have looked into algorithms to automancallv 
generate better sense classes m a corpus-based, data- 
driven manner However, the algorithms they used 
differ from ours Bruce and Wlebe (1998) made use 
of an EM algorithm ~la a latent class model to de- 
nse better sense classes Veroms (1998) performed 
a Muluple Correspondence Analyms on the table of 
annotatmns (a triple composed of a context, a judge 
and a sense) to reduce dlmensmnaht? follo~ed b? 
tree-clustering In contrast, our greedb search al- 
gorithm ~s a rumple but effecuve method that makes 
use of the Kappa statlsUC to search the space of pos- 
sible sense groupings dlrecdy 
9 Conclusion 
In th~s paper, we examined the tssue of rater- 
annotator agreement on word sense tagging and pre- 
sented a greedy search algorithm capable of gener- 
ating coarse sense classes based on the sense tags 
12 
Table 3 
Type Avg Num of senses A N P. Avg n 
Before 12 8 5,115 8,602 0 595 0 441 
After 5 6 8,042 8,602 0 935 0.852 
Inter-annotator agreement for 42 verbs before and after the collapse of senses 
Noun 
alr 
board 
body 
change 
Coarse senses 
wind/gas vs aura/atmosphere 
committee vs plank 
physmal/natural object vs group/coIlectLon 
modfficatmn vs corns 
country natron vs region/countryside 
course class vs action vs dlrectmn 
field land vs subject 
foot human body part ~s umt vs lower part/support 
force strength vs personnel 
hght 
matter 
party 
Table 4 
lumlnatlon vs perspectlve 
concern/~ssue vs substance 
pohtlcal party vs socml gathering vs group 
Coarse senses derived by the greedy search algorlthm 
assigned by two human annotators We found inter- 
esting groupings of word senses that correspond to 
human lntumve judgment of sense granularity 

References 
Rebecca Bruce and Janyce Wmbe 1998 Word- 
sense dlstmgmshabdlty and inter-coder agree- 
ment In Proceedings of the Thtrd Conference on 
Empzmcal Methods m Natural Language Process- mg 

Jean Carletta 1996 Assessing agreement on clas- 
sfficat~on tasks The kappa statlstm Computa- 
tzonal Lmgutstzcs, 22(2) 249-254 

J Cohen 1960 A coeffictent of agreement for nom- 
inal scales Educatwnal and Psychological Mea- 
surement, 20 37-46 

~.dam Kllgarnff 1998a Gold standard datasets for 
evaluating word sense d~sambiguatlon programs 
Computer Speech and Language 

Adam Kflgamff 1998b Inter-tagger agreement In 
Advanced Papers of the SENSEVAL Workshop 

Adam Kllgarrtff 1998c SENSEVAL An exercise m 
evaluating word sense d~samblguatlon programs 
In Proceedings o/LREC 

Mitchell P Marcus, Beatrice Santorml, and 
Mary Ann Marcmkmwlcz 1993 Building a large 
annotated corpus of enghsh The Penn Treebank 
Computatzonal Lmgutst~cs, 19(2) 313-330 

George A MlUer, Claudia Lea¢ock, Randee Tengl, 
and Ross T Bunker 1993 A semantm concor- 
dance In Proceedings o/the ARPA Human Lan- 
guage Technology Workshop, pages 303-308 

George A Mllter 1990 Wordnet An on-hne lexlcal 
database International Journal o/Lexicography, 
3(4) 235-312 

Hwee Tou Ng and Hlan Beng Lee 1996 Integrating 
multiple knowledge sources to dlsamblguate word 
sense An exemplar-based approach In Proceed- 
rags of the 34th Annual Meeting of the Assocza- 
twn for Computatzonal Lmguzstscs (ACL), pages 
40--47 

Hwee Tou Ng 1997 Exemplar-based word sense 
dlsambiguatlon Some recent ~mprovements In 
Proceedings o/ the Second Conference on Em- 
pw~cal Methods m Natural Language Processing, 
pages 208-213 

Jean Veroms 1998 ~ study of polysemy judge- 
meats and rater-annotator agreement In Ad- 
vanced Papers o/the SENSEVAL Workshop 
