Proceedings of EACL '99 
An experiment on the upper bound of interjudge agreement: 
the case of tagging 
Atro Voutilainen 
Research Unit for Multilingual Language Technology 
P.O. Box 4 
FIN-00014 University of Helsinki 
Finland 
Atro.Voutilainen@ling.Helsinki.FI 
Abstract 
We investigate the controversial issue 
about the upper bound of interjudge 
agreement in the use of a low-level 
grammatical representation. Pessimistic 
views suggest that several percent of 
words in running text are undecidable in 
terms of part-of-speech categories. Our 
experiments with 55kW data give rea- 
son for optimism: linguists with only 30 
hours' training apply the EngCG-2 mor- 
phological tags with almost 100% inter- 
judge agreement. 
1 Orientation 
Linguistic analysers are developed for assign- 
ing linguistic descriptions to linguistic utterances. 
Linguistic descriptions are based on a fixed inven- 
tory of descriptors plus their usage principles: in 
short, a grammatical representation specified by 
linguists for the specific kind of analysis - e.g. 
morphological analysis, tagging, syntax, discourse 
structure - that the program should perform. 
Because automatic linguistic analysis generally 
is a very di~cult problem, various methods for 
evaluating their success have been used. One such 
is based on the degree of correctness of the analysis 
provided, e.g. the percentage of linguistic tokens 
in the text analysed that receives the appropriate 
description relative to analyses provided indepen- 
dently of the program by competent linguists ide- 
ally not involved in the development of the anal- 
yser itself. 
Now use of benchmark corpora like this turns 
out to be problematic because arguments have 
been made to the effect that linguists themselves 
make erroneous and inconsistent analyses. Unin- 
tentional mistakes due e.g. to slips of attention 
are obviously unavoidable, but these errors can 
largely be identified by the double-blind method: 
first by having two (or more) linguists analyse the 
same text independently by using the same gram- 
matical representation, and then identifying dif- 
ferences of analysis by automatically comparing 
the analysed text versions with each other and fi- 
nally having the linguists discuss the di~erences 
and modify the resulting benchmark corpus ac- 
cordingly. Clerical errors should be easily (i.e. 
consensuaUy) identified as such, hut, perhaps sur- 
prisingly, many attested differences do not belong 
to this category. Opinions may genuinely differ 
about which of the competing analyses is the cor- 
rect one, i.e. sometimes the grammatical repre- 
sentation is used inconsistently. In short, linguis- 
tic 'truth' seems to be uncertain in many cases. 
Evaluating - or even developing - linguistic anal- 
ysers seems to be on uncertain ground if the goal 
of these analysers cannot be satisfactorily speci- 
fied. 
Arguments concerning the magnitude of this 
problem have been made especially in relation to 
tagging, the attempt to automatically assign lex- 
ically and contextually correct morphological de- 
scriptors (tags) to words. A pessimistic view is 
taken by Church (1992) who argues that even af- 
ter negotiations of the kind described above, no 
consensus can be reached about the correct anal- 
ysis of several percent of all word tokens in the 
text. A more mixed view on the matter is taken 
by Marcus et al. (1993) who on the one hand note 
that in one experiment moderately trained human 
text annotators made different analyses even after 
negotiations in over 3% of all words, and on the 
other hand argue that an expert can do much bet- 
ter. 
An optimistic view on the matter has been pre- 
sented by Eyes and Leech (1993). Empirical ev- 
idence for a high agreement rate is reported by 
Voutilainen and J~rvinen (1995). Their results 
suggest that at least with one grammatical repre- 
sentation, namely the ENGCG tag set (cf. Karls- 
son et al., eds., 1995), a 100% consistency can be 
204 
Proceedings of EACL '99 
reached after negotiations at the level of parts of 
speech (or morphology in this case). In short, rea- 
sonable evidence has been given for the position 
that at least some tag sets can be applied consis- 
tently, i.e. earlier observations about potentially 
more problematic tag sets should not be taken as 
predictions about all tag sets. 
1.1 Open questions 
Admittedly Voutilainen and J~xvinen's experi- 
ment provides evidence for the possibility that 
two highly experienced linguists, one of them a 
developer of the ENGCG tag set, can apply the 
tag set consistently, at least when compared with 
each others' performance. However, the practical 
significance of their result seems questionable, for 
two reasons. 
Firstly, large-scale corpus annotation by hand 
is generally a work that is carried out by less ex- 
perienced linguists, quite typically advanced stu- 
dents hired as project workers. Voutilainen and 
Jiirvinen's experiment leaves open the question, 
how consistently the ENGCG tag set can be ap- 
plied by a less experienced annotator. 
Secondly, consider the question of tagger evalu- 
ation. Because tagger developers presumably tend 
to learn, perhaps partly subconsciously, much 
about the behaviour, desired or otherwise, of the 
tagger, it may well be that if the developers also 
annotate the benchmark corpus used for evaluat- 
ing the tagger, some of the tagger's misanalyses 
remain undetected because the tagger developers, 
due to their subconscious mimicking of their tag- 
ger, make the same misanalyses when annotating 
the benchmark corpus. So 100% tagging consis- 
tency in the benchmark corpus alone does not nec- 
essarily suffice for getting an objective view of the 
tagger's performance. Subconscious 'bad' habits 
of this type need to be factored out. One way to do 
this is having the benchmark corpus consistently 
(i.e. with approximately 100% consensus about 
the correct analysis) analysed by people with no 
familiarity with the tagger's behaviour in differ- 
ent situations - provided this is possible in the 
first place. 
Another two minor questions left open by Vou- 
tilainen and Jiirvinen concern the (i) typology of 
the differences and (ii) the reliability of their ex- 
periment. 
Concerning the typology of the differences: in 
Voutilainen and J~irvinen's experiment the lin- 
guists negotiated about an initial difference, al- 
most one per cent of all words in the texts. 
Though they finally agreed about the correct anal- 
ysis in almost all these differences, with a slight 
improvement in the experimental setting a clear 
categorisation of the initial differences into un- 
intentional mistakes and other, more interesting 
types, could have been made. 
Secondly, the texts used in Voutilainen and 
J~vinen's experiment comprised only about 6,000 
words. This is probably enough to give a general 
indication of the nature of the analysis task with 
the ENGCG tag set, but a larger data would in- 
crease the reliability of the experiment. 
In this paper, we address all these three clues-' 
tions. Two young linguists 1 with no background 
in ENGCG tagging were hired for making an elab- 
orated version of the Voutilainen and J~vinen ex- 
periment with a considerably larger corpus. 
The rest of this paper is structured as follows. 
Next, the ENGCG tag set is described in outline. 
Then the training of the new linguists is described, 
as well as the test data and experimental setting. 
Finally, the results are presented. 
2 ENGCG tag set 
Descriptions of the morphological tags used by 
the English Constraint Grammar tagger are avail- 
able in several publications. Brief descriptions can 
be found in several recent ACL conference pro- 
ceedings by Voutilainen and his colleagues (e.g. 
EACL93, ANLP94, EACL95, ANLP97, ACL- 
EACL97). An in-depth description is given in 
Karlsson et al., eds., 1995 (chapters 3-6). Here, 
only a brief sample is given. 
ENGCG tagging is a two-phase process. First, 
a lexical analyser assigns one or more alternative 
analyses to each word. The following is a mor- 
phological analysis of the sentence The raids were 
coordinated under a recently expanded federal pro- 
gram: 
"<The>" 
"the" <Def> DET CENTRAL ART SG/PL 
"<raids>" 
"raid" <Count> N NOM PL 
"raid" <SVO> V PRES SG3 
"<were>" 
"be" <SVC/A> <SVC/N> V PAST 
"<coordinated>" 
"coordinate" <SVO> EN 
"coordinate" <SVO> V PAST 
"<under>" 
"under" ADV ADVL 
"under" PREP 
"under" <Attr> A ABS 
"<a>" 
"a" ABBR NOM SG 
"a" <Indef> DET CENTP~L ART SG 
1Ms. Pirkko Paljakl~ and Mr. Markku Lappalainen 
205 
Proceedings of EACL '99 
"<re cent ly>" 
"recent" <DER:Iy> ADV 
"<expanded>" 
"expand" <SV0> <P/on> EN 
"expand" <SV0> <P/on> V PAST 
"<f ede ral>" 
"federal" A ABS 
- <program>. 
"program" N N0M SG 
"program" <SV0> V PRES -SG3 
"program" <SV0> V INF 
"program" <SV0> V IMP 
"program" <SV0> V SUBJUNCTIVE 
,,<. >. 
Each indented line constitutes one morphologi- 
cal analysis. Thus program is five-ways ambiguous 
after ENGCG morphology. The disambiguation 
part of the ENGCG tagger ~ then removes those 
alternative analyses that are contextually illegit- 
imate according to the tagger's hand-coded con- 
straint rules (Voutilainen 1995). The remai-~ng 
analyses constitute the output of the tagger, in 
this case: 
"<The >" 
"the" <Def> DET CENTRAL ART SG/PL 
"<raids>" 
"raid" <Count> N N0M PL 
"<were>" 
"be" <SYC/A> <SVC/N> Y PAST 
"<coordinated>" 
"coordinate" <SV0> EN 
"<under>" 
"under" PREP 
"<a>" 
"a" <Indef> DET CENTRAL ART SG 
"<recently>" 
"recent" <DER:Iy> ADV 
"<expanded>" 
"expand" <SV0> <P/on> EN 
"<federal>" 
"federal" A ABS 
"<program>" 
"program" N N0M SG. 
.<. >,, 
Overall, this tag set represents about 180 differ- 
ent analyses when certain optional auxiliary tags 
(e.g. verb subcategorisation tags) are ignored. 
3 Preparations for the experiment 
3.1 Experimental setting 
The experiment was conducted as follows. 
2A new version of the tagger, known as EngCG-2, 
can be studied and tested at http://www.conexor.fi. 
1. The text was morphologically analysed us- 
ing the ENGCG morphological analyser. For 
the analysis of unrecognlsed words, we used 
a rule-based heuristic component that assigns 
morphological analyses, one or more, to each 
word not represented in the lexicon of the sys- 
tem. Of the analysed text, two identical ver- 
sions were made, one for each linguist. 
2. Two linguists trained to disambiguate the 
ENGCG morphological representation (see 
the subsection on training below) indepen- 
dently marked the correct alternative anal- 
yses in the ambiguous input, using mainly 
structural, but in some structurally unresolv- 
able cases also higher-level, information. The 
corpora consisted of continuous text rather 
than isolated sentences; this made the use 
of textual knowledge possible in the selection 
of the correct alternative. In the rare cases 
where two analyses were regarded as equally 
legitimate, both could be marked. The judges 
were encouraged to consult the documenta- 
tion of the grammatical representation. In 
addition, both linguists were provided with a 
checking program to be used after the text 
was analysed. The program identifies words 
left without an analysis, in which case the 
linguist was to provide the m~.~sing analysis. 
3. These analysed versions of the same text were 
compared to each other using the Unix sdiff 
program. For each corpus version, words with 
a different analysis were marked with a "RE- 
CONSIDER" symbol. The "RECONSIDER" 
symbol was also added to a number of other 
ambiguous words in the corpus. These addi- 
tional words were marked in order to 'force' 
each linguist to think independently about 
the correct analysis, i.e. to prevent the emer- 
gence of the situation where one linguist con- 
siders the other to be always right (or wrong) 
and so 'reconsiders' only in terms of the ex- 
isting analysis. The linguists were told that 
some of the words marked with the "RECON- 
SIDER" symbol were analysed differently by 
them. 
4. Statistics were generated about the num- 
ber of differing analyses (number of "RE- 
CONSIDER" symbols) in the corpus versions 
("diffl" in the following table). 
5. The reanalysed versions were automatically 
compared to each other. To words with a 
different analysis, a "NEGOTIATE" symbol 
was added. 
206 
Proceedings of EACL '99 
6. Statistics were generated about the num- 
ber of differing analyses (number of "NE- 
GOTIATE" symbols) in the corpus versions 
("diff2" in the following table). 
7. The remaining differences in the analyses 
were jointly examined by the linguists in or- 
der to see whether they were due to (i) inat- 
tention on the part of one linguist (as a result 
of which a correct unique analysis was jointly 
agreed upon), (ii) joint uncertainty about the 
correct analysis (both linguists feel unsure 
about the correct analysis), or (iii) conflict- 
ing opinions about the correct analysis (both 
linguists have a strong but different opinion 
about the correct analysis). 
8. Statistics were generated about the number 
of conflicting opinions ("dill3" below) and 
joint uncertainty ("unsure" below). 
This routine was successively applied to each 
text. 
3.2 Training 
Two people were hired for the experiment. One 
had recently completed a Master's degree from 
English Philology. The other was an advanced un- 
dergraduate student majoring in English Philol- 
ogy. Neither of them were familiar with the 
ENGCG tagger. 
All available documentation about the linguistic 
representation used by ENGCG was made avail- 
able to them. The chief source was chapters 3-6 
in Karlsson et al. (eds., 1995). Because the lin- 
guistic solutions in ENGCG are largely based on 
the comprehensive descriptive grammar by Quirk 
et al. (1985), also that work was made available 
to them, as well as a number of modern English 
dictionaries. 
The training was based on the disambiguation 
of ten smallish text extracts. Each of the extracts 
was first analysed by the ENGCG morphological 
analyser, and then each trainee was to indepen- 
dently perform Step 3 (see the previous subsec- 
tion) on it. The disambiguated text was then au- 
tomatically compared to another version of the 
same extract that was disambiguated by an expert 
on ENGCG. The ENGCG expert then discussed 
the analytic differences with the trainee who had 
also disambiguated the text and explained why 
the expert's analysis was correct (almost always 
by identifying a relevant section in the available 
ENGCG documentation; in very rare cases where 
the documentation was underspecific, new docu- 
mentation was created for future use in the exper- 
iments). 
After analysis and subsequent consultation with 
the ENGCG expert, the trainee processed the fob 
lowing sample. 
The training lasted about 30 hours. It was con- 
cluded by familiarising the linguists with the rou- 
tine used in the experiment. 
3.3 Test corpus 
Four texts were used in the experiment, to- 
tailing 55724 words and 102527 morphologi- 
cal analyses (an average of 1.84 analyses per 
word). One was an article about Japanese 
culture ('Pop'); one concerned patents ('Pat'); 
one contained excerpts from the law of Cali- 
fornia; one was a medical text ('Med'). None 
of them had been used in the development of 
the ENGCG grammatical representation or other 
parts of the system. By mid-June 1999, a sam- 
ple of this data will be available for inspection 
at http://www.ling.helsinki.fi/ voutilai/eac199- 
data.html. 
4 Results and discussion 
The following table presents the main findings. 
Figure 1: .Results from a human annotation task. 
\[ ,oo,,as\[ aiffl 
Pop 14861 188/1.3% 
Pat 13183 92/.7% 
Law 15495 107/.7% 
ivied 12185 126/1.0% 
ALL 55724 513/.9% 
I diff~ I diff3 
~'ml\[ 11/.1% 2/.0% 
'u.nsu're 
4/.0% 
1/.o% 
18/.1% 10/.1% 0 
39/.3% 1/.0% 9/.1% 
112/.2% 13/.0% 14/.0% 
It is interesting to note how high the agree- 
ment between the linguists is even before the first 
negotiations (99.80% of all words are analysed 
identically). Of the remaining differences, most, 
somewhat disappointingly, turned out to be clas- 
sifted as 'slips of attention'; upon inspection they 
seemed to contain little linguistic interest. Espe- 
cially one of the linguists admitted that most of 
the job seemed too much of a routine to keep one 
mentally alert enough. The number of genuine 
conflicts of opinion were much in line with obser- 
vations by Voutilainen and J~irvinen. However, 
the negotiations were not altogether easy, consid- 
ering that in all they took almost nine hours. Pre- 
sumably uncertain analyses and conflicts of opin- 
ion were not easily passed by. 
The main finding of this experiment is that 
basically Voutilainen and J~vinen's observations 
about the high specifiability and consistent usabil- 
ity of the ENGCG morphological tag set seem to 
be extendable to new users of the tag set. In 
207 
Proceedings of EACL '99 
other words, the reputedly surface-syntactic tag 
set seems to be learnable as well. Overall, the ex- 
periment reported here provides evidence for the 
optimistic position about the specifiability of at 
least certain kinds of linguistic representations. 
It remains for future research, perhaps as a col- 
laboration between teams working with different 
tag sets, to find out, what exactly are the prop- 
erties that make some linguistic representations 
consistently learnable and usable, and others less 
SO. 
Acknowledgments 
I am grateful to anonymous EACL99 referees for 
useful comments. 
References 
Kenneth W. Church 1992. Current Practice in 
Part of Speech Tagging and Suggestions for the 
Future. In Simmons (ed.), Sbornik praci: In 
Honor of Henry Kucera, Michigan Slavic Studies. 
Michigan. 13-48. 
Elizabeth Eyes and Geoffrey Leech 1993. Syn- 
tactic Annotation: Linguistic Aspects of Gram- 
matical Tagging and Skeleton Parsing. In Ezra 
Black, Roger Garside and Geoffrey Leech (eds.) 
1993. Statistically-Driven Computer Grammars 
of English: The IBM/Lancaster Approach. Am- 
sterdam and Atlanta: Rodopi. 36-61. 
Fred Karlsson, Atro Voutilainen, Juha Heil~kil~i 
and A.rto Anttila (eds.) 1995. Constraint Gram- 
mar. A Language-Independent System for Pars- 
ing Unrestricted Tezt. Berlin and New York: 
Mouton de Gruyter. 
Mitchell Marcus, Beatrice Santorini and Mary 
Ann Marcinkiewicz 1993. Building a Large An- 
notated Corpus of English: The Penn Treebank. 
Computational Linguistics 19:2. 313-330. 
Randolph Quirk, Sidney Greenbaum, Jan 
Svartvik and Geoffrey Leech 1985. A Comprehen- 
sive Grammar of the English Language. Longman. 
Atro Voutilainen 1995. Morphological disam- 
biguation. In Karlsson et al., eds. 
Atro Voutilainen and Timo J~vinen 1995. 
Specifying a shallow grammatical representation 
for parsing purposes. In Proceedings of the Sev- 
enth Conference of the European Chapter of the 
Association for Computational Linguistics. ACL. 
208 
