On the Evaluation and Comparison of Taggers: the Effect of 
Noise in Testing Corpora. 
Llufs Padr6 and Llufs M~rquez 
Dep. LSI. Technical University of Catalonia 
c/Jordi Girona 1-3. 08034 Barcelona 
{padro, lluism}@l si. upc. es 
Abstract 
This paper addresses the issue of POS tagger 
evaluation. Such evaluation is usually per- 
formed by comparing the tagger output with 
a reference test corpus, which is assumed to be 
error-free. Currently used corpora contain noise 
which causes the obtained performance to be a 
distortion of the real value. We analyze to what 
extent this distortion may invalidate the com- 
parison between taggers or the measure of the 
improvement given by a new system. The main 
conclusion is that a more rigorous testing exper- 
imentation setting/designing is needed to reli- 
ably evaluate and compare tagger accuracies. 
1 Introduction and Motivation 
Part of Speech (POS) Tagging is a quite well 
defined NLP problem, which consists of assign- 
ing to each word in a text the proper mor- 
phosyntactic tag for the given context. Al- 
though many words are ambiguous regarding 
their POS, in most cases they can be completely 
disambiguated taking into account an adequate 
context. Successful taggers have been built us- 
ing several approaches, such as statistical tech- 
niques, symbolic machine learning techniques, 
neural networks, etc. The accuracy reported by 
most current taggers ranges from 96-97% to al- 
most 100% in the linguistically-motivated Con- 
straint Grammar environment. 
Unfortunately, there have been very few di- 
rect comparisons of alternative taggers 1 on iden- 
tical test data. However, in most current papers 
it is argued that the performance of some tag- 
gers is better than others as a result of some 
kind of indirect comparisons between them. We 
I One of the exceptions is the work by (Samuelsson 
and Voutilainen, 1997), in which a very strict comparison 
between taggers is performed. 
think that there are a number of not enough 
controlled/considered factors that make these 
conclusions dubious in most cases. 
In this direction, the present paper aims to 
point out some of the difficulties arising when 
evaluating and comparing tagger performances 
against a reference test corpus, and to make 
some criticism about common practices followed 
by the NLP researchers in this issue. 
The above mentioned factors can affect ei- 
ther the evaluation or the comparison process. 
Factors affecting the evaluation process are: (1) 
Training and test experiments are usually per- 
formed over noisy corpora which distorts the ob- 
tained results, (2) performance figures are too 
often calculated from only a single or very small 
number of trials, though average results from 
multiple trials are crucial to obtain reliable esti- 
mations of accuracy (Mooney, 1996), (3) testing 
experiments are usually done on corpora with 
the same characteristics as the training data 
-usually a small fresh portion of the training 
corpus- but no serious attempts have been done 
in order to determine the reliability of the re- 
sults when moving from one domain to another 
(Krovetz, 1997), and (4) no figures about com- 
putational effort -space/time complexity- are 
usually reported, even from an empirical per- 
spective. A factors affecting the comparison 
process is that comparisons between taggers are 
often indirect, while they should be compared 
under the same conditions in a multiple-trial 
experiment with statistical tests of significance. 
For these reasons, this paper calls for a dis- 
cussion on POS taggers evaluation, aiming to 
establish a more rigorous test experimentation 
setting/designing, indispensable to extract reli- 
able conclusions. As a starting point, we will 
focus only on how the noise in the test corpus 
can affect the obtained results. 
997 
2 Noise in the testing corpus 
From a machine learning perspective, the rele- 
vant noise in the corpus is that of non system- 
atically mistagged words (i.e. different annota- 
tions for words appearing in the same syntac- 
tic/semantic contexts). 
Commonly used annotated corpora have 
noise. See, for instance, the following examples 
from the Wall Street Journal (wsJ) corpus: 
Verb participle forms are sometimes tagged as 
such (VBN) and also as adjectives (JJ) in other 
sentences with no structural differences: 
la) ... failing_VBG to_TO voluntarily_RB 
submit_VB the_DT requested_VBN 
±nformation_~N ... 
ib) ... a_DT largeJJ sample_NN of_IN 
married_2J women_b~IS with__IN at_IN 
least_3JS one_CD child_NN ... 
Another structure not coherently tagged are 
noun chains when the nouns (NN) are ambigu- 
ous and can be also adjectives (J J): 
2a) ... Mr._NNP Hahn_NNP ,_, the_DT 
62-year-oldJJ chairman_NN and_CC 
chief_NN executive_JJ officer_NN of_IN 
Georgia-Pacific_NNP Corp._NNP ... 
2b) ... Burger_NNP King_NNP 
's_POS chief_JJ executive_NN o\]ficer_NN ,_, 
Barry_NNP Gibbons_NNP ,_, stars_VBZ 
in_.IN ads_NNS saying_VBG ... 
The noise in the test set produces a wrong 
estimation of accuracy, since correct answers are 
computed as wrong and vice-versa. In following 
sections we will show how this uncertainty in the 
evaluation may be, in some cases, larger than 
the reported improvements from one system to 
another, so invalidating the conclusions of the 
comparison. 
3 Model Setting 
To study the appropriateness of the choices 
made by a POS tagger, a reference tagging must 
be selected and assumed to be correct in or- 
der to compare it with the tagger output. This 
is usually done by assuming that the disam- 
biguated test corpora being used contains the 
right POS disambiguation. This approach is 
quite right when the tagger error rate is larger 
enough than the test corpus error rate, never- 
theless, the current POS taggers have reached a 
performance level that invalidates this choice, 
since the tagger error rate is getting too close 
to the error rate of the test corpus. 
Since we want to study the relationship be- 
tween the tagger error rate and the test corpus 
error rate, we have to establish an absolute ref- 
erence point. Although (Church, 1992) ques- 
tions the concept of correct analysis, (Samuels- 
son and Voutilainen, 1997) establish that there 
exists a -statistically significant- absolute cor- 
rect disambiguation, respect to which the error 
rates of either the tagger or the test corpus can 
be computed. What we will focus on is how 
distorted is the tagger error rate by the use of 
a noisy test corpus as a reference. 
The cases we can find when evaluating the 
performance of a certain tagger are presented 
in table 1. OK/--aOK stand for a right/wrong 
tag (respect to the absolute correct disambigua- 
tion). When both the tagger and the test cor- 
pus have the correct tag, the tag is correctly 
evaluated as right. When the test corpus has 
the correct tag and the tagger gets it wrong, 
the occurrence is correctly evaluated as wrong. 
But problems arise when the test corpus has 
a wrong tag: If the tagger gets it correctly, it 
is evaluated as wrong when it should be right 
(false negative). If the tagger gets it wrong, it 
will be rightly evaluated as wrong if the error 
commited by the tagger is other than the er- 
ror in the test corpus, but wrongly evaluated 
as right (false positive) if the error is the same. 
Table 1 shows the computation of the percent- 
corpus tagger eval: right eval: wrong 
OK c OK t (1 -C)t 
OK c "aOK t - (1-C)(1-t) 
"aOK c OK t - Cu 
~OKc ~OKt C(1-u)p C(1-u)(1-p) 
Table 1: Possible cases when evaluating a tagger. 
ages of each case. The meanings of the used 
variables are: 
C: Test corpus error rate. Usually an estima- 
tion is supplied with the corpus. 
t: Tagger performance rate on words rightly 
tagged in the test corpus. It can be seen as 
P(OKtIOKc). 
u: Tagger performance rate on words wrongly 
tagged in the test corpus. It can be seen as 
P(OKtbOKc). 
998 
p: Probability that the tagger makes the same 
error as the test corpus, given that both get 
a wrong tag. 
x: Real performance of the tagger, i.e. what 
would be obtained on an error-free test set. 
K: Observed performance of the tagger, com- 
puted on the noisy test corpus. 
For simplicity, we will consider only perfor- 
mance on ambiguous words. Considering unam- 
biguous words will make the analysis more com- 
plex, since it should be taken into account that 
neither the behaviour of the tagger (given by u, 
t, p) nor the errors in the test corpus (given by 
c) are the same on ambiguous and unambiguous 
words. Nevertheless, this is an issue that must 
be further addressed. 
If we knew each one of the above proportions, 
we would be able to compute the real perfor- 
mance of our tagger (x) by adding up the OKt 
rows from table 1, i.e. the cases in which the 
tagger got the right disambiguation indepen- 
dently from the tagging of the test set: 
x=(1-C)t+Cu (1) 
The equation of the observed performance 
can also be extracted from table 1, adding up 
what is evaluated as right: 
K=(1-C)t+C(1-u)p (2) 
The relationship between the real and the ob- 
served performance is derived from 1 and 2: 
x=K-C(1-u)p+Cu 
Since only K and C are known (or approxi- 
mately estimated) we can not compute the real 
performance of the tagger. All we can do is to 
establish some reasonable bounds for t, u and 
p, and see in which range is x. 
Since all variables are probabilities, they are 
bounded in \[0, 1\]. We also can assume 2 that 
K > C. We can use this constraints and the 
above equations to bound the values of all vari- 
ables. From 2, we obtain: 
u= 1 K-t(1-C) K-t(1-C) K-C(1-u)p , p- , t= 
Cp C(l-u) 1-C 
Thus, u will be maximum when p and t are 
maximum (i.e. 1). This gives an upper bound 
2In the cases we are interested in -that is, current 
systems- the tagger observed performance, I(, is over 
90%, while the corpus error rate, C, is below 10%. 
for u of (1-K)/C. When t=0, u will range 
in \[-oo, 1-K/C\] depending on the value of p. 
Since we are assuming K > C, the most informa- 
tive lower bound for u keeps being zero. Simi- 
larly, p is minimum when t = 1 and u = 0. When 
t = 0 the value for p will range in \[K/C, +c~\] 
depending on u. Since K > C, the most infor- 
mative upper bound for p is still 1. Finally, t 
will be maximum when u - 1 and p = 0, and 
minimum when u=O and p=l. Summarizing: 
0 <u<min{1,~ -~} (3) { K+C-ll 
ma= 0, ~ j <p_<l (4) 
1-C 
Since the values of the variables are mutually 
constrained, it is not possible that, for instance, 
u and t have simultaneously their upper bound 
values (if (1-I()/C< 1 then K/(1-C) > 1 and 
viceversa). Any bound which is out of \[0, 1\] is 
not informative and the appropriate boundary, 
0 or 1, is then used. Note that the lower bound 
for t will never be negative under the assump- 
tion K > C. 
Once we have established these bounds, we 
can use equation 1 to compute the range for the 
real performance value of our tagger: x will be 
minimum when u and t are mimmum, which 
produces the following bounds: 
~,,,, = Xr-Cp (6) 
K+C if K_< 1-C x,,,~= = 1--g+c-1 
if K > 1-C (7) p 
As an example, let's suppose we evaluate a tag- 
ger on a test corpus which is known to contain 
about 3% of errors (C=0.03), and obtain a re- 
ported performance of 93% 3 (K= 0.93). In this 
case, equations 6 and 7 yield a range for the 
real performance x that varies from \[0.93, 0.96\] 
when p=O to \[0.90, 0.96\] when p= 1. 
This results suggest that although we observe 
a performance of K, we can not be sure of how 
well is our tagger performing without taking 
into account the values of t, u and p. 
It is also obvious that the intervals in the 
above example are too wide, since they con- 
sider all the possible parameter values, even 
when they correspond to very unlikely param- 
~This is a realistic case obtained by (M£rquez and 
Padr6 , 1997) tagger. Note that 93% is the accuracy on 
ambiguous words (the equivalent overall accuracy was 
about 97%). 
999 
eter combinations 4. In section 4 we will try to 
narrow those intervals, limiting the possibilities 
to reasonable cases. 
4 Reasonable Bounds for the Basic 
Parameters 
In real cases, not all parameter combinations 
will be equally likely. In addition, the bounds 
for the values of t, u and p are closely related 
to the similarities between the training and test 
corpora. That is, if the training and test sets are 
extracted from the same corpus, they will prob- 
ably contain the same kind of errors in the same 
kind of situations. This may cause the training 
procedure to learn the errors -especially if they 
are systematic- and thus the resulting tagger 
will tend to make the same errors that appear 
in the test set. On the contrary, if the train- 
ing and test sets come from different sources 
-sharing only the tag set- the behaviour of the 
resulting tagger will not depend on the right or 
wrong tagging of the test set. 
We can try to establish narrower bounds for 
the parameters than those obtained in section 3. 
First of all, the value of t is already con- 
strained enough, due to its high contribution 
(1-C) to the value of K, which forces t to 
take a value close to K. For instance, apply- 
ing the boundaries in equation 5 to the case 
C--0.03 and K--0.93, we obtain that t belongs 
to \[0.928, 0.959\]. 
The range for u can be slightly narrowed con- 
sidering the following: In the case of indepen- 
dent test and training corpora, u will tend to 
be equal to t. Otherwise, the more biased to- 
wards the corpus errors is the language model, 
the lower u will be. Note than u > t would mean 
that the tagger disambiguates better the noisy 
cases than the correct ones. Concerning to the 
lower bound, only in the case that all the errors 
in the training and test corpus were systematic 
(and thus can be learned) could u reach zero. 
However, not only this is not a likely Situation, 
but also requires a perfect-learning tagger. It 
seems more reasonable that, in normal cases, er- 
rors will be random, and the tagger will behave 
4For instance, it is not reasonable that u=0, which 
would mean that the tagger never disambiguates cor- 
rectly a wrong word in the corpus, or p-- 1, which would 
mean that it always makes the same error when both 
are wrong. 
randomly on the noisy occurrences. This yields 
a lower bound for u of 1/a, being a the average 
ambiguity ratio for ambiguous words. 
The reasonable bounds for u are thus 
_1 <_ u < min t, 
a 
Finally, the value of p has similar constraints 
to those of u. If the test and training corpora 
are independent, the probability of making the 
same error, given that both are wrong, will be 
the random 1/(a-1). If the corpora are not 
independent, the errors that can be learned by 
the tagger will cause p to rise up to (potentially) 
1. Again, only in the case that all errors where 
systematic, could p reach 1. 
Then, the reasonable bounds for p are: 
{ 1 K+C-1} max < p < 1 
a-l' C - - 
5 On'Comparing Tagger 
Performances 
As stated above, knowing which are the reason- 
able limits for the u, p and t parameters enables 
us to compute the range in which the real per- 
formance of the tagger can vary. 
So, given two different taggers T1 and T2, and 
provided we know the values for the test corpus 
error rate and the observed performance of both 
cases (C1, C~, K1, Ks), we can compare them 
by matching the reasonable intervals for the re- 
spective real performances xl and x2. 
From a conservative position, we cannot 
strongly state than one of the taggers performs 
better than the other when the two intervals 
overlap, since this implies a chance that the real 
performances of both taggers are the same. 
The following real example has been ex- 
tracted from (M£rquez and Padrd , 1997): The 
tagger T1 uses only bigram information and has 
an observed performance on ambiguous words 
K1 = 0.9135 (96.86% overall). The tagger 2"2 
uses trigrams and automatically acquired con- 
text constraints and has an accuracy of K2 = 
0.9282 (97.39% overall). Both taggers have been 
evaluated on a corpus (wsJ) with an estimated 
error rate 5 C1 = C2 = 0.03. The average ambigu- 
ity ratio of the ambiguous words in the corpus 
is a=2.5 tags/word. 
5The (wsJ) corpus error rate is estimated over all 
words. We are assunfing that the errors distribute 
uniformly among all words, although ambiguous words 
1000 
These data yield the following range of rea- 
sonable intervals for the real performance of the 
taggers. 
for pi=(1/a)=0.4 I 
xx E [91.35, 94.05] 
x2 • [92:82, 95.60] 
for pi = l 
xl E [90.75, 93.99] 
x2 E [92.22, 95.55] 
The same information is included in figure 1 
which presents the reasonable accuracy intervals 
for both taggers, for p ranging from 1/a = 0.4 to 
1 (the shadowed part corresponds to the over- 
lapping region between intervals). 
1 
I " I I I I I 
% accqracy 
1/a=0.4 I ..... 
90 91 92 93 94 95 (X) 
Figure 1: Reasonable intervals for both taggers 
The obtained intervals have a large overlap 
region which implies that there are reasonable 
parameter combinations that could cause the 
taggers to produce different observed perfor- 
mances though their real accuracies were very 
similar. From this conservative approach, we 
would not be able to conclude that the tagger 
7"2 is better than T1, even though the 95% con- 
fidence intervals for the observed performances 
did allow us to do so. 
6 Discussion 
The presented analysis of the effects of noise in 
the test corpus on the evaluation of POS taggers 
leads us to conclude that when a tagger is eval- 
uated as better than another using noisy test 
corpus, there are reasonable chances that they 
are in fact very similar but one of them is just 
adapting better than the other to the noise in 
the corpus. 
probably have a higher error rate. Nevertheless, a higher 
value for C would cause the intervals to be wider and to 
overlap even more. 
We believe that the widespread practice of 
evaluating taggers against a noisy test corpus 
has reached its limit, since the performance of 
current taggers is getting too close to the error 
rate usually found in test corpora. 
An obvious solution -and maybe not as costly 
as one might think, since small test sets properly 
used may yield enough statistical evidence- is 
using only error-free test corpora. Another pos- 
sibility is to further study the influence of noise 
in order to establish a criterion -e.g. a thresh- 
old depending on the amount of overlapping be- 
tween intervals- to decide whether a given tag- 
ger can be considered better than another. 
There is still much to be done in this direc- 
tion. This paper does not intend to establish 
a new evaluation method for POS tagging, but 
to point out that there are some issues -such as 
the noise in test corpus- that have been paid lit- 
tle attention and are more important than what 
they seem to be. 
Some of the issues that should be further con- 
sidered are: The effect of noise on unambigu- 
ous words; the reasonable intervals for overall 
real performance; the -probably- different val- 
ues of C, p, u and t for ambiguous/unambiguous 
words; how to estimate the parameter values of 
the evaluated tagger in order to constrain as 
much as possible the intervals; the statistical 
significance of the interval overlappings; a more 
informed (and less conservative) criterion to re- 
ject/accept the hypothesis that both taggers are 
different, etc. 

References 
Church, K.W. 1992. Current Practice in Part of 
Speech Tagging and Suggestions for the Future. 
In Simmons (ed.), Sbornik praci: In Honor of 
Henry Kudera. Michigan Slavic Studies. 
Krovetz, R. 1997. Homonymy and Polysemy in 
Information Retrieval. In Proceedings of joint 
E/A CL meeting. 
M~trquez, L. and Padr6, L. 1997. A Flexible POS 
Tagger Using an Automatically Acquired Lan- 
guage Model. In Proceedings of joint E/ACL 
meeting. 
Mooney, R.J. 1996. Comparative Experiments on 
Disambiguating Word Senses: An Illustration of 
the Role of Bias in Machine Learning. In Proceed- 
ings of EMNLP'96 conference. 
Samuelsson, C. and Voutilainen, A. 1997. Compar- 
ing a Linguistic and a Stochastic Tagger. In Pro- 
ceedings of joint EACL meeting. 
