Handling noisy training and testing data
Don Blaheta
Department of Computer Science
Brown University
dpb@cs.brown.edu
Abstract
In the  eld of empirical natural language
processing, researchers constantly deal with
large amounts of marked-up data; whether
the markup is done by the researcher or
someone else, human nature dictates that it
will have errors in it. This paper will more
fully characterise the problem and discuss
whether and when (and how) to correct the
errors. The discussion is illustrated with
speci c examples involving function tagging
in the Penn treebank.
1 Introduction: Errors
Nobody’s perfect. A clich e, but in the  eld of empir-
ical natural language processing, we know it to be
true: on a daily basis, we work with large corpora
created by, and often marked up by, humans. Falli-
ble as ever, these humans have made errors. For the
errors in content, be they spelling, syntax, or some-
thing else, we can hope to build more robust systems
that will be able to handle them. But what of the
errors in markup?
In this paper, we propose a system for cataloguing
corpus errors, and discuss some strategies for dealing
with them as a research community. Finally, we will
present an example (function tagging) that demon-
strates the appropriateness of our methods.
2 An error taxonomy
2.1 Type A: Detectable errors
The easiest errors, which we have dubbed \Type A",
are those that can be automatically detected and
 xed. These typically come up when there would
be multiple reasonable ways of tagging a certain in-
teresting situation: the markup guidelines arbitrarily
choose one, and the human annotator unthinkingly
uses the other.
 
 hh
h
 
 a13
 
 X
X
 
 X
X
 
 9
the dealersby
NP
...should go here
This LGS tag...
VBN PP-LGS
INstarted
VP
Figure 1: A function tag error of Type A
The canonical example of this sort of thing is the
treebank’s LGS tag, representing the \logical sub-
ject" of a passive construction. It makes a great
deal of sense to put this tag on the NP object of
the ‘by’ construction; it makes almost as much sense
to tag the PP itself, especially since (given a choice)
most other function tags are put there. The tree-
bank guidelines speci cally choose the former: \It
attaches to the NP object of by and not to the PP
node itself." (Bies et al., 1995) Nevertheless, in sev-
eral cases the annotators put the tag on the PP,as
shown in Figure 1. We can automatically correct this
error by algorithmically removing the LGS tag from
any such PP and adding it to the object thereof.
The unifying feature of all Type A errors is that
the annotator’s intent is still clear. In the LGS case,
the annotator managed to clearly indicate the pres-
ence of a passive construction and its logical subject.
Since the transformation from what was marked to
what ought to have been marked is straightforward
and algorithmic, we can easily apply this correction
to all data.
2.2 Type B: Fixable errors
Next, we come to the Type B errors, those which
are  xable but require human intervention at some
point in the process. In theory, this category could
include errors that could be found automatically but
require a human to  x; this doesn’t happen in prac-
tice, because if an error is su ciently systematic that
                                            Association for Computational Linguistics.
                    Language Processing (EMNLP), Philadelphia, July 2002, pp. 111-116.
                         Proceedings of the Conference on Empirical Methods in Natural
 
 
 ‘
‘
‘
‘
(
(
(h
h
h
 
 
 
 
 P
P
VP
ADVP
hard
NP
company
NP
Mistag|should be VBN
VBD
hit
PP
by ...
Figure 2: A part-of-speech error of Type B
1
an algorithm can detect it and be certain that it is
in fact an error, it can usually be corrected with cer-
tainty as well. In practice, the instances of this class
of error are all cases where the computer can’t detect
the error for certain. However, for all Type B errors,
once detected, the correction that needs to be made
is clear, at least to a human observer with access to
the annotation guidelines.
Certain Type B errors are moderately easy to
 nd. When annotators misunderstand a complicated
markup guideline, they mismark in a somewhat pre-
dictable way. While not being totally systematically
detectable, an algorithm can leverage these patterns
to extract a list of tags or parses that might be incor-
rect, which a human can then examine. Some errors
of this type (henceforth \Type B
1
") include:
 VBD / VBN. Often the past tense form of a verb
(VBD) and its past participle (VBN)havethe
same form, and thus annotators sometimes mis-
take one for the other, as in Figure 2. Some such
cases are not detectable, which is why this is not
Type A.
1
 IN / RB / RP. There are speci c tests and guide-
lines for telling these three things apart, but fre-
quently a preposition (IN) is marked when an
adverb (RB) or particle (PRT) would be more
appropriate. If an IN is occurring somewhere
other than under a PP, it is likely to be a mistag.
Occasionally, an extracted list of maybe-errors will
be \perfect", containing only instances that are ac-
tually corpus errors. This happens when the pat-
tern is a very good heuristic, though not necessarily
valid (which is why the errors are Type B
1
, and not
Type A). When  ling corrections for these, it is still
best to annotate them individually, as the correc-
tions may later be applied to an expanded or modi-
1
There is a subclass of this error which is Type A:
when we  nd a VBD whose grandparent is a VP headed
by a form of ‘have’, we can deterministically retag it as
VBN.
 ed data set, for which the heuristic would no longer
be perfect.
Other  xable errors are pretty much isolated.
Within section 24 of the treebank, for instance, we
have:
 the word ‘long’ tagged as an adjective (JJ)when
clearly used as a verb (VB)
 the word ‘that’ parsed into a noun phrase instead
of heading a subordinate clause, as in Figure 3
 a phrase headed by ‘about’,asin‘think about’,
tagged as a location (LOC)
These isolated errors (resulting, presumably, from
a typo or a moment of inattention on the part of
the annotator) are not in any way predictable, and
can be found essentially only by examining the out-
put of one’s algorithm, analysing the \errors", and
noticing that the treebank was incorrect, rather than
(or in addition to) the algorithm. We will call these
Type B
2
.
2.3 Type C: Systematic inconsistency
Sometimes, there is a construction that the markup
guidelines writers didn’t think about, didn’t write
up, or weren’t clear about. In these cases, annota-
tors are left to rely on their own separate intuitions.
This leaves us with markup that is inconsistent and
therefore clearly partially in error, but with no obvi-
ous correction. There is really very little to be done
about these, aside from noting them and perhaps
controlling for them in the evaluation.
Some Type C errors in the treebank include:
 ‘ago’. English’s sole postposition seems to have
given annotators some di culty. Lacking a
postposition tag, many tagged such occurrences
of ‘ago’ as a preposition (IN); others used the
adverb tag (RB) exclusively.
2
Since some occur-
rences really are adverbs, this just makes a big
mess.
 ADVP-MNR.TheMNR tag is meant to be ap-
plied to constituents denoting manner or instru-
ment. Some annotators (but not all) seemed
to decide that any adverbial phrase (ADVP)
headed by an ‘-ly’ word must get a MNR tag, ap-
plying it to words like ‘suddenly’, ‘signi cantly’,
and ‘clearly’.
2
In particular, the annotators of sections 05, 09, 12,
17, 20, and 24 used IN sometimes, while the others tagged
all occurrences of ‘ago’ as adverbs.
 
  ‘
‘
‘
 
 hh
h
 
 P
P
 
 P
P
*NONE*
0
S
NP VP
were ...
SBAR
DT NNS
that subsidies
should be
 
  ‘
‘
‘
 
 hh
h
 
 P
P
IN S
NP VP
were ...
SBAR
that
NNS
subsidies
Figure 3: A parse error of Type B
2
The hallmark of a Type C error is that even what
ought to be correct isn’t always clear, and as a result,
any plan to correct a group of Type C errors will
have to  rst include discussion on what the correct
markup guideline should be.
3 tsed
In order to e ect these changes in some communi-
cable way, we have implemented a program called
tsed, by analogy with and inspired by the already
prevalent tgrep search program.
3
It takes a search
pattern and a replacement pattern, and after  nd-
ing the constituent(s) that match the search pattern,
modi es them and prints the result. For those al-
ready familiar with tgrep search syntax, this should
be moderately intuitive.
To the basic pattern-matching syntax of tgrep,we
have added a few extra restriction patterns (for spec-
ifying sentence number and head word), as well as a
way of marking nodes for later reference in the re-
placement pattern (by simply wrapping a constituent
in square brackets instead of parentheses).
The replacement syntax is somewhat more com-
plicated, because wherever possible we want to be
able to construct the new trees by reference to the
old tree, in order to preserve modi ers and structure
we may not know about when we write the pattern.
For full details of the program’s abilities, consult the
program documentation, but here are the main ones:
 Relabelling. Constituents can be relabelled with
no change to any of their modi ers or children.
 Tagging. A tag can be added to or removed from
a constituent, without changing any modi ers or
children.
 Reference. Constituents in the search pattern
can be included by reference in the replacement
pattern.
3
tgrep was written by Richard Pito of the University
of Pennsylvania, and comes with the treebank.
 Construction. New structure can be built by
specifying it in the usual S-expression format,
e.g. (NP (NN snork)). Usually used in combi-
nation with Reference patterns.
Along with tsed itself, we distribute a Perl pro-
gram wsjsed to process treebank change scripts like
the following:
{2429#0-b}<<EOF
NP $ [ADJP] > (VP / keep) (S \0 \1)
NP <<, markets - SBJ
EOF
This script would make a batch modi cation to the
zeroth sentence of the 29th  le in section 24. The
batch includes two corrections: the  rst matches
a noun phrase (NP) whose sister is an ADJP and
whose parent is a VP headed by the word ‘keep’.The
matched NP node is replaced by a (created) S node
whose children will be that very NP and its sister
ADJP. The second correction then  nds an NP that
ends in the word ‘markets’ and marks it with the SBJ
function tag.
Distributing changes in this form is important for
two reasons. First of all, by giving changes in their
minimal, most general forms, they are small and easy
to transmit, and easy to merge. Perhaps more im-
portantly, since corpora are usually copyrighted and
can only be used by paying a fee to the controlling
body (usually LDC or ELDA), we need a way to dis-
tribute only the changes, in a form that is useless
without having bought the original corpus. Scripts
for tsed,orforwsjsed, serve this purpose.
These programs are available from our website.
4
4 When to correct
Now that we have analysed the di erent types of
errors that can occur and how to correct them, we
can discuss when and whether to do so.
4
http://www.cs.brown.edu/~dpb/tsed/
4.1 Training
In virtually all empirical NLP work, the training set
is going to encompass the vast majority of the data.
As such, it is usually impractical for a human (or
even a whole lab of humans) to sit down and revise
the training. Type A errors can be corrected easily
enough, as can some Type B
1
errors whose heuristics
have a high yield. Purely on grounds of practicality,
though, it would be di cult to e ect signi cant cor-
rection on a training set of any signi cant size (such
as for the treebank).
Practicality aside, correcting the training set is a
bad idea anyway. After expending an enormous ef-
fort to perfect one training set, the net result is just
one correct training set. While it might make certain
things easier and probably will improve the results
of most algorithms, those improved results will not
be valid for those same algorithms trained on other,
non-perfect data; the vast majority of corpora will
still be noisy. If a user of an algorithm, e.g. an ap-
plication developer, chooses to perfect a training set
to improve the results, that would be helpful, but it
is important that researchers report results that are
likely to be applicable more generally, to more than
one training set. Furthermore, robustness to errors
in the training, via smoothing or some other mech-
anism, will also make an algorithm robust to sparse
data, the ever-present spectre that haunts nearly ev-
ery problem in the  eld; thus eliminating all errors
in the training ought not to have as much of an e ect
on a strong algorithm.
4.2 Testing
Testing data is another story, however. In terms of
practicality, it is more feasible, as the test set is usu-
ally at least an order of magnitude smaller than the
training. More important, though, is the issue of
fairness. We need to continue using noisy training
data in order to better model real-world use, but it
is unfair and unreasonable to have noise in the gold
standard
5
, which causes an algorithm to be penalised
where it is more correct than the human annotation.
As performance on various tasks improves, it be-
comes ever more important to be able to correct the
testing data. A ‘mere’ 1% improvement on a result of
75% is not impressive, as it represents just a 4% re-
duction in apparent error, but the same 1% improve-
ment on a result of 95% represents a 20% reduction
in apparent error! In the end, a noisy gold standard
sets an upper bound of less than 100% on perfor-
mance, which is if nothing else counterintuitive.
5
Sometimes more of a pyrite standard, really.
4.3 Ethical considerations
Of course, we cannot simply go about changing the
corpus willy-nilly. We refer the reader to chapter 7
of David Magerman’s thesis (1994) for a cogent dis-
cussion of why changing either the training or the
testing data is a bad idea. However, we believe that
there are now some changed circumstances that war-
rant a modi cation of this ethical dictum.
First, we are not allowed to look at testing data.
How to correct it, then? An initial reaction might
be to \promise" to forget everything seen while cor-
recting the test corpus; this is not reasonable.
Another solution exists, however, which is nearly
as good and doesn’t raise any ethical questions.
Many research groups already use yet another sec-
tion, separate from both the training and testing,
as a sort of development corpus.
6
When developing
an algorithm, we must look at some output for de-
bugging, preliminary evaluation, and parameter esti-
mation; so this development section is used for test-
ing until a piece of work is ready for publication, at
which point the \true" test set is used. Since we are
all reading this development output already anyway,
there is no harm in reading it to perform corrections
thereon. In publication, then, one can publish the
results of an algorithm on both the unaltered and
corrected versions of the development section, in ad-
dition to the results on the unaltered test section.
We can then presume that a corrected version of the
test corpus would result in a perceived error reduc-
tion comparable to that on the development corpus.
Another problem mentioned in that chapter is of a
researcher quietly correcting a test corpus, and pub-
lishing results on the modi ed data (without even
noting that it was modi ed). The solution to this
is simple: any results on modi ed data will need to
acknowledge that the data is modi ed (to be hon-
est), and those modi cations need to be made public
(to facilitate comparisons by later researchers). For
Type A errors  xed by a simple rule, it may be rea-
sonable to publish them directly in the paper that
gives the results.
7
For Type B errors, it would be
more reasonable to simply publish them on a web-
site, since there are bound to be a large number of
them.
8
6
In the treebank, this is usually section 24.
7
Theruleweusedto xtheLGS problem noted in
section 2.1 is as follows:
f24*-bgg<<EOF
NP !- LGS > (PP - LGS) - LGS
PP - LGS ! LGS
EOF
8
The 235 corrections made to section 24 are available
at http://www.cs.brown.edu/~dpb/tbfix/.
Finally, we would like to note that one of the rea-
sons Magerman was ready to dismiss error in the
testing was that the test data had \a consistency
rate much higher than the accuracy rate of state-of-
the-art parsers". This is no longer true.
4.4 Practical considerations
As multiple researchers each begin to impose their
own corrections, there are several new issues that
will come up. First of all, even should everyone pub-
lish their own corrections, and post comparisons to
previous researchers’ corrected results, there is some
danger that a variety of di erent correction sets will
exist concurrently. To some extent this can be mit-
igated if each researcher posted both their own cor-
rections by themselves, and a full list of all correc-
tions they used (including their own). Even so, from
time to time these varied correction sets will need to
be collected and merged for the whole community to
use.
More di cult to deal with is the fact that, in-
evitably, there will be disputes as to what is correct.
Sometimes these will be between the treebank ver-
sion and a proposed correction; there will probably
also be cases where multiple competing corrections
are suggested. There really is no good systematic
policy for dealing with this. Disputes will have to
be handled on a case-by-case basis, and researchers
should probably note any disputes to their correc-
tions that they know of when publishing results, but
beyond that it will have to be up to each researcher’s
personal sense of ethics.
In all cases, a search-and-replace pattern should
be made as general as possible (without being too
general, of course), so that it interacts well with
other modi cations. Various researchers are already
working with (deterministically) di erent versions of
corpora|with new tags added, or empty nodes re-
moved, or some tags collapsed, for instance, not to
mention other corrections already performed|and it
would be a bad idea to distribute corrections that are
speci c to one version of these. When in doubt, one
should favour the original form of the corpus, natu-
rally.
The  nal issue is not a practical problem, but an
Algorithm error 44%
Parse error 20%
Treebank error 18%
Type C error 13%
Dubious 6%
Table 1: Analysis of reported errors
observation: once a researcher publishes a correction
set, any further corrections by other researchers are
likely to decrease the results of the  rst researcher’s
algorithm, at least somewhat. This is due to the fact
that that researcher is usually not going to notice
corpus errors when the algorithm errs in the same
way. This unfortunate consequence is inevitable, and
hopefully will prove minor.
5 Experimental results
As a sort of case study in the meta-algorithms pre-
sented in the previous sections, we will look at
the problem of function tagging in the treebank.
Blaheta and Charniak (2000) describe an algorithm
for marking sentence constituents with function tags
such as SBJ (for sentence subjects) and TMP (for
temporal phrases). We trained this algorithm on sec-
tions 02{21 of the treebank and ran it on section 24
(the development corpus), then analysed the output.
First, we printed out every constituent with a func-
tion tag error. We then examined the sentence in
which each occurred, and determined whether the
error was in the algorithm or in the treebank, or
elsewhere, as reported in Table 1. Of the errors we
examined, less than half were due solely to an algo-
rithmic failure in the function tagger itself. The next
largest category was parse error: this function tag-
ging algorithm requires parsed input, and in these
cases, that input was incorrect and led the function
tagger astray; had the tagger received the treebank
parse, it would have given correct output. In just
under a  fth of the reported \errors", the algorithm
was correct and the treebank was de nitely wrong.
The remainder of cases we have identi ed either as
Type C errors|wherein the tagger agreed with many
training examples, but the \correct" tag agreed with
many others|or at least \dubious", in the cases that
weren’t common enough to be systematic inconsis-
tencies but where the guidelines did not clearly pre-
fer the treebank tag over the tagger output, or vice
versa.
Next, we compiled all the noted treebank errors
and their corrections. The most common correc-
tion involved simply adding, removing, or changing
a function tag to what the algorithm output (with a
net e ect of improving our score). However, it should
be noted that when classifying reported errors, we
examined their contexts, and in so doing discovered
other sorts of treebank error. Mistags and misparses
did not directly a ect us; some function tag correc-
tions actually decreased our score. All corrections
were applied anyway, in the hope of cleaner evalua-
tions for future researchers. In total, we made 235
corrections, including about 130 simple retags.
Grammatical tags Form/function tags Topicalisation tags
PRFPRFPRF
Treebank 96.37% 95.04% 95.70% 81.61% 76.44% 78.94% 96.74% 94.68% 95.70%
Fixed 97.08% 95.27% 96.16% 85.75% 77.51% 81.42% 97.85% 95.79% 96.81%
False error 19.56% 4.64% 10.70% 22.51% 4.54% 11.78% 34.05% 20.86% 25.81%
Table 2: Function tagging results, adjusted for treebank error
Finally, we re-evaluated the algorithm’s output on
the corrected development corpus. Table 2 shows
the resulting improvements.
9
Precision, recall, and
F-measure are calculated as in (Blaheta and Char-
niak, 2000). The false error rate is simply the percent
by which the error is reduced; in terms of the per-
formance on the treebank version (t)andthe xed
version (f),
False error =
f − t
1:0− t
 100%
This is the percentage of the reported errors that are
due to treebank error.
The topicalisation result is nice, but since the TPC
tag is fairly rare (121 occurrences in section 24), these
numbers may not be robust. It is interesting, though,
that the false error rate on the two major tag groups
is so similar|roughly 20% in precision and 5% in
recall for each, leading to 10% in F-measure. First
of all, this parallelism strengthens our assertion that
the false error rate, though calculated on a devel-
opment corpus, can be presumed to apply equally
to the test corpus, since it indicates that the hu-
man missed tag and mistag rates may be roughly
constant. Second, the much higher improvement on
precision indicates that the majority of treebank er-
ror (at least in the realm of function tagging) is due
to human annotators forgetting a tag.
6 Conclusion
In this paper, we have given a new characterisation
of the sorts of noise one  nds in empirical NLP, and
a roadmap for dealing with it in the future. For
many of the problems in the  eld, the state of the
art is now su ciently advanced that evaluation error
is becoming a signi cant factor in reported results;
we show that it is correctable within the constraints
of practicality and ethics.
Although our examples all came from the Penn
treebank, the taxonomy presented is applicable to
9
We did not run corrections on, nor do we show re-
sults for, Blaheta and Charniak’s \misc" grouping, both
because there were very many of them in the reported
error list and because they are very frequently wrong in
the treebank.
any corpus annotation project. As long as there are
typographical errors, there will be Type B errors;
and unclear or counterintuitive guidelines will forever
engender Type A and Type C errors. Furthermore,
we expect that the experimental improvement shown
in Section 5 will be reflected in projects on other an-
notated corpora|perhaps to a lesser or greater de-
gree, depending on the di culty of the annotation
task and the prior performance of the computer sys-
tem.
An e ect of the continuing improvement of the
state of the art is that researchers will begin (or
have begun) concentrating on speci c subproblems,
and will naturally report results on those subprob-
lems. These subproblems are likely to involve the
complicated cases, which are presumably also more
subject to annotator error, and are certain to involve
smaller test sets, thus increasing the performance ef-
fect of each individual misannotation. As the sizes
of the subproblems decrease and their complexity in-
creases, the ability to correct the evaluation corpus
will become increasingly important.

References
Ann Bies, Mark Ferguson, Karen Katz, and Robert
MacIntyre, 1995. Bracketing Guidelines for Tree-
bank II Style Penn Treebank Project, January.
Don Blaheta and Eugene Charniak. 2000. Assign-
ing function tags to parsed text. In Proceedings
of the 1st Annual Meeting of the North American
Chapter of the Association for Computational Lin-
guistics, pages 234{240.
David M. Magerman. 1994. Natural language pars-
ing as statistical pattern recognition. Ph.D. thesis,
Stanford University, February.
