Towards a More Careful Evaluation of Broad Coverage 
Parsing Systems 
Wide It. Hogenhout and Yuji Matsumoto 
Nara Institute of Science and Technology 
8916-5 Takayama, Ikoma 
Nara 630-01, Japan 
{ marc- h, matsu } @is. aist- nara. ac.jp 
Abstract 
Since treebanks have become available 
to researchers a wide variety of tech- 
niques has been used to make broad cov- 
erage parsing systems. This makes quan- 
titative evaluation very important, but 
the current evaluation methods have a 
number of drawbacks such as arbitrary 
choices in the treebank and the difficulty 
in measuring statistical significance. We 
suggest a more detailed method for test- 
ing a parsing system using constituent 
boundaries, with a number of measures 
that give more information than current 
measures, and evaluate the quality of the 
test. We also show that statistical signif- 
icance cannot be calculated in a straight- 
forward way, and suggest a calculation 
method for the case of Bracket Recall. 
1 Introduction 
During the last few years large treebanks have be- 
come available to many researchers, which has re- 
sulted in researches applying a range of new tech- 
niques for parsing systems. Most of the meth- 
ods that are being suggested include some kind 
of Machine Learning, such as history based gram- 
mars and decision tree models (Black et al., 1993; 
Magerman, 1995), training or inducing statisti- 
cal grammars (Black, Garside and Leech, 1993; 
Pereira and Schabes, 1992; Schabes et al., 1993), 
or other techniques (Bod, 1993). 
Consequently, syntactical analysis has become 
an area with a wide variety of (a) algorithms and 
methods for learning and parsing, and (b) type of 
information used for learning and parsing (some- 
times referred to as feature set). These meth- 
ods only could become popular through evalua- 
tion methods for parsing systems, such as Bracket 
Accuracy, Bracket Recall, Sentence Accuracy and 
Viterbi Score. Some of them were introduced in 
(Black et al., 1991; Harrison et M., 1991). 
These evaluation metrics have a number of 
problems, and in this paper we argue that they 
need to be reconsidered, and give a number of 
suggestions either to overcome those problems or 
to gain a better understanding of those prob- 
lems. Particular problems we look at are arbi- 
trary choices in the treebank, errors in the tree- 
bank, types of errors made by parsers, and the 
statistical significance of differences in test scores 
by parsers. 
2 Problems with Evaluation 
Metrics 
Until now a number of problems with evaluation 
have been pointed out. One well known prob- 
lem is that measures based only on the absence 
of crossing errors on sentence level, such as Sen- 
tence Accuracy and Viterbi Consistency, are not 
usable for parsing systems that apply a partial 
bracketing, since a sparse bracketing improves the 
score. For example (Lin, 1995) discusses some 
other problems, but suggests an alternative that 
is difficult to apply. It is based on transferring 
constituency trees to dependency trees, but that 
introduces many ad hoc choices, and treebanks 
with dependency trees are hardly available. 
Also, a treebank usually contains arbitrary 
choices (besides errors) made by humans, in cases 
where it was not clear what brackets correctly re- 
flect the syntactical structure of the sentence. 
We also mention some less discussed problems. 
First of all, given a test result such as Bracket 
Accuracy, it is necessary to know the confidance 
interval. In other words, if a parsing system scores 
81.2% on a test, in what range should we assume 
the estimate to be? Basically the same prob- 
lem arises with the statistical significance of tile 
difference between the test score of two different 
parsers. If one scores 81.2% and the other 82.5%, 
should we conclude the second one is really doing 
better? 
This is particularly important when developing 
a parsing system by trying various modifications, 
and choosing the one that performs the best on a 
test set. If the differences between scores become 
too small in relation to the test set, one will just 
562 
be making a parser for the test set and the per- 
formance will drop as soon as other data is used. 
There are several problems for deciding signif- 
icance for Bracket Accuracy and Bracket Recall. 
There is a strong variation between brackets, be- 
cause some brackets are very easy and SOlIle are 
very hard. Also one mistake may lead other mis- 
takes, making them not independent. As an ex- 
ample of the last problem, think of the indicated 
bracket pair in the sentence "The dog waited for 
/his master on the bridge\]." This would probably 
produce a crossing error, since the treebank would 
probably contain the pair "The dog/waited for his 
maste~\] on the bridge." The parser is now almost 
certain to make a second mistake, namely "The 
dog waited \[for his master on the bridge\]." Conse- 
quently two crossing errors are counted, whereas 
correcting one would imply correcting the other. 
In this article we will show that this makes it im- 
possible to calculate the significance in a straight- 
forward way and suggest two solutions. 
Another 1)roblem is that we only get a very gen- 
eral picture, whereas it would be interesting to 
know much more details. For example, how many 
of the bracket-pairs that constituted a crossing er- 
ror when compared to the treebank would be ac- 
ceptable to a human? (In other words, how often 
do arbitrary choices influence the result?) And, 
how many brackets that the parser produces are 
not in the treebank nor constitute a crossing er- 
ror, and how many of those are not acceptable to 
humans? 
Bracket Accuracy is often lower than it should 
be when the treebank does not indicate all brack- 
ets (so-called skeleton parsing). This may also 
make Bracket Recall seem too low. 
In this paper we suggest giving more specific 
information about test results, and develop meth- 
ods to estimate the statistical significance for test 
scores. 
3 More Careful Measures 
The data resulting from the test may be (a) gen- 
eral data fl'om all bracket pairs, or (b) data on spe- 
cific structures (i.e. prepositional phrases). The 
measures we give can be applied to either one. 
We suggest perforlning two types of tests: regu- 
lar tests and tests with a human check. The regu- 
lar test should include a nuinber of figures that we 
describe below, which are much more informative 
than the usual Bracket Recall or Bracket Preci- 
sion. The more elaborate one includes a human 
check on certain items, which not only gives more 
exact information on the test result, but in partic- 
ular shows the quality of the regular test. This is 
particularly useful if the parsing system was made 
independently from the treebank. 
The items for the regular test are listed here. 
The last four items only apply to a comparison 
of two parsing systems (for example two modifi- 
cations of the same system), here referred to as A 
and B. 
* TTB: Total Treebank Brackets, number of 
brackets in the treebank. 
. TPB: Total Parse Brackets, number of brack- 
ets produced by the parsing system. 
. EM: Exact Match, the nmnber of bracket- 
pairs produced by the parsing system that 
are equal to a pair in the treebank. 
. CE: Crossing Error, the number of bracket- 
pairs produced by the parsing system that 
constitute a crossing error against the tree- 
bank. 
. SP: Spurious, number of bracket pairs pro- 
dated by the parsing system that were not 
in the treebank but also do not constitute a 
crossing error. 
* PINH: Parse-error Inherited, the number of 
bracket-pairs produced by the parsing system 
that constitute a crossing error and have a di- 
rect parent bracket-pair that also constitutes 
a crossing error. 
. PNINII: Parse-error Non-Inherited, the num- 
ber of bracket-pairs produced by the parsing 
system that constitute a crossing error, but 
were not counted for PINH. 
• TINH: Treebank Inherited, the number of 
bracket-pairs in the treebank that were repro- 
duced by the parsing system and have a direct 
parent bracket-pair in the treebank that was 
also reproduced. 
• TNINH: Treebank Non-Inherited, the nmn- 
ber of bracket-pairs in the treebank that were 
reproduced by the parsing system but were 
not counted for TINH. 
. YY: Number of brackets in the treebank that 
were reproduced by A and B. 
• YN: Number of brackets in the treebank that 
were reproduced by A but not by B. 
• NY: Number of brackets in the treebank that 
were reproduced by B but not by A. 
* NN: Number of brackets in the treebank that 
were not reproduced by both A and B. 
As an example, we take this 2 sentence test: 
Treebank: 
\[He \[walks to \[the house\[\]\] 
\[\[The president\] \[gave \[a long speech\]\]\] 
Parser: \[He \[walks \[to \[the housel\]\]\] 
\[The \[\[president gavel \[a \[long speech\]\]l\] 
The number of exactly matching brackets (EM) 
is 3+2 = 5. The number of crossing errors (CE) is 
563 
2, both in the second sentence. The rest, 1-1-1 = 2 
is spurious (SP). Further, TTB is 7, TPB is 9, 
PINH is 1 and PNINH is 1, TINH is 1 and TNINH 
is 4. 
This already gives more detailed information, 
but we can take things a step further by having a 
human evaluate the most important brackets. If 
the test set is large, it would be undesirable or 
impossible to have a human evaluate every single 
bracket, but we can seriously reduce the workload 
by not considering the exact matching bracket 
pairs; they are simply marked as 'accepted.' The 
only result of evaluating these brackets would be 
a few errors in the treebank, which is often not 
really worth the trouble (unless the treebank is 
suspected to contain many errors). This leaves 
only the crossing errors and spurious brackets to 
be evaluated. 
This leaves a much smaller amount of work, es- 
pecially if there are many exact matches. Never- 
theless we suggest doing a human check only on 
important tests, such as final evaluations. 
In the human evaluation, crossing error and 
spurious bracket pairs are to be counted as 'ac- 
ceptable' if they would fit into the correct interpre- 
tation using the style of bracketing that the pars- 
ing system aims at, ignoring the style of bracket- 
ing of the treebank. 
The result of this process is that EM, CE and 
SP will be divided in accepted and rejected, giving 
six groups. We will refer to them as EMA, EMR, 
CEA, CER, SPA and SPR. If the check on EM is 
not performed, as we suggest, EMR will be 0. 
If YN and NY are both relatively high, this 
shows that there are structures on which A is bet- 
ter than B and vice versa (the systems 'comple- 
ment' each other). In that case we would rec- 
ommend testing on (more) specific structures, be- 
cause otherwise the general result will be mislead- 
ing. 
4 A Practical Example 
To show the difference between the usual evalu- 
ation and our evaluation method we give the re- 
sults for two parsing systems we evaluated in the 
course of our research. We do not intend to make 
any particular claims about these parsing systems, 
nor about the treebank we used (the test was not 
designed to draw conclusions about the treebank), 
we only use it to discuss the issues involved in eval- 
uation. 
The treebank we used was the EDR corpus 
(EDR, 1995), a Japanese treebank with mainly 
newspaper sentences. We compared two versions 
of a grammar based parsing system developed at 
our laboratory, using a stochastical grammar to 
select one parse for every sentence. Having two 
variations of the same parser, we were interested 
in the difference between them. We performed a 
test on 600 sentences from the corpus (which were 
not used for training). 
Our evaluation was as follows: 
1. Unrelevant elements such as punctuation are 
eliminated fl'om both the treebank tree and 
the parse tree. 
2. Next, all (resulting) empty bracket-pairs 
are removed. This was done recursively, 
therefore, if removing an empty bracket-pair 
caused its parent to become empty, the par- 
ent is also removed. 
3. Double bracket-pairs are removed. For exam- 
ple "The l/old man\]\]" is turned into "The \[old 
magi'. 
4. The crossing error bracket-pairs and spurious 
bracket-pairs were evaluated by hand. This 
took about three person-hours. 
In this process one step is missing, we namely 
wanted to remove trivial brackets before evaluat- 
ing. In English there is a simple strategy for this: 
remove all brackets that enclose only one word. 
In Japanese this is not so easy. Since Japanese is 
an agglutinating language and words are not sep- 
arated, it is difficult to say what the 'words' are 
in the first place. We decided on a certain level to 
permit brackets, and the tree from the treebank 
also stopped at some level so that remaining, more 
precise bracket-pairs were amongst those counted 
as spurious. 
The resulting figures are in table 1 and table 2 
gives the comparative items. 
Table 1: Sample Test Results 
Item System A System B 
TTB 
TPB 
EMA 
EMR 
CEA 
CER 
SPA 
SPR 
P\[NH 
PNINH 
TINH 
TNINH 
11400 
8671 
6748 (77.8%) 
O (assumed) 
204 (2.4%) 
690 (8.0%) 
956 (11.0%) 
73 (0.8%) 
523 
371 
5212 
1536 
11400 
8771 
68 8 (78.2%) 
0 (assumed) 
182 (2.1%) 
611 (7.0%) 
1049 (12.0%) 
71 (0.8%) 
470 
323 
5426 
1432 
Table 2: Comparative Measure Results 
YY 6516 (57.2%) 
YN 232 (2.0%) 
NY 343 (3.1%) 
NN 4309 (37.8%) 
564 
5 New Measures 
We claim that the items listed in the previous 
paragraph allows a nmre flexible framework for 
ewduatim~. In this paragraph we will show some 
examt)les of measures that can be used. They can 
be calculated with these items so there is no need 
to discuss every one of them all the time. Table 3 
gives the measures and table 4 gives the results 
in percentages. ~l?he measures in the lower part 
of this tabh> are more directed at the test than at 
the parsers. 
'\['able 3: Measures 
Measnre 
Generation Rate 
Recall-hard 
Recall-soft 
Precision-hm-d 
Precision-soft 
Spuriousness 
Spurious Reject 
False Error 
Test Noise 
t)roblem Rate 
P-inheritance 
'l?-inheritance 
Calculation 
EMA / T"I'B 
( EMA 4- CEA -I-SPA ) ~ TTB 
EMA / TPB 
( EMA-t CEA-I-SPA ) / TPB 
(.gPA q-SPR) / fI't'B 
SPR?(SPA q-SPR) 
PINH/ ( PIN H + I'NINIt) 
Table 4: Results for Measures 
Measure A B 
Generation Rate 
P~ecall-hard 
Recall- soft 
I'recision- hard 
Precision-soft 
Spuriousness 
Spurious Reject 
False Error 
'Pest Noise 
Problem Rate 
l Lin heritance 
T-inheritance 
59.2% !60.2% 
69.4% 71.0% 
77.8% , 78.2% 
91.2% !92.2% 
11.9% 12.8% 
22.8% 23.0% 
14.2% 114.8% 
3.2% ~ 2.9% 
58.5% 59.3?/0 
77.2% 179.1% 
The generation rate shows that both systems 
arc rather modest in producing brackets. 
We give two types of recall. We suggest using 
recall-lmrd, but when the treebank does not indi- 
cate all brackets recall-soft may give an indication 
of the proper recall. 
We also present two types of precision. B scores 
better on precision-soft, but there is not much dif- 
ference for precision-hard. This shows that B is 
better at teeM1 but also generates nmre spnrious 
brackets. The spuriousness also indicates this. 
The other measures tell us more about the test 
itself. A would have been treated slightly fa- 
vorable without a human cheek, since relatively 
more errors go 'undetected.' False Error shows 
that almost I out of 4 crossing errors is not re- 
ally wrong, which indicates there is much differ- 
ence in bracketing-style between the treebank and 
the parsing system. 'rest Noise shows how many 
bracket-pairs were not tested properly. Problem 
Rate shows the real 'myopia' of the test. 
3'he inheritance data shows that in our test; 
crossing errors are often related (P-inheritance). 
Also, reproducing a particular bracket-pair from 
the treebank increases the chances on reproducing 
its parent (T-inheritance). 
6 Significance 
Things would be easy if we could assmne that the 
chance of apl)lying a bracket is correctly modeled 
as a binomial experiment. We begin by mention- 
ing two reasons why that is not possible. 
• Errors that are related, sneh a~s one wrong 
attachment that causes a number of cross- 
ing errors, as was shown in our test by P- 
inheritance. 
• For a binomial process we mast assume that 
the chance on success is the same for every 
bracket pair. It is not, in fact there are both 
very easy and very hard bracket pairs, with 
chances w~rying from very small to very high. 
The significance levels of all differences are 
worth knowing, but our main interest is the dis 
ference between A and B in recall and precision. 
Because of space limitations we only discuss a 
strategy for estimating the significance level of the 
measure recall-hard. 
Significance for Recall-Hard First we will 
check whether the distribution can be modeled 
properly with a binomial experiment. We do this 
by looking at the comparative items YY, YN, NY 
and NN. 
From these values the problem is intuitively 
clear: there are many easy bracket pairs that both 
always produce correctly, and many that both al- 
most never produce because they are too hard, or 
the parsing systems simply never produce a cer- 
tain type of bracket pair. Also, we have tested 
two rather similar parsing systems often giving the 
same answer, after all that is often just what one 
is interested in because one wants to measnre im- 
provement. We will use statistical distributions to 
confirm this problem occurs, and to find a solution 
to the significance problem. 
We do not have tile space to go into tile details 
of the relations between the distributions, but if A 
and B would behave like a binomial variable with 
test size N, with Pa and P~ as respective chance 
on success, the distribution of YY should again 
be a binomial variable for test size N, with chance 
Pry = PaPb. The expected value and variance of 
YY would be 
565 
E(YY) = NPyy = NPaPb 
V(YY) = NPw(1 - Pyy) = N(PaPb)(1 - P..Pb) 
For NN the distril~ution is the same with the 
opposite probabilities, a binomial variable for test 
size Nand P,~, = (1- Pa)(1- Pb). If we take 
Pa = l - Pa and Pb = 1 -- Pb, the expected value 
and variance of NN become 
E(NN) = NP,~. = Nfi~g 
V(NN) -- NP~,~(1 - Pnn) = N(Pa-Pb)(1 - P.Pb) 
We will later put this to more use, but for now 
we just use it to conclude that YY is expected to 
be around 4063, and NN is expected to be around 
1851. Using the variation we find that the ob- 
served values are both extremely rare, so we can 
reject the hypothesis that we are comparing two 
binomial variables. 
Our strategy to solve this problem is assuming 
there are three types of brackets, namely brack- 
ets that are ahnost always reproduced, those that 
are almost never reproduced, and those that are 
sometimes reproduced and therefore constitute 
the 'real test' between the two parsing systems. 
Note that the first two types do not tell us any- 
thing about the difference between the parsing 
systems. By assuming the rest is similar to a bi- 
nomial distribution, we can calculate the signif- 
icance. Of conrse this assumption simplifies the 
situation, but it is closer to the truth than assum- 
ing the whole test can be modeled by a binominal 
distribution. And, if this assumption is not jus- 
tified the whole test is not appropriate without 
testing on more specific phenomena. 
Guessing the Real Test Size The idea behind 
this method is that some brackets are almost al- 
ways produced, and some are never, and those 
should be discarded so the real test remains. Ig- 
noring certain bracket pairs corresponds with the 
fact that some constituents relate to little and 
some to much ambiguity, making some suitable for 
comparison and others not. We look at the num- 
ber of equal answers to estimate the number of 
bracket-pairs that were not too easy or too hard. 
This is a theoretical operation, thus there is no 
need to do this in practice. We only need to es- 
timate two parameters: M1 being the number of 
bracket-pairs that is discarded because they are 
always reproduced, and M2 being the number of 
bracket-pairs discarded because they are not re- 
produced. We reduce YY by M1, and NN by M2 
(the test size is thus reduced by M1 4- M2). This 
indicates an imaginary real test, namely the part 
of the test that really served to compare the pars- 
ing systems. 
We calculate these quantities by a~suming a bi- 
nomial distribution for the real test, and making 
sure that the corrected values for YY and NN be- 
come equal to their expected value. Let 
observed YY in real test = E(YY) in real test = 
real test size x Pa in real test x Pb in real test 
then we get 
YY-M1 -- (YY + YN - M1)(YY + NY - M1) 
TTB - M1 
We do not give the derivation, but when doing 
the same for NN and combining the equations the 
following relation between M 1 and M2 holds: 
M1 = NY×TTB+M2×YY-(NY++NN)(VY+NY~ 
M2-NN 
There are usually rnany values for M1 and M2 
that satisfy this condition. In practice M1 and 
M2 have to be discrete values, so they often are 
not satisfying the condition exactly, but are close 
enough. 
It may seem logical to find the proper values 
for M1 and M2 as a next step, in other words 
deciding how many brackets were 'too easy' and 
how many were 'too hard.' But our experience 
is that there is no need to do that, because we 
are only interested in the significance level of the 
difference between A and B, and the significance 
level is practically the same for all values of M1 
and M2 that satisfy the condition. 
As for our test, M1 and M2 can be, for exam- 
ple, 6234 and 4027 respectively. Whatever value 
we take, the significance level o{" the difference be- 
tween A and B corresponds to being 4.7 standard 
variations away from the expected value. This 
means that we can safely conclude that B re- 
ally performs better than A. The real test is a 
lot smMler, only 1139 bracket pairs, but that is 
still enough to be meaningful. (If the nmnber of 
eqnM answers would be extremely high, the real 
test size ruay become too small, indicating the test 
is meaningless.) 
7 Conclusion 
We have pointed out that the measures which are 
currently in use have a number of weaknesses. To 
list the most important ones, a number of aspects 
of parsing systems are not measured, treebanks 
contain arbitrary choices, some errors are not de- 
tected and discovering statistical significance is 
difficult. 
The test items and measures we have suggested 
give a better picture of the specific behaviors of 
parsing systems. Although not solving the prob- 
lem of arbitrary choices in the treebank, we can at 
least find out how much influence this has on the 
test results by using a human check on important 
tests. The same goes for other problems, such as 
errors that are not detected by comparison with 
the treebank. 
566 
We suggest giving the regular items on every 
test, and sometimes doing a hmnan check to dis- 
cover the quality of the regular test. The amount 
of work in the hmnan check can be made small 
when the recall of the parsing systems is high, by 
assuming the exact matches are correct. 
We have also given a strategy for calculating 
the significance level of differences in scores on 
one particular measure, namely recall-hard. This 
strategy makes it possible to calculate the signif- 
icance level right away from the test items, not 
requiring a haman check. 
This discussion will certainly not be the last on 
this subject. We have not mentioned some quanti- 
ties such as the number of sentences with 1 cross- 
ing error, 2 crossing errors, and those with many 
crossing errors. These are of course also useful 
tools for evaluation. We have also not mentioned 
the Parse Base (calculated as the geometric mean 
over all sentences of ~, where n is the number of 
words in the sentence and p the number of parses 
tbr the sentence), because that relates to a gram- 
mar rather than a parsing system. Nevertheless 
we feel this will help to improve the evaluation of 
broad coverage parsing systems. 
Language Processing Systems, Association for 
Computational Linguistics. 
Japan Electronic Dictionary Research Institute, 
Ltd. 1995. EDR Electronic Dictionary 7~chni- 
cal Guide. 
D. Lin. 1995. A dependency-based method for 
evaluating broad-coverage parsers. In Proceed- 
ings of the 14th International Joint Conference 
on Artificial Intelligence, pages 1420 1425. 
D. M. Magerman. 1995. Statistical decision-tree 
models for parsing, in Proceedings of the 33d 
Annual Meeting of the Association for Compu- 
tational Linguistics, pages 276-283. 
F. Pereira and Y. Schabes. 1992. Inside Outside 
reestimation from partially bracketed corpora. 
In Proceedings of the 30th Annual Meeting of 
the Association for Computational Linguistics, 
pages 128-\]35. 
Y. Sehabes, M. Roth, and R. Osborne. 1993. 
Parsing the wall street journal with the inside- 
outside algorithm. In Proceedings of the Sixth 
Conference of the European Chapter of the As- 
sociation for Computational Linguistics, pages 
34\]-347. 
References 
E. Black, S. Abney, D. Flickenger, C. Gdaniec, 
R. Grishman, P. Harrison, D. Hindle, R. Ingria, 
F. Jelinek, J. Klavans, M. I,iberman, M. Mar- 
cus, S. t{oukos, B. Santorini, and T. Strza- 
lkowski. 1991. A procedure for quantitatively 
comparing the syntactic coverage of English 
grammars. In Proceedings of the Workshop 
on Spcech. and Natural Language, Defense Ad- 
vanced Research Projects Agency, U.S. Govt., 
pages 306-311. 
E. Black, R. Garside, and G. Leech. 1993. Statis- 
tically Driven Computer Grammars of English: 
The IBM/Lancaster Approach. Rodopi. 
E. Black, F. Jelinek, J. Lafferty, and D. M. Mager- 
man. 1993. Towards history-based grammars: 
Using richer models for probabilistic parsing. 
In Proceedings of the 31st Annual Meeting of 
the Association for Computational Linguistics, 
pages 31--37. 
R. Bod. 1993. Using an annotated corpus as a 
stochastic grammar. In Proceedings of the Sixth 
Conference of the European Chapter of the As- 
sociation for Computational Linguistics, pages 
37-44. 
P. Harrison, S. Abney, E. Black, D. Flickenger, 
C. Gdanicc, R. Grishman, D. Hindle, R. In- 
gria, M. Marcus, B. Santorini, and T. Strza- 
lkowski. 1991. Evaluating syntax performance 
of parser/grammars of English. In Proceed- 
ings of the Workshop on Evaluating Natural 
567 
