More accurate tests Ibr the statistical significance of result 
differences * 
Alexander Yeh 
Mitre Corp. 
202 Burlillgl;on Rd. 
Bedford, MA 01730 
USA 
asy~mitrc.org 
Abstract 
Statisti(:a,1 signiticance testing of (litl'erelmeS in 
v;~hl(`-s of metri(:s like recall, i)rccision and bat- 
au(:(~(l F-s(:()rc is a ne(:(`-ssary t)art of eml)irical 
ual;ural  1)ro(:essing. Unfortunately, we 
lind in a set of (;Xl)erinlc\]d;s (;hal; many (:ore- 
inertly used tesl;s ofte, n underestimate t.he signif 
icancc an(l so are less likely to detect differences 
that exist 1)el;ween ditl'ercnt techniques. This 
undel'esi;imation comes from an in(let)endcn('(~ 
a,-;SUlnl)tion that is often violated. \~fe l)oint out 
some useful l;e,%s (;hal; (lo nol; make this assuml)- 
lion, including computationally--intcnsive ran- 
d()mizat,ion 1;cs|;s. 
1 Introdu(-tion 
In Clnl)irical natural \]al~gUag(~ l)rocessing, on(', 
is ot'tcal |:('~st;ing whether some new technique 
1)ro(lu('es im\])rove(l l'esull;s (as mcasur(xl \])y one 
()1' 111017(`- IIICI;I'\]CS) Oit sonIc i;esl; (lai;~L set; \V\]l(`-ll 
(;Olll\])aI'e(l li()sollle (;llrrellt (l)asclilm) l;c(:\]lnique. 
\5/\]lell l,\]le lCsllll;s are better with the new tcch- 
ni(lUe , a question arises as t() wh(',l;h(;r these l:(`-- 
sult; (litl'eren(:es are due t() the new technique 
a(:t;ually 1)eing l)cl;t('x or just; due 1;o (:han(:e. Un- 
t'ortmmtely, one usually Callll()t) directly answer 
the qnesl;ion "what is the 1)robatfility that 1;11(; 
now l;(x:hni(luC, is t)el;lx~r givell l;he results on the 
t(',sl, dal;a sol;": 
I)(new technique is better \[test set results) 
\]~ul; with statistics, one cml answer the follow- 
ing proxy question: if the new technique was a(> 
tually no ditt'erent than the old t(',('hnique ((;he 
* This paper reports on work l)erfonncd at the MITR1,; 
Corporation under the SUl)porl: of the MITIlJ,; ,qponsored 
Research l)rogrmn. Warren Grcit\[, l,ynette Ilirschlnm b 
Christilm l)orall, John llen(lerson, Kelmeth Church, Ted 
l)unning, Wessel Kraaij, Milch Marcus and an anony- 
mous reviewer l)rovided hell)rid suggestions. Copyright 
@2000 The MITRE Corl)oration. All rights r(~s(n'vcd. 
null hyl)othesis), wh~tt is 1:11(; 1)robat)ility that 
the results on the test set would l)e at least this 
skewed in the new technique's favor (Box eta\]., 
1978, So(:. 2.3)? Thai; is, what is 
P(test se, t results at least this skew('A 
in the new techni(lue's favor 
I new technique is no (liffercnt than the old) 
If the i)robtfl)ility is small enough (5% off;on is 
used as the threshold), then one will rqiect the 
mill hyi)otheMs and say that the differences in 
1;he results are :'sta.tisl;ically siglfilicant" aI; that 
thrt,shold level. 
This 1)al)(n" examines some of th(`- 1)ossil)le 
me£hods for trying to detect statistically signif'- 
leant difl'el'enc(`-s in three commonly used met- 
l'i(:s: tel'all, 1)re('ision and balanced F-score. 
Many of these met;Ire(Is arc foun(t to be i)rol)lem- 
a.ti(" ill a, so, t; of eXl)erinw, nts that are performed. 
Thes(~ methods have a, tendency to ullderesti- 
mat(`- th(', signili(:ance, of the results, which tends 
t() 1hake one, 1)elieve thai; some new techni(tuc is 
no 1)el;l;er l;lmn the (:urrent technique even when 
il; is. 
This mtderestimate comes fl'om these lnc|h- 
ells assuming l;hat; the te(:hlfi(tues being con> 
lmrcd produce indepen(lc, nt results when in our 
eXl)eriments , the techniques 1)eing COml)ared 
tend to 1)reduce l)ositively corr(`-lated results. 
To handle this problem, we, point out some 
st~ttistical tests, like the lnatche(t-pair t, sign 
and Wilcoxon tests (Harnett, 1982, See. 8.7 and 
15.5), which do not make this assulnption. One 
Call ITS(', l;llcse tests Oll I;hc recall nlel;ric, but l;he 
precision an(l 1)alanced F-score metric have too 
COml)lex a tbrm for these tests. For such com- 
1)lex lne|;ri(;s~ we llSe a colnplll;e-in|;Clisiv(~ ran- 
domization test (Cohen, 1995, Sec. 5.3), which 
also ~tvoids this indet)en(lence assmnption. 
947 
The next section describes many of the stan- 
dard tests used and their problem of assuming 
certain forms of independence. The first subsec- 
rio11 describes tests where this assumption ap- 
pears in estimating the standard deviation of 
the difference between the techniques' results. 
The second subsection describes using contin- 
gency tables and the X 2 test. Following this is a 
section on methods that do not 1hake this inde- 
pendence assumption. Subsections in turn de- 
scribe some analytical tests, how they can apply 
to recall but not precision or the F-score, and 
how to use randomization tests to test preci- 
sion and F-score. We conclude with a discussion 
of dependencies within a test set's instances, a 
topic that we have yet to deal with. 
2 Tests that assume independence 
between compared results 
2.1 Finding and using the variance of a 
result difference 
For each metric, after determining how well a 
new and current technique t)efforms on solne 
test set according to that metric, one takes the 
diflbrence between those results and asks "is 
that difference significant?" 
A way to test this is to expect 11o difference in 
the results (the null hypothesis) and to ask, as- 
suming this expectation, how mmsual are these 
results? One way to answer this question is to 
assulne that the diffb, rence has a normal or t dis- 
tribution (Box et al., 1978, Sec. 2.4). Then one 
calculates the following: 
(d - Z\[4)/s d = d/,~,~ (1) 
where d = xl- x2 is the difference found be- 
tween xl and x2, the results for the new and 
current techniques, respectively. E\[d\] is the ex- 
pected difference (which is 0 under the null hy- 
pothesis) and Sd is an estimate of the standard 
deviation of d. Standard deviation is the square 
root of the variance, a measure of how much a 
random variable is expected to vary. The results 
of equation 1 are compared to tables (c.f. in Box 
et al. (1978, Appendix)) to find out what the 
chances are of equaling or exceeding the equa- 
tion 1 results if the null hypothesis were true. 
The larger the equation 1 results, the more un- 
usual it would be under the null hypothesis. 
A complication of using equation 1 is that 
one usually does not have Sd, but only st and 
s2, where Sl is the estimate for Xl'S standard 
deviation and similarly for s2. Ilow does one 
get the former fi'om the latter? It turns out 
that (Box et al., 1978, Ch. 3) 
o -2 o-12 + a~ d = -- 2p12a10-2 
where cri is the true standard deviation (instead 
of the estimate si) and pl'2 is the correlation 
coefficient between xl and :c2. Analogously, it 
turns out that 
2 z S d 82 -t- 82 -- 2r128182 (2) 
where r12 is an estimate for P12. So not only 
does cr d (and Sd) depend on the properties of 
xl and x2 in isolation, it also depends on how 
Xl and .~'2 interact, as measured by P12 (and 
'rr)). When Xl and x2 are independent, p12 = 
0, and then (Td = ~-+ c7~ and analogously, 
Sd = ~ + s~. When P~2 is positive, ;1; 1 and 
x2 are positively correlated: a rise in xl or x2 
tends to be accompanied by a rise in the other 
result. When P12 is negative, :cl and x2 are 
negatively correlated: a rise in :cl or x9 tends 
to be accompmfied by a decline in the other 
result. -1 < P12 < 1 (Larsen and Marx, 1986, 
Sec. 10.2). 
The assulnption of' independence is often used 
in fornlnlas to determine the statistical signifi- 
cance of the difference d = .~:1 - x2. But how 
accurate is this assumption? One nfight expect 
sonic positive correlation from both results com- 
ing from the same test set;. One may also expect 
some positive correlation when either both tech- 
niques are just variations of each other 1 or both 
techniques are trained on the same set of train- 
ing data (and so are missing the same examples 
relative to the test set). 
This assumption was tested during some 
experiments for finding granunatical relations 
(subject, object, various types of nxodifiers, 
etc.). The metric used was the fraction of the 
relations of interest in the test set that were re- 
called (tbund) by some technique. The relations 
of interest were w~rious subsets of the 748 rela- 
tion instances in that test set. An example sub- 
set is all the modifier relations. Another subset 
is just that of all the time modifier relations. 
1 These variations are often designed to usually behave 
ill the stone way and only differ in just a few cases. 
948 
First, two difl'erent te(:hniques, one ltlelllory- 
t)ased and the other tl'ansti)rlnation-rule based, 
wei"e trained on the same training set, and then 
both teste(1 on that ticst set;. l~.e(:all eonlt)a.risons 
we, re made tbr ten subsets of tim relations and 
the r12 was found for each cOral)arisen. From 
Box et al. (1978, Ch. 3) 
"'12 = ~(I, Jlk -- Yl)(~2k -- ~2)/('glt~2( 71 -- l.)) 
k 
where Yil~ = \] if the ith technique recalls the 
tcl;h relation and = 0 if not. 'lz, is the nmnl)cr 
of relations in the subset. !\]i and si are mean 
and stmJ(lard de, vial, ion estimate.s (based on the 
Yik'S), rest)ectively, fl)r the ith technique. 
For the ten subsets, only Clio COlnl)arison had 
a 'r12 (::lose to 0 (It was -0.05). The other nine 
c()ml)arisons had 'r12's 1)etw(x',n 0.29 and 0.53. 
The ten coral)arisen inedian value was 0.38. 
Next;, the transformatiol>rulc t)ased t.cch- 
nique was rUll with difl'erent sets of starting con- 
ditions and/or different, but overlapl)ing , sub- 
sets of the training set. Recall comparisons were 
ma(le on the same test (lata. set 1)etween l;he dif 
fcrent variations. Many of the comparisons were, 
of how well two wu:iations recalled a particular 
subset of the relations. A total of 40 compar- 
isons were made.. The 'r\]2's on all d0 were 1)osi- 
tire. 3 of the 'r,2's w('~re ill the 0.20-0.30 range. 
24 of the rj2's wore in the 0.50--0.79 range. 13 
of the 'r\]2's were in the 0.80-1.0() range. 
So in our ext)erin~ents, we were usually eom- 
t)aring 1)ositivcly correlated results. How much 
error is introdu(:e(t t)y assuming independence? 
An easy--to-analyze case is when the stan- 
dard devial,ions for the results being eoml)ared 
a:t'c the same. ?~ 'J}hen equation 2 reduces to 
s,,- sV/2(l- r12), where s = sl = ,s'2. If one, 
assumes the re.sults m'e indcpel:dent (~/SSllllle 
r,2 = 0), then sd :-~ .sv/22. Call this wflue sd-i,7,g. 
As flu increases in value, Sd decreases: 
\[().38 d 0.T87(sd_i,,d) 1.27 
\[p.ao I 1.41 
\[O.80J 0.447(Sd.__i.,,.d) 2.24 
'l'he rightmost cohunn above indicates the mag- 
nitude by which erroneously assuming indepen- 
>\[lifts is actually roughly true in the coml)arisons 
nmde, and is assumed to be true in many of the standard 
Wsts for statistical significance. 
(lence (using 8d_in d ill 1)lace of sd) will increase 
the standard deviation estimate. In equation 1, 
sd forms the denominator of the ratio d/.s d. So 
erroneously assmning independence will mean 
that the mmmrator d, the difference between the 
two results: will nee(t to increase by that same 
factor in order f()r equation 1 to have the same 
wtlue as without the indel)endence assmnt)tion. 
Since the value of that equation indicates the 
statistical significance of d, assunfing indepen- 
dence will mean that e1 will have to be larger 
than without the. assumption to achieve the 
same al)parent level of statistical significance. 
l?roln tile tal)le above, when r12 = 0.50, (1 will 
need to 1)c about 41% larger. Another way to 
look at this is that assuming indei)en(lenee will 
make the same. v~due, of d appear less statist;i- 
cally signifiealtt. 
The common tests of statistical significance 
use this assumt)tion. The, tesl; klloWlt as the 
1, (Box et; al., 1978, Sec. 4.1) or two-saml)le t 
(Harnett, 1982, See. 8.7) test does. This test 
uses equation 1 and then compares the resulting 
va.lue against the t; distribution tal)les. This test 
has a (:Oml)licated form for sd l)eeause: 
1. :c! and :c2 can t)e 1)ased on (tiffering num- 
1)ers of saml)les. Call these retail)ors 'n~ and 
'n2 r(;sl)ectivcly. 
2. 111l this t(;st, the zi's are each an ni sam- 
pie average, of altother varial)le ((:all it yi). 
'\['his is important because the si's in this 
test are standm'd deviation estimates tor 
the yi's, not the xi's. The relationship be- 
tween them is that si for Jli is the same as 
( for :,:,:. 
3. The test itself assumes that !11 and Y2 have 
the same standard deviation (call this com- 
mon value s). The denominator estimates 
,s using a weighte(1 average of 81 and s2. 
The weighting is b~sed on nl and r7,2. 
From Harnett (1982, Scc. 8.7), the denominator 
Sd ~- nl + n2 - 2 
711 -b r~,2 ) 
'i7,177,2 
When 'nl = 'n2 (call this common value 'n), '~1 
and s2 will be given equal weight, and Sd siml)li- 
fie.s to ~ + ,s'~)/n. Making the substitution 
described above of si v/57 tbr si leads to an Sd of 
949 
s 2 the fbrm had earlier for the -t-, 2, we using 
independence assumption. 
Another test that both makes this assulnt)- 
tion and uses a tbrm of equation 1 is a test tbr 
binonlial data (Harnett, 1982, Sec. 8.1.1) which 
uses the "t'aet" that binomial distributions tend 
to approximate normal distributions. In this 
test, the zi's being compared are the fraction 
of the items of interest that are recovered by 
the ith technique. In this test, the denomina- 
tor sd of equation 1 also has a complicated fbrm, 
both due to the reasons mentioned for the t, test 
above and to the fact that with a binomial dis- 
tribution, the standard deviation is a flmction 
of the number of samples and the mean wflue. 
2.2 Using contingency tables and X 2 to 
test precision 
A test that does not use equation 1 but still 
makes an assunlption of independence l)etween 
a:l and a:u is that of using contingency tables 
with the chi-squared 0,52) distribution (Box et 
al., 1978, Sec. 5.7). When tile assmnption is 
valid, this test is good for comparing differences 
ill the pr'ecision metric. Precision is the fraction 
of the items "Ibund" 1)y some technique that 
are actually of interest. Precision = l~,/(I~, + S), 
where R is the number of items that are of inter- 
est and m'e Recalled (fbund) by tile technique, 
and S is the munber of items that are found by 
tile technique that turn out to be Spurious (not 
of interest). One can test whether the precision 
results from two techniques are different by us- 
ing a 2 x 2 contingency table to test whether the 
ratio R/S is different for the two techniques. 
One makes tile latter test, by seeing if tile as- 
sumption that the ratios for the two techniques 
are the same (the null hypothesis) leads to a sta- 
tistically significant result when using a X 2 dis- 
tribution with one degree of freedom. A 2 x 2 ta- 
ble has 4 cells. The top 2 cells are filled with the 
R and S of one technique and the bottom 2 cells 
get the R and S of the other technique. In this 
test, the valuc in each cell is assumed to have a 
Poisson distribution. When the cell values are 
not too small, these Poisson distributions are 
approximately Normal (Gaussiml). As a result, 
when the cell values are independent, smnming 
tlle normalized squares of the difference between 
each cell and its expected value leads to a X 2 
distribution (Box el; al., 1978, Sec. 2.5-2.6). 
How well does this test work in our experi- 
ments? Precision is a non-linear time(ion of two 
random wu'iables R and S, so we did not try to 
estimate the correlation coefficient \]'or precision. 
However, we can easily estimate the correlation 
coefficients for the R's. They are the r12's found 
in section 2.1. As that section mentions, the 
r12's fbund are just about always positive. So 
at least in our experiments, the R's are not ill- 
dependent, but are positively correlated, which 
violates the assumptions of the test. 
An example of how this test behaves is the 
following comparison of the precision of two dif- 
ferent methods at finding the modifier relations 
using tile stone training and test set. The corm- 
lation coefficient estilnate tor R is 0.35 mid the 
data is 
Method 17, 5' t?recision 
1 47 48 4!)% 
2 25 14 64% 
Placing the l~, and S values into a 2 x 2 table 
leads to a X 2 value of 2.38. a At t degree of 
freedom, tile X 2 tables indicate that if the null 
hypothesis were true, there would 1)e a 10% to 
20% chance of producing a X 2 value at least this 
large. So according to this test, this nnlch of an 
observed difference in precision wouht not be 
unusual if no actual differ(,ncc in the precision 
exists between the two nw, thods. 
This test assumes independence between the 
/~, wdues. When we use a 22(I (=1048576) trial 
approximate rmldomization test (section 3.3), 
which makes no such assumptions, then we find 
that this latter test indicates that under the 
null hypothesis, there is less than a 4% chance 
of producing a difference in precision results as 
large as the one observed. So this latter test in- 
dicates that this nmch of an observed difference 
in precision would be mmsual if no actual dif- 
ference ill the precision exists between the two 
methods. 
It should be mentioned that the manner of 
testing here is slightly different than the man- 
ner in the rest of this paper. The X 2 test looks 
at the square of the difference of two results, 
and rejects the mill hylmthesis (the compared 
techniques are the same) when this square is 
a\Ve do not use Yate's adjustment to compensate lbr 
the numbers in the table being integers. 1)oing so would 
lmve made the results even worse. 
950 
large, whel;he, r l;lm largeness is (:aused l)y t;he 
new t;eehni(lue t)l"o(lucing' a, much l)(fl;l;er result; 
titan l;he current, l;e(:hlfique or vice-versa. So 
1,o l)e fair, we eolnl)ared l;he X 2 resull;s with a 
l;wo-sided version of l;hc rmldon~iz~fl;ion t;esl,: es- 
l;inm|;e, l;he likelihood glu~l; l;he obsea'ved magni- 
l;u(le of t, he resull; (lifl'eren(:e would 1)c matched 
or exceeded (regardless of' which l;echnique pro- 
duced l;he betl;er resull;) raider the mill hyl)oth- 
esis. A one-sided version of the test;, which is 
colnt)aral)le t;o what we use in l;he rest of the t)a - 
per, esl;inml;es l;he likelihood of a (tifferenl; oul;- 
come under t;he null hyt)oChesis: that of m~l:cll- 
ing or exceeding t;he (lit\['erence of how lllllch 
l)¢,l;ter i;he new (possibly 1)ett, er) l;e(:lmi(lue's oh- 
s('a'ved result is than l;he currenl; l;e('hnique's o|)- 
serve,(1 l'esull;, ht t;he ahoy(; scenario, a one-sided 
t(;sl; t)rodu(:es ~ 3(~, tigure insl;ead of s~ d:% figure. 
3 Tests without that independence 
assumption 
a.1 Tests for matched pairs 
At; l;his point, one may wonder it' all st;al;isl;ical 
t;CSt;S lllil\](e SllC\]l s/,\]l intlepealdenct~ asSllltl\])l;ioll. 
Th(', miswer is no, lml; t;\]l()se lesl:s l;hal; (to nol; 
lltC}~Slll;e, how lllllch {;we l;e(;\]lni(llles illl;(',ra(:l; (to 
need i;o lmtke, some assmnpl;ioH al)oul; t;\]m|; ill- 
I;(;r~t('l;ion mid l;yl)it:a.ll E l;\]ml; assuml)l;ioll is in(te- 
t)(~\]ldell(;e. 'Fhose I;esl;s I;ll~H; Hol;i(;c in S()lll(~ \\r;~Br 
how much l;wo tc(:hniqucs hm;ra(:l; (;~1,11 lib(; ~h()se 
ol)servations insl;ead of relying on assumt)l:ions. 
One w',~y t;o measure how 1;we l;e(:lmi(lucs in- 
i;erac(; is 1;o comtm.re \]tow similarly (;he, t;wo t;ecl> 
ni(tues tea.el; 1;o various l)arl;s ()f 1;he l;(;s\[; seA;. 
'_l."his is done in the mal;t:hed-lm.ir 1, I;esl; (Hm'- 
nctl;, 1982, Se(:. 8.7). This l;csI; tin(ls the dith'a'- 
once bet;we, en how t;eclmiques 1 and 2 l)eribrm 
on e~t(::h l;esl; set, Saml)le. The/, dist;ri|)ul;ion and 
a fOI'ln of eqm~l;ion l m:e used. The null }lyl)ol;h- 
esis is st;ill l;\]l~tl; ~he mtmeral;or d \]ms ~t 0 me,m, 
but el is now l;he stun of these difference values 
(divided 1)y t;he number of Smnl)les), instead of 
being :r~ - :re. Similm'ly, the (lenomimd;or .sd is 
now esl;inml;ing l;he si;a.ndm'd (leviation of l;hese 
difl'erenee wdues, instead of being a funcl;ion of 
s:l and su. '.Flfis means for example, (;hal; even if 
t;lm values fl'om l,eclmiques l and 2 vary on (lii- 
ti:rent; test; Smnl)les , Sd will now 1)(' 0 if on each 
tesI; smnl)le, l;echnique \] 1)reduces a. value l;lmt is 
the ssulle C()llS|;allI; tHI1OlllIi; lllOl'e t;han l;he va,\]ue 
fl'om t, echnique 2. 
Two ol;h(',r tests for eomlmring how (;we tech- 
ni(lueS 1)ert'()rm 1) 3, comtmring how well l;hey 
perform on each I;est Smnl)le arc the sign mid 
Wilcoxon tests (Harnel;t;, 1!)82, See. 15.5). Un- 
like, t;\]le nl~tl;ched-tmir t: t;esI;~ neither of t, hese l;wo 
I;CSI;5 slSSllllte t;ln~l; I;hc sum of l;he (litl'crences has 
a normal (Gaussian) (listribul;ion. The i;wo tests 
are, so-calh~d nonl)a.rmut%ri(: l;esl;s, which (lo not; 
make assuml)l, ions a.1)out; how l, he rcsull;s axe dis- 
ln'il)ut, ed (thrnel,l,, 1982, Ch. 15). 
il'he sign |;est is I;he simplier of lJm I;wo. It uses 
a 1)inomial dist,rilm|;ion to examine the munber 
of l;esl; smni)les where t;e(:hlfi(lUe \] 1)crforms \])el;- 
l;er t;ha.n l;e(:hnique 2 ve, rsus l;he munl)er where 
1;he Ol)posite occurs. The null hyl)ol;hesis is l;h~d; 
1;he t;wo t;eclmiques 1)ert'orm equally well. 
Unlike the sign t;esl;, t;he Vfilcoxon |;esl; also 
uses inlbl'nlal;ion on how large a difference exisl;s 
1)el;ween t, hc l;wo l;echniques' r(,,sull;s on each of 
l;hc l;csl; smnpl(;s. 
3.2 Using the tests for matched-pairs 
All three of l;hc ma.l,(:he(1-tmir t, sign and 
Wilcoxon t;csl;s can 1)e a.pl)lied t;o t;hc re, call met- 
ric, whicll is the fl'act;ion of |;he il;ems of inl;crcsl; 
in ~,he l:csl; sol; l;lml; a, I;e, ehniquc recalls (finds). 
Each il;em of inl,eresi; in |;he l;esl; (la~;a serves as 
a. l;cst sainlflU. \¥e use t;he sign l;esl; b(',causc iI; 
11Htkcs fcwel" assumi)i;ions 1;hart i;he nml;chcd-l)air 
1: I;est and is simplier l;han the Wih'oxon I;esi;. 111 
addit;ion, the fro:i; glml; t~he sign l;e, st ignores l;he 
size of 1;he result; difl'erence on eacll l;esl; Smnl)le 
(tocs llOI; nml;ter here. \¥iI:h I;he recall met;rio, 
each sa.mple of int;eresl; is either found or nol; by 
a. t;eehnique. There are no interlnedbtte values. 
While 1;he 1;hree l;esl;s described in sccl;ion 3.1 
can be used on the re(:~dl mctxic, 1;hey CallllO|; bc 
"" ' ' used on ell;lint t;hc precision or sla mgh|fforwardly 
1)abmced F-score met;rics. This is because both 
precision and F-score ~tre more coml)licated non- 
linem' flmci;ions of rml(lom varial)lcs than recall. 
In fst(:t bol;h can be l;hought of as non-linem" 
flm(:l;ions involving recall. As described in Sec- 
tion 2.2, precision = 1~./(1~ + S), where I~ is i;he 
nmnl)er of iWms t;lmt; are of inl:eresl; that; are '/'c'- 
called by a W, chnique mid S is l;he mmfl)er of 
it;e, ms (fi)und 1)y s~ technique) that; are nol; of 
interest;. The 1)~dmmed F-score = 2ab/(a + b), 
where a is recall and b is precision. 
951 
3.3 Using randomization fbr precision 
and F-score 
A class of technique that ean handke all ldnds of 
flmetions of random variables without the above 
problenls is the computationally-intellsive ran- 
domization tests (Noreen, 1989, Ch. 2) (Cohen, 
1995, Sec. 5.3). These tests have previously 
used on such flmctions during the "message un- 
derstanding" (MUC) evaluations (Chinchor et 
al., 1993). The randomization test we use is like 
a randomization version of the paired sample 
(matched-1)air) t test (Cohen, 1995, Sec. 5.3.2). 
This is a type of stratified shuffling (Noreen, 
198!), Sec. 2.7). When eomt)aring two tech- 
niques, we gather-u I) all the responses (whether 
actually of interest or not) produced by one 
of the two techniques when examining the test 
data, but not both techniques. Under the 111111 
hyl)othesis , the two techniques are not really 
different, so any resl)onse produced by one of 
the teehniques eonld have just as likely come 
fl'om the other. So we shuffle these responses, 
reassign each response to one of the two tech- 
niques (equally likely to either technique) and 
see how likely such a shuffle 1)roduces a differ- 
ence (new technique lninus old technique) in the 
metric(s) of interest @1 our ease, precision and 
l?-score) that is at least; as large as the difference 
observed when using the two techniques on the 
test data. 
'n responses to shuttle and assign 4 leads to 
2 ~' difl'erent w~\ys to shuffle and assign I;hose re- 
sponses. So when 'n. is small, one can try each 
of the different shuttles once and produce an 
exact randomization. V~;hen n gets large, the 
mmfl)er of different shutttes gets too large to be 
exhaustively evaluated. ~J?hen one performs a.u 
approximate randomization where each shuffle 
is perfornmd with randoln assignments. 
For us, when n < 20 (2'" .<_. 1048576), we use 
an exact randomization. For n > 20, we use an 
approximate randomization with 1048576 shuf 
ties. Because an approximate randomization 
uses random nmnbers, which both lead to oc~ 
casional unusual results and may involve using 
a not-so-good pseudo-random 1111111\])(;I" genera- 
tol "~, we perfbrm the following cheeks: 
4Note that responses produced by both or neither 
techniques do not need to be shulIled and ,~ssigned. 
5One examI)le is the RANDU routine on the IBM360 
(Forsythe et al., 1977, See. 10.1). 
• We run the 1048576 shuttles a seeond time 
and colnpare the two sets of results. 
• We also use tile same shutttes to calcu- 
late the statistical significance for the recall 
metric, and compare this significance value 
with the significance value found for recall 
analytically by the sign test. 
An example of using randomization is to com- 
pare two different methods on finding modifier 
relations ill the same test set,. The results on 
the test; set, are: 
Method ~~ Precision F-score t 
i _l_556t ' I 49.5% 47.5% 
Zl: 64.1% 35.2% 
Two questions being tested are whether the ap- 
parent ilnt)rovement in reca.ll and F-score f!rom 
using method I is significant. Also being tested 
is whether the apparent imt)rovenmnt; in pl'eci- 
sion fl'om using method Ii is significant. 
In this example, there are 10"1 relations that 
should be found (are of interest). Of these, 19 
are recalled by both methods, 28 are recalled 
by method I but not; II, and 6 are recalled by 
II but not I. The correlation coeificient estilnate 
between the methods' recalls is 0.35. In addi- 
tion, 5 stmrious (not of interest) relations arc 
found by both methods, with method I find- 
ing an additional 43 Sl)uriolls relationships (not 
found by method II) and me£hod II finding an 
additional 9 relationships. 
There are a total of 28+6+43+9=86 relations 
that are found (whether of interest oi' not) by 
one method, but not the other. This is too 
many to t)erfornl an exact randolnizgtion, so 
a 1048576 trial apt)roximate randomization is 
perfornmd. 
In 96 of these trials, method I's recall 
is greater than method iI's recall by at, 
least (45.6%-24.3%). Similarly, in 14794 
of the trials, the F-score difference is at 
least (47.5%-35.2%). In 25770 of the trials, 
method II's precision is greater than method I's 
precision by at least; (64.1%-49.5%). N:om 
(Noreen, 1989, Sec. aA.a), the significance level 
(probability under the null hypothesis) is at 
most (.,e + 1)/(,~t + 1), where ',,.: is the nul~lt/er 
of trials that meet the criterion alld 1t, t is the 
number of trials. So fbr recall, the significance 
level is at most (96+1)/(1048576+1) =0.00009. 
952 
Similarly, for F-score, the significance level is at 
most 0.()1 d: and for l)re(:ision, the level is at lllOSt 
0.025. A secon(l 1048576 trial t)ro(luces similar 
results, as does a sign test on recall. 'l'hus, we 
see that all three dit\[ere.n(:es are statistically sig- 
lfiIica.nt. 
4 The future.: handling inter-smnple 
dependencies 
An assmnption made by all I;he methods men- 
tioned in this I)~tl)er is ttmt the nlenlbcrs of the 
Lest set are all independent of one anothex. Tlmt 
is, knowing how a method l)('rforms on one test 
sot sanlple should not give any information on 
how that method \])el'forllls on other test set 
samples. This assulnl)tJon is not always true. 
Church and Mercer (1993) give some exaln- 
ples of dependence bctwe.en test set insl;ances 
ill natural la.llguage. One tyt)e of dci)endence 
is that of a lexeme's part of speech on the 
l)m'l;s of speech of neighl)oring lexenm,~ (th(,ir 
section 2.1). Sinfilar is the concept of collo- 
ca, t;ion, where the prolml)ility of a lexeme's al> 
l)earance is influenced by the. lexemes ai)pea.rin: ~ 
i1~ nearby positions (their section 3). A type of 
(tet)en(lence that is less local is that often, a. con-- 
tent word's al)pe.arance in a piece of text gr(;atly 
increases the cha.n('es of th~tt s;ulle wor(1 ~q)l)ear - 
illg b~ter in that 1)iece of texl; (their se(:l;ion 2.;/). 
What ~tr('. the effects when SOllle d{:t)endency 
exists? The expected (average) value of' the in- 
stallC(~ results will stay the, same. However, the 
('lmnees of getting an llllllSllal reslllt (;a,lt c\]la.ll~re. 
As an eXmnl)le , take five flips of a Nit coin. 
When no dependen(:ies exist 1)etween the tlil)s , 
the clmnces of the extreme result tha.t all the 
flit)s l:md on :~ particular side is faMy small 
((1/2) 5 -- i\[/32). When the ttil)s are positively 
correlated, these chmices increase. When the 
first flip lands on that side, the chances of the 
other four tlil)s doing the same are now ea.ch 
greater tlmn 1/2. 
Since statistical significance testing involves 
finding the chances of getting an mmsmd 
(skcwe(1) result under some null hyt)othesis, one 
needs to determine those del)endencies in order 
to accurately determine those dmnces, l)eter- 
mining the etk's:t of these dependencies is some- 
thing that is yet to l)e done,. 
5 Conclusions 
In elnpirical natural  processing, one 
is often COml)aring differences in values of met- 
rics like recall, precision and balanced F-score. 
Many of the statistics tests commonly used to 
make such comparisons assume the indepen- 
dence between the results being compared. \¥e 
ran ~ set of m~tural  processing exper- 
iments and tbund that this assuml)tion is often 
violated in .~uch a way as t,o understate the sta- 
l, istical significance of the difli;rences between 
the results. We point out some analyt;ica.1 statis- 
tics tests like lnatched-l)air t,, sign mid Wilcoxon 
tests, which do not midge this assmnption and 
show that they (;tl,ll \])e llsed Oll a llletric like 
recall, l?br more complicated 1nettles like pre- 
cision and balanced F-score, wc use a compute-- 
intensive randonfization test, which also avoids 
this assumption. A next topic to address is that 
of possible dependencies l)etween test set sam- 
ples. 

References 

G. Box, W. Hunter, and J. Hmlter. 1978. 
,gta, iisl.ics for" <rpc.ri'm.ent, er.~. John Wiley and 
S()llS. 

N. Chinchor, L. Hirschman, and l). Lewis. \] 9!)3. 
\]~vahmtillg message understanding systems: 
an analysis of the, third message understand- 
ing conferc.nce (muc-.3). Co'll~,p'ltt(tt'io'llgl Lil~,- 
gui,stic.s, 1!)(3). 

K. Church mid 171.. Mercer. 1993. Introduction 
1;o the sl)ecial issue on computational linguis- 
tics using large corpora. Cornp'u, tational Lin- 
guistic.s, 1!)(1.):1 24. 

P. Cohen. 1!)95. Empirical Meth, ods for Ar'tifi- 
cial Intelligence.. MIT Press, MA, USA. 

G. Forsythe, M. M~dcolm, and C. Moler. 1977. 
Com, putc'r methods for ~nathcm, atical comp'u- 
l.ati~m,.s. Prentice-lI~dl~ N,J, USA. 

D. Harnett. 1982. Statistical Methods. 
Addison-XYesley Publishing Co., 3rd edi- 
tion. 

R. Larsen and M. Marx. 1986. An Introduc- 
tion to Ma, th, cmatical Statistics and Its Appli- 
cations. Prentice-Hall, N J, USA, 2nd edition. 

E. Noreen. 1989. Computer-intensive met;hods 
.\[br testing h, ypoth, cscs: an int, r'od'ttction. Jolm 
Wiley and Sons, Inc. 
