THE STATISTICAL SIGNIFICANCE OF THE MUC-4 RESULTS 
Nancy Chinchor, Ph.D. 
Science Applications International Corporation 
10260 Campus Point Drive, M/S A2-F 
San Diego, CA 92121 
chinchor@esosun.css.gov 
(619) 458-2614 
INTRODUCTION 
The MUC-4 scores of recall, precision, and the F-measures are used to measure the performance of the par- 
ticipating systems. The differences in the scores between any two systems may be due to chance or may be due to a 
significant difference between the two systems. To rule out the possibility that the difference is due to chance, statisti- 
cal hypothesis testing is used. The method of hypothesis testing used is a computationally-intensive method known as 
approximate randomization. The method and the statistical significance of the results for the two MUC-4 test sets, 
TST3 and TST4, will be discussed in this paper. 
In our hypothesis testing, our objective was to determine whether a system is characteristically different 
from another system. This was achieved by comparing two systems to see if their actual difference in performance 
stands out in comparison with the results for random combinations of their scores. If their actual difference stands 
out, then we know that this difference could not have arisen by chance. 
ELEMENTS OF HYPOTHESIS TESTING 
A key element in hypothesis testing is obviously the hypothesis. Statistics can be used to reject a hypothesis. 
In statistical hypothesis testing, it is important to compose the hypothesis in such a way as to be able to use statistics 
to reject it and to thereby be able to conclude that something of interest is true. The hypothesis formulated to be 
rejected is called the null hypothesis. In Table 1, some elementary terms are defined, including the null hypothesis. 
We axe interested in determining whether two systems are significantly different in their performance on a 
MUC-4 test set. To conclude that two systems are significantly different, we would formulate null hypotheses of the 
following form to be tested for each test set: 
The absolute value of the difference between System X's overall recall (precision, F-measure) score 
for the data extraction task and System Y's overall recall (precision, F-measure) score for the data 
extraction task is approximately equal to zero. 
If this null hypothesis can be rejected, then we cart conclude that the systems are significantly different. 
Another key element in hypothesis testing found within the null hypothesis is the test statistic. A test statistic 
is a function which can be applied to a set of sample data to produce a single numerical value. A simple example of a 
test statistic is recall, which is a function of the number correct, the number partially correct, and the number of pos- 
sible fills. The test statistic we used in our hypothesis testing is the absolute value of the difference in recall, preci- 
sion, or F-measure of systems. Observations are instances of a set of random variables. An example of an observation 
is the four-tuple for a MUC-4 message consisting of the number possible, actual, correct, and partially correct. This 
four-tuple plays an important role in our application of the approximate randomization method. 
30 
O Null hypothesis The hypothesis that a relationship of interest is not present. 
Examples (informal): 
1) system X and system Y do not differ in recall 
2) system X and system Y do not differ in precision 
3) system X and system Y do not differ in F-measure for 
equal weighting of recall and precision 
0 Test statistic A function which can be applied to a set of sample data to produce 
a single numerical value. 
Examples: recall, precision, F-measure, difference in recall 
O Observations Instances of values of a set of random variables. 
Example: number possible, actual, correct ,partially correct 
O Significance level The prebabiHty that a test statistic that is as extreme or more 
extreme than the actual value could have arisen by chance, given 
the null hypothesis. 
The lower the significance level, the less probable it is that the null 
hypothesis holds. In our case, the lower the significance level, the 
more likely it is that the two systems are significantly different. 
Table 1: Definition of Terms 
The final key element in hypothesis testing is some means of generating the probability distribution of the 
test statistic under the assumption that the null hypothesis is true. Instead of assuming a distribution, such as the Nor- 
real distribution, we empirically generate the distribution as illustrated in Figure 1. As shown, the significance level is 
the area under the distribution curve bounded on the lower end by the actual value of the test statistic.The significance 
level is the probability that a test statistic that is as extreme or more extreme than the actual value could have arisen 
by chance, given the null hypothesis. Thus, the lower the significance level, the more likely it is that the two systems 
are significantly differenL 
Relative 
frequency   
Empirical distribution 
iiM ................ 
actual 
Absolute value of the 
y difference In recall 
Figure 1: Histogram of the Absolute Value of the Difference in Recall Scores 
31 
RANDOMIZATION TESTING 
Traditional statistical analysis requires knowledge of or an assumption about the distribution of the data in 
the sample. We do not know the distribution for our sample ~d prefer to not make an assumption about it. Computa- 
tionally-intensive methods empirically generate the distribution. Exact randomization testing generates all of the log- 
ically possible outcomes and compares the actual outcome to these. This amount of generation is often impractical 
because of the large number of data points involved, so approximate randomization is used instead. A confidence 
level is calculated to indicate how close the approximate randomization is to the exact randomization. 
The method of approximate randomization involves random shuffling of the data to generate the dislribu- 
tion. We used the approximate randomization method described by Noreen in \[1\] with stratified shuffling to control 
for categorical variables that are not of primary interest in the hypothesis test. 
The first step in approximate randomization is to select a test statistic. Our formulation of the null hypothesis 
indicates that the test statistic is the absolute value of the difference in recall, precision, or F-measure. The next step is 
to input the data. The data for each test set consists of the four-tuples of number possible, actual, correct, and partially 
correct for each message for each system. The actual statistic: is calculated for the pair of systems as they come under 
scrutiny. The desired number of shuffles is set to 9,999 because it takes about eight hours to run the test for each test 
set and the confidence levels can be easily looked up in a table given in \[1\]. We have arbitrarily chosen the cutoff con- 
fidence level to be a conservative 99%. In a test of how the confidence levels were affected by the number of shuffÊes, 
9,999 shuffles produced slightly higher confidence levels than 999 and were worth the 16-fold increase in computing 
time. Once the desired number of shuffles is set, the counters for the number of shuffles, ns, and the number of times 
the pseudostatistic is greater than or equal to the actual statistic, nge, are set to 0. A loop then increments ns until it 
has exceeded 9,999, the desired number of shuffles. The first step in this loop is to shuffle the data, which is the first 
major operation that occurs during approximate randomization. Table 2 contains an outline of the major operations 
involved in approximate randomization. Figure 2 illustrates the stratified shuffling used for the analysis of the MUC- 
4 results. 
i 
APPROXIMATE RANDOMIZATION 
1. Shuffle ns times (ns is 9,999 in our case). 
2. Count the number of times (number greater than or equal, nge) that 
Istat....~.- atat...,,~..I ~ latatA- mt.I 
(stat can be recall, precision, or F.measure in our case). 
3. The estimate of the significance level is (nge + 1) / (ns + 1) 
(the l's are added to ensure the test is valid). 
4. The confidence level is found by calculation or table lookup. 
Table 2: Operations in Approximate Randomization 
32 
SHUFFLING 
SYSTEM A \] 
(pos.zA aCt zA corlA parle 
(pOSlooA aCllooA cor.zooA parloo~ 
rec A preA F-meas A 
I SYSTEM B I 
(posiB actiB COrZB pariB) 
(pOS.zooB act.zooB cOr.ZOOB par.zOOB) 
re% pres F-measB 
Coin FHp (100 times) 
, heads tails , PSEUDO SYSTEMA I ~ PSEUDO SYSTEMB I 
(pOSnx 7nx COrnx Pa ~ ~pOSny aCtny COrny 
reCpseudoA prepseudoA F'meaSPseudoA reCpseudoB prepseudoB F'meaSpseudoB 
Figure 2: Shuffling for MUC-4 
33 
In Figure 2, data is shuffled by exchange of the systems' message scores depending on the outcome of a 
computer-simulated coin flip. After 100 coin flips, one per naessage, the absolute value of the difference in the metrics 
of the resulting pseudosystems can be compared to the corresponding absolute value of the difference in the metrics 
of the actual systems. The inequality for the comparison is shown in operation 2 of Table 2. The value ofnge is incre- 
mented every time a randomized pair of pseudosystems salLisfies this inequality. The significance level is calculated 
according to the formula in operation 3 of Table 2. The corresponding confidence level is then found in the appropri- 
ate table in \[1 \]. 
According to Noreen, 
Randomization is used to test the generic null hypothesis that one variable (or group of 
variables) is unrelated to another variable (or group of variables). Significance is assessed 
by shuffling one variable (or set of variables) relative to another variable (or set of vari- 
ables). Shuffling ensures that there is in fact no relationship between the variables. If the 
variables are related, then the value of the test statistic for the original unshuffled data 
should be unusual relative to the values of the test statistic that are obtained after shuf- 
fling. 1 
In our case, the four-tuple of data associated with each message for each system is the dependent set of vari- 
ables that is shuffled. The explanatory variable is the system. Shuffling ensures that there is no relationship between 
the differences in the scores and the systems that produced them, i.e., that the differences were achieved by chance. If 
they were not achieved by chance, then the value of the actual test statistic will be far enough out on the tail of the 
empirically-generated distribution to be significant. The area under the distribution curve with the lower bound of the 
actual test statistic will be smaller than the cutoff in this case. We have arbitrarily chosen the cutoff to be 0.1 because 
we are able to distinguish reasonable groupings of systems at this level. At lower cutoffs, too many systems are not 
significantly different. The choice of a cutoff has traditionally been based on the practical impfications of the choice. 
We need more data before we will know more of the implications of the cutoff given the sample sizes that we now 
have. 
The stratified shuffling that was limited to exchange at the message level was the most conservative 
approach to shuffling because it eliminated the effect of the varying number of slots filled per message in the key. We 
did not want to introduce the effect of this nuisance variable. Instead, we wanted the simplest, most straightforward 
method to be used for the initial introduction of approximate randomization in this application. Further experimenta- 
tion with the parameters of stratification, the cutoff for significance level, and the cutoff for confidence level may 
occur in the future. 
Examples 
These examples were obtained from \[3\] and are meant to give an intuitive sense of the results of the approx- 
imate randomization method. In general, ff two systems hawe. a large recall difference, they are likely to produce more 
homogeneous pseudo systems; nge will be small and the significance level will be low. On the other hand, if two sys- 
tems have similar recall scores, they are likely to produce a high nge and a larger significance level, giving a lower 
likefihood that the differences are significant. 
A pair of messages that show consistent differences across a range of messages will be more likely to be sta- 
tistically significantly different in their score than two systems which have a small number of differences appearing in 
just a few messages. Consider the following systems reporting precision results for a set of 100 messages. System A 
has a score of 15/20 on each of 50 relevant messages and no spurious templates, for a total of 750/1000 = 75% preci- 
sion. System B has the identical score on each of the relevant messages except for one, where its score is 0/20. Sys- 
tem B has a precision of 735/1000 - 73.5%. In the random shuffle, one of the pseudo systems will have the "0 
precision" template. The pseudostafistic will always be the same as the measured absolute difference between the sys- 
tems, i.e., 1.5%, so the significance level is 1.0 and the difference is not statistically significant. 
1. From page 9 of \[1\]. 
34 
Let us suppose that there is a System C with a precision score of 18/20 on each of 50 relevant messages and 
with no spurious templates. Any random shuffle of Systems A and C is likely to produce a smaller difference than the 
absolute value of the difference between A and C. The significance level will be extremely close to zero, indicating 
with virtual certainty that the two systems are significantly different. 
WHAT WE CANNOT CONCLUDE FROM APPROXIMATE RANDOMIZATION 
It is the systems themselves that set the conditions for the shultle in the pairwise comparison. The same dif- 
ference in scores between pairs of systems may not equate to the same significance level when different systems are 
involved in the pairwise comparison. The statistical test measures the difference from what would have occtm'ed ran- 
domly. Each pair of systems provides a different set of conditions for the stratified shuffle. Thus, we cannot conclude 
from the results of the approximate randomization tests that the evaluation scores are generally significant within a 
certain range of percentage points. This cautionary note has further consequences which are discussed below in the 
results for TST3. 
We also cannot conclude from the approximate randomization test whether our sample is sufficiently repre- 
sentative of newswire reports on activities concerning Latin America. In addition, we do not know whether the size of 
the sample is large enough for all of our purposes. The test sets were carefully collected in such a way as to reduce 
bias by selecting the number of articles concerning certain countries based on the percentage of articles about those 
countries in the entire corpus for the associated time period. In addition, the test articles were selected based solely on 
the order in which they appeared in the larger corpus and not on their contents. The size of the test sets was deter- 
mined by the practical limits of time allotted for composing answer keys, running the participating systems on the test 
sets, and scoring the results. Even with these precautions, however, we still do not know whether the random sample 
of test messages is representative of the population from which they were drawn. Our statistical testing is not aimed at 
answering this question. 
WHAT WE CAN CONCLUDE FROM APPROXIMATE RANDOMIZATION 
TST3 
The results 'for TST3 of MUC-4 are presented in this section. TST3 is the official test set for MUC-4. The 
systems otticially report scores for recall, precision, and the F-measures for three weightings of recall and precision 
based on the ALL TEMPLATES row in the summary score report. 2 Approximate randomization was applied to each 
pair of participating systems. The approximate randomization method applied to the TST3 results randomly shuffles 
the message-by-message scores for the two systems being compared and checks to see how many times the test statis- 
tic is as extreme as or more extreme than the actual test statistic. The lower the number of times the pseudo test staffs- 
tic is greater than or equal to the actual test statistic, the lower the significance level and the more likely the actual test 
statistic did not occur by chance. 
The results of this computer-intensive testing are reported in two formats. The first format shows the signifi- 
cance levels for the pairwise comparison for each of the scores reported (Figures 3 - 6). The second and more infor- 
mative format shows the significance groups or clusters based on the cutoffs at 0.I for significance level and 0.99 for 
confidence level (Figures 7 - 11). The significance groupings represent groups of systems which were not considered 
to be significantly different from each other according to these cutoff criteria. Please note that the F-measures are cal- 
culated using floating point arithmetic throughout and differ slightly from the F-measures in the official score reports, 
which were calculated based on the integer values of recall and precision. These more accurate F-measures depicted 
in Figures 9 - 11 appear in tabular format in Appendix G. 
2. For further information on the score report, see the paper "MUC-4 Evaluation Metrics" in these proceed- 
ings. 
35 
iii;ii~i~i~i I . . . ...... ooo o ~ 
:i:: ii:i:i:::i:! """""'" I 
............. 
!ii!:li~ili * • * * .... * * * " ° * .... c~ ' 
ii::i.~ii::i: 1 I ~?:~ ............ 
ii;~ii ............ ~ .... c- -- o:o:o:-- - o- g_ 
::::;:::::::::::: i """"""" ' ' 
~iii~\]i::i::i::i::iii I .......... - - i,~ ~.! - - - , ............... ,'iiii oo-o 
oo i?~:m:~:~: . • : ~ ° o ,~ ~ o o • • • • • • , • . • • 
..... a 
~,~,~ii',',iili~,i'~""':" .':. • ...... • • • " • .... • • • " ....... ~'i :'°~o°° ......... o ~ o ~ o~ o°~ o ~ o ~ o g o ~ o ~ 
_ o." o~ o~"- ~i~ ..... 
i " • " " " • .... ~ _oi ~?i: .:\[:: 0 0 
:! .... ~ .... I ..... c~ I ...... i::::ii::::i::::::ii::iiiii ! o o 
.......< ;;;: .......... 
~',~',;',~i~i~i~ ..... ........ 
....... i 
"''"""" • • • . . . • • • • ~rO O0 O,q" OC, I O0 ~D u'),~r 00 
ii!iii!i~ii!iiiii i ...... • • • - ." ." ." ." ." ." ." ." ." I*~00 0¢'~10 ¢~ 0000 ~ro ~ u') O0 O00000 ~ 0 tD . 0000 
!iii|iiii 
i:!:ilJ,l~!:!:! ,,i,i~i,li . . ooo o ooo o ooo o ooo o oO __ ~ 
! Ioo oo oo oo o ~.-~ 
.................................... ~; ~ ~o o~ ~ ~ o~o; Io;o~ o~o ~ ~ oOO~ ~ ~ ...... "''' . • , , . . • * * !:!:!:!:!:!:~:!:~:i ~ = m 
O0 ~0 O0 O0 O0 ~U~ O00~ O0 O0 
::::::::: ::::::::::::::::::: ~\[~ ~. ::::::::::::::::: ??;?:;::: 
ioo o° "'i ........... °! 
• .:.:.;-;-;.:-;-: I 
i:i~i~i::i:~i ..... iiii!iiiiiiiiiiiiiiii ~'° ~ o o o ~. ~ ~ o ~- . o., o o o o,- o o. o o ~ !!iiiiiiiiii!ii!i!!ii!!iiii ~ o o o o o o o ~ o o ~ ~ o ,, o o o o ~ o o o o o. 
Oe~ 00i00 00 O0 O0 ~'0 'ql'O 00 04e~ O0 00 0¢~ 
i:i:i:~:!:i 
::~ii~:i::i~ ~°-9 oo oo oo oo -- ~o oo oo ~o oo ~,~ ~o oo ~i:! :~,~i:i:i~ - - i 0 0 0 0 0 0 0 0 0 0 ~ 0 0 0 0 0 ~ O) 0 0 ~ ~ 0 0 0 0 
!!i~ii~ii~!ii ......... oo oo c~d do oo oo o¢~ ¢~d od do ¢~c~ oo do do 
" rrr ...... 
iiiiiiiiiiiiiill o o o o ~ - o o o ~ o~ ~ o o 
00 O0 ¢'~ O0 0 ~ii~ : o- o oo o ~oo eo o o o oo o o o~ ~ o °°o o °o ° 
ii!iiii,i~iiiiiiiiii o°: :: :o °. :: o°: :: :: o°o °. :: :: o°: ooo ° oO~ :: :o °. 
...............- . .-........ .-......... ........... :::::::::.:~i~::::: !::-?'.~:~: : :::liT::: 
,~!- "!!:::::,:~!i 
0 
.o 
t= 
°~ 
\[,,T., 
36 
i:i:i:i:!:!:!: i ::::::::::::::::: 
iiiiiii~ I ................. 
,~,~,~, . - : " iiiiiiiiiiiiiiiii 
iiiiii~i " ~ ' ................ • : :!:!:!:!:~:~:~:~: 
iiiiiii~ : ° ° ~ 
* ~ 0 0 
0 0 0 0 0: 
0 0 0 0 
" iiiiiiiiiiiiiiiii ~ ~ o o ~ ~ o 
• iiiiiiiiii!ii!i!i o o o o o o o o 
~| o o o ~i 0 0 0 0 0 o 0 0 0 0 
ooooog 
iiiiiiii ~ 
O: 0 0 0 0 0 "~ 0 0 O~ 0 0 
iii~ :ii!iiiiii!il ~ o . o--o o- 
0 0 0 0 0 0 O! 0 i 0 0 
3? 
I+++IIIIIIIIIIIIIIII ............. +: .... : : : : : : +++++++++++++++ +++: : ::''2" "" ++++++++++++++++ I++I l+l+i+ I+ i+ I+ i+l ° I+I l t+l+l++ I 
 B/iHBHBHBI IBBiB/H WHHnHHiHBI IBHBmHHH 
 HiiHHBHB/I IHH|HHHH  RHHHHHHBHliHUHHHH 
 IBRBBHHHBHIHHHHH WHHBHIHIBIHHHHHHH 
 IHHHHHHHUHHHHHHH UHHHHHHH/HHHHHHHH 
EHBBBHHUHHHHHHHHH  BHHHB/HHHHHHHHHH 
|HHBIUHHHHHHHHHHH UHHHIHHHHHHHHHHHHH 
UHIIHHHHHHHHHHHHHH |HmHHHHHHHHHHHHHHH 
UIHHHHHHHHHHHHHHHH 
"S 
C~ 
38 
o ~. . 
V V 
2 W ~ 
~o~ 
40 '1~ 
C ~ 
~.C ~._~ 
i 
 UHHHHHHHHHHHHHH| | RH IN  HH n n H  
t= 
0 
"S 
"S 
39 
10o 
Recall 
90 
80 
70 
60 
50 
40 
30 
20 
10 
0 
® 
   
i i i i i 1 i i i i I I f I 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 
System 
Figure 7: Significance Groupings for Recall at the 0.10 Level with 0.99 Confidence for TST3 
100- 
Precision 
90 
O0 
70 
60 
50 
40 
30 
20 
10 
0 
Q- 
s I e i s | i 0 1 2 3 ' ; 15 ; 8 ; 
10 11 12 13 1; 15 16 1; 
System 
Figure 8: Significance Groupings for Precision at the 0.10 Level with 0.99 Confidence for TST3 
40 
F-Equal 
100 
go 
80 
70 
60 
50 
40 
30 
20 
10 
0 
® 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 
System 
Figure 9: Significance Groupings for F-Measures (P&R) at the 0.10 Level with 0.99 Confidence for TST3 
lOO 
F-PTwice 
go 
80 
70 
60 
50 
40 
30 
20 
10 J 
i , i i i i , | . | i | w e ! ! | 
0 1 2 3 4 5 6 77 8 9 10 11 12 13 14 15 16 17 
System 
Figure 10: Significance Groupings for F-Measures (2P&R) at the 0.10 Level with 0.99 Confidence for TST3 
41 
F.PHalf 
100 
gO 
SO 
70 
SO 
50 
40 
30 
20 
10 
0 
® 
®0 
! i | i ! i | i • | i • i • i • i 
0 1 2 3 4 5 6 7 $ 9 10 11 12 13 14 15 16 17 
System 
Figure 11: Significance Groupings for F-Measures (P&2R) at the 0.10 Level with 0.99 Confidence for TST3 
Although most of the information in the figures is self-explanatory, there are a few items of interest to be 
pointed out. The first concerns the kinds of point spreads that occur within and between significance groupings. For 
example, in Figure 7, the UMBC-Conquest team has a recall score of 2.5, which differs from the next higher recall 
score of 6.6 for USC by 4.1 percentage points. Whereas, the first large group of four systems, MDC (20.4), NMSU- 
Brandeis (21.9), LSI (23.0), and SRA (26.9), has a spread of 6.5 percentage points. This difference between what sep- 
arates two significantly different systems and what separates systems which are not significantly different illustrates 
the cautionary note that this method is unable to determine that the evaluation scores are generally significant within 
a certain range of percentage points. 
Another anomaly in the results is in the precision scores. In Figure 8, the UMBC- Conquest team has a pre- 
cision score of 20.8. This score is not significantly different ~)m seven other sites, although those seven sites divide 
into significance groupings of their own. The reason for this is the low number of actuals generated by the system. 
Figure 9 illustrates two more examples of how the conditions set up by the systems influence the outcome of 
the statistical significance testing more than the actual test statistics themselves. The first example involves MDC, 
Paramax, and SRA and the second involves GE, GE-CMU, and UMass. 
For the first example, the F-measures are as follows: lVIDC 24.33, Paramax 29.03, and SRA 29.33. Paramax 
and SRA are not significantly different at the 0.1 level with at least a confidence of 0.99; MDC and SRA are not, but 
MDC and Paramax are significantly different. If the cautionary note did not hold, then these results would be anoma- 
lous, because SRA's F-measure is higher than Paramax's. Therefore, we would expect that MDC and Paramax would 
not be significantly different because MDC and SRA are not. However, the F-measures are not the only factors affect- 
ing the results of the significance testing. The conditions set up by the systems during the shuffling have more influ- 
ence than the raw results. This case illustrates that we cannot conclude from an approximate randomization test that 
the evaluation scores are generally significant within a certain range of percentage points. 
In the case of GE, GE-CMU, and UMASS, we see a similar illustration of the cautionary note. The F-mea- 
42 
sures are: GE 56.01, GE-CMU 51.98, and UMASS 51.61. The significance levels are GE and GE-CMU: 0.0415, GE 
and UMASS: 0.0994, and GE-CMU and UMASS: 0.8918. GE and GE-CMU are significantly different at the 0.05 
level and GE-CMU and UMASS are not significantly different even at the 0.1 level. The significance levels show that 
GE is significantly different from GE-CMU and UMASS at the 0.1 level. However, the confidence level for the sig- 
nificance test for GE and UMASS does not meet the 0.99 cutoff because it is only 0.635. So the pair cannot be consid- 
ered significantly different by our criteria. The cautionary note explains why this situation could arise even though we 
would expect that the significance level would have been higher for GE and GE-CMU than for GE and UMASS. 
TST4 
The results for TST4 of MUC-4 are presented in this section. TST4 is a second test set chosen from an ear- 
lier time period. It contains more straightforward messages concerning terrorism with fewer irrelevant messages than 
TST3. The significance results are presented in the same formats as those for TST3. The first format is a matrix show- 
ing the significance levels for the pairwise comparison for each of the scores reported (Figures 12 - 15). The second 
and more informative format is a scatterplot showing the significance groups or clusters based on the cutoffs at 0.1 for 
significance level and 0.99 for confidence level (Figures 16 - 20). The significance groupings represent groups of sys- 
tems which were not considered to be significantly different from each other according to these cutoff criteria. 
SUMMARY 
The results of the statistical analysis of the MUC-4 scores on TST3 and TST4 have been presented with an 
explanation of the computationally-intensive method of determining the statistical significance of the differences 
between systems. The method of approximate randomization was used to generate the probability distribution of the 
differences between systems. The statistical significance of the actual results can be ascertained by calculating the 
significance level during the generation of the distribution and looking up the confidence level in a published table. 
Knowing the statistical significance of the MUC-4 results allows us to draw valid conclusions concerning the perfor- 
mance of the participating systems. 
ACKNOWLEDGEMENTS 
I would like to thank Christopher Cooper for our discussions concerning the adaptation of approximate ran- 
domization to the MUC-4 problem and Lynette Hirschman and David Lewis for their help in explicating the statisti- 
cal method for the computational linguistics community. 
REFERENCES 
\[1\] Noreen, E. W. (1989) Computer Intensive Methods for Testing Hypotheses: An Introduction. New York." John 
Wiley & Sons. 
\[2\] Efron, B. and R. Tibshirani (1991) "Statistical Data Analysis in the Computer Age" Science, vol. 253, pp. 390 
- 395. 
\[3\] Chinchor, N., L. Hirschman, and D. Lewis (1991) "Evaluating Message Understanding Systems: An Analysis 
of the Third Message Understanding Conference (MUC-3)" submitted to Computational Linguistics. 
43 
I:iiiiiii!!iiiiiiil ........ .:: .::. 
:!:: :: -- . 
~:lEi~!~ * I * * * • 
liiii~iii .......... . ...... 
:iiiiiiiiiiiii ......... 
I ~ Q g 
0 
: . :: ::: 
ii!ii!iii!iiiiiiiii 
::::::::::::::::::: ::::::::::::::::::: 
iliiii|iiiiii c ......... 
I 
iiiiii~!iiiii ................... * • ~ ~':" • " • i• " 
~!~!iiiiiiiiiii~ii\[ 
i!!!ii~i!iii!., c: • :!:~:~ 
::~: • 
:. .... 
Q ~ ~ ~ g I Q 0 
• • • • • • • • • • i • * • ........... 
• ii!!iiii!ili!iMil :~:~:~:~:~:~:~:~:~:~ 
I i:!:ii~:iii~i:i:i:!: 
.. ::.: :: :- • ~ 
c~ 
• ". ~'~ 
+:.:.:.:.:.:+:. 
..... ' 
iiiiiiii ..... 
........ oi~ ,~o oo 
. . . o~ ~ ooOooO ......... o.I ...... 
° i ::::::::::::::::::: iiiii 
• • • • * * ~. 'qrO Or,,,. ¢~/o ~-t~J or,,,. 
• • • . . . • • ~0 O0 oo f.,.,~l;" O~ 
." • • :• ~ ~ ~o~ 
• . . o ~oo°'~oooo oo 
• . iiiiiii!iiii!i!iiil o o ~ o ~ ~ ~ ; ." ~ ~ ~ ~ - ii ~ ooo. ooO ~.o ~.~ ~.~ Ooo~ " ooO.~. 
• . .~. . . .~. oO. ~o oO.oO. : ...... !.. 
OO oo O0 O0 
~ 0 ~ ~ o~ ° o° ~ ~ ~ ° ~ ° ~ ~ o~ ~ ° ~ oOO ° ~ ~i: ~" 
• " " iiiiiiiiiiiiiiiiili ................ ~ " :'::: :::::: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
iiiiiiiii|!iiiiii - o ~ o o_ o o ° ~ _ o o o ~ o ~ o o,. ~ 
• • . 0 f~,.O O0 O0 ~'0 0 . O0 O0 O0 oar 0~: 
i:!:!:~:~:::: • • . . • • • . :::::::::: ~ 0 ~ ~ ~ ~ 0 0 0 0 0 0 r-. 0 0 0 0 0 0 0 0 0 : 0 0 -- 0 ::::::~,~.t:::!:~ . • • • • • • • ::::::::::::::::::::: . o : ¢~ 'q" ~,o o o 0 0 0 0 0 0 0 0 0 0 o 0 0 o o 0 
Oo oo c~o c~o o ~ c;o oo oo oo oc~ oc~ oo 
..... o~ oolo ° # ~" ,~ ;~ . ~o 
oo Oo~ o ~ ~' o°o o.o. ................ !m.~ "o o 
oo oo oo oo gO oo !o .:.:.:.:.:.:.:.:.:.: 0 0 0 0 0 0 0 0 0 , 0 0 
"" ' ' . "" "''" I 
......... li!i!!~!~i~iiiiiiiii i ii 0 ~" ~' 
::iii::::~::::~::~:: • " o o o ° o ° ~ o ° o ° - '~ ~ o ° ~: o ~ o ° ~ ~ o ° o ° ~ 
oo o ....... ~ ........... ~. "1 ° ~ ~ o°o °- ooo o oo °. ~o ° c¢i~ o°.o °. ~ o°~ oo°. ° 
I !!i!iii~ii!~i!ii.i ::::::::::::::::::: i I 
i~i~ii~i,.. ,,,,,,,,,,,~,,,,,~, oo ~ o o o o o o o o o o o ~ ; ~ ; e ;; ; e ;; ~; oo oo oo oo oo oc~ oo oo oo o~ oo oo~ 
........... ::::::::::::::::::::::::::::::::: co o o o o o o o o o o o o o o o o o o o co o o o o io 
i! . ~ ......... = . ; .oo.-oo oo°° oo oo oo 
U~J 
E- 
¢) 
44 
|gii/li//iii/iiii|  gili|ili/lliii|| 
 lilHiiiililili/| WHHBHHHHHHHBHUHH 
 E/lillliliilUHH| mn/oinnnnulugHgH 
EBBBBHBBHHH|HHHHHH  HBBHHBHHH|HHHHHHH 
 |H|HHHHH|HHHHHHHH  BHHHHBHIHHHHHHHHH 
 HHHBHH|HHHHHHHHHH  HH||H HHHHHHHHHHH 
|B/HHUHHHHHHHHHHH  HHHHHHHHHHHHHHHH 
 NI|HHHDHHHHHHHHHH |H|HHHHHHHHHHHHHEH 
 UHHHHHHHHHHHHHHH 
(.,,~ 
~3 
"S 
M 
L.., 
45 
|iiilililililiiii| WHHHi/iBBHI/iHi|i 
U2 0 
V V 
..~ .~ V 
~o~ 
• I ,ll 
W 4m ~ 
~'~ 
., 
\[.. 
J~ 
.o 
r4 
L~ 
{/) 
r J, 
E-, 
46 
iiiiiiiiij • • • : ......... :::::::w:: 41 e • • 4s • • e e 41 • • 41 
ii~iiiiii~i 41 .is e .... e • • • • 
0 
....... i 
! " . i~i,i~i,,ii,i,i~ ~iiiii~ ° ° ii!! iii 
.............. I ° ° iiiiiill ~ iiii!iii~ . . ~ ........ o 
! .............. ~iiiii!i!!!!~!!!i i 
iiiiii!iiiiiiii i i!iiiiiill ~ 
!iiiiiiiiE :::~:~:::~:~ o r,- " j . :~::::~: o .j 
iiiiiiiiiiiiiii! o o 
• ~::::::::! ~ 0 0 0 ; 
• .:-:-:.:.:.:.:-:: 0 0 0 0 1 
iiiiiiiii~ ~ "! - 
• :::;;::: . ~ 0 O 0 iliiii?!i~ ......... o 
........... : ..... o o o c~ 
i!i! iiii!!!iiiiiiiii o o o o o 
i; ~ ~ ° ° ° ° . i 
~iii\[ii o o ~ o o o o 
iiiiiii" ° ° ° ° 
ii ..... 
o N ~ o o o o ~I 
........ ~ 0 0 0 0 0 0 0 0 I~ ,~,~,~,~ . . . . ~ ~ o o o o o o o o 
:.:::::'.. . " 0 0 0 0 0 0 0 0 ~,1 :':':':':':':" ~ • • 0 0 • 
:.:.:+:.:.:.: 0 0 0 0 0 0 0 0 0 0 ~ !!!!!:! ................... 
• • 0 ~, 0 0 ~f" 0 0 0 ,C~ 0 
0 0 0 0 0 0 ~ ~ 0 0 0 0 0 
• * * o c~ o o o o o o o o o o o 
0 0 0 0 0 0 ~ 0 ~,1 O 0 
~i~"i 0 0 0 . 0 0 0 0 ~ 0 0 
!1ii~ " i " o o o o o o o o o o o o 
~ :" ................. ~ o o o o o o o o o o o o o iiiiiiiiiiiiiiiii - 
o o o o o o o o o o o o o o 0 0 0 0 0 0 0 0 0 0 0 0 0 O 
":':':':':':': :~:i:~:~:~:~:~:~: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 ~ 0 0 0 0 ~0 ~ ~ eO 0 ~ ~ 0 
0 0 0 0 0 0 0 0 ~ 0 ~ ~ 0 ~ ~ 0 
• ...iiii!!~!~i~! iiii!i!iiiiiiiiil o o o o o o o o o o o o o o o o 
E- 
c~ 
0~ 
L~ 
~L 
rcj 
r~ 
L~ 
4'7 
100 
Recall 
90 
80 
70 
60 
50 
40 
30 
20 
10 
0 
® 
® 
® 
• • • • • • • • i • • • • • • • J • 
0 1 2 $ 4 6 6 77 B 9 10 11 12 13 14 16 16 17 
System 
Figurel6: Significance Groupings for Recall at the 0,,10 Level with 0.99 Confidence for TST4 
100 
Precision 
20 
80 
70 
20 
60 
40 
30 
20 
lO 
o 
z o 
System 
Figure 17: Significance Groupings for Precision at the 0.10 Level with 0.99 Confidence for TST4 
48 
lOO 
F-Equal 
90 
80 
70 
60 
50 
40 
30 
20 
10 
0 0 | w g i i i i i ~ • • i • w i w | 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 
z o 
System 
Fignrel8: Significance Groupings for F-Measures (P&R) at the 0.10 Level with 0.99 Confidence for TST4 
lOO 
F-Ptwice 
90 
50 
70 
60 
50 
40 
30 
20 
10 
S 
0 • • 0) 
0 1 2 3 4 S 6 7 8 9 10 11 12 13 14 15 16 17 
System 
Figure 19: Significance Groupings for F-Measures (2P&R) at the 0.10 Level with 0.99 Confidence for TST4 
49 
F-Phalf 
100 
90 
80 
70 
60 
SO 
40 
30 
20 
10 
0 
® 
® 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 18 16 17 
System 
Figure 20: Significance Groupings for F-Measures (P&2R) at the 0.10 Level with 0.99 Confidence for TST4 
50 
