MUC-3 EVALUATION METRIC S
Nancy Chinchor, Ph .D .
Science Applications International Corporatio n
10260 Campus Point Drive, M/S 1 2
San Diego, CA 9212 1
(619) 458-2728
INTRODUCTION
Purpos e
The MUC-3 evaluation metrics are measures of performance for the MUC- 3
template fill task. Obtaining summary measures of performance necessitates the los s
of information about many details of performance . The utility of summary measure s
for comparison of performance over time and across systems should outweigh thi s
loss of detail . The template fill task is complex because of the varying nature of th e
fills for each slot and the interdependencies of the slots . The evaluation metrics used
in MUC-3 were adapted from traditional measures in information retrieval and signa l
procesing and were still evolving to fit the more complex data extraction task of MUC-
3 when the evaluation was performed . The scoring of the template fill task and th e
calculation of the metrics used in MUC-3 will be described here . This description i s
meant to assist in the analysis of the MUC-3 results and in the further evolution of
the evaluation metrics .
Metric s
The measures of performance chosen for use in MUC-3 were recall, precision ,
fallout, and overgeneration .
	
Recall, precision, and fallout were adapted based o n
their use in information retrieval . Overgeneration was developed as a measure fo r
MUC-3 . Recall is a measure of the completeness of the template fill . Precision is a
measure of the accuracy of the fill . Fallout is a measure of the false alarm rate fo r
the slots which can be filled from finite sets of slot fillers .
	
Overgeneration is a
measure of spurious generation .
	
These measures will be described in greater detai l
below .
SCORE REPORT
A semi-automated scoring system was developed for MUC-3 .
	
The scoring
system displayed the answer key templates, the response templates, and the message s
using a flexibly customized emacs interface . During scoring, the user was asked to
enter the score for displayed mismatches between the key and the respons e
templates. Fills could generally be scored as matches, partial matches, or mismatches .
Depending on the type of slot fill, the scoring system may or may not have allowe d
full credit to be given .
	
The interactive scoring was carried out following well -
defined scoring guidelines .
	
Depending on the scoring guidelines, full, partial, or n o
credit may have been allowed for each mismatch .
	
After the interactive scoring wa s
complete, the scoring system produced an official score report containing template
by template score reports and a summary score report for the official record . A
sample summary score report produced for human comparison against the ke y
appears in Figure 1 . The following sections discuss the contents of the score report .
17
* * * TOTAL SLOT SCORES * * *
SLOT
	
POS ACTICOR PAR INCIICR IPAISPU MIS NONIREC PRE OVG FAL
template-id
	
118 1151114
	
0
	
01 0
	
01 1
	
4 391 97 99
	
1incident-date
	
114 1101 90 10 101 .31 101 0
	
4
	
41 83 86
	
0incident-type
	
118 1141112
	
1
	
11 0
	
11 0
	
4
	
01 95 99
	
0
	
0
category
	
90 1091 88
	
0
	
01 0
	
01 21
	
2
	
71 98 81 19 14
indiv-perps
	
106 611 59
	
0
	
21 10
	
01 0 45 501 56 97
	
0org-perps
	
71 681 58
	
0
	
11 15
	
01 9 12 481 82 85 13
perp-confidence
	
71 681 56
	
1
	
21 12
	
11 9 12 481 80 83 13
	
2
phys-target-ids
	
59 571 54
	
3
	
01 14
	
31 0
	
2 771 94 97
	
0phys-target-num
	
41 411 39
	
0
	
21 0
	
01 0
	
0 771 95 95
	
0phys-target-types
	
59 571 52
	
4
	
11 11
	
41 0
	
2 771 92 95
	
0
	
0human-target-ids
	
145 1311129
	
2
	
01 33
	
21 2 14 231 90 99
	
2
human-target-num
	
94 881 79
	
6
	
21 0
	
61 1
	
7 231 87 93
	
1
human-target-types
	
145 1311126
	
2
	
31 24
	
21 2 14 231 88 97
	
2
	
0target-nationality
	
35 191 17
	
2
	
01 3
	
21 0 16 1031 51 95
	
0
	
0
instrument-types
	
25 221 16
	
1
	
01 0
	
01 5
	
8 881 66 75 23
	
0incident-location
	
118 1131 88 24
	
11 0
	
11 0
	
5
	
01 85 88
	
0phys-effects
	
41 441 37
	
3
	
01 8
	
31 4
	
1 891 94 88
	
9
	
0human-effects
	
56 541 43
	
2
	
21 10
	
21 8
	
9 811' 78 81 15
	
1
MATCHED ONLY
	
1464 140211257 61 271171 371 62 119 8261 88 92
	
4
MATCHED/MISSING
	
1506 140211257 61 271171 371 62 161 8571 85 92
	
4
ALL TEMPLATES
	
1506 142011257 61 271171 371 80 161 8611 85 91
	
6SET
FILLS ONLY
	
640 6181547 16
	
91 68 151 49 68 5161 87 90
	
8
	
0
Figure 1 : Summary Score Report
Scoring Categorie s
Individual slot fills in the response were scored as correct, partially correct ,
incorrect, noncommittal, spurious, or missing. A response was correct if it was th e
same as the key, partially correct if it partially approximated the key, and incorrec t
if it was not the same as the key . If the key and response were both blank, th e
response was scored as noncommittal . If the key was blank but the slot was filled, th e
response was scored as spurious . If the response was blank and the key was not, th e
response was scored as missing. Figure 2 summarizes the scoring categories relating
them to the corresponding columns in the score report .
Category Criteria Colum n
Correct response = key COR
Partial response
	
= key PAR
Incorrect response ~ key INC
Noncommittal key
	
and response are both blank NON
Spurious key is blank and response is not SPU
Missing response is blank and key is not MIS
Figure 2 :
	
Scoring Categories
1 8
The summary score report rows show the totals for each of the categories ove r
all templates. The slots are listed on the left hand side and the totals for each slot ove r
all templates are given in the labeled columns .
	
For example, the total number o f
physical targets correctly identified was 54 . The number appears in the phys-target-
ids row and the COR column of the summary score report . Note that the bottom fou r
rows of the score report are not slot scores but rather global summary rows describe d
in a later section .
During scoring, the scoring system automatically scored matches as correc t
and some partially matching hierarchically organized items as partially correct .
However, many of the mismatches were interactively scored by the user . To reflect
the number of items interactively scored as correct or partially correct, two column s
labeled ICR and IPA were provided .
The first two columns in the score report contain the number of possible slo t
fills (POS) and the actual number of slot fills (ACT) . The number of possible slot fill s
is the number of slots fills in the key plus the number of optional slot fills in the ke y
that were matched in the response . The number of possible slot fills for each system
differs depending on the optional fills given by the system . The number of actua l
fills given is the number of slot fillers in the response . The numbers in the possible
and actual columns are used to calculate the metrics .
Calculation of Metric s
The metrics were calculated for each slot and for the summary rows . The
calculations were based on information in the columns of the score report as well a s
on some tallies kept internally by the scoring system . The first three metrics show n
in the score report are recall, precision, and overgeneration .
	
These were calculate d
for each slot and were based on information contained in the score report .
Recall is a measure of completeness and was calculated as follows .
correct + (partial x 0.5)
possible
For example, recall for the human-target-ids slot was calculated as follows .
REC COR +(PAR x 0 .5)POS
_ 129 + (2 x 0 .5 )
145
130
145
= 0.90
recall =
1 9
Precision is a measure of the accuracy of the attempted fills and was calculate d
as follows.
correct + ( partial x 0 .5 )precision =
	
actual
For example, the precision for the phys-target-num slot was calculated a s
follows .
PRE=COR + ( PAR X 0 .5)ACT
39 + ( 0 x 0 .5)
4 1
3 9
4 1
= 0.95
Overgeneration is a measure of spurious generation and was calculated a s
follows .
overgeneration = spuriou s
For example, the amount of overgeneration in the category slot was calculate d
as follows.
OVG = ACT'
2 1
109
= 0 .1 9
Fallout is a measure of the false alarm rate . The number of false alarms could
only be measured for slots for which we knew the number of possible incorrect
responses .
	
A subset of the slots in the template fill task were filled from finite sets .
The rest of the slots are filled from possibly infinite sets .
	
Fallout measures were
calculated for the finite set fill slots as follows .
fallout = Incorrect + spuriouspossible incorrec t
where "possible incorrect" is the number of possible incorrect answers which coul d
be given in the response . The number of possible incorrect is not shown in the scor e
report but a tally is kept internally by the scoring system . The method for keepin g
this tally of possible incorrect has evolved during the course of the evaluation .
actua l
20
In order to describe this evolution, a simple calculation of fallout for a singl e
slot in a single template will be given .
	
The instrument type slot has 16 possibl e
fillers.
	
If the key contains the filler GUN and the response contains the fille r
GRENADE, then fallout would b e
FAL =	 INC +SPUpossible incorrec t
1
1 5
= 0 .07
The number of possible incorrect is the cardinality of the set of possibl e
answers minus the number correct in the key which is 16 - 1, or 15 .
In phase one, the fallout measure assumed that the system was essentiall y
choosing a subset of the finite set of possible fills when it gave a response . For
example, if the key for the instrument type slot contained GUN and GRENADE and the
response contained BOMB, GRENADE, and CUTTING DEVICE, the phase one fallout woul d
be
FAL =	 INC + SPUpossible incorrec t
1 +1
16 - 2
2
1 4
= 0.14
The number of possible incorrect was the cardinality of the set minus the total
number of slot fills given in the key .
During phase two, it was noticed that this simple approach to fallout was in
fact erroneous for several reasons. Some finite set slots allowed multiple uses of set
members due to cross-referencing requirements . For example, the slot fill CIVILIA N
might be used multiple times in specifying the human target type for differen t
human targets .
CIVILIAN: "MARIO FLORES "
CIVILIAN : "JOSE RODRIGUEZ "
Further complications arose when alternatives were given in the key for eac h
such slot fill. In order to solve all of these problems, the calculation of the possible
correct for the slot fills was revised to coincide more closely with the calculation use d
in information retrieval .
	
Each separate slot fill item is now thought of as bein g
chosen from the entire finite set of possible fill items .
21
In general, the number of possible incorrect is given by the followin g
formula.
	
E
	
(IUI - Ikeyvall)
keyva l
where keyval stands for each of the key values including blanks, IUI is th e
cardinality of the finite set U of possible slot fillers, and Ikeyvall is the number of key
values corresponding to the response . If there are alternative key values for a
response, then Ikeyvall > 1 . If the key is blank, then there are no corresponding key
values and the contribution to the number of possible incorrect is the cardinality o f
the finite set .
Returning to our example of instrument types with the key containing GU N
and GRENADE and the response containing BOMB, GRENADE, and CUTTING DEVICE ,
fallout will be recalculated using the new method of determining the possibl e
incorrect . The number of possible incorrect is calculated by summing over the slot
fills . For GUN, the number of possible incorrect is the cardinality of the set, which i s
16, minus the number of slot fill alternatives given in the key, which in this case i s
1 . For GRENADE, the number of possible incorrect is also 15 . So the number of
possible incorrect for this slot is 15 + 15, or 30 . Since there is 1 incorrect and 1
spurious response, fallout is 2/30, or 7% . In phase one, fallout was 14% for this same
example.
If there are alternatives to a single slot fill in the key, the contribution to the
number of possible incorrect by that slot fill is the cardinality of the finite set minu s
the number of alternatives given . For example, if the key is GUN/GRENADE, the
number of possible incorrect is 16 - 2, or 14 .
If the key is blank, the number of possible incorrect is the cardinality of the
finite set . For example, if the instrument type slot is blank in the key and th e
response is GUN and GRENADE, then the fallout i s
	
FAL =	 INC + SPUpossible incorrec t
0 +2
1 6
2
1 6
= 0 .1 3
Notice that if the number of spurious responses is great enough, fallout can be
more than 100% .
Meaning of Metric s
Recall is a measure of completeness in the sense that it measures the amount o f
relevant data extracted relative to the total available . It is the true positive rate . A
mnemonic for recall can be constructed by imagining that you have been asked t o
read the entire answer key, then fill in templates with all that you hav e
22
"remembered" or "recalled ." Your score would be the total correctly recalled out o f
the total possible.
Precision is the accuracy with which a system extracts data. It is the amount
of relevant data relative to the total put in by the system . A mnemonic for precision
is to imagine that each time a system fills a slot it is throwing a dart at a dartboard .
All of the bull's-eyes are correct . Precision is a measure of the number of bull's-eyes
relative to the number of darts thrown . Precision can also be described as th e
tendency of a system to avoid assigning bad fillers as it assigns more good fillers .
Fallout is a measure of the false positive rate . It is the tendency of the system
to assign incorrect fillers as the number of potential incorrect fillers increases . So,
for a mnemonic, if you are imagining the dartboard again, fallout measures th e
number of darts that "fall outside" of the bull's-eye relative to the size of the are a
outside the bull's-eye .
	
Fallout can only be assigned for slots with a calculabl e
number of possible incorrect .
	
Only some of the slots have a finite set of slot fill s
associated with them .
	
The others have fills that come from potentially infinite set s
and hence cannot be assigned a fallout score .
Overgeneration is a measure of spurious generation .
	
It is the amount of
spurious fillers assigned in relation to the total assigned . Overgeneration wa s
calculated to deter overgeneration as an approach to higher scores . A mnemonic -for
overgeneration can be constructed by imagining that required fills and extra fill s
are in a box. Overgeneration is represented by the area that the extra fills take up i n
relation to the total area .
Summary Score s
The last four rows of the score report in Figure 1 are summary score rows . In
phase one, there was one summary score row that represented the totals of th e
columns for the scoring categories including possible and actual . The metrics were
then calculated based on those totals and appeared in the appropriate columns in th e
lower righthand portion of the chart . The summary metrics are always calculate d
from the items in the summary totals and are never the result of averaging th e
metrics for the slots .
In phase two, it was decided that the scoring system should keep the interna l
tallies needed to supply several summary score rows, only one of which would be the
total of the slot scores shown in the columns of the score report . The scoring of slot s
in the missing and spurious templates was the issue which gave rise to multipl e
summary rows.
	
In phase one, spurious templates were scored as spurious in th e
template id slot only. The spurious slot fillers aside from the template id slot fille r
were not scored as spurious . Missing templates, however, were scored in the templat e
id slot and in the individual missing slots . This method of scoring did not penalize a s
much for overpopulating the database as it did for underpopulating it .
In phase two, we wanted to find out how the systems scored if overpopulatin g
and underpopulating the database were treated equally . Two summary rows were
added, one of which scored spurious and missing in the template id only and th e
other of which scored spurious and missing templates for all of the spurious and
missing slot fills. The official scores were still taken from the same summary row a s
in phase one, but the other two rows were there for analysis .
23
The global summary rows are listed on the score report in order of strictnes s
based on the scoring of missing and spurious templates . The MATCHED ONLY row has
missing and spurious templates only scored in the template id slot . This row contains
the least strict of the scores for the system . The MATCHED/MISSING row contains the
official test results . The missing template slots are scored as missing . The spuriou s
templates are scored only in the template id slot . The totals in this row are the total s
of the tallies in the columns as shown. The ALL TEMPLATES row has missing templat e
slots scored as missing and spurious template slots scored as spurious . This row
contains the strictest scores for the system .
A fourth summary row was added to allow analysis of system performance o n
only the set fill slots . The SET FILLS ONLY row contains totals for slots with finite se t
fills only . A global fallout score is calculated for these slots and given in the fallou t
column of this row .
CONCLUSIONS AND FURTHER RESEARCH
The evaluation metrics for MUC-3 had utility for system development and fo r
the reporting and analysis of the results of the evaluation. The metrics were adapte d
from simpler task models and were still evolving when the evaluation wa s
performed . There has been consistent agreement on the necessity of basi c
measurements of completeness, accuracy, false alarm rate, and overgeneration .
These measurements were accomplished through the metrics of recall, precision ,
fallout, and overgeneration as defined for MUC-3 .
	
The global summary score s
provide several different views of system performance . However, further analysis o f
the current results is possible based on the information in the official score reports .
The template by template scores are officially reported and can be used as a basis for
further analysis .
	
For example, performance at the message level can be calculate d
from the template by template scores for the systems .
While the metrics of recall, precision, fallout, and overgeneration have bee n
defined for MUC-3, further research into the metrics and their implementation need s
to be done. Additional measurements may be required . More refined definitions of
the current measurements are probably needed . The complexities of optional fills ,
alternatives in the key, partial credit, and distribution of partial credit over ke y
values, to name a few, still need to be examined more closely with consideration
given to their effects on the metrics.
	
These complexities have made it difficult t o
fully test the scoring system software and require more attention to be paid t o
detecting and isolating subtle errors .
	
A different treatment of the slots will need to
be attempted. For example, the template id slot is unique among the slots and will b e
kept separate when the summary measures are calculated in the future.
	
A single
overall measure of performance may be possible in the future once the roles o f
recall and precision are more fully determined. All of these avenues of furthe r
research have been opened up by the definition of a set of metrics for MUC-3 and th e
development of a scoring system embodying those metrics .
ACKNOWLEDGEMENT S
Many of the scoring issues were debated and resolved in consultation wit h
members of the Program Committee . David Lewis provided technical guidance . Pete
Halverson developed the scoring system . The participants provided feedback. Beth
Sundheim was a sounding board for many of the issues that arose during th e
evolution of the MUC-3 scoring .
24
