OVERVIEW OF THE THIRD MESSAGE UNDERSTANDING
EVALUATION AND CONFERENCE
Beth M. Sundheim
Naval Ocean Systems Cente r
Code 444
Decision Support and AI Technology Branc h
San Diego, CA 92152-500 0
sundheim@nosc .mi l
INTRODUCTION
The Naval Ocean Systems Center (NOSC) has conducted the third in a series o f
evaluations of English text analysis systems . These evaluations are intended to
advance our understanding of the merits of current text analysis techniques, a s
applied to the performance of a realistic information extraction task . The latest
one is also intended to provide insight into information retrieval technolog y
(document retrieval and categorization) used instead of or in concert wit h
language understanding technology .
	
The inputs to the analysis/extraction proces s
consist of naturally-occurring texts that were obtained in the form of electroni c
messages .
	
The outputs of the process are a set of templates or semantic frame s
resembling the contents of a partially formatted database .
The premise on which these evaluations are based is that task-oriented test s
enable straightforward comparisons among systems and provide usefu l
quantitative data on the state of the art in text understanding . The tests ar e
designed to treat the systems under evaluation as black boxes and to point u p
system performance on discrete aspects of the task as well as on the task overall .
These quantitative data can be interpreted in light of information known about
each system's text analysis techniques in order to yield qualitative insights into th e
relative validity of those techniques as applied to the general problem of
information extraction.
The process of conducting these evaluations has presented great opportunitie s
for examining and improving on the evaluation methodology itself. Although stil l
far from perfect, the MUC-3 evaluation was markedly better than the previous one ,
especially with respect to the way scoring was done and the degree to which the
test set was representative of the training set . Much of the credit for improvemen t
goes to the evaluation participants themselves, who have been actively involved i n
nearly every aspect of the evaluation . The previous MUC, known as MUCK-II (the
naming convention has since been stripped down), proved that systems existe d
that could do a reasonable job of extracting data from ill-formed paragraph-lengt h
texts in a narrow domain (naval messages about encounters with hostile forces )
and that measuring performance on such a task was a feasible and viable thing t o
do.
	
However, the usage of a very small test set (just 5 texts) and an extremel y
unsophisticated scoring procedure combined to make it inadvisable to publicize th e
3
results .
	
(Results obtained in experiments conducted on one MUCK-II system afte r
the evaluation was completed are discussed in [1] .)
The MUC-3 evaluation was significantly broader in scope than previous ones in
most respects, including text characteristics, task specifications, performanc e
measures, and range of text understanding and information extraction techniques .
MUC-3 presented a significantly more challenging task than MUCK-II, which wa s
held in June of 1989 .
	
The results show that MUC-3 was not an unreasonabl e
challenge to 1991 technologies . The means used to measure performance have
evolved far enough that we no longer hesitate to present the system scores, an d
work on the evaluation methodology is planned that will take the next step t o
determine the statistical significance of the results .
In another effort to determine their significance, some work has already bee n
undertaken by Hirschman [2] to measure the difference in complexity of MUC-lik e
evaluation tasks so that the results can be used to quantify progress in the field of
text understanding . This objective, however, brings up another critical area o f
improvement for future evaluations, namely refining the evaluation methodolog y
in such a way as to better isolate the systems' text analysis capabilities from thei r
data extraction capabilities. This will be done, since the MUC- 3 corpus and task ar e
sufficiently challenging that .they can be used again (with a new test set) in a
future evaluation .
	
That evaluation will seek to examine more closely the tex t
analysis capabilities of the systems, to measure improvements in performance b y
MUC-3 systems, and to establish performance baselines for any new systems .
This paper covers most of the basics of the MUC-3 evaluation, which wer e
presented during a tutorial session and in an overview presentation at the start o f
the regular sessions . This paper is also an overview of the conferenc e
proceedings, which includes papers contributed by the sites that participated i n
the evaluation and by individuals who were involved in the evaluation in othe r
ways. Parts I, II, and III of the proceedings are organized in the order in whic h
the sessions were held, but the ordering of papers within Parts II and III i s
alphabetical by site and does not necessarily correspond with the order in whic h
the presentations were made during the conference.
	
The proceedings als o
includes a number of appendices containing materials pertinent to the evaluation .
OVERVIEW OF MUC- 3
The planning for MUC-3 began while MUCK-II was still in progress, wit h
suggestions from MUCK-II participants for improvements . A MUC-3 program
committee was formed from among those MUCK-II participants who provide d
significant feedback on the MUCK-II effort. The MUC-3 program committee
included Laura Blumer Balcom (Advanced Decision Systems), Ralph Grishman (Ne w
York University), Jerry Hobbs (SRI International), Lisa Rau (General Electric), an d
Carl Weir (Unisys Center for Advanced Information Technology) . Since one of the
suggestions for MUC-3 was to add an element of document filtering to the task o f
data extraction, David Lewis (then at the University of Massachusetts and now at
the University of Chicago) was invited to join the committee as a representative o f
the information retrieval community .
NOSC began looking for a suitable corpus - in late 1989 and obtained assistanc e
from other government agencies to acquire it during the summer of 1990 . At that
4
time, a call for participation was sent to academic, industrial, and commercia l
organizations in the United States that were known to be engaged in system design
or development in the area of text analysis or information retrieval . Participation
on the part of many of the respondents was contingent upon receiving outsid e
financial support ; approximately two-thirds of the sites were awarded financia l
support by the Defense Advanced Research Projects Agency (DARPA) . These
awards were modest, some sites having requested funds only to pay travel expenses
and others having requested funds to cover up to half of the total cost of
participating. The total cost was typically estimated to be approximately equivalen t
to one person-year of effort .
The evaluation was officially launched in October, 1990, with a three-mont h
phase dedicated to compiling the "answer key" templates for the texts in th e
training set (see next section), refining the task definition, and developing the
initial MUC-3 version of the data extraction systems . These systems underwent a
dry-run test in February, 1991, after which a meeting was held to discuss th e
results and hammer out some of the remaining evaluation issues . Twelve sites
participated in the dry run. One site dropped out after the dry run (TRW), and fou r
new sites entered, three of which had already been involved to some extent (BB N
Systems and Technologies, McDonnell Douglas Electronic Systems Company, an d
Synchronetics, Inc .) and one that had not (Hughes Research Laboratories) .
The second phase began in mid-February and, while system developmen t
continued at each of the participating sites, updates were made to the scorin g
program, the task definition, and the answer key templates for the training set .
Final testing was carried out in May, 1991, concluding with the Third Messag e
Understanding Conference (MUC-3), which was attended by representatives of th e
participating sites and interested government organizations . During th e
conference, the evaluation participants decided that the test results should b e
validated by having the system-generated templates rescored by a single party .
Two of the participants were selected to work as a team to carry out this task, an d
the results of their effort are the official test scores presented in this volume .
Pure and hybrid systems based on a wide range of text interpretatio n
techniques (e .g ., statistical, key-word, template-driven, pattern-matching, in -
depth natural language processing) were represented in the MUC-3 evaluation .
The fifteen sites that completed the evaluation are Advanced Decision System s
(Mountain View, CA), BBN Systems and Technologies (Cambridge, MA), Genera l
Electric (Schenectady, NY), General Telephone and Electronics (Mountain View ,
CA), Intelligent Text Processing, Inc . (Santa Monica, CA), Hughes Researc h
Laboratories (Malibu, CA), Language Systems, Inc. (Woodland Hills, CA), McDonnel l
Douglas Electronic Systems (Santa Ana, CA), New York University (New York City ,
NY), PRC, Inc . (McLean, VA), SRI International (Menlo Park, CA), Synchronetics ,
Inc . together with the University of Maryland (Baltimore, MD), Unisys Center fo r
Advanced Information Technology (Paoli, PA), the University of Massachusett s
(Amherst, MA), and the University of Nebraska (Lincoln, NE) in association wit h
the University of Southwest Louisiana (Lafayette, LA) .
	
Parts II and III of this
volume include papers by each of these sites . In addition, an experimenta l
prototype of a probabilistic text categorization system was developed by Davi d
Lewis, who is now at the University of Chicago, and was tested along with the othe r
systems. That work is described in a paper in Part IV.
5
CORPUS AND TAS K
The corpus was formed via a keyword query'. to an electronic databas e
containing articles in message format from open sources worldwide . These article s
had been gathered, translated (if necessary), edited, and disseminated by th e
Foreign Broadcast Information Service (FBIS) of the U .S . Government . A training
set of 1300 texts was identified, and additional texts were set aside for use as tes t
data 2 . The message headers were used to create or augment a dateline and the tex t
type information appearing at the front of the article ; the original messag e
headers and routing information were removed . The layout was modified slightl y
to improve readability (e .g., by double-spacing between paragraphs), an d
problems that arose with certain characters when the data was downloaded were
rectified (e.g., square brackets were missing and had to be reinserted) . The body of
the text was modified minimally and with the sole purpose of eliminating som e
idiosyncratic features that were well beyond the scope of interest of MUC-3 3 .
The corpus presents realistic challenges in terms of overall size (over 2 . 5
megabytes), length of the individual articles (approximately a half-page each o n
average), variety of text types (newspaper articles, TV and radio news, speech an d
interview transcripts, rebel communiques, etc .), range of linguistic phenomen a
represented (both well-formed and ill-formed), and open-endedness of th e
vocabulary (especially with respect to proper nouns) . The texts used in MUCK- I
and MUCK-II originated as teletype messages and thus were all upper case ; the
MUC-3 texts are also all upper case, but only as a consequence of downloading from
the source database, where the texts appear in mixed upper and lower case .
The task was to extract information on terrorist incidents (incident type, date ,
location, perpetrator, target, instrument, outcome, etc .)
	
from the relevant texts i n
a blind test on 100 previously unseen texts .
	
Approximately half the articles were
irrelevant to the task as defined .
	
In some cases the terrorism keywords in th e
query used to form the corpus (see footnote) were used in irrelevant senses, e .g. ,
"explosion" in the phrase "social explosion" .
	
In other cases, an entity of one of the
'. The query specified a hit as a message containing both a country/nationality name (e.g.,
Honduras or Honduran) for one of the nine countries of interest (Argentina, Bolivia, Chile ,
Colombia, Ecuador, El Salvador, Guatemala, Honduras, Peru) and some inflectional form of a
common word associated with terrorist acts (abduct, abduction, ambush, arson, assassinate ,
assassination, assault, blow [up], bomb, bombing, explode, explosion, hijack, hijacking, kidnap ,
kidnapping, kill, killing, murder, rob, shoot, shooting, steal, terrorist) . Some of the articles i n
the MUC-3 corpus may no longer satisfy this query, since the message headers (including th e
subject line) were removed after the retrieval was done .
2 Over 300 articles were set aside from the overall corpus to be used as test data . The
composition of the test sets was intentionally controlled with respect to the frequency wit h
which incidents concerning any given country are represented ; otherwise, the selection wa s
done simply by taking every nth article about that country .
3 For example, transcriptions of radio and TV broadcasts sometimes contained sentences i n
which words were enclosed in parentheses to indicate that the transcriber could not be certai n
of them, e .g., "They are trying to implicate the (Ochaski Company) with narcoterrorism ." (This
quote is from article number PA1807130691 of the Latin America volume of the Foreig n
Broadcast Information Service Daily Reports .) In cases such as this, where the text i s
parenthetical in form but not in function, the parentheses were deleted .
6
nine countries of interest -- the second necessary condition for a hit -- wa s
mentioned, but the entity did not play a significant role in the terrorist incident .
Other articles were irrelevant for reasons that were harder to formulate .
	
For
example, some articles concerned common criminal activity or guerrilla warfare
(or other military conflict) .
	
Rules were developed to challenge the systems t o
discriminate among various kinds of violent acts and to generate templates onl y
for those that would be of interest to a terrorism news analyst . The real-life
scenario also required that only timely, substantive information be extracted ; thus,
rules were formulated that defined relevance in terms of whether the news wa s
recent and whether it at least mentioned who/what the target was .
	
Other
relevance criteria were developed as well, again with the intent of simulating a
real-life task . The relevance criteria are described in the first part of appendix A ,
which is the principal documentation of the MUC-3 task . Appendix D contains some
representative samples of relevant and irrelevant articles .
It can be seen that the relevance criteria are extensive and would sometimes b e
difficult to state, let alone implement . It was learned that greater allowance s
needed to made for the fact that this was an evaluation task and not a real-life one .
Systems that generated generally correct internal data structures for a relevan t
incident, only to filter out that data structure by making a single mistake on one o f
the relevance criteria, were penalized for having missed the incident entirel y
rather than being penalized for getting just one aspect of the incident descriptio n
wrong. Some allowance was made in the answer key for the fact that incidents o r
facts about incidents might be of questionable relevance, given the vagueness o f
some texts and gaps in the statement of the relevance criteria ; the templat e
notation allowed for optionality, and systems were not penalized if they failed t o
generate an optional template or an optional filler in a required template .
If an article was determined to be relevant, there was then the task o f
determining how many distinct relevant incidents were being reported . The
information on these incidents had to be correctly disentangled and represented i n
separate templates.
	
The extracted information was to be represented in th e
template in one of several ways, according to the data format requirements of eac h
slot. (See appendix A .) Some slot fills were required to be categories from a
predefined set of possibilities called a "set list" (e .g., for the various types o f
terrorist incidents such as BOMBING, ATTEMPTED BOMBING, BOMB THREAT) ;
others were required to be canonicalized forms (e.g ., for dates) or numbers ; still
others were to be in the form of strings (e .g., for person names) .
A relatively simple article and corresponding answer key template from th e
dry-run test set (labeled TST1) are shown in Figures 1 and 2 . Note that the text i n
Figure 1 is all upper case, that the dateline includes the source of the articl e
("Inravision Television Cadena 1") and that the article is a news report by Jorge
Alonso Sierra Valencia . In Figure 2, the left-hand column contains the slot labels ,
and the right-hand column contains the correct answers as defined by NOSC .
Slashes mark alternative correct responses (systems are to generate just one of the
possibilities), an asterisk marks slots that are inapplicable to the incident typ e
being reported, a hyphen marks a slot for which the text provides no fill, and a
colon introduces the cross-reference portion of a fill (except for slot 16, where th e
colon is used as a separator between more general and more specific place names) .
More information on the template notation can be found in appendix A, an d
further examples of texts and templates can be found in appendices D and E .
7
TST 1-MUC 3-008 0
BOGOTA, 3 APR 90 (INRAVISION TELEVISION CADENA 1) -- [REPORT] [JORGE ALONS O
SIERRA VALENCIA] [TEXT] LIBERAL SENATOR FEDERICO ESTRADA VELEZ WA S
KIDNAPPED ON 3 APRIL AT THE CORNER OF 60TH AND 48TH STREETS IN WESTER N
MEDELLIN, ONLY 100 METERS FROM A METROPOLITAN POLICE CAI [IMMEDIATE
A1"1ENTION CENTER] . THE ANTIOQUTA DEPARTMENT LIBERAL PARTY LEADER HAD
LEFT HIS HOUSE WITHOUT ANY BODYGUARDS ONLY MINUTES EARLIER . AS HE WAITED
FOR THE TRAFFIC LIGHT TO CHANGE, THREE HEAVILY ARMED MEN FORCED HIM TO GE T
OUT OF HIS CAR AND GET INTO A BLUE RENAULT .
HOURS LATER, THROUGH ANONYMOUS TELEPHONE CALLS TO THE METROPOLITA N
POLICE AND TO THE MEDIA, THE EXTRADITABLES CLAIMED RESPONSIBILITY FOR TH E
KIDNAPPING . IN THE CALLS, THEY ANNOUNCED THAT THEY WILL RELEASE TH E
SENATOR WITH A NEW MESSAGE FOR THE NATIONAL GOVERNMENT .
LAST WEEK, FEDERICO ESTRADA VELEZ HAD REJECTED TALKS BETWEEN TH E
GOVERNMENT AND THE DRUG TRAFFICKERS .
Figure 1 . Article from MUC-3 Corpus 4
Figure 2. Answer Key Template
4This article has serial number PA0404072690 in the Latin America volume of the FBIS Dail y
Reports, which are the secondary source for all the texts in the MUC-3 corpus .
0. MESSAGE I D
1. TEMPLATE ID
2. DATE OF INCIDENT
3. TYPE OF INCIDENT
4. CATEGORY OF INCIDENT
5. PERPETRATOR : ID OF INDIV(S )
6. PERPETRATOR : ID OF ORG(S )
7. PERPETRATOR : CONFIDENCE
8. PHYSICAL TARGET : ID(S )
9. PHYSICAL TARGET : TOTAL NUM
10. PHYSICAL TARGET : TYPE(S)
11. HUMAN TARGET: ID(S)
12. HUMAN TARGET: TOTAL NU M
13. HUMAN TARGET: TYPE(S)
14. TARGET : FOREIGN NATION(S )
15. INSTRUMENT : TYPE(S)
16. LOCATION OF INCIDENT
17. EFFECT ON PHYSICAL TARGET(S )
18. EFFECT ON HUMAN TARGET(S)
TST1-MUC3-008 0
1
03 APR 90
KIDNAPPING
TERRORIST ACT
"THREE HEAVILY ARMED MEN "
"THE EXTRADITABLES" / "EXTRADITABLES "
CLAIMED OR ADMITTED : "THE EXTRADITABLES" /
"EXTRADITABLES"
*
*
*
"FEDERICO ESTRADA VELEZ" ("LIBERAL SENATOR" /
"ANTIOQUTA DEPARTMENT LIBERAL PARTY LEADER "
/ "SENATOR" / "LIBERAL PARTY LEADER" / "PARTY
LEADER" )
1
GOVERNMENT OFFICIAL / POLITICAL FIGURE :
"FEDERICO ESTRADA VELEZ"
*
COLOMBIA : MEDELLIN (CITY)
*
8
The participants collectively created the answer key for the training set, each
site manually filling in templates for partially overlapping subset of the texts .
This task was carried out at the start of the evaluation ; it therefore provided
participants with good training on the task requirements and provided NOSC wit h
good early feedback . Generating and cross-checking the templates required an
investment of at least two person-weeks of effort per site . These answer keys were
updated a number of times to reduce errors and to maintain currency wit h
changing template fill specifications . In addition to generating answer ke y
templates, sites were also responsible for compiling a list of the place names tha t
appeared in their set of texts ; NOSC then merged these lists to create the set lists fo r
the TARGET : FOREIGN NATION slot and LOCATION OF INCIDENT slot.
MEASURES OF PERFORMANC E
All systems were evaluated on the basis of performance on the informatio n
extraction task in a blind test at the end of each phase of 'he evaluation . It was
expected that the degree of success achieved by the different techniques in Ma y
would depend on such factors as whether the number of possible slot fillers wa s
small, finite, or open-ended and whether the slot could typically be filled by fairl y
straightforward extraction or not . System characteristics such as amount o f
domain coverage, degree of robustness, and general ability to make proper use o f
information found in novel input were also expected to be major factors . The dry -
run test results were not assumed to provide a good basis for estimatin g
performance on the final test in May, but the expectation was that most, if not all ,
of the systems that participated in the dry run would show dramatic improvement s
in performance .
	
The test results show that some of these expectations were born e
out, while others were not or were less significant than expected .
A semi-automated scoring program was developed under contract for MUC-3 t o
enable the calculation of the various measures of performance . It was distribute d
to participants early on during the evaluation and proved invaluable in providin g
them with the performance feedback necessary to prioritize and reprioritize thei r
development efforts as they went along .
	
The scoring program can be set up t o
score all the templates that the system generates or to score subsets o f
templates/slots .
	
User interaction is required only to determine whether a
mismatch between the system-generated templates and the answer key template s
should be judged completely or partially correct .
	
(A partially correct filler for slo t
11 in Figure 2 might be "VELEZ"
	
("LEADER"), and a partially correct filler fo r
slot 16 would be simply COLOMBIA .)
	
An extensive set of interactive scorin g
guidelines was developed to standardize the interactive scoring .
	
These guideline s
are contained in appendix C . The scoring program maintains a log of interaction s
that can be used in later scoring runs and augmented by the user as the system i s
updated and the system-generated templates change .
The two primary measures of performance were completeness (recall) an d
accuracy (precision) . There were two additional measures, one to isolate th e
amount of spurious data generated (overgeneration) and the other to determin e
the rate of incorrect generation as a function of the number of opportunities t o
incorrectly generate (fallout) .
	
The labels "recall," "precision," and "fallout" were
borrowed from the field of information retrieval, but the definitions of those term s
had to be substantially modified to suit the template-generation task .
	
The
overgeneration metric has no correlate in the information retrieval field, i .e., a
9
MUC-3 system can generate indefinitely more data than is actually called for, bu t
an information retrieval system cannot retrieve more than the total number o f
items (e .g., documents) that are actually present in the corpus .
Fallout can be calculated only for those slots whose fillers form a closed set .
Scores for the other three measures were calculated for the test set overall, wit h
breakdowns by template slot .
	
Figure 3 presents somewhat simplified definitions .
MEASURE
	
II DEFINITIO N
RECALL #correct
	
fills
	
generated
#fills
	
in
	
key
PRECISION #correct
	
fills
	
generated
#fills
	
generated
OVERGENERATION #spurious
	
fills
	
generated
#fills
	
generate d
FALLOUT #incorrect+spurious
	
generate d
#possible
	
incorrect
	
fills
Figure 3 . MUC-3 Scoring Metric s
The most significant thing that this table does not show is that precision and recal l
are actually calculated on the basis of points -- the term "correct" includes syste m
responses that matched the key exactly (earning 1 point each) and syste m
responses that were judged to be a good partial match (earning .5 point each) . It
should also be noted that overgeneration is not only a measure in its own right bu t
is also a component of precision, where it acts as a penalty by contributing to the
denominator.
	
Overgeneration also figures in fallout by contributing to the
numerator. Further information on the MUC-3 evaluation metrics and scorin g
methods, including information on three different ways penalties for missing and
spurious data were assigned, can be found elsewhere in this volume in the pape r
on evaluation metrics by Nancy Chinchor [3] .
TEST PROCEDURE
Final testing was done on a test set of 100 previously unseen texts that wer e
representative of the corpus as a whole . Participants were asked to copy the tes t
package electronically to their own sites when they were ready to begin testing .
Appendix B contains a copy of the test procedure . The testing had to be conducted
and the results submitted within a week of the date when the test package was mad e
available for electronic transfer. Each site submitted their system-generated
templates, the outputs of the scoring program (score reports and the interactiv e
scoring history file), and a trace of the system's processing (whatever type of trac e
the system normally produces that could serve to help validate the system' s
outputs).
	
Initial scoring was done at the individual sites, with someone designate d
as interactive scorer who preferably had not been part of the system developmen t
team . After the conference, the system-generated templates for all sites wer e
labeled anonymously and rescored by two volunteers in order to ensure that th e
official scores were obtained as consistently as possible .
The system at each site was to be frozen before the test package wa s
transferred ; no updates were permitted to the system until testing and scorin g
1 0
were completed .
	
Furthermore, no backing up was permitted during testing in th e
event of a system error.
	
In such a situation, processing was to be aborted an d
restarted with the next text .
	
A few sites encountered unforeseen system problem s
that were easily pinpointed and fixed . They reported unofficial, revised test result s
at the conference that were generally similar to the official test results and do no t
alter the overall picture of the current state of the art .
The basic test called for systems to be set up to generate templates tha t
produced the "maximum tradeoff" between recall and precision, i .e., templates that
achieved scores as high as possible and as similar as possible on both recall an d
precision . This was the normal mode of operation for most systems and for man y
was the only mode of operation that the developers had tried . Those sites that coul d
offer alternative tradeoffs were invited to do so, provided they notified NOSC i n
advance of the particular setups they intended to test on .
In addition to the scores obtained for these metrics on the basic template -
generation task, scores were obtained of system performance on the linguistic
phenomenon of apposition, as measured by the template fills generated by th e
systems in particular sets of instances . That is, sentences exemplifying apposition
were marked for separate scoring if successful handling of the phenomenon
seemed to be required in order to fill one or more template slots correctly for tha t
sentence . This test was conducted as an experiment and is described in the pape r
by Nancy Chinchor on linguistic phenomena testing [4] .
TEST RESULTS AND DISCUSSIO N
The summary score reports produced for the tested systems by the scorin g
program are found in appendix F ; scatter plots for selected portions of the final tes t
results are shown in appendix G . Most of the figures in appendix G plot recal l
versus precision ; a couple plot recall vs overgeneration, since the generation o f
spurious data is a significant element of precision with respect to a templat e
generation task .
	
The plots facilitate consideration of que ions such as th e
following :
* On which aspect of the task (slot in the template) were the systems as a
group most successful ?
* How well did the systems handle time expressions (DATE OF INCIDEN T
slot)?
* How did the front-running systems on the overall measures differ wit h
respect to individual slot performance?
* To what extent do the different ways of computing the score s
(Matched/Missing, Matched Only, All Templates, and Set Fills Only) change th e
picture?
* To what extent was generation of spurious data taking place ?
* To what extent did the individual systems' recall and precision represen t
tradeoffs?
11
Not included in the appendices are the detailed score reports produced by th e
scoring program for each of the system-generated templates . These reports permi t
consideration of other interesting questions such as how systems performed fro m
one terrorist incident type to another and how they performed when a messag e
contained more than one relevant incident report . It is also possible to use the m
together with the corresponding texts to answer questions such as how wel l
systems handled newspaper articles versus TV and radio news reports and how wel l
they handled incident reports that were spread out over a paragraph or acros s
paragraphs rather than being completely described in a single sentence .
The appendices also do not include the results of a minor study of huma n
performance on the MUC-3 final test . This study was conducted using two MUC-3
evaluators as subjects and measuring their performance individually compared t o
the official answer key, which was created by merging and correcting thei r
individual draft keys. The evaluator with the lower scores for Matched/Missing
had 87% recall, 91% precision, and 5% overgeneration . Needless to say, since thes e
subjects were responsible for preparing the official answer key, thei r
performance on the draft keys was undoubtedly higher than could be expecte d
even from other highly trained persons . Another reason they are higher tha n
would be obtained in a different study is that the two evaluators prepared the draf t
keys in two stages and reconciled most of the differences that arose in the firs t
stage before starting the second stage . In the first stage, the evaluators identifie d
which articles were relevant, how many templates would be generated for th e
relevant ones, and which incident types would be represented in each of th e
templates . In the second stage, the evaluators filled in the templates, with th e
assistance of an interactive software tool that provides some integrity checking ,
automatic fill-in, etc .
The plots in appendix G present an interesting picture of the MUC-3 results as a
whole, but the significance of the numbers for each of the tested systems needs t o
be assessed on the basis of a careful reading of the papers in this volume that wer e
submitted by each of the sites . To facilitate interpretation of the test results, th e
sites were asked to focus on the test scores and the evaluation experience in th e
first of those papers and to elaborate in their second paper on how the system -- a s
it was actually implemented for MUC-3 -- works in general and how it is designed t o
handle the kinds of phenomena found in the MUC-3 corpus .
The level of, effort that could be afforded by each of the sites varie d
considerably, as did the maturity of the systems at the start of the evaluation . Al l
sites were operating under time constraints imposed by the evaluation schedule.
In addition, the evaluation demands were a consequence of the intricacies of th e
task and of general corpus characteristics such - as the following :
* The texts that are relevant to the MUC-3 task (comprising approximatel y
50% of the total corpus) are likely to contain more than one relevant incident .
* The information on a relevant incident may be dispersed throughout the
text and may be intertwined with accounts of other (relevant or irrelevant )
incidents .
* The corpus includes a mixture of material (newspaper articles, TV news ,
speeches, interviews, propaganda, etc .) with varying text structures and styles .
12
The scoring program produces four sets of overall scores, three of which are
based on different means of assessing penalties for missing and spurious data .
These sets of scores appear in the rows at the bottom of the score reports . The set
called Matched/Missing is a compromise between Matched Only (which is mor e
lenient than Matched/Missing) and All Templates (more stringent) and is used a s
the official one for reporting purposes . Figure G1 is based on the Matched/Missin g
method of assessing penalties . The fourth method does the scoring only for those
slots that require set fills, i .e., fills that come from predefined sets of categories .
Figure G4 is based on that method of scoring . The various methods are describe d
more fully in [3] .
The remainder of this section is a discussion of just a few of the figures i n
appendix G . (The data points in appendix G are labeled with abbreviated names o f
the 15 sites, and optional test runs are marked with the site' s name and an "0 "
extension .) Figure GI gives the most general picture of the results of MUC-3 fina l
testing.
	
It shows that precision always exceeds recall and that the systems with
relatively high recall are also the ones that have relatively high precision . The
latter fact inspires an optimistic attitude toward the promise of at least some of th e
techniques employed by today's systems -- further efforts to enhance existin g
techniques and extend the systems' domain coverage may lead to significantl y
improved performance on both measures . However, since all systems show bette r
precision than recall, it appears that it will be a bigger challenge to obtain ver y
high recall than it will be to achieve higher precision at recall levels that ar e
similar to those achievable today . This observation hold true even for Figure G 2
(Matched Only), where recall is substantially greater for most systems compared to
G1 . 5
The distribution of data points tentatively supports at least one genera l
observation about the technologies underlying today's systems : those systems tha t
use purely stochastic techniques or handcrafted pattern-matching technique s
were not able to achieve the same level of performance for MUC-3 as some of th e
systems that used parsing techniques . The "non-parsing" systems are ADS, HU ,
MDC, UNI, UNL, UNL-01, and UNL-02, and the "parsing" systems are BBN, BBN-O, GE ,
GTE, ITP, LSI, NYU, NYU-01, NYU-02, PRC, SRI, SYN, UMA, and UMA-O .
Further support for this observation can be found in Figure G4, where the
scores are computed for all slots requiring set fills, and in Figure G9, which show s
the scores for just one of those set-fill slots, the TYPE OF INCIDENT . In these
cases, one might expect the non-parsing systems to compare more favorably wit h
the parsing systems, since the fill options are restricted to a fairly small ,
predefined set of possibilities .6 However, none of the non-parsing systems appears
at the leading edge in Figure G4, and the only non-parsing system in the cluster a t
the leading edge in Figure G9 is ADS (which shares a data point with NYU-02) ,
5 Recall is greater in G2 because Matched Only differs from Matched/Missing in that the "tota l
possible," i.e., the recall denominator, does not include penalties for missing templates .
6The results in G4 are somewhat contaminated due to the fact that some of the set-fill slot s
require that the fillers be cross-referenced to fillers of string-fill slots (see, for example, th e
fillers of slots 7, 11, and 13 in Figure 2 earlier in this paper) . The scoring of the set-fill slots
is affected by these cross-reference tags . However, the TYPE OF INCIDENT results (G9) are
not contaminated in this way.
13
although a few non-parsing systems have extremely high precision scores (UNI ,
UNL, UNL-01, and UNL-02) .
On the other hand, there is quite a range in performance even among th e
systems in the parsing group, all of which had to cope with having limite d
coverage of the domain . One thing that is apparent from the sites' system
descriptions (see Part III of this proceedings) is that the ones on the leading edg e
in Figure G1 have the ability to make good use of partial sentence parses whe n
complete parses cannot be obtained . Level of effort is also an indicator o f
performance success, though not a completely reliable one : GE, NYU, and UMass al l
reported investing more than one person-year of effort in MUC-3, but severa l
other sites with lower overall performance also reported just under or over on e
person-year of effort .
It must be said that there were some extremely immature systems in the non -
parsing group and the parsing group alike, so any general conclusions must b e
taken as tentative and should certainly not be used to form opinions about th e
relative validity of isolated techniques employed by the individual systems in eac h
group. It could be that the relatively low-performing systems use extremel y
effective techniques that, if supplemented by other known techniques o r
supported by more extensive domain coverage, would put the system well out i n
front.
	
Neither should one assume that the systems at the leading edge are simila r
kinds of systems .
	
In fact, those systems have quite different architectures an d
have varying sizes of lexicons, kinds of parsers and semantic interpreters, etc .
Figures G7 through G24 show how system performance varied from one slot t o
another. Figures G7, G9, and G17 are useful as examples of the way spurious data
generation combines with incorrect data generation to affect the precision score s
in different kinds of slots . Figure G7 is for the TEMPLATE ID slot . The fillers o f
this slot are arbitrary numbers that uniquely identify the templates for a give n
message . The scoring program disregards the actual values and finds the bes t
match between the system-generated templates and the answer key templates for a
given message based on the degree of match in fillers of other slots in th e
template. Since there is no such thing as an incorrect template ID, only a spuriou s
or missing template ID, and since missing data plays no role at all in computin g
precision, the only penalty to precision for the TEMPLATE ID slot is due to
spurious data generation . In contrast to the TEMPLATE ID slot, the TYPE OF
INCIDENT slot (Figure G9) shows no influence of spurious data on precision at all .
This is because the TYPE OF INCIDENT slot permits only one filler. The HUMA N
TARGET : ID(S) slot (Figure G17) can be filled with indefinitely many fillers an d
thus shows the impact of both incorrect and spurious data on precision .
Four sites submitted results for the optional test runs that were alluded to in th e
previous section -- BBN Systems and Technologies (BBN-O), New York University
(NYU-01 and NYU-02), the University of Massachusetts (UMA-O), and th e
University of Nebraska/University of Southwestern Louisiana (UNL-01 and UNL -
02) .
	
These sites conducted radically different experiments to generate templates
more conservatively . The BBN-O experiment largely involved doing a narrower
search in the text for the template-filling information ; the NYU-01 and NYU-02
experiments involved throwing out templates in which certain key slots wer e
either unfilled or were filled with information that indicated an irrelevan t
incident with good probability ; the UMA-O experiment bypassed a case-base d
reasoning component of the system ; and the UNL-01 and UNL-02 experiment s
14
involved the usage of different thresholds in their connectionist framework .
	
The
experiments resulted in predicted differences in the Matched/Missing score s
compared to the basic test .
	
In almost all cases the experiments had the overall
effect of lowering recall ; in all cases they lowered overgeneration and thereby
raised precision . Figure G7 shows the marked difference the experiments made i n
spurious template generation ; Figure G1 shows the much smaller difference the y
made in overall recall and precision .
CONCLUSIONS
The MUC-3 evaluation established a solid set of performance benchmarks fo r
systems with diverse approaches to text analysis and information extraction . The
MUC-3 task was extremely challenging, and the results show what can be done with
today's technologies after only a modest domain- and task-specific developmen t
effort (on the order of one person-year) . On a task this difficult, the systems tha t
cluster at the leading edge were able to generate in the neighborhood of 40-50% o f
the expected data and to do it with 55-65% accuracy . Breakdowns of performance
by slot show that performance was best on identifying the type of incident -- 70 -
80% recall (completeness) and 80-85% precision (accuracy) were achieved, an d
precision figures in the 90-100% range were possible with some sacrifice in recall .
All of the MUC-3 system developers are optimistic about the prospects fo r
seeing steady improvements in system performance for the foreseeable future .
This feeling is based variously on such evidence as the amount of improvemen t
achieved between the dry-run test and the final test, the slope of improvement
recorded on internal tests conducted at intervals during development, and the
developers' own awareness of significant components of the system that they had
not had time to adapt to the MUC-3 task . The final test results are consistent with
the claim that most systems, if not all, may well be still on a steep slope o f
improvement . However, they also show that performance on recall is not as goo d
as performance on precision, and they lend support to the possibility that thi s
discrepancy will persist .
	
It appears that systems cannot be built today that ar e
capable of obtaining high overall recall, even at the expense of outrageously hig h
overgeneration .
	
Systems can, however, be built that will do a good job a t
potentially useful subtasks such as identifying terrorist incidents of various kinds .
The results give at least a tentative indication that systems incorporatin g
robust parsing techniques show more long-term promise of high performanc e
than non-parsing systems . However, there are great differences in technique s
among the systems in the parsing and non-parsing groups and even among thos e
robust parsing systems that did the best in maximizing recall and precision an d
minimizing the tradeoff between them . Further variety was evident in the
optional test runs conducted by some of the sites . Those runs show promise for th e
development of systems that can be "tuned" in various ways to generate data mor e
aggressively or more conservatively, yielding tradeoffs between recall an d
precision that respond to differences in emphasis in real-life applications .
Some conclusions can be drawn regarding the evaluation setup itself that wil l
influence future work. First, the evaluation corpus and task were sufficientl y
challenging that they can be used again in a future evaluation (with a refined tas k
definition and a new test set) .
	
Second, the information extraction task need s
modification in order to focus as much as possible on language processing
15
capabilities separate from information extraction capabilities, and new ideas fo r
designing tests related to specific linguistic phenomena are needed . Finally, more
work is needed to ensure that the statistical significance of the results is known ,
and a serious study of human performance on the task is needed in order to defin e
concrete performance goals for the systems .
ACKNOWLEDGEMENTS
This work was funded by DARPA under ARPA order 6359 . The author i s
indebted to all the evaluation participants, whose collaboration on MUC-3 deserve s
the highest praise . The author would especially like to thank those individuals
who served in special capacities and contributed extra time and energy to ensur e
the success of the evaluation and the publication of the proceedings, among who m
are Laura Blumer Balcom, Nancy Chinchor, Ralph Grishman, Pete Halverson ,
Lynette Hirschman, Jerry Hobbs, Cheryl Kariya, George Krupka, David Lewis, Lis a
Rau, Eric Scott, John Sterling, Charles Wayne, and Carl Weir.
REFERENCES
[1] Grishman, R ., and Sterling, J ., Preference Semantics for Message
Understanding, in Proceedings of the Speech and Natural Language Workshop ,
October, 1989, Morgan Kaufmann, pp . 71-74.
[2] Hirschman, L ., Comparing MUCK-II and MUC-3 :
	
Assessing the Difficulty o f
Different Tasks (in this volume) .
[3] Chinchor, N., MUC-3 Evaluation Metrics (in this volume) .
[4] Chinchor, N., MUC-3 Linguistic Phenomena Test Experiment (in this volume) .
16
