Message Understanding Conference - 6: 
A Brief History 
Ralph Grishman 
Dept. of Computer Science 
New York University 
715 Broadway, 7th Floor 
New York, NY 10003, USA 
grishman@c s. nyu. edu 
Abstract 
We have recently completed the sixth 
in a series of "Message Understanding 
Conferences" which are designed to pro- 
mote and evaluate research in informa- 
tion extraction. MUC-6 introduced sev- 
eral innovations over prior MUCs, most 
notably in the range of different tasks for 
which evaluations were conducted. We 
describe some of the motivations for the 
new format and briefly discuss some of 
the results of the evaluations. 
1 The MUC Evaluations 
We have just completed the sixth in a series of 
Message Understanding Conferences, which have 
been organized by NRAD, the RDT&E division of 
the Naval Command, Control and Ocean Surveil- 
lance Center (formerly NOSC, the Naval Ocean 
Systems Center) with the support of DARPA, 
the Defense Advanced Research Projects Agency. 
This paper looks briefly at the history of these 
Conferences and then examines the considerations 
which led to the structure of MUC-6} 
The Message Understanding Conferences were 
initiated by NOSC to assess and to foster research 
on the automated analysis of military messages 
containing textual information. Although called 
"conferences", the distinguishing characteristic of 
the MUCs are not the conferences themselves, 
but the evaluations to which participants must 
submit in order to be permitted to attend the 
conference. For each MUC, participating groups 
have been given sample messages and instructions 
on the type of information to be extracted, and 
have developed a system to process such messages. 
Then, shortly before the conference, participants 
are given a set of test messages to be run through 
their system (without making any changes to the 
system); the output of each participant's system 
1The full proceedings of the conference are to be 
distributed by Morgan Kaufmann Publishers, San Ma- 
teo, California; earlier MUC proeeedings~ for MUC-3, 
4, and 5, are also available from Morgan Kaufmann. 
Beth Sundheim 
Naval Command, Control and 
Ocean Surveillance Center 
Research, Development, Test and 
Evaluation Division (NRaD) 
Code 44208 
53140 Gatchell Road 
San Diego, CMifornia 92152-7420 
sundheim@poj ke. nosc. mil 
is then evaluated against a manually-prepared an- 
swer key. 
The MUCs are remarkable in part because of 
the degree to which these evaluations have defined 
a prograin of research and development. DARPA 
has a number of information science and technol- 
ogy programs which are driven in large part, by 
regular evaluations. The MUCs are notable, how- 
ever, in that they in large part have shaped the 
research program in information extraction and 
brought it to its current state} 
2 Early History 
MUC-1 (1987) was basically exploratory; each 
group designed its own format for recording the 
information in the document, and there was no 
formal evaluation. By MUC-2 (1989), the task 
had crystalized as one of template filling. One re- 
ceives a description of a class of events to be iden- 
tiffed in the text; for each of these events one must 
fill a template with information about the event. 
The template has slots for information about the 
event, such as the type of event, the agent, the 
time and place, the effect, etc. For MUC-2, the 
template had 10 slots. Both MUC-1 and MUC- 
2 involved sanitized forms of military messages 
about naval sightings and engagements. 
The second MUC also worked out the details of 
the primary evaluation measures, recall and pre- 
cision. To present it in simplest terms, suppose 
the answer key has Nke~ filled slots; and that a 
system fills Neor,.~t slots correctly and Nin~or,,~t 
incorrectly (with some other slots possibly left un- 
filled). Then 
Ncorrect recall - 
Nkey 
2There were, however, a number of individual re- 
scm'eh efforts in information extraction underway be- 
\[bre the first MUC, including the work on information 
formatting of medieM narrative by Sager at New York 
University; the formatting of naval equipment failure 
reports by Marsh at the Naval Research Laboratory; 
and the DBG work by Logieon for RADC. 
466 
Nco,,;,.ect precision = 
Ncorrect + Nincorrect 
For MUC-3 (1991), tile task shifted to reports 
of terrorist events ill Central and South Amer- 
ica, as reported in articles provided by the For- 
eign Broadcast Information Service, and the tem- 
plate becmne somewhat more complex (18 slots). 
This same task was used for MUC-4 (1992), with a 
further small increase in template complexity (24 
slots). 
MUC-5 (1993), which was conducted as part of 
the Tipster program, a represented a substantial 
fllrther jump in task complexity. Two tasks were 
involved, international joint ventures and elec- 
tronic circuit fabrication, in two hmgnages, En- 
glish and Japanese. The joint venture task re- 
quired 11 templates with a total of 47 slots for 
the output double tile number of slots defined 
for MUC-4 and the task documentation was 
over 40 pages long. 
One innovation of MUC-5 was the use of a 
nested template structure. In earlier MUCs, each 
event had been represented as a single temi)late 
• in effect, a single record in a data l)ase, with a 
large nuinber of attributes. This format proved 
awkward when an event had several participmlts 
(e.g., several victims of a terrorist attack) and one 
wanted to record a set of facts about each partic- 
ipant. This sort of information (:ould be ranch 
more easily recorded in the hierarchical structure 
introduced for MUC-5, in which there was a single 
template for an event, which pointed to a list of 
temI(lates, one for each particii)mlt in tile event;. 4 
3 MUC-6: initial goals 
1)ARI)A convened a meeting of Tipster partici- 
pants and government representatives in Decca> 
bet' 1993 to define goals and tasks tot MUC-6) 
Among the goals which were identified were 
• demonstrating taskqndependent component 
technologies of information extraction which 
would be immediately useflfl 
• encouraging work to make information ex- 
tractioil systems in<)re portable 
• encouraging work on "deeper understanding" 
aTipster is a U.S. Govermnent program of research 
and development in the areas of inibrmation retrieval 
and information extraction. 
4In fact the MUC-5 structure wa~s much (nor(; com- 
plex, because there were separate temt)lates for prod- 
ucts, time, activities of organizations, etc. 
'~The representatives of the resear(:h community 
were Jim Cowie, lS(.alph Grishman (commit;tee chair), 
Jerry Hobbs, Paul Jacobs, Lea Schubert, Carl Weir, 
and Ralph Weischedel. The government people at- 
tending wcre George Doddington, Donna Harman, 
Boyan Onyshkevych, John Prangc, Bill Schultheis, 
and Beth Sundheim. 
Each of these can been seen in part as a reaction 
to the trends in the prior MUCs. The MUC-5 
tasks, in particular, had been quite complex and 
a great effort had been invested by the government 
in preparing the training and test data and by the 
participants in adapting their systems for these 
tasks. Most participants worked on the tasks for 
6 months; a few (the Tipster contractors) had 
been at work on the tasks tbr consi(lerably longer. 
While the performance of solne systems was quite 
impressive (the best got 57% recall, 64% precision 
overall, with 73% recall and 74% t)recision on the 
4 "(:or(;" template types), tile question naturally 
arose as to whether there were many apl)lieations 
tbr which art investment of one or several develop- 
ers over half->year (or more) could be justified. 
Furthermore, while so much effort had been ex- 
pended, a large portion was specific to tire partic- 
ular tasks. It wasn't clear whether much progress 
was being made on the underlying technologies 
which would be needed for hetter understanding. 
To address these goals, the meeting formulated 
an ambitious menu of tasks for MUC-6, with the 
idea that individual participants could choose a 
subset of these tasks. We consider the three goals 
in the three sections below, and describe the tasks 
which were developed to address each goal. 
4 Short-term subtasks 
The first goal was to identit~y, from the compo- 
nent technologies being developed for information 
extraction, flmctions which would be of 1)ractical 
use, would be largely domain indet)endent, and 
couhl in the near term be performed automatically 
with high ac('uracy. To meet this goal the con> 
mittce developed the "named entity" task, which 
t(asically involves identifying the names of all the 
people, organizations, and geographic locations in 
a text. 
The final task specification, which also involved 
time, currency, and percentage expressions, used 
SGML markup to identify the names in a text. 
Figure 1 shows a sample sentence with named en- 
tity annotations. The tag ENAMEX ("entity name 
expression") is used for both people and organiza- 
tion names; the tag NUNEX ("numeric expression") 
is used for currency and I)ercentages. 
5 Portability 
The second goal was 1;o focus on portability in 
the inibrmation extraction task the ability to 
rapidly retarget a system to extract; information 
about a different class of events. The comnfit- 
tee felt that it was important to demonstrate that 
useful extraction systems eouht be created in a 
few weeks. To meet this goal, we decided that 
the infbrmation extraction task for MUC-6 wouhl 
have to involve a relatively simple template, more 
like MUC-2 than MUC-5; this was duhbed "mini- 
467 
Mr. <ENAMEX TYPE="PERSON">Dooner</ENAMEX> met with <ENAMEX:TYPE="PERSON">Martin 
Puris</ENAMEX>, president and chief executive officer of <ENAMEX 
TYPE="ORGANIZATION">Ammirati & Puris</ENAMEX>, about <ENAMEX 
TYPE="ORGANIZATION">McCann</ENAMEX>~s acquiring the agency with billings of <NUMEX 
TYPE="MONEY">$400 million</NUMEX>, but nothing has materialized. 
Figure 1: Sample named entit;y annotation. 
MUC". In keeping with |;he hierarchical tem- 
plate structure introduced in MUC-5, it was envi- 
sioned |;hat the inini-MUC would have an event- 
level template pointing to templates representing 
|;he partieitmnts in the event (people, orgmfiza- 
tions, products, e.tc.), me(liated perhaps by a "re- 
lational" level template. 
To further increase portability, a proposal was 
made to standardize the lowest-level tenlplates 
(for peoph',, orgaIfizations, etc.), since these basic 
(:lasses are involved in a wide variety of actions. In 
this way, MUC participants could develop code for 
these low-level telnplates once, and then use them 
with many different types of events. These low- 
level t;emptates were named "telnplate elements". 
As the specification finally deveh)ped, tit(; rein- 
plate element for orgalfizations had six slots, for 
the inaximal organization nalne, any aliases, the 
type, a descriptive noun phrase, the locale (inost 
specific location), and country. Slots are tilh,d 
only if inforlnation is explicitly given in the text 
(or, ill the ease of the country, can be inDrred 
Doln an explicit locale). The text 
We are striving to have a strong re- 
newed creative partnership with Coca- 
Cola," Mr. Dooner says. However, o(lds 
of that hapt;ening are slim since word 
from Coke headquarters in Atlanta is 
that... 
wouht yiehl an organization telnplate elenmnt 
with live of these six slots filled: 
<0RGANIZATION-9402240133-5> := 
ORG NAME: "Coca-Cola" 
ORG ALIAS: "Coke" 
ORG TYPE: COMPANY 
ORG_LOCALE: Atlanta CITY 
ORG COUNTRY: United States 
(the first line identities this as organization tenl- 
plate 5 from article 9402240\]33). 
Ever on the lookout for additional ewfluation 
measm'es, the committee decide, d to nlake the cre- 
ation of telnI)late eh,ments tbr all the people and 
organizations in a text a separate MUC task. lake 
the named entil;y task, this was also seen as a 
potential demonstration of the ability of systelns 
1;o pertbrm a useflfl, relatively dolnain indepen- 
dent task with near-term extraction te(:hnoh)gy 
(although it was recognized as being more dilli- 
cult than named entity, since it required merging 
information from several places in the text). The 
old-style MUC information extraction task, based 
on a description of a particular (:lass of events 
(a "scenario") was called the "scenario template" 
task. A sample scenario template is shown in the 
appendix. 
6 Measures of deep understanding 
Another concern which was noted about the 
MUCs is that tile systenls we.re tending to- 
wards relatively shallow understanding techlfiques 
(based IIrimarily on local pa.ttern inatching), and 
that not enough work was being done to build 
up the mechanisms needed for deeper understand- 
ing. Therefore, tile committee, with strong en- 
couragement front I)AII.PA, included three MUC 
tasks which were intended to measure, aspex:ts of 
the internal processing of an inforlnation extra(:- 
lion or hmguage understanding systenL These 
three tasks, which were collectively called Se- 
mEwfl ("Senmntic Ewfluation") were: 
• Coreference: the systent would have to 
mark coreferential noun t)hrases (the initial 
SlmCification envisioned marking set-subsel; 
and part-whole relations, ill addition to iden- 
tity relations) 
• Word sense disambiguation: for each 
ope.n (:lass word (noun, verb, a, djective, ad- 
verb) in the text, the systein would have to 
determine its sense using the Wordlmt clas- 
s|Ileal|on (its "synset", in Wordnet termii~of 
ogy) 
• Predicate-argument structure: the sys- 
tem wouhl have to create a tree interrelating 
the constituellts of the sentence, using sonm 
set of gralnma.tical flnmtional relations 
The committee recognized that, in seh;eting sneh 
internal measures, it, was inaking sortie presumI) 
tion regarding the structures and decisions which 
an analyzer should make in understanding a doc- 
llmellt. Not everyone would share these pre, sump- 
lions, lint participants in the next MU(J would 
be free 1;o enter the infornlation extraction evalu- 
ation and skip some or all of these internal ewdua- 
Lions. Language understanding technology might 
develop in ways very diIii?rent from those imagined 
by the committee, and these internal evaluations 
might turn ollt t() t)e irrelevant distractions. How- 
ever, froln the current perslmctive of tnost of the 
eolnmittec, @ese seenmd fairly \])asic aspects of 
unde, rstanding, and so an experinmnt in evahlat- 
ing them (and encouraging improvem(mt in them) 
4:68 
would I)e worl;hwhil(~. 
7 Preparation process 
Round 1: Resolution of SemEval 
The committee, had l)ropos(;(t a ve.ry anll)itious 
I)rogrmn of cvahu~l;ions. Wc now had to r(xhlce. 
these I)roi)osals to (let;ailed spe.cifi(:ations. The. 
first step was t;o (lo some ma,mlal te.xl; anuol:a.-- 
lion for the fore ~,asks named em;ity mM the 
Selnt,;val triad whi(:h were quit(: (tifii!r(!nt from 
what had be(m l;rie(l before, lh'M! sp(~(:ifi(:ations 
were prepared for ca(:h task, and in the sl)ring of 
\] 994 a grou I) of vohmt(~ers (most;ly vel;(n:ans ()f ear-. 
lier MU(Js) annol:~mxl a shorl: newst)~p(w m'tM(', 
using ('.ach set of specifi(:ations. 
Prot)lems arose with ea.(:h of t;he SemEval tasks. 
* For corefcren('e., ther(', were problems i(hull;i\[y- 
ing i)art-whoh~ and sei;-sul)s(¢ rela.tions, mM 
distinguishing the, two; a decision wa.s lm;er 
made to limit ourselv(;s I;() i(lenLi(;y rela.I;ions. 
® bbr sens(' I:~gging, l;h(; ~l.llllOl;tl, t,()l'S forum that, 
in some cases Wordn(,t made very \[ine dis- 
1;incl;ions a,nd thai; making l;hese (list, incti(ms 
consistently ill l;agging was very ditticulI;. 
e For predicate-argumealt sl;ru(;l;llr(',, pracl;ically 
every new CoIIS\[;Ill(;l; 1)(;y()lI(l simple clauses 
and noun l)hrases r;tise(l new issues which had 
I;o t)e toilet:lively r(:solve(l. 
Beyon(l th(;se in(lividuM t)rolflenls, il; was fell: 
l;hal; l;he menu was simply (;oo anfl)il, ious, mM l;hal; 
w('. would do t)('.l:t('x by (:on(:entrat, ing on out; (',le- 
menl: of the Sem(;v;fl l,riad for MUC-6; at a. me('.l;- 
ing hehl in .hllm 1994, a decision was mad(; to 
go with coref('xea,(:('.. In i/arl;, this r(~tl(w.l;est a feel- 
ing that the t)rol)lems wi@ Lh(', (:()refl',ren(:(~ Sl)(X> 
ili(:a.I;ion w('.re l:he mosl; mn(mable l:o soluli(m, lilt, 
also re.fle(:i;ed a. (:onvicl;ion I;hal; (:or(ff('r(m(:(~ idea> 
t:ilication had 1)een, &nd would re, main, critical 
1;o success ill inforina.t;iou cxl;r~mi;ion, au(1 st) it 
wgts \[IIlpor\[;~l,ll\[; 1;o (?llC()llrtl,~(~ a, dvtl~ltc(;s in (:or(',\[ k 
m(;nc(;, tin contrasl;, mosl; (;xt, rat:l;ion sysl;ems 
did nol; buil(t fltll t)redi(:ate-atrgument sl;ru(:l;ures, 
and word-sense (lismnbigual;ion played a relal;iv('ly 
stnall role ill exl;ra(:l;ion (particularly since (;xl;l';t(> 
l;ion sysl;ems ol)erated in a narrow domain). 
'Phe (:or('~h'a'(;n(:('~ task, like. the nam(xl entil;y 
l;ask, was a.nnotal;ed using SGMI, n()tal;i()tl. A 
C{\]REF tag has mt ID ai;l;ri|)ul;(' whi(:h i(lenl;ifies l;he 
tagged noult 1)hrase or l)ron(mn, ll; tn;ty also ha.vc 
a.n at,l;ril)ut;(' of the \[orm REF--n, which indi(:al,es 
thai; this lfln'ase is (:or(,fe, r(mtiM with I;he 1)hrasc 
wit;h I1) n. Figure, 2 shows an (;x(:('rt)I; fl'om ;m 
m'tMe, ann(/l;al;c(t \[or (;orefereal(;e. (; 
6 'The TYPE mM M\]~N ;tl;l;I'il)uLes which appear in l, he 
;tctmd annot;al;ion have been omitted here fin the s~tke 
of readM)ilil;y. 
Round 2: annotation 
The next st;(; 1) was the IWel)axal;hm of a substa.nt;ia.1 
Lra.illing corpllS for LII(~ l;wo novel t, asks which re- 
nmined (nmued Clll;il;y &lid COI'(~,f(~,FO,1IC(Q. SRA Col 
l)orat:ion kindly provided tools which aided in t;he 
a nnol;ation process. Again a sl;alwa.rt grtml) of vt)l- 
uui;e(w a.nn()i;alx)rs was assenfl)led; 7 each was 1)to - 
vide(l with 25 m't;i(:lcs from 1;h('. Wall Street .\]our- 
na.l. There. was SOlUe overlap b(!Lween t;hc arLi(:les 
assigned, s() t, haL we could IIIO&Slll'(! ~;}1( ~. consistency 
of a.mloi;m;ion /w.|:weeu silx~s. This amloi,ation w~s 
(lone. in I.he winter o\[ 1994-95. 
A major role o\[ the. mmol;aLion l)ro(:e.ss was Lo 
i(lemify and res()lv(~ l)r(fl)h!ms wil;h l;he task Sl)(X> 
ifi(:a.tions. For na.nied cnl;iifies, this was rel~tl;ively 
st, rtdght, forwar(\[. For COI'(~\[(~I'(;I/(',(;, it proved r(',- 
markat)ly (lifli(:ult to f()rmutat;e guitl(,lines which 
were reasonal)ly comI/lel;(~ a, nd <:onsist, ent.. s 
RomM 3: dry rml 
()nee the t;ask sl)e(:ifica.l;ions seemed r(~asonably 
stM)l(b Nlbd) ()rg;ufiz(~(l a "(lry run" a full-s(:al(~ 
r(~hearsal for MUC-6, I)ul; with all result:s r('4)ori;ed 
a.nouymously. The dry run Ix)ok t)l;u:e in April 
1995, wil;h a s(:enario iuvolving labor union (:()n. 
l,ra,c.t; n(~gotia.l:i(ms. ()f 0m sil;es whi(:h we.re in- 
volved in t;he annot;ation l/r()(:('~s,q, t;en 1)arl:i(:ipatx~(l 
in lhe dry run. Results of t;h(~ dry run were r(> 
l)()rWxl n.I, l;he Tit)sl:er I)hase II 12--mout;h me(Mug 
in May 1995. 
8 The formal evaluation 
The MUC6 formal ewflu;ttion was /mhl in 
,~{(q)l:emt)ex 1995. The s(:(;nario (l(~finil;ion w;L,q dis- 
t, ribuIxxt at, t;he t)egimling ()\[' S(q)tember I l;he test 
data was disiaibut, cd four we('.ks late.r, with re 
sult;s due by (,he end ()\[' th(; w('.ek. The ,qcena.rio 
involv(M (;h;l,II~O,S ill COI'|)OF;I,I;(~ (LK(~CIII;iv(; II\],%II;/,~C- 
m('.n(; p('a,~onn(~l. The. (;valua.1;i(m reel; mmly ()I 1;t1('. 
f~oals which had /)(~en set, 1)y th('~ iniLial planning 
(:onfer(mt:e. in l)e(:emlmr ()f 1993. 
There were (;va\]u;Lti(ms for \[our t, asks: 1HIIII(RI 
entit;y, (:orel'('.re.n(:e, 1;eml)lat(, c, lt!inenI;, }l, ll(t s(;c-- 
nmio I;e, mt~lm;(u Ttmre w('r(; 16 t)m'ti(;ipmfl;s; 11.5 
1)arti(:it)al;e(l in the nmne(l ent, it, y task, 7 in (',oref- 
(~l'O, ll(~(~,, \] 1 ill t(',ml)lat;(; elemenl;, an(l 9 in s(:enari() 
l,(;mi)lal;(,,. 
Name(l eni;ity was inl;(mdcd to b(; a siml)h~ 
task on whi(:h syst, ems coul(t (lernoustrat, e a high 
level of 1)(!rforumn(:e ... high enough for imme(li- 
m;e use. Our su(:(;(;ss iu I;his t, ask (~x(:(;(~(le(l our 
>l'he annol;;)A;ion groups were from BBN, Brall(t(fis 
Univ., t~he Univ. of Durham, Lo(:kheed-Marl;in, New 
Mexico Sl;ai;e Univ., Nlbd), New York Univ., PRC, 
l;he, Univ. of l)(mnsylwmia, SAIC (San /)iego), SRA, 
SR\[, the Univ. of Shefliehl, SouLhe, rn Metlmdisl; Univ., 
mr(1 Ultisys. 
SAs exl)e, rienced (:Oml)ut~tional linguists, we 1)rol)- 
ably should ha,re kuown 1)el;l;(',r l;han to l;hink this wa.s 
an easy t~ask. 
469 
Maybe <COREF ID="136" REF="I34">he</CSREF>'II even leave something from <COREF 
ID="138" REF="I39"><COREF ID="137" REF="I36">his</COREF> office</COREF> for <CSREF 
ID="I40" REF="91">Mr. Dooner</COREF>. Perhaps <COREF ID="144">a framed page from 
the New York Times, dated Dec. 8, 1987, showing a year-end chart of the stock market 
crash earlier that year</COREF>. <COREF ID="I41" REF="I37">Mr. James</COREF> says 
<COREF ID="142" REF="I41">he</COREF> framed <COREF ID="143" REF="I44" 
STATUS="OPT">it</COREF> and kept <COREF ID="145" REF="I44">it</COREF> by <COREF 
ID="146" REF="I42">his</COREF> desk as a "personal reminder. It can all be gone like 
that." 
Figure 2: Sample coreference annotation. 
expectations. The majority of sites had recall 
and precision over 90%; the highest-scoring sys- 
tem had a recall of 96% and a precision of 97%. 
Although one must keep in mind the somewhat 
limited range of texts in the test set (all are from 
the Wall Street Journal, in particular), the re- 
sults are excellent. A couple of these systems have 
been commercialized, and several are being incor- 
porated into government text-processing systems. 
Given this level of performance, there is probably 
little point in repeating this task with the same 
ground rules in a future MUC (although there 
might be interest in processing monoease text and 
in performing comparable tasks oil a more varied 
corpus and for languages other than English). 
The template element task, while superfi- 
cially similar to named entities - ~ it is also based 
on identifying people and organizations ~ is sig- 
nificantly more difficult. One has to identify de- 
scriptions of entities ("a distributor of kumquats") 
as well as names. If an entity is mentioned sev- 
eral times, possibly using descriptions or differ- 
ent forms of a name, these need to be identified 
together; there should be only one template ele- 
ment for each entity in an article. Consequently, 
the scores were appreciably lower, ranging across 
most systems from 65 to 75% in recall, and from 
75% to 85% in precision. The top-scoring sys- 
tem had 75% recall, 86% precision. Systems did 
particularly poorly in identifying descriptions; the 
highest-scoring system had 38% recall and 51% 
precision for descriptions. 
There seemed general agreement that having 
prepared code for template elements in advance 
did make it easier to port a system to a new see- 
nario in a few weeks. This factor, and the room 
that exists for improvement in performance, sug- 
gest that including this task in a future MUC may 
be worthwhile. 
The goal for scenario templates mini- 
MUC -- was to demonstrate that effective infor- 
mation extraction systems could be created in a 
few weeks. This too was successful. Although it is 
difficult to meaningfully compare results on differ- 
ent scenarios, the scores obtained by most systems 
after a few weeks (40% to 50% recall, 60% to 70% 
precision) were comparable to the best scores ob- 
tained in prior MUCs. The highest performance 
overall was 47% recall and 70% precision. 
One can observe an increasing convergence of 
methods tbr information extraction. Most of 
the systems participating in MUC-6 employed a 
cascade of finite-state pattern recognizers, with 
the earlier pattern sets recognizing entities, and 
the later sets recognizing scenario-specific pat- 
terns. This convergence may be one reason for 
tile bunching of scores for this task -- most sys- 
tems fell in a rather narrow range in both recall 
and precision. 
The results of this MUC provide valuable pos- 
itive testimony on behalf of information extra(> 
tion, but further improvement in both portability 
and performance is needed tbr many applications. 
With respect to port~bility, custoiners would like 
to have systems which can be ported in a t'ew 
hours, or at most a few days, by someone with 
less expertise than a system developer. How this 
might be tested in the context of a MUC is not en- 
tirely clear. For one thing, most sites spent several 
days just studying the scenario description and 
annotated corpus, in order to understand tile sce- 
nario definition, before coding began. Perhaps a 
micro-MUC 9 with an even simpler template struc- 
ture, is needed to push the limits of port, ability. 
Getting systems which can be custonfized by oth- 
ers is also a tall order, given the complexity and 
variety of knowledge sources needed for a typical 
MUC information extraction task. 
With respect to performance, tile bunching of 
scores suggests that many sites were able to solve a 
common set of "easy" problems, but were stymied 
in processing messages which involved "hard" 
problems. Whether this is true, and just what 
the hard problems are, will require more extensive 
analysis of the results of MUC-6. Are the short- 
comings due primarily to a lack of coverage in the 
basic patterns, to a lack of background knowledge 
in the domain, to failures in coreference, or some- 
thing else? We. may hope that the failings are 
primarily in one area, so that we may concentrate 
our energies there, but more likely the failings will 
be in many areas, and broad improvements in ex- 
traction engines will be needed to improve perfor- 
mance. 
9a term suggested by George Krupka 
470 
Pushing improvements in the underlying tech- 
nology was one of tlm goals of SemEval and its 
current survivor, eoreference.. Much of tile en- 
ergy for the current round, however, went into 
honing the definition of the task. Philosol)hers 
of language have been arguing over reference and 
coreferencc for centuries, so we should not have 
been surprised that it would t)e so hard to pre- 
pare a precise and consistent definition. Addi- 
tional work on the definition will he necessary, 
and it may be necessary to narrow the task fllr- 
ther. Despite these distractions, a few interesting 
early results were ol)tained regarding eoreference 
methods; we may hot)e that, once the task specifi- 
cation settles down, the availability of coreference- 
aimotated corpora and the chance for glory ill fltr- 
ther evaluations will ein'ourage more work in this 
area. 
Appendix: Sample Scenario 
Template 
Shown below is a set of templates for the MUC- 
6 scenario template task. Tile scenario involved 
changes in corporate executive management per- 
sonnel. ~br the text; 
McCann has initiated a new so-called 
global collaborative system, (:omposed 
of world-wide account directors paired 
with creative partners. In addition, P(> 
ter Kim was hired from WPP Grout)'s .I. 
Walter Thompson last; Septenfl)er as vice 
chairman, chief strategy officer, worhl- 
wide. 
the following templates were to be generated: 
<SUCCESSION_EVENT-9402240133-3> := 
SUCCESSION_ORG : <ORGANIZATION-9402240133-1> 
POST: "vice chairman, chief strategy 
officer, world-wide" 
IN_AND_OUT : < IN_AND_OUT-9402240 i33~5> 
VACANCY_REASON : OTH_UNK 
< IN_AND_OUT-9402240133-5> := 
IO_PERSON : <PERSON-9402240133-5> 
NEW_STATUS : IN 
ON_THE_JOB : YES 
OTHER_ORG : <ORGANIZATION-9402240133-8> 
REL_OTHER ORG : OUTSIDE_ORG 
<ORGANIZATION-9402240133- i> := 
ORG_NAME : "McCann" 
ORG_TYPE : COMPANY 
<ORGANIZATION-9402240133-8> := 
ORG_NAME: "J. Walter Thompson" 
ORG_TYPE : COMPANY 
<PERSON-9402240133-5> := 
PER NAME: "Peter Kim" 
Although we cannot explain al\] tile details of 
the template here, a few highlights shouht be 
noted. For each executive post; one generates a 
SUCCESSION_EVENT template, which contains 
refl~rences to the ORGANIZATION template for 
the organization involved, and the IN_AND OUT 
template for the activity involving that post (if 
an article describes a person leaving and a per- 
son start;ing the same job, there will be two 
IN_AND_OUT templates). The IN_AND_OUT 
template contains references to the tmnt)lates fl)r 
the PERSON and tbr the ORGANIZATI()N from 
which the person came (if he/she is starting a 
new job). The PERSON and ORGANIZATION 
templates are the "temt)late element" templates, 
which are invariant across scenarios. 
471 
