LANGUAGE SYSTEMS, INC .
MUC-3 TEST RESULTS AND ANALYSIS 1
Christine A. Montgomery
Bonnie Glover Stalls
Robert S. Belvin
Robert E. Stumberger
Language Systems, Inc .
6269 Variel Avenue, Suite F
Woodland Hills, CA 91367
lsiO01%lsi-la .uucp@ism.isc.com
1-818-703-503 4
RESULTS
Table 1 summarizes LSI 's scores for Phase II of MUC-3. In evaluating these scores, which place LSI about
two/thirds of the way down in a results-ranked list of MUC-3 participants, it should be noted that our MUC- 3
system reflects a major redevelopment of the key components of the DBG message understanding system, whic h
is currently in process . Specifically, innovative development of a parser based on government-binding principle s
is under way, 2 with associated revisions of the lexicon, functional parse (recovery of the predicate/argumen t
functions representing the underlying semantic structure of the sentence), and DBG template 3 generation and
frame hierarchy components (the areas indicated by the heavy lines in the system flow chart shown in Figure 1).
This innovative development is described more fully in the system summary paper. For the purposes of thi s
site report, it is obvious that the "under construction" status of the DBG system had considerable impact upo n
our ability to achieve a respectable score. Had we chosen instead to go with the fairly robust previous version o f
the DBG system (described in [2] and [3], recently evaluated for Rome Laboratory by KSC, Inc .), our MUC-3
scores would certainly have been substantially better, because all components of the DBG system would have
been fully functional (see the discussion on functionality of the DBG version currently under developmen t
below).
However, we felt strongly that the time had come to replace our chart parser with weighted rules by a mor e
powerful and generic model that would provide a better foundation for current work, including automated trans-
lation and the integration of speech processing with the DBG system, as well as for the complex MUC-3 mes-
sages. Once the decision was made to embark upon this major re-development of the key DBG system com-
ponents, it would have been unproductive to carry out the MUC-3 development in parallel on the older versio n
of the DBG system (as well as infeasible given LSI's available resources for MUC-3).
ALLOCATION OF EFFOR T
For the reasons described above, the bulk of our development effort was concentrated on the parser (approxi-
mately 60% of the total MUC-3 development effort) and the lexicon (approximately 30% of the total effort),
1. The work reported in this paper was supported in part by the Defense Advanced Research Projects Agency, Information Systems and
Technology Office, under Contract No. MDA903-89-C-0041 (Subcontract MC 1350)
2. Unlike chart parsers, which are based on a well-understood model that has been in existence for almost 30 years, very few attempts hav e
been made at implementing GB parsers (see [1]), and LSI's implementation approach differs from these in several important respects .
3. It is important to note that the term "template " in the DBG system is a label for the generic message level semantic and pragmatic
representational units, not an application oriented structure like the MUC-3 templates . It is the glass box output or internal
representational output, as opposed to the MUC-3 templates, which are black box outputs mapped to the external representation require d
by a given application : currently, extraction of specific data elements for describing terrorist incidents in 9 Latin American countries .
84
1
11!1!111111 .1 .1! .
	 1i1,,~~11liil}~fll
- - r
U. A
s
	
E .
I : —.
iN:ullili!
~,111~!tt~ll ll~~tl .z~!l1~,1111i, i
!IIIi ii ~tlii!1l1~i1liil1 l
;t
	
t
	
t
	
t
	
t
_!ri
	
i te a
t
	
' l .g1= -!~=--11;-;111~1;11111
	
l4 11'1~1Fl~t iltt11 ' 1~j 1lr' `tt~lti ili t~lklu1i11 !liil~i ii1i111u!il lt, ti
	
.
	
.l	 -a	
t
	
;li
	
— --
	
'
	
'01
11111;1
	
(
	
1l
.
	
h
uT:
	
l
	
+1
	
!s1? ~ : 1
	
E
, ;lti~u ;iii; 1x .1 1, ' 1 t.	 	 ` -
I hi:
II
	
11;1li i
il i1
1
llt't
I!iili ,t :; i
li ii1li
	
1;
llh!i1
II
I
Figure 1 : Functional Flow of the System
85
SLOT
	
REC PRE OVG FAL
MATCHED ONLY
	
32 48 22
MATCHED/MISSING
	
16 48 22
ALL TEMPLATES
	
16 33 46
SET FILLS ONLY
	
15 45 32 0
TABLE 1: SUMMARY OF SCORES
FALSE POSITIVE
	
FALSE NEGATIVE
Message 01 Message 4 6
Message 08 Message 54
Message 16 Message 56
Message 18 Message 5 7
Message 19 Message 6 8
Message 24 Message 74
Message 28 Message 7 8
Message 31 Message 80
Message 32 Message 90
Message 36 Message 9 3
Message 39 Message 95
Message 4 2
TABLE 2: INCORRECTLY CLASSII hD MESSAGES (RELEVANT VS . IRRELEVANT)
Message 07
Message 10
Message 1 1
Message 1 6
Message 23
Message 3 0
Message 67
Message 77
Message 86
Message 91
Message 9 9
TABLE 3: UNPROCESSED MESSAGE S
Message 30
Message 38
Message 66
Message 83
Message 84
Message 85
Message 88
Message 92
Message 96
86
since our earlier system of categories and features had to be substantially revised and enhanced to provide all
the subcategorization and selection information required by the GB parser. The remaining 10% of the effort was
devoted to the higher level components for semantic and pragmatic interpretation at the sentence and messag e
level (i.e., the functional parse, (DBG) template generation and frames component, and back-end applicatio n
component for extracting data for filling MUC-3 templates .
Since December, approximately 10,000 new lines of commented Prolog code have been written, and are in
the process of being debugged. The lexicon developed for MUC-3 comprised almost 10,000 entries includin g
inflected forms. In comparison, the unclassified lexicon for the air activities messages recently completed fo r
Rome Laboratory [3] included only about 1500 entries, and the most recent lexicon for the Army maneuver dia-
logs is estimated at about 3,000 entries . The MUC-3 frames component includes 3485 frames and 257 classe s
of items. However, much of this information was not exploited properly since, as mentioned above, very littl e
of the MUC-3 development effort could be devoted to the higher level DBG components which utilize fram e
and slot information.
LIMITING FACTORS
In LSI's case, these included all of the items on the NOSC list of possible limiting factors, i .e., time, people ,
cpu cycles, and knowledge (interpreted as limitations on system use of available knowledge), as discussed
below.
Time/People
There were not enough person hours available to LSI under the given resource limitations to carry out th e
full effort required for MUC-3 in the time allotted, as well as to perform on other in-house contracts . Although
particular sentence types represented in the MUC-3 texts were no more complex than in previous projects suc h
as the air activities work carried out for Rome Laboratory, there was a great deal more variety in the sentenc e
types in the MUC-3 texts, and the time and resources available were substantially less than the 18 months an d
$225K dedicated to the Rome effort .
Cpu Cycle s
The main problem here is that we were essentially developing and debugging a substantial number of line s
of new code, requiring detailed tracing facilities to identify and fix bugs . In addition, a key component for
MUC3 texts is LUX (Lexical Unexpected Inputs) and its associated module WAM (Word Acquisition Module )
which deal with items not present in the DBG lexicon, either because they are misspellings of existing entries ,
or because they are new words . LUX in particular consumes a lot of cpu cycles, but is absolutely critical fo r
processing of texts containing many words which are unknown to the DBG lexicon .
In order to complete Test 2 in some finite time frame, it was therefore necessary to limit the input to th e
DBG message understanding system by utilizing some fairly draconian measures . Discourse analysis of the
development corpus revealed that detecting the transitions from descriptions of one event to another in the text s
was too complicated to attempt within the limited resources, so all messages labeled as relevant were arbitraril y
truncated to 10 sentences . Since the intent was to exclude all but event reports, which typically describe th e
event(s) of interest within the 10 sentence segment, it did not appear that much information loss would resul t
from this measure. No attempt was made to determine the number of MUC-3 templates represented in the trun-
cated text that began at the 11th sentence of all messages in the critical event directory . (For further discussion ,
see LSI's section in the paper on discourse processing within the MUC-3 context included in this proceedings .)
The selection of relevant messages, which was performed using Logicon's Message Dissemination Syste m
(LMDS), 4 was thus executed to exclude all potentially windy messages such as reports of political speeches ,
interviews, and clandestine radio broadcasts of political propaganda, as well as military attack events and dru g
traffic related events not involving terrorist acts . Critical event and irrelevant message criteria were defined i n
terms of LMDS profiles and used to filter the test message set into the groups shown in Table 2 . The LMDS
4. LMDS does shallow text analysis based on boolean combinations of key words/phrases with proximity and other criteria, which operat e
as search logic specifications within particular user-defined zones of messages (e .g ., mediasource zone, with typical contents such as
"Radio Venceremos", "Television Peruana", etc .).
	
87
filtering was performed as a pre-processing run which took less than a minute for the total test set of 100 mes-
sages. In Table 2, false positives (in terms of the MUC-3 test key) in the critical event directory and false nega-
tives in the "nohit" d irectory are indicated by Xs.
Table 3 contains numbers of the messages which were partially processed, but produced no template outpu t
because processing hung up in the parser. Per the instructions for running the test procedure, processing was res -
tarted with the next relevant message following these parser failures .
Knowledge Availability vs. System Functionalit y
The DBG system was fairly well primed with knowledge, as can be seen from the size of the lexicon and th e
frames data bases cited previously. However, because the the GB parser was not completely functional (in fact,
was still undergoing extensive debugging, as mentioned above), many attachments were not being made, result-
ing in a large number of partial trees which could not labeled with their thematic roles (i .e., agent, patient, etc.) .
The consequences of these attachment failures propagated throughout the remainder of the processing com-
ponents, resulting in predicate/argument functions unlabeled, unindexed, or missing in the functional parse, s o
that the DBG templates were extremely sparse, as were the MUC-3 application templates. Figures 2 and 3 show
partial output for Test 2 Message 100, which illustrates this point. Essentially because of the limited functional-
ity of the GB parser at this stage of development, a great deal of the knowledge represented in the frame hierar-
chy and associated rules, as well as knowledge represented in the rules for filling the MUC-3 templates, wa s
never exploited by the system .
TRAINING
It goes without saying that the development corpus was extremely useful in lexical, syntactic, semantic, an d
discourse analysis for system development . We also found the treebank analysis of MUC-3 messages very use-
ful for identifying the multitude of possible variations on a single syntactic theme . Due to the partially func-
tional status of the evolving GB parser, we were unable to fully exploit the 1300 messages in testing for this
phase of MUC-3, but were limited to a few messages that we used for regression testing. The MUC-3 corpus is
a very valuable archive that we intend to utilize more fully in the next few months, as our parser stabilizes an d
we can take advantage of the variety of texts represented in the MUC-3 collection .
MODULE MOST OVERDUE FOR REWRITIN G
Most of the components of the DBG system have been at least revised and extended, and in most cases ,
completely replaced, as part of our evolutionary design philosophy (see the system summary paper for a discus-
sion).
As noted previously, however, one of the oldest modules in the system is the Lexical Unexpected Inputs
(LUX) module, and its associated Word Acquisition Module (WAM) . These modules attempt to determine
whether an unexpected lexical input (i .e., one which is not present in the lexicon) is erroneous (e .g., a misspel-
ling of an entry actually present in the lexicon), or entirely new. In the first case, LUX goes through an ela-
borate procedure to determine whether a spelling error exists, which is corrected if a reasonable hypothesis fo r
an association with a word in the lexicon can be found (e .g., in test 1, the form "kidapped" was corrected to
"kidnapped"). If no correction can be made, the form is determined to be new, which requires WAM to provide
a temporary grammatical category assignment so that the sentence containing the new word can be parsed .
As noted previously, our lexicon included approximately 10,000 words ; however, the vocabulary in the
MUC-3 development corpus is estimated at 20,000 words (see Hirschman paper in this proceedings) . Clearly,
the LUX/WAM components were of inestimable value to us in processing the test sets ; it would have been
impossible to run without them . On the other hand, because of the many new words encountered in the MUC- 3
texts, LUX and WAM had to be used many times on every message, and because these procedures are non -
optimized at present, the amount of time devoted to autonomous LUX/WAM processing was substantial .
Another module that should be rewritten for higher efficiency is LXI, which handles lexical lookup and mor-
phological processing; however, LUX/WAM are first on the list.
88
SYNTACTIC PARSE OUTPUT
Transmission 1
	
Paragraph 1
	
Sentence 1
'Dmaxl'+(1 .0) :
Dmax(Dbar(D([the) :det)
Amax(Abar(A([brazilian) :adj) ,
Nmax (Nbar (N ( [embassy) :noun) ) ) ) ) ) ) .
'Pmaxl'+(1 .3) :
Pmax (Pbar (Pbar (P ( [in) :prep) ,
Nmax(Nbar(N((colombia] :noun)))) ,
Nmax(Nbar(N([colombia] :noun))))) .
'Imaxl' .
Imax(Ibar(I)) .
'Vmaxl' :
Vmax(Vbar(V([has] :third_pres))) .
' Vmax3' + (1.5) :
Vmax(Vbar(V((confirmed) :past) ,
Dmax(Dbar(D([the] :det) ,
Nmax (Nbar (N ([release) :noun) ,
Genmax (Genbar (Gen ((of) :of) ,
Amax (Abar(A([red] :adj) ,
Nmax (Nbar (NC (globo] :noun) ,
Nmax(Nbar(N( [journalist) :noun) ,
Nmax(Nbar(N([carlos) :noun_name) ,
Nmax (Nbar (N ( [marcelo] :noun) ) )))))))))l))))))) .
'Dmax3'+(1 .0) :
Dmax(Dbar(D([who] :pronoun))) .
'Vicemaxl'+(1 .0) :
Vicemax(Vicebar(Aux ((was) :aux) ,
Vmax(Vbar(Vbar(V([kidnapped] :pastpart) ,
Argmax(Argbar('Arg' :'*empty*'))) ,
Pmax(Pbar(P([by] :prep) ,
Amax (Abar(A([colombian] :adj) ,
Nmax(Nbar(N(['army of national liberation'] :noun) ,
Nmax(Nbar (N ( (guerrillas] :plural) ) ) ) ) ) ) ) ) ) ) ) ) .
***************************************************************************** *
SENTENCE-LEVEL SEMANTIC INTERPRETATION
functional-parse-1 :
'MAINPRED'('1 .0') = ' INDEX' (' 1 . 1' )
'DETERMINER' ( . 1 .1' ) = the
'ARG'('1.1') = brazilian
'FOREIGN _ GOVT_ FACILITY' (. 1.1' ) = embassy
' DESCRIPTION' (' 1 . 1' ) = ' INDEX' (' 1 .2' )
'COUNTRY' ( . 1.2' ) = colombia
'MAINPRED' (' 1 .2' ) = ' INDEX' (' 1 .3' )
'MAINPRED' (. 1.3' ) = 'INDEX'('1 .4' )
'PRED' (' 1 .4' ) = have
'MAINPRED' (. 1.4' ) = ' INDEX' (' 1 .5' )
'PRED'('1 .5') = confirm
'DETERMINER' ('1 .5') = the
'MATERIAL_ACT' ('1.5') = 'release of red globo journalist carlos marcelo'
'MAINPRED' (' 1 .5' ) = ' INDEX' (' 1 .6, )
'MAINPRED' (' 1 . 6' ) = who
'MAINPRED' (' 1 .6' ) _ ' INDEX' (' 1.7, )
'EVENT' (' 1 .7' ) = kidnap
'AGENT' ('1 .7')
	
'colombian army of national liberation '
'ORG'('1 .7')
	
'army of national liberation'
'AGENT' (' 1 .7' ) = guerrillas
'MAINPRED' (' 1 . 7' ) = 'INDEX'('1 .8' )
Figure 2 : Partial Syntactic and Semantic Parse Output for Message 100 (TST2)
89
MESSAGE-LEVEL SEMANTIC INTERPRETATIO N
DBG TEMPLATES
Report
	
muc3 [1]
date:
	
26 may
event :
	
[1 .1]
[1 .2]
[1 .3]
Action
	
kidnap [1 .1]
agent :
	
colombian army of national liberation
agent_org:
	
army of national liberation
Action
	
kidnap [1 .2]
Action
	
abduct [1 .3]
agent :
	
the guerrillas
patient :
	
he
***************************************************************************** *
APPLICATION OUTPUT
MUC TEMPLATES
0. MESSAGE ID
1. TEMPLATE ID
2. DATE OF INCIDENT
3. TYPE OF INCIDENT
4. CATEGORY OF INCIDENT
5. PERPETRATOR : ID OF INDIV(S )
6. PERPETRATOR : ID OF ORG(S)
7. PERPETRATOR : CONFIDENCE
8. PHYSICAL TARGET : ID(S)
9. PHYSICAL TARGET : TOTAL NUM
10. PHYSICAL TARGET : TYPE(S)
11. HUMAN TARGET: ID(S)
12. HUMAN TARGET: TOTAL NUM
13. HUMAN TARGET : TYPES)
14. TARGET: FOREIGN NATION(S)
15. INSTRUMENT : TYPE(S)
16. LOCATION OF INCIDENT
17. EFFECT ON PHYSICAL TARGET(S )
18. EFFECT ON HUMAN TARGET(S)
0 . MESSAGE ID
1. TEMPLATE ID
2. DATE OF INCIDENT
3. TYPE OF INCIDENT
4. CATEGORY OF INCIDENT
5. PERPETRATOR : ID OF INDIV(S)
6. PERPETRATOR: ID OF ORG(S)
7. PERPETRATOR: CONFIDENCE
8. PHYSICAL TARGET : ID(S)
9. PHYSICAL TARGET : TOTAL NUM
10. PHYSICAL TARGET : TYPE(S)
11. HUMAN TARGET: ID(S)
12. HUMAN TARGET: TOTAL NUM
13. HUMAN TARGET: TYPE(S)
14. TARGET: FOREIGN NATION(S )
15. INSTRUMENT : TYPE(S)
16. LOCATION OF INCIDENT
17. EFFECT ON PHYSICAL TARGET(S)
18. EFFECT ON HUMAN TARGET(S)
TST2-MUC3-0100
1
- 26 MAY
KIDNAPPING
TERRORIST ACT
"COLOMBIAN ARMY OF NATIONAL LIBERATION"
ARMY OF NATIONAL LIBERATION
REPORTED AS FACT*
*
*
TST2-MUC3-0100
2
- 26 MAY
KIDNAPPING
TERRORIST ACT
"THE GUERRILLAS"
R-EPORTED AS FACT*
*
*
"HE"
1
*
*
Figure 3 : DBG-Templates and MUC-3-Templates for Message 100 (TST2)
90
REUSABILITY
The DBG system developed for the MUC-3 application is completely reusable on other applications, with th e
exception of the rules for deriving the output MUC-3 templates from the DBG templates, which is the backen d
especially tailored for the MUC-3 application . Other than that, there are a few features such as an attribute i n
the frame system entitled "critical event", which would not be useful in another application, but there are ver y
few of these (another such feature does not even come to mind at this point).
LESSONS LEARNED
Since we have performed several MUC-like tasks (i.e., data extraction) as described in the system summary ,
as well as evaluations, the main lesson learned was not to postpone further the acquisition of an on-line diction-
ary such as Longmans or the OED . In any case, had we made such an acquisition for MUC-3, time an d
resources would have been insufficient to integrate it with the other system components and exploit it within th e
MUC-3 context.
With respect to evaluation, the evaluations performed by LSI on the systems described in [2] and [3] bot h
included competitive testing in the template-filling task against a human user or simulated user of the type o f
information in the given domain. This type of evaluation is perhaps more difficult in the MUC case, but, base d
on our experience, is extremely significant for users, because it is more believable to them than a series of finely
tuned scores .
REFERENCES
[1] Berwick, R . C., Principle-Based Parsing, AI TR No . 972, June, 1987.
[2] Montgomery, C. A., Burge, J., Holmback, H., Kuhns, J . L., Stalls, B . G., Stumberger, R ., Russel, R . L.,
The DBG Message Understanding System, in Proceedings of the Annual At Systems in Governmen t
Conference (1989), IEEE Computer Society Press, 1989 .
[3] Stalls, B ., R. Stumberger, and C . A. Montgomery (1990) . Long Range Air (LRA) Data Base Generator
(DBG). RADC-TR-89-366.
91
