SRA SOLOMON :
MUC-4 TEST RESULTS AND ANALYSI S
Chinatsu Aone, Doug McKee, Sandy Shinn, Hatte Bleje r
Systems Research and Applications (SRA )
2000 15th Street North
Arlington, VA 2220 1
aonec@sra.com
INTRODUCTION
In this paper, we report SRA's results on the MUC-4 task and describe how we trained our natural languag e
processing system for MUC-4 . We also report on what worked, what didn't work, and lessons learned .
Our MUC-4 system embeds the SOLOMON knowledge-based NLP shell which is designed for both domain-
independence and language-independence. We are currently using SOLOMON for a Spanish and Japanes e
text understanding project in a different domain. Although this was our first year participating in MUC, w e
have built and are currently building other data extraction systems .
RESULTS
Our TST3 and TST4 results are shown in Figures 1 and 2 . The similarity of these scores as well as thei r
similarity to SRA-internal testing results reflects the portability of SRA's MUC-4 system . In fact, our scor e
on the TST4 texts was better than that of TST3, even though those texts covered a different time perio d
than that of the training texts or TST3 .
Our matched-only precision and recall for both test sets were very high (TST3 : 68/47, TST4: 73/49) .
When SOLOMON recognized a MUC event, it did a very accurate and complete job at filling the requisit e
templates.
SOLOMON performance was tuned so that the all-templates recall and precision were as close as possibl e
to maximize the F-Measure . As shown in Figure 3, our F-Measure steadily increased over time. The fact
that this slope has not yet leveled off shows SOLOMON's potential for improvement .
EFFORT SPENT
We spent a total of 9 staff months starting January 1, 1992 through May 31, 1992 on MUC-4 . A task-
specific breakdown of effort is shown in Figure 4 . The bulk of the work was spent porting SOLOMON t o
a new domain with new vocabulary, concepts, template-output format, and fill rules . Approximately 72%
of the effort was domain-dependent. However, about 63% of the total effort was language-independent, i.e.
it would be directly applicable to understanding texts about terrorism in any language. We expect that
our English MUC-4 system could be ported to a new language in about 3 months, given a basic grammar ,
lexicon and preprocessing data similar to the ones which existed for English . We partially demonstrated this
137
REC PRE OVG FAL
MATCHED/MISSING 27 68 8
MATCHED/SPURIOUS 47 32 57
MATCHED ONLY 47 68 8
ALL TEMPLATES 27 32 57
TEXT FILTERING 71 85 15 23
F-MEASURES
P&R
29 .29
2P&R
30.86
P&2R
27.87
Figure 1 : TST3 Results
REC PRE OVG FAL
MATCHED/MISSING 38 73 4
MATCHED/SPURIOUS 49 31 59
MATCHED ONLY 49 73 4
ALL TEMPLATES 38 31 59
TEXT FILTERING 91 75 25 35
F-MEASURES
P&R
34.14
2P&R
32 .19
P&2R
36.36
Figure 2 : TST4 Results
claim by showing our MUC-4 system processing English, Japanese and Spanish newspaper articles about
the murder of Jesuit priests at the demonstration session of MUC-4. We spent less than 2 weeks after the
final test adding MUC-specific words to Spanish and Japanese lexicons, and extending the grammars of the
two languages .
Data
40% of the total effort building MUC-data was spent on lexicon and KB entry acquisition . Much of this data
was acquired automatically. We used the supplied geographical data to automatically build location lexicons
and KBs. Using the development templates, we acquired lexical and KB entries for classes of domain term s
such as human and physical targets and terrorist organizations . We automatically derived subcategorization
information for the domain verbs from the development texts (cf. [1]). These automatically acquired lexicons
and KBs did require some manual cleanup and correction .
Certain multi-word phenomena which occur frequently in texts but are unsuitable for general parsing wer e
handled by pattern matching during Preprocessing . For example, we created patterns for Spanish phrases ,
complex location phrases, relative times, and names of political, military and terrorist organizations .
Modifications to SOLOMON's broad-coverage English grammar included adding more semantic restric-
tions, extending some phrase-structure rules, and improving general robustness .
Based on our knowledge engineering effort, we built a set of commonsense reasoning rules that are
described in detail in our system description. Our EXTRACT module recognizes MUC-relevant events in
the output of SOLOMON and translates them into MUC-4 filled templates . We implemented all the domain-
specific information as mapping rules or simple conversion functions (e .g. numeric values like "at least 5 "
means "5-" ) . This data is stored in the knowledge base, and is completely language independent .
13 8
so
T4
T20
	
'30 — 13 {
M•
	
T
T2 0
M•
s
20 — T2 ,
i
T2 s
10
s
I
s
s
0 • i
sI
	
I
	
I
	
I I I I I
	
I
	
I . 1
1200
	
1300
1
0
	
100
	
200
	
300 400 500 600 700
	
500
	
000 1000 1100 1400
	
moo
JAN 1 MAR 25 MAY 1
Hours of Effort
MAY17 MAY 31
Imo 11
	
3125 517 5124 5125 5127 5/31
Noun 0
	
300 1240 1380 1400 1440 1500
TST2 0
	
11.43 19.48 2625 27.43 2525
TST3 2020
T8T4 34.14
Figure 3: Tracking SOLOMON Performanc e
Task Category ~ % of Total Effort
DATA 71
Knowledge Engineering 1 3
Data Acquisition 30
Grammar 7
Pragmatic Inference Rules 11
Extract Data 1 0
PROCESSING - 29
Message Zoning 3
Extract Extensions 7
Testing 1 0
Misc. Bug Fixing 10
Figure 4: Breakdown of Effort Spent for MUC- 4
13 9
Processing
We spent 1 week porting our existing Message Zoner to deal with message headers in MUC messages . The
Message Zoner could already recognize more general message structures such as paragraphs and sentences .
We extended EXTRACT while maintaining domain and language independence of the module . Features
added included event merging and handling of flat MUC templates instead of the more object-oriente d
database records that SOLOMON is accustomed to . Our time spent on fixing bugs was distributed through-
out the system, but problems in Debris Parsing and Debris Semantics received the most attention .
SYSTEM TRAININ G
We used TST2 texts for blind testing and the entire 1300 development texts for both testing and trainin g
material. The development set was crucial to both our automated data acquisition and our knowledge
engineering task . We performed frequent testing to track and direct our progress. To raise recall, w e
focussed on data acquisition ; to raise precision, we focussed on stricter definitions of "legal" MUC events .
To improve overall performance, we focussed on more robust syntactic and semantic analysis and mor e
reliable event merging .
LIMITING FACTOR S
The two main limiting factors were the number of development texts and templates and the amount of tim e
allotted for the MUC-4 effort . With more texts, we could have applied other more data-intensive automate d
acquisition techniques and had more examples of phenomena to draw upon . With more time, we would add
more domain-dependent lexical knowledge and additional pragmatic inference rules . We also need to tune
our EXTRACT mapping rules more finely and improve our discourse module for both NP reference an d
event reference resolution. Integration of existing on-line resources such as machine-readable dictionaries ,
the World Factbook, or WordNet would also improve system performance. A more extensive testing and
evaluation strategy at both the blackbox and glassbox levels would help direct progress, but was not feasibl e
in the amount of time we had .
WHAT WAS OR WAS NOT SUCCESSFU L
There were several areas where hybrid solutions worked very well. Totally automated knowledge acquisition
was quite successful when supplemented by manual checking and editing of domain-crucial information . Sim-
ilarly, augmenting a pure bottom-up parser with "simulated top-down parsing" (See SRA's MUC-4 System
Description) worked well . Improved Debris Semantics and significantly extended Pragmatic Inferencing wer e
also important contributors to the system's performance .
REUSABILITY
SRA's SOLOMON NLP system has been designed for portability and proven to be highly reusable . This
includes portability to other domains, other languages, and other applications . As shown in Figure 5, a larg e
140
ssnwM k
Wprpr«won
P
SOLOMON
Popmaimiu
MiW Ylwr
Mee fume .
U luwm Wad 11maly
Wmd4mn MMydp
MN+eV [wwq
AIM
PTV
Rr' Mrplrnd
Wmsnal, g
snoop
fdsediiN
nMMMIarr1rip
O+Wrw+dtl1mIiMIMIr11
♦PMT
	
•PT,rwIMIRSupw
Dwain-
	
.:NMMU
	
• .
~JIiWH
	
apadk
	
• HM~M
	
e
nom
rtl~O
B
	
lbw 41..dol
	
Smomld
	
SINS,
MUCOssis•
Mo le
MUC
Emma
Figure 5 : MUC NLP System Reusability
part of SOLOMON 's data and almost all of the processing modules are completely reusable for NLP in othe r
domains or languages .
Currently, our Spanish and Japanese data extraction project MURASAKI is using, without modification ,
the same processing modules and the core knowledge base as those used for MUC-4 . The MURASAKI
system processes Spanish and Japanese language newspaper and journal articles as well as TV transcripts .
This project's domain is the AIDS disease. Thus, the only difference between our MUC-4 system an d
MURASAKI system is that the latter uses Spanish and Japanese lexicons, patterns and grammars, an d
MURASAKI domain-dependent knowledge bases . SOLOMON has also been embedded in several Englis h
message understanding systems : ALEXIS (operational) and WARBUCKS.
LESSONS LEARNED AND REAFFIRMED BY MUC- 4
We have learned and reaffirmed the following points as the most crucial aspects of successful text under -
standing for data extraction .
Overcoming the Knowledge Acquisition Bottleneck : We must develop techniques and tools for ac-
quiring timely, complete, and proven system data .
Solving the Parsing Problem : We need more robust, semantically constrained syntactic analysis . Gram-
mars must be broad-coverage and highly accurate on complex input .
Developing Sophisticated Discourse Analysis : We must handle real world discourse phenomena foun d
in actual texts . The discourse architecture must be flexible enough to accommodate particular discours e
phenomena which are crucial in particular domains or languages .
MUC-4 has reaffirmed our knowledge of what is involved in porting an NLP system to a new domain .
9 staff months is a bare minimum for such an effort . Improved knowledge acquisition tools as well a s
141
on-line resources are desirable. To ensure good results, it is necessary to have sufficient time for knowledg e
engineering, testing and evaluation . Our experience underscores the fact that natural language understandin g
is a highly data-driven problem . The system's performance is often proportional to the level of understandin g
of the input and output . The MUC-4 development texts and templates were extremely helpful in this regard .
References
[1] Doug McKee and John Maloney. Using Statistics Gained from Corpora in a Knowledge-Based NLP
System. In Proceedings of The AAAI Workshop on Statistically-Based NLP Techniques, 1992.
142
