GE-CMU: DESCRIPTION OF THE SHOGUN SYSTEM USEDFOR MUC- 5
Paul S. Jacobs, George Krupka, and Lisa Ra u
Information Technology Laboratory
GE Research and Developmen t
Schenectady, NY 12301 USA
Michael L . Mauldin, Teruko Mitamura, and Tsuyoshi Kitan i
Center for Machine Translation
Carnegie Mellon University
Pittsburgh, PA 15213 USA
Ira Sider and Lois Childs
Management Data System s
Martin Mariett a
Philadelphia, PA 19101 USA
Abstract
This paper describes the GE-CMU TIPSTER/SHOGUN system as configured for the TIP-
STER 24-month (MUC-5) benchmark, and gives details of the system's performance on the se-
lected Japanese and English texts. The SHOGUN system is a distillation of some of the key ideas
that emerged from previous benchmarks and experiments, emphasizing a simple architecture in
which the focus is on detailed corpus-based knowledge . This design allowed the project to meet its
goal of achieving advances in coverage and accuracy while showing consistently good performanc e
across languages and domains .
INTRODUCTION
The GE-CMU TIPSTER/SHOGUN system is the result of a two-year research effort, part of the A1tPA -
sponsored TIPSTER: data extraction program. The project's main goals were : (1) to develop algorithm s
that would advance the state of the art in coverage and accuracy in data extraction, and (2) to demonstrat e
high performance across languages and domains and to develop methods for easing the adaptation of the
system to new languages and domains .
The system as used in MUC-5 represents a considerable shift from those used in earlier stages of th e
program and in previous MUC 's. The original SHOGUN design integrated several different approaches b y
combining different knowledge sources, such as syntax, semantics, phrasal rules, and domain knowledge, a t
run-time . This allowed the system to achieve a good level of performance very quickly, and made it easy t o
test different modules and methods ; however, it proved very difficult to make all the changes necessary t o
improve the system, especially across languages, when system knowledge was so distributed at run-time .
As a result, the team adopted a new approach, relying heavily on finite-state approximation . This method
combines several earlier previous of work, including Pereira's research on grammar approximation [4], som e
of the original ideas on parser compilation from Tomita [5], and GE 's representation of the dynamic lexico n
[3, 1] . Like Pereira 's model, the system uses a finite-state grammar as a loose version of a context fre e
1 This research was sponsored (in part) by the Advanced Research Project Agency (DOD) and other government agencies .
The views and conclusions contained in this document are those of the authors and should not be interpreted as representin g
the official policies, either expressed or implied, of the Advanced Research Project Agency or the US Government .
109
grammar, under the assumption that the finite state grammar will cover all the inputs that the genera l
grammar would recognize but perhaps be more tolerant . However, the system also includes methods fo r
compiling different knowledge sources into the finite state model, particularly emphasizing lexical knowledg e
and domain knowledge as reflected in a corpus .
This model, in which knowledge is combined at development time to be used by a finite-state patter n
snatching engine at run-time, makes it easier to tune the system to a new language or domain withou t
sacrificing the benefit of having general linguistic and conceptual knowledge in the system .
While the GE systems, and more recently, the GE-CMU systems, have done well in all the MUC evalua-
tions, our rate of progress has never been so great as it has been in the period before MUC-5 . This is in spit e
of the fact that, the team's diagnostic and debugging efforts had to be divided across languages and domain s
(handling Japanese, for example, presented a significant overhead in simply being able to follow the rule s
and analyze the results) . We attribute this progress to the current focus on facilitating and automating th e
knowledge acquisition process, especially on the use of a corpus.
This paper will give a very brief overview of the configuration of the system, followed by the analysis o f
the examples, and some conclusions about the results .
SYSTEM OVERVIE W
The TIPSTER/SHOGUN system as configured for the 24-month/ MUC-5 benchmark has roughly the
same components as earlier versions of the system, but the system now performs linguistic analysis entirel y
using a finite-state pattern matcher, instead of LR parsing or chart-style parsing, both of which were par t
of the configuration in MUC-4 .
Figure 1 shows the basic components of the SHOGUN system, using our own names for modules, wher e
applicable, along with the labels used in Jerry Hobbs ' paper "The Generic Information Extraction System" .
The core components of SHOGUN are a subset of the modules that Hobbs describes . However, the syste m
differs from other current extraction systems in the use of the finite-state analyzer and the way that corpus-
based knowledge is integrated into the lexico-syntactic rules .
Finite–state Sentence Analysis
(MUC–5 System )
("parser")("lexical disambiguation")
PM3
Syntactic Parsing
(MUC–4 System)
Post–processing
TRUMPImmii
LR Parse r
text structure ("zoner")
NLlex ("preprocessor")
PM1 ("filter") (English)
statistical filter (ME)
PM2 ("preparser")
TRUMPET
("fragment combiner")
"semantic interpreter"
"discourse processing"
"template generator"
Core lexicons and grammars
Figure I : SHOGUN configuration in MUC- 5
Because many of the MUC-5 systems now perform much the same type of pre-processing, name recog-
nition, and post. processing that SIIOGUN has, we will concentrate here on linguistic analysis, includin g
110
parsing and lexical disambiguation, which were the main research areas of our work on SIIOGUN .
About half of the MUC-5 systems still use linguistic analysis driven by "traditional" phrase structur e
rules, traditional in the sense that there is a clearly separable syntactic component whose knowledge consists
mainly of rules for recognizing grammatical constituents based on word categories (like noun, verb) and wor d
order. SHOGUN differs from all these systems in that it no longer has any purely syntactic component, an d
uses finite state rules in place of phrase structure rules .
The remaining systems divide roughly into those that emphasize pattern matching and those that empha-
size fragment parsing. The fragment parsing systems, notably BBN's, work fairly close to the way our MU( 1- 4
system did, taking advantage of partial parses by using a combination of syntactic and domain knowledg e
to guide the combination of syntactic chunks . The difference between this approach and SHOGUN's curren t
processing is that fragment parsing is still a largely syntax-first method, while pattern matching tends t o
introduce specialized domain and corpus knowledge by combining this knowledge with syntactic knowledg e
in the system's declarative representation .
By this coarse characterization, the "pattern matching " group of systems includes, for example, SRI an d
Unisys as well as GE-CMU . We also consider UMass to be in this category, because their linguistic analysi s
emphasizes lexical and conceptual knowledge rather than constituent structure .
Among these approaches, we believe the main differentiator is not in the basic processing algorithms but,
in the way that knowledge ends up getting assigned to various system components . If there is one noteworth y
trend among the MUC systems as they have evolved over time, it is that they have become more knowledge -
based, especially emphasizing more corpus-based and lexical knowledge as well as automated knowledge
acquisition methods. Within the emerging "generic" model, the main difference among systems is thus i n
the content of their knowledge bases . Here, the distinguishing characteristic of SHOGUN is probably the
degree to which the system still includes sentence-level knowledge, assigning linguistic and conceptual role s
much the way the TRUMP/TRUMPET combination did but using more detailed, lexically-driven knowledge .
Many of the sentence-level rules, for example, include groupings like start a facility and organization nou n
phrase, which combine traditional syntactic phrases with lexical or domain knowledge .
As systems continue to become still broader in scope and more accurate, it is likely that the way knowledg e
is acquired will become the main differentiator .
The rest of this paper will discuss the overall results of SHOGUN on MUC-5 and describe how the syste m
handles some of the system walkthrough examples . The analysis of the examples will highlight some of thes e
characteristics and demonstrate the system's actions in various stages of processing .
OVERALL RESULTS
The SHOGUN system did very well on MUC-5 . The team 's specific goals were to achieve results on the MUC -
5/TIPSTER tasks that were above the level of the simpler MUC-4 task, to attain comparable performance
across languages and domains, and to reduce customization time as much as possible . In addition, the ai m
was to produce near-human accuracy at a throughput orders of magnitude faster than human beings . These
goals seemed rather ambitious, but SHOGUN reached all of them .
The following is a summary of SHOGUN 's performance on all the official metrics . We put error rat e
first and F-measure last in this table because these are the only ones that can be used for overall syste m
comparison (the goal being low error rate and high F-measure) .
Error UND OVG SUB Min-err Max-err Text Rec Pre F-meas
EJV 61 30 39 19 0 .8784 0.9026 96/92 57 49 52 .8
JJV 54 36 27 12 0 .6624 0.6794 99/98 57 64 60 . 1
EME 65 37 41 19 0 .8354 0.8724 95/81 50 48 49 .2
JME 58 30 38 14 0 .7756 0.8152 97/86 60 53 56 .3
Figure 2 : SHOGUN Scores for MUC- 5
The overall results here are better, on average, than SHOGUN's scores on the MUC-4 benchmark . While
11 1
it is very difficult to compare results across domains across languages, it is clear that this shows substantia l
progress, as the MUC-5 tasks are certainly much harder and more detailed than MUC-4 . In addition, the
average improvement between the TIPSTER 18 month benchmark and the current point was over 20%, an d
there is certainly more room for further improvement . Thus we are confident that our current methods an d
algorithms support continued progress toward high accuracy .
While it seems that there is substantial variation among the scores on the different language-domain pairs ,
this variation is reasonable given the differences among the task and the variations on the test samples . The
EME result is worse than the others, but the EME MUC-5 test set seemed to be a very difficult one for ou r
system . In fact, the system on a blind test using the same configuration scored 9 error rate points better i n
EME than on the test reported above . We are not sure what accounts for this variability in EME, which is
much greater than on the other domain-language pairs .
With respect to achieving human performance, it is not clear where good human perform falls on these
scales, but we are close. At the TIPSTER 12-month test, a study of trained human analysts placed individua l
analysts between 70 and 80 in F-measure. However, this test used a somewhat more generous scorin g
algorithm than the current one (there have been a number of important changes to the scoring since th e
12-month point), and did not separate the analysts work from the preparation of the "ideal" answers—it i s
important in a blind test that the human subject have no impact on the answer key, because there are man y
texts that involve fine-grained interpretation .
The results on Japanese are, on average, somewhat higher than the English results . This is consistent with
our tests. We attribute this to the fact that the Japanese tests are considerably easier than the English ( a
factor that is somewhat difficult to weight, given that none of our system developers know Japanese) . Some
of the influences that make the Japanese easier are greater homogeneity in the text sources (for example ,
In;ME includes very different sources from EJV, while JJV and JME are quite consistent in style), shorte r
stories with fewer distinct events in Japanese, far fewer new joint venture companies in Japanese, and a n
emphasis in Japanese on research and sales rather than production (production activities are more difficul t
to assign to codes in the template design) .
In addition to the SHOGUN system, the GE-CMU team ran the Japanese benchmarks only using a syste m
called TEXTRACT, which was developed in parallel to SHOGUN by Tsuyoshi Kitani, a visiting researche r
at CMU from NTT Data . TEXTRACT, like SHOGUN, emphasizes lexically-driven pattern matching ,
and the two systems share a Japanese tagging/segmentation program from NTT Data, called MAJESTY .
While there is little else that is directly shared between the two system's, additions to TEXTRACT 's
knowledge base were incrementally adapted, in functionality, to SHOGUN 's knowledge base in JJV, thus
it it, not surprising that the systems had similar performance on this set . TEXTRACT generally had a
better performance on company name recognition than SHOGUN, and a somewhat more effective metho d
of splitting events . SHOGUN had better coverage of industry types and products (based, we think, on the
heavy use of statistically-based training), and had higher recall (but lower precision) in JME .
Figure 3 shows the results of both systems on the recall/precision scale on the various MUC-5 sets .
ANALYSIS OF WALKTHROUGH MESSAGE S
Overview of Example s
The examples are in many ways typical of the TIPSTER-SHOGUN system. These are relatively easy
messages, but the problems the system encountered are illustrative. In the English message, the system
made a few minor mistakes, some of which may even have been matters of fine-grained interpretation, and
had an error rate of 15 for EJV0592 . This is much better than the average message ; on the whole, the EJV
performance is pulled down by "tangled tie-up" messages in which the system has a great deal of difficulty
determining who is doing what with whom .
.IJ V0002 was much harder, because it requires information to be split across two tie-ups . The system
correctly determined that there were two tie-ups (which it did not do when it ran this message at the 12 -
nronth point), but, it failed to recognize "Toukyou kaijou" as an alias for "Toukyou kaijou kasai hoken " , and
as a result ended up getting a whole bunch of aliases and entity pointers wrong . In addition, SHOGUN mad e
the very typical mistake of almost getting the product service information but losing most of the points ,
anyway. In this case, the Japanese text says that the tie-up will be selling a new product called "hyu-man " .
112
70
60 —
JJV
JME nJJV
n JME
	
n GE–CMU
	
50
	
n n
	
SHOGUN
	
EME EJV
L GE–CMU "optional"
	
(Textract)
	
40
30 —
Precision
20 — .
0
	
I	 I	 I
	
I
	
I
	
I
	
I
10
	
20
	
30
	
40
	
50
	
60
	
70 Recal l
Figure 3 : GE-CMU Results for MUC-5/TIPSTER 24-month benchmar k
SHOGUN correctly spots this and assumes that whatever "hyu-man" is will be wholesale sales with code 50 .
The analyst infers from the context that "hyu-man" is an insurance product, so the actual industry type i s
"finance" rather than "sales" . Finally, the answer key contains an error in the string fill, so SHOGUN gets
scored completely wrong on this object .
We emphasize these minor mistakes because it helps to show, for one thing, how hard it is to get extremely
high accuracy, and, for another, the relative effects of easy and hard objects . SHOGUN was, by far, the most
accurate system in determining industry information, probably because our efforts on automated knowledg e
acquisition used this object as a test case for both English and Japanese . However, the net effect of th e
industry object in SHOGUN was a reduction in error of .2 in English and 1 .2 in Japanese over what th e
system would have produced by leaving the product service slot blank . This is because potentially spuriou s
information on hard objects and slots dilutes the good scores produced on the easier objects and slots . Hence
it is very difficult to show improvement by getting more information ; the easiest improvements are to ge t
higher and higher performance on the "critical" slots and objects .
In addition, the system made many technical errors with the location and alias slots, some of which ar e
illustrated here . Often these were due to bugs, but there are many other problems . The location slot(s)
proved much more difficult than expected, because many forms of subtle inferences often affect locatio n
information, such as inferring that one site subsumes another or inferring location by process of eliminatio n
(especially in Japanese) .
We will now show, very briefly, the results of each stage in processing of SHOGUN on the EJV and JJ V
examples.
Pre-processing
Pre-processing identifies names, dates, locations, and other special phrases, and handles certain morphologi -
cal rules in Japanese. For example, the following gives some of the results of pre-processing on one sentence
from each example :
EJV0592 Sentence 0 :
[CNAME{1}: BRIDGESTONE SPORTS CO . ] SAID FRIDAY IT HAS SET UP A JOINT VENTURE
113
[IGNORE{41} : IN TAIWAN ] WITH A LOCAL CONCERN AND A JAPANESE TRADING HOUSE
TO PRODUCE GOLF CLUBS TO BE SHIPPED TO JAPAN .
JJV0002 Sentence 0 :
[CNAME{24}: V
	
_Ek ` M ]
	
4A 73, 6* U IE L [MORPH{8}:
	
LZ ]
L ` ~~t lie
	
1 [MORPH{4}: {eiJc ]
	
flJpa f x — v j
[MORPH{5}:
	
EBL fc ]
Where a company name is marked in pre-processing, this means that the name is "learned" rather tha n
recognized as a known name . In JJV0002, Daiwashouken ()C*IIIA) is a known name, so it is not marked
above.
Linguistic analysi s
Linguistic analysis uses the same pattern matcher and same knowledge base notation as pre-processing, bu t
relies on a mixture of syntactic and lexical information to perform sentence-level interpretation . For example,
the following is one rule for marking verb phrases with activity information in English :
44: :
; ; JV ACTIVITY-VP
; ; ACTIVITY
{< ?START-TIME=*date* * >}
[ $jventure
?ENTITY=(and org-name (not *partner* *venture*) )
< ?VENTURE=*venture* {< (member *apostrophe-s* *apostrophe*) rights >} >
< (and *venture-org-np* (not $ventureterm)) {$loc} >
it
$facilityphr ]
{$np-postmod}
{which}
{$helperphr}
$verb-premod*
{to}
[ ?ACTIVITY-TEXT=
< ?TIE-UP-ACTIVITY=$actverb
< *comma* ?TIE-UP-ACTIVITY=$actverb >*
{< {*comma*} and ?TIE-UP-ACTIVITY=$actverb>}
$ps-text-list >
< ?ACTIVITY-TEXT=< ?TIE-UP-ACTIVITY=$actverb $ps-text-list >
?ACTIVITY-TEXT=< *comma* ?TIE-UP-ACTIVITY=$actverb $ps-text-list >*
{*comma*} and
?ACTIVITY-TEXT=< ?TIE-UP-ACTIVITY=$actverb $ps-text-list > > ]
{< (not *date*)* $loc >}
{< (not *date*)* ?START-TIME=*date* {$loc} >}
<=> (mark-jv-activator c-joint-venture-template) ;
In linguistic analysis, the pattern matcher annotates the text, much like it does during pre-processing ,
but these annotations can be very close to the roles that portions of text will play in the template . For
example, where pre-processing finds company names and organization descriptions, sentence analysis wil l
often find partners and ventures.
The following are exarnples of this analysis from the walkthrong h
EJV0592 Sentence 0:
114
[C-JOINT-VENTURE-TEMPLATE{45,44,16,2,0} ?CONJ=<
?ENTITY=?PARTNER=?HEAD=BRIDGESTONE SPORTS CO. SAID FRIDAY IT ?HEAD=HAS
?HEAD=SET UP [C-JOINT-VENTURE-TEMPLATE{36,13}?HEAD=A ?HEAD=JOINT
VENTURE IN ?LOCATION=TAIWAN WITH A LOCAL ?PARTNER=CONCERN AND A
JAPANESE ?ACTIVITY-TEXT=< ?TIE-UP-ACTIVITY=TRADIN G
?TIE-UP-PRODSERV=?PS-TEXT=?PARTNER=HOUSE >=?ACTIVITY-TEXT >=?CON J
{45}] ?ACTIVITY-TEXT=< TO ?ACTIVITY-TEXT=<
?TIE-UP-ACTIVITY=?HEAD=PRODUCE ?PS-TEXT=< GOLF
?TIE-UP-PRODSERV=?TIE-UP-ACTIVITY=CLUBS >=?PS-TEXT >=?ACTIVITY-TEX T
{44,36,16,13,2,0}]TO BE SHIPPED TO JAPAN.
JJV0002 Sentence 1 :
[C-JOINT-VENTURE-TEMPLATE{12,0} .
?PARTNER=?HEAD= pgk( E , ?PARTNER= MsTtI XiL
	
orME
?PARTNER=MIv"C [C-JOINT-VENTURE-TEMPLATE{9} ?PS-TEXT=<
?ACTIVITY-TEXT=< ?HEAD= *mil spa >=?PS-TEXT <
?TIE-UP-ACTIVITY= ESYj 6 >=?ACTIVITY-TEXT {12,9}]MAW.
[C-JOINT-VENTURE-TEMPLATE{1} ?PARTNER=?HEAD=* o
{1,01] [C-JOINT-VENTURE-TEMPLATE{8}?HEAD=Onn 5 < It ?PS-TEXT=—a
{8}]
Each set of annotations from sentence-level analysis goes through semantic interpretation, top-dow n
analysis (using TRUMPET), and discourse processing, just as full parses and fragment parses were used i n
TRUMP and the LR parser . The input to TRUMPET now, however, is a set of annotations instead of ful l
or partial syntactic trees .
Calling Trumpet with SENSE Interpretation:
(C-JOINT-VENTURE-TEMPLATE (R-TIE-UP-ACTIVITY (PRODUCE))
(R-LOCATION (TAIWAN (R-NAME TAIWAN)) )
(R-PARTNER
(CNAME_BRIDGESTONE-SPORTS-001 (R-NAME BRIDGESTONE-SPORTS-CO )
(R-PART (C-ENTITY))) )
(R-PARTNER (CONCERN)) (R-PARTNER (HOUSE)) )
Calling Trumpet with SENSE Interpretation :
(C-CAP-TEMPLATE
(R-CAP
(C-MONEY (R-QUANTITY (C-NUMBER (R-VALUE 1200000001))) (R-UNIT (DOLLAR))) )
(R-OWNED
(CNAME_BRIDGESTONE-SPORTS-TAIWAN-001 (R-NAME BRIDGESTONE-SPORTS-TAIWAN-CO )
(R-PART (C-ENTITY)))))
; ; Top-down processing
Linking (special) C-CAP-TEMPLATE as filler for R-OWNERSHIP of C-JOINT-VENTURE-TEMPLAT E
Creating objects in sentence 3 for C-OWN-PERCENT-TEMPLATE marker{17} wit h
(?OWNER ?PERCENT) variables
Calling Trumpet with SENSE Interpretation :
(C-OWN-PERCENT-TEMPLATE
(R-OWNER (CNAME_TAGA-CO1 (R-NAME TAGA-CO) (R-PART (C-ENTITY))))
115
(Ft-PERCENT (REMAINDER)) )
TRUMPET then takes these pieces of semantic interpretation and tries to map them onto a final template,
applying domain constraints, reference resolution, and heuristics for merging and splitting information fro m
multiple sentences and paragraphs .
Discourse Processin g
Before producing the final template, SHOGUN must take all the references to objects and events and try to
resolve them . Often the resolution of object references affects the resolution of event references, because th e
objects become the only tie-in from one description of an event to the next .
The discourse processing knowledge of the system is considerably more developed in English than i n
Japanese . This is a case where it was difficult to do all the experiments we would have liked because ou r
developers were not bilingual, and discourse cues in Japanese are often fairly subtle .
In EJV 0592, the system correctly resolves most of the event and object references, but still does badl y
on the location and activity site slots because it assumes that the location of the joint venture company i s
the location of the production activity, and it fails to guess that "Kaohsiung" is in Taiwan . In addition ,
there is a very subtle inference here that the production of clubs in Japan is not an additional location fo r
the production of clubs by the Taiwan unit ; SHOGUN treats both Japan and Taiwan as production bases .
; ; Removing nations (TAIWAN) which conflict with the organizations
; ; Replacing Ft-VENTURE references (COMPANY) with (CNAME_BRIDGESTONE-SPORTS-TAIWAN-001 )
; ; Removing references (CONCERN HOUSE) from It-PARTNER (CNAME_TAGA-CO 1
CNAME_UNION-PRECISION-CASTING-CO1 CNAME_BRIDGESTONE-SPORTS-001 )
Creating ACTIVITY template with 1 industrie s
Resolving "THE TAIWAN UNIT" to PARTNER CNAME_UNION-PRECISION-CASTING-CO 1
for nationality "Taiwan (COUNTRY)" using locatio n
Resolving "THE JAPANESE SPORTS GOODS MAKER" to PARTNER
CNAME_BRIDGESTONE-SPORTS-CO1 for nationality "Japan (COUNTRY) "
TEXTRACT "OPTIONAL" SYSTE M
In order to process Japanese, the SHOGUN system uses a morphological analyzer called MAJESTY de-
veloped at NTT Data. As part of our early efforts in the Joint Venture domain, Tusyoshi Kitani of NT T
Data (who was then a visiting scientist at Carnegie Mellon) wrote several AWK scripts to identify Japanes e
company names in the segmented output . Later, rules for identifying other kinds of text fields includin g
proper names, locations, numbers and times were added . This year, he has extended this set of finite-stat e
rules and augmented it with other modules to perform the entire TIPSTER task on Japanese texts . For the
MUC-5 evaluation, we have submitted TEXTRACT's results on the JJV and JME texts as optional scores .
These were officially scored by the government, and the results appear in the table .
Error UND OVG SUB Min-err Max-err Text Rec Pre F-meas
JJV
JME
49 .99
58 .64
32
43
23
28
12
12
0 .5877
0.6728
0 .6028
0 .7072
99/99
96/85
60
51
68
63
63.84
56.35
Figure 4 : Official TEXTRACT Scores for MUC- 5
TEXTRACT : Overview
I'I?X'I'Ii .A('1' is comprised of four major components : preprocessing, pattern snatching, discourse processin g
and template generation . Although only the first of these modules is shared with the SIHO( .7UN system ,
,, ,
; ; ;
; ; ;
11 6
both systems share the basic method of using finite-state pattern matching instead of full natural language
parsing.
In the preprocessor, Japanese text is segmented into primitive words and they are tagged parts of speech
by a Japanese segmenter called MAJESTY . Then, proper nouns, monetary, numeric and temporal expressions
are identified by the proper noun recognizer . The comprising segments are grouped together to provid e
meaningful sets of segments to the succeeding processes [2] . The pattern matcher searches all possibl e
patterns of interest in a sentence that match defined patterns such as tie-up relationships and economi c
activities. In the discourse processor, company names are identified uniquely throughout a text, allowin g
recognition of company relationships and correct merging of information within a text . Finally, the template
generator puts extracted information together to create the required template format .
The JJV configuration of TEXTRACT has been under development since the Spring, and during th e
TIPSTER 18 month evaluation it achieved a recall of 29 and a precision of 70 (for an F-measure of 40 .9) .
With 5 months of additional work, TEXTRACT now has a recall of 60 and a precision of 68, giving an
F-measure of 63 .8 .
The JJV Textract system was ported to the microelectronics domain in three (3) weeks by one person .
This was possible because most of the system's modules were shared across both domains (and because
identifying company names is a key element to performance in both domains) . Most of the development
time was spent identifying key expressions from the corpus . The JME configuration of the TEXTRAC T
system performed about the same as the base SHOGUN system on JME, but had higher precision compare d
to the higher recall of SHOGUN .
Our experience with TEXTRACT confirms that finite-state pattern matching allows for very rapid de-
velopment of high performance text extraction for new domains .
TEXTRACT : Company name identification throughout a text .
Unifying multiple references to the same company throughout a text is key to achieving a high performanc e
in the template structure of joint venture . A notion Of "topic companies," which are the main concern i n
the sentence, was introduced. Topic companies are identified where subject case markers such as " h ;" and
" (± " appear . When a subject is missing in a sentence, which is often the case in Japanese, the subject i s
automatically assumed as the topic companies taken from the previous sentence .
Company aliases are identified by applying a substring matching algorithm called the longest commo n
subsequence (LCS) method . References of three kinds of company name pronouns, " Ha" (dousha ; the
company), " (jisha; the company itself), and "MI" (ryousha; both companies) are also identified
using the topic companies and some heuristic rules .
Every company name in the text, including company aliases and pronouns, is given a unique numbe r
by the discourse process. Using topic companies and the unique number, individual pieces of informatio n
identified by the preprocessor and the pattern matcher are merged together to generate a relevant templat e
structure.
TEXTRACT: Analysis of a Walkthrough Messag e
In JJV0002, all five entities were correctly identified by the preprocessor . The pattern matcher also recognize d
two tie-ups correctly, although the pattern selected from four matched patterns was incorrect in Sentence 2
as shown in the traces below. TEXTRACT found one tie-up from Sentence 2, only because it cannot identify
multiple tie-ups in a sentence with the current design .
Sentence no . = 1
@CNAME_PARTNER_SUBJ : 0.97 : _~~~kkt
defined = ti
@CNAME_PARTNER_WITH :0.94: =
	
36MM*
defined =
@SKIP =
defined = J ,
LZ
	
L moo) >G i'ifi A.tcgOal re .—Q'j
	
wLt~ o
117
Sentence no . = 2
@CNAME_PARTNER_WITH1 :0 .50: = t-cOAX _ E
	
, n Aj$ _
	
o')f J$:E6 IJI—u1
defined =
@CNAME_PARTNER_WITH2:0 .50: = Iv t`' ~ fl inp
	
< , Mrct Z-c . A
	
5goUlfA#±
defined =
defined = 0)
@SKIP =
defined =
Oar-i< h f — X Lf i~~i Z1nxa o
0002 ryosha = rJfkk#_H, category = 1, distance = 3
0002 ryosha = Qgfkk~, category = 2, distance = 6
An alias " 3krt ±" (Toukyou kaijou) was found by the LCS method as a substring of the entity name
(Toukyou Kaijou Kasai Hoken) . References of "ME" (ryousha; both companies) were
correctly resolved as " H AXiLE " (Nisshin Kasai Kaijou Hoken) and " M[fkX' _E%M" (Douwa
Kasai Kaijou Hoken) . After the discourse processing, entities were given unique numbers (uniqueid) a s
follows:
gid = 1, unique_id = 1, partner 1, string = WTP-:0±AXfOl
gid = 4, unique_id = 4, partner 1, string = [1'u
gid = 5, unique_id = 5, partner 2, string = bIVAMN.E
gid = 7, unique_id = 7, partner 2, string = n AXitE
gid = 9, unique_id = 9, partner 2, string = W—Mlf
gid = 12, unique_id = 1, partner 0, string = A-A E
gid = 14, unique_id = 1, partner 0, string = W .
	
E
gid = 15, unique_id = 4, partner 0, string = Mug*
Industry objects and the product service slot were completely wrong due to the following reasons : (1)
TEXTRACT did not find the Product/Servicel string, and (2) although it did spot the Product/Service 2
string, it gave a wrong pointer to Activityl due to a system bug . Another observation regarding the
industry object was that TEXTRACT gave the industry type "sales" with SIC 50 to Product/Service 2
as the SHOGUN system did .
COMBINING SYSTEMS : SHOGUN + TEXTRACT
For the Japanese Microelectronics domain, the SHOGUN system scored the highest recall, while the TEX-
TRACT system scored the highest precision . The F-measure and error scores were almost exactly the same .
We developed a statistical technique to combine these systems in a way to improve the F-measure, and as a
by-product we determined the theoretical limits of combining the output of the two systems .
The combining algorithm works as follows : both SHOGUN and TEXTRACT are run on an input text ,
and the output templates are given as input to the combiner . The following methods were examined:
SHOGUN this row just shows the scores for the SHOGUN system .
TEXTRACT this row shows the scores for the TEXTRACT system .
Theoretical max this row shows the scores for a system which chooses perfectly whether SHOGUN o r
TEXTRACT has the better answer for a particular text.
Entity weight D=T this row shows the results of using total entity weight to select the output template ,
using TEXTRACT output in case of ties .
Entity weight D=S same as above, but uses SHOGUN output to break ties .
Most names D=S this method chooses the output template with the most entity names .
Avg Entity weight D=T similar to entity weight, but the average is used instead of the total weight .
11 8
SHO + TEX this method uses SHOG UN's output unless it is empty, in which case TEXTRACT's outpu t
is used .
TEX + SHO this method uses TEXTRACT's output unless it is empty, it which case SHOGUN's outpu t
is used.
Avg Entity weight D=S average entity weight with SHOGUN output used in case of a tic .
Single capability D=T this method chooses the output with the number of capabilities closest to one ,
and chooses TEXTRACT's output in case of a tie .
Method Recall
	
Precision F-Measur e
SHOGUN 60.0306
	
53.0254 56 .3110
TEXTRACT 50.6498
	
63.4988 56 .351 1
Theoretical max 61 .0330
	
63.4371 62 .2118
Entity weight D=T 56 .0208
	
58.9768 57 .4608
Entity weight D=S 60 .1467
	
53.1031 56 .4058
Most names D=S 61 .7665
	
51 .5824 56 .2170
Avg Entity weight D=T 53 .8203
	
58.7784 56 .1902
SHO + TEX 60 .7034
	
51 .4782 55 .7115
TEX + SHO 52 .3476
	
58.6007 55 .2979
Avg Entity weight D=S 55 .2294
	
55.0946 55.1619
Single capability D=T 53.0257
	
57.0724 54.9747
Figure 5 : Combining Two MUC-5 Systems : Table
Figure 5 gives the numeric values for the various combining methods, and Figure 6 shows the recall -
precision performance of each method graphically .
.TEXTRACT
	
. Theoretical ma x
Entity weight D=T
Avg entity weight D=T
TEX + SH O
Single capability D=T
Avg entity weight D=S
.SHOGUN
Most names D=S
SHO + TEX
Recall
	
65
Figure 6 : Combining Two MUC-5 Systems : Graph
65
C0
y
0
a
5050
11 9
Note that the best performing method was the total entity weight, which used statistics from the de-
velopment corpus for the entity-name slot to determine which output template had more commonly foun d
company names. Intuitively, if the output template had more companies that were associated with correc t
keys from the development corpus, that template is more likely to be correct . Note also that no knowledge-
free combining method gave a better F-measure than either of the two systems alone .
SUMMARY AND CONCLUSIO N
The examples and the analysis here are illustrative of the performance of the TIPSTER/SHOGUN syste m
on MUC-5 . While the system has done well and continued to improve significantly, there are still quit e
a number of problems that could be fixed to achieve better accuracy . On the other hand, the stead y
improvement of the system and the high performance across languages are very gratifying, and the fact tha t
we already seem close to human performance seems to bode well for the deployment of this technology.
While research up to this point has emphasized interpretation and control issues, we see corpus analysi s
and knowledge acquisition algorithms as being the key topics for further research and further progress . In
this way, MUC-5 may represent a turning point from matters of structure to matters of scale, with most o f
the necessary work on this type of task being broadening scope and scale . At the same time, we expect tha t
simple but very challenging tasks will emerge that test some of the key algorithms that are required for dat a
extraction.
References
[1 ] P. S . Jacobs and L . F. Rau . Innovations in text interpretation . Artificial Intelligence (Special Issue o n
Natural Language Processing), 48, To Appear 1993 .
[2] T. Kitani and T . Mitamura . A Japanese Preprocessor for Syntactic and Semantic Parsing . In Ninth
IEEE Conference on Artificial Intelligence for Applications . IEEE, 1993 .
[3] Susan McRoy . Using multiple knowledge sources for word sense discrimination . Computational Linguis-
tics, 18(1), March 1992 .
[4] Fernando Pereira. Finite-state approximations of grammars . In DARPA Speech and Natural Languag e
Workshop, pages 20-25, Hidden Valley, PA, 1990 .
[5] M . Tomita. Efficient Parsing for Natural Language . Kluwer Academic Publishers, Hingham, Mas-
sachusetts, 1986.
120
