PATTERN MATCHING IN THE TEXTRACT 
INFORMATION EXTRACTION SYSTEM 
Tsuyoshi Kitani t Yoshio \]:',riguchi t* Masami Ilara)* 
Center for Ma.chine Translation 
Ca.rnegie Mellon University 
Pittsburgh, PA 15213 
Abstract 
In information extraction systems, pattern 
marchers are widely used to identi~q infof 
mat|on of interest in a scntcncc. In this 
paper, pattern matching in the Tt:XTRACT 
information extraction system is described. 
It comprises a conccpt search which |dent|- 
tics key words representing a concept, and a 
template pattern search which identifies pat- 
terns of words and phrases. TI'JXTI~,A(;T 
using thc matcher performed wcll in the 
:I'IPSTER/MUC-5 evahtation. Thc pattern 
matching architecture is also suitable \]br rapid 
system development across different domains 
of the same language. 
1 INTRODUCTION 
In information extraction systems, finite- 
state pattern matchers are becoming popular 
as a means of identifying individu~d pieces of 
information in a sentence. Pattern matching 
systems for English texts are reported to be 
suits,hie tor achieving a high level of per\[br- 
mance with less effort, compared to full pars- 
ing architectures\[Hobbs et al. 92\]. Among 
seventeen systems presented in the Fifth 
Message Understanding Conference (MUC-5), 
three systems used a pattern marcher a.s the 
main component for identifying patterns to be 
extracted \[MUC-5 93\]. A pattern matching ar- 
chitecture is appropriate for information ex- 
traction fi-om texts in narrow domains since 
identifying informatkm does not necessarily re- 
quire full understa.nding of the text. The pat- 
tern matcher can extract information of inter- 
tVisiting rese;trcher fron/ NTT Data Communica- 
tions Systems Corp., email: tkitani~.rd.nttdM.a.jp 
HNTT Data Communiclttions Systems Corp. 
est by locating speciIic expressions defined a.s 
key words and phrasal patterns obtained by 
coq)us analysis. 
This paper describes a pattern matching 
method that first identifies concepts in a seu- 
tence and then links critical pieces of informa- 
tion that map to a p~ttern. The first step in 
pattern ln~tching is a concept searvh applied 
in the TI';XTRACT system of the TIPSTER 
Japanese microelectronics and corporate .joint 
ventures domains{aacobs 93a\], \[aacobs 93b\]. 
In this step, key words representing a concept 
are searched for within a sentence. The second 
step is a. template pattern sea~rh applied in the 
TEXTRACT joint ventures system. A com- 
plex pattern to be searched for usually con- 
sists of a few words and phrases, inste~d of just 
one word, as in the concept search. The tem- 
plate pattern search recognizes relationships 
between matched objects in the defined pat- 
tern a.s well a.s recognizing the. concept itself. 
l,'rom the viewpoints of system perfof 
mance and portalfility across domains, the 
TIPS'I.'I~;II/MUC-5 evaluatioll J'esults suggest 
that pattern nta.tching described in this paper 
is all appropriate architecture for information 
extraction from ,lapanese texts. 
2 TIPSTER/MUC-5 OVERVIEW 
The goal of the TI1)STER/MUC-5 project 
sponsored by ARPA is to capture informa- 
tion of interest from English and .lal)aJmse 
newspal)er artMes about microelectronics and 
corpora.re joint ventures. 1 A system must 
fill a. generic template with information taken 
1Several Al{PA-sponsored sites f¢)rmed tile TIP- 
ST\]'21/informal|on extraclion project. "\['he TIPSTER 
sites and other non-sponsored organizations partici- 
pated in MUC-5. 
1064 
fronl the text its. a fully automated fa.sh- 
ion. The template is composed of several 
objects, each containing severM slots. Slots 
may have pointers as va.lues, where pointers 
link related ot)jects. Extracted information 
is expected to be stored in an object-oriented 
database \[TIPSTER 92\]. 
In the microelectronics domain, information 
about four specific processes in seiniconduc- 
tot manufacturing for microchip fabrication 
is captured. They are layering, lithography, 
etching, aaM packaging processes. \],ayering, 
lithography, and etching a.re wafer fa.brical:ion 
processes; packaging is part of tile last stage of 
manufacturing. Entities such as manufactu rer, 
distributor, and user, in addition to detailed 
manufacturing information such a.s materials 
used and the microchip specifications such as 
wafer size and device speed are also extra.cted 
in each process. 
The joint ventures domain focuses on ex 
tracting entities, i.e. organizations, forming 
or dissolving joint venture relationshil)s. The 
information to l)e extracted includes entity in- 
formation such as location, na.tionality, per- 
sonnel, and facilities, and joint venture infor- 
mation such as rela.tionshii)s, 1)usiness a.ctivi- 
ties, capital, and estimated revenue of the joint 
ventttre. 
3 TEXTRACT ARCHITECTURE 
TEXTI/ACT is an informati(m extraction 
system developed as an optiona.l system of 
the GE-CMU SHOGUN system \[Jacobs 93a\], 
\[aacobs93b\]. it processes the TIPSTFI{. 
Japanese domains of microelectronics and col 
porate joint ventures. The :I'I';XTllA(VI ~ mi- 
croelectronics system comprises three major 
components: prel)rocessing ~ conceltt search, 
and template generatiol|. In ad(lition to (:on. 
celtt search, the "FI!;XrI'IIACT joint ventures 
system perfbrms a templ~te pattern search. \[t 
is also equipped with a discourse processor, as 
shown in Fig. 1. 
In the preprocessor, Japanese text is seg- 
mented into primitive: words tagged with their 
t)arts of speech by a Japanese segmentor 
called MAJESTY\[Kitani and Mitamura 93\], 
\[Kitani 91\]. Then, proper norms, along with 
monetary, nulneric, and temporal expressions 
Pre- I.... ~ processing 
morphological 
analysis 
• name 
recognition 
Lt 
........ Ra'te m ....... I 
i ,::::.:: :' ~:": :r~:: :: ::::::::::::::::::::::: ::::::::::: 
i :search:: :: i 
- concept - concept l 
identification ldentificqtion I 
- information l 
merging in a I 
sentence / 
Discourse ~ Template processing I \[ generation 
-information . output 
merging generation 
In :t text 
Fig. 1: Architecture of the q'EXTI{.A(~T joint 
ventures sysl, eln 
are. identified I)y the name recognition module. 
Tim segments are g,'Oul)e(l into units which 
a.re meaningfill in (.It(; l)attern ma.tching pro- 
cess\[Kitani and Mitamura 94\]. Most strings 
to be extracted dire.ctly Dora the text at'(.' iden- 
tiffed by MAJESTY and the name recognizer 
in the l)reprocessor. 
The con(;ept search and template pattern 
search rood u les both identi\['31 concepts in a set,- 
tence. The template pattern sear¢:h also rec- 
ognizes relationshil)s within the identified in- 
f'ornuttion in the matched pattern. Details of 
the l)attern matching process are described in 
the next section. 
The discourse processor links information 
identified a.t different stages o\[" processing. 
l"irst, implicit subjects, often use<\[ in Japanese 
sentences, are inherited fronl previous sen- 
tences, and set'oil(l, company ltatlles are givell 
iltliqlte ltunlbers necessary to accurately rec- 
ogMze company relationships throughout tile 
text\[Kitani 94\]. Concepts identified during 
tile pattern matching process are used to se- 
lect an approt)ria.te string and filler' to go into ~ 
slot. \]?inally, l.he template generation pro(:ess 
assembles the extracted information necessary 
to creat(.~ the OUtl)nl; descril)ed in Secl, iou 2. 
1065 
4 PATTERN MATCHING IN 
TEXTRACT 
4.1 Concept search 
Key words representing the same concept 
are grouped into a list and used to recognize 
the concept in a sentence. The list is written in 
a simple format: (concept-name wordl word2 
...). For example, key words tbr recognizing a 
dissolved joint vent u re con cept can be written 
in the following way: 
(DISSOLVED ~-~j~-~ ~-.~: ~) 
or 
(DISSOLVED dissolve terminate cancel). 
The concept search module recognizes the con- 
cept when a word in the fist exists in tile sen- 
tence. Using such a simple word list some- 
tilnes generates an incorrect concept. For ex- 
ample, a dissolved concept is erroneously iden- 
tiffed fl'om an expression "cancel a hotel reser- 
vation". IIowever, when processing text in a. 
narrow <lomain, concepts are often i<lentiffe<t 
correctly fi'om the simple list, since key words 
are usually used in a particular meaning of in- 
terest in the domain. 
During the Ja.panese segmentation process 
in the preprocessor, a key word in the text 
tends to be divided into a few separate words 
by MAJESTY, when the word is not stored 
in the dictionary, For example, the compound 
noun "~jJ'~f~" consists of two words, "i~3 '' 
(joint venture) and ":r#~b1" (dissolve). It is seg- 
mented into the two individual nouns using the 
current MAJESTY dictionary. Thus, when 
the compound word "~.-~jlf(t'-fb\] '' is searched for 
in the segmented sentence, the concept search 
fails to identify it. To avoid this segmentation 
problein, adjacent nouns are automatically put 
together during the concept search process. 
This process al\]ows, by defanlt, partial word 
matching between a key word and a word in 
the text. Therefore, "~.-~j" and "~05 {~J" 
both meaning "a .joint venture" can be identi- 
tied by a single key word "~.'I~,Y'. Ilowever, 
due to the nature of partial matching, the 
key word "-.:/~ = :/" (Silicon) matches "-iL~. 
4\[:: "5+ ~ = :/" (Silicon dioxide), which is a dif- 
ferent type of ffhn reported in the microelec- 
tronies domain. This undesirable behavior can 
be avoided by attaching ">" to the begin- 
ning or "<" to the end of key words. Thus, 
,,> .3.1) :~ :/ <,, tells the matcher that it re- 
quires an exact word matching against a word 
in the text. 
4.2 Template pattern search 
4.2.1 Template pattern matcher 
The teml>late pattern matcher identifies 
typical expressions to be extracted from the 
text that frequently aplmar in the corpus. The 
patterns are defined as pa.ttern matching rules 
using regular expressions. 
The pattern matcher is a ffnite-state au- 
tomaton sinfilar to the pattern recognizer use.d 
in the MUC-4 FASTUS system developed at 
SRI \[I\[obbs et al. 92\]. /n TEXTRACT, state 
transitions arc driven by segmented words or 
grouped units fi'om the prei)rocessor. The 
matcher identifies all possible patterns of in- 
terest in the. text that match defined l>atterns. 
It must ignore unnecessary words in the pat- 
tern to perform successfifl pattern matching 
for various expressions. 
4.2.2 Pattern matching rules 
Fig. 2 shows a defined pattern in 
which an arhitrary string is represented 
as "g~string" along with its correspond- 
ing English pattern. 2 Specilica.lly, a vari- 
able starting with "@CNAMI:;" is ca.lled the 
COulpally-name varial)le, used where a com- 
pany nanm is exi)ected to apl)ear. For exain- 
pie, "{}CNAME_I'AI{TNER_SUBJ" matches 
any string that likely includes at least; one com- 
pany name acting a.s a joint venture partner 
and functioning as a subject in tile sentence. 
The pattern "~ I h{:stri(:t:P" tells the pat- 
tern matcher to identify the word, where 
"~" or "z)<" are grammatical particles that 
serve as sul)je(-t case markers. The (te- 
fault type "strict" requires a.n exact string 
match, whereas "loose" allows a partial string 
match. Partial string matching is useful 
when compound words must be matched to 
a defined pattern. A joint venture, "~.-~j: 
loose:VN", whose l>art of speech is verbal nom- 
inal, matches compound words such as "~..~ 
~'~--~j" (corporate joint venture) a.s well ,%s "N 
~,j" (joint venture). 
2'\]'his \],'mglish pattern is used to capture expressions 
such as "XYZ Corp. created a joint venture with PQR 
1066 
(JointVenturel 6 
@CNAME_PARTNER_SUBJ 
~ \[ ~:strict:P 
@CNAME_PARTNERWITH 
:strict:P 
@SKIP 
~.-~:loose:VS) 
(a) A matching pattern for Japanese 
(JointVenturel 3 
@CNAME_PARTNER_SUBJ 
create::V 
a joint venture:NP 
with::P 
@CNAME_PARTNER_WITH) 
(b) A matching pattern for English 
Fig. 2: A re;etching pattern for (a) ,laf)anese 
a,n(l (b) English 
The ill-st field in a i)atteril is the pattern 
nalile followed by the patterll iiuliiber. The 
tia, ttern nunlber is used to deride whether or 
llOt a, search within a, given strhlg is IleCess.>t£y, 
To assure ell\]ciency with the pal, l;ern marcher, 
the fiehl designated by the lllunber sliould in- 
clude tlie leant frequent word in the entire pat- 
terll (,,~t~,, for aa, l)anese and :'a joint velil, urc" 
for English in this case). 
4.2.3 Pattern selection 
Approxiiila.tely 150 pa.tterns were used to 
extract various concepts in the Japanese joint 
ventures domain. Several patterns usually 
tllatch a. single sentence. Moreover, siuce pal;- 
terns are often searched using case ma.rkers 
slic\]i as "~", "7~<", and " ~ ", which frequently 
apt)ea.r ill .\]al)a.nese texts, even a sitigie l)a.t - 
1;eri/ Call l\[latch the Sellt,ence ill n\]oi'e thrill Olle 
w{I,y whell severa, I of" tile same ca,se lliarkers ex- 
ist ill a sentence. However, since the template 
gtmerator aCCellts only the best lnat(;he(I pal;- 
tern~ ehoosilig a corre(:tly nla,tehed i)atl,eril is 
imlJortant. The selection in (lone by applying 
three heuristic rules in the following or<let: 
• sele(:t l)a,tterns th~tt in(:hide the IlK)st 
I/lli\[lbei' of' nla, tched COlll\])al/y-ll311K) Val'i 
Inc." 
a bles in which there is at least one com- 
pany llaDle~ 
, sele(:t l)atterns tha.t COllSili\[le tilt fewest 
input seglnents (the shortest string 
match), and 
• select patterns that include the lnost 
llliiiil)er of variables and defined words. 
Another important feature of the pa.t:tern 
Iilat(:hor is tha,t rules can be groupe(1 accord- 
ilig to their COilCel)t. A rule lla.iile "JohltVeli- 
turel" iii Fig. 2, for example, represents a 
concel)t "JointVelitllre'. Ushlg this group- 
big, the best nlatched pattern can be selected 
fl'on-i nlatched patterns of a particular concept 
group instea.d of choosing from all the matched 
patterns. This feature enables the discourse 
and template g(meration processes to look at 
the, best infortnation necessary whet, tilling in 
a particular slot. 
5 EXAMPLE OF THE INFORMA- 
TION EXTRACTION PROCESS 
This sectkm describes how concepts ~tnd 
patterns identifh;d by tile matcher are used for 
tenll)late filling. Concepts are often useful to 
fill in the "set fill" (choice fi'om a. given set) 
sh)ts. An entity type slot, for examl)le , ha,s 
four giw, i choices: COMPANY, I)EI{SON, 
GOVERNM/i;NT, ~tnd OTIIt{;R. The matcher 
assigns concepts related to each entity type ex- 
cept ()TflEIL Thus, from the given set, the 
output generator chooses an entity type corre- 
sponding to the identified concept. There axe 
ca.ses when discourse processing is necessa.ry 
to link identified concepts and patterns. The 
f.ollowing text;: "X Inc. created a joint w;ntm'e 
with Y Corp. last yea.r. X announced yester- 
day that it. terminated the venture." in used 
to describe the extraction process illustrated 
in Fig. 3. 
In the preprocessing, two company na.tnes in 
the first sentence "X hic." and "Y Corp." are 
identified either I)y MAJESTY or the name 
recognizer. In the first sentence, the tem- 
pla.te pa.tteru search locates t he ,I ointgenture 1 
pattern shown in Fig. 2. Now, the ,/OINT- 
V I';NT \[J 1{1'3 con cepl; I)etweeii "X inc." and "Y 
C'ol'l)." is recognized. In tim second Sellt(~.it(X~., 
7067 
"X Inc. created a "X announced 
joint venture with yesterday that it 
Y Corp. last year."i terminated the 
venture." I 
Pre- company1: "X Inc." "X" 
company2: Ip rocesslng company3: 
"Y Corp." 
:::::::::::::::::::::: 
D,ssoLveD" 
IVENTURE* &LCorp" 
Discourse "X Inc."(companyl=company3) 
process ng } JOINT-VENTURE* y DISSOLVED* 
"Y Corp." * concept name 
Fig. 3: FxamI)le of the inforntatiou extraction 
process 
the company name "X" is also identitied by 
the preprocessor, a Next, the concept "DIS- 
SOLVEi)" is recognized by the key word ter- 
minate in the concept search. (The key word 
list is shown in Section 4.11.) After sentence- 
level processing, discourse processing recog- 
nizes that "X" in the second sentence is a ref- 
erence to "X Inc." found in the first sentence. 
Thus, the "DISSOLVI';I)" concept is joined to 
the, joint venture relationship between "X inc." 
and "Y Corp.". in this way, TI!iXTRACT rec- 
ognizes that the two companies dissolved the 
,joint venture. 
SupI)ose that the second sentence is re- 
place(t with another sentence: "Shortly after, 
X terminated a contract to supply rice to Z 
Corp.". Although it does not mentiot~ the 
dissolved relationship nor anything a hottt "Y 
Corp.", the system incorrectly recognizes the 
dissolved joint ventttre rela.tionship between 
"X In('." and "V Corp." due to the existence 
of the word terminate. When this undesir- 
able matching is often seen, more complicated 
template patterns must be used instead of tile 
simple word list. A dissolved concept, lbr ex- 
ample, could he identified using the fo\]\[owing 
template pattern: 
aWhen it is an unknown word to tim prepro(x~ssor, 
the discourse processor idcnti\[ies it IM.er. 
(Dissolvedl 2 
OCNAMEPARTNER_SUBJ 
dissolve I terminate l cancel::V 
@skip 
venture::N 
@CNAME_PARTNER_WITH). 
Then, discourse processing must check if com+ 
panies identified in this pattern are the same 
as the current joint venture comi)anies in order 
to recognize their dissolved relationship. 
6 OVERALL SYSTEM PERFOR- 
MANCE 
A total of 250 newspaper articles, 100 
about Japanese mi('roelectronics and 150 
about Japanese corl)orate joint ventures 
were provided by ARI)A fl)r nse in the 
Tll)SrI'FA/./MU(L5 system evalua.tion. Five 
microelectronics and six joint ventures sys- 
tems were presented in the Japanese, system 
eva.luation a.t MUC-5. 4 Scoring was done in 
a, semi-atttomatie nta,nner, rl'he scoring pro- 
gram automatically compared the system out- 
put with answer tetnl)lates created by hu- 
mat~ analysts, then, when a Mman decision 
was necessa.ry, analysts instructed the scof 
ing progratu whether tlt(,, two strings in cora- 
l)arisen were completely matched, pa.rtially 
matched, or unumtched. Finally, it calculated 
an overall score combined from all the news- 
paper article scores. Although various eval- 
nat;ion tnetrics were tneasured in the evalua- 
tiou \[C\[lillchor and Sun(lheim 93\], only the fol- 
lowing error and reeall-l)recision metrics are 
discussed in this pa.per. The ha.sic scoring cat- 
egories use(l are: correct (CO R), partially cor- 
rect (PAR), itlcorrect (INC), ntissing (MIS), 
and spurious (SPIJ), counted as the tmml)er 
of i)ieces of inl'orma.tion in the system output 
eompa.red to tile possil)le (answer) informa- 
tion. 
(l) 1';1'1;o1' metrics 
* l';rror 1)el . resl)onse fill (I~l{II): 
wrong _ INC+ ILdR/2 + \]I'IIS + HPU 
lotal COR+ t)AR+ INC+ MLS'+,5'PU 
4These numbers include TI;;X'FRACT, aA| optional 
sb'stmn of (;I'LCMU SIIO(;UN. 
7068 
TaJ)h, 1: Scores of "rI';XTItACT and two other l.Ol>ra.nking of\[iciaJ sysl,e,l,S iit '.I'II>STHL/M U(7-5 
\[ 'I'I"XTRACT (,i Mi,;) System A (JMI';) 
System B (JME) 
I 'rI,;x:rRACT (J Jr) 
System A (J,IV) 
System 1} (,\],IV) 
_.58 :it) 38 i4 60 53 n 
s. C ,Io us ISO.4 
67 ,52 1 _(J:l .... 51_ .... '2:t ..... IC_!_'l _L(irk .... _ 
TEXTI{A(/r's scores sul)inil.l.ed to MU(: 5 were unollicial. 
J M t';: J a.liiHl(+,,S(~ Ill icroelectr()lfics (IoIIt ai It 
JJV: ,/a.p,+,.lleSe ('o,'l)Ol'al;(~ johi t. v0litil ,'eS (\[outai II 
,,, Ulldergeneration (lIND): 
M l D' a41 ,¢ 
possible COle l- PAIL I IN(: q kl I£' 
® Overgelmration (OV(',): 
,91)U ,'dl'I\] 
acutual C'Ol~q PAIL4. IN(,'4 ,5'I~1 ! 
® Subsl, itutiol, (SUB): 
INU-F l'AI:/2 
COle,+ I'AI¢, t INC 
(2) l{ecali-preciMon ntetrics 
• l/.eca.II (1{,1';(\]): 
COI~ t I'ANI2 
possible 
.. Precision (I'I\[E): 
C01¢, -F I'A R/2 
acutetal 
+ l)~l\[ l:'-mea.slu'e (l)&l{,): 
'2 * IU')C * l' tUq 
IU'\]C I \]'l~l: 
'.l'he error per l'eSl)oltse lill (I';RI{.) was the of- 
IMal Inea.s,l,'e of M U(\]--5 systel,i pel\[(iriilatit:(*. 
~(;cotidaJ'y e, va.hia.tiolt ,heLlits we, re tlllfi0rgOll- 
eration (UNI)), overget,eratio,t (OV(',), ;,.td 
S,I I)sl;\]l;lll;ioll (SU II). The recall, l,l'eciSiOll, ~+tll (1 
I:'-Iilea.sti re l\]iel+rics we,'e used a.~; u nofli(:ia.I lnet. 
tics for MUC-5. 
Ta, ble I shows scores of TI:,X'\['I{A(:T and 
1,wo other tot>ra.nkilig oflicia.l .'-; y ,'-; l, elil+q r, l.akeu 
s'I+EXTIIA(YI' processed only Japane>~e text, 
wherea.s the two other sysl,enis l~rocessed I)olh I'\]nglish 
D, II(\[ Jit|)a.nese text. 
from the '\['\[F'STEI{/M U(\]+5 syst,:utl cwduatiotl 
' : I I,X IRA , I perforn+e<l re,quits \[M U(,-+> 93\]. u " ' C 
e,lually with the tol)-ra.nMn<e; systems it\] the 
\[,WO ,\] ~t I)~l IH!,qe (1()II13i IIS. 
Since the TI,;XTI{A(Yr nlicroe\]ectronics 
system did not, ia(th.le a, l,elnli\]a,te pa, l, tern 
s;earch or d iscou rse processor to help dillk.'en ti 
a.lo bOI, W(R~II n~tdt, ilfle, sellliCOlt(ltl(:Lor proce,c;ses 
of the sa,me kiIld, it reported ouly oile ol)jecl, 
for each kiud of ulanufacturing; process, even 
wheu multil)le ol)ject<; of tile sa,mc kind existed 
in the artMe. This resulted in the lower scores 
in the nfi(:roele(:tronics (\]Olll,~liF£ I;\]I;11, l;hose of 
t he .joitJ t vetltu res dotna.i N. 
Thi.~ \[);tt, t, erlt ma,t(:hi,g architecture is highly 
l)ortal>Ic~ ;tcros:-; dilreretlt domains (>\[' the s;,.me 
laliguage. The TI:,XTRAC/r nii,::r(+,+lectrolii(:s 
system was dew~loped in only three weeks by 
one person by simply replacing joint venture 
coilce+l)t,~ and key words witll representative 
Ilti('rof!l(~(~|.l'Oli\](~s COlIC(':\])l;.q ~,.,1(1 key words. 
7 CON(JL(JSION 
lit the ,\];'q)alle,qe lilicro(qeC:tl'OlliCB ai\]d cor- 
porate .\](>itll, vent, u res <lonul,i us, TEXT RA CT 
perf'<)rmed equally with t,h,<.! t, ol>rankiNg oNi- 
cial s,ystems ;it, the 'I'IILqTER/M U(\]-5 system 
eva\]ua,t.ioN. Alt, hough l)erl'oruna,nce of F,.a,IJ,(;rll 
lua.tching~ tnust be ewdua.ted, Ihe high I)erfor- 
ma.nce or TEXTI~A(/r suggests that the paL- 
tern n.atcher worked well in e×tra,cting infof 
ma,tion froul t\]ie l, ext;. '\]'tie p:<i,l,t(;rn nlal, cher 
("'I'I'~X'I'IIA(/I"s scores submitie(I to MU(~-5 were 
u,official. \[I was scored ollicially M'ler the confelX~llCe, 
The official scores showed slight dilfcrences from ulmf 
ficiaJ onus. 
1069 
has not been tested to languages other than 
Japanese. It is expected to work to other lan- 
guages with some minor modifications given 
that the input is segmented into primitive 
words tagged with their parts of speech. 
The TEXTRACT Japanese microelectron- 
ics system was developed in only three weeks 
by one person. In spite of its simplicity, it 
showed the high performance. This result 
also suggests that the pattern matching archi- 
tecture is highly portable across similar do- 
mains of the same la.nguage, thus facilitating 
rapid system development. Developing and 
maintaining TEXTRACT's pattern matching 
based architecture is easier and less complex 
than that of a full parsing system, as experi- 
enced in the early stage of StIOGUN system 
development \[aacobs 93b\]. 
Corpus analysis took about half of the de- 
velopment time, since only a KWIC (Key 
Word In Context) list and a word fi'equency 
tool were used to acquire the concept-word 
lists and the template patterns. Using good 
statistical corpus analysis tools will shorten 
the development time and promise a high per- 
formance. The tools should not only collect 
patterns of interest with context, but also give 
statistical data to show how well deft ned pat- 
terns are working when they are applied in the 
system. 
At MUC-5 meeting, P&R F-measure of one 
of the top-ranking systems was claimed to be 
close to the human perfornlance \[Jacobs 93b\]. r 
To match the system performance of a pattern 
matching system to human performallce, the 
preprocessor must recognize expressions to be 
extracted at nearly 100% accuracy given that 
other components simply merge information 
and generate ontput. 
Acknowledgements 
The authors wish to express their apprecia- 
tion to aaime Carbonell, who provided the op- 
portunity to pursue this research a.t the Cen- 
ter for Machine Translation, Carnegie Mellon 
University. Thanks are also due to 3bxu ko Mi- 
tamura and Michael Mauldin for their many 
helpful suggestions. 
r'l'he human performance was estirmtted to be recall 
and precision of about seventy to eighty. 
References 
\[Chinchor and Sundheim 93\] Chinchor, N. 
and Sundheim, B. (1993). MUC-5 Evalua- 
tion Metrics. Notebook of the Fifth Message 
Undersianding Cor@renee (MUC-5). 
\[IIobbs et al. 92\] lIobbs, J., Appelt, 1)., et al. 
(1992). FASTUS: A System for Extracting 
hfformation from Natural-Language Text. 
SRI International, Technical Note No. 519. 
\[Jacobs 93a\] Jacobs, 
P. (1993). TIPSTER/SIIOGUN 18-Month 
Progress Report. Notebook of the TIPSTER 
18-Month Meeting. 
\[Ja.cobs 931)\] Jacobs, P. (1993). GE-CMU: l)e- 
scription of the Shogun System Used for 
MUC-5. Notebook of the Fifth Message Un- 
derstanding Conference (M U6'-5). 
\[Kitani 91\] Kitani, T. (:199:l). An ()CR 
Post-processing Method for Handwritten 
Japanese Documents. In proceedings of Nat- 
ural Language Processing PaciJic Rim ,gym- 
posium, pp. 38-45. 
\[Kitani and Mitamura 93\] Kitani, T. and Mi- 
tamura, T. (1993). A Japanese Preproces- 
sor for Syntactic and Scmalltic Parsing. In 
proceedings of Ninth 11¢1¢E Conferencc on 
Artificial Intelligence for Applications, pp. 
86-92. 
\[Kitani and Mitamnra 94\] Kitani, T. and Mi- 
tam ura, T. (199d). A n Accurate Morpholog- 
ical Analysis and Proper Name hlentifica- 
tion for Japanese Text Processing. Journal 
of Information Processing Society of Japan, 
Vol. 35, No. 3, 1)P. 404 - 413. 
\[Kitani 94\] Kitani, T. (1994). Merging hffor- 
marion by I)iscourse Processing for lnfof 
mation Extraction. In proceedings of 7~nth 
IEIH¢ Conference on Artificial Intclligence 
for Applications, pp. 4112-418. 
\[MUC-5 93\] (:1993). System 
Descriptions. Notebook of thc l"iflh Mess'age 
Undcrstandin.g Cbnfcrcnec (AIUC-5). 
\[TIt)STER 92\] (1992). Joint Ventnre Tem- 
plate Fill Rules. Plenary Session Notebook 
of the TII%5'TER, 12-Month Mcctin 9. 
1070 
