FASTUS: A System for Extracting Information from Text* 
Jerry R. Hobbs, Douglas Appelt, John Bear, 
David Israel, Megumi Kameyalna, and Mabry Tyson 
SRI International 
333 Ra.venswood Avenue 
Menlo Park, C,A 94025 
INTRODUCTION 
FAS'rUS is a (slightly permuted) acronym for Finite 
State Automaton Text Understanding System. It is a. 
system \[br extracting information fi'om free text in En- 
glish (Japanese is under development), for entry into 
a database, and potentially for other apl)lications. It 
works essentially as a set of cascaded, nondeterministic 
finite state automata. 
FASTUS is rnost appropriate for inform.ation e~:lraclion 
tasks, rather than fldl text understanding. That is, it. is 
most effective for text-scanning tasks where 
• Only a fi'actiou of the text is relevant. 
• There is a. pre-defined, relatively simple, rigid target 
representation that the information is mappe(I into. 
• The subtle nuances of meaning a,nd the writer's 
goals in writing the text are of no interest. 
THE STRUCTURE OF THE MUC-4 
FASTUS SYSTEM 
The opera.tion of FASTUS is comprised of four steps. 
l. Triggering: Sentences are scanned for key words to 
determine whether they should be processed flir- 
t.her. 
2. Recognizing Phrases: Sentences are segmented into 
noun groups, verb groups, and particles. 
3. Recognizing Patterns: The sequence of phrases pro- 
duced in Step 2 is scanned for patterns of interest, 
and when they are found, corresponding "incident 
structures" axe built. 
4. Merging Incidents: Incident structures from differ- 
ent parts of the text are merged if they provide in- 
formation about the same incident. 
*This research was supported in part by the Defense Advanced 
I~esearch Projects Agency under Contract ONI~I N00014-90-C,-0220 
wi|.h the Office of Naval Research, in part by NTT Data, and in 
part by an SIII internal research and development grant. The 
views and c~mclusions ,'ontail~ed in this document are those of I.he 
ant|mrs and should not be interpreted as necessarily representing 
the ,dfi,:ial policies, either expressed or in,plied, of |he Defense 
Advanced l:{eseav,:h I:)roject:s Agency of the U.S. (;,~vernment. 
Many systems have been built to do pattern matching on 
strings of words. One crucial innovation in the FASTUS 
system has been separating that process into the two 
steps of recognizing phrases and recognizing patterns. 
Phrases can be recognized reliably with purely syntac- 
tic information, and they provide precisely the elements 
that are required for stating the patterns of interest. 
The system is implemented in CommonLisp and rmls on 
both Sun and Symbolics machines. 
AN EXAMPLE 
The task in the MUC-3 and MUC,-4 (Message Under- 
standing Conference) evaluations of text processing sys- 
tems was to scan news reports and extract intbrmation 
about terrorist incidents, in particular, who did what to 
whom. Tim following sentence occurred in one rel)ort: 
Salvadoran President-elect Alfredo Cris- 
tiani condemned the terrorist killing of Attor- 
ney General Roberto Garcia Alvarado and ac- 
cused the Farabundo Marti National Liberation 
Front (FMLN) of the crime. 
1. Triggering: This sentence is triggered because it has 
a nmnber of key words, including "terrorist", "killing", 
and "FMLN". 
2. Recognizing Phrases: Step 2 segments tim sen- 
tence into the following phrases: 
Noun Group: 
Name: 
Verb Group: 
Noun Group: 
Preposition: 
Noun Group: 
Name: 
Conjunction: 
Verb Group: 
Noun Group: 
Preposition: 
Noun Groul): 
Salvadoran President-elect 
Alfredo Cristiani 
condemned 
the terrorist killing 
of 
Attorney General 
Roberto Garcia Alvarado 
and 
accused 
the Farabundo Marti National 
Liberation Front (I!'MLN) 
of 
the crime 
133 
The phrases that are recognized are nantes, the noun 
group, or the noun phrase up t.hrough tile head noun, the 
verb group, or the verb together with its a.uxilliaries and 
any trapped adverbs, and various particles, including 
prepositions, conjunctions, relatiw~ pronouns, the word 
"ago", and tile word "'that" which is treated Sl)ecially be- 
cause of the ambiguities it gives rise to. Essentially the 
full complexity of English noun groups and wq'b groul.)S 
is accommodated. 
This phase of the processing gives very reliable results - 
better than 96% accuracy on the data we haxe examined. 
3. Reeognlzing Patterns: In the exa.mlAe, two pat- 
terns are recognized in the sequence of phrases: 
<Perpetrator> <Killing> of <Hmnan'Fa.rget> 
and 
<GovtOtficial> accused <PerpOrg> of 
<Incident> 
Two corresponding incident structures are constructed: 
Incident: KILLING 
Perpetrator: "terrorist" 
Confidence: 
Hurnan Ta.rget: "Roberto (.-;aleia Alvarado" 
and 
Incident: INCIDENT 
Perpetrator: FMLN 
Confidence: Suspected or Accused by 
Authorities 
Human Target: -- 
Altogether for the MUC-4 application, about one hun- 
dred patterns were recognized. 
4. Merging Incidents: These two incident structures 
are merged into a single incident structure, containing 
the most specific infornlation from each. 
Incident: 
Perpetrator: 
Confidence: 
Human Target: 
KILLING 
FMLN 
Suspected or Accused by 
A ut, horities 
"Rol)erto (larcia Alvarado" 
In the MUC-4 system, l.here are fairly elaborate rules for 
merging the noun groups that a.ppear ill the Perpetra- 
tor, Physical Target, a.nd Hmna.n Target slot.s. A name 
call be ii)erged with a. description, as "(-larcia" with "'a.t- 
torney general", provided I.he description is COllSiSl,ell|, 
with the other descriptions for that name. A precise de- 
scription can be merged with a vague description, such as 
"person", with the precise description as the result. Two 
precise descriptions can be merged if they a.re sen)an- 
tically compatible. The descriptions "prieslY and "Je- 
suit" are compatible, while "priest" and "peasant" are 
not. When precise descriptions are merged, the longest 
string is taken as the result. If merging is inlpossible, 
both noun groups are listed in tile slot. 
SKIPPING COMPLEMENTS 
Pattern-matching approaches have often been tried in 
the past, without much success. We believe that our suc- 
cess was due to two key ideas. The frst, as stated above, 
is the use of cascaded finite-state automata, dividing the 
task at the noun group and verb group level. The second 
is our approach to skipping over complements. 
One significant problem in pattern-matching approaches 
is linking up arguments with their predicates when they 
are distant in the sentence, for example, linking up the 
subject noun group with the main verb when the subject 
has a number of nominal complements. One technique 
that has been tried is to skip over up to some umnber of 
words, say, five, in looking for tile subject's verb. One 
trouble with this is that there are often more t.han five 
words in tim subject's nominal complement. Another 
trouble is that in a. sentence like 
The police reported that terrorists bombed the 
Parliament today. 
this teclmique would find "the police" as the subject of 
"bombed". 
Our approach is to implement knowledge of the gram- 
mar of nominal complements directly into the finite-state 
pattern recognizer. The material between the end of the 
subject noun group and the beginning of the main verb 
group nmst be read over. There are patterns to accom- 
plish this. Two of them are as follows: 
Subject {Preposition NounGroup}* 
VerbGroup 
Subject Relpro {NounGroup I Oi, her}* 
VerbGroup {NounGroup \[Other}* 
VerbGroup 
Tlle first of these patterns reads over prepositional 
phrases. The second over relative clauses. The verb 
group at the end of these patterns takes the subject noun 
group as its subject. There is another pattern for cap- 
turing the COl)tent encoded in relative clauses: 
Subject Relpro {NounGroup { Other}* 
VerbGroup 
Since tile finitie-state mechanisnl is nondeternlinistic, tile 
full colll, ent can be exl, ra.cted fl'om the selli,ellce 
134 
The n3a.yor, who was kidiral)ped yesterday, was 
foulid dead today. 
One branch discovers i.he iricident encoded in the rela.tive 
clause. Another branch marks t, ime through the relative 
clause arid then discovers the incident in the niain clause. 
"Flies(, incidents are then merged. 
A similar device is used for" conjoined verb phrases. The 
pattern 
Subject VelbGroup {Nourr(~4roup I Other}* 
Conju uction Verb(Iroup 
allows i,lie n\]achine to nondeternlinistically skip over the 
first, conjunct and associate the subject with the verb 
group in the second colrjunct. This is llow, in the ahove 
examph', we were able to recognize Cristia.ni a~s the one 
who was accusing the FMLN of the crime. 
THE PERFORMANCE OF FASTUS 
On the MUC-4 evahlation in June 1992, FASTUS was 
among to top few systems, even tllough it had only been 
tinder (levelopnient for five nlonths. On the TST3 set of 
Olle hulldred irressa,gos, \[ A,_~ \[IS actlieved a recall of 44% 
and a. precisiolr of 5,5%. The flrll results of the MUC-4 
evahiation can be forrnd in Sundheirn (;1992). 
Moreover, FASTUS is an order of magnitude faster than 
any other conrpa,rabh' system, hi the MUC,-4 evahiation 
it. was able to l)rocess the era.ire test set of 100 messages, 
ranging fl'om a. third of a page to two pages in lelJgth, in 
ll.8 mimltes of(IPU time on a Sun SPARC-2 processor. 
The ela.pse(l real time was 1.5.9 minutes, hi nlore con- 
crete terms, FASTUS can read 2,375 words per minute. 
It can analyze one text in an average of 9.6 seconds. This 
translates into 9,000 texts per day. 
This fast run time translates directly into fast devel- 
opment time. I~;%STUS became operationa.1 on May 6, 
1992, aud we did a. run on a set of messages that we had 
not trained on, obtaining a score of 8% recall and 42% 
precision. At that point we began to train the system on 
1300 development texts, adding patterns and doing pe- 
riodic runs on the fair test to monitor our progress. This 
effort culminated three and a half weeks later on June 
1 in a. score of 44% recall and 57% precision. (Recall is 
percent of the possible answers the system got correct; 
precision is percent of the system's answers that were 
correct.) 'thus, in less than a lnonth, recall went up 3(J 
points and precision 15 points. 
A more complete description of FASTUS and its perfor- 
mance is given in Hobbs et a.l. (1992). 
RECENT EXTENSIONS 
We are crrrrenl.ly ~'xl.ei/ding the I;'ASTU,q sysl.ein hi three 
ways: 
• We are develolYiug a. corlvorrieill, ilit(,r'fa.ce t\]la, t will 
a.llow risers I,o oh'fine i)alJ,er'llS iriore easily. 
• We axe irnphmwnt,in~ a Japa.nese la.nguage version 
of FASTLiS. 
• \'V(' are apl)lying i,he syst,em to a, new domain- 
exl.ra.ctiilg i\[,tbrnla.tion a.bout joint velltrlres fl'Ollr 
news articles. 
The last of these will be the subject of our M U(:-5 paper. 
The other l.wo awe descri/)ed hero. 
THE INTERFACE 
The original version o\[' li'A,lgT\[JS has been augniented 
with a convenienl, graphical user interface for iniple- 
nlellt, illg O1" extending aJI application, eniployillg Sill's 
Grasper systenr (Karl) el. a.l., 1993). We expect this to 
speed up developrirerrt time for a new application by a 
factor of three or four. Moreover, whereas hefore riow 
only a systenl dew4oper could inlpleinent a new applica- 
tion, now virtrrally a.nyoue should I)e able to. 
In a specification interface tbr FASTUS, there needs to 
be convenient means for performing four tasks: 
1. Defining ta.rget strtr('tlrres. 
2. Defining word classes. 
3. Defining sta.te l, ra.risitiorrs. 
4. Defining nierge coirditioris. 
ViSe have dolle nothing yet irl the firsl, two areas, since 
e.veryone currently working with the syst(~nl is tltlent ill 
Lisp. Target structures are defined with defstruct, word 
classes with deDa.r. As we acquire users who are not 
programmers, it will be straighth)rward to inil)lenlent 
convenient means for these tasks. 
The Grasper-based graphical interface provides a con- 
venient inemls for creating, exaulining, editing, and de- 
stroying nodes arid links in the graphs representing the 
finite-state automata. Each link is labelled with the to- 
kens that cause that transition to take place. Nodes have 
associated with them sequences of instructions that are 
executed when that node is reached. These instructions 
typically fill slots in the target strlrctures, and they ~can 
be conditionalized on what link the node was reached 
from, allowing greater economy in the finite-state ma- 
chiries. 
In addition, the interface allows the graphs a,t each level 
to be modularized in whatever fashion the user desires, 
so that at any given tin\]e, the user can focus on only a 
small portion of the total graph. There are also conve- 
nient means for saving and compiling the graphs afl.er 
changes have been made. 
Perhaps the hardest problem in the inforn\]ation extra.c- 
tion task is defiifing when two target structures can be 
merged. This is, after a.ll, the coreference l)rol)h'nr in dis- 
('Oilr'SO, well-knowrl to I)e "al-eomlilete". W'e have devel- 
ol)ed a kiird of a.lgebra on l,he l,a.rgel, sl, ructures. 'Hie rrser 
135 
can define abstract data types, inchiding ntlniber rallges, 
date ranges, locations, and strings. Comparison opera.- 
lions can then be defined for each of these data. l.ypes, 
returning vahies of Equal, Snbstnnes, Inconsistent, and 
hlcorupa.rable. (\]onlbina.tion operations ca.n also be de- 
fined. For exainple, the cornbination of two uunil)er or 
date ranges is the nlore restrictive range. For striligs, 
the conll)ination depends on the semantic categories of 
the heads of the strings. If one is more specific than the 
other, the more sl)ecific term is the resu It. of the combhm- 
t,ion. There are t.hree types of actions that be l)erformed 
after doing a comparison. The items can be merged or 
c.ombined. If they are incomparable and if the slot. in t.he 
target struct,ure admits eonlpound entries, die two call 
simply be added together. Or the unification of the l.wo 
items can be rqjected. 
This algebra of target structures gives us a very clean 
treatment of what in the MUC-4 systenl was often very 
ad hoe. 
FASTUS has been restrtictured somewhat a.s well since 
MUC-4. A Tokenizer Phase has been added, its in- 
put consists of ascii characters and it output is tokens, 
usually words, numerals, and punctuation lnarks. This 
phase gives the user control over the lowest level of input,, 
so that special rules can be encoded for abbreviations, 
numbers with radix other than 10, and other such phe- 
nOlllena. The illOSt conlnlon tokenizations are, of course, 
ah'eady iniplenmnted. 
A Preprocessor Phase has also been added. This incof 
pora.tes t,he nmltiword handling that. was done in t.he 
Phrase R.ecognition phase of the first, version of FAS- 
TUS. It also allows the user to customize automata lot 
dealing, for example, with names that have a different 
given-name falnily-nanae order and with names of non- 
human entities that have internal structure significant to 
/.he donm.in, such a.s company names. 
The treatment of appositives, conjunct,ions, and "oF' 
prepositional phrases was originally done in the Pattern 
Recognition phase. This has now been separated out 
into a Combining Pha.se for a. treat.nlent tha.t is nlore 
perspicuous and hence more conw?nient for the user. 
JAPANESE FASTUS 
We are also developing a Japanese version of Ia)~SJ'US. 
The initial application is for extracting a, summary of 
spoken diMogues, inpu{, in R,omari characters, in the 
domain of conDrenee room reservatiolls. Smmnarizing 
goal-oriented dialogues can be achieved by filling a pre- 
defined sumnlary tenq)late, and a.ny digressions in the 
dialogue content can 1)e ignored. Sunnnarization is i, hou 
an exalnple of expectatiorl-driven inforlila.tioli extracl,ion 
performed by FAS'FUS. 
Despite the dissiniila.ril.y bet.weon t.he English and 
.}al)a.nese lallglla.ges, t.ho Basic FASTUS a, rchit.o('l, llrO COil- 
sis ing of \[bur phases can be a.l)plh~d to the process- 
ing of .la.i)anese. The phrase recognition phase (phase 
II) recognizes noun groups, verb groups, and parlicles. 
The phrase coral)tirol, ion phase (phase II1) recognizes 
the "N(~ no NG" l)hra.ses (similar to the English "of" 
phrases) and N(I conjunctions that a.re of interest, to the 
giwm domain. The incident recognition phase (phase 
IV) recognizes those ut, tera.nce patt,erns that conl,ain key 
inrorma.l,ion releva.nt l.o the sumnmry template. Be- 
cause the inl)ut, is Sl)ont, aueous dialogues rather than 
writt.en news reports, we will have a dialogue manag- 
ing module a.fter the. incident recognition phase in order 
to combine intbrma.tion contained in successive dialogue 
turns---for instance, question-answer pairs and request- 
confirmation pairs. We have implemented phases \]l and 
Ill, and phase IV will be in place shortly. 
The main complexity of summarization in this room 
reservation domain is in the use of tempora.l expres- 
sions and in the dynamics of negotiation between the 
two speakers. Written news report,s typically report past 
ewnlts whose resulting states are already known. Spoken 
dialogues, however, progress through a sequence of nego- 
tiations where the speakers express their desires, possi- 
Ifilit.ies, iml)ossibilities, concessions, accel)tances, a.nd so 
\[ortrh. This is a considerable challenge to the structlu:e 
merging routine of FAS'I'IIS. 
For i.he M U( '.-5 particil)ation, the Jal)anese FASTUS sys- 
tem will be extended for the new domain of joint ven- 
tures and the new inl)ut type of written news reports in 
J apa.neso charact.ers. 
SUMMARY 
The advantages of the FAS'I'IIJS system are as \[;allows: 
• It, is concept.ually simple. It is a set of cascaded 
fin ite-state a.utomat.a. 
• The basic system is relatively small, Mthough the 
dict.ionary and other lists are potentially very large. 
• It is effective. It. was among the top few systems in 
I.he MUC-4 evaluation. 
• It has very fast run time. The average time for an- 
alyzing one message is less than 10 seconds. This is 
nearly a.n order of magnitude faster than compara- 
I)le sysl.ems. 
• In part I)ecause of the fast nm time, it has a very 
ra.sL dewqopment time. This is also true because the 
system provides a wiry direct link between the texts 
being analyzed and the da.l.a being extracted. 
We I)eli,'w" thai. the le:'\STUS technology can achieve a 
level or (i0(~, r,'call and 60% precision oi;i hlforn-iation ex- 
l.racl.ion l.asks Ilk,' thai. or M U(.:-~I. tlunian coders do not 
136 
agree on flus task nlore than 80% of the i,hue, tlenc(', 
a systeln working ten tinles as fast as \[lllllla.ns do ('all 
achieve 75% of hulnan perforrnau('e. We beliew" that 
conabining this system with a good user interface couhl 
increase the productivity of analysts by a factor of' live 
or ten in this task. 
This of course raises the quest, iou about the final :25%. 
ttow call we achieve that? We believe this will not 
be achiew~d until we niake substantia.l progress on the 
long-term problem of hill text undersla.uding. This call- 
not hai)peri until there is a long-terrn connnitnienl, that 
nlakes resources available for innovative research on 1,he 
l)roblem, research tiiat will ahllOSt surely not produce 
striking results on large bodies of text in the near hl- 
ture. 
Absent such an environment, our inmiediate plans are 
to spend about two months bringing our MUC-5 sys- 
tem to and beyond the level of our MUC-4 systeni, and 
then to explore the important research question of how 
nmch of hill text understanding can be a.pproxinlai.ed by 
the finite-state approach. The following observations are 
very suggestive in this regard. 
We beliew~ that the most promising approach for full 
text understanding is the "htterprel, ation as Abducl,iou" 
approach elaborated in Hobbs el, al. (1993). There are 
i,hree basic operations in this approach, a.nd each of l.henl 
can be approximated in FASTUS technology. First, the 
syntactic structure is recognized and a Iogica.I form is 
produced. The corresponding operation in FASTUS is 
the recognition of phrases, that part of syntax that can 
be done reliably. Second, the logical form is proven al)- 
ductively by back-chaining on axioms of the form 
(ga, b)Y(a,b) D X(a,b) 
Tiffs can be approximated by adding flirt.her i)a.l, terns: 
In addition to having a pattern for 
A X'ed B 
we would also have a pattern for 
A Y'ed B 
Third, redundancies are spotted and merged to solve the 
coreference problem. As pointed out above, this is ap- 
proximated in FASTUS by the operation of merging in- 
cidents. 
However, it nlust be realized that nnlch of the success 
of the FASTUS approach is in the clever ways it ignores 
much of the irrelevant information in the texl.. As we 
deal with texts in which more and more o\[" l.he informa- 
tion is relevant, this a.(Ivantage could well I)e Iosi, and a. 
gmmine, full text-understan(ling system will b(" required. 
REFERENCES 
I. llobbs, .\]erry R., Douglas E. Appelt..loll, Bear, I)avid 
Israel, and Mabry Tyson, 1992. "I':\STIJS: A Syslenl for 
I~xtracting I nf(n'mation fi'om Nalm ral-l~a,guage Text", 
SRI '!Pechnical Note 519, SRI International. Menlo Park, 
Ca|ifornia, November 1992. 
2. Hobl)s, Jerry R.., Mark Stickel, Douglas Appelt, and 
Pa.ul Martin. 1993. "Interpretation as A I)du('tion", Io 
~q)pear in Artificial Intelligence(, .Journal. Also Iml4ish(:d 
as SRI Technical Note 499, ,q\]{\] lut(:rim.tioiml, Menlo 
Pa.rk, California. December 199(I. 
3. t(arp, Peter D., .\]olin D. Lowra.ncc, Thomas M. Strat. 
David E. Wilkins, 1993. "'l'he Grasper-( :l, (;raph Man- 
aggement System", Technical Note No. 521, Artificial In- 
telligence Center, SR1 International, .lanuary 1993. 
4. Sundheim, Beth, ed., 1992. Proceeding.x, Fourth M('ssa.g(' 
Understanding Conference (MUC-4), Mcl,ean, Virginia, 
June 1992. Distributed by Morgan l(aufmann Pul)lish- 
ers, Inc., San Mateo, California. 
137 
