Automatic Acquisition of Domain Knowledge for Information 
Extraction 
Roman Yangarber, Ralph Grishman Past Tapanainen 
Courant Institute of Conexor oy 
Mathematical Sciences Helsinki, Finland 
New York University 
{roman \[ grishman}@cs, nyu. edu Pasi. Tapanainen@conexor. fi 
Si!ja Ituttunen 
University of Helsinki 
Finland 
sihuttun@ling.helsinki.fi 
Abstract 
In developing an Infbrmation Extraction tIE) 
system tbr a new class of events or relations, one 
of the major tasks is identifying the many ways 
in which these events or relations may be ex- 
pressed in text. This has generally involved the 
manual analysis and, in some cases, the anno- 
tation of large quantities of text involving these 
events. This paper presents an alternative ap- 
proach, based on an automatic discovery pro- 
cedure, ExDIsCO, which identifies a set; of rele- 
wmt documents and a set of event patterns from 
un-annotated text, starting from a small set of 
"seed patterns." We evaluate ExDIScO by com- 
paring the pertbrmance of discovered patterns 
against that of manually constructed systems 
on actual extraction tasks. 
0 Introduction 
Intbrmation Extraction is the selective extrac- 
tion of specified types of intbrmation from nat- 
ural  text. The intbrmation to be 
extracted may consist of particular semantic 
classes of objects (entities), relationships among 
these entities, and events in which these entities 
participate. The extraction system places this 
intbrmation into a data base tbr retrieval and 
subsequent processing. 
In this paper we shall be concerned primar- 
ily with the extraction of intbrmation about 
events. In the terminology which has evolved 
ti'om the Message Understanding Conferences 
(muc, 1995; muc, 1993), we shall use the term 
subject domain to refer to a broad class of texts, 
such as business news, and tile term scenario to 
refer to tile specification of tile particular events 
to be extracted. For example, the "Manage- 
ment Succession" scenario for MUC-6, which we 
shall refer to throughout this paper, involves in- 
formation about corporate executives starting 
and leaving positions. 
The fundamental problem we face in port- 
ing an extraction system to a new scenario is 
to identify the many ways in which intbrmation 
about a type of event may be expressed in the 
text;. Typically, there will be a few common 
tbrms of expression which will quickly come to 
nfind when a system is being developed. How- 
ever, the beauty of natural  (and the 
challenge tbr computational linguists) is that 
there are many variants which an imaginative 
writer cast use, and which the system needs to 
capture. Finding these variants may involve 
studying very large amounts of text; in the sub- 
ject domain. This has been a major impediment 
to the portability and performance of event ex- 
traction systems. 
We present; in this paper a new approach 
to finding these variants automatically fl'om a 
large corpus, without the need to read or amLo- 
tate the corpus. This approach has been evalu- 
ated on actual event extraction scenarios. 
In the next section we outline the strncture of 
our extraction system, and describe the discov- 
ery task in the context of this system. Sections 
2 and 3 describe our algorithm for pattern dis- 
covery; section 4 describes our experimental re- 
sults. This is tbllowed by comparison with prior 
work and discussion in section 5. 
1 The Extraction System 
In the simplest terms, an extraction system 
identifies patterns within the text, and then 
mat)s some constituents of these patterns into 
data base entries. (This very simple descrip- 
lion ignores the problems of anaphora and in- 
tersentential inference, which must be addressed 
by any general event extraction system.) AI- 
though these l)atterns could in principle be 
stated in terms of individual words, it is much 
940 
easier to state them in terms of larger SylltaC- 
tic constituents, such as noun phrases and verb 
groups. Consequently, extraction normally con- 
sists of an analysis of the l;e.xt in terms of general 
linguistic structures and dolnain-specifio con- 
structs, tbllowed by a search for the scenario- 
specific patterns. 
It is possible to build these constituent struc- 
tures through a flfll syntactic analysis of the 
text, and the discovery procedure we describe 
below woul(1 be applicable to such an architec- 
ture. Howe, ver, for re&sellS of slme,(t , coverage, 
and system rolmstness, the more (:ommon ap- 
t)roa(:h at present is to peribrni a t)artial syn- 
tactic analysis using a cascade of finite-state 
transducers. This is the at)t)roa(:h used by our 
e.xtraction system (Grishman, 1995; Yangarber 
and Grishman, 1998). 
At; the heart of our syslx'an is a regular ex- 
pression pattern matcher which is Cal)al)le of 
matching a set of regular exl)ressions against 
a partially-analyzed text and producing addi- 
tional annotations on the text. This core draws 
on a set of knowledge bases of w~rying degrees 
of domain- and task-specificity. The lexicon in- 
cludes both a general English dictionary and 
definitions of domain and scenario terms. The 
concept base arranges the domain terms into 
a semantic hierarchy. The predicate base. de- 
s('ribes the, logical structure of I;he events to be 
extracl;od. 'Fire pattern \])ase consists of sets of 
patterns (with associated actions), whi(;h make 
r(;ferollCO to information Kern the other knowl- 
e(lge bases. Some t)attorn sots, su(:h as those for 
n(mn and verb groups, are broadly apl)licable , 
wlfile other sets are spe(:ifio to the scenario. 
V~Ze, have previously (Yangarl)er and Grish- 
man, 1.997) (lescrit)ed a user interface which 
supt)orts the rapid cust;omization of the extrac- 
tion system to a new scenario. This interface 
allows the user to provide examples of role- 
wmt events, which are automatically converted 
into the appropriate patterns and generalized to 
cover syntactic variants (passive, relative clause, 
etc.). Through this internee, the user can also 
generalize l;he pattern semanti('ally (to (:over a 
broader class of words) and modify the concet)t 
base and lexicon as needed. Given an appro- 
priate set; of examples, thereibre, it; has become 
possible to adapt the extraction system quite 
ral)idly. 
However, the burden is still on the user to 
find the appropriate set of examples, which may 
require a painstaldng and expensive search of a 
large corpus. Reducing this cost is essential for 
enhanced system portability; this is the problem 
addressed by the current research. 
Ilow can we automatically discover a suitable 
set; of candidate patterns or examples (patterns 
which at least have a high likelihood of being 
relevant to the scenario)? The basic idea is to 
look for linguistic patterns which apt)ear with 
relatively high frequency in relevant documents. 
While there has been prior research oll idea|i- 
lying the primary lexical t)atterns of a sublan- 
guage or cortms (Orishman et al, 1986; Riloff, 
1996), the task here is more complex, since we 
are tyt)ically not provided in advance with a 
sub-corpus of relevmlt passages; these passages 
must themselves be tbund as part of t;t1(; discov- 
ery i)rocedure. The difficulty is that one of the 
l)est imlic~tions of the relevance of the passages 
is t)recisely the t)resence of these constructs. Bo- 
(:ause of this (:ircularity, we l)ropose to a(:quire. 
the constructs and t)assagos in tandem. 
2 ExDISCO: the Discovery Procedure 
We tirst outline ExDIsco, our procedure for 
discovery of oxl,raction patterns; details of some 
of the stops arc l)rcse, nted in the section which 
follows, and an earlier t)~q)er on our at)l)roach 
(Yang~u:bcr ot al., 2000). ExDIscO is mi ml- 
supervised 1)rocedure: the training (:ortms does 
not need to t)e amlotated with the specific event 
intbrmatkm to be. e.xtracted, or oven with infor- 
mation as to whi(;h documents in the ('orpus are 
relevant to the scenario. 'i7tlo only intbrmation 
the user must provide, as described below, is a 
small set of seed patterns regarding the s(:enario. 
Starting with this seed, the system automati- 
(:ally pertbnns a repeated, automatic expansion 
of the pattern set. This is analogous to the pro- 
cess of automatic t;enn expansion used in s()me 
information retrieval systems, where, the terlns 
Dora the most relewmt doculncnts are added 
to the user query and then a new retriewfl is 
imrformed. However, by expanding in terms of 
1)atl;erns rather than individual terms, a more 
precise expansion is possit)le. This process pro- 
coeds as tbllows: 
0. We stm:t with a large, corlms of documents 
in the domain (which have not been anne- 
941 
tared or classified in any way) and an initial 
"seed" of scenario patterns selected by the 
user -- a small set of patterns whose pres- 
ence reliably indicates thai; the document 
is relevant to the scenario. 
. The pattern set is used to divide the cor- 
tins U into a set of relewmt documents, R 
(which contain at; least one instance of one 
of the patterns), and a set of non-relevant 
documents R = U - R. 
2. Search tbr new candidate patterns: 
• automatically convert each document 
in the eorIms into a set of candidate 
patterns, one for each clause 
• rank patterns by the degree to which 
their distribution is correlated with 
docmnent relevance (i.e., appears with 
higher frequency in relevant docu- 
ments than in non-relewmt ones). 
3. Add the highest ranking pattern to the pat- 
tern set. (Optionally, at this point, we may 
present the pattern to the user for review.) 
4. Use the new pattern set; to induce a new 
split of the corpus into relevant and non- 
relevant documents. More precisely, docu- 
ments will now be given a relevance confi- 
dence measure; documents containing one 
of the initial seed patterns will be given 
a score of 1, while documents which arc 
added to the relevant cortms through newly 
discovered patterns will be given a lower 
score. I/,epeat the procedure (from step 1) 
until some iteration limit is reached, or no 
more patterns can be added. 
3 Methodology 
3.1 Pre-processing: Syntactic Analysis 
Before at)plying ExDIsco, we pre-proeessed 
the cortms using a general-purpose dependency 
parser of English. The parser is based on 
the FDG tbrmalism (Tapanainen and Jgrvi- 
hen, 1997) and developed by the Research Unit 
for Multilingual Language Technology at the 
University of Helsinki, and Conexor Oy. The 
parser is used ibr reducing each clause or noun 
phrase to a tuple, consisting of the central ar- 
guments, ms described in detail in (Yangarber 
et al., 2000). We used a corlms of 9,224 articles 
from the Wall Street; Journal. The parsed arti- 
cles yielded a total of 440,000 clausal tuples, of 
which 215,000 were distinct. 
3.2 Normalization 
We applied a name recognition module prior to 
parsing, and replaced each name with a token 
describing its (:lass, e.g. C-Person, C-Company, 
etc. We collapsed together all numeric expres- 
sions, currency wflues, dates, etc., using a single 
token to designate each of these classes. Lastly, 
the parser performed syntactic normalization to 
transtbrm such variants ms the various passive 
and relative clauses into a common tbrm. 
3.3 Generalization and Concept Classes 
Because tuples may not repeat with sufficient 
frequency to obtain reliable statistics, each tu- 
ple is reduced to a set of pints: e.g., a verb- 
object pair, a subject-object pair, etc. Each pair 
is used as a generalized pattern during the can- 
didate selection stage. Once we have identitied 
pairs which are relevant to the scenario, we use 
them to gather the set; of words for the miss- 
ing role(s) (tbr example, a class of verbs which 
occur with a relevant subject-ot@ct pair: "com- 
pany {hire/fire/expel...} person"). 
3.4 Pattern Discovery 
We (-onducte(1 exi)eriments in several scenarios 
within news domains such as changes in cor- 
porate ownership, and natural disasters. Itere 
we present results on the "Man~geme.nt Suc- 
cession" and "Mergers/Acquisitions" scenarios. 
ExDIsco was seeded with lninimal pattern sets, 
namely: 
Subject Verb Direct Object 
C-Company C-At)point C-Person 
C-Person C-Resign 
ibr the Mmmgement task, and 
Subject Verb Direct Object 
* C-Buy C-Conlt)any 
C-Company merge * 
for Acquisitions. Here C-Company and C- 
Person denote semantic classes containing 
named entities of the corresponding types. C- 
Appoint denotes the list of verbs { appoint, elect, 
promote, name, nominate}, C-Resign = { re- 
sign, depart, quit }, and C-Buy = { buy , pur- 
chase }. 
942 
\])uring ~ single iter~tion, we conqmt(; the 
score, See're(p), for each cm~(lidate 1)attern p, 
using (;he fornmla~: 
S,:o',',@) = IH n l~l 
IHI - 1,,~ IHn ~.1 (:t) 
where 12. (Icnotes (;h(', l'clewmt subsc(; of docu- 
ments, mid It= It(p) the, (locmnents imttching 
p, as above; the Iirst (;erm a(:(:ounts for the con- 
(lition~fl t)robability of relev;m('e oil p~ and |;11(; 
second tbr its support. We further impose two 
support criteria: we distrust such frequent pat- 
(;,~.,-.~ w\]le,:e I1~ n UI > ,~IUI, ,~ uninforn,,~tive, 
mid rare patte.rns \['or which I1\] r-I \]~.1 < fl as 
noise. 2 At the end ot' (.aeh il;eratiol~, the sysl;em 
selects the pal;tern with the highest Sco'/'d(p)~ 
and adds it (;o (;lie seed scl;. The (to(:un~enl;s 
which t;he winning t)~(;t;ern hits are added (;o 
t;111(; relevant set. The t)al;l;(;rn s(;areh is then 
r(;sl;m:l;(;d. 
3.5 Document Re-ranking 
Th(: above is a simt)lifi(';~l;ion of (;he a,(:tual pro- 
cedlll'(}~ in severa\] r(',st)e('(;s. 
Only generalized t)ntl;erns are (:onsidered fi)r 
(:audi(t~my, with one or mot(', slol;s fill(:(1 wi(;h 
wihl-cm'ds. In computing the score of th(', ge, n- 
(;raliz(:d \]);tttern, w(: do not take into ('onsi(h:r- 
;i,1;i()11 all possible va,hw, s of the, wil(1-('m:d role. 
\¥e instea.d (:()llS(;raJll (;he wild-(:ar(l to thos(~ wd- 
u(:s wlli(:h l;ht',llls(;lv(;s ill (;llrH \]l;tV(: high scores. 
Th(:se v~du(:s l;lw, n |)e(:on~e lllClll\])(;l'S of }/. II(:W 
(:lass, whi(:h is l)rOdu(:ed in (;:tlldClll with the 
wimfing 1)att(:rn. 
\])o('umel~tS reh:wm('e is s(-ored (m ~ s(;ah: l)e- 
(;ween 0 and 1. Tlm seed t)atterns a.re a.(:cet)ted 
~,s trut\]~; the do('mlw, nts (;hey mat(:\]1 hnve rcle- 
vmme 1. On i(;er;~tion i + 1, e~mh t)a(;tern p is 
assigned a precision measure, t)ase(l on the rel- 
(':Vall(;e of |;11(; (locllnlelfl;s i|; 111a, l;(;ll(',,q: 
~ ".d~(d) (~) 
ff,,:d +~ (v) -- IH(v) l ,~.(,,) 
where l~,eli(d) is the re, levmlce of' 1;11(: doeunmn(; 
fi'om t;t1(', previous iteration, ~md lI(p) is the set 
of documents where p matched, in general, if K 
is a classifier (:onsisting of ~ set of l)al;terns, w(', 
define H (K) as the st:l; of documents where all 
~similar to that used in (liiloff, 1996) 
~W(: used ,:-- 0.1 and fl = 2. 
of t)~d;terns p C K m~l;(:h, mid the "cunmlative" 
precision of K as 
1 ~ 1~4~(a,) (3) P~.~d +~(1() = IH U()I <.(K) 
Once the wimfing pa,l;l;ern is accepted, the rel- 
ewmee of the documents is re-adjusted. For 
(;~mh document d which is matched by some 
subset of l;he currently accet)t('d pntterns, we 
can view thai; sul)s(',t; of l)~tterns as ~ classitier 
Kd = {pj}. These patterns (tel;ermilm the new 
reh;wmce score of the document as 
J~, "~l,~ "(,0 : 111~x (:tc,,.1,*(,O,v,.,;, .~" (K,)) (~:) 
This ensures tha.(; l;he rclewmce score grows 
monotonically, and only when there is sufliei(mt 
positive evidence, as (;he i)ntterns in etl'e(:I; vote 
"conjmmtively" on the (loculncnl;s. 
We also tried an alternative, ::disjun(:tive" 
voting scheme, with weights wlfich accounts tbr 
vm:intion in support of the p~ttterns, 
J,.,.1, (d) .... ~ "~ II (1 - ~',.~,.c~(p))"",' (5) 
~c K(d) 
where t;11(', weights ,wp arc (tetint;d using the tel- 
ewm(:(: of the (loeuments, a,s the total SUl)l)or(; 
which the pa, I;I;ern p receives: 
% = log ~ l;.d,(d) 
dE 11 (p) 
and ;,7 is (;11(' largest weight. The r(',cursive for- 
nmb~s ('apl;m:e (;he mul;u~fl dependency of t)~t- 
terns ~md documents; this re-computation ~md 
growing of precision and relevmlce rmlks is the 
core of the t)rocedure. :~ 
4 Results 
4.1 Event Extraction 
'l'he, most nal;m'a.l measm'e of efl'ecl;iveness of our 
discovery procedure is the performmme of ml ex- 
traction systmn using the, discovered t)~tterns. 
However, il; is not 1)ossil)le to apply this reel;- 
rio direei;ly because the discovered t)al;terns lack 
some of the information required tbr entries ill 
:{\V('. did not el)serve a significam; difl'erencc in 1)crfi)r- 
lIiHl\[CO, bet, ween the two tormulas 4 alt(t 5 in o111" experi- 
in(mrs; the results whit:h tbllow use 5. 
943 
the pattern base: information about the event 
type (predicate) associated with the pattern, 
and the mapping from pattern elements to pred- 
icate arguments. We have evaluated ExDIsco 
by manually incorporating the discovered pat- 
terns into the Proteus knowledge bases and run- 
ning a full MUC-style evaluation. 
We started with our extraction system, Pro- 
tens, which was used in MUC-6 in 1995, and 
has undergone continual improvements since 
the MUC evaluation. We removed all the 
scenario-specific clause and nominalization pat- 
terns. 4 We then reviewed all the patterns which 
were generated by the ExDIsco, deleting those 
which were not relewmt to the task, or which 
did not correspond directly to a predicate al- 
ready implemented tbr this task) The remain- 
ing pat;terns were augmented with intbnnation 
about the corresponding predicate, and the re- 
lation between the pattern and the predicate 
al'guments, a The resulting variants of Proteus 
were applied to the formal training corpus and 
the (hidden) formal test corpus for MUC-6, and 
the output evaluated with the MUC scorer. 
The results on the training corpus are: 
Pattern Base Recall Precision 
Seed 38 83 
ExI)Isco 62 80 
Union 69 __79 
Manual-MUC ~ 71 L~1.9~ 
Manual-NOW 6(3~ 79 L7!~z\[)_t_j 
and on the test cortms: 
4There are also a few noun phrase patterns which can 
give rise to scenario events. For example, "Mr Smith, 
former president of IBM", may produce an event record 
where l%ed Smith left IBM. These patterns were left in 
Proteus for all the runs, and they make some contribu- 
tion to the relatively high baseline scores obtained using 
just the seed event patterns. 
~ExD~sco found patterns which were relevant to the 
task lint could not be easily aceomodated in Proteus. 
For instance "X remained as president" could be rele- 
vant, particularly in the case of a merger creating a new 
corporate entity, but Proteus was not equipped to trun- 
dle such iIfformation, and has not yet been extended to 
incorporate such patterns. 
6As with all clause-level patterns in Proteus, these 
patterns m-e automatically generalized to handle syntac- 
tic wn'iants such as passive, relative clause, etc. 
Pattern Base Recall Precision F 
Seed 27 74 39.58 
ExDIsco 52 72 60.16 
Union 57 73 63.56 
Manual-NOW -- 56 75 6404. 
The tables show the recall and precision mea- 
sures for the patterns, with F-measure being 
the harmonic mean of the two. The Seed pat- 
tern base consists of just the initial pattern set, 
given in the table on the previous page. ~ib this 
we added the patterns which the system discov- 
ered automatically after about 100 iterations, 
producing the pattern set called ExDIsco. For 
comparison, M anual-MUC is the pattern base 
lnanually develot)ed on the MUC-6 training 
corpus-1)repared over the course of 1 month 
of full-time work by at least one computational 
linguist (during which the 100-document train- 
ing corpus was studied in detail). The last row, 
Manual-now, shows the current pertbrmance of 
the Proteus system. The base called Ultiolt con- 
tains the union of ExDIScO and Manual-No'w. 
We find these results very encouraging: Pro- 
teus performs better with the patterns discov- 
ered by ExI)IscO than it did after one month 
of manual tinting and development; in fact, this 
perfi)rmance is close to current levels, which 
are the result of substantial additional devel- 
opmeut. These results umst be interpreted, 
however, with several caveats. First, Proteus 
performance depends on many fimtors besides 
the event patterns, such as the quality of name 
re, cognition, syntactic mmlysis, anaphora reso~ 
lution, inferencing, etc. Several of these were 
improved since the MUC formal evaluation, so 
some of the gain over the MUC formal evalua- 
tion score is attritmtable to these factors. How~ 
ever, all of the other scores are comparable in 
these regards. Second, as we noted above, the 
patterns were reviewed and augmented manu- 
ally, so the overall procedure is not entirely au- 
tomatic. However, the review and augmenta- 
tion process took little time, as compared to 
the manual corpus analysis and development of 
the pattern base. 
4.2 Text filtering 
We can obtain a second measure of pertbr- 
mance by noting that, in addition to growing 
the tmttern set, ExDIsco also grows the rele- 
944 
0.9 
0.8 
0.7 
0.6 
0.5 
_ . r-~H ................. r ....... T~~::~ T 
;!\ >. g~t : - ' il 
%i 
\[!\] 
7 
\[!J 
Legend: 
Management/Test • .-{~ ...... 
ManagemenVl-raie - :*: -- 
MUC-6 • 
0.2 0.4 0.6 0.8 
Recall 
Figure l: Management Suc('cssion 
0.9 
0.8 
0.7 
0.6 
0.5 
L_~/r 
Legend: 
Acquisition 
0.2 0.4 0.6 
Recall 
0.8 
Figme 2: Mergers/A(:quisitions 
vance rankings of documents. The latter cnn be 
evahlated directly, wil;hollt human intervention. 
We tested Exl)IsC, o ~tgainst two cor\])orn: th(; 
100 documents from MUC-6 tbrmal training, 
a:nd the 100 documents from the MUC-6 for- 
mal test (both are contained anlong the 10,000 
ExDIsoO training set) r. Figure 1 shows recall 
t)\]otted against precision on the two corpora, 
over 100 iterations, starting with the seed pat- 
te, nls in section 3.d. This view on the discovery 
procedure is closely related to the MUC %ext- 
till;ering" task, in which the systems are jlulged 
at the \]evel of doc,wm, e,'nt.s rather thmt event slots. 
It; is interesting to (:omt)m:e Exl)IsCO's results 
with how other MUC-6 part\]tit)ants performed 
on the MUC-b' test cortms , shown anonymously. 
ExDIscO attains values within the range of 
the MUC participald;S, all of which were either 
heavily-supervised or m~mually coded systems. 
II; is important to bear in mind that ExI)Isco 
had no benefit of training material, or any in- 
tbrmation beyond the seed pattern set. 
Figure 2 shows the 1)ertbrmance, of text fil- 
tering on the Acquisition task, again, given the 
seed in section 3.4. ExDisco was trained on 
|;lie same WSJ eorlms, and tested against a set 
of 200 documents. We retrieved this set using 
keyword-based IR, search, and judged their rel- 
evance by halId. 
rThesc judgements constituted the truth which was 
used only for evaluation, not visible to ExDISCO 
5 Discussion 
The development of a w~riety of information 
extra(:tion systems over the last decade has 
demonstrated their feasibility but also the lim- 
itations on their portability and t)erformance. 
Prcl)aring good t)atterns tbr these syste, ms re- 
quires (:onsiderable skill, and achieving good 
(:overage requires |;lie analysis of a large amount 
of text. These t)rol)lems h~ve t)een impedinmnts 
to the -wide\].' use of extraction systenls. 
These dit\[iculties have stimulate.d resear('h on 
1)attel.'n a(:(luisition. Solne of this work has enl- 
i)hasized il\]teractive tools to (:onvert examples 
to extractioi~ t)atterlls (Yangarber and Grish- 
man, 1997); nmch ot:' the re, search has focused on 
methods for automatically converting a cortms 
annotated with extraction examples into pat- 
terns (Lehnert et al., 1992; Fisher et al., 1995; 
Miller el; al., 1998). These techniques may re- 
duce the level of systeln expertise required to 
develop a new extraction N)plieation, but they 
do not lessen the lmrden of studying a large cor- 
lms in order to .find relevant candidates. 
The prior work most closely related to our 
own is that of (R.ilotf, 1996), who also seeks to 
lmild pattenls automatically without the need 
to annotate a corpus with the information to 
be extracted. Itowever, her work ditfers t'rom 
01217 own in several ilnportant respects. First, 
her patterns identit~y phrases that fill individual 
slots in the template, without specifying how 
these slots may be combined at a later stage 
into complete templates. In contrast, our pro- 
cedure discovers complete, multi-slot event pat- 
945 
terns. Second, her procedure relies on a cort)us 
in which |;tie documents have been classified for 
relevance by hand (it was applied to the MUC-3 
task, tbr which over 1500 classified documents 
are available), whereas ExDIsco requires no 
manual relevance judgements. While classify- 
ing documents tbr relevance is much easier than 
annotating docunlents with the information to 
be extracted, it; is still a significant task, and 
places a limit on |:tie size of the training corpus 
that can be effectively used. 
Our research has demonstrated that for the 
studied scenarios automatic pattern discovery 
Call yield extraction perfi)rmance colnt)arabh~ to 
that obtained through extensive corpus anal- 
ysis. There are many directions in which the 
work reported here needs to be extended: 
• nsing larger training corpora, in order to 
find less frequent exanlplcs, and in that way 
hopefully exceeding the i)erfornlancc of our 
best hand-trained system 
• cat)luring the word classes which are gen- 
erated as a by-product of our pattern dis- 
covery 1)rocedure (in a manner similar to 
(Riloff and ,Jones, 1999)) and using them 
to discover less frequent t)atterns in subse- 
quent iterations 
- evaluating the effectiveness of the discov- 
cry procedure on other scenarios. In par- 
titular, we need to be able to identi\[y top- 
its which cast be most effbctively charac- 
terized by clause-level patterns (as was the 
case tbr the business domain), and topics 
which can be better characterized by other 
means. We. wouM also like to understand 
how the topic clusters (of documents and 
patterns) which are developed by our pro- 
cedure line up with pre-specified scenarios. 

References 

David Fisher, Stephen Soderland, Joseph Mc- 
Carthy, Fangfang Feng, and Wendy Lelmert. 
1995. Description of the UMass system as 
used fbr MUC-6. In Prec. Sixth Message Un- 
dcrstandin9 Conf. (MUC-6), Columbia, MD, 
November. Morgan Kauflnann. 

R.alph Grishman. 1995. The NYU systenl tbr 
MUC-6, or where's the syntax? Ill Prec. 
Sixth Message Understanding Conf. (MUC- 
6), pages 167 176, Columl)ia, MD, Novem- 
ber. Morgan Kauflnann. 

W. Lehnert, C. Cardie, D. Fisher, J. McCarthy, 
E. Riloff, and S. Soderland. 1992. Univer- 
sity of nlassachusetts: MUC-4 test results 
and analysis. Ill P,'oe. Fourth Message Un- 
der.standing Con.\[., McLean, VA, June. Mor- 
gan Kauflnaml. 

Scott Miller, Michael Crystal, Heidi Fox, 
Lance II,amshaw, R,ichard Schwartz, Rebecca 
Stone, Rall)h Weischedel, and the Annota- 
tion Group. 1998. Algorithms that learn to 
extract intbrmation; BBN: Description of the 
SIFT systenl as used for MUC-7. In PTve. 7th 
Mc.ssagc Understanding Co~:f., FMrfax, VA. 

1993. Proceedings of the F'~ifth Message UTz.- 
derstanding Confer(race (MUC-5), Baltimore, 
MD, August. Morgan Kauflnann. 

1995. PTveeedings of the Sixth Message U~I,- 
derstav, ding Conference (MUC-6), Colmnt)ia, 
MD, November. Morgan Kauflnaml. 

Ellen Rilotf and Rosie Jones. 1999. Learn- 
ing dictionaries for infbrmation extraction by 
multi-level bootstrat)ping. In Prec. 16th Nat'l 
Cord'erenee on Art'~i\[icial Intelli9enee (AAA I 
99), Orlando, Florida. 

Ellen Riloff. 1996. Automatically generating 
extraction patterns from m~tagged text. In 
Prec. I3th Nat'l Co~~:f. on Art~ificial Intel- 
ligence (AAAI-96). The AAAI Press/MIT 
Press. 

l?asi '\])~panainen and Time .J/h:vinen. 1997. A 
non-t)rojectivc dependency parser. In P'mc. 
5th Conf. on Applied Nat'aral Language P~v- 
cessiu9, pages 64-71, Washington, D.C. ACL. 

Roman Yangarber and RalI)h Grishman. 1997. 
Customization of intbrmation extraction sys- 
tems. In Paola Velardi, editor, I~tt'l Work- 
shop on Lexically Driven I~7:forrnation Extrac- 
tion, Frascati, Italy. Universith di Roma. 

Roman Yangarl)er and Ralph Grishman. 1998. 
NYU: Description of thc Protens/PET sys- 
tem as used tbr MUC-7 ST. In 7th Message 
Understanding Conference, Columbia, MD. 

Roman Yangarl)er, Ralph Grishman, Past 
Tapanainen, and Silja Huttunen. 2000. Un- 
supervised discovery of scenario-level pat- 
terns tbr information extraction. Ill PTve. 
Co~@ on Applied Nat'aral Lang'aage Pr'ocess- 
tug (ANLP-NAACL), Seattle, WA. 
