File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2136_metho.xml
Size: 14,048 bytes
Last Modified: 2025-10-06 14:07:15
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2136"> <Title>Automatic Acquisition of Domain Knowledge for Information Extraction</Title> <Section position="3" start_page="0" end_page="940" type="metho"> <SectionTitle> 1 The Extraction System </SectionTitle> <Paragraph position="0"> In the simplest terms, an extraction system identifies patterns within the text, and then mat)s some constituents of these patterns into data base entries. (This very simple descriplion ignores the problems of anaphora and intersentential inference, which must be addressed by any general event extraction system.) AIthough these l)atterns could in principle be stated in terms of individual words, it is much easier to state them in terms of larger SylltaCtic constituents, such as noun phrases and verb groups. Consequently, extraction normally consists of an analysis of the l;e.xt in terms of general linguistic structures and dolnain-specifio constructs, tbllowed by a search for the scenario-specific patterns.</Paragraph> <Paragraph position="1"> It is possible to build these constituent structures through a flfll syntactic analysis of the text, and the discovery procedure we describe below woul(1 be applicable to such an architecture. Howe, ver, for re&sellS of slme,(t , coverage, and system rolmstness, the more (:ommon apt)roa(:h at present is to peribrni a t)artial syntactic analysis using a cascade of finite-state transducers. This is the at)t)roa(:h used by our e.xtraction system (Grishman, 1995; Yangarber and Grishman, 1998).</Paragraph> <Paragraph position="2"> At; the heart of our syslx'an is a regular expression pattern matcher which is Cal)al)le of matching a set of regular exl)ressions against a partially-analyzed text and producing additional annotations on the text. This core draws on a set of knowledge bases of w~rying degrees of domain- and task-specificity. The lexicon includes both a general English dictionary and definitions of domain and scenario terms. The concept base arranges the domain terms into a semantic hierarchy. The predicate base. des('ribes the, logical structure of I;he events to be extracl;od. 'Fire pattern \])ase consists of sets of patterns (with associated actions), whi(;h make r(;ferollCO to information Kern the other knowle(lge bases. Some t)attorn sots, su(:h as those for n(mn and verb groups, are broadly apl)licable , wlfile other sets are spe(:ifio to the scenario.</Paragraph> <Paragraph position="3"> V~Ze, have previously (Yangarl)er and Grishman, 1.997) (lescrit)ed a user interface which supt)orts the rapid cust;omization of the extraction system to a new scenario. This interface allows the user to provide examples of rolewmt events, which are automatically converted into the appropriate patterns and generalized to cover syntactic variants (passive, relative clause, etc.). Through this internee, the user can also generalize l;he pattern semanti('ally (to (:over a broader class of words) and modify the concet)t base and lexicon as needed. Given an appropriate set; of examples, thereibre, it; has become possible to adapt the extraction system quite ral)idly.</Paragraph> <Paragraph position="4"> However, the burden is still on the user to find the appropriate set of examples, which may require a painstaldng and expensive search of a large corpus. Reducing this cost is essential for enhanced system portability; this is the problem addressed by the current research.</Paragraph> <Paragraph position="5"> Ilow can we automatically discover a suitable set; of candidate patterns or examples (patterns which at least have a high likelihood of being relevant to the scenario)? The basic idea is to look for linguistic patterns which apt)ear with relatively high frequency in relevant documents.</Paragraph> <Paragraph position="6"> While there has been prior research oll idea|ilying the primary lexical t)atterns of a sublanguage or cortms (Orishman et al, 1986; Riloff, 1996), the task here is more complex, since we are tyt)ically not provided in advance with a sub-corpus of relevmlt passages; these passages must themselves be tbund as part of t;t1(; discovery i)rocedure. The difficulty is that one of the l)est imlic~tions of the relevance of the passages is t)recisely the t)resence of these constructs. Bo(:ause of this (:ircularity, we l)ropose to a(:quire. the constructs and t)assagos in tandem.</Paragraph> </Section> <Section position="4" start_page="940" end_page="941" type="metho"> <SectionTitle> 2 ExDISCO: the Discovery Procedure </SectionTitle> <Paragraph position="0"> We tirst outline ExDIsco, our procedure for discovery of oxl,raction patterns; details of some of the stops arc l)rcse, nted in the section which follows, and an earlier t)~q)er on our at)l)roach (Yang~u:bcr ot al., 2000). ExDIscO is mi mlsupervised 1)rocedure: the training (:ortms does not need to t)e amlotated with the specific event intbrmatkm to be. e.xtracted, or oven with information as to whi(;h documents in the ('orpus are relevant to the scenario. 'i7tlo only intbrmation the user must provide, as described below, is a small set of seed patterns regarding the s(:enario.</Paragraph> <Paragraph position="1"> Starting with this seed, the system automati(:ally pertbnns a repeated, automatic expansion of the pattern set. This is analogous to the process of automatic t;enn expansion used in s()me information retrieval systems, where, the terlns Dora the most relewmt doculncnts are added to the user query and then a new retriewfl is imrformed. However, by expanding in terms of 1)atl;erns rather than individual terms, a more precise expansion is possit)le. This process procoeds as tbllows: 0. We stm:t with a large, corlms of documents in the domain (which have not been anne- null tared or classified in any way) and an initial &quot;seed&quot; of scenario patterns selected by the user -- a small set of patterns whose presence reliably indicates thai; the document is relevant to the scenario.</Paragraph> <Paragraph position="2"> . The pattern set is used to divide the cortins U into a set of relewmt documents, R (which contain at; least one instance of one of the patterns), and a set of non-relevant documents R = U - R.</Paragraph> <Paragraph position="3"> 2. Search tbr new candidate patterns: * automatically convert each document in the eorIms into a set of candidate patterns, one for each clause * rank patterns by the degree to which their distribution is correlated with docmnent relevance (i.e., appears with higher frequency in relevant documents than in non-relewmt ones).</Paragraph> <Paragraph position="4"> 3. Add the highest ranking pattern to the pattern set. (Optionally, at this point, we may present the pattern to the user for review.) 4. Use the new pattern set; to induce a new split of the corpus into relevant and non-relevant documents. More precisely, documents will now be given a relevance confidence measure; documents containing one of the initial seed patterns will be given a score of 1, while documents which arc added to the relevant cortms through newly discovered patterns will be given a lower score. I/,epeat the procedure (from step 1) until some iteration limit is reached, or no more patterns can be added.</Paragraph> </Section> <Section position="5" start_page="941" end_page="942" type="metho"> <SectionTitle> 3 Methodology </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="941" end_page="941" type="sub_section"> <SectionTitle> 3.1 Pre-processing: Syntactic Analysis </SectionTitle> <Paragraph position="0"> Before at)plying ExDIsco, we pre-proeessed the cortms using a general-purpose dependency parser of English. The parser is based on the FDG tbrmalism (Tapanainen and Jgrvihen, 1997) and developed by the Research Unit for Multilingual Language Technology at the University of Helsinki, and Conexor Oy. The parser is used ibr reducing each clause or noun phrase to a tuple, consisting of the central arguments, ms described in detail in (Yangarber et al., 2000). We used a corlms of 9,224 articles from the Wall Street; Journal. The parsed articles yielded a total of 440,000 clausal tuples, of which 215,000 were distinct.</Paragraph> </Section> <Section position="2" start_page="941" end_page="941" type="sub_section"> <SectionTitle> 3.2 Normalization </SectionTitle> <Paragraph position="0"> We applied a name recognition module prior to parsing, and replaced each name with a token describing its (:lass, e.g. C-Person, C-Company, etc. We collapsed together all numeric expressions, currency wflues, dates, etc., using a single token to designate each of these classes. Lastly, the parser performed syntactic normalization to transtbrm such variants ms the various passive and relative clauses into a common tbrm.</Paragraph> </Section> <Section position="3" start_page="941" end_page="941" type="sub_section"> <SectionTitle> 3.3 Generalization and Concept Classes </SectionTitle> <Paragraph position="0"> Because tuples may not repeat with sufficient frequency to obtain reliable statistics, each tuple is reduced to a set of pints: e.g., a verb-object pair, a subject-object pair, etc. Each pair is used as a generalized pattern during the candidate selection stage. Once we have identitied pairs which are relevant to the scenario, we use them to gather the set; of words for the missing role(s) (tbr example, a class of verbs which occur with a relevant subject-ot@ct pair: &quot;company {hire/fire/expel...} person&quot;).</Paragraph> </Section> <Section position="4" start_page="941" end_page="942" type="sub_section"> <SectionTitle> 3.4 Pattern Discovery </SectionTitle> <Paragraph position="0"> We (-onducte(1 exi)eriments in several scenarios within news domains such as changes in corporate ownership, and natural disasters. Itere we present results on the &quot;Man~geme.nt Succession&quot; and &quot;Mergers/Acquisitions&quot; scenarios. for Acquisitions. Here C-Company and C-Person denote semantic classes containing named entities of the corresponding types. C-Appoint denotes the list of verbs { appoint, elect, promote, name, nominate}, C-Resign = { resign, depart, quit }, and C-Buy = { buy , purchase }.</Paragraph> <Paragraph position="1"> \])uring ~ single iter~tion, we conqmt(; the score, See're(p), for each cm~(lidate 1)attern p, using (;he fornmla~:</Paragraph> <Paragraph position="3"> where 12. (Icnotes (;h(', l'clewmt subsc(; of documents, mid It= It(p) the, (locmnents imttching p, as above; the Iirst (;erm a(:(:ounts for the con(lition~fl t)robability of relev;m('e oil p~ and |;11(; second tbr its support. We further impose two support criteria: we distrust such frequent pat(;,~.,-.~ w\]le,:e I1~ n UI > ,~IUI, ,~ uninforn,,~tive, mid rare patte.rns \['or which I1\] r-I \]~.1 < fl as noise. 2 At the end ot' (.aeh il;eratiol~, the sysl;em selects the pal;tern with the highest Sco'/'d(p)~ and adds it (;o (;lie seed scl;. The (to(:un~enl;s which t;he winning t)~(;t;ern hits are added (;o t;111(; relevant set. The t)al;l;(;rn s(;areh is then r(;sl;m:l;(;d.</Paragraph> </Section> <Section position="5" start_page="942" end_page="942" type="sub_section"> <SectionTitle> 3.5 Document Re-ranking </SectionTitle> <Paragraph position="0"> Th(: above is a simt)lifi(';~l;ion of (;he a,(:tual procedlll'(}~ in severa\] r(',st)e('(;s.</Paragraph> <Paragraph position="1"> Only generalized t)ntl;erns are (:onsidered fi)r (:audi(t~my, with one or mot(', slol;s fill(:(1 wi(;h wihl-cm'ds. In computing the score of th(', ge, n(;raliz(:d \]);tttern, w(: do not take into ('onsi(h:r;i,1;i()11 all possible va,hw, s of the, wil(1-('m:d role. \Y=e instea.d (:()llS(;raJll (;he wild-(:ar(l to thos(~ wdu(:s wlli(:h l;ht',llls(;lv(;s ill (;llrH \]l;tV(: high scores. Th(:se v~du(:s l;lw, n |)e(:on~e lllClll\])(;l'S of }/. II(:W (:lass, whi(:h is l)rOdu(:ed in (;:tlldClll with the wimfing 1)att(:rn.</Paragraph> <Paragraph position="2"> \])o('umel~tS reh:wm('e is s(-ored (m ~ s(;ah: l)e(;ween 0 and 1. Tlm seed t)atterns a.re a.(:cet)ted ~,s trut\]~; the do('mlw, nts (;hey mat(:\]1 hnve rclevmme 1. On i(;er;~tion i + 1, e~mh t)a(;tern p is assigned a precision measure, t)ase(l on the rel-</Paragraph> <Paragraph position="4"> where l~,eli(d) is the re, levmlce of' 1;11(: doeunmn(; fi'om t;t1(', previous iteration, ~md lI(p) is the set of documents where p matched, in general, if K is a classifier (:onsisting of ~ set of l)al;terns, w(', define H (K) as the st:l; of documents where all ~similar to that used in (liiloff, 1996) ~W(: used ,:-- 0.1 and fl = 2.</Paragraph> <Paragraph position="5"> of t)~d;terns p C K m~l;(:h, mid the &quot;cunmlative&quot; precision of K as</Paragraph> <Paragraph position="7"> Once the wimfing pa,l;l;ern is accepted, the relewmee of the documents is re-adjusted. For (;~mh document d which is matched by some subset of l;he currently accet)t('d pntterns, we can view thai; sul)s(',t; of l)~tterns as ~ classitier Kd = {pj}. These patterns (tel;ermilm the new reh;wmce score of the document as J~, &quot;~l,~ &quot;(,0 : 111~x (:tc,,.1,*(,O,v,.,;, .~&quot; (K,)) (~:) This ensures tha.(; l;he rclewmce score grows monotonically, and only when there is sufliei(mt positive evidence, as (;he i)ntterns in etl'e(:I; vote &quot;conjmmtively&quot; on the (loculncnl;s.</Paragraph> <Paragraph position="8"> We also tried an alternative, ::disjun(:tive&quot; voting scheme, with weights wlfich accounts tbr vm:intion in support of the p~ttterns,</Paragraph> <Paragraph position="10"> where t;11(', weights ,wp arc (tetint;d using the telewm(:(: of the (loeuments, a,s the total SUl)l)or(; which the pa, I;I;ern p receives:</Paragraph> <Paragraph position="12"> and ;,7 is (;11(' largest weight. The r(',cursive fornmb~s ('apl;m:e (;he mul;u~fl dependency of t)~tterns ~md documents; this re-computation ~md growing of precision and relevmlce rmlks is the core of the t)rocedure. :~</Paragraph> </Section> </Section> class="xml-element"></Paper>