File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2173_metho.xml

Size: 19,511 bytes

Last Modified: 2025-10-06 14:13:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2173">
  <Title>PATTERN MATCHING IN THE TEXTRACT INFORMATION EXTRACTION SYSTEM</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
HNTT Data Communiclttions Systems Corp.
</SectionTitle>
    <Paragraph position="0"> est by locating speciIic expressions defined a.s key words and phrasal patterns obtained by coq)us analysis.</Paragraph>
    <Paragraph position="1"> This paper describes a pattern matching method that first identifies concepts in a seutence and then links critical pieces of information that map to a p~ttern. The first step in pattern ln~tching is a concept searvh applied in the TI';XTRACT system of the TIPSTER Japanese microelectronics and corporate .joint ventures domains{aacobs 93a\], \[aacobs 93b\].</Paragraph>
    <Paragraph position="2"> In this step, key words representing a concept are searched for within a sentence. The second step is a. template pattern sea~rh applied in the TEXTRACT joint ventures system. A complex pattern to be searched for usually consists of a few words and phrases, inste~d of just one word, as in the concept search. The template pattern search recognizes relationships between matched objects in the defined pattern a.s well a.s recognizing the. concept itself. l,'rom the viewpoints of system perfof mance and portalfility across domains, the TIPS'I.'I~;II/MUC-5 evaluatioll J'esults suggest that pattern nta.tching described in this paper is all appropriate architecture for information extraction from ,lapanese texts.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="1064" type="metho">
    <SectionTitle>
2 TIPSTER/MUC-5 OVERVIEW
</SectionTitle>
    <Paragraph position="0"> The goal of the TI1)STER/MUC-5 project sponsored by ARPA is to capture information of interest from English and .lal)aJmse newspal)er artMes about microelectronics and corpora.re joint ventures. 1 A system must fill a. generic template with information taken 1Several Al{PA-sponsored sites fC/)rmed tile TIP-ST\]'21/informal|on extraclion project. &amp;quot;\['he TIPSTER sites and other non-sponsored organizations participated in MUC-5.</Paragraph>
    <Paragraph position="1">  fronl the text its. a fully automated fa.shion. The template is composed of several objects, each containing severM slots. Slots may have pointers as va.lues, where pointers link related ot)jects. Extracted information is expected to be stored in an object-oriented database \[TIPSTER 92\].</Paragraph>
    <Paragraph position="2"> In the microelectronics domain, information about four specific processes in seiniconductot manufacturing for microchip fabrication is captured. They are layering, lithography, etching, aaM packaging processes. \],ayering, lithography, and etching a.re wafer fa.brical:ion processes; packaging is part of tile last stage of manufacturing. Entities such as manufactu rer, distributor, and user, in addition to detailed manufacturing information such a.s materials used and the microchip specifications such as wafer size and device speed are also extra.cted in each process.</Paragraph>
    <Paragraph position="3"> The joint ventures domain focuses on ex tracting entities, i.e. organizations, forming or dissolving joint venture relationshil)s. The information to l)e extracted includes entity information such as location, na.tionality, personnel, and facilities, and joint venture information such as rela.tionshii)s, 1)usiness a.ctivities, capital, and estimated revenue of the joint ventttre.</Paragraph>
  </Section>
  <Section position="6" start_page="1064" end_page="1065" type="metho">
    <SectionTitle>
3 TEXTRACT ARCHITECTURE
</SectionTitle>
    <Paragraph position="0"> TEXTI/ACT is an informati(m extraction system developed as an optiona.l system of the GE-CMU SHOGUN system \[Jacobs 93a\], \[aacobs93b\]. it processes the TIPSTFI{.</Paragraph>
    <Paragraph position="1"> Japanese domains of microelectronics and col porate joint ventures. The :I'I';XTllA(VI ~ microelectronics system comprises three major components: prel)rocessing ~ conceltt search, and template generatiol|. In ad(lition to (:on.</Paragraph>
    <Paragraph position="2"> celtt search, the &amp;quot;FI!;XrI'IIACT joint ventures system perfbrms a templ~te pattern search. \[t is also equipped with a discourse processor, as shown in Fig. 1.</Paragraph>
    <Paragraph position="3"> In the preprocessor, Japanese text is segmented into primitive: words tagged with their t)arts of speech by a Japanese segmentor called MAJESTY\[Kitani and Mitamura 93\], \[Kitani 91\]. Then, proper norms, along with monetary, nulneric, and temporal expressions  are. identified I)y the name recognition module. Tim segments are g,'Oul)e(l into units which a.re meaningfill in (.It(; l)attern ma.tching process\[Kitani and Mitamura 94\]. Most strings to be extracted dire.ctly Dora the text at'(.' identiffed by MAJESTY and the name recognizer in the l)reprocessor.</Paragraph>
    <Paragraph position="4"> The con(;ept search and template pattern search rood u les both identi\['31 concepts in a set,tence. The template pattern searC/:h also recognizes relationshil)s within the identified inf'ornuttion in the matched pattern. Details of the l)attern matching process are described in the next section.</Paragraph>
    <Paragraph position="5"> The discourse processor links information identified a.t different stages o\[&amp;quot; processing. l&amp;quot;irst, implicit subjects, often use&lt;\[ in Japanese sentences, are inherited fronl previous sentences, and set'oil(l, company ltatlles are givell iltliqlte ltunlbers necessary to accurately recogMze company relationships throughout tile text\[Kitani 94\]. Concepts identified during tile pattern matching process are used to select an approt)ria.te string and filler' to go into ~ slot. \]?inally, l.he template generation pro(:ess assembles the extracted information necessary to creat(.~ the OUtl)nl; descril)ed in Secl, iou 2.</Paragraph>
  </Section>
  <Section position="7" start_page="1065" end_page="1066" type="metho">
    <SectionTitle>
4 PATTERN MATCHING IN
TEXTRACT
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1065" end_page="1065" type="sub_section">
      <SectionTitle>
4.1 Concept search
</SectionTitle>
      <Paragraph position="0"> Key words representing the same concept are grouped into a list and used to recognize the concept in a sentence. The list is written in a simple format: (concept-name wordl word2 ...). For example, key words tbr recognizing a dissolved joint vent u re con cept can be written in the following way:</Paragraph>
      <Paragraph position="2"> (DISSOLVED dissolve terminate cancel).</Paragraph>
      <Paragraph position="3"> The concept search module recognizes the concept when a word in the fist exists in tile sentence. Using such a simple word list sometilnes generates an incorrect concept. For example, a dissolved concept is erroneously identiffed fl'om an expression &amp;quot;cancel a hotel reservation&amp;quot;. IIowever, when processing text in a. narrow &lt;lomain, concepts are often i&lt;lentiffe&lt;t correctly fi'om the simple list, since key words are usually used in a particular meaning of interest in the domain.</Paragraph>
      <Paragraph position="4"> During the Ja.panese segmentation process in the preprocessor, a key word in the text tends to be divided into a few separate words by MAJESTY, when the word is not stored in the dictionary, For example, the compound noun &amp;quot;~jJ'~f~&amp;quot; consists of two words, &amp;quot;i~3 '' (joint venture) and &amp;quot;:r#~b1&amp;quot; (dissolve). It is segmented into the two individual nouns using the current MAJESTY dictionary. Thus, when the compound word &amp;quot;~.-~jlf(t'-fb\] '' is searched for in the segmented sentence, the concept search fails to identify it. To avoid this segmentation problein, adjacent nouns are automatically put together during the concept search process.</Paragraph>
      <Paragraph position="5"> This process al\]ows, by defanlt, partial word matching between a key word and a word in the text. Therefore, &amp;quot;~.-~j&amp;quot; and &amp;quot;~05 {~J&amp;quot; both meaning &amp;quot;a .joint venture&amp;quot; can be identitied by a single key word &amp;quot;~.'I~,Y'. Ilowever, due to the nature of partial matching, the key word &amp;quot;-.:/~ = :/&amp;quot; (Silicon) matches &amp;quot;-iL~. 4\[:: &amp;quot;5+ ~ = :/&amp;quot; (Silicon dioxide), which is a different type of ffhn reported in the microelectronies domain. This undesirable behavior can be avoided by attaching &amp;quot;&gt;&amp;quot; to the beginning or &amp;quot;&lt;&amp;quot; to the end of key words. Thus, ,,&gt; .3.1) :~ :/ &lt;,, tells the matcher that it requires an exact word matching against a word in the text.</Paragraph>
    </Section>
    <Section position="2" start_page="1065" end_page="1066" type="sub_section">
      <SectionTitle>
4.2 Template pattern search
</SectionTitle>
      <Paragraph position="0"> The teml&gt;late pattern matcher identifies typical expressions to be extracted from the text that frequently aplmar in the corpus. The patterns are defined as pa.ttern matching rules using regular expressions.</Paragraph>
      <Paragraph position="1"> The pattern matcher is a ffnite-state automaton sinfilar to the pattern recognizer use.d in the MUC-4 FASTUS system developed at SRI \[I\[obbs et al. 92\]. /n TEXTRACT, state transitions arc driven by segmented words or grouped units fi'om the prei)rocessor. The matcher identifies all possible patterns of interest in the. text that match defined l&gt;atterns. It must ignore unnecessary words in the pattern to perform successfifl pattern matching for various expressions.</Paragraph>
      <Paragraph position="2">  Fig. 2 shows a defined pattern in which an arhitrary string is represented as &amp;quot;g~string&amp;quot; along with its corresponding English pattern. 2 Specilica.lly, a variable starting with &amp;quot;@CNAMI:;&amp;quot; is ca.lled the COulpally-name varial)le, used where a company nanm is exi)ected to apl)ear. For exainpie, &amp;quot;{}CNAME_I'AI{TNER_SUBJ&amp;quot; matches any string that likely includes at least; one company name acting a.s a joint venture partner and functioning as a subject in tile sentence.</Paragraph>
      <Paragraph position="3"> The pattern &amp;quot;~ I h{:stri(:t:P&amp;quot; tells the pattern matcher to identify the word, where &amp;quot;~&amp;quot; or &amp;quot;z)&lt;&amp;quot; are grammatical particles that serve as sul)je(-t case markers. The (tefault type &amp;quot;strict&amp;quot; requires a.n exact string match, whereas &amp;quot;loose&amp;quot; allows a partial string match. Partial string matching is useful when compound words must be matched to a defined pattern. A joint venture, &amp;quot;~.-~j: loose:VN&amp;quot;, whose l&gt;art of speech is verbal nominal, matches compound words such as &amp;quot;~..~ ~'~--~j&amp;quot; (corporate joint venture) a.s well ,%s &amp;quot;N ~,j&amp;quot; (joint venture).</Paragraph>
      <Paragraph position="4"> 2'\]'his \],'mglish pattern is used to capture expressions such as &amp;quot;XYZ Corp. created a joint venture with PQR  The ill-st field in a i)atteril is the pattern nalile followed by the patterll iiuliiber. The tia, ttern nunlber is used to deride whether or llOt a, search within a, given strhlg is IleCess.&gt;tPSy, To assure ell\]ciency with the pal, l;ern marcher, the fiehl designated by the lllunber sliould include tlie leant frequent word in the entire patterll (,,~t~,, for aa, l)anese and :'a joint velil, urc&amp;quot; for English in this case).</Paragraph>
      <Paragraph position="5">  Approxiiila.tely 150 pa.tterns were used to extract various concepts in the Japanese joint ventures domain. Several patterns usually tllatch a. single sentence. Moreover, siuce pal;terns are often searched using case ma.rkers slic\]i as &amp;quot;~&amp;quot;, &amp;quot;7~&lt;&amp;quot;, and &amp;quot; ~ &amp;quot;, which frequently apt)ea.r ill .\]al)a.nese texts, even a sitigie l)a.t 1;eri/ Call l\[latch the Sellt,ence ill n\]oi'e thrill Olle w{I,y whell severa, I of&amp;quot; tile same ca,se lliarkers exist ill a sentence. However, since the template gtmerator aCCellts only the best lnat(;he(I pal;tern~ ehoosilig a corre(:tly nla,tehed i)atl,eril is imlJortant. The selection in (lone by applying three heuristic rules in the following or&lt;let:  input seglnents (the shortest string match), and * select patterns that include the lnost llliiiil)er of variables and defined words.</Paragraph>
      <Paragraph position="6"> Another important feature of the pa.t:tern Iilat(:hor is tha,t rules can be groupe(1 accordilig to their COilCel)t. A rule lla.iile &amp;quot;JohltVeliturel&amp;quot; iii Fig. 2, for example, represents a concel)t &amp;quot;JointVelitllre'. Ushlg this groupbig, the best nlatched pattern can be selected fl'on-i nlatched patterns of a particular concept group instea.d of choosing from all the matched patterns. This feature enables the discourse and template g(meration processes to look at the, best infortnation necessary whet, tilling in a particular slot.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="1066" end_page="7067" type="metho">
    <SectionTitle>
5 EXAMPLE OF THE INFORMA-
TION EXTRACTION PROCESS
</SectionTitle>
    <Paragraph position="0"> This sectkm describes how concepts ~tnd patterns identifh;d by tile matcher are used for tenll)late filling. Concepts are often useful to fill in the &amp;quot;set fill&amp;quot; (choice fi'om a. given set) sh)ts. An entity type slot, for examl)le , ha,s four giw, i choices: COMPANY, I)EI{SON, GOVERNM/i;NT, ~tnd OTIIt{;R. The matcher assigns concepts related to each entity type except ()TflEIL Thus, from the given set, the output generator chooses an entity type corresponding to the identified concept. There axe ca.ses when discourse processing is necessa.ry to link identified concepts and patterns. The f.ollowing text;: &amp;quot;X Inc. created a joint w;ntm'e with Y Corp. last yea.r. X announced yesterday that it. terminated the venture.&amp;quot; in used to describe the extraction process illustrated in Fig. 3.</Paragraph>
    <Paragraph position="1"> In the preprocessing, two company na.tnes in the first sentence &amp;quot;X hic.&amp;quot; and &amp;quot;Y Corp.&amp;quot; are identified either I)y MAJESTY or the name recognizer. In the first sentence, the templa.te pa.tteru search locates t he ,I ointgenture 1 pattern shown in Fig. 2. Now, the ,/OINT-V I';NT \[J 1{1'3 con cepl; I)etweeii &amp;quot;X inc.&amp;quot; and &amp;quot;Y C'ol'l).&amp;quot; is recognized. In tim second Sellt(~.it(X~.,  &amp;quot;X Inc. created a &amp;quot;X announced joint venture with yesterday that it Y Corp. last year.&amp;quot;i terminated the  the company name &amp;quot;X&amp;quot; is also identitied by the preprocessor, a Next, the concept &amp;quot;DIS-SOLVEi)&amp;quot; is recognized by the key word terminate in the concept search. (The key word list is shown in Section 4.11.) After sentence-level processing, discourse processing recognizes that &amp;quot;X&amp;quot; in the second sentence is a reference to &amp;quot;X Inc.&amp;quot; found in the first sentence. Thus, the &amp;quot;DISSOLVI';I)&amp;quot; concept is joined to the, joint venture relationship between &amp;quot;X inc.&amp;quot; and &amp;quot;Y Corp.&amp;quot;. in this way, TI!iXTRACT recognizes that the two companies dissolved the ,joint venture.</Paragraph>
    <Paragraph position="2"> SupI)ose that the second sentence is replace(t with another sentence: &amp;quot;Shortly after, X terminated a contract to supply rice to Z Corp.&amp;quot;. Although it does not mentiot~ the dissolved relationship nor anything a hottt &amp;quot;Y Corp.&amp;quot;, the system incorrectly recognizes the dissolved joint ventttre rela.tionship between &amp;quot;X In('.&amp;quot; and &amp;quot;V Corp.&amp;quot; due to the existence of the word terminate. When this undesirable matching is often seen, more complicated template patterns must be used instead of tile simple word list. A dissolved concept, lbr example, could he identified using the fo\]\[owing template pattern: aWhen it is an unknown word to tim prepro(x~ssor, the discourse processor idcnti\[ies it IM.er.</Paragraph>
    <Paragraph position="3">  @CNAME_PARTNER_WITH).</Paragraph>
    <Paragraph position="4"> Then, discourse processing must check if com+ panies identified in this pattern are the same as the current joint venture comi)anies in order to recognize their dissolved relationship.</Paragraph>
  </Section>
  <Section position="9" start_page="7067" end_page="7067" type="metho">
    <SectionTitle>
6 OVERALL SYSTEM PERFOR-
MANCE
</SectionTitle>
    <Paragraph position="0"> A total of 250 newspaper articles, 100 about Japanese mi('roelectronics and 150 about Japanese corl)orate joint ventures were provided by ARI)A fl)r nse in the Tll)SrI'FA/./MU(L5 system evalua.tion. Five microelectronics and six joint ventures systems were presented in the Japanese, system eva.luation a.t MUC-5. 4 Scoring was done in a, semi-atttomatie nta,nner, rl'he scoring program automatically compared the system output with answer tetnl)lates created by humat~ analysts, then, when a Mman decision was necessa.ry, analysts instructed the scof ing progratu whether tlt(,, two strings in coral)arisen were completely matched, pa.rtially matched, or unumtched. Finally, it calculated an overall score combined from all the newspaper article scores. Although various evalnat;ion tnetrics were tneasured in the evaluatiou \[C\[lillchor and Sun(lheim 93\], only the following error and reeall-l)recision metrics are discussed in this pa.per. The ha.sic scoring categories use(l are: correct (CO R), partially correct (PAR), itlcorrect (INC), ntissing (MIS), and spurious (SPIJ), counted as the tmml)er of i)ieces of inl'orma.tion in the system output eompa.red to tile possil)le (answer) information. null  TEXTI{A(/r's scores sul)inil.l.ed to MU(: 5 were unollicial. J M t';: J a.liiHl(+,,S(~ Ill icroelectr()lfics (IoIIt ai It JJV: ,/a.p,+,.lleSe ('o,'l)Ol'al;(~ johi t. v0litil ,'eS (\[outai II  s'I+EXTIIA(YI' processed only Japane&gt;~e text, wherea.s the two other sysl,enis l~rocessed I)olh I'\]nglish D, II(\[ Jit|)a.nese text.</Paragraph>
    <Paragraph position="1"> from the '\['\[F'STEI{/M U(\]+5 syst,:utl cwduatiotl ' : I I,X IRA , I perforn+e&lt;l re,quits \[M U(,-+&gt; 93\]. u &amp;quot; ' C e,lually with the tol)-ra.nMn&lt;e; systems it\] the \[,WO ,\] ~t I)~l IH!,qe (1()II13i IIS.</Paragraph>
    <Paragraph position="2"> Since the TI,;XTI{A(Yr nlicroe\]ectronics system did not, ia(th.le a, l,elnli\]a,te pa, l, tern s;earch or d iscou rse processor to help dillk.'en ti a.lo bOI, W(R~II n~tdt, ilfle, sellliCOlt(ltl(:Lor proce,c;ses of the sa,me kiIld, it reported ouly oile ol)jecl, for each kiud of ulanufacturing; process, even wheu multil)le ol)ject&lt;; of tile sa,mc kind existed in the artMe. This resulted in the lower scores in the nfi(:roele(:tronics (\]Olll,~liFPS I;\]I;11, l;hose of t he .joitJ t vetltu res dotna.i N.</Paragraph>
    <Paragraph position="3"> Thi.~ \[);tt, t, erlt ma,t(:hi,g architecture is highly l)ortal&gt;Ic~ ;tcros:-; dilreretlt domains (&gt;\[' the s;,.me laliguage. The TI:,XTRAC/r nii,::r(+,+lectrolii(:s system was dew~loped in only three weeks by one person by simply replacing joint venture coilce+l)t,~ and key words witll representative Ilti('rof!l(~(~|.l'Oli\](~s COlIC(':\])l;.q ~,.,1(1 key words.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML