File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2157_metho.xml

Size: 18,167 bytes

Last Modified: 2025-10-06 14:14:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2157">
  <Title>A Self-Learning Universal Concept Spotter</Title>
  <Section position="4" start_page="931" end_page="932" type="metho">
    <SectionTitle>
3 From seeds to spotters
</SectionTitle>
    <Paragraph position="0"> The seed should identit~y the sought-after entities with a high precision (thougil not; necessarily 100%), however its recall is assumed to be low, or else we would already have a good spotter. Our task is now to iucrease tile recall while maintaining (or ('.veil increase if possible) the precision. We proceed by examining the lexical context in which tlle seed entities occur. In the silnplest instance of this process we consider a context to coilsist of N words to the left of the seed and N words to the right of tile seed, as well as the words ill the seed itself. Each piece of significant contextual evidence is then weighted against its distribution in the balance of the training corpus. This in turn leads to selection of some contexts to serve as indicators of relevant entities, in other words, they become the initial rules of the emerging spotter.</Paragraph>
    <Paragraph position="1"> As an exami)le, let's consider building a spotter for company names, starting with seeds as illustrated in the tbllowing fragments (with seed cont, exts highlighted): ... HENRY KAUFMAN is president of Henry Kaufmau C~ Co., a ... Gabelli, chairman of Gabelli l%nds Inc.; Claude N. Rosenberg ... is named president of  Slmndinaviska Enskilda Banken ... become viee chairman of the state-owned electronics giant Thomson S.A .... banking group, said the formal merger of Sl~anska Banken into ... water maker Source Perrier S.A., according to French stock ...</Paragraph>
    <Paragraph position="2">  ltaving &amp;quot;Co.&amp;quot; &amp;quot;htc.&amp;quot; to pick out &amp;quot;Henry Kauf mmn &amp; Co.&amp;quot; rand &amp;quot;Gabelli IAmds Inc.&amp;quot; as seeds, we proceed to find new evidence in the training corlms , using an unsul)ervised lemrning process, mnd discover thmt &amp;quot;chmirman of&amp;quot; rand &amp;quot;t)residcnt of&amp;quot; rare very likely to precede, cOral)any nalnes. We expand our initial set of rules, which tallows us to spot more COml)anies: ... ltENI{Y KAUFMAN is president of lh;nry Kaufm.an ~'4 Co., a ... Gabclli, chairman of Gabclli \[,}mds Inc.; Clmude N. \]{osenl)erg ... is nmmed president of Skandi'naviska Enskilda Bankcn ... be, come vice ('hairntan of lhe. state-o'wncd electronics giant Thomson S.A .... banldng groul) , said dw, forreal merger of Skansl~ l{anken into ...</Paragraph>
    <Paragraph position="3"> winter inaker Sotnce Perrier S.A., according to French stock ...</Paragraph>
    <Paragraph position="4"> This evidence discovery (:an be relmated in m bool;strmpl)ing process l)y ret)la(:ing the initiml set; of seeds with the new set; of entities obtained froln the lmst itermtion. In t|~e mbove examt)le, we now have. &amp;quot;Slamdinaviskm Fmskihla Bank(m&amp;quot; and &amp;quot;l;hc stmte-owned electronics giant '\]'homson S.A.&amp;quot; in mddition to the initiml two names. A flu'ther it(wation ma,y mdd &amp;quot;S.A.&amp;quot; rand &amp;quot;Bmnken&amp;quot; {;o l;hc set of contcxtuml rules, and so forth, in generml, (ml;ities can 1)e both added mnd deh;ted from the evolving s(;t of examples, det)ending on how uxmctly the cvid(;n(:e is weighted and combin(;d. The details are exl)lained in the following sections.</Paragraph>
  </Section>
  <Section position="5" start_page="932" end_page="932" type="metho">
    <SectionTitle>
4 Text preparation
</SectionTitle>
    <Paragraph position="0"> In ill()S~, (;asc, s l;he text needs to t)e preprocessed to isolmte 1)asic lexi(:al tok(',ns (words, ml)l)r(!viations, symbols, mnnol;a|;ions, el;(:), and sl;ru(:turml units (sections, pmragrat)hs , sentences) wh(mever api)licmt)le. In addition, t)mrt-of-speech tmgging ix usuml\]y desirmble, in which case tim tagger mmy need l;o be re-trained on a text saml)le 1;o ol)l;ilnize its performance (Brill, 1993), (Mercer, Schwartz &amp; W(;ischedcl, 1{)91). Finmlly, a limited amount of lexicml normalization, or stemming, Inay be f)erlormed. null The entities we rare looking for inay be exl)ressed |)y certain tyt)es of phrases. For example, people nmmes m'e usually sequences of i)rot)er nouns, while equipment nmmes rare contained within noun phrmses, e.g., 'forwmrd looking int&gt;m'ed radar'. We use 1)art of speech information to delinemte those se(lllelt(;es of lexicml l;okens t;hat arc likely to (:ont;mill (Olll &amp;quot;~ enl;itics. \]~'l'()in l;h(',ll Oil we restrict tony further t)rocessing on these sequences, and their contexts.</Paragraph>
    <Paragraph position="1"> These preparatory steps are desirable since they reduce the amount of noise through which the lemrning process needs to plow, but they mre not, strictly st)eaking, ne(:essary. Further experiments rare required to deterlnint~ the level of preprocessing required I;o optinfize the t)erforlnanee of the \[hfiversal Sl)otl;er.</Paragraph>
  </Section>
  <Section position="6" start_page="932" end_page="933" type="metho">
    <SectionTitle>
5 Evidence items
</SectionTitle>
    <Paragraph position="0"> The smnmnl;i(: categorization problem described here displmys some pmrmllcls to the word sense dis ambigumdon problem where hoInonylll words ileed to be mssigned to one of several possible senses, (Yarowsky, 19!)5), (Gale, Chm'ch &amp; Yarowsky, lt)92), (Brown, Pictra, Pietra &amp; Mercer, \]991).</Paragraph>
    <Paragraph position="1"> 'Fhcre mrc two itnportant difl'erenc(',s, however.</Paragraph>
    <Paragraph position="2"> First, in the semantic cat, cgorizal;ion l)ro|)lem, t, here is al; lemsl, one Olmn-ended catc, gory serving as m grml) 1)rag for roll things non-relevant;. This c, mte, gory Inay be hard, if not impossible, to describe by any finit(; set of rules. Second, unlike the word sense disambigumtion where the it;eros 1;o be clmssitied arc known apriori, we attempt to acconqflish two things at the smnm time:  1. discover l;he items Lo be (:onsidcred for c, mtegorization; null 2. acl;ually decide if an item 1)elongs to a given  category, or falls outside of il;.</Paragraph>
    <Paragraph position="3"> '\]'hc cmtcgorization of a lexical token its belonging l,o m p;ivell selnalltic, clmss is based llpOtt t,}l(': information provided by the words occurriug in 1,he token itself, ms well as the words thmL l)re cede mM follow it; in t(~xl;. Ill addition, i)ositionml relal;ionshil)s among l;hes(; words mmy be of importaalce. ~lb capture l;his informal;ion, we define the notion of an e.'videncc set lbr a lexicml unil; W,//V2...IA&lt;,,. (m phrase, or an N-gram) its follows. Let .... W.., .... W .I W~...W,,W, , W+.2...W, , .... be m string of subsequellt, tokens (e.g., words) in text, such Lhat W~ W~....I/Km is a unit of interesl, (e.g., a noun phrase) rand n is the maximum size of the context window on either side of the unit. The mt:\[;ual window size, mmy l)e limited by boundaries of strllcturml mfit, s sm;h its sentences or parmgraphs. For each unit W1 Wu...l/g,,~, a se~ of evidence, ilcms is colh;cted as a set union of the following four sel;s:  1. Pmirs of (word, position), where position {p,s, f} indicates whethex word is fount\[ ill the context preceding (p) the central refit, following (t) it,, or whe|;her il; come, s flom I;he centra.1 unil;</Paragraph>
    <Paragraph position="5"> 2. Pairs of (bi-gram, position) to capture word se, quence informmtion. E2 = {(W ..... W--(,~- l)), p) ... ((1/V._:~, W__ t), p) } ((w,, w~), .~) ... ((w,,~ _,, w,,&amp; ~) ((w+l, w+~),f) ... ((w+/,~_l), w+,o, f)  3. 3-tuples (word, position, distance), where distance indicates how far word is located relative to W1 or I/V,~. Ea = {</Paragraph>
    <Paragraph position="7"> ((Wl, W2), s, 7D, - 1) ...... ((W .... 1, Win), s, 1) ((w+l, w+~), f, 0...((w+(,~_ ~), w+~), f, n - 1) For example, ill the fl'agment below, tile central phrase the door has the context window of size 2: ... boys kicked the door with rage ...</Paragraph>
    <Paragraph position="8"> The set of evidence items generated for this fl'aginent, i.e., E1 UE2 UEaUE4, contains the following elements: (boys, p), (kicked, p), (the, s), (door, s), (with, f), (rage , f), ((boys, kicked), p), ((the, door)), s), ((with, ,'age), f), (boys, p, 2), (ki&amp;ed, p, 1), (the, s, 2), (door, s, 1), (with, f, 1), (rage, f, 2), ((boys, kicked), p, 1), ((the, door)), s, 1), ((with, ,'age), f, 1) Items in evidence sets are assigned significance weights (SW) to indicate how strongly they point towards or against the hyphothesis that the central unit belongs to the semantic category of interest to the spotter. The significance weights are acquired through corpus-based training.</Paragraph>
  </Section>
  <Section position="7" start_page="933" end_page="933" type="metho">
    <SectionTitle>
6 Training
</SectionTitle>
    <Paragraph position="0"> Evidence items for all candidate phrases in the training corpus, for those selected by tile initial used-supplied seed, as well as for those added by a training iteration, are divided into two groups.</Paragraph>
    <Paragraph position="1"> Group A items are collected from the candidate phrases that are accepted by tile spotter; group R items come from the candidate phrases that are rejected. Note that A and 1% may contain repeated elements.</Paragraph>
    <Paragraph position="2"> For each evidence item t, its significance weight is computed as:</Paragraph>
    <Paragraph position="4"> where f(t, X) is the fl'equency of t in group X, and s is a constant used to filter the noise of very low frequency items.</Paragraph>
    <Paragraph position="5"> As defined SW(t) takes values from -1 to 1 interval. SW(t) close to 1.0 means that t appears imarly exclusively with the candidates that have been accepted by tile spotter, and thus provides the strongest positive evidence. Conversely, SW(t) close to -1.0 means that t is a strong negative indicator since it occurs nearly always with the rejected candidates. SW(t) close to 0 indicates neutral evidence, which is of little or no consequeuce to the spotter. In general, we take SW(t) &gt; e &gt; 0 as a piece of positive evidence, and SW(t) &lt; -e as a piece of negative evidence, as provided by item t. Weights of evidence items within an evidence set are then combined to arrive at the compound context weight which is used to accept or reject candidate phrase.</Paragraph>
    <Paragraph position="6"> At this time, we make no claim as to whether  (1) is an optimal fornmla for cah:ulating evidence weights. An alternative method we considered was to estimate certain conditional probabilities, similarly to the formula used in (Yarowsky, 1995):</Paragraph>
    <Paragraph position="8"> Here f(A) is (an estimate of) the probability that any given candidate phrase will be accepted by the spotter, and f(R) is the probability that this phrase is rejected, i.e., f(R) = l-f (A). Thus fin' our experinmnts show that (1) produces better results than (2). We continue investigating other weighting schemes as well.</Paragraph>
  </Section>
  <Section position="8" start_page="933" end_page="934" type="metho">
    <SectionTitle>
7 Combining evidence weights to
</SectionTitle>
    <Paragraph position="0"> classify phrases In order to classify a candidate phrase, all evidence items need to be collected from its coiltext and their SW weights are combined. When the combined weight exceeds a threshold value, the candidate is accepted and the i)hrase becomes available for tagging by the spotter. Otherwise, the ('andidate is reje(:te(l, although it may be reevaluated in a fllture iteration.</Paragraph>
    <Paragraph position="1"> There are many ways to combine evidence weights. In our experiments we tried the following</Paragraph>
    <Paragraph position="3"> both x and y are positive, and it is less than both x and y for negative x and y. In all cases, x 0) y remains within \[-1, +1\] interval.</Paragraph>
    <Paragraph position="4"> In (4) only the dominating evidence is considered. This formula is more noise resistant than (3), but produces generally less recall.</Paragraph>
  </Section>
  <Section position="9" start_page="934" end_page="934" type="metho">
    <SectionTitle>
8 Bootstrapping
</SectionTitle>
    <Paragraph position="0"> The eviden{:e, training and candidate sele{:tion (:ycle forms a l)ootstrapI}ing t}rocess, as folh)ws:  crease recall of the spotter. This is possible thanks to overall redundancy and rep(;titiveness of information, particularly local {:ontext information, in large bodies of text. For exanq}le,, in our three,sectional contexl, ret)resent, ation (t}re(:eding, self, following), if one section contains strong evidence that the candidate t)hrase is selectat}le, eviden(:e f(mnd in other se,{:tions will t}e considere, d in tile next training cy{:le, in order to sele(:t additional candidates.</Paragraph>
    <Paragraph position="1"> An imi}ortmlt consideration he, re is to mainlain all overall precision level throughout the elltire process. AMmugh, it; may t)e possible to rec(}ver fl'om some miselassiti{:ation errors (e.g., (Ym'owsky, 1995)), (:a.re shouhl 1)e taken when adjusting the process l}arameters so that 1)r{;eision does not deteriorate too rapidly. For insl;ance, a(:(;el}tan(;e thresholds of evide, nce weights, initially set, higll, can be gradually decreased to allow more recall while keeping l}recision at a reasonable level. In additioil, (Yarowsky, 1995), (Gale, Church &amp;; Yarowsky, 1992) point ou{; that there is a st, rent tenden(:y for words 1;O occur in (}Ile sense within any given dis{:ourse (&amp;quot;one sense pe, r dis{:ourse&amp;quot;). Th(; same seems to at)ply to (:oncel)t sele(:l;ion, thai, is, Inultil}le o(:(:m'ren(:es of a (:an{lidate 1}hrase within ~t disc{}urse should all 1}e eithe\]' a(:eel)te{l or reje,(:t;(;{t \[)y the Sl}Ol,te\]'. This in turn allows f{}r t}ootstrat}t)ing pr(}cess to gather more contextual evideal{:c more quickly, and thus to (:onwuge faster t)rodu{:ing, better results.</Paragraph>
  </Section>
  <Section position="10" start_page="934" end_page="934" type="metho">
    <SectionTitle>
9 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> We, used the Universal St)ot;ter to find organizations an{1 products in a 7 MBytes cortms consisting of al'ti(:les fl'om i;ll(', Wall Street Journal. l,'irst, we l}re-t)rocess{~d the l;ext with a l}arl;-of-sl}{~ech tagger and |dent|tied all simple noun groups to l)e used as {:and|date 1}hrases. 10 artMes were set, aside and ha.rid l,agged as key for evalual;ion.</Paragraph>
    <Paragraph position="1"> Subsequently, seeds were construct, ed ma.nually in forln of contextual rule, s. l~i)r orgmfizati{}ns, these |nit|a.1 rules hall a 98% i)\]'e{;ision and 4{}%) recall; for products, the corresl}onding numbers were 97% and 42%}. (4) is used to combine evidences. No lexi{:on veriti{:ation (see later) has been used in order l;o show m()re clearly the behavior the learning nmthod itself ( the l}erformance can  be enhanced by lexicon verification). Also note that the quality of the ,~eeds affects the per'formalice of the final sI)otl;er since they define what type of {;()I1(',(;1)1; the system is supt)osed to look for. The seeds that we used in our exlmrimenl;s are quit(; simple, perhaps too simple, lletter seeds may be neede.d (possibly developed through all inl;eraC/'tion with the user) t;o obtain str(mg r{~stllts for some (:~l, cgories of concci}l;s.</Paragraph>
    <Paragraph position="2"> For orgmdzation tagging, the recall and precision results obtained after the tirst mid the follrth t}ootstrat)t)ing eyt'.le are given in Figm'e 1.</Paragraph>
    <Paragraph position="3"> The poinl; with the inaximmn precision*recall in the ftmrth rllll is 950/{) pre(:ision and 90% recall. Examples of extracted organizations in{:lude: &amp;quot;l,h,e State Statistical btstit, ntc, lst,,,t,&amp;quot;, &amp;quot;We.rl, heim Sch, roder #4 Co&amp;quot;, &amp;quot;Skandi'naviska Enskilda Ha'nken&amp;quot;, &amp;quot;Statistics Canada&amp;quot;.</Paragraph>
    <Paragraph position="4"> The results for products tagging are given in Figure 2 on the next page. Examph~s of extracted products include: &amp;quot;the Mercury Grand Marquis and Ford Crown Victoria cars&amp;quot;, &amp;quot;(_~tevrolet Prizm&amp;quot;, &amp;quot;Pump shoe&amp;quot;, 'MS/doe&amp;quot;. The efl'ect of bootstrapping is clearly visible in both charts: it improves the recall while, mainraining or even iinproving the pre,(:ision. We may also nol;ice that some misclassifications due to all iml)ext'e,t:t seed (e.g., see the first dip in t)re(:ision ()11 the 1}tOdllt;l;s chart) (:all ill t'aet t)e corrected in further t}ootstrapping loops. The generally lower performance levels for the product; spotl;er is prol)ably due to the. fact t;hat the (;oncel)t of produ(;t, is harder to eirt'.mnscril)e.</Paragraph>
  </Section>
  <Section position="11" start_page="934" end_page="935" type="metho">
    <SectionTitle>
10 Further options
10.1 Lexicon verification
</SectionTitle>
    <Paragraph position="0"> The itenlS identified in the second step can be further wflidated fl)r their broad semantic classification using on-line lexical (lat~J)asc8 such as Corn- null lex or Longman Dictionary, or Princeton's Word-Net; (Miller, 1990) For example, &amp;quot;gas turbine&amp;quot; is an acceptable equipment/machinery name since 'turbine' is listed as &amp;quot;machine&amp;quot; or &amp;quot;device&amp;quot; in WordNet hierarchy. More complex validation may involve other words in the phrase (e.g., &amp;quot;circuit breaker&amp;quot;) or words in the immediate context.</Paragraph>
  </Section>
  <Section position="12" start_page="935" end_page="935" type="metho">
    <SectionTitle>
10.2 Conjunctions
</SectionTitle>
    <Paragraph position="0"> The current program cannot deal with conjunction. The difficulty with conjunction is not with classification of the conjoined noun phrases (it is easier, as a matter of fact, because they carry more evidences) but with identification of the phrase itself because of the structural ambiguities it typically involves that cannot be dealt with easily on lexical or even syntactic level.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML