File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2157_intro.xml

Size: 7,381 bytes

Last Modified: 2025-10-06 14:06:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2157">
  <Title>A Self-Learning Universal Concept Spotter</Title>
  <Section position="3" start_page="0" end_page="931" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> hlentifying concepts in natural language text is an important intbrmation extraction task. Depending upon the current information needs one may be interested in finding all references to people, locations, dates, organizations, companies, products, equipment, and so on. These concepts, along with their classification, can be used to index any given text for search or categorization purposes, to generate suimnaries, or to populate database records. However, automating the process of concept identification in untbrmatted text has not been an easy task. Various single-Imrpose spotters have been developed for specific types of conce.pts, including people mm~es, com'pa.ny n&amp;ines, location names, dates, etc. })lit; those were usually either hand crafted for particular applications or domains, or were heavily relying on apriori lexical clues, such as keywords (e.g., 'Co.'), case (e.g., 'John K. Big'), predicatable format; (e.g., 123 Maple Street), or a combination of thereof. This makes treat, ion and extension of stleh spotters an arduous mamml job. Other, less s;tlient entities, such as products, equipnmilt, foodstuff', or generic refcrenc.es of any kind (e.g., 'a ,lapanese automaker') could only be i(lentifled if a sut\[iciently detailed domain model was available. Domain-model driven extraction wits used in ARPA-sponsored Message Understanding Colltc1'eilc(!s (MUC); a detailed overview of current research can be found in the procecdil~gs ot7 MUC-5 (nmcS, 1993) and the recently concluded MUC-6, as well as Tipster Project meetings, or ARPA's Human Language q&gt;chnology workshops (tipsterl, 1993), (hltw, 1994).</Paragraph>
    <Paragraph position="1"> We take a somewh~t different approach to identify various types of text entities, both generic and specific, without a (let, ailed underst, anding of the text domain, and relying instead on a comlfination of shallow linguistic processing (to identi(y candidate lexical entities), statistical knowledge acquisition, unsupervised learning techniques, and t)ossibly broa(1 (mfiversal but often shallow) knowledge, sources, such as on-line dictionaries (e.g., WordNet, Comlex, ()ALl), etc.). Our method IllOVeS t)eytmd the traditional name si)otters and towards a universal spotter where, the requirements on what to spot can be specified as input paraineters, and a specific-purpose spotter c.ouht be generated automatically. In this paper, we describe a method of creating spotters for entities of a specified category given only initial seed examples, and using an unsupervised learning t)rocess to discover rules for finding more instances of the eoncet)t. At this time we place no limit on what kind of things one may want to build a spotter for, al@lough our extmriments thus far concentrated on entities customarily re- null ferred to with noun phrases, e.g., equipment (e.g., &amp;quot;gas turbine assembly&amp;quot;), tools (e.g., &amp;quot;adjustable wrench&amp;quot;), products (e.g., &amp;quot;canned soup&amp;quot;, &amp;quot;Arm  ton), and so on. We view the semantic categorization problem as a case of disambiguation, where for each lexical entity considered (words, phrases, N-grams), a binary decision has to be made whether or not it is an instance of the semantic type we are interested in. The problem of semantic tagging is thus reduced to the problem of partitioning the space of lexical entities into those that are used in the desired sense, and those that are not. We should note here that it is acceptable for homonym entities to have different classification depending upon the context in which they are used. Just as the word &amp;quot;bank&amp;quot; can be assigned different senses in different contexts, so can &amp;quot;Boeing 777 jet&amp;quot; be once a product, and another time an equipment and not a product, depending upon the context. Other entities may be less context dependent (e.g., company nan'ms) if their definitions are based on internal context (e.g., &amp;quot;ends with Co.&amp;quot;) as opposed to external context (e.g., &amp;quot;followed by mauufactures&amp;quot;), or if they lack negative contexts. The user provides the initial information (seed) about what kind of things he wishes to identify in text. This infortnation should be in a form of a typical lexical context in which tile entities to be spotted occur, e.g., &amp;quot;the name ends with Co.&amp;quot;, or &amp;quot;to the right of produced or made&amp;quot;, or &amp;quot;to the right of maker of', and so forth, or simply by listing or highlighting a number of examples in text.</Paragraph>
    <Paragraph position="2"> In addition, negative examples can be given, if known, to eliminate certain 'obvious' exceptions, e.g., &amp;quot;not to the right of made foal', &amp;quot;not toothbrushes&amp;quot;. Given a sufficiently large training corpus, an unsupervised learning process is initiated in which the system will: (1) generate initial context rules from the seed examples; (2) find further instances of tile sought-after concept using the initial context while maximizing recall and precision; (3) find additional contexts in which these entities occur; and (4) expand the current context rules based on selected new contexts to find even more entities.</Paragraph>
    <Paragraph position="3"> In the rest of tlle paper we discuss the specifies of our system. We present and evaluate preliminary results of creating spotters for organizations and products.</Paragraph>
    <Paragraph position="4"> 2 What do you want to find: seed selection If we want to identify some things in a stream of text, we first need to learn how to distinguish them from other items. For example, company names are usually capitalized and often end with 'Co.', 'Corp.', 'Inc.' and so forth. Place names, such as cities, are nonmflly capitalized, sometimes are followed by a state abbreviation (as in Albauy, NY), and may be preceded by locative prepositions (e.g., in, at, from, to). Products may have no distinctive lexical appearance, but they tend to be associated with verbs such as 'produce', 'manufacture', 'make', 'sell', etc., which in turn may involve a company name. Other concepl;s, such as equipment or materials, have R~'w if any ot)vious associati(ms with the surrounding text, and on(; may prefer just to iioint them out directly to the learning prograin. There are texts, e.g., technical manuals, where such specialized entities occur more often than elsewhere, and it may be adwmtagous to use these texts to derive spotters.</Paragraph>
    <Paragraph position="5"> The seed can be obtained either by hand tagging some text or using a naive spotter that has high precision but presumably low recall. A naive spotter may contain simple contextual rules such as those mentioned above, e.g., for organizations: a noun phrases ending with &amp;quot;Co.&amp;quot; or &amp;quot;Inc.&amp;quot;; for products: a noun phrase following &amp;quot;manufacturer of&amp;quot;, &amp;quot;producer of&amp;quot;, or &amp;quot;retailer of&amp;quot;. When such naive spotter is ditlicult to come by, one may resort to hand tagging.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML