File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/m93-1024_intro.xml

Size: 11,323 bytes

Last Modified: 2025-10-06 14:05:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="M93-1024">
  <Title>Tagged Sentences Tagger Lisp-readable sentences CompletedChart Template Generator, Target Templates Input Text System Knowledge Bases</Title>
  <Section position="2" start_page="0" end_page="295" type="intro">
    <SectionTitle>
FLOW OF CONTRO L
</SectionTitle>
    <Paragraph position="0"> In the spectrum of information extraction approaches represented in MUC-5, LINK tend s toward computing a complete syntactic and semantic analysis of each sentence . The main module of the system is a unification-based chart parser . Relatively little preprocessing is performed on individual sentences before they are passed to the parser . A complete analysis of each sentence is attempted, although partial parses are utilized if a complete parse cannot be produced .</Paragraph>
    <Paragraph position="1"> The overall system consists of the modules shown in figure 1 . One sentence at a time passes through the modules in the order shown in the figure . Each module's function is described below .</Paragraph>
    <Paragraph position="2"> The Tokenizer The tokenizer produces LISP-readable files from a 100-article source file . Each file consists of header information followed by the sentences of the article represented as lists of tokens . Token s that have special meaning in LISP, such as the single and double quotes, commas, and period s are modified to be readable by the main parsing engine .</Paragraph>
    <Paragraph position="3"> Sentence boundaries are hypothesized whenever a period is seen . An exception to this is i f a period follows a known abbreviation, and is not followed by a capitalized token, then it is no t the end of the sentence.</Paragraph>
    <Paragraph position="4"> Double quotes are simply removed, and single quotes that are used as quotation marker s are removed. Contractions are expanded and possessives are made into separate tokens (e .g.,  Figure 1 : Modules of the MUC-5 LINK syste m &amp;quot;Nikon's'. -- &amp;quot;Nikon *'S*&amp;quot;) . Other special LISP symbols are converted to LISP-readable symbols null The Tokenizer checks the case of each word, and puts sequences of capitalized words insid e strings for the use of the Tagger, as described below . It also breaks apart hyphenated tokens i f the first half is a number (e.g., &amp;quot;25-Mhz&amp;quot;), to allow the grammar access to the units . The Tokenizer also performs some filtering tasks . Names of locations at the beginning o f the text or abstract are removed, as are author name lines, and COMLINE tag lines . Sentences that are too short to be interesting are removed .</Paragraph>
    <Paragraph position="5"> The Tagger Because the input is mixed case in this domain, and because many of the proper names tha t would normally be unknown to the system lexicon are capitalized, the MUC-5 LINK syste m uses a pre-parse tagger to process and attempt to identify capitalized words which are passe d as strings from the Tokenizer. The Tagger uses heuristics (aka hacks) to break apart strings i n several different ways. Some of the tags that are used include : :COMP-NAME for things that seem to be obviously company names, :LOCATION for city/state pairs, :PERSON-NAME for people names (if they have Mr, Mrs, VP, Dr in front), and :NAME for other names .</Paragraph>
    <Paragraph position="6"> Some example rules that the tagger uses are :  1. If a word is a known acronym (e .g. DRAM) or an abbreviation that is normally capitalized (e.g. &amp;quot;Mbit&amp;quot;), then just pass the word as a regular lexeme .</Paragraph>
    <Paragraph position="7"> 2. If the string ends in a word like &amp;quot;Corp&amp;quot; or &amp;quot;Co &amp;quot; , tag the string as a company name.  3. If a string is followed by a word like &amp;quot;President&amp;quot; or &amp;quot;Spokesman&amp;quot; and then another string , make the first part a company name and the rest a person name .</Paragraph>
    <Paragraph position="8"> 4. If a string is followed by a comma and then a state name, tag it as a city / state pai r  The Filter Our filtering mechanism allows the system to ignore all sentences which have no useful mean ing. Each sentence in an article is checked to see if it contains at least one word whose meanin g is relevant to the domain ; if so, the sentence is passed on to the parsre. Words with relevan t meanings to this domain included verbs indicating the development or purchase of a microelectronics capability (e.g., &amp;quot;transfer&amp;quot; or &amp;quot;use&amp;quot;) ; names of companies or people ; and various nouns of interest (e .g., &amp;quot;device&amp;quot;, &amp;quot;hydrofluoric&amp;quot;, &amp;quot;temperature&amp;quot; and &amp;quot;DRAM&amp;quot;) . The LINK parser LINK is unification-based chart parser, which parses a sentence at a time . The LINK parser applies unification grammar rules to a sentence to generate a syntactic and semantic representation. A set of principled grammar rule application heuristics select which grammar rules t o apply. If these heuristics fail, we revert to bottom-up chart parsing . We will outline the forma t of the grammar and then we will describe our parsing strategy .</Paragraph>
    <Paragraph position="9"> The LINK grammar LINK's grammar rules are quite similar in form to those used in PATR-II (Shieber, 1986) . Semantic information resides mainly in the lexicon, along the lines of HPSG (Pollard and Sag , 1987). This organization improves the portability of the system, since the vast majority o f the grammar should be applicable to other domains, while the lexicon contains most of th e domain-specific information .</Paragraph>
    <Paragraph position="10"> The integration of syntactic and semantic knowledge into the same grammar formalism i s crucial to our system's ability to process large texts in a reasonable length of time, and t o producing the semantic analysis used to generate templates .</Paragraph>
    <Paragraph position="11"> Edges are placed in the chart to represent constituents that the parser identifies. Edges have associated with them both syntactic and semantic information, represented in the form of a directed acyclic graph (DAG) . The DAGs correspond to the information in the set of gramma r rules used to build a constituent .</Paragraph>
    <Paragraph position="12"> The MUC-5-LINK parsing strateg y LINK is a bottom-up chart parser which does not use top-down constraints . Top-down constraints are not used so that as many partial parses as possible can be generated . Because unrestricted bottom-up chart parsing can be (and is, in our system) very inefficient , LINK uses heuristics to decide on the next edge to be entered into the chart . Many of the  heuristics we use are taken from those suggested in psycholinguistic work (e .g., Ford, Bresnan, and Kaplan, 1982), although we found the need to embellish these with additional heuristics o f our own (see Huyck and Lytinen, 1993, for details) .</Paragraph>
    <Paragraph position="13"> The heuristics are encoded in a rule-based system . The rules are invoked each time a new edge is to be entered into the chart, in case more than one edge could be entered next . Each rule specifies a set of conditions under which a grammar rule should be preferred or unpreferred . Rules may specify several different types of preference levels, similar to the preferences that ar e used in SOAR (Laird, Rosenbloom, and Newell, 1987) . Heuristics may state that one grammar rule is preferable to another under some set of circumstances (i .e., if it is possible to apply both rule a and rule b at this point, then rule a should be applied), that a rule is a good candidate , that it is a bad candidate, or that it is the best candidate (i .e., under these conditions, definitely apply this grammar rule) .</Paragraph>
    <Paragraph position="14"> Because the heuristics are incomplete, often it is the case that, at some point during th e parse, they are not able to suggest which rule to apply next . When this occurs, the syste m performs regular undirected bottom-up parsing . This continues until a complete parse of the sentence is found, no more rules can be applied, or a maximum time limit is exceeded . If a complete parse is not found, one or more partial parses is passed on for further processing . No attempt is made to &amp;quot;patch&amp;quot; together a complete interpretation of the sentence if it is not parse d successfully.</Paragraph>
    <Paragraph position="15"> The Postprocessor The postprocessor is responsible for assembling the semantic representations of individua l sentences into a coherent representation of the entire article, and for generating the response template(s) from this overall representation . Our MUC-5 postprocessor is a two-stage, rule-based system . In the first stage, the rules transform representations produced by the parser into a cannonical form. Irrelevant portions of the representation are also discarded in this firs t stage. In the second stage, another set of rules transforms these representations into a for m which much more closely resembles the form of the response templates .</Paragraph>
    <Paragraph position="16"> A rule consists of a left hand side (lhs), which must match (i .e., unify with) the semanti c output from the parser. If the lhs matches, the representation is converted to the form specifie d in the right hand side (rhs) .</Paragraph>
    <Paragraph position="17"> Here are some example rules from the first stage of postprocessing :</Paragraph>
    <Paragraph position="19"> The first rule converts the representation produced for sentences such as &amp;quot;It was reporte d that . . .&amp;quot; . If the main predicate representing the sentence is REPORT, and the reported object i s an ACTION, then this rule discards the REPORT predicate and replaces it with the ACTION .</Paragraph>
    <Paragraph position="20"> If the ACTION has no actor, it is filled in as the actor of the REPORT . Thus, the transforme d representation of the sentence &amp;quot;LSI Logic Corp. reported that they developed . . .&amp;quot; becomes DEVELOP, with the actor filled in as &amp;quot;LSI Logic Corp .&amp;quot; The second rule transforms the representation of a sentence such as &amp;quot;The customer wa s Hampshire Instruments .&amp;quot; Whenever the main predicate is EQUIV (our semantic representatio n of &amp;quot;to be&amp;quot;), and the subject (or actor) of this action is &amp;quot;customer&amp;quot;, this rule converts th e representation to the predicate TRANSFER-TO-CUSTOMER, the recipient of which is th e complement (object) of &amp;quot;to be&amp;quot; . Together, these two rules transform the representation of a sentence like &amp;quot;The customer is reported to be LSI Logic Corp&amp;quot; to the predicate TRANSFER-TO-CUSTOMER, the recipient of which is &amp;quot;LSI Logic Corp.&amp;quot; The postprocessor also merges representations from separate sentences into a single template when appropriate . After the transformation rules are run, the representations of two sentence s are merged together if they can unify. The resulting single representation is simply the resul t of the unification . If representations of sentences cannot be unified, then their representation s may produce separate templates in the response .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML