XML Viewer - m95-1020

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/m95-1020_metho.xml
Size: 20,529 bytes
Last Modified: 2025-10-06 14:14:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="M95-1020">
  <Title>STERLING SOFTWARE : AN NLTOOLSET-BASED SYSTEM FOR MUC-6</Title>
  <Section position="3" start_page="0" end_page="251" type="metho">
    <SectionTitle>
SYSTEM DESIGN
</SectionTitle>
    <Paragraph position="0"> Our MUC-6 system (Figure 1) consists of 5 major components, applied in sequence: Lexical Analysis, Reduction, Extraction, Merging, Postprocessing. It was designed to share as much of the processin g sequence between tasks as possible . The processing for NE followed the identical sequence of step s (Lexical Analysis, and Reduction) as was followed for the TE and ST tasks, then diverged to its ow n Postprocessing component to write the NE file . The Reduction steps taken to identify portions of text for marking in NE also filled the slots with the appropriate text for the TE task. The processing specific to ST diverged after all the phrase-level Reductions for NE and TE had been performed .</Paragraph>
    <Paragraph position="1"> TE expectations expectationsExtraction Merging tokerReduction ~ sequen 1Extraction IST expect  The heart of the system is a sophisticated pattern-matcher, which is used repeatedly in the course o f processing to identify text for Reduction or Extraction. While the NLToolset also provides a parser, afte r some initial development we abandoned it on ATS, and did not use it on MUC-6.</Paragraph>
    <Section position="1" start_page="249" end_page="250" type="sub_section">
      <SectionTitle>
Lexical Analysis
</SectionTitle>
      <Paragraph position="0"> The Lexical Analysis component has several subcomponents. First, a tokenizer converts the input string for the entire article into a sequence of tokens . We modified the NLToolset-supplied tokenizer to try to prevent it from reordering or dropping text in ways that made it difficult to map back to the original text when writing the NE output file; we also modified it to preserve upper- vs lower-cas e information.</Paragraph>
      <Paragraph position="1"> The second step in Lexical Analysis is the actual lexicon lookup, which attaches information from th e lexicon to the tokens. This includes morphological analysis, which was useful primarily for determinin g the root form of nationalities, such as &amp;quot;Canadian&amp;quot; -&gt; CANADA. It also includes finding multi-token lexicon entries, such as &amp;quot;New York&amp;quot; and &amp;quot;Coca-Cola&amp;quot; . Since we weren't using the parser, the part-of-speech obtained by a lexical lookup was of interest mainly if it was something like city-name or orgname; we did also try to prevent the inappropriate inclusion of verbs, prepositions, etc in names, wit h mixed results .</Paragraph>
      <Paragraph position="2"> The third step in Lexical Analysis is the insertion of special marker tokens to indicate capitalize d words. This was needed to be able to usethat information in name recognition, since there did not appea r to be any good way to get the pattern matcher to use the capitalization information contained in th e original tokens .</Paragraph>
      <Paragraph position="3"> Finally, Lexical Analysis splits the token sequence into sentences, including one each for headline, dateline, and date.</Paragraph>
      <Paragraph position="4"> Reduction The Reduction components each consist of one or more stages of applying the NLToolset's pattern matcher to phrases. Any phrase matched is &amp;quot;reduced&amp;quot;, usually but not always to a single multi-token, o r &amp;quot;mtoken&amp;quot;. In each stage, all the patterns appropriate to that stage are tried on each sentence in turn. The very first reduction stage is a &amp;quot;junk&amp;quot; reduction to delete tables so they are not seen by subsequent reduction stages.</Paragraph>
      <Paragraph position="5"> Each subsequent reduction has two useful side-effects : 1) identifying which tokens form the heart of the reduction and therefore should be marked for the NE task, and 2) filling the slots of the mtokens wit h appropriate pieces of the text that was reduced, for the TE task . Note that these two purposes often conflict -- for example, city, state references and date ranges were supposed to have pieces marke d separately, but were reduced to single mtokens with one set of slot fillers . This called for some carefu l engineering.</Paragraph>
      <Paragraph position="6"> The applications of reduction patterns are done in sequence rather than all at once for a number o f reasons: First, some references to a person, organization, or location may not be recognizable b y themselves, but other references to the same thing may be easier to spot . Therefore, every new thing reduced is added to a temporary lexicon, and another reduction step is applied to look for othe r references (with certain allowed variations) to those same things ; for example, relatively easy-torecognize references to &amp;quot;Mr. Jones&amp;quot; or &amp;quot;Robert L . James&amp;quot; would enable later recognition of the more problematic &amp;quot;Barnaby Jones&amp;quot; and &amp;quot;James&amp;quot; . And when adding to this lexicon, appropriate variations in a n (organization) name are included so that they would be recognized if they occured ; for example:  When such a &amp;quot;secondary&amp;quot; organization reference is reduced, the text is put in the org_alias slot; the full form is pulled from the lexicon and put in the org_name slot to ensure proper merging (see below) of the two referents.</Paragraph>
      <Paragraph position="7"> Second, the results of reductions can be used to provide additional context for later reductions; for example, person reduction is done after organization, so a reduced organization can help the patter n matcher recognize a person, as in the token sequence [ARTIE MCDONALD , *ORG* 'S PRESIDENT] , where *ORG* is the mtoken produced by the earlier reduction . A reduction can also involve multiple previously-reduced mtokens, filling the slots of one with information from another ; for example, the reduction of the token sequence [*ORG* , A *LOC* - BASED MANUFACTURER] includes filling th e org_descriptor, org_locale, and org_country slots of *ORG* with the descriptive phrase and th e information from *LOC*.</Paragraph>
    </Section>
    <Section position="2" start_page="250" end_page="250" type="sub_section">
      <SectionTitle>
Extraction
An Extraction component uses the results of a pattern match to generate an &amp;quot;expectation&amp;quot; and fill its
</SectionTitle>
      <Paragraph position="0"> slots with pieces of the text matched. For ST, a typical expectation represents an event, with the person, organization, date, etc mtokens in the clause that was matched being used to fill its slots . For TE, each expectation is a trivial one containing one person or organization.</Paragraph>
    </Section>
    <Section position="3" start_page="250" end_page="251" type="sub_section">
      <SectionTitle>
Merging
</SectionTitle>
      <Paragraph position="0"> The NLToolset provides a merging tool, which merges expectations of the same type (person, organization, etc) as long as the fillers of their corresponding slots do not conflict; a conflict occurs if both have a filler, the fillers are different, and the slot is not allowed to have multiple fillers . Obviously, the org_alias and org_descriptor slots were allowed to have multiple fillers and org_name was not .</Paragraph>
      <Paragraph position="1"> During reduction, our system actually splits a person's name across slots called given_name , family_name, and suffix_name, so that the expectations for, say, &amp;quot;Harry L . James, Jr.&amp;quot; and &amp;quot;Mr. James&amp;quot; would be merged. It also carefully fills slots such as org_type and a few others added just for this purpose so as to prevent improper merges; for example, it reduces the token sequence [THE *ORG* UNIT] to two *ORG* mtokens, one old and one new, with slots filled so that they could not merge wit h each other .</Paragraph>
      <Paragraph position="2"> Initially, we relied on this merging tool to bring together separated org names and descriptors, such as &amp;quot;NEC Corp. ... the giant Japanese computer manufacturer&amp;quot;. We soon found, however, that even with careful use of slot fillers to prevent descriptors for commercial organizations from merging with, say, th e name of a government organization or a library, too many merges were incorrect . We therefore devised a separate stack mechanism which keeps track of the org mtokens for each sentence; when an or g descriptor is reduced in the final TE reduction stage, the stack is searched starting at the current sentence , to find the closest suitable referent that precedes the descriptor, and to add the descriptor text to th e mtoken for that referent. This approach worked quite well.</Paragraph>
    </Section>
    <Section position="4" start_page="251" end_page="251" type="sub_section">
      <SectionTitle>
Postprocessing
</SectionTitle>
      <Paragraph position="0"> For the NE task, the postprocessing step consists of traversing the token sequences in parallel with the original text, writing the original text and inserting markers as the reduction results attached to eac h token indicated . We had to go back to original text to include those portions of the article header whic h were not processed, and to recover from cases where the tokenizer had dropped characters despite our modifications.</Paragraph>
      <Paragraph position="1"> For the TE task, the postprocessing step consists of traversing the list of expectations and writing a template for each, performing final clean-ups like removing duplicate aliases, combining the person_name pieces, skipping slots used only to control merging, etc .</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="251" end_page="259" type="metho">
    <SectionTitle>
KNOWLEDGE ENGINEERING
</SectionTitle>
    <Paragraph position="0"> The bulk of the time spent in knowledge engineering was spent developing the patterns for all th e Reduction and Extraction stages . These patterns were devised to take advantage of all the loca l contextual clues we could come up with, including upper- vs lower-case information and descriptive appositives. Our results show that this approach works well ; and the modularity of the patterns makes i t easy to add coverage as we discover additional clues (such as those we discuss in the walkthrough with respect to organizations).</Paragraph>
    <Paragraph position="1"> The reliance on case information meant that headlines were a bit of a problem ; despite giving them somewhat special treatment, our error rate was higher there than elsewhere:  There was some lexicon work, as well. This included entries for all the countries, with alternat e phrases (such as &amp;quot;West Germany&amp;quot; for &amp;quot;Federal Republic of Germany&amp;quot;) and irregular derivations (such a s &amp;quot;Dutch&amp;quot; for &amp;quot;Netherlands&amp;quot;), and entries for major cities and geographical regions, with their countr y information included . For organizations, we limited it to a few dozen major ones that have no reliabl e internal clues and often occur without any contextual clues (such as &amp;quot;White House&amp;quot;, &amp;quot;Fannie Mae&amp;quot;, &amp;quot;Bi g Board&amp;quot;, &amp;quot;Coca-Cola&amp;quot; and &amp;quot;Coke&amp;quot;, &amp;quot;Macy's&amp;quot;, &amp;quot;Exxon&amp;quot;, etc).</Paragraph>
    <Paragraph position="2">  The results on the walkthrough article (see Table 1) compared to our overall results show that this wa s indeed a relatively difficult article. They show three issues worth discussing.</Paragraph>
    <Paragraph position="3"> First, we had low precision on timex . Two out of the three &amp;quot;spurious&amp;quot; dates are due to our apparentl y mistaken belief that &amp;quot;yesterday&amp;quot; and &amp;quot;tomorrow&amp;quot; were supposed to be marked . This knowledge engineering error led to the worst recall or precision number on our overall NE results, a precision o n timex of 84; avoiding that error would have raised it to 94 .</Paragraph>
    <Paragraph position="4"> Second, recall and precision on organizations was a bit low . The system missed both &amp;quot;Fallon McElligott&amp;quot; and &amp;quot;McCann-Erickson&amp;quot; . On the former, a phrase like &amp;quot;ad agency Fallon McElligott&amp;quot; woul d have caused it to be found, but the actual phrase &amp;quot;other ad agencies, such as Fallon McElligott&amp;quot; did not . On the latter, not having a pattern to cover things like &amp;quot;chief executive officer of McCann-Erickson&amp;quot; wa s an omission on our part.</Paragraph>
    <Paragraph position="5"> 25 3 Other organization errors were : getting &amp;quot;New York Times&amp;quot;, which in this article is incorrect ; missing the two descriptors for &amp;quot;Ammirati &amp; Puris&amp;quot; and the locale for &amp;quot;Coca-Cola&amp;quot; . The locale error points ou t another major cause of poor results -- a next-to-last-minute change in the final TE pattern for picking u p combination of organization name plus location and/or descriptor, inadequately tested, led t o inadvertantly dropping coverage of the most basic of combinations : [*ORG* $lprep *LOC*], where $lprep is a macro for: &amp;quot;one of ',' 'in&amp;quot;of&amp;quot;. This unfortunate error had the following effect on total locale slot scor e  Third, problems with persons . The system decided &amp;quot;McCann&amp;quot; was a person, based on &amp;quot;the McCan n family&amp;quot;; since it did not recognize &amp;quot;McCann-Erickson&amp;quot; as a company, every reference to &amp;quot;McCann&amp;quot; wa s therefore marked as a person. Due to inadequate restrictions on our use of capitalization, the system als o decided &amp;quot;While McCann&amp;quot; and &amp;quot;One McCann&amp;quot; were distinct persons . It decided that &amp;quot;John J. Dooner, Jr.&amp;quot; and &amp;quot;John Dooner&amp;quot; were distinct persons ; the &amp;quot;Jr.&amp;quot; would not have caused it to make that decision, but th e &amp;quot;J.&amp;quot; did.</Paragraph>
    <Paragraph position="6"> Now, the walkthrough .</Paragraph>
    <Paragraph position="7">  After the Lexical Analysis, the input string has been converted into a list of 52 sentences, each sentenc e containing a list of tokens; this list includes *CAP* tokens inserted in front of every capitalized token . Attached to each token is the result of the lexical lookup .</Paragraph>
    <Paragraph position="8"> Note that at this point lexical lookup has replaced the surface representation of &amp;quot;Coke&amp;quot; and &amp;quot;CEO&amp;quot; wit h their &amp;quot;canonical&amp;quot; forms. Every token contains its original string, so we can still recover it for use in filling slots.</Paragraph>
    <Paragraph position="9"> The lookup on &amp;quot;Atlanta&amp;quot; has provided the information that it is a city and that its country is the US.  Figure 3 : After Entity Reduction s The initial Reduction stages take care of money, percent, date, time, and location, then &amp;quot;secondary &amp;quot; references to location . The only things worth noting here are the &amp;quot;yesterday&amp;quot; errors already discussed , that the system decided &amp;quot;60 pounds&amp;quot; was a reference to money, and that the information in the lexica l entry for &amp;quot;Atlanta&amp;quot; was used to fill the slots of the *LOC* mtoken .</Paragraph>
    <Paragraph position="10"> The next Reduction stages take care of &amp;quot;primary&amp;quot; then &amp;quot;secondary&amp;quot; references to organizations . The primary stage picks up &amp;quot;Interpublic Group&amp;quot;, &amp;quot;PaineWebber&amp;quot;, &amp;quot;Coca-Cola&amp;quot;, &amp;quot;Coke&amp;quot;, &amp;quot;Creative Artist s Agency&amp;quot;, 'WPP Group&amp;quot;, &amp;quot;Ammirati &amp; Puris&amp;quot;, &amp;quot;New York Yacht Club&amp;quot; and &amp;quot;New York Times&amp;quot;. It misses &amp;quot;Fallon McBride&amp;quot; and &amp;quot;McCann-Erickson&amp;quot; for reasons already noted . The only reason it gets &amp;quot;PaineWebber&amp;quot;, &amp;quot;Coca-Cola&amp;quot;, and &amp;quot;Coke&amp;quot; is because they are in the lexicon ; the others are all picked up by match various patterns.</Paragraph>
    <Paragraph position="11"> In this article, the only secondary reference is &amp;quot;CAA&amp;quot; as a reference to &amp;quot;Creative Artists Agency&amp;quot; . While the system does manufacture acronyms as potential secondary references when certainpattems match , the pattern which enabled it to determine that &amp;quot;Creative Artists Agency&amp;quot; was a commercial organizatio n was unfortunately not one of them .</Paragraph>
    <Paragraph position="12">  The next Reduction stages take care of &amp;quot;primary&amp;quot; then &amp;quot;secondary&amp;quot; references to persons . The primary stage picks up &amp;quot;James&amp;quot;, &amp;quot;John Dooner&amp;quot;, &amp;quot;Kevin Goldman&amp;quot;, &amp;quot;Robert L . James&amp;quot;, &amp;quot;John J. Dooner, Jr .&amp;quot;, &amp;quot;Mr . James&amp;quot;, &amp;quot;Mr. Dooner&amp;quot;, &amp;quot;Alan Gottesman&amp;quot;, &amp;quot;Peter Kim&amp;quot;, &amp;quot;Walter Thompson&amp;quot;, &amp;quot;Marti n Puris&amp;quot;, and (alas) &amp;quot;McCann&amp;quot; . These are found on the strength of titles like &amp;quot;Mr .&amp;quot; and &amp;quot;Sen.&amp;quot;, known first names, and contextual clues such as known occupations like &amp;quot;president&amp;quot;, &amp;quot;analyst&amp;quot;, etc . &amp;quot;James&amp;quot; in th e headline is found because it follows &amp;quot;succeed&amp;quot;; &amp;quot;McCann&amp;quot; is found because of &amp;quot;McCann family&amp;quot; . The secondary stage picks up all remaining references to &amp;quot;McCann&amp;quot; . Since &amp;quot;McCann-Erickson&amp;quot; was not recognized as an organization, all those occurrences are picked up, too . And since we failed to make adverbs off-limits as new first names in this stage, it decides that &amp;quot;While McCann&amp;quot; and &amp;quot;One McCann&amp;quot; (note the capitalization) are distinct persons .</Paragraph>
    <Paragraph position="13">  Now, NE and TE processing diverge. For NE, the system uses the original text of the article to write a copy. It traverses the token sequences in parallel with the original text, using the fact that each token contains information on all the reductions it was involved in to determine where to insert begin and en d brackets. It only pays attention to the final reduction except in the case of locations inside money, wher e brackets are inserted for both.</Paragraph>
    <Paragraph position="14">  For TE, there is one final Reduction stage to take care of organization descriptors and locations . Here, the system finds descriptors &amp;quot;the big Hollywood talent agency&amp;quot; and &amp;quot;a hot agency&amp;quot;, but not &amp;quot;a qualit y operation&amp;quot; and &amp;quot;the agency with billings of $400 million&amp;quot; . The former omission was deliberate, due to too many spurious matches when it was included; the latter was a construct we did not think to include . In  cases where the descriptor is an appositive, the referenced organization is included in the pattern match ; otherwise, if the appositive is a definite reference, the stack of organization references is searched for th e putative antecedant. In either case, the descriptor and locale information (if any) is inserted into slots o f the organization mtoken . In retrospect, including indefinite references that are not appositives appears to have been the wrong thing to do.</Paragraph>
    <Paragraph position="15"> Then there is the trivial Extraction step which turns the organization and person mtokens int o &amp;quot;expectations&amp;quot;. This is followed by the Merging step which merges expectations together wherever possible. This includes merging the expectations for &amp;quot;James&amp;quot;, &amp;quot;Robert L . James&amp;quot;, &amp;quot;Mr. James&amp;quot; (several occurences); &amp;quot;Coca-Cola&amp;quot;, &amp;quot;Coke&amp;quot; ; etc.</Paragraph>
    <Paragraph position="16">  Finally, the Postprocessing step writes each expectation to the TE result file, making final adjustments to the slot fillers as needed .</Paragraph>
    <Paragraph position="18"/>
    <Paragraph position="20"> ORG_DESCRIPTOR: &amp;quot;the big Hollywood talent agency&amp;quot;</Paragraph>
    <Paragraph position="22"/>
  </Section>
class="xml-element"></Paper>
Download Original XML