File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/m92-1027_metho.xml

Size: 13,949 bytes

Last Modified: 2025-10-06 14:13:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="M92-1027">
  <Title>HUGHES RESEARCH LABORATORIES : DESCRIPTION OF THE TRAINABLE TEXT SKIMMER USED FOR MUC-4</Title>
  <Section position="3" start_page="0" end_page="189" type="metho">
    <SectionTitle>
THE TTS APPROACH
</SectionTitle>
    <Paragraph position="0"> There are two aspects of TTS . The first is system training, or the process of deriving or identifying th e phrases which are used in the training corpus . The second is text skimming, or the process of skimming a text with the purpose of identifying whether the text falls into a specific category, and if so, extracting particular pieces of information from the text. Figure 1 illustrates the key components of both aspects of Hughes Trainable Tex t Skimmer. The training involves deriving the set of phrases used to generate the templates associated with a particular corpus of texts, and the generation of training features which map features to the actual text in the stories .</Paragraph>
    <Paragraph position="1"> The text skimming involves the Text Database, databases containing derived phrases and training features, a Phrasa l Parser, the Classifier, and a Feature-to-Template Process Model.</Paragraph>
  </Section>
  <Section position="4" start_page="189" end_page="189" type="metho">
    <SectionTitle>
TTS Module Descriptions
</SectionTitle>
    <Paragraph position="0"> The training processor uses the provided templates to derive the phrase lexicon and build the features used by the pattern classifier. Phrases are generated from the fillers for the template slots. Features are generated from the sentences that provided the fillers . The word lexicon is generated by performing a word frequency analysi s on the raw text. In TTS-MUC4, all words that occur between 10 and 105 times are included in the lexicon.</Paragraph>
    <Paragraph position="1"> The text database contains: (1) the database of training stories, (2) the database of testing stories, (3) the database of training templates for user browsing, and (4) the database of parsed templates for use during training . It supports retrieval of a single fragment of text from a large collection of texts that may be spread over multiple dis k files. The values retrieved from the text database fall into three categories . A raw text string can be retrieved fo r processing a story or browsing templates. A recursive token structure, representing an entire story, with individua l</Paragraph>
  </Section>
  <Section position="5" start_page="189" end_page="192" type="metho">
    <SectionTitle>
SENTENCE CLASSIFICATION
TOPIC GROUPING
DATE PROCESSIN G
LOCATION PROCESSING
TEMPLATE GENERATION
TEMPLATES
SENTENCEFEATURES
/RELEVANT
PREVIOUSFEATURES
TWIT
SCSD(I IKE R
PATTERNCLASSIFIER
ASSOCIATIVE LOOKUP
FEATURE MATCIING
BAYESIAN CLASSIFIERS
</SectionTitle>
    <Paragraph position="0"> words at the leaf nodes, can also be returned. In the third case, an s-expression representing a parsed template can be retrieved for further use by the pattern classifier .</Paragraph>
    <Paragraph position="1"> The phrasal parser is a fast, shallow, conceptual parser. The parser accepts a token structure, a lexica l hierarchy, and a phrase-pattern set . The parser returns an ordered list of text features . A text feature includes: (1) a member of the concept hierarchy, (2) the string covered by the phrase, and (3) a recursive token structure spanning the tokens covered by the phrase.</Paragraph>
    <Paragraph position="2"> Lexicon entries are created by adding word stems to a concept hierarchy as follows , (ks :isa h-lex &amp;quot;PRIEST&amp;quot; :religious-individual-w) (ks :isa h-lex &amp;quot;MISSIONARY&amp;quot; :religious-individual-w) (ks:isa h-lex &amp;quot;CONFERENCE&amp;quot; :conference-w) (ks:isa h-lex &amp;quot;SUMMIT&amp;quot; :conference-w) (ks :isa h-lex &amp;quot;RECEPTION&amp;quot; :conference-w) Phrasal pattern definitions have three parts : the discrimination net to which the pattern belongs, a list of th e pattern components, and the pattern descriptor . Phrasal patterns may reference either elements of the concep t hierarchy, or specific words . Pattern definitions have the syntax illustrated in the following examples :</Paragraph>
    <Paragraph position="4"> The features are extracted using a depth first search of the patterns, with a preference for patterns that have specific words over those which have only concept names, as well as a preference for longer patterns .</Paragraph>
    <Paragraph position="5"> The pattern classifier actually consists of a separate pattern classifier for each template slot to be filled.</Paragraph>
    <Paragraph position="6"> Each such classifier takes an ordered list of text features, and returns the probability of a set fill or a string fill, base d upon all previously used fills for that particular slot.</Paragraph>
    <Paragraph position="7"> The feature-to-template process model has the task of identifying which features of the test text are relevant to each template slot . It also has the task of generating a completed story template for each relevant topi c identified in the story .</Paragraph>
    <Paragraph position="8"> Flow of Control Once an initial training phase has been completed to initialize the pattern classifiers, the feature-to-templat e process model performs its task in four phases : (1) pattern classification, (2) topic grouping, (3) slot filling, and (4 ) template generation. During the first phase, TTS-MUC4 iterates over all sentences in the text and collects potentia l topics. The second phase consists of determining which topics are relevant, and eliminating from furthe r consideration the sentences having no bearing on the chosen topics. In phase 3, the values to be supplied in the completed template(s) are extracted and/or computed from the relevant sentences, based upon the focus of eac h selected topic . Lastly, the standard MUC-4 templates are generated .</Paragraph>
    <Section position="1" start_page="190" end_page="190" type="sub_section">
      <SectionTitle>
Pattern Classification
</SectionTitle>
      <Paragraph position="0"> For each sentence of a story, a set of Bayesian classifiers is used . For set fills, the classifiers compute the probabilities of the different potential set fills . For string fills, the classifiers compute the probability that a particular phrase is, for example, a human target . (See the Site Report section of this volume for a detailed description of TTS-MUC4, including details on the Bayesian classifiers .) 19 1</Paragraph>
    </Section>
    <Section position="2" start_page="190" end_page="190" type="sub_section">
      <SectionTitle>
TopicGrouping and Relevance Assessment
</SectionTitle>
      <Paragraph position="0"> Topic grouping (analogous to discourse processing) is based on the INCIDENT: TYPE slot. The weight for each type of incident is computed for every sentence . The weights are then passed through a competitive filter , resulting in binary signals. The competitive filter first normalizes the topic weights using a Gaussian mask on a sentence by sentence basis, then computes the best topic. A topic is a set of contiguous sentences with the sam e computed value for INCIDENT: TYPE.</Paragraph>
      <Paragraph position="1"> Figure 2 shows the inputs and outputs to the topic grouping process . Note that moderately high evidenc e of kidnapping throughout the story is suppressed in favor of the bombing interpretation, which turns out to b e correct. This filter used is topic grouping is designed to pick out signals that are high but that &amp;quot;drop out&amp;quot; from time to time, as one can see in the smoothing over the arson signal.</Paragraph>
    </Section>
    <Section position="3" start_page="190" end_page="192" type="sub_section">
      <SectionTitle>
Slot Filling
</SectionTitle>
      <Paragraph position="0"> Slot filling consists of five parts : (1) pure set fills, (2) string fills, (3) cross-referenced slots, (4) dat e extraction, and (5) location extraction . The first three parts consider only relevant sentences. A relevant sentenc e shares the same topic with the previous sentence or contains no competing topic. There are two distinct types of processing for slot filling. Two slots, date and location, are filled by domain specific procedures. The remaining slots are filled using hypotheses returned by the pattern classifier. The slots are filled from the pattern classifier fall into three categories:  1. Set fills--Pure set fills are computed by averaging the weights over all sentences fo r a given topic, and picking the highest score .</Paragraph>
      <Paragraph position="1"> 2. String fills--String fills are computed in a similar manner to set fills. They differ in that the suggestions returned by the classifier are subject to a threshold on th e weights. For the official run of TTS-MUC4 the string fill threshold was set at 0 .75.</Paragraph>
      <Paragraph position="2"> 3. Cross reference generation--Cross reference generation is performed by choosing th e most likely tag (as suggested by the classifier) for the sentence that contains th e string fill.</Paragraph>
      <Paragraph position="3">  For date extraction, all sentences within a topic are scanned for absolute or relative date references. Absolute date references are combined into a range . Absolute dates are preferred over relative dates within a give n sentence. Relative date references are interpreted with respect to either the current date specification for a story (if one has been found) or the story date line.</Paragraph>
      <Paragraph position="4">  For location extraction, all sentences within a topic are scanned for known location names . The resulting list of location names is then searched for a maximal, legal, location containment chain .</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="192" end_page="192" type="metho">
    <SectionTitle>
TST-MUC4-0048 ANALYSIS
</SectionTitle>
    <Paragraph position="0"> A detailed look at the processing of TST-MUC4-0048 provides insight into TTS-MUC4 . Upon reading th e first sentence of TST-MUC4-0048 :</Paragraph>
  </Section>
  <Section position="7" start_page="192" end_page="193" type="metho">
    <SectionTitle>
SALVADORAN PRESIDENT-ELECT ALFREDO CRISTIANI CONDEMNED TH E
TERRORIST KILLING OF ATTORNEY GENERAL ROBERTO GARCIA
ALVARADO AND ACCUSED THE FARABUNDO MARTI NATIONAL LIBERATION
</SectionTitle>
    <Paragraph position="0"> TTS-MUC4 identifies each potentially relevant template slot, and hypothesizes possible values, based upo n similar sentences which TTS-MUC4 has retrieved from case memory . In this case, for example, the system choose s : attack as the most likely : incident-type, based upon the following calculated likelihoods for each topic associated with the first sentence of -0048 :</Paragraph>
    <Paragraph position="2"> Similar likelihoods are calculated for each sentence.</Paragraph>
    <Paragraph position="3"> Based on semantic features such as :death-w, :government-official-or-legal-orjudicial-descr, and : terrorist-act-org, the Bayesian classifier for INCIDENT-TYPE computes the probabilities above . Likewise, using features such as :death-w and :terrorist-act-org, another Bayesian classifier computes the probability that &amp;quot;ROBERTO GARCIA ALVARADO&amp;quot; is a human target .</Paragraph>
    <Paragraph position="4">  Processing proceeds in a like manner for the next 6 sentences of the story, and the template shown in figur e 3 is produced.</Paragraph>
    <Paragraph position="5"> 0. MESSAGE: ID TST2-MUC4-0048 1. MESSAGE: TEMPLATE 1 2. INCIDENT: DATE 01 JUNE 1988 3. INCIDENT: LOCATION EL SALVADOR: SAN SALVADOR (DEPARTMENT ) 4 . INCIDENT: TYPE ATTACK 5. INCIDENT: STAGE OF EXECUTION ACCOMPLISHED 6. INCIDENT: INSTRUMENT ID * 7 . INCIDENT: INSTRUMENT TYPE * 8 . PERP : INCIDENT CATEGORY TERRORIST ACT 9. PERP: INDIVIDUAL ID &amp;quot;URBAN GUERRILLAS&amp;quot; 10. PERP : ORGANIZATION ID &amp;quot;NATIONALIST REPUBLICAN ALLIANCE &amp;quot; 11. PERP: ORGANIZATION CONFIDENCE SUSPECTED OR ACCUSED : &amp;quot;NATIONALIST REPUBLICAN ALLIANCE &amp;quot; 12. PHYS TGT: ID &amp;quot;VEHICLE&amp;quot; 13. PHYS TGT: TYPE OTHER: &amp;quot;VEHICLE&amp;quot; 14 . PHYS TGT: NUMBER 1 : &amp;quot;VEHICLE&amp;quot; 15. PHYS TGT: FOREIGN NATION * 16. PHYS TGT: EFFECT OF INCIDENT DESTROYED: &amp;quot;VEHICLE&amp;quot; 17. PHYS TGT: TOTAL NUMBER 1 18. HUM TGT: NAME &amp;quot;GARCIA ALVARADO&amp;quot; 19. HUM TGT: DESCRIPTION &amp;quot;DEMOCRAT&amp;quot; : &amp;quot;JOSE NAPOLEON DUARTE&amp;quot; &amp;quot;ATTORNEY GENERAL&amp;quot;: &amp;quot;ROBERTO GARCIA ALVARADO&amp;quot; 20. HUM TGT: TYPE CIVILIAN: &amp;quot;ROBERTO GARCIA ALVARADO&amp;quot;  The most notable feature of this template fill is over generation . TTS-MUC4 correctly identifies one human target and one physical target, but one other person, JOSE NAPOLEON DUARTE, and several coreferents are also generated.</Paragraph>
    <Paragraph position="6"> Sentences 11 through 13 of TST-MUC4-0048 :</Paragraph>
  </Section>
  <Section position="8" start_page="193" end_page="194" type="metho">
    <SectionTitle>
GUERRILLAS ATTACKED MERINO'S HOME IN SAN SALVADOR 5 DAYS AGO
WITH EXPLOSIVES . THERE WERE SEVEN CHILDREN, INCLUDING FOUR
OF THE VICE PRESIDENT'S CHILDREN, IN THE HOME AT THE TIME .
</SectionTitle>
    <Paragraph position="0"> A 15-YEAR-OLD NIECE OF MERINO'S WAS INJURED.</Paragraph>
    <Paragraph position="1"> along with the preceding 2 sentences and the subsequent sentence 14, produce the template shown in figure 4 .  The most notable feature of these template fills is the fragmentation of the string fills . Many of the correc t features are tagged, but, because TTS only has a shallow parser, the phrases : &amp;quot;CHILDREN&amp;quot; and &amp;quot;vICE PRESIDENT' S&amp;quot; are never combined into the correct answer &amp;quot;VICE PRES IDENT' S CHILDREN.&amp;quot; In addition to the two templates above, TTS-MUC4 also produces a spurious template based on th e sentences 21 through 22:</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML