XML Viewer - m91-1033

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/m91-1033_metho.xml
Size: 21,655 bytes
Last Modified: 2025-10-06 14:12:48
<?xml version="1.0" standalone="yes"?>
<Paper uid="M91-1033">
  <Title>UNIVERSITY OF MASSACHUSETTS: DESCRIPTION OF THE CIRCUS SYSTEM AS USED FOR MUC-3</Title>
  <Section position="3" start_page="223" end_page="223" type="metho">
    <SectionTitle>
SYSTEM COMPONENTS
</SectionTitle>
    <Paragraph position="0"> Although CIRCUS was the primary workhorse underlying our MUC-3 effort, it was necessary t o augment CIRCUS with a separate component that would receive CIRCUS output and massage tha t output into the final target template instantiations required for MUC-3 . This phase of our processing came to be known as consolidation, although it corresponds more generally to what many people woul d call discourse analysis . We will describe both CIRCUS and our consolidation processing with examples from TST1-MUC3-0099 . (Please consult Appendix H for the complete text of TST1-MUC3-0099 .) A flow chart of our complete system is given in Figure 1 .</Paragraph>
    <Section position="1" start_page="223" end_page="223" type="sub_section">
      <SectionTitle>
Sentence Preprocessing
</SectionTitle>
      <Paragraph position="0"> To begin, each sentence is given to our preprocessor where a number of domain-specific modifications are made. (1) Dates are analyzed and translated into a canonical form . (2) Words associated with our phrasal lexicon are connected via underscoring. (3) Punctuation marks are translated into atoms mor e agreeable to LISP . For example, the first sentence (SI) reads :</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="223" end_page="223" type="metho">
    <SectionTitle>
SI : POLICE HAVE REPORTED THAT TERRORISTS TONIGHT BOMBED THE EMBASSIES OF TH E
PRC AND THE SOVIET UNION.
</SectionTitle>
    <Paragraph position="0"> After preprocessing, we have:</Paragraph>
  </Section>
  <Section position="5" start_page="223" end_page="224" type="metho">
    <SectionTitle>
SI : (POLICE HAVE REPORTED THAT TERRORISTS ON OCT_25_89 &gt;CO TONIGHT BOMBED TH E
EMBASSIES OF THE PRC AND THE SOVIET UNION &gt;PE)
</SectionTitle>
    <Paragraph position="0"> The canonical date was derived from &amp;quot;tonight&amp;quot; and the dateline of the article, &amp;quot;25 OCT 89 .&amp;quot; Most of our phrasal lexicon is devoted to proper names describing locations and terrorist organizations (83 1 entries) . 25 additional proper names are also recognized, but not from the phrasal lexicon .</Paragraph>
    <Section position="1" start_page="224" end_page="224" type="sub_section">
      <SectionTitle>
Lexical Analysi s
</SectionTitle>
      <Paragraph position="0"> At this point we are ready to hand the sentence to CIRCUS for lexical processing . This is where we search our dictionary and apply morphological analysis in an effort to recognize words in the sentence .</Paragraph>
      <Paragraph position="1"> Any words that are not recognized receive a default tag reserved for proper nouns in case we need t o make sense out of unknown words later . In order for any semantic analysis to take place, we need to recognize a word that operates as a trigger for a concept node definition . If a sentence contains no concep t node triggers, it is ignored by the semantic component . This is one way that irrelevant texts can be identified: texts that trigger no concept nodes are deemed irrelevant . Words in our dictionary are associated with a syntactic part of speech, a position or positions within a semantic feature hierarchy , possible concept node definitions if the item operates as a concept node trigger, and syntacti c complement predictions . Concept nodes and syntactic complement patterns will be described in the nex t section. An example of a dictionary entry with all four entry types is our definition for &amp;quot;dead&amp;quot; as seen i n figure 2.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="224" end_page="227" type="metho">
    <SectionTitle>
(D-WORD DEAD
:SYNTACTIC-TYPE SPECIAL-ADJECTIV E
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> When morphological routines are used to strip an inflected or conjugated form back to its root, the root-form dictionary definition is dynamically modified to reflect the morphological information . For example, the root definition for &amp;quot;bomb&amp;quot; will pick up a :VERB-FORM slot with PAST filling it when the lexical item &amp;quot;bombed&amp;quot; is encountered .</Paragraph>
    <Paragraph position="3"> Semantic and Syntactic Predictions Words associated with concept nodes activate both syntactic and semantic predictions . In Sl th e verb &amp;quot;bombed&amp;quot; activates semantic predictions in the form of a concept node designed to describe a bombing. Each concept node describes a semantic case frame with variable slots that expect to be filled by specific syntactic constituents . The concept node definition activated by &amp;quot;bombed&amp;quot; in Si is given i n Figure 3 .</Paragraph>
    <Paragraph position="4"> We can see from this definition that a case frame with variable slots for the actor and target i s predicted. The actor slot expects to be filled by an organization, the name of a recognized terrorist o r generic terrorist referent, a proper name, or any reference to a person . The target slot expects to be fille d by a physical target . We also expect to locate the actor in the subject of the sentence, and the targe t  should appear as either a direct object or the object of a prepositional phrase containing the prepositio n &amp;quot;in.&amp;quot; None of these predictions will be activated unless the current sentence is in the active voice . ;; X bombed/dynamited/blew_up ;; the bomb blew up in the building (tstl-0040) -emr ;; (if this causes trouble we can create a new cn for blew_up )  Syntactic complement predictions are managed by a separate mechanism that operate s independently of the concept nodes . The syntactic predictions fill syntactic constituent buffers with appropriate sentence fragments that can be used to instantiate various concept node case frames. Syntactic predictions are organized in decision trees using test-action pairs under a stack-based contro l structure [6] . Although syntactic complements are commonly associated with verbs (verb complements), we have found that nouns should be used to trigger syntactic complement predictions with equa l frequency . Indeed, any part of speech can trigger a concept node and associated complement prediction s as needed . As we saw in the previous section, the adjective &amp;quot;dead&amp;quot; is associated with syntacti c complement predictions to facilitate noun phrase analysis. Figure 4 shows the syntactic complemen t pattern predicted by &amp;quot;bombed&amp;quot; once morphology recognizes the root verb &amp;quot;bomb.&amp;quot; (((test (second-verb-or-infinitive?)) ;; all verbs call this functio n (assign *part-of-speech* 'verb ;; just to be sure...reset some buffers for noun phrase collectio n *np-flag* nil *noun-group* nil *predicates* nil *entire-noun-group* ni l *determiners* nil *appositive* nil *gerund* ni l</Paragraph>
    <Paragraph position="6"> ;; next noun phrase should be the direct object ((test (equal *part-of-speech* 'noun-phrase) ) (assign *DO* *cd-form*)) ;; don't predict *DO* if conjunction follows the verb, ;; e.g., in &amp;quot;X was damaged and Y was destroyed&amp;quot; , ;; Y should NOT be *DO* of &amp;quot;damaged &amp;quot; ((test (equal *part-of-speech* 'conjunction))))) )  Remarkably, Figure 4 displays all the syntactic knowledge CIRCUS needs to know about verbs . Every verb in our dictionary references this same prediction pattern . In particular, this means that we have found no need to distinguish transitive verbs from intransitive verbs, since this one piece of cod e handles both (if the prediction for a direct object fails, the *DO* buffer remains empty) . Once semantic and syntactic predictions have interacted to produce a set of case frame slot fillers , we then create a frame instantiation which CIRCUS outputs in response to the input sentence . In general, CIRCUS can produce an arbitrary number of case frame instantiations for a single sentence. No effort is made to integrate these into a larger structure. The concept node instantiation created b y $BOMBING-3$ in response to Sl is given in Figure 5 .</Paragraph>
    <Paragraph position="7"> Some case frame slots are not predicted by the concept node definition but are inserted into th e frame in a bottom-up fashion. Slots describing time specifications and locations are all filled by a mechanism for bottom-up slot insertion (e .g. the REL-LINK slot in Figure 5 was created in this way) . Although the listing in Figure 5 shows only the head noun &amp;quot;embassies&amp;quot; in the target noun group slot, th e full phrase &amp;quot;embassies of the PRC and the Soviet Union&amp;quot; has been recognized as a noun phrase and ca n be recovered from this case frame instantiation . The target value &amp;quot;ws-diplomat-office-or-residence&amp;quot; i s a semantic feature retrieved from our dictionary definition for &amp;quot;embassy .&amp;quot; No additional output is produced by CIRCUS in response to Si .</Paragraph>
    <Paragraph position="8">  It is important to understand that CIRCUS uses no sentence grammar, and does not produce a full syntactic analysis for any sentences processed . Syntactic constitutents are utilized only when a concep t node definition asks for them. Our method of syntactic analysis operates locally, and syntacti c predictions are indexed by lexical items . We believe that this approach to syntax is highly advantageous when dictionary coverage is sparse and large sentence fragments can be ignored withou t adverse consequences . This allows us to minimize our dictionaries as well as the amount of processin g needed to handle selective concept extraction from open-ended texts.</Paragraph>
    <Paragraph position="9"> Some concept nodes are very simple and may contain no variable slots at all . For example, CIRCUS generates two simple frames in response to S2, neither of which contain variable slot fillers . Note that the output generated by CIRCUS for S2 as shown in Figure 6 is incomplete . There should be a representation for the damage . This omission is the only CIRCUS failure for TST1-MUC3-0099, and it results from a noun/verb disambiguation failure. The 14 sentences in TST1-MUC3-0099 resulted in a total of 27 concept node instantiations describing bombings, weapons, injuries, attacks, destruction , perpetrators, murders, arson, and new event markers .</Paragraph>
    <Paragraph position="10"> ***  Special mechanisms are devoted to handling specific syntactic constructs, including appositives an d conjunctions. We will illustrate our handling of conjunctions by examining two instances of &amp;quot;and&amp;quot; in S5: S5: Police said the attacks were carried out almost simultaneously and(1 ) that the bombs broke windows and(2) destroyed the two vehicles.</Paragraph>
    <Paragraph position="11"> We recognize that and(1) is not part of a noun phrase conjunction, but do nothing else with it . A new control kernel begins after &amp;quot;that&amp;quot; and reinitializes the state of the parser. and2 is initially recognized as potentially joining two noun phrases -- &amp;quot;windows&amp;quot; and whatever noun phrase follows . However, when the verb &amp;quot;destroyed&amp;quot; appears before any conjoining noun phrase is recognized, the LICK mechanism determines that the conjunction actually joins two verbs and begins a new clause . As a result, the subject of &amp;quot;broke&amp;quot; (i .e., &amp;quot;the bombs&amp;quot;) correctly becomes the subject of &amp;quot;destroyed&amp;quot; as well.</Paragraph>
    <Section position="1" start_page="227" end_page="227" type="sub_section">
      <SectionTitle>
Rule-Based Consolidatio n
</SectionTitle>
      <Paragraph position="0"> When an entire text has been processed by CIRCUS, the list of the resulting case fram e instantiations is passed to consolidation . A rule base of consolidation heuristics then attempts to merg e associated case frames and create target template instantiations that are consistent with MUC-3 encoding guidelines . It is possible for CIRCUS output to be thrown away at this point if consolidatio n does not see enough information to justify a target template instantiation . If consolidation is not satisfied that the output produced by CIRCUS describes bonafide terrorist incidents, consolidation ca n declare the text irrelevant . A great deal of domain knowledge is needed by consolidation in order to make these determinations . For example, semantic features associated with entities such as perpetrators, targets, and dates are checked to see which events are consistent with encodin g guidelines. In this way, consolidation operates as a strong filter for output from CIRCUS, allowing us t o concisely implement encoding guidelines independently of our dictionary definitions.</Paragraph>
      <Paragraph position="1"> A number of discourse-level decisions are made during consolidation, including pronoun resolutio n and reference resolution. Some references are resolved by frame merging rules. For example, CIRCUS output from S1, S2 and S3 is merged during consolidation to produce the target template instantiatio n found in Figure 7 .</Paragraph>
      <Paragraph position="2"> The CIRCUS output from Sl triggers a rule called create-bombing which generates a templat e instantiation that eventually becomes the one in Figure 7 . But to arrive at the final template, we mus t first execute three more consolidation rules that combine the preliminary template with output from S 2 and S3. Pseudo-code for two of these three rules is given in Figure 8.</Paragraph>
      <Paragraph position="3">  and the weapon is an explosive and a BOMBING or ATTEMPTED-BOMBING template is on the stack in the current famil y and dates are compatible and locations are compatibl e  and BOMBING template is on the stack in the current family and dates are compatible and locations are compatibl e</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="227" end_page="229" type="metho">
    <SectionTitle>
THEN
</SectionTitle>
    <Paragraph position="0"> merge perpetrators, human targets, physical targets, instruments, dates, and location s and also ...</Paragraph>
    <Paragraph position="1"> if there is a MURDER template with compatible victim (on the stack in the same family ) with no instruments or the instruments are explosives then merge perpetrators, human targets, instruments, dates, and locations with the MURDE R  0. MESSAGE ID 1. TEMPLATE ID 2. DATE OF INCIDENT 3. TYPE OF INCIDENT 4. CATEGORY OF INCIDENT 5. PERPETRATOR: ID OF INDIV(S ) 6. PERPETRATOR: ID OF ORG(S) 7. PERPETRATOR: CONFIDENCE 8. PHYSICAL TARGET : ID(S) 9. PHYSICAL TARGET : TOTAL NUM  Note also that the location of the incident was merged into this frame from S3 which trigger s another bombing node in response to the verb &amp;quot;exploded&amp;quot; as shown in figure 9 .  Once again, the top-level REL-LINK for a location is printing out only a portion of the complet e noun phrase that was captured.</Paragraph>
    <Paragraph position="2"> Although we would say that the referent to &amp;quot;bombs&amp;quot; in S2 is effectively resolved durin g consolidation, our methods are not of the type normally associated with linguistic discourse analysis . When consolidation examines these case frames, we are manipulating information on a conceptua l rather than linguistic level . We need to know when two case frame descriptions are providin g information about the same event, but we aren't worried about referents for specific noun phrases per se . We did reasonably well on this story . Three templates of the correct event types were generated and no spurious templates were created by the rule base. Sentences S9 through S13 might hav e generated spurious templates if we didn't pay attention to the dates and victims . Here is how the preprocessor handled S12 :</Paragraph>
  </Section>
  <Section position="8" start_page="229" end_page="229" type="metho">
    <SectionTitle>
EMBASSY COMPOUND &gt;PE )
</SectionTitle>
    <Paragraph position="0"> Whenever the preprocessor recognizes a date specification that is &amp;quot;out of bounds&amp;quot; (at least tw o months prior to the dateline), it inserts -DEC_31_80 as a flag to indicate that the events associate d with this date are irrelevant . This date specification will then be picked up by any concept nod e instantiations that are triggered &amp;quot;close&amp;quot; to the date description. In this case, the event is irrelevan t both because of the date and because of the the victim (murdered militants aren't usually relevant) .</Paragraph>
    <Paragraph position="1"> Despite the fact that S10 and S13 contain no date descriptions, the case frames generated for thes e sentences are merged with other frames that do carry disqualifying dates, and are therefore handled a t a higher level of consolidation. In the end, the two murders (S11 and S12) are discarded because o f disqualifications on their victim slot fillers, while the bombing (S9) was discarded because of the dat e specification. The injuries described by S10 are correctly merged with output from S9, and therefor e 23 0 discarded because of the date disqualifier . Likewise, the dynamite from S13 is correctly merged with the murder of the militant, and the dynamite is subsequently discarded along with the rest of tha t template.</Paragraph>
    <Section position="1" start_page="229" end_page="229" type="sub_section">
      <SectionTitle>
Case-Based Consolidation
</SectionTitle>
      <Paragraph position="0"> The CBR component of consolidation is an optional part of our system, designed to increase recall rates by generating additional templates to augment the output of rule-based consolidation . These extra templates are generated on the basis of correlations between CIRCUS output for a given text, and th e key target templates for similarly indexed texts . The CBR component uses a case base which draw s from 283 texts in the development corpus, and the 100 texts from TST1 for a total of 383 texts . We experimented with a larger case base but found no improvement in performance . The case base contains 254 template type patterns based on CIRCUS output for the 383 texts in the case base .</Paragraph>
      <Paragraph position="1"> Each case in the case base associates a set of concept nodes with a template containing slot fillers from those concept nodes . The concept nodes are generated by CIRCUS when it analyzes the origina l source text. A case has two parts: (1) an incident type, and (2) a set of sentence/slot name patterns. For example, suppose a story describes a bombing such that the perpetrator and the target were mentioned in one sentence, and the target was mentioned again three sentences later . The resulting case would b e generated in response to this text:</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="229" end_page="231" type="metho">
    <SectionTitle>
BOMBING
0: (PERP TARGET)
3: (TARGET)
</SectionTitle>
    <Paragraph position="0"> The numerical indices are relative sentence positions. The same pattern could apply no matte r where the two sentences occurred in the text, as long as they were three sentences apart .</Paragraph>
    <Paragraph position="1"> Cases are used to determine when a set of concept nodes all contribute to the same output template .</Paragraph>
    <Paragraph position="2"> When a new text is analyzed, a probe is used to retrieve cases from the case base . Retrieval probes are new sentence/slot name patterns extracted from the current CIRCUS output . If the sentence/slot name pattern of a probe matches the sentence/slot name pattern of a case in the case base, that case i s retrieved, the probe has succeeded, and no further cases are considered .</Paragraph>
    <Paragraph position="3"> Maximal probes are constructed by grouping CIRCUS output into maximal clusters that yiel d successful probes . In this way, we attempt to identify large groups of consecutive concept nodes that al l contribute to the same output template . Once a maximal probe has been identified, the incident type o f the retrieved case forms the basis for a new CBR-generated output template whose slots are filled b y concept node slot fillers according to appropriate mappings between concept nodes and output templates.</Paragraph>
    <Paragraph position="4"> In the case of TST1-MUC3-0099, case-based consolidation proposes hypothetical template s corresponding to 3 bombings, 2 murders,1 attack, and 1 arson incident . Two of the bombings and the arson are discarded because they were already generated by rule-based consolidation . The two murders ar e discarded because of victim and target constraints, while the third bombing is discarded because of a date constraint. The only surviving template is the attack incident, which turns out to be spurious . It is interesting to note that for this text, the CBR component regenerates each of the templates created b y rule-based condolidation, and then discards them for the same reasons they were discarded earlier, o r because they were recognized to be redundant against the rule-based output. We have not run any experiments to see how consistently the CBR component duplicates the efforts of rule-based consolidation . While such a study would be very interesting, we should note that the CBR template s are generally more limited in the number of slot fillers present, and would therefore be hard pressed to duplicate the overall performance of rule-based consolidation .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML