File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/m91-1032_metho.xml
Size: 23,147 bytes
Last Modified: 2025-10-06 14:12:48
<?xml version="1.0" standalone="yes"?> <Paper uid="M91-1032"> <Title>UNISYS : DESCRIPTION OF THE UNISYS SYSTEM USED FOR MUC- 3</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> UNISYS : DESCRIPTION OF THE UNISYS SYSTEM USED FOR MUC- 3 </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> INTRODUCTIO N </SectionTitle> <Paragraph position="0"> This paper describes the Unisys MUC-3 text understanding system, a system based upon a three tiered approach to text processing in which a powerful knowledge-based form of information retrieval plays a central role. This knowledge-based form of information retrieval makes it possible to define a n effective level of text analysis that falls somewhere between what is possible with standard keyword-base d information retrieval techniques and deep linguistic analysis .</Paragraph> <Paragraph position="1"> The Unisys Center for Advanced Information Technology (CAIT) has a long-standing commitment to NLP research and development, with the Pundit NLP system developed at CAIT serving as the Center' s primary research vehicle [3] . The Unisys MUC-3 system, however, consists primarily of components that are less than 7 months old and still in a developmental stage . Although the three-tiered processing approach that the MUC-3 system's architecture is based upon includes Pundit as its third level of (linguistic ) analysis, the incorporation of Pundit into the MUC-3 system was not achieved in time for the final MUC- 3 test in May, 1991 . A decision was made to focus on the development of a knowledge-based informatio n retrieval component, and this precluded the integration of Pundit into the prototype) The Unisys MUC- 3 system without its linguistic analysis component is depicted in Figure 1 . This is the version of the syste m that was actually used in the MUC-3 test.</Paragraph> </Section> <Section position="3" start_page="0" end_page="212" type="metho"> <SectionTitle> APPROACH AND SYSTEM DESCRIPTIO N </SectionTitle> <Paragraph position="0"> The Unisys MUC-3 system's architecture consists of five main processing components, three of whic h represent levels of text understanding . An initial preprocessing component transforms texts into a n appropriate format for the text understanding components to manipulate . The three text understandin g components engage in (1) standard keyword-based information retrieval, (2) knowledge-based informatio n retrieval, and (3) linguistic analysis .2 A final, template generation component gathers together all th e facts extracted from a given text and builds template data structures out of them . These five components are described in more detail below .</Paragraph> <Paragraph position="1"> A Message Pre-processing Component The Unisys MUC-3 system's message pre-processing component is a special, low-level processor which parses texts into their component parts and generates output in a form compatible with the KBIR D rule processing system (i .e., as a set of Prolog terms) . This processor is a special C program which was generated using an Application Specific Language called MFPL (Message Format Processing Language ) [6]. MFPL was specifically designed as a high-level language for processing the formatted portions o f electronic messages . In addition to producing a representation of the text in Prolog terms, this modul e identifies and encodes sentence boundaries, paragraph boundaries, and the standard formatted portion s of the text (e.g., date, time, location, etc .).</Paragraph> <Paragraph position="2"> The third level of text-understanding in the three-tiered approach to text-processing described in this pape r (linguistic analysis provided by the Pundit NLP system) was not incorporated in time for the test, and is therefore not represented in the diagram .</Paragraph> <Paragraph position="3"> A Keyword-Based Information Retrieval Component The keyword analysis component of the Unisys MUC-3 system predicts when various types of terrorist acts (bombings, murders, kidnappings, and so forth) have been referred to in a text . The probability of an act of a given type having occurred is determined by a search for words, word stems, and pairs of words and pairs of word stems, that are associated with types of acts .' The probability of such a word (o r word stem, or word pair or stem pair) occurring in a text for which an act of a given type is associated i s determined as follows .</Paragraph> <Paragraph position="4"> The frequency of presence for a given word W (or word stem . . .) in texts for which a terrorist ac t of a given type T occurs is computed (f (W, T)), as is the presence of the word in any text at all in the complete corpus (f(W,C)) . The probability of the word appearing in texts for which a terrorist act of a given type occurs</Paragraph> <Paragraph position="6"> and the probability of the word occurring in any text f(W,C )</Paragraph> </Section> <Section position="4" start_page="212" end_page="212" type="metho"> <SectionTitle> ICI </SectionTitle> <Paragraph position="0"> are calculated, and these two values are used to determine the conditional probability of the word (o r word stem . . .) predicting the given type of terrorist act .</Paragraph> </Section> <Section position="5" start_page="212" end_page="215" type="metho"> <SectionTitle> P(W, T </SectionTitle> <Paragraph position="0"> Only words with relatively high probabilities of predicting a given type of terrorist act are searched for in a text, and words that do not occur frequently enough in the text corpus based on some empirically-derive d threshold are not used .</Paragraph> <Paragraph position="1"> Training the keyword-based analysis component . A database of key words, two-word phrases , word stems and two-stem phrases was compiled from the DEV corpus using a collection of GAWK scripts. After some experimentation, we decided not to use the word stem and stem-pair data in the final test , because it was not making any positive difference in the system's event detection performance . Currently, an event class, T, is predicted for a text if it contains any single word (or two-word phrase), W, where P(W,T) > .65 or if it contains two words (or two two-word phrases) W1 and W2 where P(W1i T) > .55 and P(W2, T) > .55. Further experimental variation of the scoring algorithm should result in continue d enhancements to this component's event detection capabilities .</Paragraph> <Paragraph position="2"> A Knowledge-based Information Retrieval Component (KBIRD ) Once a set of terrorist acts have been predicted, the task of generating templates describing those act s falls to the knowledge-based information retrieval component called KBIRD.</Paragraph> <Paragraph position="3"> KBIRD is a rule-based system for concept-spotting in free text [2, 7]. KBIRD rules are forward-chaining horn clauses whose antecedents are constituents discovered and recorded in a chart data structure and whose consequents are newly inferred constituents--concepts (or facts)--to be added to the chart . The antecedents and consequents of KBIRD rules can include arbitrary Prolog goals just as in Definite Claus e Grammars [5] .</Paragraph> <Paragraph position="4"> It is tempting to think of a set of KBIRD rules as implementing a kind of bottom up chart parser, but there are several interesting differences . One distinctive feature is that the concepts that KBIRD rules infer are associated with a specific region of text, a region which is the maximal cumulative span of th e regions of text associated with each expression in a given rule's antecedent . Moreover, these regions ca n be explicitly reasoned about by subsequent KBIRD rules.</Paragraph> <Paragraph position="5"> In typical natural language parsers, there is an implicit constraint that adjacent constituents in a rul e must be realized by contiguous strings of text in the input . KBIRD allows one to write rules which specify other constraints on the relative positions of the strings which realize rule constituents . The anteceden t of a KBIRD rule may consist of several facts (words or concepts) that are the arguments of operator s illustrated below. New operators are easy to define .</Paragraph> <Paragraph position="6"> A is in the same region as B.</Paragraph> <Paragraph position="7"> KBIRD rules are compiled into a combination of Prolog backward chaining rules and forward chainin g rules in Pfc [1] . A simple optimizer is applied to the output of this compilation process to improv e performance . KBIRD has many additional features which are inherited from the Pfc rule language, suc h as the ability to write non-monotonic rules which specify that no occurrence of a certain constituent o r concept be found in a given region .</Paragraph> <Paragraph position="8"> Some examples of KBIRD rules are shown below . The first rule states that if the wordstem &quot;MURDER*&quot; has been found in the text, then a fact should be added to the factbase stating that a potential murde r event has been found. The second rule illustrates KBIRD's ability to recognize phrases, asserting that if the string &quot;ARMY OF NATIONAL LIBERATION&quot; is discovered, a fact should be added to the factbase stating that a terrorist organization exists in the text at the same location as the string . The third rule illustrates the use of operations on concepts derived from the text, asserting that if a terrorist event E is found in the same sentence as a potential victim V, then a fact should be added to the factbase indicating that V is the actual victim of E.</Paragraph> <Paragraph position="9"> 1. &quot;MURDER*&quot; __> potential_murder_event .</Paragraph> <Paragraph position="10"> 2. &quot;ARMY&quot;&quot;&quot;OF&quot;&quot;NATIONAL&quot;&quot;LIBERATION&quot; ==> terrorist_organization . 3. terrorist_event(E) . . potential_victim(V) __> victim(E,V) .</Paragraph> <Paragraph position="11"> Several additional features of the KBIRD rule language should be mentioned, all of which appear in th e following, more complex rule used to infer individual perpetrators : generic_perpetrator(A)OP,</Paragraph> <Paragraph position="13"> In the first clause of the antecedent of this rule, the text location index associated with the concep t generic..perpetrator(A) is bound to the logic variable P with the (c) operator. This allows the locatio n to be explicitly constrainted later in the rule . If a clause is enclosed in square brackets, as is the case for the second clause of the antecedent, then its location is ignored . This condition also shows the use of the tilde (&quot;) as a negation operator . Thus, this second clause specifies that it is not the case that Nam e has been determined to be an &quot;unlikely perpetrator&quot; anywhere else in the text . The final clause of th e antecedent in this rule is enclosed in curly brackets, which indicates that it is a Prolog constraint which must be met--this clause is used to extract the actual text associated with the concept bound to the logi c variable P .</Paragraph> <Paragraph position="14"> A Template Generator The Template Generator has three tasks: to select the actual templates to be produced as output, t o choose between candidate slot fillers if more than one has been found, and to print the template in th e proper format .</Paragraph> <Paragraph position="15"> Template Selection. The process of determining which template structures to build out of the fact s inferred by KBIRD begins by determining if any events at all have been predicted . If no event has been predicted, then an &quot;irrelevant template&quot; is created . If several events of the same type have been created , the template generator will attempt to merge them using a set of heuristics which hypothesize that two event descriptions refer to the same event . Some of the general heuristics used for merging events of the same class are : Slot Filler Selection. After merging events, the template generator must select the final slot fille r values. The KBIRD rules which propose slot fillers attach a score (an integer between 0 and 100) to eac h candidate which represents the system's confidence in that value . If multiple candidate fillers exist for a given template, several general heuristics are used to select among them : * Candidate slot values with scores below a given threshold are dropped from consideration . * A set of synonymous expressions are dropped in favor of their canonical expression .</Paragraph> <Paragraph position="16"> * If one candidate expression is a substring of another, then the shorter one is dropped .</Paragraph> <Paragraph position="17"> * A generic description (e .g., vehicles) is dropped in favor of one or more subsumed ones (e.g., ambulance, truck) .</Paragraph> <Paragraph position="18"> * If a slot can only take a single value then the candidate receiving the highest value is selected. A Linguistic Analysis Component (Pundit ) The Pundit natural language processing system has been under development at Unisys for the last fiv e years and is capable of performing a detailed linguistic analysis of an input text . Unlike KBIRD, Pundit abstracts away from the actual strings used to convey information in a text at the very beginning of it s analysis process by determining to which syntactic properties and domain concepts the lexical items i n the text correspond. These syntactic properties and domain concepts are then processed without muc h attention being paid to their physical location in the text . In KBIRD, on the other hand, everything that is manipulated, even concepts that have been asserted, are explicitly associated with regions of text . A key capability that the deeper linguistic processing of Pundit can provide is the determination o f the grammatical and thematic roles of expressions in a text . Thus, it can determine that in the sentence &quot;Castellar is the second mayor that has been murdered in Colombia in the last 3 days&quot; that Castellar is the subject of the copular verb in the matrix clause, and that Castellar should inherit properties asserted of the predicate nominal argument. It can also recognize the passive voice of the relative/subordinate claus e headed by that and thus that it is Castellar that has been murdered (as the second mayor) in Columbia. It would be possible to build a KBIRD rulebase that performs the sort of detailed linguistic analysis no w being performed by Pundit . Merging KBIRD and Pundit in this way would minimize the complication s of integrating the text analyses that they perform . However, such a merger would very likely reduce th e modularity of the three-tiered approach to text processing that we have been following .</Paragraph> </Section> <Section position="6" start_page="215" end_page="221" type="metho"> <SectionTitle> AN EXTENDED EXAMPLE </SectionTitle> <Paragraph position="0"> In this section, we illustrate in a more concrete fashion how the Unisys MUC-3 system goes abou t processing messages by examining in more detail what happens during the processing of a specific text , message TST1-MUC3-0099 in the MUC-3 corpus (see Figure 2) . Our discussion will proceed through th e various processing phases that have been identified .</Paragraph> <Paragraph position="1"> Phase One : Message Pre-processing In this phase, the message is parsed (by a special low-level processor) into its components and outpu t in a form compatible with the KBIRD rule processing system. This processor is a special C progra m generated by MFPL, the ASL mentioned earlier in this paper . This phase produces text input of the following sort to the Prolog portion of the system, including default (header) information about the dat e and location.</Paragraph> <Paragraph position="3"> Phase Two: Keyword analysi s In the second phase, the keyword analysis component predicts three event classes--bombings with a probability of 87%, attacks with a probability of 66%, and murders with a probability of 63% . Figure 3 shows the particular words and word pairs which gave rise to these predicted event types . The last colum n in this table contains triples consisting of a probability, a word or two-word phrase, and its location i n the text. Given our current thresholds, the murder prediction was judged to be too weak for furthe r consideration.</Paragraph> <Paragraph position="4"> Phase Three : KBIRD processing KBIRD examines the text word by word and applies forward chaining rules whenever their pre-conditions are met . KBIRD's task is to take the event classes predicted by the keyword analysis stage and try t o predict additional event classes as well as instantiate the predicted types with individual events . Event instances are associated with particular regions within the text . When an event instance is created, additional rules will be triggered to look for values to fill each of the instance's slots .</Paragraph> <Paragraph position="5"> type predicted by the keyword-based analysis component with words or inferred concepts that have bee n detected in a text will allow KBIRD to infer additional event types . For example, the following KBIR D rule, which was triggered in the processing of message 0099, asserts that the occurrence of &quot;BURNED&quot; in the active voice in a message for which an instance of a bombing event has been discovered is enough t o predict the likely occurrence of an arson event.</Paragraph> <Paragraph position="7"> Locating Events . The process of instantiating event types, or locating events, is initiated in KBIR D through a class of locator rules, which attempt to find &quot;hot spots&quot; in the text which seem to be discussin g events of the predicted type . The following locator rules were used to detect bombing, attack, and arso n instances in this message :</Paragraph> <Paragraph position="9"> voice (no preceding &quot;be&quot; word) with a potential physical target to its right in the same sentence, then infer an instance of a bombing event.</Paragraph> <Paragraph position="11"> out&quot; occurs (or a variant with some other &quot;be&quot; word), and in the same sentence somewhere a bomb device is mentioned, then infer an instance of a bombing event.</Paragraph> <Paragraph position="13"> occurs (or a variant with some other &quot;be&quot; word), and no mention is made of a bomb device in th e same sentence, then infer an instance of an attack event .</Paragraph> <Paragraph position="15"> voice (with a &quot;be&quot; word to its left) with a mention of a potential physical target somewhere to th e left in the same sentence, then infer an instance of an arson event .</Paragraph> <Paragraph position="16"> Although the rule above for detecting an instance of an attack event will initially fire as the words in the message are examined sequentially by KBIRD and the phrase &quot;THE ATTACK WAS CARRIED OUT &quot; is encountered, the attack event instance that has been created will eventually be retracted when, in th e same sentence, the description of a bomb device is encountered (&quot;THE BOMBS&quot;) . On the other hand, the second rule for inferring instances of bombing events will suddenly have all of its antecedent constraint s met when this latter phrase is encountered, and so it will fire to create a new instance of a bombing .</Paragraph> <Paragraph position="17"> Locating perpetrator ids and orgs . The following two rules are triggered when, in the first sentence of 0099, the word &quot;TERRORISTS&quot; is encountered . The latter rule licenses the inference that &quot;TERRORISTS&quot; describes a potential perpetrator .</Paragraph> <Paragraph position="19"> Later, in the fourth paragraph of the text, the following rules are used to infer that the known guerrill a organizations &quot;SHINING PATH&quot; and &quot;TUPAC AMARU REVOLUTIONARY MOVEMENT&quot; have bee n</Paragraph> <Paragraph position="21"> Locating a Physical Target . In processing the first three paragraphs of the text, a number of rule s fire to trigger the recognition of potential physical targets . Embassies and vehicles are frequent physica l targets, and so the following inference rules have been written to capture essential information abou t</Paragraph> <Paragraph position="23"> Detecting Event Instances, Revisited . The discovery of a physical target satisfies the last of the antecedent constraints for the arson and the first bombing event locator rules mentioned earlier, and s o actual events (event instances) can now be inferred by them . Actual events are represented in the char t as facts of the following sort: for inferring bombing instances can be satisfied . It will be the job of the template generator to detect an d merge references to the same event .</Paragraph> <Paragraph position="24"> Generating Slot Values . Once an event instance has been asserted, KBIRD will begin to infer tm p clauses, which will later be written to a file to serve as input to the template generator for filling template slots. Each clause has as one of its parameters a score that indicates how likely it is to be an appropriate slot value. The following rules illustrate how a perpetrator that is a terrorist is favored in a bombin g</Paragraph> <Paragraph position="26"> Similarly, the following rules illustrate how, in templates representing bombing events, organization s that have been identified as guerrilla groups are favored over drug cartels and military groups as likel y values for the perpetrator ORG slot.</Paragraph> <Paragraph position="27"> actual_event(_,ID,bombing) . .organization( G, 'GUERRILLA' ) __> tmp(ID, slot06, [G,'GUERRILL~'], kbird, 85) .</Paragraph> <Paragraph position="28"> actual_event(_,ID,bombing) . .organization( G, 'DRUGGIES') __> tmp(ID, slot06, [G,'REBELS'], kbird, 77) .</Paragraph> <Paragraph position="29"> actual_event(_,ID,bombing) . .organization( G, 'MILITARY') __> tmp(ID, slot06, [G,'MILITARY'], kbird, 35) .</Paragraph> <Paragraph position="30"> The arson template generated by the system was almost completely correct . The only problem was that the perpetrator confidence reported for &quot;SHINING PATH&quot; was CLAIMED OR ADMITTED and not REPORTED AS FACT . In the bombing template generated by the system, the date was incorrectl y identified as being a span of time in July instead of a span of time in October . The July inference was based on information in the fifth paragraph. The system also failed to report the TUPAC AMARU group as a perpetrator ORG value, even though the group was identified in the text . An uninteresting bug in the template generator caused this error . Finally, rules for inferring that the physical targets belonged t o foreign nations were not sensitive enough to be activated .</Paragraph> </Section> class="xml-element"></Paper>