File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/m92-1036_metho.xml
Size: 22,561 bytes
Last Modified: 2025-10-06 14:13:12
<?xml version="1.0" standalone="yes"?> <Paper uid="M92-1036"> <Title>correctly generated an attack incident with no injury to the human target, the driver : Incident: ATTACK Perp : PTarg : HTarg: &quot;Garcia Alvarado's driver&quot; HEffect: No Injury This was merged with the attack on Merino 's home Incident : BOMBING Perp:</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> SRI INTERNATIONAL : DESCRIPTION OF THE FASTUS SYSTE M USED FOR MUC- 4 </SectionTitle> <Paragraph position="0"> It is a system for extracting information from free text in English, and potentially other languages as well, for entry into a database, and potentially for other applications . It works essentially as a cascaded, nondeterministic finite state automaton .</Paragraph> <Paragraph position="1"> It is an information extraction system, rather than a text understanding system. This distinction is important . In information extraction, only a fraction of the text is relevant . In the case of the MUC- 4 terrorist reports, probably only about 10% of the text is relevant . There is a pre-defined, relatively simple , rigid target representation that the information is mapped into . The subtle nuances of meaning and th e writer's goals in writing the text are of no interest . This contrasts with text understanding, where the ai m is to make sense of the entire text, where the target representation must accommodate the full complexitie s of language, and where we want to recognize the nuances of meaning and the writer's goals .</Paragraph> <Paragraph position="2"> The MUC evaluations are information extraction tasks, not text understanding tasks . The TACITU S system that .was used for MUC-3 in 1991 is a text-understanding system [1] . Using it for the information extraction task gave us a high precision, the highest of any of the sites . However, our recall was mediocre , and the system was extremely slow . Our motivation in building the FASTUS system was to have a syste m that was more appropriate to the information extraction task .</Paragraph> <Paragraph position="3"> The inspiration for FASTUS was threefold . First, we were struck by the strong performance that th e group at the University of Massachusetts got out of a fairly simple system [2] . It was clear they were no t doing anything like the depth of preprocessing, syntactic analysis, or pragmatics that was being done by th e systems at SRI, General Electric, or New York University. They were not doing a lot of processing . They were doing the right processing .</Paragraph> <Paragraph position="4"> The second source of inspiration was Pereira's work on finite-state approximations of grammars [3] , especially the speed of the implementation .</Paragraph> <Paragraph position="5"> Speed was the third source . It was simply too embarassing to have to report at the MUC-3 conferenc e that it took TACITUS 36 hours to process 100 messages . FASTUS has brought that time down to 1 1 minutes .</Paragraph> <Paragraph position="6"> The operation of FASTUS is comprised of four steps, described in the next four sections .</Paragraph> <Paragraph position="7"> 1. Triggering 2. Recognizing Phrases 3. Recognizing Pattern s 4. Merging Incidents The system is implemented in CommonLisp and runs on both Suns and Symbolics machines .</Paragraph> </Section> <Section position="2" start_page="0" end_page="268" type="metho"> <SectionTitle> TRIGGERING </SectionTitle> <Paragraph position="0"> In the first pass over a sentence, trigger words are searched for . There is at least one trigger word for each pattern of interest that has been defined . Generally, these are the least frequent words required by th e pattern. For example, in the pattern take <HumanTarget> hostage &quot;hostage&quot; rather than &quot; take&quot; is the trigger word. There are at present 253 trigger words . In addition, the names of people identified in previous sentences as victims are also treated, for the remainder of the text, as trigger words . This allows us, for example, to pick up occupations of victims whe n they occur in sentences with no other triggers, as i n Hector Oqueli and Gilda Flores were assassinated yesterday .</Paragraph> <Paragraph position="1"> Gilda Flores was a member of the Democratic Socialist Party (PSD) of Guatemala . Finally, on this pass, full names are searched for, so that subsequent references to surnames can be linked to the corresponding full names . Thus, if one sentence refers to &quot;Ricardo Alfonso Castellar&quot; but does no t mention his kidnapping, while the next sentence mentions the kidnapping but only uses his surname, we ca n enter Castellar's full name into the template .</Paragraph> <Paragraph position="2"> In Message 48 of TST2, 21 of 30 sentences were triggered in this fashion . 13 of the 21 triggered sentences were relevant. There is very little penalty for passing irrelevant sentences on to further processing since th e system is so fast, especially on irrelevant sentences .</Paragraph> <Paragraph position="3"> Eight of the nine nontriggered sentences were irrelevant . The one relevant, nontriggered sentence wa s There were seven children, including four of the vice president 's children, in the home at the time .</Paragraph> <Paragraph position="4"> It does not help to recognize this sentence as relevant as we do not have a pattern that would match it . The missing pattern i s <HumanTarget> be in <PhysicalTarget > which would pick up human targets who were in known physical targets. In order to have this sentenc e triggered, we would have to take the head nouns of known physical targets to be temporary triggers for th e remainder of the text, as we do with named human targets .</Paragraph> </Section> <Section position="3" start_page="268" end_page="269" type="metho"> <SectionTitle> RECOGNIZING PHRASE S </SectionTitle> <Paragraph position="0"> The problem of syntactic ambiguity is AI-complete . That is, we will not have systems that reliably parse English sentences correctly until we have encoded much of the real-world knowledge that people bring t o bear in their language comprehension . For example, noun phrases cannot be reliably identified because of th e prepositional phrase attachment problem . However, certain syntactic constructs can be reliably identified .</Paragraph> <Paragraph position="1"> One of these is the noun group, that is, the noun phrase up to the head noun . Another is what we are calling the &quot;verb group&quot;, that is, the verb together with its auxilliaries and embedded adverbs . Moreover , an analysis that. identifies these elements gives us exactly the units we most need for recognizing patterns o f interest.</Paragraph> <Paragraph position="2"> Pass Two in FASTUS identifies noun groups, verb groups, and several critical word classes, includin g prepositions, conjunctions, relative pronouns, and the words &quot;ago&quot; and &quot;that&quot; . Phrases that are subsume d by larger phrases are discarded . Overlapping phrases are rare, but where they occur they are kept . This sometimes compensates for incorrect analysis in Pass Two .</Paragraph> <Paragraph position="3"> Noun groups are recognized by a 37-state nondeterministic finite state automaton . This encompasses most of the complexity that can occur in English noun groups, including numbers, numerical modifier s like &quot;approximately&quot;, other quantifiers and determiners, participals in adjectival position, comparative an d superlative adjectives, conjoined adjectives, and arbitrary orderings and conjunctions of prenominal noun s and noun-like adjectives . Thus, among the noun groups recognized are approximately 5 k g more than 30 peasant s the newly elected president, the largest leftist political forc e a government and military reaction Verb groups are recognized by an 18-state nondeterministic finite state machine . They are tagged as Active, Passive, Gerund, and Infinitive . Verbs that are locally ambiguous between active and passive senses , as the verb &quot;kidnapped&quot; the the two sentences, Several men kidnapped the mayor today.</Paragraph> <Paragraph position="4"> Several men kidnapped yesterday were released today .</Paragraph> <Paragraph position="5"> are tagged as Active/Passive and Pass Three resolves the ambiguity if necessary .</Paragraph> <Paragraph position="6"> Certain relevant predicate adjectives, such as &quot;dead&quot; and &quot;responsible&quot;, are recognized, as are certai n adverbs, such as &quot;apparently&quot; in &quot;apparently by&quot; . However, most adverbs and predicate adjectives and man y other classes of words are ignored altogether . Unknown words are ignored unless they occur in a contex t that could indicate they are surnames.</Paragraph> <Paragraph position="7"> Lexical information is read at compile time, and a hash table associating words with their transitions i n the finite-state machines is constructed . There is a hash table entry for every morphological variant of th e words. Altogether there are 43,000 words in the hash table . During the actual running of the system on the texts, only the state transitions are accessed .</Paragraph> <Paragraph position="8"> The output of the second pass for the first sentence of Message 48 of TST2 is as follows : The verb groups &quot;condemned&quot; and &quot;accused&quot; are labelled &quot;Active/Passive&quot; . The word &quot;killing&quot; which was incorrectly identified as a verb group is labelled as a Gerund . This mistake is common enough that we hav e implemented patterns to get around it in Pass Three .</Paragraph> <Paragraph position="9"> On Message 48 of TST2, 243 of 252 phrases, or 96 .4%, were correctly recognized . Of the 9 mistakes, 5 were due to nouns being misidentified as verbs or verbs as nouns . 3 were due to a dumb bug in the code for recognizing dates that crept into the system a day before the official run and meant that no explicit dates were recognized except in the header . (This resulted in the loss of 1% in recall in the official run of TST3 .) One mistake was due to bit rot .</Paragraph> <Paragraph position="10"> We implemented and considered using a part-of-speech tagger to help in this phase, but there was n o clear improvement and it would have doubled the time the system took to process a message .</Paragraph> </Section> <Section position="4" start_page="269" end_page="271" type="metho"> <SectionTitle> RECOGNIZING PATTERN S </SectionTitle> <Paragraph position="0"> The in put to the third pass of FASTUS is a list of phrases in the order in which they occur . Anythin g that is not included in a phrase in the second pass is ignored in the third pass . The state transitions are driven off the head words in the phrases . In addition, some nonhead words can trigger state transitions . For example, &quot;bomb blast&quot; is recognized as a bombing .</Paragraph> <Paragraph position="1"> We implemented 95 patterns for the 1VIUC-4 application . Among the patterns are the following ones that are relevant to Message 48 of TST2 : The incident type is an attack or a bombing, depending on the Device . There was a bug in this patter n that caused the system to miss picking up the explosives as the instrument . In addition, it is disputable whether Merino should be listed as a human target . In the official key template for this message, he is not . But it seems to us that if someone's home is attacked, it is an attack on him .</Paragraph> <Paragraph position="2"> A certain amount of pseudo-syntax is done while patterns are being recognized . In the first place, the material between the end of the subject noun group and the main verb group must be read over . There are patterns to accomplish this . Two of them are as follows : Subject {Preposition NounGroup}* VerbGrou p Subject Relpro {NounGroup Other}* VerbGroup {NounGroup I Other}* VerbGroup The first of these patterns reads over prepositional phrases . The second over relative clauses . The verb group at the end of these patterns takes the subject noun group as its subject . There is another pattern fo r capturing the content encoded in relative clauses : Subject Relpro {NounGroup Other}* VerbGrou p Since the finite-state mechanism is nondeterministic, the full content can be extracted from the sentenc e The mayor, who was kidnapped yesterday, was found dead today .</Paragraph> <Paragraph position="3"> One branch discovers the incident encoded in the relative clause . Another branch marks time through th e relative clause and then discovers the incident in the main clause . These incidents are then merged . A similar device is used for conjoined verb phrases . The pattern Subject VerbGroup {NounGroup ~ Other}* Conjunction VerbGrou p allows the machine to nondeterministically skip over the first conjunct and associate the subject with th e verb group in the second conjunct . Thus, in the sentence one branch will recognize the killing of Garcia and another the fact that Cristiani accused the FMLN . The second sort of &quot;pseudo-syntax&quot; that is done while recognizing patterns is attaching genitives, &quot;of &quot; complements, and appositives to their heads, and recognizing noun group conjunctions . Thus, i n seven children, including four of the vice-president 's children the genitive &quot;vice-president's&quot; will be attached to &quot;children&quot; . The &quot;of&quot; complement will be attached to &quot;four&quot;, and since &quot;including&quot; is treated as a conjunction, the entire phrase will be recognized as conjoined noun groups.</Paragraph> <Paragraph position="4"> In Message 48 of TST2, there were 18 relevant patterns . FASTUS recognized 12 of them completely . Because of bugs in implemented patterns, 3 more patterns were recognized only partially . One implemented pattern failed completely because of a bug . Specifically, in the sentence A niece of Merino's was injured .</Paragraph> <Paragraph position="5"> the genitive marker took the system into a state in which it was not expecting a verb group . Two more patterns were missing entirely . The pattern <HumanTa.rget1> <VerbGroup> with <HumanTarget2 > would have matched . . . the attorney general was traveling with two bodyguards .</Paragraph> <Paragraph position="6"> and consequently would have recognized the two bodyguards as human targets along with the attorne y general.</Paragraph> <Paragraph position="7"> The second pattern i s <HumanTarget> be in <PhysicalTarget> mentioned above .</Paragraph> <Paragraph position="8"> A rudimentary sort of pronoun resolution is done by FASTUS . If (and only if) a pronoun appears in a Human Target slot, an antecedent is sought . First the noun groups of the current sentence are searche d from left to right, up to four phrases before the pronoun . Then the previous sentences are searched similarl y for an acceptable noun group in a left-to-right fashion, the most recent first . This is continued until th e last. paragraph break, and if nothing is found by then, the system gives up . A noun group is an acceptable antecedent if it is a possible human target and agrees with the pronoun in number . This algorithm worked i n 100% of the relevant cases in the first 200 messages of the development set . However, in its one application in Message 48 of TST2, it failed . The example is According to the police and Garcia Alvarado 's driver, who escaped unscathed, the attorney general was traveling with two bodyguards . One of them was injured .</Paragraph> <Paragraph position="9"> The algorithm incorrectly identifies &quot;them&quot; as &quot;the police&quot; .</Paragraph> </Section> <Section position="5" start_page="271" end_page="272" type="metho"> <SectionTitle> MERGING INCIDENT S </SectionTitle> <Paragraph position="0"> As incidents are found they are merged with other incidents found in the same sentence . Those remainin g at the end of the processing of the sentence are then merged, if possible, with the incidents found in previou s sentences.</Paragraph> <Paragraph position="1"> For example, in the first sentence of Message 48 of TST2, the incident .</Paragraph> <Paragraph position="2"> killing of Attorney General Roberto Garcia Alvarad o while the inciden t There are fairly elaborate rules for merging the noun groups that appear in the Perpetrator, Physica l Target, and Human Target slots . A name can be merged with a precise description, as &quot;Garcia&quot; with &quot;attorney general&quot;, provided the description is consistent with the other descriptions for that name . A precise description can be merged with a vague description, such as &quot;person&quot;, with the precise description as the result . Two precise descriptions can be merged if they are semantically compatible . The description s &quot;priest&quot; and &quot;Jesuit&quot; are compatible, while &quot;priest&quot; and &quot;peasant&quot; are not . When precise descriptions are merged, the longest string is taken as the result . If merging is impossible, both noun groups are listed in th e slot.</Paragraph> <Paragraph position="3"> We experimented with a further heuristic for when to merge incidents . If the incidents include name d human targets, we do not merge them unless there is an overlap in the names . This heuristic results in abou t a 1% increase in recall. In Message 48 of TST2, the heuristic prevents the Bombing of Garcia Alvarado' s car from being merged with the Bombing of Merino 's home.</Paragraph> <Paragraph position="4"> There were 13 merges altogether in processing Message 48 of TST2 . Of these, 11 were valid . One of the two bad merges was particularly unfortunate . The phrase . . . Garcia Alvarado's driver, who escaped unscathed, . . .</Paragraph> <Paragraph position="5"> correctly generated an attack incident with no injury to the human target, the driver : That is, it was assumed that Merino was the driver . The reason for this mistake was that while a certai n amount of consistency checking is done before merging victims, and while the system knows that drivers an d vice presidents-elect are disjoint sets, the fact that Merino was the vice president-elect was recorded only i n a table of titles, and consistency checking did not consult that table .</Paragraph> </Section> <Section position="6" start_page="272" end_page="272" type="metho"> <SectionTitle> ERROR ANALYSIS </SectionTitle> <Paragraph position="0"> FASTUS made 25 errors on Message 48 of TST2, where a wrong answer, a missing answer, and a spurious answer are all counted as errors . (There is in principle no limit to the number of possible errors , since arbitrarily many spurious entries could be given . However, practically the number of possible errors is around 80 . If no entries are made in the templates, that counts as 55 errors . If all the entries are made and are correct, but combined into a single template, that counts as 48 errors--the 24 missing entries in th e smaller template and the 24 spurious entries in the larger .) The sources of the errors are as follows : Because of the missing patterns, we failed to find the children and the bodyguards as human targets . The bad merges resulted in the driver being put into the wrong template . The armored car was found as a physical target in the attack against Garcia Alvarado, but armored cars are viewed as military, and military targets are filtered out just before the templates are generated . The disputable answer is Merino as a huma n target in the bombing of his home .</Paragraph> <Paragraph position="1"> We do not know to what extent this pattern of causes of errors is representative of the performance o f the system on the corpus as a whole .</Paragraph> </Section> <Section position="7" start_page="272" end_page="274" type="metho"> <SectionTitle> FUTURE DIRECTION S </SectionTitle> <Paragraph position="0"> If we had had one more month to work on the MUC-4 task, we would have spent the first week developin g a rudimentary pattern specification language . We believe that with about two months work we could develo p a langauge that would allow a novice user to he able to begin to specify patterns in a new domain withi n hours of being introduced to the system . The pattern specification language would allow the user to defin e structures, to specify patterns in regular expressions interrupted by assignments to fields of the structures , and to define a sort hierarchy to control the merging of structures .</Paragraph> <Paragraph position="1"> We would also like to apply the system to a new domain. Our experience with the MUC-4 task leads u s to believe we could achieve reasonable performance on the new domain within two months .</Paragraph> <Paragraph position="2"> Finally, it. would be interesting to try to convert FASTUS to a new language . There is not much linguisti c knowledge built into the system . What there is probably amounted to no more than two weeks coding . For this reason, we believe it would require no more than one or two months to convert the system to another language. This is true even for a language as seemingly dissimilar to English as Japanese . In fact, our approach to recognizing phrases was inspired in part by the bunsetsu analysis of Japanese .</Paragraph> </Section> <Section position="8" start_page="274" end_page="274" type="metho"> <SectionTitle> SUMMARY </SectionTitle> <Paragraph position="0"> The advantages of the FASTUS system are as follows: system provides a very direct link between the texts being analyzed and the data being extracted . FASTUS is not a text understanding system. It is an information extraction system . But for information extraction tasks, it is perhaps the most convenient and most effective system that has been developed .</Paragraph> </Section> <Section position="9" start_page="274" end_page="274" type="metho"> <SectionTitle> ACKNOWLEDGEMENT S </SectionTitle> <Paragraph position="0"> The research was funded by the Defense Advanced Research Projects Agency under Office of Nava l Research contracts N00014-90-C-0220, and by an internal research and development grant from SRI Inter national. null</Paragraph> </Section> class="xml-element"></Paper>