File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/m92-1019_metho.xml
Size: 10,828 bytes
Last Modified: 2025-10-06 14:13:13
<?xml version="1.0" standalone="yes"?> <Paper uid="M92-1019"> <Title>SRI INTERNATIONAL FASTUS SYSTEM MUC-4 TEST RESULTS AND ANALYSIS</Title> <Section position="3" start_page="0" end_page="143" type="metho"> <SectionTitle> OVERVIEW OF THE FASTUS ARCHITECTUR E </SectionTitle> <Paragraph position="0"> The architecture of the FASTUS system is described in detail in the associated system summary . It can be summarized as a three-phase process . The first phase consists of scanning the text to identify prope r names, correcting spelling, and similar preprocessing tasks to ensure that the text is in a standardized forma t for the remainder of the processing .</Paragraph> <Paragraph position="1"> The second phase consists of a finite-state machine that accepts the sequence of words from the text, an d produces as output a sequence of linguistic consituents -- noun groups consisting of determiners, prenominals and head noun, verb groups consisting of auxilliaries plus the main verb together with any intervenin g adverbs, and particles, which is a catch-all category including prepositions, conjunctions, and genitive markers. The output of the second pass is filtered to include only the longest consitutents spanning any give n portion of the sentence .</Paragraph> <Paragraph position="2"> The linguistic consituents from the second phase are given as input to another finite-state machine . The transitions of this third-phase machine are based on the head of each constituent, and each transition build s some piece of an &quot;incident.&quot; structure, which can be thought of as a &quot;proto-template .&quot; When a final state of the machine is reached, the incident, structure that has been produced through that point is saved, an d merged with all other incident structures produced by the same sentence . (There may be several, because the machines are non-deterministic) . These incident structures are then merged with incident structure s from the rest of the text according to a set of merging heuristics . The incident structures are converted to the format of MUC-4 templates in a post-processing phase .</Paragraph> </Section> <Section position="4" start_page="143" end_page="144" type="metho"> <SectionTitle> CONTROLLING THE FASTUS SYSTE M </SectionTitle> <Paragraph position="0"> In the course of designing the system, we paramaterized a number of characteristics of the system's operation because we believed that the parameterized behavior would reflect tradeoffs in recall versus precision .</Paragraph> <Paragraph position="1"> Subsequent testing revealed that many of these parameters result in both higher recall and higher precision when in one state or the other, and therefore we left them permanently in their most advantageous state .</Paragraph> <Paragraph position="2"> Those parameters that seemed to affect recall the the expense of precision were set to produce a test ru n in which we attempted to maximize the system's recall . The effect of these parameters could be described in general as distrusting the system's filters' ability to eliminate templates corresponding to stale dates , uninteresting countries, and military incidents . We observed a small but measurable increase in recall at the expense of precision by distrusting our filters .</Paragraph> <Paragraph position="3"> The following parameters were implemented and tested on 300 texts before arriving at the decisions for the settings on the final run .</Paragraph> <Paragraph position="4"> * Conservative Merging. When this option is selected, the system would not merge incidents that had non-overlapping targets with proper names . When not selected, any merges consistent with the inciden t types were permitted . Testing revealed that merging should always be conservative.</Paragraph> <Paragraph position="5"> * Civilian Target Requirement. This filter would reject any template that did not have at least on e non-military target, including templates that identified a perpetrator, but no physical or human targe t at all. This option appears to produce a recall-precision tradeoff of about one or two points .</Paragraph> <Paragraph position="6"> * Subjectless Verb Groups . This parameter would allow the system to generate an incident structur e from a verb together with its object, even if its subject could not be determined . Although early tests showed a recall-precision tradeoff, subsequent and more thorough testing indicated that this shoul d always be done.</Paragraph> <Paragraph position="7"> * Filter Many-Target Templates. This filter would disallow any template that had more than 100 targets , on the supposition that such templates often result from vague or general, and hence irrelevant, descriptions . This turns out to be a correct heuristic, but only if the number of targets is evenly divisibl e by 100. (An airline bombing with 307 victims is certainly interesting, while &quot;70,000 peasants hav e been killed&quot; is probably vague) .</Paragraph> <Paragraph position="8"> * Military Filtering. This heuristic causes the system to eliminate all military targets from templates , on the belief that we may have incorrectly merged a military incident with a civilian incident and incorrectly reported the union of the two . Tests show that this filtering improves precision slightly. * Liberal Perpetrator Org. Setting this parameter would cause the system to pick any likely perpetrator organization out of the text, ignoring whatever the text actually says . 'Testing showed that this parameter had no effect, which was such a surprising result that we distrust it, and regard our testin g as inconclusive .</Paragraph> <Paragraph position="9"> * Spelling Correction This parameter controls how much spelling correction the system does . Our exper null iments indicated that spelling correction hurts, primarily because novel proper names get corrected to other words, and hence lost . We tried a weaker version of spelling correction which would correct onl y misspelled words that did not occur on a large list of proper names that we had assembled . This showed an improvement, but spelling correction still had a small negative effect . This was also a surprisin g result, and we were not willing to abandon spelling correction, and ran all tests with weak spellin g correction enabled, although to some extent a complete lack of spelling correction is compensated fo r by the presence of common misspellings of important domain words like &quot;guerrilla&quot; and &quot;assassinate&quot; in the lexicon .</Paragraph> <Paragraph position="10"> * Stale Date Filtering . This parameter causes filtering of any template that has a date that is earlie r than two months before the date of the article . Eliminating this filtering produces an increase in recal l at the expense of precision, the magnitude of which depends on how well our date detection currentl y works. We would expect about a one-point tradeoff.</Paragraph> <Paragraph position="11"> * Weak Location Filtering . If the system's location dection finds that the location of an incident i s impossible according to the system 's location database, it eliminates the template . If this flag is set , the template will be produced using only the country as the location. Testing shows that this is always desireable.</Paragraph> </Section> <Section position="5" start_page="144" end_page="144" type="metho"> <SectionTitle> THE RESULTS ON TST3 AND TST4 </SectionTitle> <Paragraph position="0"> On TST3, we achieved a recall of 44% with precision of 55% in the all-templates row, for an F-scor e</Paragraph> <Paragraph position="2"> identical recall score of 44%, however our precision fell to 52%, for an F-score of 47 .7. It was reassuring to see that there was very little degradation in performance moving to a time period over which the system had not been trained . We also submitted a run in which we attempted to maximize the system's recal l by not filtering military targets, and allowing incidents with stale dates . On TST3, this led to a two-point increase in recall at the expense of one point in precision. On TST4, our recall did not increase, howeve r our precision fell by a point, giving us a lower F-score on this run . These results were consistent with ou r observations during testing, although our failure to produce even a small increase in recall on TST4 was somewhat disappointing .</Paragraph> <Paragraph position="3"> The runtime for the entire TST3 message set on a SPARC-2 processor was 11 .8 minutes (about 1 6 minutes of elapsed real time with our configuration of memory and disk) . These times are quite consistent with our runs over the development sets . During the course of development, the overall run time for 10 0 messages increased approximately 50%, but we attribute this increase to the decision to treat more sentence s as relevant. It appears possible to increase the coverage of the system without an unacceptable increase i n processing time .</Paragraph> </Section> <Section position="6" start_page="144" end_page="144" type="metho"> <SectionTitle> DEVELOPMENT HISTORY </SectionTitle> <Paragraph position="0"> During December of 1991 we decided to implement a preprocessor for the TACITUS system, at whic h point the FASTUS architecture was born. The system was originally conceived as a preprocessor for TACITUS that could be run in a stand-alone mode . Considerably later in our development we decided that the performance of FASTUS on the MUC-4 task was so high that we could make FASTUS our complete system .</Paragraph> <Paragraph position="1"> Most of the design work for the FASTUS system took place during January . The ideas were tested ou t on finding incident locations in February, and with some initial favorable results in hand, we proceded wit h the implementation of the system in March . The implementation of the second phase of processing wa s completed in March, and the general outline of phase three was completed by the end of April . On May 6, we did the first test of the FASTUS system on TST2, which had been withheld as a fair test, and we obtained a score of 8% recall and 42% precision . At that point we began a fairly intensive effort to hill-clim b on all 1300 development texts, doing periodic runs on the fair test to monitor our progress, culminating i n a score of 44% recall, 57% precision in the wee hours of June 1, when we decided to run the official test . As the chart in Figure 1 points out, the rate of progress was rapid enough that even a few hours of work coul d be shown to have a noticeable impact on the score . Our scarcest resource was time, and our supply of it wa s eventually exhausted well before the point of diminishing returns .</Paragraph> </Section> class="xml-element"></Paper>