File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/m92-1007_metho.xml
Size: 11,780 bytes
Last Modified: 2025-10-06 14:13:13
<?xml version="1.0" standalone="yes"?> <Paper uid="M92-1007"> <Title>Relevant Messages Irrelevant Messages Marginal Messages Required Templates</Title> <Section position="2" start_page="0" end_page="88" type="metho"> <SectionTitle> MEASURING SUCCESS IN ACHIEVING OUR SHORT-TERM GOAL S </SectionTitle> <Paragraph position="0"> PLUM had demonstrated quite high recall in MUC-3 and scored among the top systems. We chose to focus o n the following goals in MUC-4: Increasing precision and reducing overgeneration, without hurting recall.</Paragraph> <Paragraph position="1"> Demonstrating a broad range of tradeoff in recall and precision.</Paragraph> <Paragraph position="2"> Goal 1: Increasing precision and reducing overgeneration, without hurting recall . As the graph in Figure 1 shows, we doubled our precision in MUC-4 (compared to MUC-3) and reduced our overgeneration by roughly one third. The overall impact was to increase PLUM's F-Measure by 50% . Naturally one would ideally base thi s measurement on the new test sets for MUC-4 (TST3, TST4) ; however, between MUC-3 and MUC-4 both the definition of the templates to be produced and the evaluation function changed dramatically, so that there was no easy way to run the MUC-3 version of PLUM on TST3 and TST4 to produce results comparable to that of th e MUC-4 version of the system. However, since the Government had converted our MUC-3 TST2 templates to the MUC-4 format, and since we had never examined the corpus of messages or the answer key to TST2, we coul d easily use it as a basis for comparison.</Paragraph> <Paragraph position="3"> Goal 2: Demonstrating a broad range of tradeoff in recall and precision. As Figure 2 illustrates, the user ca n select from a broad range of system performance, emphasizing either recall or precision to various degrees. No system had displayed such a span favoring recall versus favoring precision in MUC-3 . Only one other system, GE's , demonstrated a broad range; at a cost of 17 points of recall, GE's system could achieve an increase of roughly 8 points of precision. For PLUM, the tradeoff of recall for precision was far more balanced.</Paragraph> <Paragraph position="4"> Two independent parameters primarily contributed to this : a discrete parameter controlling how aggressively o r conservatively two descriptions are fused into a single view of the same event, and a continuously variable threshol d on a classification algorithm predicting whether a paragraph is relevant or irrelevant (with respect to reporting anyterrorist incident) . Together, these two parameters offer a user the ability to turn a knob to emphasize recall or precision based on their application preference.</Paragraph> </Section> <Section position="3" start_page="88" end_page="88" type="metho"> <SectionTitle> KEY SYSTEM FEATURE S </SectionTitle> <Paragraph position="0"> Two design features stand out in our minds: partial understanding and statistical language modeling . By partial understanding we mean that the parser and grammar are designed to find analyses for a non-overlapping sequence o f fragments. When cases of permanent, predictable ambiguity arise, such as a prepositional phrase that can b e attached in multiple ways, or most conjoined phrases, the parser finishes the analysis of the current fragment, an d begins the analysis of a new fragment. Therefore, the entities mentioned and some relations between them are processed in every sentence, whether syntactically ill-formed, complex, novel, or straightforward . Furthermore, this parsing is done using essentially domain-independent syntactic information. The semantic interpreter and the rest of the system in turn do not assume having complete understanding.</Paragraph> <Paragraph position="1"> The second key feature is the use of statistical algorithms to guide processing . Determining the part of speec h of highly ambiguous words is done by well-known Markov modeling techniques . To improve the recognition of Latin American names, we employed a statistically derived five-gram (five letter) model of words of Spanish origi n and a similar five-gram model of English words . This model was integrated into the part-of-speech tagger.</Paragraph> <Paragraph position="2"> Another usage of statistical algorithms was a statistical induction algorithm to learn case frames for verbs fro m examples . This saved substantial effort compared to building the case frames by hand. The algorithm and empirical results are described in [3].</Paragraph> <Paragraph position="3"> Figure 3 : Impact of Paragraph classifier on recall and precision in the ALL TEMPLATES row .</Paragraph> <Paragraph position="4"> The statistical methods mentioned above were already available and used in the MUC-3 version of PLUM . A new statistical algorithm employed in MUC-4 is a classification algorithm that automatically learns features to discriminate among classes . Given a list of relevant paragraphs and a list of irrelevant ones (made available by Ne w Mexico State University), we employed a chi square measurement to determine word stems (though other features could be used as well) whose presence or absence in text is significantly correlated with the text being relevant (o r irrelevant). Given that ordering, the user must select how many features to use . At runtime, the classifier sums the logarithm of the odds that the paragraph is relevant given the presence of the features . If the sum exceeds a user-specified threshold, the paragraph is considered relevant . If the classifier predicts that the paragraph is relevant, then events found in the paragraph can be used to generate templates ; if not, terrorist events that would otherwise have been produced from that paragraph are blocked . The performance of the overall system, given variou s thresholds of the text classifier, is shown in Figure 3 .</Paragraph> <Paragraph position="5"> A more detailed description of the system components, their individual outputs, and their knowledge bases is presented in Ayuso et al., [1]. We expect the particular implementations to change and improve substantially during the next three years of research and development .</Paragraph> </Section> <Section position="4" start_page="88" end_page="90" type="metho"> <SectionTitle> RESULTS </SectionTitle> <Paragraph position="0"> Appendix G lists detailed test scores. A number of systems performed better on TST4 than on TST3, and som e performed significantly worse on TST4 than TST3 . The results on the two test sets were so disparate for PLUM tha t we decided to look into the causes of the abnormally low recall of PLUM on TST3 . As table 1 shows, the following properties of TST3 stand in stark contrast with TST4, TST1,, and the 1300 message development corpus: representative of the MUC-4 domain than TST3.</Paragraph> <Paragraph position="1"> Taken together, the first two observations above suggest the following: Systems tuned to overgenerate (i .e., produce a high percentage of templates, what Hirschman labels the &quot;lazy merge problem&quot; in this volume) should perform significantl y better on TST3 than TST4.</Paragraph> <Paragraph position="2"> The observations above suggest, in part, why PLUM's recall for TST3 was abnormally low: known, temporary grammar problems and (2) a challenge for discourse processing to be able to collect target s across sentences.</Paragraph> <Paragraph position="3"> * A bug in the official scoring program was encountered for TST3, but not for TST4 . If this bug is corrected, we estimate it would improve PLUM's scores by at least one point in recall, at least one point in precision, and at least two points in overgeneration. (That clearly is not sufficient to fully account for the discrepancy in performance on TST3 and TST4.) One other point confirming the normalcy of TST4, contrasted with the abnormal characteristics of TST3, can be seen in PLUM's performance under various settings . Prior to the test, we ran PLUM with numerous parameter settings on TST1, TST2, and one set of 100 messages from the development set. This predicted the setting that would maximize the F-Measure, or come indistinguishably close to the maximum F-measure. That prediction proved correct (consistent) with TST4, but was 2 points under the maximum actually achieved for TST3 via one of our optional runs.</Paragraph> <Paragraph position="4"> is maximized, and where F is maximized. It also lists the required run for TST4. In addition, since TST3 and TST4 were so disparate in character, we computed the score of PLUM if TST3 and TST4 together constituted the test.</Paragraph> </Section> <Section position="5" start_page="90" end_page="90" type="metho"> <SectionTitle> EFFORT SPENT </SectionTitle> <Paragraph position="0"> We estimate that 4 person months specific to MUC-4 went into our effort . These were spent approximately as follows: domain-dependent lexical additions, 0 .5 person months; grammar, 0.5 person months; semantic rules , 0.75 person months; discourse, 1 .0 person months; backend, 0.75 person months ; and overhead (evaluation, fulfilling requirements, etc .), 0.5 person months.</Paragraph> </Section> <Section position="6" start_page="90" end_page="90" type="metho"> <SectionTitle> TRAINING DATA AND TECHNIQUES </SectionTitle> <Paragraph position="0"> The 1300 messages of the development corpus were used at various levels as training data. PLUM was run over all 1300 messages to detect, debug, and correct any causes of system breaks . The perpetrator organization slot for 9 1 all 1300 messages was used to quickly add names to the domain-dependent lexicon . After running our part-of-speech tagger (POST) over the development corpus, the statistical algorithm for predicting words of Spanish origi n was run over the list of previously unknown words. Those predicted as Spanish in origin were then reviewed manually to add Spanish names to the lexicon.</Paragraph> <Paragraph position="1"> A subset of the development set was used more intensively as training data. Approximately 95,000 words oftext (about 20% of the development corpus) was tagged by the University of Pennsylvania as to part of speech an d labelled as to syntactic structure as part of the DARPA-funded TREEBANK project. The bracketed text firs t provided us with a frequency-ranked list of head verbs, head nouns, and nominal compounds . For each of these we added a pointer to the domain model element that is the most specific super-concept containing all things denoted by the verb, noun, or nominal compound. As mentioned earlier, the TREEBANK data was then used with the lexical relation to the domain model to hypothesize case frames for verbs . The automatically hypothesized verb cas e frames were then reviewed manually and added to the lexicon . This is detailed in [3].</Paragraph> <Paragraph position="2"> The 100 messages of TST1 and TST2 were used as blind test sets to measure our progress at least once a week.</Paragraph> <Paragraph position="3"> Throughout, we only looked at the summary output from the scoring procedure, rather than adding to the lexicon or debugging the system based on particular messages .</Paragraph> <Paragraph position="4"> The training mentioned above had already been used in preparing for MUC-3, with the obvious exception that TST2 was not available in preparation for MUC-3 . What we added in MUC-4 was training regarding the relevance/irrelevance of paragraphs. We tried training at the article level; however, the fact that an article could b e mostly irrelevant except for a single paragraph mentioning a terrorist incident made the training much less effectiv e than training based on labelling individual paragraphs as irrelevant.</Paragraph> </Section> class="xml-element"></Paper>