File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/m92-1009_metho.xml
Size: 7,726 bytes
Last Modified: 2025-10-06 14:13:14
<?xml version="1.0" standalone="yes"?> <Paper uid="M92-1009"> <Title>TIPSTER SHOGUN SYSTEM (JOINT GE-CMU) : MUC-4 TEST RESULTS AND ANALYSIS 1</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> RESULTS </SectionTitle> <Paragraph position="0"> Our overall results on both TST3 and TST4 were very good in relation to other systems . Figure 1 summarizes our results on these tests . On both TST3 and TST4, the GE-CMU system was 9-10 recall points behin d the GE system and about 4 F-measure points behind . The entire difference is due to the difference between the GE TRUMP parser, which had been developed for text applications and thoroughly tested in MUC-3 , and the CMU generalized LR parser, which was developed for machine translation and has just begun to b e tested on this sort of task .</Paragraph> <Paragraph position="1"> In addition to these core results, the Figure 2 summarizes our performance on the adjunct test .</Paragraph> </Section> <Section position="5" start_page="0" end_page="101" type="metho"> <SectionTitle> GE - GE-CMU SCORE COMPARISO N </SectionTitle> <Paragraph position="0"> While the performance of the GE-CMU system was not as good as that of the GE system, we view thes e results in a very positive light . First, the system with the CMU parser was reasonably close to the G E system, and it was rapidly catching up at the end of the preparations for MUC-4 (The GE-CMU syste m improved about 30 recall points between the interim test in March and the final test in May) . Second , the integrated system proved that software modules and fundamental results could be shared across sites , This research was sponsored (in part) by the Defense Advanced Research Project Agency (DOD) and other governmen t agencies. The views and conclusions contained in this document are those of the authors and should not be interpreted a s representing the official policies, either expressed or implied, of the Defense Advanced Research Project Agency or the U S systems and methodologies, with some effort . The success of both the parser integration and the MUC tes t was thus encouraging for our TIPSTER effort as a whole .</Paragraph> </Section> <Section position="6" start_page="101" end_page="101" type="metho"> <SectionTitle> EFFORT </SectionTitle> <Paragraph position="0"> We spent overall approximately 2 person-months of effort on the GE-CMU system, compared with 10 person-months on the GE system . Half of this effort was from CMU, where participants had not been through MU C before and there was thus a &quot;learning curve&quot; to get used to the test and testing procedures . The amount of development represented here is thus very small, and went almost entirely into improving the robustnes s and recovery of the parser .</Paragraph> <Paragraph position="1"> The entire difference in scores is due to the difference between the two parsers, since the rest of the syste m components are the same . Successful parses covered 34% less of the input (We failed to get an accurate count of successful parses because of a bug in the calculation) in the LR parser, and unsuccessful parses produced 50% more fragments . Our rough analysis is that about half of the recall difference between the two system s is due to more parser failures in the GE-CMU system, and the remaining half is due to recovery problems .</Paragraph> <Paragraph position="2"> Although the same recovery strategies were used in both cases, the behavior of the parsers on failure wa s substantially different, so more work is needed on LR parser recovery .</Paragraph> <Paragraph position="3"> Some problems we have noted that cause parser failure are : inability to cope with spurious and missin g punctuation, as well as nested comma clauses, catastrophic failure on missing or spurious determiners, and problems with long sentences where we set a time limit .</Paragraph> <Paragraph position="4"> Problems with recovery include a difference between the way the two systems handle &quot;chain rules&quot; in th e grammar, a bug that prevented the LR system from building complete noun phrases on failure, and the lac k of certain phrase reductions on end of sentence (the LR parser would often leave a long trail of fragments a t the end of sentence, instead of failing early and trying to recover) .</Paragraph> <Paragraph position="5"> Most of these problems are relatively minor, but they take time to fix . Our system had no real parser recovery mechanism until a few weeks before MUC-4 .</Paragraph> <Paragraph position="6"> In addition to producing somewhat lower scores, the system with the LR parser was 7-9 times slower tha n the GE system using TRUMP . This difference was due to the fact that the LR parser was designed to follow many more parse paths, putting a large burden on the semantic interpreter to sort out the best interpretatio n from a huge &quot;forest&quot; of parses . We believe that this gap can be quickly closed with the installation of a ne w control module for the parser .</Paragraph> </Section> <Section position="7" start_page="101" end_page="101" type="metho"> <SectionTitle> RETROSPECTIVE ON THE TASK AND RESULT S </SectionTitle> <Paragraph position="0"> The GE-CMU system in MUC-4 broke new ground in resource sharing . To our knowledge, two substantial natural language systems have never been successfully integrated, and the way the two systems were combine d is a bit remarkable . We converted GE's grammar and lexicon to work with CMU's parser, used CMU's parse r compiler on the GE grammar, and developed a separable data structure for parser recovery and control tha t could work with two vastly different parsers . This integration was so complete that, in at least one case, a bug turned up in the LR system's infant recovery module that led to a bug fix to both parsers.</Paragraph> <Paragraph position="1"> The result is that grammars and lexicons developed for one parser can be used with another, contro l strategies can be tested with either parser, and the two parsers can be interchanged and compared at the &quot;flip of a switch&quot; within the context of the MUC or TIPSTER systems . This suggests that sharing of resources in a test like MUC may be considerably more practical than we had once believed .</Paragraph> </Section> <Section position="8" start_page="101" end_page="102" type="metho"> <SectionTitle> LESSONS LEARNE D </SectionTitle> <Paragraph position="0"> The MUC task gave the SHOGUN system a good test drive, providing a testbed for efficiently working th e kinks out of the combined system . This benefit came not only from having a real task, but, surprisingly , from being able to compare the results of two interchangeable modules . If both parsers produced the wron g result, we would follow one line of response, while we would follow a completely different path if only one of the two parsers failed . Since much of the effort in MUC is determining what went wrong where, th e interchangeable parsers turned out to be a useful diagnostic aid .</Paragraph> <Paragraph position="1"> The amount of effort spent on pulling together the SHOGUN system for MUC, barely two person-months , proved that one system can really benefit from the work that goes into another . This is a mixed blessing, because we did pay double the price in the work required to fill templates, run tests, and run the scorin g program on the two systems (two people did nothing but run tests and score!) The overhead of testin g means that we wouldn't want to be doing this too often, but it seems a worthwhile investment for getting a relative analysis of two parsers .</Paragraph> </Section> class="xml-element"></Paper>