File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/93/j93-3001_concl.xml

Size: 5,057 bytes

Last Modified: 2025-10-06 13:57:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="J93-3001">
  <Title>Science Applications International Corp.</Title>
  <Section position="16" start_page="446" end_page="447" type="concl">
    <SectionTitle>
8. Conclusion
</SectionTitle>
    <Paragraph position="0"> In this paper, we have sketched the evaluation techniques that were applied to 15 text processing systems during MUC-3. In addition to the raw results, we have introduced a method of computing significance for the results across systems using approximate randomization. The results showed that the systems fell into a number of distinct clusters for both precision and recall. In many cases, the systems performing well in precision also performed well in recall. We can conclude that the evaluation methodology used in MUC-3 can indeed discriminate among systems and that there were significant differences in effectiveness among the systems fielded for MUC-3.</Paragraph>
    <Paragraph position="1"> We have also been able to draw several other conclusions. The first is that all systems performed worse in recall than in precision; furthermore, none of the linguistically based systems had adjustable parameters to increase recall at the expense of precision (although several systems could be run in several configurations to produce slight changes in overall precision and recall). Achievement of high recall scores (over 60%) is a problem for the current text-understanding systems. However, several systems have shown steady improvement with time, and their performance may show further improvement with continued development. Even some of the high-performing systems may not yet have reached peak performance.</Paragraph>
    <Paragraph position="2"> This raises the issue of portability--most systems spent approximately one person year preparing their systems to run for MUC-3. If porting takes 12 months of time for highly trained system developers, portability will be a serious stumbling block both to building real systems and to changing the evaluation paradigm. At the end of MUC-3, the participating system developers did not want to spend another year porting their system to yet another evaluation application. This underscores the need for serious research on portable systems.</Paragraph>
    <Paragraph position="3"> Generality of linguistic coverage is an important part both of system portability and overall system performance. In order to evaluate the linguistic coverage, we devised a successful method for isolating specific linguistic phenomena and measuring system performance on them within the black-box framework, even though specifics of the performance of systems varied and made these tests more difficult to interpret.</Paragraph>
    <Paragraph position="4"> Development of a suite of such tests with adequate linguistic coverage would provide insight into how the handling of certain common linguistic phenomena relates to overall system performance. This insight would be similar to that obtainable through glass-box testing, but the method of testing is still technically a black-box method because it looks only at the output given certain input.</Paragraph>
    <Paragraph position="5">  Computational Linguistics Volume 19, Number 3 We had hoped to draw conclusions concerning the relative effectiveness of the various language processing techniques used by the participating systems. However, the results of MUC-3, even with their statistical significance now known, do not support the recommendation of one approach over another. We did notice, however, that pattern-matching and information retrieval techniques proved successful only when combined with linguistic techniques. Used in isolation, these techniques were not powerful enough to extract the wide range of information needed in the MUC-3 task. We also noted that robust processing techniques (filtering, skimming, robust or partial parsing) were important for good performance.</Paragraph>
    <Paragraph position="6"> Overall, we believe that the MUC conferences have made a major contribution to evaluation methodology. They have both benefited from and inspired other work on evaluation, as evidenced by the growing body of research on evaluation techniques for natural language understanding including two workshops on the subject: Palmer and Finin, 1990 and Neal and Walter, 1991. In glass-box evaluation, researchers have developed a parse evaluation methodology (Black et al. 1991) that is now in use. There is active research in the evaluation of spoken language understanding: Hirschman et al., 1992 and Price et al., 1992. In addition, there is a new effort in evaluation of machine translation systems based on quality assessment measures and use of a multiple choice reading comprehension test. This work is beginning to provide the evaluation techniques that will enable the natural language community to assess its progress, to understand its results, and to focus future research towards robust, high-performance systems capable of handling real-world applications.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML