File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/91/m91-1001_concl.xml

Size: 4,059 bytes

Last Modified: 2025-10-06 13:56:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="M91-1001">
  <Title>OVERVIEW OF THE THIRD MESSAGE UNDERSTANDING EVALUATION AND CONFERENCE</Title>
  <Section position="8" start_page="14" end_page="15" type="concl">
    <SectionTitle>
CONCLUSIONS
</SectionTitle>
    <Paragraph position="0"> The MUC-3 evaluation established a solid set of performance benchmarks fo r systems with diverse approaches to text analysis and information extraction . The MUC-3 task was extremely challenging, and the results show what can be done with today's technologies after only a modest domain- and task-specific developmen t effort (on the order of one person-year) . On a task this difficult, the systems tha t cluster at the leading edge were able to generate in the neighborhood of 40-50% o f the expected data and to do it with 55-65% accuracy . Breakdowns of performance by slot show that performance was best on identifying the type of incident -- 70 80% recall (completeness) and 80-85% precision (accuracy) were achieved, an d precision figures in the 90-100% range were possible with some sacrifice in recall .</Paragraph>
    <Paragraph position="1"> All of the MUC-3 system developers are optimistic about the prospects fo r seeing steady improvements in system performance for the foreseeable future .</Paragraph>
    <Paragraph position="2"> This feeling is based variously on such evidence as the amount of improvemen t achieved between the dry-run test and the final test, the slope of improvement recorded on internal tests conducted at intervals during development, and the developers' own awareness of significant components of the system that they had not had time to adapt to the MUC-3 task . The final test results are consistent with the claim that most systems, if not all, may well be still on a steep slope o f improvement . However, they also show that performance on recall is not as goo d as performance on precision, and they lend support to the possibility that thi s discrepancy will persist .</Paragraph>
    <Paragraph position="3"> It appears that systems cannot be built today that ar e capable of obtaining high overall recall, even at the expense of outrageously hig h overgeneration .</Paragraph>
    <Paragraph position="4"> Systems can, however, be built that will do a good job a t potentially useful subtasks such as identifying terrorist incidents of various kinds . The results give at least a tentative indication that systems incorporatin g robust parsing techniques show more long-term promise of high performanc e than non-parsing systems . However, there are great differences in technique s among the systems in the parsing and non-parsing groups and even among thos e robust parsing systems that did the best in maximizing recall and precision an d minimizing the tradeoff between them . Further variety was evident in the optional test runs conducted by some of the sites . Those runs show promise for th e development of systems that can be &amp;quot;tuned&amp;quot; in various ways to generate data mor e aggressively or more conservatively, yielding tradeoffs between recall an d precision that respond to differences in emphasis in real-life applications .</Paragraph>
    <Paragraph position="5"> Some conclusions can be drawn regarding the evaluation setup itself that wil l influence future work. First, the evaluation corpus and task were sufficientl y challenging that they can be used again in a future evaluation (with a refined tas k definition and a new test set) .</Paragraph>
    <Paragraph position="6"> Second, the information extraction task need s modification in order to focus as much as possible on language processing  capabilities separate from information extraction capabilities, and new ideas fo r designing tests related to specific linguistic phenomena are needed . Finally, more work is needed to ensure that the statistical significance of the results is known , and a serious study of human performance on the task is needed in order to defin e concrete performance goals for the systems .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML