File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/93/x93-1016_concl.xml

Size: 5,480 bytes

Last Modified: 2025-10-06 13:57:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="X93-1016">
  <Title>INFORMATION EXTRACTION SYSTEM EVALUATION</Title>
  <Section position="9" start_page="159" end_page="162" type="concl">
    <SectionTitle>
SUMMARY
</SectionTitle>
    <Paragraph position="0"> The evaluations conducted during Phase 1 of the Tipster extraction program have measured the completeness and accuracy of systems and have used an examination of the role of missing, spurious and 25However, these four core slots are more frequently filled than many of the non-core slots. Of the 30 non-core slots, 24 account for less than 3% each of the total fills (13 account for less than 1% each, and 11 account for 1-2% each); only six of the non-core slots account for a sizeable proportion of the total fills (four account for 3-4% each, and only two account for 5-10% each).</Paragraph>
    <Paragraph position="1"> otherwise erroneous output as a means of diagnosing the state of the art. Viewed as a set of performance benchmarks for the state of the art in information extraction, the MUC-5 evaluation yielded EJV results that are at least as good as the MUC-4 level of performance. This comparison takes into account some of the measurable differences in difficulty between the EJV task and the MUC-3 and MUC-4 terrorism task.</Paragraph>
    <Paragraph position="2"> However, even a superficial comparison of task difficulty is hard to make because of the change from the fiat-format design of the earlier MUC templates to the object-oriented design of the MUC-5 templates. Comparison is also made difficult by the many changes that have been made to the alignment and scoring processes and to the performance metrics. Therefore, it is more useful to view performance of the MUC-5 systems on their own terms rather than in comparison to previous MUC evaluations.</Paragraph>
    <Paragraph position="3"> From this independent vantage point, MUC-5 yielded very impressive results for some systems on some tasks. Error per response fill scores as low as 34 (GE/CMU optional test run using the CMU TEXTRACT system) and 39 (GE/CMU Shogun system) were obtained on the JJV core-template test.</Paragraph>
    <Paragraph position="4"> The only other error per response fill scores in the 30-40 range were achieved by humans, who were tested on the EME task; however, machine performance on that EME test was only half as good as human performance. Thus, while the JJV core-template test results show that machine performance on a constrained test can be quite high, the EME results show that a similar level of machine performance on a more extensive task could not be achieved, at least not in the relatively short development period allowed for ME.</Paragraph>
    <Paragraph position="5"> Not only do results such as those cited for the JJV core-template test show how well some approaches to information extraction work for some tasks, they also show how manageable languages other than English can be. A cross-language comparison of results showed fairly consistent advantage in favor of Japanese over English.</Paragraph>
    <Paragraph position="6"> Comparison of results across domains does not show an advantage in favor of one domain over the other, and it is quite likely that differences in the nature of the texts, the nature and evolution of the extraction tasks, and the amount of time allowed for development all had an impact on the results.</Paragraph>
    <Paragraph position="7"> The quantity and variety of material on which systems were trained and tested presented challenges far beyond those posed by earlier MUC evaluations.</Paragraph>
    <Paragraph position="8">  The scope of the evaluations was broad enough to cause most MUC-5 sites to skip parts of the extraction task, especially types of information that appear relatively rarely in the corpus. Since no type of information is weighted in the scoring more heavily than any other, the biases that exist in the evaluation reflect the distribution of relevant information in the text corpus and result in a natural emphasis on handling the most frequently-occurring slot-tilting tasks. These tasks turn out to be the ones that are less idiosyncratic and therefore more important to the development of generally useful technology.</Paragraph>
    <Paragraph position="9"> Examination of the slot-level results in the appendix to this volume shows which systems are filling which slots and how aggressively they are generating fills. For those slots where a system is generating a substantial number of fills, analysis at the level of the individual templates and corresponding texts would provide insight into the particular circumstances under which the system extracted corr~ or incorrect information. In other words, the quantitative performance measures may yield information on aspects of performance that deserve further analysis, but a deeper investigation needs to include examination of the actual fills and the actual texts. The discussion in this paper of slot-level performance on the JV core-template task does not go as far as that; the discussion is based only on frequency of slot fill and on the slot definitions. Some of the deeper analysis can be carried out only by the authors of the systems.</Paragraph>
    <Paragraph position="10"> Such an analysis would relate the circumstances under which correct or incorrect system behavior was seen with the strengths and weaknessses of particular algorithms and modules of the system.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML