File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/93/m93-1004_concl.xml
Size: 5,412 bytes
Last Modified: 2025-10-06 13:57:01
<?xml version="1.0" standalone="yes"?> <Paper uid="M93-1004"> <Title>TIPSTER/MUC- 5 INFORMATION EXTRACTION SYSTEM EVALUATIO N</Title> <Section position="9" start_page="42" end_page="43" type="concl"> <SectionTitle> SUMMARY </SectionTitle> <Paragraph position="0"> The evaluations conducted during Phase 1 of the Tipster extraction program have measured the completeness and accuracy of systems and have used an examination of the role of missing, spurious and otherwise erroneou s output as a means of diagnosing the state of the art. Viewed as a set of performance benchmarks for the state o f the art in information extraction, the MUC-5 evaluation yielded EJV results that are at least as good as the MUC 4 level of performance . This comparison takes into account some of the measurable differences in difficulty between the EJV task and the MUC-3 and MUC-4 terrorism task .</Paragraph> <Paragraph position="1"> However, even a superficial comparison of task difficulty is hard to make because of the change from th e flat-format design of the earlier MUC templates to the object-oriented design of the MUC-5 templates .</Paragraph> <Paragraph position="2"> Comparison is also made difficult by the many changes that have been made to the alignment and scoring processes and to the performance metrics. Therefore, it is more useful to view performance of the MUC- 5 systems on their own terms rather than in comparison to previous MUC evaluations.</Paragraph> <Paragraph position="3"> From this independent vantage point, MUC-5 yielded very impressive results for some systems on som e tasks. Error per response fill scores as low as 34 (GE/CMU optional test run using the CMU TEXTRAC T system) and 39 (GEICMU Shogun system) were obtained on the JJV core-template test. The only other error per response fill scores in the 30-40 range were achieved by humans, who were tested on the EME task ; however , machine performance on that EME test was only half as good as human performance. Thus, while the JJV core-template test results show that machine performance on a constrained test can be quite high, the EME results show that a similar level of machine performance on a more extensive task could not be achieved, at least not in the relatively short development period allowed for ME.</Paragraph> <Paragraph position="4"> Not only do results such as those cited for the JJV core-template test show how well some approaches to information extraction work for some tasks, they also show how manageable languages other than English can be. A cross-language comparison of results showed fairly consistent advantage in favor of Japanese over English . Comparison of results across domains does not show an advantage in favor of one domain over the other, and it i s quite likely that differences in the nature of the texts, the nature and evolution of the extraction tasks, and th e amount of time allowed for development all had an impact on the results.</Paragraph> <Paragraph position="5"> The quantity and variety of material on which systems were trained and tested presented challenges fa r beyond those posed by earlier MUC evaluations. The scope of the evaluations was broad enough to cause most MUC-5 sites to skip parts of the extraction task, especially types of information that appear relatively rarely i n the corpus. Since no type of information is weighted in the scoring more heavily than any other, the biases tha t exist in the evaluation reflect the distribution of relevant information in the text corpus and result in a natura l emphasis on handling the most frequently-occurring slot-filling tasks. These tasks turn out to be the ones that are less idiosyncratic and therefore more important to the development of generally useful technology.</Paragraph> <Paragraph position="6"> 25However, these four core slots are more frequently filled than many of the non-core slots. Of the 30 non-core slots, 24 account for less than 3% each of the total fills (13 account for less than 1% each, and 11 account for 1-2% each) ; only six of the non-core slots account for a sizeable proportion of the total fills (four account for 3-4% each, and only tw o account for 5-10% each).</Paragraph> <Paragraph position="7"> Examination of the slot-level results in the appendix to this volume shows which systems are filling whic h slots and how aggressively they are generating fills . For those slots where a system is generating a substantial number of fills, analysis at the level of the individual templates and corresponding texts would provide insight into the particular circumstances under which the system extracted correct or incorrect information . In other words, the quantitative performance measures may yield information on aspects of performance that deserve furthe r analysis, but a deeper investigation needs to include examination of the actual fills and the actual texts . The discussion in this paper of slot-level performance on the JV core-template task does not go as far as that ; the discussion is based only on frequency of slot fill and on the slot definitions . Some of the deeper analysis can be carried out only by the authors of the systems . Such an analysis would relate the circumstances under which correct or incorrect system behavior was seen with the strengths and weaknessses of particular algorithms and modules of the system.</Paragraph> </Section> class="xml-element"></Paper>