File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/m93-1006_intro.xml
Size: 6,156 bytes
Last Modified: 2025-10-06 14:05:28
<?xml version="1.0" standalone="yes"?> <Paper uid="M93-1006"> <Title>COMPARING HUMAN AND MACHINE PERFORMANCE FO R NATURAL LANGUAGE INFORMATION EXTRACTION : Results for English Microelectronics from the MUC-5 Evaluation</Title> <Section position="2" start_page="0" end_page="53" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> In evaluating the state of technology for extracting information from natural language text by machine, it i s valuable to compare the performance of machine extraction systems with that achieved by humans performing th e same task. The purpose of this paper is to present some results from a comparative study of human and machine performance for one of the information extraction tasks used in the MUC-5/Tipster evaluation that can help assess th e maturity and applicability of the technology.</Paragraph> <Paragraph position="1"> The Tipster program, through the Institute for Defense Analyses (IDA) and several collaborating U .S. government agencies, produced a corpus of filled &quot;templates&quot; --structured information extracted from text. This corpu s was used both in the development of machine extraction systems by contractors and in the evaluation of the develope d systems. Production of templates was performed by human analysts extracting the data from the text and structurin g it, using a set of structuring rules for &quot;filling&quot; the templates and computer software that made it easier for analysts t o organize information. Because of this rather extensive effort by analysts to create these templates, it was possible t o study the performance of humans for this task in some detail and to develop methods for comparing this performanc e with that of machines participating in the MUC-5/Tipster evaluation.</Paragraph> <Paragraph position="2"> The texts that the templates were filled from were newspaper and technical magazine articles concerned either with joint business ventures or microelectronics fabrication technology. Each topic domain used text in two languages , English and Japanese. This paper discusses preparation of templates and presents results for human and machine performance for English Microelectronics ; a companion paper [1] presents additional experimental results.</Paragraph> <Paragraph position="3"> The primary motivation for this study was to provide reliable data that would allow machine extraction performance to be compared with that of humans . The MUC and Tipster programs have included extensive efforts to develop measurements that can objectively evaluate the performance of the different machine systems . However, although these measures are capable of reliably discriminating between the performance of different machine systems , they are not very useful by themselves in evaluating how near the technology is to providing reliable performance an d the extent to which it is ready to be used in applications . Sundheim [2] initiated human performance study by providin g estimates of human performance for the task used in the MUC-4 evaluation ; the present study provides human data for the MUC-551pster evaluation that was produced under relatively controlled conditions and with methods and statistical measures that assess the reliability of the data A second motivation for the study was for its value in helping produce better quality templates so as to allo w high-quality system development and reliable evaluation . We monitored the quality and consistency of the template s being produced as analysts were trained and gained experience, and made particular efforts to identify the causes o f errors and inconsistency so as to develop strategies for reducing error and increasing consistency.</Paragraph> <Paragraph position="4"> A third motivation for studying human performance was to better understand the nature of the extraction task and the relative performance of humans compared with machines on different aspects of the task. Such an understanding can particularly help in the construction of human-machine integrated systems that are designed to make the bes t use of what are at the present time rather different abilities of humans and machines [3] .</Paragraph> <Paragraph position="5"> This paper is organized as follows : The paper begins with a discussion of how the templates were prepared, with particular emphasis on the strategies that were used that served to minimize errors and maximize consistency, including detailed fill rules, having mor e than one analyst code a given template, and the use of software tools with error detection capabilities .</Paragraph> <Paragraph position="6"> The paper next describes the results of an investigation into the extent to which template codings made b y analysts that are playing different roles in the production of a particular template influence the resulting key, whic h provides clues to the effectiveness of the quality control strategies used in the template preparation process .</Paragraph> <Paragraph position="7"> The results of an experimental test of different methods of scoring human performance are then presented , with the goal of selecting a method that is statistically reliable, minimizes bias, and has other desirable characteristics . Data that indicates overall levels of human performance on the task, variability among analysts, and reliability of th e data are then presented.</Paragraph> <Paragraph position="8"> The results of an investigation into the development of analyst skill are then presented, with the significan t question being the need to understand whether the performance levels being measured truly reflect analysts who have a high level of skill .</Paragraph> <Paragraph position="9"> The performance of humans for information extraction is then compared with that of machine systems, in terms of both errors and metrics that attempt to separate out two different aspects of performance, recall and precision. A final section of the paper discusses implications of the results for assessing the maturity and applicabilit y of extraction technology.</Paragraph> </Section> class="xml-element"></Paper>