File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/x93-1018_intro.xml

Size: 6,939 bytes

Last Modified: 2025-10-06 14:05:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="X93-1018">
  <Title>COMPARING HUMAN AND MACHINE PERFORMANCE FOR NATURAL LANGUAGE INFORMATION EXTRACTION: Results from the Tipster Text Evaluation</Title>
  <Section position="3" start_page="0" end_page="179" type="intro">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> In evaluating the state of technology for extracting information from natural language text by machine, it is valuable to compare the performance of machine extraction systems with that achieved by humans performing the same task.</Paragraph>
    <Paragraph position="1"> The purpose of this paper is to present some results from a comparative study of human and machine performance for one of the information extraction tasks used in the Tipster/ MUC-5 evaluation that can help assess the maturity and applicability of the technology.</Paragraph>
    <Paragraph position="2"> The Tipster program, through the Institute for Defense Analyses (IDA) and several collaborating U.S. government agencies, produced a corpus of filled &amp;quot;templates&amp;quot; --StlUCtured information extracted from text. This corpus was used both in the development of machine extraction systems by contractors and in the evaluation of the developed systems.</Paragraph>
    <Paragraph position="3"> Production of templates was performed by human analysts extracting the data from the text and structuring it, using a set of structuring rules for &amp;quot;filling&amp;quot; the templates and computer software that made it easier for analysts to organize information. Because of this rather extensive effort by analysts to create these templates, it was possible to study the performance of humans for this task in some detail and to develop methods for comparing this performance with that of machines participating in the Tipster/MUC-5 evaluation.</Paragraph>
    <Paragraph position="4"> The texts that the templates were filled from were newspaper and technical magazine articles concerned either with joint business ventures or microelectronics fabrication technology. Each topic domain used text in two languages, English and Japanese. This paper discusses preparation of templates and presents detailed results for human and machine performance; a shorter paper \[1\] discusses preparation of templates and basic results.</Paragraph>
    <Paragraph position="5"> The primary motivation for this study was to provide reliable data that would allow machine extraction performance to be compared with that of humans. The MUC and Tipster programs have included extensive efforts to develop measurements that can objectively evaluate the performance of the different machine systems. However, although these measures are capable of reliably discriminating between the performance of different machine systems, they are not very useful by themselves in evaluating how near the technology is to providing reliable performance and the extent to which it is ready to be used in applications. Sundheim \[2\] initiated human performance study for extraction by providing estimates of human performance for the task used in the MUC4 evaluation; the present study provides human data for the Tipster/MUC-5 evaluation that was produced under relatively controlled conditions and with methods and statistical measures that assess the reliability of the data.</Paragraph>
    <Paragraph position="6"> A second motivation for the study was for its value in helping produce better quality templates so as to allow high-quality system development and reliable evaluation. The quality and consistency of the templates being produced were monitored as analysts were trained and gained experience, and particular efforts were made to identify the causes of errors and inconsistency so as to develop strategies for reducing error and increasing consistency.</Paragraph>
    <Paragraph position="7"> A third motivation for studying human performance was to better understand the nature of the extraction task and the relative performance of humans compared with machines on different aspects of the task. Such an understanding can particularly help in the construction of human-machine integrated systems that are designed to make the best use of  what are at the present time rather different abilities of humans and machines \[3\].</Paragraph>
    <Paragraph position="8"> This paper is organized as follows: The paper begins with a discussion of how the templates were prepared, with particular emphasis on the strategies that were used that served to minimize errors and maximize consistency, including detailed fill rules, having more than one analyst code a given template, and the use of software tools with error detection capabilities.</Paragraph>
    <Paragraph position="9"> The paper next describes the results of an investigation into the extent to which template codings made by analysts that are playing different roles in the production of a particular template influence the resulting key, which provides clues to the effectiveness of the quality control strategies used in the template preparation process.</Paragraph>
    <Paragraph position="10"> The results of an experimental test of different methods of scoring human performance are then presented, with the goal of selecting a method that is statistically reliable, minimizes bias, and has other desirable characteristics. Data that indicates overall levels of human performance on the task, variability among analysts, and reliability of the data are then presented.</Paragraph>
    <Paragraph position="11"> The results of an investigation into the development of analyst skill are then presented, with the significant question being the need to understand whether the performance levels being measured truly reflect analysts who have a high level of skill.</Paragraph>
    <Paragraph position="12"> The performance of humans for information extraction is then compared with that of machine systems, in terms of both errors and metrics that attempt to separate out two different aspects of performance, recall and precision, The results of a study comparing the effect of key preparation on the evaluation of machine performance are then presented. This is particularly relevant to the question of how keys should be future MUC and Tipster evaluations.</Paragraph>
    <Paragraph position="13"> A study is then presented of the extent to which machines and humans agree on the relative difficulty of particular templates. null The results of a pilot study in which the performance of humans and machines is compared for particular kinds of information, to see what information machines are comparatively worse or better than humans in extracting, is then presented.</Paragraph>
    <Paragraph position="14"> A final section of the paper makes some general conclusions about the results and their implications for assessing the maturity and applicability of extraction technology.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML