File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/x98-1030_metho.xml

Size: 3,548 bytes

Last Modified: 2025-10-06 14:15:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="X98-1030">
  <Title>MUC/MET Evaluation Trends</Title>
  <Section position="3" start_page="235" end_page="236" type="metho">
    <SectionTitle>
7 follows:
</SectionTitle>
    <Paragraph position="0"> an event in which entities participated. The scenario provided the domain of the dataset and allowed for relevancy judgments of high accuracy by systems.</Paragraph>
    <Paragraph position="1">  The task definition for ST required relevancy and fill rules. The choice of the domain was dependent to some extent on the evaluation epoch. The structure of the template and the task definition tended to be dependent on the author of the task, but the richness of the templates also served to illustrated the utility of information extraction to users most effectively.</Paragraph>
    <Paragraph position="2"> The filling of the slots in the scenario template was generally a difficult task for systems and a relatively large effort was required to produce ground truth.</Paragraph>
    <Paragraph position="3"> Reasonable agreement(&gt;80%) between annotators was possible, but required sometimes ornate refinement of the task definition based on the data encountered.</Paragraph>
    <Section position="1" start_page="236" end_page="236" type="sub_section">
      <SectionTitle>
Task Definitions
</SectionTitle>
      <Paragraph position="0"> As experience was gained in defining tasks for information extraction, certain principles became invaluable. It was important for the utility of the task to be apparent to end users. The lower level tasks needed to dovetail into the higher level tasks requiring relatively more processing for each higher level.</Paragraph>
      <Paragraph position="1"> It was important for the task definitions to allow the achievement of an 80 - 99% threshold in interannotator agreement depending on how well systems were performing. Also, the ability to annotate text rapidly and with ease was critical to the end product: guidelines and datasets of high quality for research and development.</Paragraph>
      <Paragraph position="2"> The process of refining the task definition required concentration on several cycles of independent annotation, analysis of annotator agreement, and detailed note-taking on which examples required updates to the task definition. The production of consistent datasets was always the goal.</Paragraph>
    </Section>
    <Section position="2" start_page="236" end_page="236" type="sub_section">
      <SectionTitle>
Datasets
</SectionTitle>
      <Paragraph position="0"> To be sure that systems could work on newswire from different sources, the different evaluations utilized material from various sources.  multiple news organizations in a uniform SGML format. Typically, there were 100 texts per dataset. The texts were chosen using mixtures of keywords associated with the domains (terrorism, joint ventures, microelectronics, labor relations, management succession, air crashes, launch events). A pre-defined relevancy ratio was used for each dataset, usually 65% of the texts were relevant. Datasets in MUC-7 were provided for general training, dry run training and test, and formal run training and test data.</Paragraph>
      <Paragraph position="1"> In all of the MUC/MET evaluations the annotation accuracy was at least 80%, or higher whenever system performance was closer to human performance. We coordinated adjudication after the evaluation results were reported but before the final package was released for research use by the community at large.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML