File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/j01-2003_intro.xml

Size: 9,825 bytes

Last Modified: 2025-10-06 14:01:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-2003">
  <Title>The Need for Accurate Alignment in Natural Language System Evaluation</Title>
  <Section position="3" start_page="232" end_page="235" type="intro">
    <SectionTitle>
2. Information Extraction, MUC, and the F-score Metric
</SectionTitle>
    <Paragraph position="0"> IE systems process streams of natural language input and produce representations of the information relevant to a particular task, typically in the form of database templates. In accordance with the aforementioned array of roles served by evaluation methods, the MUCs have been very influential, being the primary driving force behind IE research in the past decade: The MUCs have helped to define a program of research and development .... The MUCs are notable ... in that they have substantially shaped the research program in information extraction and brought it to its current state. (Grishman and Sundheim 1995, 1-2) There have been seven MUCs, starting with MUC-1 in 1987, and ending with MUC-7 in 1997. The metrics used--precision, recall, and F-score--are probably the most exhaustively used metrics for any natural language understanding application; precision and recall have been in use since MUC-2 in 1989 (Grishman and Sundheim 1995), and F-score since MUC-4 in 1992 (Hirschman 1998). IE evaluation has thus been extensively thought out, revised, and experimented with: For the natural language processing community in the United States, the pre-eminent evaluation activity has been the series of Message Understanding Conferences (MUCs) .... the MUC conferences provide us with over a decade of experience in evaluating language understanding. (Hirschman 1998, 282) The MUCs therefore provide a rich and established basis for the study of the effects of evaluation with respect to its roles noted above. As previously indicated, we will focus in this paper on MUC-6, held in 1995, and exclusively on the procedure used for aligning system-generated responses with those in an answer key.</Paragraph>
    <Paragraph position="1">  Computational Linguistics Volume 27, Number 2</Paragraph>
    <Paragraph position="3"> Output template for MUC-6.</Paragraph>
    <Paragraph position="4"> Marketing &amp; Media: Star TV Chief Steps Down After News Corp. Takeover In the wake of a takeover by News Corp., the chief executive officer of Star TV resigned after less than six months in that post, industry executives said.</Paragraph>
    <Paragraph position="5"> Last week, News Corp. bought 63.6% of the satellite broadcaster, which serves the entire Asian region, for $525 million in cash and stock from Hutchison Whampoa Ltd. and the family of Li Ka-shing. At the time of the purchase, News Corp. executives said they would like the executive, Julian Mounter, to stay. However, Star's chief balked at the prospect of not reporting directly to Rupert Murdoch, News Corp.'s chairman and chief executive officer, people close to Mr. Mounter said. Both Mr. Mounter and a Star spokesman declined to comment.</Paragraph>
    <Paragraph position="6"> It is likely that Star's new chief executive will report to either Sam Chisholm, chief executive of News Corp.'s U.K.-based satellite broadcaster, British Sky Broadcasting, or Chase Carey, chief operating officer of News Corp.'s Fox Inc. film and television unit.</Paragraph>
    <Paragraph position="7"> Mr. Mounter's departure is expected to be formally announced this week.</Paragraph>
    <Paragraph position="8"> Although there are no obvious successors, it is expected that Mr. Murdoch will choose someone from either British Sky Broadcasting or Fox to run Star, said a person close to News Corp. Figure 2 Example text from MUC-6 development set (9308040024).</Paragraph>
    <Section position="1" start_page="233" end_page="234" type="sub_section">
      <SectionTitle>
2.1 Task Definition
</SectionTitle>
      <Paragraph position="0"> The MUC-6 task was, roughly speaking, to identify information in business news that describes executives moving in and out of high-level positions within companies (Grishman and Sundheim 1995). The template structure that MUC-6 systems populated is shown in Figure 1. There are three types of values used to fill template slots. String fills, shown italicized, are simply strings taken from the text, such as CEO for the POST slot in a SUCCESSION_EVENT. Set fills are chosen from a fixed set of values, such as DEPART_WORKFORCE for the VACANCY_REASON slot in a SUCCESSION_EVENT.</Paragraph>
      <Paragraph position="1"> Finally, Pointer fills, shown in angle brackets, hold the identifier of another template structure, e.g., the SUCCESSION_ORE slot in a SUCCESSION_EVENT will contain a pointer to an ORGANIZATION template.</Paragraph>
      <Paragraph position="2"> Figure 2 displays a text from the MUC-6 development corpus. When a participating system encounters this passage, it should extract the information that Julian  Kehler, Bear, and Appelt Accurate Alignment in Evaluation  Mounter is &amp;quot;out&amp;quot; of the position of CEO of company Star TV, along with other information associated with the event. The correct results for this passage, as encoded in a human-annotated answer key, are shown in the middle column of Figure 3.</Paragraph>
    </Section>
    <Section position="2" start_page="234" end_page="235" type="sub_section">
      <SectionTitle>
2.2 Evaluation: Alignment
</SectionTitle>
      <Paragraph position="0"> The rightmost column of Figure 3 shows hypothetical output of an IE system. The first step of the evaluation algorithm, alignment, determines which templates in the system's output correspond to which ones in the key. Generally speaking, there can be any number of templates in the key and system response for a given document; all, some, or no pairs of which may be descriptions of the same event. The alignment algorithm must thus determine the correct template pairing.</Paragraph>
      <Paragraph position="1"> This process is not necessarily straightforward, since there may be no slot in a template that uniquely identifies the event or object which it describes. In response to this problem, the MUC community decided to adopt a relatively lax alignment criterion, leaving it to the alignment algorithm to find the alignment that optimizes the resulting score. The procedure has two major steps. First, it determines which pairs of templates are possible candidates for alignment; the criterion for candidacy was only that the templates share a common value for at least one slot. (Pointer fills share a common value if they point to objects that are aligned by the algorithm.) This criterion often results in alignment ambiguities--a key template will often share a common slot value with several templates in the system's output, and vice versa-and thus a method for selecting among the alternative mappings is necessary. The candidate pairs are rank ordered by a mapping score, which simply counts the number of slot values the templates have in common. 1 The scoring algorithm then considers 1 The scoring software provided for MUC-6 allows for alternative scoring configurations based on slot content, including the ability to assign different weights to slots, but this was the configuration used for development and evaluation in MUC-6.</Paragraph>
      <Paragraph position="2">  Computational Linguistics Volume 27, Number 2 key and response template pairs according to this order, aligning them when neither member has already been mapped to another template. Ties between pairs with the same number of common slot values are broken arbitrarily. Because this algorithm is heuristic--with many combinations of alignments never being considered--the result may not be the globally optimal alignment in terms of score assigned.</Paragraph>
      <Paragraph position="3"> In our example, there is only one template of each type in each response, and thus there are no mapping ambiguities. The only requirement is that each pair share a common slot value, which is the case, and so the algorithm aligns each as shown in</Paragraph>
    </Section>
    <Section position="3" start_page="235" end_page="235" type="sub_section">
      <SectionTitle>
2.3 Evaluation: Scoring
</SectionTitle>
      <Paragraph position="0"> Once the templates are aligned, the scoring algorithm performs slot-by-slot comparisons to determine errors. The leftmost column in Figure 3 shows examples of the three types of errors that the algorithm will mark. First, while our hypothetical system recognized that the correct PERSON and ORGANIZATION are Julian Mounter and Star TV, respectively, it missed the ORG_ALIAS, ORG_DESCRIPTOR, PER_ALIAS, and PER_TITLE values that appear later in the passage, resulting in four missing slot fills (denoted by mis in the left hand column). Next, it also erroneously assigned a value to the ORG_LOCALE and ORG_COUNTRY slots in the ORGANIZATION, resulting in two spurious slot fills (denoted by spu). Finally, the system got three of the set fill slots wrong--the VACANCY_REASON slot in the SUCCESSION_EVENT, and the NEW_STATUS and ON_THE_JOB slots of the IN_AND_OuT template--resulting in three incorrect slot fills (denoted by inc). The remainder of the slot fills are correct (denoted by cor). 2 Again, pointer slots are scored as correct when they point to templates that have been aligned.</Paragraph>
      <Paragraph position="1"> The possible fills are those in the key which contribute to the final score, and the actual fills are those in the system's response which contribute to the final score:</Paragraph>
      <Paragraph position="3"> Three metrics are computed from these results: precision, recall, and F-score. Precision is the number of correct fills divided by the total number generated by the system, and recall is the number of correct fills divided by the the total number in the key:</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML