File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/j01-2003_metho.xml

Size: 18,229 bytes

Last Modified: 2025-10-06 14:07:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-2003">
  <Title>The Need for Accurate Alignment in Natural Language System Evaluation</Title>
  <Section position="4" start_page="235" end_page="236" type="metho">
    <SectionTitle>
COR PRE -
ACT
COR REC -
POS
</SectionTitle>
    <Paragraph position="0"> In the example, there are 8 correct fills, 13 generated fills, and 15 possible fills in the key, resulting in a precision of 0.615 and a recall of 0.533. Note that the four missing slot fills do not figure into the precision computation, and the two spurious fills do not figure into the recall computation. F-score is determined by a harmonic mean of precision and recall (van Rijsbergen 1979):</Paragraph>
    <Paragraph position="2"> 2 There is also the possibility of getting partial credit, which we will not discuss further nor include in the equations below.</Paragraph>
    <Paragraph position="3">  Kehler, Bear, and Appelt Accurate Alignment in Evaluation In the standard computation of F-score, fl is set to one (indicating equal weight for precision and recall), resulting in:</Paragraph>
    <Paragraph position="5"/>
    <Section position="1" start_page="236" end_page="236" type="sub_section">
      <SectionTitle>
2.4 Focus of the Paper
</SectionTitle>
      <Paragraph position="0"> Before proceeding to the analysis, we take care to note that our aim is neither to provide a thorough analysis of all problematic aspects of NLP evaluation, nor to provide a criticism of the MUCs in particular. Problematic aspects of the scoring procedures used in a variety of NLP evaluations are well attested, including the existence of side effects of scoring procedures which reward behavior that may not be perfectly consistent with the goals of the evaluations that these scoring procedures serve. For instance, word error metrics used in evaluations of speech recognition technology have been criticized for the fact that they assign the same degree of credit for the recognition of all words, a policy that rewards a focus on recognizing frequent but less important words (e.g., urn) over more important but less frequent content words. Similarly, evaluations of syntactic annotation systems that use locally oriented crossing brackets and labeled precision/recall metrics can assign high degrees of credit to cases in which the assigned structures are fairly inaccurate when viewed more globally. Likewise, there are aspects of the scoring procedure used for MUC-6 that one could question, such as the choice of slots included in each template and their corresponding definitions, the decision to weight all slots equally without regard to perceived importance, and the choice to define the templates to have hierarchical structure and give credit for slots that merely contain pointers to lower-level templates. We believe that any mechanical evaluation is likely to have such issues, and while they are very worthy of study and debate, a more detailed discussion of them would take us too far afield from the main purpose of this paper.</Paragraph>
      <Paragraph position="1"> The focus of this paper is purposefully more narrow, being concerned only with the effects of alignment criteria on the goals of evaluation. While it is unclear as of this writing whether there will be future MUCs, evaluation-driven efforts in IE continue to be sponsored, and future evaluations of interpretation tasks of equal or greater complexity are not only likely but crucial if the field is to progress while maintaining its current focus on quantitative evaluation. Because alignment questions will almost certainly become exacerbated as the interpretation problems addressed become more complex, and because of the aforementioned pervasiveness of evaluation in the entire technology development process, the payoff in avoiding potential pitfalls in such evaluations is high. Thus, our aim is to bring lessons learned from the MUC experience to the fore so that they can inform future evaluations that, like the MUCs, are likely to be principal driving forces for research over extended periods of time.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="236" end_page="237" type="metho">
    <SectionTitle>
3. The Impact of Inaccurate Alignment
</SectionTitle>
    <Paragraph position="0"> In the introduction, we described several roles that the MUC-6 evaluation has played in bringing IE technology to its current state: task definition, system development, and cross-system evaluation. We focus here on its role in the system development process, where its influence has been substantial, as it has provided system developers the ability to obtain rapid feedback with which to iteratively gauge their progress on a training  Target and actual outputs for the example text.</Paragraph>
    <Paragraph position="1"> corpus. In essence, a developer can incorporate a new processing strategy or data modification, evaluate the system with the change, keep it if end-to-end performance (i.e., F-score) improves, and withdraw it if it does not. Often changes have unanticipated results, and thus a formal evaluation method is required to affirm or deny developers' (often misleading) intuitions. Likewise, the method has an analogous role for supporting systems that learn rules automatically. In a typical learning scenario, an automated procedure iteratively proposes rule modifications and adopts them only in those cases in which an objective function--that is, the evaluation method--indicates that performance has improved. Thus, it is absolutely crucial that both positive and negative system changes are reflected as such in the feedback provided by the scoring mechanism.</Paragraph>
    <Paragraph position="2"> As we illustrate with the passage shown in Figure 2, the weak alignment criterion used in MUC-6 causes the scoring mechanism to not respect this requirement. 3 The answer key in Figure 4 is as it was in Figure 3, representing the event of Julian Mounter leaving the position of CEO at Star TV. The results of an IE system (in this case, a slightly modified version of what SRI's FASTUS \[Appelt et al. 1995\] extracted for this example) are shown in the righthand column. The IE system made two significant mistakes on this text: It failed to extract any information relating to the correct event, and it overgenerated an incorrect &amp;quot;event,&amp;quot; specifically Rupert Murdoch's becoming CEO of News Corp. 4 Thus, the system's precision should be o = 0, since it generated</Paragraph>
  </Section>
  <Section position="6" start_page="237" end_page="238" type="metho">
    <SectionTitle>
3 In this section we will be arguing our point primarily on the basis of a single illustrative example,
</SectionTitle>
    <Paragraph position="0"> where in fact the MUC development process utilized a 100-text corpus. The arguments extend to this broader setting, of course, a topic to which we return in Section 4.</Paragraph>
    <Paragraph position="1"> 4 An anonymous reviewer points out that because no slot in the template structure can be guaranteed to identify an object, intuitions alone are not enough (despite how strong they might be in a case such as this) to establish that the system output in Figure 4 actually represents a different event, rather than the event described in the output in Figure 3 corrupted by the selection of wrong names for certain slots.</Paragraph>
    <Paragraph position="2"> The FASTUS system produces byte offsets to tie information in templates to the places in the text from which they were created, and from this we can confirm that the event was indeed created solely from textual material unrelated to the event described in the key. See also footnote 7.</Paragraph>
    <Paragraph position="3">  generated none of the 15 slots associated with the correct event. 5 In actuality, however, the scoring algorithm aligned these templates as shown in Figure 4, since each template pair shares at least one slot value. With this alignment, the scoring procedure identifies 8 correct slot values, 5 incorrect values, and 2 missing values, resulting in a recall of 8 = 0.533, a precision of 8 = 0.615, and an F-score of 0.571. These are the same scores received for the output in Figure 3, in which a partial but mostly accurate description of the correct event was extracted.</Paragraph>
    <Paragraph position="4"> The fact that the wholly incorrect output in Figure 4 and the largely correct output in Figure 3 receive equally good scores demonstrates a serious flaw with the evaluation method. As shown in Table 1, the distributions of values for several slots in the MUC-6 training data have relatively low entropy, and thus alignments based on a fortuitous overlap in these values are not rare. 6 For instance, the fact that the ORG_TYPE slot in ORGANIZATION templates has the value COMPANY in 115 out of 117 instances almost ensures that any ORGANIZATION template produced by a system can get matched to an arbitrary one in the key for a given document. From this, in turn, the inaccurate alignments may then cascade: Two unrelated SUCCESSION_EVENT templates can be aligned</Paragraph>
  </Section>
  <Section position="7" start_page="238" end_page="241" type="metho">
    <SectionTitle>
5 Its F-score should be undefined, since the denominator in the F-score equation will be zero. For all
</SectionTitle>
    <Paragraph position="0"> intents and purposes, however, the F-score can be considered to be zero, in the sense that its overall contribution to the F-score assigned to the results over a larger corpus will be zero. With this in mind, for simplicity we may speak of such cases as having an F-score of zero.</Paragraph>
    <Paragraph position="1"> 6 In some cases, the sum of the counts in the rightmost column is greater than the total template count.</Paragraph>
    <Paragraph position="2"> This is because in some cases a key entry allowed for alternative slot values; matching any of the alternatives was sufficient for both alignment and scoring. Likewise, instances of optional slot fills were also included.</Paragraph>
    <Paragraph position="3">  Computational Linguistics Volume 27, Number 2 on the basis of sharing pointers to two unrelated (but nonetheless aligned) ORGANIZATION templates. Likewise, two unrelated IN_AND_OUT templates can be aligned on the basis of sharing pointers to two unrelated PERSON templates that share the value Mr. for the TITLE slot. Between this prospect and the three set fills for IN_AND_OUT templates in Table 1, it is highly likely that an arbitrary IN_AND_OUT template produced by a system will overlap in at least one slot value with an arbitrary one in the key, which may in turn allow the SUCCESSION_EVENTS that point to them to be incorrectly aligned.</Paragraph>
    <Paragraph position="4"> Although the fact that the scoring algorithm is capable of assigning undeserved credit due to overly optimistic alignments is not unknown to the MUC community (see, for instance, Aberdeen et al. \[1995, 153, Table 4\] for a reference to &amp;quot;enthusiastic scoring mapping&amp;quot; for the Template Element task of MUC-6), we are unaware of any previous acknowledgement of, or similar example which demonstrates, the potential severity of the problem to the extent that Figure 4 does. This notwithstanding, one might still be tempted to view this behavior as relatively benign--perhaps there is no real harm done by giving systems the benefit of the doubt, along with a little undeserved credit that goes with it. While this will perhaps result in somewhat artificially inflated scores, there may be no reason to think that it would benefit one system or approach more than another, and this concession might seem reasonable considering the fact that there is likely to be no completely foolproof way to perform alignments.</Paragraph>
    <Paragraph position="5"> However, the potential harm that this behavior manifests in terms of the technology development process--which, to our knowledge, has never been brought to light--is that it creates a situation in which uncontroversially positive changes in system output may result in a dramatically worse score, and likewise negative changes may result in a dramatically better score. Consider a (common) development scenario in which, starting from a state in which the system produces the output in Figure 4 for the text in Figure 2, it is technically too difficult to modify the system to extract the correct (i.e., Julian Mounter) event, but in which a change can nonetheless be made to block the overgenerated (i.e., Rupert Murdoch) event. After such a modification, one would expect no change in recall, since no correct output is created or removed, and an improvement in precision, since an overgenerated event is removed.</Paragraph>
    <Paragraph position="6"> What actually happens in this example is that recall drops from 0.533 (8) to zero (o), and precision goes from 0.615 (8) to undefined (0). To circumvent comparisons with undefined values, we can suppose that there was another, independent event extracted from a different part of the text that was aligned correctly against its corresponding event in the answer key. For simplicity, we will assume that this event receives the same score as the overgenerated event: a precision of 8 and a recall of 8 ig&amp;quot; With the overgenerated event left in, the same scores as before are obtained: 16 ~___ 0.615 16 ~-- 0.533 recall, resulting in an F-score of 0.571. With the overgenerated precision and ~6 event removed, we obtain 8 = 0.615 precision and 8 = .267 recall, resulting in an F-score of 0.372. Instead of no change in recall and an improvement in precision, the reward for eliminating the overgenerated event is the same precision, a 50% reduction in recall, and a 20-point reduction in F-score. Our clear &amp;quot;improvement&amp;quot; thus has the effect of reducing performance dramatically, implicitly instructing the developer to reintroduce the rules responsible for producing the overgenerated event.</Paragraph>
    <Paragraph position="7"> The converse scenario yields an analogous problem. Consider a situation in which a system developer can add a rule to extract at least some of the information in the correct event--producing the output shown in Figure 5, for instance--but for whatever reason cannot make a change to block the overgenerated event. We would expect this change to result in a marked increase in both recall and precision, since unlike before, information for a relevant event is now being produced. Indeed, the  alignment algorithm will correctly align this event with the key and leave the Rupert Murdoch event unaligned, resulting in a precision of 9 = .600, a recall of 9 = .360, and an F-score of 0.450. This is the anticipated result, and would constitute a large increase over the zero F-score that the overgenerated event should have received when standing alone. However, this is a substantial reduction from the F-score of 0.571 that the overgenerated event actually receives, a change which implicitly instructs the developer to remove the rules responsible for extracting the correct event.</Paragraph>
    <Paragraph position="8"> In these two scenarios, positive changes to system output resulted in a dramatically reduced score. The opposite situation can also occur, in which a change that reduces the quality of the system's response nonetheless receives an increased score.</Paragraph>
    <Paragraph position="9"> One can merely reverse the scenarios. For instance, in a situation in which no output is being created for the Julian Mounter event and a developer considers adding a rule that produces the Rupert Murdoch event, the rise in F-score will indicate that this rule should be kept. Likewise, starting with the incorrect output in Figure 4 together with the correct output in Figure 5, a developer might consider removing the rule responsible for creating the correct output. This would cause the F-score to rise from 0.450 to 0.571, implicitly instructing the developer to keep it removed.</Paragraph>
    <Paragraph position="10"> Thus, in all of these scenarios, the feedback provided by the evaluation method may steer our system developer off of the path to the optimal system state. Likewise, the same effect would occur when employing automatic learning methods that use F-score as an objective function. Starting from the state of producing the output in Figure 4, for example, suppose the learning procedure could in fact propose each change necessary to get to the desired output, that is, to (i) eliminate the rules producing the erroneous output, and (ii) add rules for producing the output shown in Figure 5. These changes would result in a precision of 9 = .60, a recall of 9 = .75, and an F-score of 0.667, which is an improvement over both the zero result that the current output should receive, and the (artificially inflated) score of 0.571 it actually does receive.</Paragraph>
    <Paragraph position="11"> However, the type of incremental search process that efficiency concerns generally necessitate--one that can only perform one of steps (i) or (ii) in a single iteration and will only adopt the proposed change if it improves on its objective function--will not find this system state, since as we have seen, either move taken first would actually reduce the F-score.</Paragraph>
    <Paragraph position="12"> To sum, the fortuitous alignments allowed by the MUC-6 evaluation method create a situation in which both positive and negative system changes may not be reflected as such in the evaluation results. While there are other properties of the evaluation that conspire to help produce these anomalous results--including the choice to score all slot fills equally without respect to importance or entropy of their distribution of values, and to score slots which contain only pointers to other templates--these only  Computational Linguistics Volume 27, Number 2 serve to make the effects more or less dramatic than they might otherwise be. The root cause of this behavior is the alignment process: None of the foregoing behaviors would occur if the alignment criterion was such that the templates in Figure 4 were not alignable, thus producing an F-score of zero.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML