File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/91/h91-1059_concl.xml

Size: 7,222 bytes

Last Modified: 2025-10-06 13:56:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1059">
  <Title>THIRD MESSAGE UNDERSTANDING EVALUATION AND CONFERENCE (MUC-3): PHASE 1 STATUS REPORT</Title>
  <Section position="6" start_page="302" end_page="304" type="concl">
    <SectionTitle>
MEASURES OF PERFORMANCE
</SectionTitle>
    <Paragraph position="0"> All systems are being evaluated on the basis of performance on the information extraction task in a blind test at the end of each phase of the evaluation. It is expected that the degree of success achieved by the different techniques in May will depend on such factors as whether the number of possible slot fillers is small, finite, or open-ended and whether the slot can typically be filled by fairly straightforward extraction or not. System characteristics such as amount of domain coverage, degree of robustness, and general ability to make proper use of information found in novel input will also be major factors. The dry-run test results cannot be assumed to provide a good basis for estimating performance on the official test in May.</Paragraph>
    <Paragraph position="1"> An excellent, s~mi-automated scoring program has been developed and distributed to all participants to enable the calculation of the various measures of performance. The two primary measures are completeness (recall) and accuracy (precision). There are two additional measures, one to isolate the amount of spurious data generated (overgeneration) and the other to determine the rate of incorrect generation as a function of the number of opportunities to incorrectly generate (fallout). Fallout can be calculated only for those slots whose fillers form a closed set. Scores for the other three measures are calculated for the test set overall, with breakdowns by template slot. Figure 3 presents a somewhat simplified set of definitions for the measures.</Paragraph>
    <Paragraph position="2">  The most significant things to note are that precision and recall are actually calculated on the basis of points -- the term &amp;quot;correct&amp;quot; includes system responses that matched the key exactly (earning 1 point each) and system responses that were judged to be a good partial match (earning .5 point each). It should also be noted that overgeneration figures in precision by contributing to the denominator in addition to being isolated as a measure in its own right. Overgeneration also figures in fallout by contributing to the numerator. This fact will come up again in the next section in the discussion of the phase 1 results.</Paragraph>
    <Paragraph position="3"> In addition to the official measures, unofficial measures will be obtained in May of performance on particular linguistic phenomena (e.g., conjunction), as measured by the database fills generated by the systems in particular sets of instances. That is, sentences exemplifying a selected phenomenon will be marked for separate scoring if successful handling of the phenomenon seems to be required in order to fill one or more template slots correctly for that sentence. An experiment involving several phenomena tests was conducted as part of the dry run. The tests concerned the interpretation of active versus passive clauses, main versus embedded clauses, conjunction of noun phrases, and negation. The results for the dry run were extremely inconclusive, given the lack of basic domain coverage of the systems and, for several sites, the exclusive use of nonlinguistic processing components. In addition, the utility of this means of judging linguistic coverage was eroded by the fact that most systems had multiple points of failure; some may have handled the linguistic phenomena correctly in the early stages of analysis, but failed to fill the slots correctly due to subsequent processing failure.</Paragraph>
    <Paragraph position="4"> PHASE 1 RESULTS The results obtained in the first phase of the evaluation are unofficial and will therefore not be presented in their entirety. To give readers an idea of the current top level of performance of the participating systems, scores from two systems are presented anonymously. Table 1 presents a summary of the scores obtained on recall, precision, and overgeneration for the system that scored highest overall on recall and the system that scored highest overall on precision (with recall above a threshold of 10%). The results for the fallout measure cannot be calculated for the test overall (because the fillers for some slots do not form closed sets) and are therefore not included in Table 1.</Paragraph>
    <Paragraph position="5">  nearly four times greater recall than $2, and so it is not surprising that its overgeneration score is significantly worse than S2's. In this regard, it should be noted that generating a spurious template incurs a penalty that affects only slot 1, the template ID slot. Thus, although the precision of S1 is lower than S2's as expected, the difference is not nearly as marked as it would be if the penalty for generating a spurious template affected all slots rather than just the template ID slot.</Paragraph>
    <Paragraph position="6"> The recall columns in Table 2 suggest to what extent S1 and $2 have been developed to fill data in the various template slots. $2 has zero percent recall for several of the slots. In the particular case of $2, a system based on thorough syntactic and semantic analysis, the reason for the zero reall is that system development simply has not focused yet on filling those slots. Only one (slot 4) requires a string fill; the other three take a set fill.</Paragraph>
    <Paragraph position="7"> However, in the ease of systems based on text categorization techniques (not represented in Tables 1 and 2), zero recall is more likely to appear consistently in the slots whose fillers do not form a closed set, reflecting an inherent limitation in the approach. In order to obtain measures that give a fair appraisal of all systems in terms of their ability to select proper categories of responses, it has been suggested that a second set of &amp;quot;overall&amp;quot; measures be calculated that includes only those slots for which the fillers form a closed set.</Paragraph>
    <Paragraph position="8"> As defined for MUC-3, the numerator for fallout includes both the number of spurious slot fillers and the number of incorrect slot fillers. The inclusion of the spurious fillers in the numerator changes the intended meaning of the measure, as seen in the results for slot 4 in Table 2. That slot can be filled with one of only two possible set fills, either STATE-SPONSORED VIOLENCE or TERRORIST ACT, or it sometimes is intended to be null (represented as a hyphen in the notation). All other slots for which fallout can be computed have significantly more options, i.e., &amp;quot;opportunities to incorrectly generate.&amp;quot; If the fallout score were computed without including spurious fillers, the scores for the CATEGORY OF INCIDENT slot should be relatively low compared to the other slot scores for fallout. Instead, the scores for fallout on that slot are higher than for any of the others, probably showing that the systems frequently filled that slot when it was supposed to be null.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML