File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/m93-1007_metho.xml

Size: 18,527 bytes

Last Modified: 2025-10-06 14:13:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="M93-1007">
  <Title>MUC-5 EVALUATION METRIC S</Title>
  <Section position="2" start_page="0" end_page="286" type="metho">
    <SectionTitle>
SCORE REPORTS
</SectionTitle>
    <Paragraph position="0"> The MUC-5 Scoring System is evaluation software that aligns and scores the templates produced by th e information extraction systems under evaluation in comparison to an &amp;quot;answer key&amp;quot; created by humans . The Scoring System produces comprehensive summary reports showing the overall scores for the templates in the test set ; these may be supplemented by detailed score reports showing scores for each template individually. Figure 1 shows a sample summary score report in the joint ventures task domain for the error metrics ; Figure 2 shows a corresponding summary score report for the recall-precision metrics .</Paragraph>
    <Section position="1" start_page="0" end_page="286" type="sub_section">
      <SectionTitle>
Scoring Categories
</SectionTitle>
      <Paragraph position="0"> The basic scoring categories are found in the score report under the column headings COR, PAR, INC , XCR, XPA, XIC, MIS, SPU, and NON. These categories have not fundamentally changed since the MUC-4 evaluation. The rows in the body of the score report are for the various slots and objects in the template ; various totals appear at the bottom .</Paragraph>
      <Paragraph position="1"> For the MUC-5 evaluation, alignment of system responses (i.e., templates, objects, and slot-fillers generated by the system under evaluation) with the answer key was done fully automatically, and scoring was don e interactively. In interactive scoring mode, the evaluator is queried for a scoring decision only under certain circumstances; under most circumstances, the scoring decisions are made automatically . The meaning of each of th e scoring categories is described below and summarized in Table 1 .</Paragraph>
      <Paragraph position="2"> * If the response and the key are deemed to be equivalent, the category is correct (COR); if interactively assigned, a tally appears in both the COR and XCR (interactive correct) columns.</Paragraph>
      <Paragraph position="3"> * If the response and the key are judged to be a near match, the category is partial (PAR) ; if interactively assigned, a tally appears in both the PAR and XPA (interactive partial) columns .</Paragraph>
      <Paragraph position="4">  * If the key and response do not match, the category is incorrect (INC) ; if interactively assigned, a tall y appears in both the INC and XIC (interactive incorrect) columns.</Paragraph>
      <Paragraph position="5"> * If the key has a fill and the response has no corresponding fill, the category is missing (MIS) . * If the response has a fill which has no corresponding fill in the key, the category is spurious (SPU) . * If the key and response are both left blank, then the category is noncommittal (NON) .</Paragraph>
      <Paragraph position="6">  The columns in Figures 1 and 2 labelled possible (POS) and actual (ACT) contain the tallies of the numbe r of slot fillers that should be generated and the number of fillers that the system under evaluation actually generated , respectively. Possible is the sum of the correct, partial, incorrect, and missing . Actual is the sum of the correct, partial, incorrect, and spurious . These tallies are used in the computation of some of the evaluation metrics . The total possibl e is system-dependent and is therefore computed by summing the tallies assigned to the system responses rather tha n by simply summing the slot fillers to be found in the key template . In contrast, a system-independent metric will be explained in a later section .</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="286" end_page="286" type="metho">
    <SectionTitle>
ALL OBJECTS
MATCHED ONLY
TEXT FILTERING
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="286" end_page="286" type="sub_section">
      <SectionTitle>
Summary Rows
</SectionTitle>
      <Paragraph position="0"> The two summary rows in the score report labelled &amp;quot;ALL OBJECTS&amp;quot; and &amp;quot;MATCHED ONLY&amp;quot; show th e accumulated tallies obtained by scoring spurious and missing objects in different manners . Templates may contain  key is blank and response is no t response is blank and key is no t key and response are both blank 7 1 more than one instance of a kind of object, e .g., more than one &lt;entity&gt; object. The keys and responses may not agree in the number of objects generated. These cases lead to spurious and/or missing objects. Opinions as to how muc h systems should be penalized for spurious or missing objects differ depending upon the requirements of th e application in mind . These differing views have lead us to provide the two ways of scoring spurious and missin g information as outlined in Table 2 .</Paragraph>
      <Paragraph position="1"> The MATCHED ONLY manner of scoring penalizes the least for missing and spurious objects by scorin g them only in the object ID slot. This object ID score does not impact the overall score because the object ID slot is no t included in the summary tallies; the tallies include only the individual slots . ALL OBJECTS is a stricter manner o f scoring because it penalizes for both the slot fills missing in the missing objects and the slots filled in the spuriou s object. The metrics calculated based on the scores in the ALL OBJECTS row of the error score report are the officia l</Paragraph>
    </Section>
    <Section position="2" start_page="286" end_page="286" type="sub_section">
      <SectionTitle>
Evaluation Metric s
</SectionTitle>
      <Paragraph position="0"> The rightmost four columns in both the error score report and the recall-precision score report contain th e scores for the evaluation metrics. These are computed for each object and slot in the template, and overall scores ar e shown at the bottom.</Paragraph>
      <Paragraph position="1"> The primary evaluation metrics for MUC-5 have been changed from those used in previous MU C evaluations. The reasoning behind this change will be described in a later section . First, the formulas used to calculat e the evaluation metrics on the score reports will be given .</Paragraph>
    </Section>
    <Section position="3" start_page="286" end_page="286" type="sub_section">
      <SectionTitle>
Error Metrics
</SectionTitle>
      <Paragraph position="0"> The error per response fill (ERR) is the official measure of MUC-5 system performance . This measure is calculated as the number wrong divided by the total (possible plus spurious) as shown in Table 3 . It is dependent on the system because tallies change according to the amount of spurious data generated and according to how th e system tilled slots that have optional or alternate fills in the key. (See the discussion below on richness-normalize d error metric.) Table 3 also shows the computation of three secondary metrics -- undergeneration, overgeneration, an d substitution -- which isolate the three elements constituting overall error . Undergeneration and overgeneration were i n use for MUC-4 as well, and this is why they appear in both the error score report and the recall-precision score report . Those metrics are computed the same way for both reports . The substitution metric is new for MUC-5 and is foun d only in the error score report . The metric is not isolated in the recall-precision view on information extraction ; this is because it is a (negative) factor in both recall and precision ; in the error-based view, on the other hand, it is isolated a s a distinct type of error. The reader should note that the denominator in each of the secondary metrics is differen t because each metric offers a distinct perspective on the errors that a system can make .</Paragraph>
      <Paragraph position="1">  The error per response fill has been chosen as the primary measure reported for a system for this evaluatio n because developers now need to focus on the sources of errors, explain them, and remedy them to push the state o f the art. For example, if System A has the raw scores shown in Figure 3, its error per response fill is calculated a s follows:</Paragraph>
      <Paragraph position="3"> While the error per response fill metric and the undergeneration, overgeneration, and substitution metrics ar e designed to suit the system developers' need for performance diagnostics, a different measure that is as independen t of the system and the text sample as possible may be more useful in some other circumstances . The richness normalized error measure is designed to measure errors relative to the amount of information to be extracted from th e texts. This metric is shown in one of the summary rows at the bottom of the error score report.</Paragraph>
      <Paragraph position="4">  Richness-normalized error is calculated by dividing the number of errors per word by the number of key fill s per word . This calculation reduces to the number of errors divided by the fill-count . If a program manager i s considering use of a system on a distinct class of documents from the ones the system was tested on, this measure wil l predict the number of errors the system will make, given the richness of the new set of documents .</Paragraph>
      <Paragraph position="5"> Due to the optional and alternate fills in the key, there will be a range of fill-counts from the minimu m number of fills required to the maximum number of fills allowed . The difference between the two numbers represen t &amp;quot;discretionary&amp;quot; fills, i .e., ones that represent the ambiguity inherent in the text) The formaulas for calculating the minimum and maximum richness-normalized error appear in Table 4 .</Paragraph>
      <Paragraph position="6"> 1 . For further information on the variability inherent in the key templates, please refer to the published ver sion of the proceedings, which will contain a paper about the text and template corpora.</Paragraph>
      <Paragraph position="7">  For example, if system B has the raw scores in Figure 4 and if the key is filled as in Figure 5, the fill-coun t will range from the minimum required fills, which is a sum of Required Fills + Minimum Alternate Discretionar y Fills (20+ 10), to the maximum allowed fills, which is the sum of Required Fills + Optional Discretionary Fills + Maximum Alternate Discretionary Fills (20 + 10 + 30) . For this system, the richness-normalized error will range from 40/60 to 40/30 or 0.67 to 1 .33.</Paragraph>
      <Paragraph position="8"> Note that the maximum richness-normalized error can be greater than 1 .00 because the fill-count in the key can he less than the number wrong for a system that overgenerates . Note also that the minimum richness-normalized error can he less than the error per response fill because the (system-independent) fill-count in the key can be greate r than the (system-dependent) total used in the denominator in error per response fill .</Paragraph>
      <Paragraph position="9"> The error score report also contains a row called &amp;quot;Error Rate per Word,&amp;quot; but it should be noted that thi s metric is not comparable between the Japanese and the English and is not highly accurate for Japanese .</Paragraph>
    </Section>
    <Section position="4" start_page="286" end_page="286" type="sub_section">
      <SectionTitle>
Recall precision Metrics
</SectionTitle>
      <Paragraph position="0"> We have designated the recall, precision, and F-measure metrics that were used for MUC-4 as unofficia l secondary metrics for MUC-5 in order to maintain continuity with previous MUCs . They can be used to explain current performance in comparison to past performance. Further analysis is still necessary to determine thei r contribution to the evaluation of data extraction systems as compared to the error-based metrics .</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="286" end_page="286" type="metho">
    <SectionTitle>
Richness-Normalized Error
</SectionTitle>
    <Paragraph position="0"> The recall-precision evaluation metrics were adapted from the field of Information Retrieval (IR) an d extended for the MUC evaluations . They measure four different aspects of performance and an overall, combine d view of performance . The four evaluation metrics of recall, precision, undergeneration, and overgeneration ar c calculated for the slots and in the summary score rows (see Table 5) . The fifth metric, the F-measure, is a combined score for the entire system and is listed at the bottom of the report .</Paragraph>
    <Paragraph position="1"> Recall (REC) is the percentage of possible answers which were correct . Precision (PRE) is the percentage of actual answers given which were correct . A system has a high recall score if it does well relative to the number of slo t fills in the key. A system has a high precision score if it does well relative to the number of slot fills it attempted : In IR, a common way of representing the characteristic performance of systems is in a precision-recal l graph. Normally, as recall goes up, precision tends to go down and vice versa [I ] . To directly measure underpopulation or overpopulation of the template database by the information extraction systems, we introduced th e measures of undergeneration and overgeneration .</Paragraph>
    <Paragraph position="2">  Methods have been developed for combining the measures of recall and precision to get a single measure . In MUC-4, we used van Rijsbergen's F-measure [1, 2] for this purpose . The F-measure provides a way of combinin g recall and precision to get a single measure which falls between recall and precision . Recall and precision can hav e relative weights in the calculation of the F-measure, giving it the flexibility to be useful in the context of differen t application requirements . The formula for calculating the F-measure is :</Paragraph>
    <Paragraph position="4"> where P is precision, R is recall, and is the relative importance given to recall over precision . If recall and precision are of equal weight, Q = 1 .0. This value is shown in the score report under the heading &amp;quot;P&amp;R.&amp;quot; The heading &amp;quot;2P&amp;R &amp;quot; is for recall half as important as precision (R = 0.5). The heading &amp;quot;P&amp;2R&amp;quot; is for recall twice as important as precisio n (f3 = 2.0). The F-measure is calculated from the recall and precision values in the ALL OBJECTS row .</Paragraph>
    <Paragraph position="5"> Note that the F-measure is higher if the values of recall and precision are more towards the center of th e precision-recall graph than at the extremes and their sums are the same . So, for R = 1 .0, a system which has recall o f 50% and precision of 50% has a higher F-measure than a system which has recall of 20% and precision of 80%. This behavior is what we wanted from this single measure, which we expected would encourage developers to pus h overall performance and, at the same time, to minimize the trade-off between the competing requirements fo r minimal missing, spurious, and substitution types of error .</Paragraph>
    <Paragraph position="6"> F=  An example showing the new metrics and the old (along with the pertinent scoring categories) for thre e theoretical systems is given in Figures 6 and 7 . In this example, the error per response fill is the same for each of th e three systems even though the F-measures are different. However, the secondary metrics of undergeneration , overgeneration, and substitution serve to distinguish the three systems . This hypothetical example points out th e important role that the secondary metrics could play in system analysis as well as the analysis of the quality of the  Also appearing in the recall-precision score report is a row called &amp;quot;Text Filtering.&amp;quot; The purpose of this row i s to report how well systems distinguish relevant articles from irrelevant articles . The scoring program keeps track of how many times each of the situations in the contingency table arises for a system (see Table 6) . It then uses those values to calculate the entries in the Text Filtering row . The evaluation metrics are calculated for the row as indicate d by the formulas at the bottom of Table 6.</Paragraph>
    <Paragraph position="7"> The Role of the Noncommittal Scoring Categor y The reader will have noticed that the category of &amp;quot;noncommittal&amp;quot; responses has been omitted from the metrics. Although this may not seem reasonable from an applications perspective, from a research perspective w e believe that the exclusion of noncommittal responses results in a much less distorted cross-system view o f performance. The question comes down to whether systems normally leave a slot blank out of knowledge or whethe r they do so out of a lack of knowledge . Highly immature systems tend either to overgenerate to an extreme, leavin g few blanks, or to undergenerate to an extreme, leaving many blanks . The latter type of immature system is more common and may benefit unfairly from a metric that considers a noncommittal response to be a correct response, especially if there are relatively many blanks in the key templates .</Paragraph>
    <Paragraph position="8"> If, for example, noncommittals were considered correct responses and included in the denominator of the error per response fill measure, the rankings of all 17 MUC-4 systems on TST3 (the name of one of the two test set s used in the evaluation) would change. The most radical changes would be for immature systems whose number of noncommittals greatly outweighs all other categories of response . Since there are a lot of immature systems evaluate d for MUC-5 (as there were for MUC-4) and since the average number of fills in the answer-key templates for MUC- 5 is only about half of what it was for MUC-4, the distortions of the results for MUC-5 have the potential to be eve n greater than they were for MUC-4 . However, the potential effect on the MUC-5 evaluation is damped somewhat b y the fact that the MUC-5 template consists of objects that are aligned separately ; response objects that contain an insufficient amount of slot-fillers to warrant an alignment with a key object are not scored against a key object at th e slot level. Nonetheless, we believe that omitting noncommittals from the metrics provides a better basis fo r comparison across the full range of MUC-5 (and MUC-4) systems and provides a more accurate assessment of the state of the art.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML