File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1020_metho.xml

Size: 20,432 bytes

Last Modified: 2025-10-06 14:08:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1020">
  <Title>Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics</Title>
  <Section position="3" start_page="1" end_page="7" type="metho">
    <SectionTitle>
2 Document Understanding Conference
</SectionTitle>
    <Paragraph position="0">  included the follow two main tasks: * Fully automatic single-document summarization: given a document, participants were required to create a generic 100-word summary. The training set comprised 30 sets of approximately 10 documents each, together with their 100-word human written summaries. The test set comprised 30 unseen documents.</Paragraph>
    <Paragraph position="1"> * Fully automatic multi-document summarization: given a set of documents about a single subject, participants were required to create 4 generic summaries of the entire set, containing 50, 100, 200, and 400 words respectively. The document sets were of four types: a single natural disaster event; a  Multiple judgments occur when more than one performance score is given to the same system (or human) and human summary pairs by the same human judge.</Paragraph>
    <Paragraph position="2">  DUC 2001 and DUC 2002 have similar tasks, but summaries of 10, 50, 100, and 200 words are requested in the multi-document task in DUC 2002.</Paragraph>
    <Paragraph position="3">  single event; multiple instances of a type of event; and information about an individual. The training set comprised 30 sets of approximately 10 documents, each provided with their 50, 100, 200, and 400-word human written summaries. The test set comprised 30 unseen sets.</Paragraph>
    <Paragraph position="4"> A total of 11 systems participated in the single-document summarization task and 12 systems participated in the multi-document task.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.1 Evaluation Materials
</SectionTitle>
      <Paragraph position="0"> For each document or document set, one human summary was created as the ideal model summary at each specified length. Two other human summaries were also created at each length. In addition, baseline summaries were created automatically for each length as reference points. For the multi-document summarization task, one baseline, lead baseline, took the first 50, 100, 200, and 400 words in the last document in the collection. A second baseline, coverage baseline, took the first sentence in the first document, the first sentence in the second document and so on until it had a summary of 50, 100, 200, or 400 words. Only one baseline (baseline1) was created for the single document summarization task.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="6" type="sub_section">
      <SectionTitle>
2.2 Summary Evaluation Environment
</SectionTitle>
      <Paragraph position="0"> To evaluate system performance NIST assessors who created the ideal written summaries did pairwise comparisons of their summaries to the system-generated summaries, other assessors summaries, and baseline summaries. They used the Summary Evaluation Environment (SEE) 2.0 developed by (Lin 2001) to support the process. Using SEE, the assessors compared the systems text (the peer text) to the ideal (the model text). As shown in Figure 1, each text was decomposed into a list of units and displayed in separate windows.</Paragraph>
      <Paragraph position="1"> SEE 2.0 provides interfaces for assessors to judge both the content and the quality of summaries. To measure content, assessors step through each model unit, mark all system units sharing content with the current model unit (green/dark gray highlight in the model summary window), and specify that the marked system units express all, most, some, or hardly any of the content of the  at five different levels: all, most, some, hardly any, or none  . For example, as shown in Figure 1, an assessor marked system units 1.1 and 10.4 (red/dark underlines in the left pane) as sharing some content with the current model unit 2.2 (highlighted green/dark gray in the right).</Paragraph>
    </Section>
    <Section position="3" start_page="6" end_page="7" type="sub_section">
      <SectionTitle>
2.3 Evaluation Metrics
</SectionTitle>
      <Paragraph position="0"> Recall at different compression ratios has been used in summarization research to measure how well an automatic system retains important content of original documents (Mani et al. 1998). However, the simple sentence recall measure cannot differentiate system performance appropriately, as is pointed out by Donaway et al. (2000). Therefore, instead of pure sentence recall score, we use coverage score C. We define it as follows null</Paragraph>
      <Paragraph position="2"> E, the ratio of completeness, ranges from 1 to 0: 1 for all, 3/4 for most, 1/2 for some, 1/4 for hardly any, and 0 for none. If we ignore E (set it to 1), we obtain simple sentence recall score. We use average coverage scores derived from human judgments as the references to evaluate various automatic scoring methods in the following sections.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="7" end_page="8" type="metho">
    <SectionTitle>
3 BLEU and N-gram Co-Occurrence
</SectionTitle>
    <Paragraph position="0"> To automatically evaluate machine translations the machine translation community recently adopted an n-gram co-occurrence scoring procedure BLEU (Papineni et al.</Paragraph>
    <Paragraph position="1"> 2001). The NIST (NIST 2002) scoring metric is based on BLEU. The main idea of BLEU is to measure the translation closeness between a candidate translation and a set of reference translations with a numerical metric. To achieve this goal, they used a weighted average of variable length n-gram matches between system translations and a set of human reference translations and showed that a weighted average metric, i.e. BLEU, correlating highly with human assessments.</Paragraph>
    <Paragraph position="2"> Similarly, following the BLEU idea, we assume that the closer an automatic summary to a professional human  Does the summary observe English grammatical rules independent of its content?  Do sentences in the summary fit in with their surrounding sentences?  Is the content of the summary expressed and organized in an effective way?  These category labels are changed to numerical values of 100%, 80%, 60%, 40%, 20%, and 0% in DUC 2002.  DUC 2002 uses a length adjusted version of coverage metric C, where C = a*C + (1-a)*B. B is the brevity and a is a parameter reflecting relative importance (DUC 2002). summary, the better it is. The question is: Can we apply BLEU directly without any modifications to evaluate summaries as well?. We first ran IBMs BLEU evaluation script unmodified over the DUC 2001 model and peer summary set. The resulting Spearman rank order correlation coefficient (r) between BLEU and the human assessment for the single document task is 0.66 using one reference summary and 0.82 using three reference summaries; while Spearman r for the multi-document task is 0.67 using one reference and 0.70 using three. These numbers indicate that they positively correlate at a = 0.01  . Therefore, BLEU seems a promising automatic scoring metric for summary evaluation. According to Papineni et al. (2001), BLEU is essentially a precision metric. It measures how well a machine translation overlaps with multiple human translations using n-gram co-occurrence statistics. N-gram precision in BLEU is computed as follows:  (n-gram) is the maximum number of n-grams co-occurring in a candidate translation and a reference translation, and Count(n-gram) is the number of n-grams in the candidate translation. To prevent very short translations that try to maximize their precision scores, BLEU adds a brevity penalty, BP, to the for- null Where |c |is the length of the candidate translation and |r |is the length of the reference translation. The BLEU formula is then written as follows:</Paragraph>
    <Paragraph position="4"> N is set at 4 and w n , the weighting factor, is set at 1/N. For summaries by analogy, we can express equation (1) in terms of n-gram matches following equation (2):  (n-gram) is the maximum number of n-grams co-occurring in a peer summary and a model unit and Count(n-gram) is the number of n-grams in the model unit. Notice that the average n-gram coverage score, C n , as shown in equation 5 is a recall metric  The number of instances is 14 (11 systems, 2 humans, and 1 baseline) for the single document task and is 16 (12 systems, 2 humans, and 2 baselines) for the multi-document task. instead of a precision one as p n . Since the denominator of equation 5 is the total sum of the number of n-grams occurring at the model summary side instead of the peer side and only one model summary is used for each evaluation; while there could be multiple references used in BLEU and Count clip (n-gram) could come from matching different reference translations. Furthermore, instead of a brevity penalty that punishes overly short translations, a brevity bonus, BB, should be awarded to shorter summaries that contain equivalent content. In fact, a length adjusted average coverage score was used as an alternative performance metric in DUC 2002. However, we set the brevity bonus (or penalty) to 1 for all our experiments in this paper. In summary, the n-gram co-occurrence statistics we use in the following sections are based on the following formula:</Paragraph>
    <Paragraph position="6"> i+1). Ngram(1, 4) is a weighted variable length n-gram match score similar to the IBM BLEU score; while Ngram(k, k), i.e. i = j = k, is simply the average k-gram</Paragraph>
    <Paragraph position="8"> With these formulas, we describe how to evaluate them in the next section.</Paragraph>
  </Section>
  <Section position="5" start_page="8" end_page="400" type="metho">
    <SectionTitle>
4 Evaluations of N-gram Co-Occurrence
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="8" end_page="8" type="sub_section">
      <SectionTitle>
Metrics
</SectionTitle>
      <Paragraph position="0"> In order to evaluate the effectiveness of automatic evaluation metrics, we propose two criteria:  1. Automatic evaluations should correlate highly, positively, and consistently with human assessments. null 2. The statistical significance of automatic evaluations  should be a good predictor of the statistical significance of human assessments with high reliability. The first criterion ensures whenever a human recognizes a good summary/translation/system, an automatic evaluation will do the same with high probability. This enables us to use an automatic evaluation procedure in place of human assessments to compare system performance, as in the NIST MT evaluations (NIST 2002). The second criterion is critical in interpreting the significance of automatic evaluation results. For example, if an automatic evaluation shows there is a significant difference between run A and run B at a = 0.05 using the z-test (t-test or bootstrap resampling), how does this translate to real significance, i.e. the statistical significance in a human assessment of run A and run B? Ideally, we would like there to be a positive correlation between them. If this can be asserted with strong reliability (high recall and precision), then we can use the automatic evaluation to assist system development and to be reasonably sure that we have made progress.</Paragraph>
    </Section>
    <Section position="2" start_page="8" end_page="400" type="sub_section">
      <SectionTitle>
4.1 Correlation with Human Assessments
</SectionTitle>
      <Paragraph position="0"> As stated in Section 3, direct application of BLEU on the DUC 2001 data showed promising results. However, BLEU is a precision-based metric while the human evaluation protocol in DUC is essentially recall-based.</Paragraph>
      <Paragraph position="1"> We therefore prefer the metric given by equation 6 and use it in all our experiments. Using DUC 2001 data, we compute average Ngram(1,4) scores for each peer system at different summary sizes and rank systems according to their scores. We then compare the Ngram(1,4) ranking with the human ranking. Figure 2 shows the result of DUC 2001 multi-document data.</Paragraph>
      <Paragraph position="2"> Stopwords are ignored during the computation of Ngram(1,4) scores and words are stemmed using a Porter stemmer (Porter 1980). The x-axis is the human ranking and the y-axis gives the corresponding Ngram(1,4) rankings for summaries of difference sizes.</Paragraph>
      <Paragraph position="3"> The straight line marked by AvgC is the ranking given by human assessment. For example, a system at (5,8) Table 1. Spearman rank order correlation coefficients of different DUC 2001 data between</Paragraph>
      <Paragraph position="5"> rankings and human rankings including (S) and excluding (SX) stopwords. SD-100 is for single document summaries of 100 words and MD-50, 100, 200, and 400 are for multi-document summaries of 50, 100, 200, and 400 words. MD-All averages results from summaries of all sizes.</Paragraph>
      <Paragraph position="6">  ranks it at the 8 th . If an automatic ranking fully matches the human ranking, its plot will coincide with the heavy diagonal. A line with less deviation from the heavy diagonal line indicates better correlation with the human assessment.</Paragraph>
      <Paragraph position="7"> To quantify the correlation, we compute the Spearman rank order correlation coefficient (r) for each N-</Paragraph>
      <Paragraph position="9"> run at different summary sizes (n). We also test the effect of inclusion or exclusion of stopwords. The results are summarized in Table 1.</Paragraph>
      <Paragraph position="10"> Although these results are statistically significant (a = 0.025) and are comparable to IBM BLEUs correlation figures shown in Section 3, they are not consistent across summary sizes and tasks. For example, the correlations of the single document task are at the 60% level; while they range from 50% to 80% for the multi-document task. The inclusion or exclusion of stopwords also shows mixed results. In order to meet the requirement of the first criterion stated in Section 3, we need better results.</Paragraph>
      <Paragraph position="12"> score is a weighted average of variable length n-gram matches. By taking a log sum of the n-gram matches, the Ngram(1,4) n favors match of longer n-grams. For example, if United States of America occurs in a reference summary, while one peer summary, A, uses United States and another summary, B, uses the full phrase United States of America, summary B gets more contribution to its overall score simply due to the longer version of the name. However, intuitively one should prefer a short version of the name in summarization. Therefore, we need to change the weighting scheme to not penalize or even reward shorter equivalents. We conduct experiments to understand the effect of individual n-gram co-occurrence scores in approximating human assessments. Tables 2 and 3 show the results of these runs without and with stopwords respectively.</Paragraph>
      <Paragraph position="13"> For each set of DUC 2001 data, single document 100-word summarization task, multi-document 50, 100, 200, and 400 -word summarization tasks, we compute 4 different correlation statistics: Spearman rank order correlation coefficient (Spearman r), linear regression t-test</Paragraph>
      <Paragraph position="15"> , 11 degree of freedom for single document task and 13 degree of freedom for multi-document task), Pearson product moment coefficient of correlation (Pearson r), and coefficient of determination (CD) for each Ngram(i,j) evaluation metric. Among them Spearman r is a nonparametric test, a higher number indicates higher correlation; while the other three tests are parametric tests. Higher LR  and multiple document tasks when stopwords are ignored. Importantly, unigram performs especially well with Spearman r ranging from 0.88 to 0.99 that is better than the best case in which weighted average of variable length n-gram matches is used and is consistent across different data sets.</Paragraph>
      <Paragraph position="16"> (2) The performance of weighted average n-gram scores is in the range between bi-gram and tri-gram co-occurrence scores. This might suggest some summaries are over-penalized by the weighted average metric due to the lack of longer n-gram matches. For example, given a model string United States, Japan, and Taiwan, a candidate  for 4 different statistics (without stopwords): Spearman rank order coefficient correlation (Spearman r),  string United States, Taiwan, and Japan has a unigram score of 1, bi-gram score of 0.5, and tri-gram and 4-gram scores of 0 when the stopword and is ignored. The weighted average n-gram score for the candidate string is 0.</Paragraph>
      <Paragraph position="17"> (3) Excluding stopwords in computing n-gram co-occurrence statistics generally achieves better correlation than including stopwords.</Paragraph>
    </Section>
    <Section position="3" start_page="400" end_page="400" type="sub_section">
      <SectionTitle>
4.2 Statistical Significance of N-gram Co-
</SectionTitle>
      <Paragraph position="0"> Occurrence Scores versus Human Assessments null We have shown that simple unigram, Ngram(1,1), or bigram, Ngram(2,2), co-occurrence statistics based on equation 6 outperform the weighted average of n-gram matches, Ngram(1,4), in the previous section. To examine how well the statistical significance in the automatic Ngram(i,j) metrics translates to real significance when human assessments are involved, we set up the following test procedures:  (1) Compute pairwise statistical significance test such  as z-test or t-test for a system pair (X,Y) at certain a level, for example a = 0.05, using automatic metrics and human assigned scores.</Paragraph>
      <Paragraph position="1"> (2) Count the number of cases a z-test indicates there is a significant difference between X and Y based on the automatic metric. Call this number N As  .</Paragraph>
      <Paragraph position="2"> (3) Count the number of cases a z-test indicates there is a significant difference between X and Y based on the human assessment. Call this number N Hs .</Paragraph>
      <Paragraph position="3"> (4) Count the cases when an automatic metric predicts  a significant difference and the human assessment also does. Call this N hit . For example, if a z-test indicates system X is significantly different from Y with a = 0.05 based on the automatic metric scores and the corresponding z-test also suggests the same based on the human agreement, then we have a hit.  (5) Compute the recall and precision using the following formulas:</Paragraph>
      <Paragraph position="5"> A good automatic metric should have high recall and precision. This implies that if a statistical test indicates a significant difference between two runs using the automatic metric then very probably there is also a significant difference in the manual evaluation. This would be very useful during the system development cycle to gauge if an improvement is really significant or not.</Paragraph>
      <Paragraph position="6"> Figure 3 shows the recall and precision curves for the DUC 2001 single document task at different a levels and Figure 4 is for the multi-document task with different summary sizes. Both of them exclude stopwords.</Paragraph>
      <Paragraph position="7"> We use z-test in all the significance tests with a level at 0.10, 0.05, 0.25, 0.01, and 0.005.</Paragraph>
      <Paragraph position="8"> From Figures 3 and 4, we can see Ngram(1,1) and Ngram(2,2) reside on the upper right corner of the recall and precision graphs. Ngram(1,1) has the best overall behavior. These graphs confirm Ngram(1,1) (simple  assessment for DUC 2001 single document task. The 5 points on each curve represent values for the 5 a levels.</Paragraph>
      <Paragraph position="9"> Figure 4. Recall and precision curves of N-gram co-occurrence statistics versus human assessment for DUC 2001 multi-document task. Dark (black) solid lines are for average of all summary sizes, light (red) solid lines are for 50-word summaries, dashed (green) lines are for 100-word summaries, dash-dot lines (blue) are for 200-word summaries, and dotted (magenta) lines are for 400-word summaries.</Paragraph>
      <Paragraph position="10"> unigram) is a good automatic scoring metric with good statistical significance prediction power.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML