File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1003_metho.xml

Size: 21,556 bytes

Last Modified: 2025-10-06 14:09:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1003">
  <Title>The Effects of Human Variation in DUC Summarization Evaluation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Initial Design - DUC-2001
</SectionTitle>
    <Paragraph position="0"> Since the roadmap specified testing in DUC-2001 of both single and multi-document summarization, the data sets and tasks were designed as follows.</Paragraph>
    <Paragraph position="1"> Sixty sets of approximately 10 documents each were provided as system input for this task. Given such a set of documents, the systems were to automatically create a 100-word generic summary for each document. Additionally they were to create a generic summary of the entire set, one summary at each of four target lengths (approximately 400, 200,  age 100, and 50 words).</Paragraph>
    <Paragraph position="2"> The sets of documents were assembled at NIST by 10 retired information analysts. Each person selected six document sets, and then created a 100-word manual abstract for each document, and for the entire document set at the 400, 200, 100 and 50 word lengths. Thirty of the sets (documents and manual abstracts) were distributed as training data and the remaining thirty sets of documents (without abstracts) were distributed as test data.</Paragraph>
    <Paragraph position="3"> Fifteen groups participated in DUC-2001, with 11 of them doing single document summarization and 12 of them doing the multi-document task.</Paragraph>
    <Paragraph position="4"> The evaluation plan as specified in the roadmap was for NIST to concentrate on manual comparison of the system results with the manually-constructed abstracts. To this end a new tool was developed by Chin-Yew Lin at the Information Sciences Institute, University of Southern California (http: //www.isi.edu/~cyl/SEE/). This tool allows a summary to be rated in isolation as well as compared to another summary for content overlap.</Paragraph>
    <Paragraph position="5"> Figure 1 shows one example of this interface. Human evaluation was done at NIST using the same personnel who created the manual abstracts (called model summaries).</Paragraph>
    <Paragraph position="6"> One type of evaluation supported by SEE was coverage, i.e., how well did the peer summaries (i.e., those being evaluated) cover the content of the documents (as expressed by the model summary).</Paragraph>
    <Paragraph position="7"> A pairwise summary comparison was used in this part of the evaluation and judges were asked to do detailed coverage comparisons. SEE allowed the judges to step through predefined units of the model summary (elementary discourse units/EDUs) (Soricut and Marcu, 2003) and for each unit of that summary, mark the sentences in the peer summary that expressed [all(4), most(3), some(2), hardly any(1) or none(0)] of the content in the current model summary unit. The resulting ordered category scale[04] is treated as an interval scale in the coverage score based on feedback from the judges on how it was used. The coverage score for a given peer summary is the mean of its scores against the EDUs of the associated model ( 4 EDUs per summary for the 50-word model summaries). This process is much more complex than doing a simple overall comparison using the entire summary but past evaluation experiences indicated that judges had more difficulty making an overall decision than they did making decisions at each EDU.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 DUC-2001 Results - Effect of Variability in
Models
</SectionTitle>
      <Paragraph position="0"> Recall that there are two very different sources of human variation in DUC-2001, as in all the DUC evaluations. The first is the disagreement among judges as to how well a system summary covers the model summary. This is similar to what is seen in relevance assessment for IR evaluations. To the extent that different judges are consistently more lenient or strict, this problem has been handled in DUC by having the same judge look at all summaries for a given document set so that all peer summaries are affected equally and by having enough document sets to allow averaging over judges to mitigate the effect of very strict or very lenient judges. If a judge's leniency varies inconsistently in a way dependent on which system is being judged (i.e., if there is an interaction between the judge and the system), then other strategies are needed. (Data was collected and analyzed in DUC-2002 to assess the size of these interactions.) Summarization has a second source of disagreement and that is the model summaries themselves. People write models that vary not only in writing style, but also in focus, i.e., what is important to summarize in a document or document set.</Paragraph>
      <Paragraph position="1"> To shed light on variability in creation of models and their use, each of the 30 document sets in the test set (plus the 300 individual documents) were summarized independently by three summarizers the one who had selected the documents plus two others. These extra summaries were used as additional peer human summaries in the main evaluation and also in a special study of the model effects on evaluation.</Paragraph>
      <Paragraph position="2"> This special study worked with a random subset of 20 document sets (out of 30). Each peer was judged twice more by a single person who had not done the original judgment. This person used the two extra models, neither of which had been created by the person doing the judgments. There was only time to do this for the multi-document summaries at lengths 50 and 200.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Model Differences
</SectionTitle>
      <Paragraph position="0"> A first question is how much did the two models differ. One way of measuring this is by a simple n-gram overlap of the terms. This was done based on software in the MEAD toolkit (http://www.summarization.com), without omitting the commonwords, nor doing any stemming, and the n-grams were allowed to span sentence boundaries. The average unigram overlap (the number of unique unigrams in the intersection/the number of unique unigrams in the union) for the two extra 50-word model summaries was 0.151 and there were only 6 out of the 20 sets that had any tri-gram overlap at all. For the 200-word summaries, the average unigram overlap was 0.197, with 16 out of the 20 sets having tri-gram overlaps.</Paragraph>
      <Paragraph position="1"> These numbers seem surprisingly low, but an examination of the summaries illustrates some of the reasons. What follows are the two model pairs with the greatest and least unigram overlap in the two extra 50-word document set group.</Paragraph>
      <Paragraph position="2"> Document set 32, Judge G &amp;quot;In March 1989, an Exxon oil tanker crashed on a reef near Valdez, Alaska, spilling 8.4 million gallons of oil into Prince William Sound seriously damaging the environment. The cleanup was slow and Exxon was subject to severe compensation costs and indictment by a federal jury on five criminal charges.&amp;quot; Document set 32, Judge I &amp;quot;On March 24, 1989, the Exxon Valdez spilled 11.3 million gallons of crude oil in Prince William Sound, Alaska. Cleanup of the disaster continued until September and cost almost $2 billion, but 117 miles of beach remained oily.</Paragraph>
      <Paragraph position="3"> Exxon announced an earnings- drop in January 1990 and was ordered to resume cleaning on May 1.&amp;quot; Document set 14, Judge B &amp;quot;U.S. military aircraft crashes occur throughout the world more often than one might suspect. They are normally reported in the press; however, only those involving major damage or loss of life attract extensive media coverage. Investigations are always conducted.</Paragraph>
      <Paragraph position="4">  set for the two extra 50-word models Flight safety records and statistics are kept for all aircraft models.&amp;quot; Document set 14, Judge H &amp;quot;1988 crashes included four F-16s, two F-14s, three A-10s, two B-52s, two B-1Bs, and one tanker. In 1989 one T-2 trainer crashed. 1990 crashes included one F-16, one F-111, one F-4, one C-5A, and 17 helicopters.</Paragraph>
      <Paragraph position="5"> Other plane crashes occurred in 1975 (C-5B), 1984 (B-52), 1987 (F-16), and 1994 (F-15).&amp;quot; For document set 32, the two model creators are covering basically the same content, but are including slightly different details (and therefore words). But for document set 14, the two models are written at very different levels of granularity, with one person writing a very high-level analysis whereas the other one gives only details. Note that these are only examples of the variation seen across the models; many other types of variations exist.</Paragraph>
      <Paragraph position="6"> Additionally there is a wide variation in overlap across the 20 document sets (see Figure 2). This document set variation is confounded with the human variation in creating the models since there were 6 different humans involved for the 20 document sets.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Effects of Model Differences on Document
Set Coverage Scores
</SectionTitle>
      <Paragraph position="0"> Figure 3 shows the absolute value of the coverage score differences between the two extra models for each of the 20 document sets for the 50-word summaries. The middle bar shows the median, the black  by document set for the two extra 50-word models dot the average, and the box comprises the middle 2 quartiles. The open circles are outliers.</Paragraph>
      <Paragraph position="1"> There is a large variation across document sets, with some sets having much wider ranges in coverage score differences based on the two different models. Looking across all 20 document sets, the average absolute coverage difference is 0.437 or 47.8% of the highest scoring model for the 50-word summaries and 0.318 (42.5%) for the 200-word summaries. This large difference in scores is coming solely from the model difference since judgment is being made by the same person (although some self-inconsistency is involved (Lin and Hovy, 2002)).</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Relationship between Model Differences
and Coverage Scores
</SectionTitle>
      <Paragraph position="0"> Does a small unigram overlap in terms for the models in a given document set predict a wide difference in coverage scores for peers judged against the models in that document set? Comparing Figures 2 and 3, or indeed graphing overlap against coverage (Figure 4) shows that there is little correlation between these two. One suspects that the humans are able to compensate for different word choice and that the coverage differences shown in Figure 3 represent differences in content in the models.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.5 Effects of Model Differences on per System
Coverage Scores
</SectionTitle>
      <Paragraph position="0"> How does the choice of model for each document set affect the absolute and relative coverage score for each system averaged across all document sets? Figure 5 shows the median coverage scores (50 null using extra model sets (50-word summaries) word summaries) for the 12 systems using each of the two extra model sets. The points for the coverage scores are connected within a given model to make changes in rank with neighbors more obvious. It can be seen that the scores are close to each other in absolute value and that the two lines track each other in general. (The same type of graph could be shown for the 200-word summaries, but here there were even smaller differences between system rankings.) null What is being suggested (but not proven) by Figure 5 is that the large differences seen in the model overlap are not reflected in the absolute or relative system results for the DUC-2001 data examined. Most of the systems judged better against one set of models are still better using different models. The correlation (Pearson's) between median coverage scores for the systems using the two extra model sets is 0.641 (p &lt; 0.05). This surprising stability of system rankings clearly needs further analysis beyond this paper, but implies that the use of enough instances (document sets in this case) allows an averaging effect to stablize rankings.</Paragraph>
      <Paragraph position="1"> There are many system judgments going into these averages, basically 20 document sets times the average number of model units judged per document set ( 4). These 80 measurements should make the means of the extra scorings better estimates of the &amp;quot;true&amp;quot; coverage and hence more alike. More importantly, Figure 5 suggests that there is minimal model/system interaction. Although no analysis of variance (ANOVA) was run in DUC-2001, the ANOVAs for DUCs 2002 and 2003 verify this lack of interaction.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 DUC-2002
</SectionTitle>
    <Paragraph position="0"> DUC-2002 was designed and evaluated in much the same manner as DUC-2001 to allow continuity of research and evaluation. There were 60 more document sets with manual abstracts created in the same way as the first 60 sets. The target lengths of the summaries were shortened to eliminate the 400-word summary and to include a headline length summary. The SEE GUI was modified to replace the five-point intervals [All, most, some, hardly any, or none] with percentages [0, 20, 40, 60, 80, 100] to reflect their perception by judges and treatment by researchers as a ratio scale. Seventeen groups that took part in DUC-2002, with 13 of them tackling the single document summary task (at 100 words) and 8 of them working on the multi-document task.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 DUC-2002 Results - Effect of Variability in
Judges
</SectionTitle>
      <Paragraph position="0"> Beyond the main evaluation, it was decided to measure the variability of the coverage judgments, this time holding the models constant. For six of the document sets, each peer was judged three additional times, each time by a different judge but using the same model (not a model created by any of the judges). Whereas the judgment effect does not change the relative ranking of systems in the TREC information retrieval task (Voorhees, 1998), the task in coverage evaluation is much more cognitively difficult and needed further exploration. In DUC the question being asked involves finding a  using extra judgment sets (50-word summaries) shared meaning between the content in each model summary unit and in the peer summary sentence, and determining how much meaning is shared - a very subjective judgment.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Differences in the Coverage Judgments
Using the Same Model
</SectionTitle>
      <Paragraph position="0"> The average absolute coverage score difference between the highest and lowest of the three extra scorings of each peer summary for the 50-word summaries was 0.079, which is a 47.6% difference (0.070 for the 200-word, or 37.1%). This is about the same percentage differences seen for the coverage differences based on using different models in DUC-2001.</Paragraph>
      <Paragraph position="1"> Once again, there is a wide variation across the six document sets (see Figure 6). Even though the median is similar across these sets, the variation is much larger for two of the document sets, and much smaller for two others. The variation in coverage score for the 200-word summaries is much less, similar to what was found in DUC-2001.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Effects of Judgment Differences on per
System Coverage Scores
</SectionTitle>
      <Paragraph position="0"> lines plotted are similar to those shown for the DUC-2001 model variations, one line for each set of extra judgments. The scores again are very close together in absolute value and in general the systems are ranked similarly. In this case, the pairwise correlations (Pearson's) were 0.840, 0.723, and 0.801 (p &lt; 0.05). With only six document sets involved in the averaging, versus the 20 used in DUC-2001, it is surprising that there is still so little effect.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 ANOVA Results
</SectionTitle>
      <Paragraph position="0"> The extra three judgments per peer allowed for analysis of variance (ANOVA) and estimates of the sizes of the various main effects and interactions. While the main effects (the judge, system, and document set) can be large, they are by definition equally distributed across all systems. Although still significant, the three interactions modeled - judge/system, judge/docset, and system/docset, are much smaller (on the order of the noise, i.e., residuals) and so are not likely to introduce a bias into the evaluation.</Paragraph>
      <Paragraph position="1"> Due to lack of space, only the ANOVA for DUC-2003 is included (see Table 1).</Paragraph>
      <Paragraph position="2"> 4 DUC-2003 For DUC-2003 it was decided to change the tasks somewhat. In an effort to get the human summaries closer to a common focus, each of the multi-document summary tasks had some constraining factor. There were four different tasks for summarization, one very short &amp;quot;headline&amp;quot; task for single documents (300 single documents in the test set), and three different multi-document summary tasks (each task had 30 document sets used in testing). There were 21 groups that participated in DUC-2003, with 13 of them doing task 1, 16 doing task 2, 11 doing task 3 and only 9 trying task</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 DUC-2003 Results - Effect of Variability in
Judges and Models
</SectionTitle>
      <Paragraph position="0"> Beyond the main evaluation it was decided to do further investigation into the effects of model and judgment variation, in particular to focus on task 4 (create short summaries of 10 documents that were relevant to a given question). Each of the 30 document sets in task 4 had four different model summaries built by four different people, and four judgments made where the judge in each case was the model creator. The two types of variations were deliberately confounded for several reasons. The first was that the variations had already been investigated separately and it was important to investigate the combined effect. The second related issue is that this confounding mimics the way the evaluation is being run, i.e. the judges are normally using their own model, not someone else's model. The third reason was to provide input to the proposed automatic evaluation (ROUGE) to be used in DUC-2004 in which multiple models would be used but with no human judgments.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Differences in Model/Judgment Sets
</SectionTitle>
      <Paragraph position="0"> The n-gram overlap for the 30 document sets is shown in Figure 8 with six possible pairwise comparisons for each set of four model summaries. The average unigram overlap is 0.200, but again a wide variation in overlap across the different document sets.</Paragraph>
    </Section>
    <Section position="7" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Effects of Model/Judgment Differences
</SectionTitle>
      <Paragraph position="0"> Looking only at the maximum and minimum score in each set of four, the coverage score differences  using extra judgment sets (100-word summaries) are still high, with an average absolute coverage difference of 0.139 or 69.1% difference. Again there is a wide variation across document set/judge pair (see Figure 9).</Paragraph>
      <Paragraph position="1"> Figure 10 shows the absolute coverage scores for each system for each of the four model/judgment pairs. The difference in absolute scores is small, and the relative ranking of the systems is mostly unchanged. For DUC-2003, the pairwise correlations (Pearson's) are 0.899, 0.894, 0.837, 0.827, 0.794, and 0.889 (p &lt; 0.05). Additionally the scores are lower and closer than in earlier DUCs; this is proba- null bly because task 4 was a new task and systems were in a learning curve.</Paragraph>
    </Section>
    <Section position="8" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 ANOVA Results
</SectionTitle>
      <Paragraph position="0"> An analysis of variance was also run on the DUC-2003 task 4 multiple models and judgments study, and results are presented in Table 1. The abbreviations for the column headings are as follows: Df (degrees of freedom), SS (sum of squares), MS (mean square), F (F value), Pr(F) (probability of F under the null hypothesis). The judge, system, and document set effects predominate as expected. Although still significant, the three interactions modeled - judge/system (jud/sys), judge/docset (jud/ds) and system/docset (sys/ds) are smaller than any of the main effects.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML