File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/w04-1003_concl.xml
Size: 2,782 bytes
Last Modified: 2025-10-06 13:54:14
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1003"> <Title>The Effects of Human Variation in DUC Summarization Evaluation</Title> <Section position="5" start_page="0" end_page="0" type="concl"> <SectionTitle> 5 Conclusions </SectionTitle> <Paragraph position="0"> The secondary experiments described in this paper were by necessity small in scope and so are not conclusive. Still they consistently suggest stability of the SEE-based coverage results reported in the first three DUCs, i.e., despite large variations in the human-generated model summaries and large variations in human judgments of single-model coverage, the ranking of the systems remained comparatively constant when averaged over dozens of document sets, dozens of peer summaries, and 10 or so judges.</Paragraph> <Paragraph position="1"> Note that this is only on average, i.e. there will be variations reflected in the individual document sets and the scoring cannot be used reliably at that level. However, variation in human summaries reflects the real application and one can only aim at improved performance on average for better summary methodology.</Paragraph> <Paragraph position="2"> Attempts to reduce or incorporate variability in summarization evaluation will and should continue, e.g., by use of &quot;factoids&quot; (van Halteren and Teufel, 2003) or &quot;summarization content units&quot; (Passonneau and Nenkova, 2004) as smaller units for generating model summaries. The use of constraining factors such as in DUC-2003 is helpful, but only in some cases since there are many types of summaries that do not have natural constraints. Variability issues will likely have to be dealt with for some time and from a number of points of view.</Paragraph> <Paragraph position="3"> In manual evaluations the results of this study need to be confirmed using other data. In ROUGElike automatic evaluations that avoid variability in judgments and exploit variation in models, the question of how the number of models and their variability affect the quality of the ROUGE scoring needs study.</Paragraph> <Paragraph position="4"> Beyond laboratory-style evaluations, system builders need to attend to variability. The averages hide variations that need to be analysed; systems that do well on average still need failure and success analysis on individual test cases in order to improve. The variations in human performance still need to be studied to understand better why these variations are occurring and what this implies about the acceptability of automatic text summarization for real end-users. The effect of variability in training data on the machine learning algorithms used in constructing many summarization systems must be understood.</Paragraph> </Section> class="xml-element"></Paper>