XML Viewer - n06-1057

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-1057_evalu.xml
Size: 6,005 bytes
Last Modified: 2025-10-06 13:59:38
<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1057">
  <Title>ParaEval: Using Paraphrases to Evaluate Summaries Automatically</Title>
  <Section position="7" start_page="451" end_page="453" type="evalu">
    <SectionTitle>
6 Evaluation of ParaEval
</SectionTitle>
    <Paragraph position="0"> To evaluate and validate the effectiveness of an automatic evaluation metric, it is necessary to show that automatic evaluations correlate with human assessments highly, positively, and consistently (Lin and Hovy, 2003). In other words, an automatic evaluation procedure should be able to distinguish good and bad summarization systems by assigning scores with close resemblance to humans' assessments.</Paragraph>
    <Section position="1" start_page="451" end_page="452" type="sub_section">
      <SectionTitle>
6.1 Document Understanding Conference
The Document Understanding Conference has
</SectionTitle>
      <Paragraph position="0"> provided large-scale evaluations on both human-created and system-generated summaries annually.</Paragraph>
      <Paragraph position="1"> Research teams are invited to participate in solving summarization problems with their systems. System-generated summaries are then assessed by humans and/or automatic evaluation procedures.</Paragraph>
      <Paragraph position="2"> The collection of human judgments on systems and their summaries has provided a test-bed for developing and validating automated summary grading methods (Lin and Hovy, 2003; Hovy et al., 2005).</Paragraph>
      <Paragraph position="3"> The correlations reported by ROUGE and BE show that the evaluation correlations between these two systems and DUC human evaluations are much higher on single-document summarization tasks. One possible explanation is that when sum- null marizing from only one source (text), both humanand system-generated summaries are mostly extractive. The reason for humans to take phrases (or maybe even sentences) verbatim is that there is less motivation to abstract when the input is not highly redundant, in contrast to input for multi-document summarization tasks, which we speculate allows more abstracting. ROUGE and BE both facilitate lexical n-gram matching, hence, achieving amazing correlations. Since our baseline matching strategy is lexically based when paraphrase matching is not activated, validation on single-doc summarization results is not repeated in our experiment.</Paragraph>
    </Section>
    <Section position="2" start_page="452" end_page="453" type="sub_section">
      <SectionTitle>
6.2 Validation and Discussion
</SectionTitle>
      <Paragraph position="0"> We use summary judgments from DUC2003's multi-document summarization (MDS) task to evaluate ParaEval. During DUC2003, participating systems created short summaries (~100 words) for 30 document sets. For each set, one assessorwritten summary was used as the reference to compare peer summaries created by 18 automatic systems (including baselines) and 3 other human-written summaries. A system ranking was produced by taking the averaged performance on all summaries created by systems. This evaluation process is replicated in our validation setup for ParaEval. In all, 630 summary pairs were compared. Pearson's correlation coefficient is computed for the validation tests, using DUC2003 assessors' results as the gold standard.</Paragraph>
      <Paragraph position="1"> Table 1 illustrates the correlation figures from the DUC2003 test set. ParaEval-para_only shows the correlation result when using only paraphrase and synonym matching, without the baseline uni-gram matching. ParaEval-2 uses multi-word paraphrase matching and unigram matching, omitting the greedy synonym-matching phrase. ParaEval-3 incorporates matching at all three granularity levels. null We see that the current implementation of ParaEval closely resembles the way ROUGE-1 differentiates system-generated summaries. We believe this is due to the identical calculations of recall scores. The score that a peer summary receives from ParaEval depends on the number of words matched in the reference summary from its paraphrase, synonym, and unigram matches. The counting of individual words in reference indicates a ROUGE-1 design in grading. However, a detailed examination on individual reference-peer comparisons shows that paraphrase and synonym comparisons and matches, in addition to lexical n-gram matching, do measure a higher level of content coverage. This is demonstrated in Figure 6a and b. Strict unigram matching reflects the content retained by a peer summary mostly in the 0.2-0.4 ranges in recall, shown as dark-colored dots in the graphs. Allowing paraphrase and synonym matching increases the detection of peer coverage to the range of 0.3-0.5, shown as light-colored dots.</Paragraph>
      <Paragraph position="2"> We conducted a manual evaluation to further examine the paraphrases being matched. Using 10 summaries from the Pyramid data, we asked three human subjects to judge the validity of 128 (randomly selected) paraphrase pairs extracted and identified by ParaEval. Each pair of paraphrases was coupled with its respective sentences as contexts. All paraphrases judged were multi-word.</Paragraph>
      <Paragraph position="3"> ParaEval received an average precision of 68.0%.</Paragraph>
      <Paragraph position="4"> The complete agreement between judges is 0.582 according to the Kappa coefficient (Cohen, 1960).</Paragraph>
      <Paragraph position="5"> In Figure 7, we show two examples that the human judges consider to be good paraphrases produced and matched by ParaEval. Judges voiced difficul- null ties in determining &amp;quot;semantic equivalence.&amp;quot; There were cases where paraphrases would be generally interchangeable but could not be matched because of non-semantic equivalence in their contexts. And there were paraphrases that were determined as matches, but if taken out of context, would not be direct replacements of each other. These two situations are where the judges mostly disagreed.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML