File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1057_intro.xml

Size: 2,613 bytes

Last Modified: 2025-10-06 14:03:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1057">
  <Title>ParaEval: Using Paraphrases to Evaluate Summaries Automatically</Title>
  <Section position="3" start_page="447" end_page="447" type="intro">
    <SectionTitle>
2 Previous Work
</SectionTitle>
    <Paragraph position="0"> There has been considerable work in both manual and automatic summarization evaluations. Three most noticeable efforts in manual evaluation are SEE (Lin and Hovy, 2001), Factoid (Van Halteren and Teufel, 2003), and the Pyramid method (Nenkova and Passonneau, 2004).</Paragraph>
    <Paragraph position="1"> SEE provides a user-friendly environment in which human assessors evaluate the quality of system-produced peer summary by comparing it to a reference summary. Summaries are represented by a list of summary units (sentences, clauses, etc.). Assessors can assign full or partial content coverage score to peer summary units in comparison to the corresponding reference summary units. Grammaticality can also be graded unit-wise.</Paragraph>
    <Paragraph position="2"> The goal of the Factoid work is to compare the information content of different summaries of the same text and determine the minimum number of summaries, which was shown through experimentation to be 20-30, needed to achieve stable consensus among 50 human-written summaries.</Paragraph>
    <Paragraph position="3"> The Pyramid method uses identified consensus--a pyramid of phrases created by annotators--from multiple reference summaries as the gold-standard reference summary. Summary comparisons are performed on Summarization Content Units (SCUs) that are approximately of clause length.</Paragraph>
    <Paragraph position="4"> To facilitate fast summarization system designevaluation cycles, ROUGE was created (Lin and Hovy, 2003). It is an automatic evaluation package that measures a number of n-gram co-occurrence statistics between peer and reference summary pairs. ROUGE was inspired by BLEU (Papineni et al., 2001) which was adopted by the machine translation (MT) community for automatic MT evaluation. A problem with ROUGE is that the summary units used in automatic comparison are of fixed length. A more desirable design is to have summary units of variable size. This idea was implemented in the Basic Elements (BE) framework (Hovy et al., 2005) which has not been completed due to its lack of support for paraphrase matching. Both ROUGE and BE have been shown to correlate well with past DUC human summary judgments, despite incorporating only lexical matching on summary units (Lin and Hovy, 2003; Hovy et al., 2005).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML