File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/x98-1025_abstr.xml

Size: 5,416 bytes

Last Modified: 2025-10-06 13:49:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="X98-1025">
  <Title>SUMMARIZATION: (1) USING MMR FOR DIVERSITY- BASED RERANKING AND (2) EVALUATING SUMMARIES</Title>
  <Section position="2" start_page="0" end_page="181" type="abstr">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> With the continuing growth of online information, it has become increasingly important to provide improved mechanisms to find information quickly. Conventional IR systems rank and assimilate documents based on maximizing relevance to the user query \[1, 8, 6, 12, 13\]. In cases where relevant documents are few, or cases where very-high recall is necessary, pure relevance ranking is very appropriate. But in cases where there is a vast sea of potentially relevant documents, highly redundant with each other or (in the extreme) containing partially or fully duplicative information we must utilize means beyond pure relevance for document ranking.</Paragraph>
    <Paragraph position="1"> In order to better illustrate the need to combine relevance and anti-redundancy, consider a reporter or a This research was performed as part of Carnegie Group Inc.'s Tipster III Summarization Project under the direction of Mark Borger and Alex Kott.</Paragraph>
    <Paragraph position="2"> student, using a newswire archive collection to research accounts of airline disasters. He composes a wellthough-out query including &amp;quot;airline crash&amp;quot;, &amp;quot;FAA investigation&amp;quot;, &amp;quot;passenger deaths&amp;quot;, &amp;quot;fire&amp;quot;, &amp;quot;airplane accidents&amp;quot;, and so on. The IR engine returns a ranked list of the top 100 documents (more if requested), and the user examines the top-ranked document. It's about the suspicious TWA-800 crash near Long Island. Very relevant and useful. The next document is also about &amp;quot;TWA-800&amp;quot;, so is the next, and so are the following 30 documents. Relevant? Yes. Useful? Decreasingly so.</Paragraph>
    <Paragraph position="3"> Most &amp;quot;new&amp;quot; documents merely repeat information already contained in previously offered ones, and the user could have tired long before reaching the first non-TWA-800 air disaster document. Perfect precision, therefore, may prove insufficient in meeting user needs.</Paragraph>
    <Paragraph position="4"> A better document ranking method for this user is one where each document in the ranked list is selected according to a combined criterion of query relevance and novelty of information. The latter measures the degree of dissimilarity between the document being considered and previously selected ones already in the ranked list. Of course, some users may prefer to drill down on a narrow topic, and others a panoramic sampling bearing relevance to the query. Best is a usertunable method that focuses the search from a narrow beam to a floodlight. Maximal Marginal Relevance (MMR) provides precisely such functionality, as discussed below.</Paragraph>
    <Paragraph position="5"> If we consider document summarization by relevantpassage extraction, we must again consider anti-redundancy as well as relevance. Both query-free summaries and query-relevant summaries need to avoid redundancy, as it defeats the purpose of summarization.</Paragraph>
    <Paragraph position="6"> For instance, scholarly articles often state their thesis in the introduction, elaborate upon it in the body, and&amp;quot; reiterate it in the conclusion. Including all three in versions in the summary, however, leaves little room for other useful information. If we move beyond single document summarization to document cluster summarization, where the summary must pool passages  from different but possibly overlapping documents, reducing redundancy becomes an even more significant problem.</Paragraph>
    <Paragraph position="7"> Automated document summarization dates back to Luhn's work at IBM in the 1950's \[12\], and evolved through several efforts including Tait \[24\] and Paice in the 1980s \[17, 18\]. Much early work focused on the structure of the document to select information. In the 1990's several approaches to summarization blossomed, include trainable methods \[10\], linguistic approaches \[8, 15\] and our information-centric method \[2\], the first to focus on query-relevant summaries and anti-redundancy measures. As part of the TIPSTER program \[25\], new investigations have started into summary creation using a variety of strategies. These new efforts address query relevant as well as &amp;quot;generic&amp;quot; summaries and utilize a variety of approaches including using co-reference chains (from the University of Pennsylvania) \[25\], the combination of statistical and linguistic approaches (Smart and Empire) from SaBir Research, Cornell University and GE R&amp;D Labs, topic identification and interpretation from the ISI, and template based summarization from New Mexico State University \[25\].</Paragraph>
    <Paragraph position="8"> In this paper, we discuss the Maximal Marginal Relevance method (Section 2), its use for document reranking (Section 3), our approach to query-based single document summarization (Section 4), and our approach to long documents (Section 6) and multi-document summarization (Section 6). We also discuss our evaluation efforts of single document summarization (Section 7-8) and our preliminary results (Section 9).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML