File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0402_intro.xml

Size: 4,937 bytes

Last Modified: 2025-10-06 14:01:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0402">
  <Title>Selecting Sentences for Multidocument Summaries using Randomized Local Search</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Improving the intellibility of multidocument summaries remains a significant challenge.</Paragraph>
    <Paragraph position="1"> While most previous approaches to multidocument summarization have addressed the problem of reducing repetition, less attention has been paid to problems of coherence and cohesion. In a typical extractive system (e.g.</Paragraph>
    <Paragraph position="2"> Goldstein et al. (2000)), sentences are selected for inclusion in the summary one at a time, with later choices sensitive to their similarity to earlier ones; the selected sentences are then ordered either chronologically or by relevance.</Paragraph>
    <Paragraph position="3"> The resulting summaries often jump incoherently from topic to topic, and contain broken cohesive links, such as dangling anaphors or unmet presuppositions.</Paragraph>
    <Paragraph position="4"> Barzilay et al. (2001) present an improved method of ordering sentences in the context of MultiGen, a multidocument summarizer that identifies sets of similar sentences, termed themes, and reformulates their common phrases as new text. In their approach, topically related themes are identified and kept together in the resulting summary, in order to help improve cohesion and reduce topic switching.</Paragraph>
    <Paragraph position="5"> In this paper, we pursue a related but simpler idea in an extractive context, namely to favor the selection of blocks of adjacent sentences in constructing a multidocument summary. Here, the challenge is to improve intelligibility without unduly sacrificing informativeness; for example, selecting the beginning of the most recent article in a document set will usually produce a highly intelligible text, but one that is not very representative of the document set as a whole.</Paragraph>
    <Paragraph position="6"> To manage this tradeoff, we have developed a randomized local search procedure (cf. Selman and Kautz (1994)) to select the highest ranking set of sentences for the summary, where the inclusion of adjacent sentences is favored and the selection of repetitive material is penalized. The method involves greedily searching for the best combination of sentences to swap in and out of the current summary until no more improvements are possible; noise strategies include occasionally adding a sentence to the current summary, regardless of its score, and restarting the local search from random starting points for a fixed number of iterations. In determining sentence similarity, we have used surface-oriented similarity measures obtained from Columbia's SimFinder tool (Hatzivassiloglou et al., 2001), as well as semantic groups obtained from merging the output templates of an information extraction (IE) subsystem.</Paragraph>
    <Paragraph position="7"> In related work, Marcu (2001) describes an approach to balancing informativeness and intelligibility that also involves searching through Philadelphia, July 2002, pp. 9-18. Association for Computational Linguistics. Proceedings of the Workshop on Automatic Summarization (including DUC 2002), sets of sentences to select. In contrast to our approach, Marcu employs a beam search through possible summaries of progressively greater length, which seems less amenable to an anytime formulation; this may be an important practical consideration, since Marcu reports search times in hours, whereas we have found that less than a minute of searching is usually effective. In other related work, Lin and Hovy (2002) suggest pairing extracted sentences with their corresponding lead sentences; we have not directly compared our search-based approach to Lin and Hovy's simpler method.</Paragraph>
    <Paragraph position="8"> In order to evaluate our approach, we compared 200-word summaries generated by our system to those of two baselines that are similar to those used in DUC 2001 (Harman, 2001), and to three simpler versions of the system, where a simple marginal relevance selection procedure was used instead of the selection search, and/or the IE groups were ignored. In general, we found that our randomized local search method provided substantial improvements in both content and intelligibility over the DUC-like baselines and the simplest variant of our system, which used marginal relevance selection and no IE groups (with the exception that the last article baseline was always ranked first in intelligibility). The use of the IE groups also appeared to contribute a small further improvement in content when used with our selection search.</Paragraph>
    <Paragraph position="9"> We discuss these results in greater detail in the final section of the paper.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML