File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/05/j05-3002_abstr.xml
Size: 7,174 bytes
Last Modified: 2025-10-06 13:44:24
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-3002"> <Title>Sentence Fusion for Multidocument News Summarization</Title> <Section position="2" start_page="0" end_page="298" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Redundancy in large text collections, such as the Web, creates both problems and opportunities for natural language systems. On the one hand, the presence of numerous sources conveying the same information causes difficulties for end users of search engines and news providers; they must read the same information over and over again.</Paragraph> <Paragraph position="1"> On the other hand, redundancy can be exploited to identify important and accurate information for applications such as summarization and question answering (Mani and Bloedorn 1997; Radev and McKeown 1998; Radev, Prager, and Samn 2000; Clarke, Cormack, and Lynam 2001; Dumais et al. 2002; Chu-Carroll et al. 2003). Clearly, it would be highly desirable to have a mechanism that could identify common information among multiple related documents and fuse it into a coherent text. In this article, we present a method for sentence fusion that exploits redundancy to achieve this task in the context of multidocument summarization.</Paragraph> <Paragraph position="2"> A straightforward approach for approximating sentence fusion can be found in the use of sentence extraction for multidocument summarization (Carbonell and Goldstein 1998; Radev, Jing, and Budzikowska 2000; Marcu and Gerber 2001; Lin and Hovy 2002). Once a system finds a set of sentences that convey similar information (e.g., by clustering), one of these sentences is selected to represent the set. This is a robust approach that is always guaranteed to output a grammatical sentence. However, extraction is only a coarse approximation of fusion. An extracted sentence may include not only common information, but additional information specific to the article from [?] Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02458. E-mail: regina@csail.mit.edu.</Paragraph> <Paragraph position="3"> + Department of Computer Science, Columbia University, New York, NY 10027.</Paragraph> <Paragraph position="4"> E-mail: kathy@cs.columbia.edu.</Paragraph> <Paragraph position="5"> Submission received: 14 September 2003; revised submission received: 23 February 2005; accepted for publication: 19 March 2005.</Paragraph> <Paragraph position="6"> (c) 2005 Association for Computational Linguistics Computational Linguistics Volume 31, Number 3 which it came, leading to source bias and aggravating fluency problems in the extracted summary. Attempting to solve this problem by including more sentences to restore the original context might lead to a verbose and repetitive summary.</Paragraph> <Paragraph position="7"> Instead, we want a fine-grained approach that can identify only those pieces of sentences that are common. Language generation offers an appealing approach to the problem, but the use of generation in this context raises significant research challenges. In particular, generation for sentence fusion must be able to operate in a domain-independent fashion, scalable to handle a large variety of input documents with various degrees of overlap. In the past, generation systems were developed for limited domains and required a rich semantic representation as input. In contrast, for this task we require text-to-text generation, the ability to produce a new text given a set of related texts as input. If language generation can be scaled to take fully formed text as input without semantic interpretation, selecting content and producing well-formed English sentences as output, then generation has a large potential payoff.</Paragraph> <Paragraph position="8"> In this article, we present the concept of sentence fusion, a novel text-to-text generation technique which, given a set of similar sentences, produces a new sentence containing the information common to most sentences in the set. The research challenges in developing such an algorithm lie in two areas: identification of the fragments conveying common information and combination of the fragments into a sentence.</Paragraph> <Paragraph position="9"> To identify common information, we have developed a method for aligning syntactic trees of input sentences, incorporating paraphrasing information. Our alignment problem poses unique challenges: We only want to match a subset of the subtrees in each sentence and are given few constraints on permissible alignments (e.g., arising from constituent ordering, start or end points). Our algorithm meets these challenges through bottom-up local multisequence alignment, using words and paraphrases as anchors. Combination of fragments is addressed through construction of a fusion lattice encompassing the resulting alignment and linearization of the lattice into a sentence using a language model. Our approach to sentence fusion thus features the integration of robust statistical techniques, such as local, multisequence alignment and language modeling, with linguistic representations automatically derived from input documents. Sentence fusion is a significant first step toward the generation of abstracts, as opposed to extracts (Borko and Bernier 1975), for multidocument summarization. Unlike extraction methods (used by the vast majority of summarization researchers), sentence fusion allows for the true synthesis of information from a set of input documents. It has been shown that combining information from several sources is a natural strategy for multidocument summarization. Analysis of human-written summaries reveals that most sentences combine information drawn from multiple documents (Banko and Vanderwende 2004). Sentence fusion achieves this goal automatically. Our evaluation shows that our approach is promising, with sentence fusion outperforming sentence extraction for the task of content selection.</Paragraph> <Paragraph position="10"> This article focuses on the implementation and evaluation of the sentence fusion method within the multidocument summarization system MultiGen, which daily summarizes multiple news articles on the same event as part of Columbia's news browsing system Newsblaster (http://newsblaster.cs.columbia.edu/). In the next section, we provide an overview of MultiGen, focusing on components that produce input or operate over output of sentence fusion. In Section 3, we provide an overview of Barzilay and McKeown Sentence Fusion for Multidocument News Summarization our fusion algorithm and detail on its main steps: identification of common information (Section 3.1), fusion lattice computation (Section 3.2), and lattice linearization (Section 3.3). Evaluation results and their analysis are presented in Section 4. Analysis of the system's output reveals the capabilities and the weaknesses of our text-to-text generation method and identifies interesting challenges that will require new insights. An overview of related work and a discussion of future directions conclude the article.</Paragraph> </Section> class="xml-element"></Paper>