File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/98/j98-3005_relat.xml

Size: 14,622 bytes

Last Modified: 2025-10-06 14:16:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="J98-3005">
  <Title>Generating Natural Language Summaries from Multiple On-Line Sources</Title>
  <Section position="4" start_page="471" end_page="475" type="relat">
    <SectionTitle>
2. Related Work
</SectionTitle>
    <Paragraph position="0"> Previous work related to summarization falls into three main categories. In the first, full text is accepted as input and some percentage of the text is produced as output.</Paragraph>
    <Paragraph position="1"> Typically, statistical approaches, augmented with keyword or phrase matching, are used to lift from the article full sentences that can serve as a summary. Most of the work in this category produces a summary for a single article, although there are a few exceptions. The other two categories correspond to the two stages of processing that would have to be carried out if sentence extraction were not used: analysis of the input document to identify information that should appear in a summary and  Radev and McKeown Generating Natural Language Summaries generation of a textual summary from a set of facts that are to be included. In this section, we first present work on sentence extraction, next turn to work on identifying information in an article that should appear in a summary, and conclude with work on generation of summaries from data, showing how this task differs from the more general language generation task.</Paragraph>
    <Paragraph position="2"> This is a systems-oriented perspective on summarization-related work focusing on techniques that have been implemented for the task. There is also a large body of work on the nature of abstracting from a library science point of view (Borko 1975). This work distinguishes between different types of abstracts, most notably, indicative abstracts that tell what an article is about, and informative abstracts, that include major results from the article and can be read in place of it. SUMMONS generates summaries that are informative in nature. Research in psychology and education also focuses on how to teach people to write summaries (e.g., Endres-Niggemeyer 1993; Rothkegel 1993). This type of work can aid the development of summarization systems by providing insights into the human process of summarization that could be simulated in systems.</Paragraph>
    <Section position="1" start_page="472" end_page="473" type="sub_section">
      <SectionTitle>
2.1 Summarization through Sentence Extraction
</SectionTitle>
      <Paragraph position="0"> To allow summarization in arbitrary domains, researchers have traditionally applied statistical techniques (Luhn 1958; Paice 1990; Preston and Williams 1994; Rau, Brandow, and Mitze 1994). This approach can be better termed extraction rather than summarization, since it attempts to identify and extract key sentences from an article using statistical techniques that locate important phrases using various statistical measures.</Paragraph>
      <Paragraph position="1"> This has been successful in different domains (Preston and Williams 1994) and is, in fact, the approach used in recent commercial summarizers (Apple \[Boguraev and Kennedy 1997\], Microsoft, and inXight). Rau, Brandow, and Mitze (1994) report that statistical summaries of individual news articles were rated lower by evaluators than summaries formed by simply using the lead sentence or two from the article. This follows the principle of the &amp;quot;inverted pyramid&amp;quot; in news writing, which puts the most salient information in the beginning of the article and leaves elaborations for later paragraphs, allowing editors to cut from the end of the text without compromising the readability of the remaining text.</Paragraph>
      <Paragraph position="2"> Paice (1990) also notes that problems for this approach center around the fluency of the resulting summary. For example, extracted sentences may accidentally include pronouns that have no previous reference in the extracted text or, in the case of extracting several sentences, may result in incoherent text when the extracted sentences are not consecutive in the original text and do not naturally follow one another. Paice describes techniques for modifying the extracted text to replace unresolved references.</Paragraph>
      <Paragraph position="3"> Summaries that consist of sentences plucked from texts have been shown to be useful indicators of content, but they are often judged to be highly unreadable (Brandow, Mitze, and Rau 1990).</Paragraph>
      <Paragraph position="4"> A more recent approach (Kupiec, Pedersen, and Chen 1995) uses a corpus of articles with summaries to train a statistical summarization system. During training, the system uses abstracts of existing articles to identify the features of sentences that are typically included in abstracts. In order to avoid problems noted by Paice, the system produces an itemized list of sentences from the article thus eliminating the implication that these sentences function together coherently as a full paragraph. As with the other statistical approaches, this work is aimed at summarization of single articles.</Paragraph>
      <Paragraph position="5"> Work presented at the 1997 ACL Workshop on Intelligent Scalable Text Summarization primarily focused on the use of sentence extraction. Alternatives to the use  Computational Linguistics Volume 24, Number 3 of frequency of key phrases included the identification and representation of lexical chains (Halliday and Hasan 1976) to find the major themes of an article followed by the extraction of one or two sentences per chain (Barzilay and Elhadad 1997), training over the position of summary sentences in the full article (Hovy and Lin 1997), and the construction of a graph of important topics to identify paragraphs that should be extracted (Mitra, Singhal, and Buckley 1997).</Paragraph>
      <Paragraph position="6"> While most of the work in this category focuses on summarization of single articles, early work is beginning to emerge on summarization across multiple documents. In ongoing work at Carnegie Mellon, Carbonell (personal communication) is developing statistical techniques to identify similar sentences and phrases across articles. The aim is to identify sentences that are representative of more than one article.</Paragraph>
      <Paragraph position="7"> Mani and Bloedorn (1997) link similar words and phrases from a pair of articles using WordNet (Miller et al. 1990) semantic relations. They show extracted sentences from the two articles side by side in the output.</Paragraph>
      <Paragraph position="8"> While useful in general sentence extraction approaches cannot handle the task that we address, aggregate summarization across multiple documents, since this requires reasoning about similarities and differences across documents to produce generalizations or contradictions at a conceptual level.</Paragraph>
    </Section>
    <Section position="2" start_page="473" end_page="474" type="sub_section">
      <SectionTitle>
2.2 Identifying Information in Input Articles
</SectionTitle>
      <Paragraph position="0"> Work in summarization using symbolic techniques has tended to focus more on identifying information in text that can serve as a summary (Young and Hayes 1985; Rau 1988; Hahn 1990) than on generating the summar~ and often relies heavily on domain-dependent scripts (DeJong 1979; Tait 1983). The DARPA message understanding systems (MUC 1992), which process news articles in specific domains to extract specified types of information, also fall within this category. As output, work of this type produces templates that identify important pieces of information in the text, rep:resenting them as attribute-value pairs that could be part of a database entry. The :message understanding systems, in particular, have been developed over a long period, have undergone repeated evaluation and development, including moves to new domains, and as a result, are quite robust. They are impressive in their ability to handle large quantities of free-form text as input. As stand-alone systems, however, they do not address the task of summarization since they do not combine and rephrase extracted information as part of a textual summary.</Paragraph>
      <Paragraph position="1"> A recent approach to symbolic summarization is being carried out at Cambridge University on identifying strategies for summarization (Sparck Jones 1993). This work studies how various discourse processing techniques (e.g., rhetorical structure relations) can be used to both identify important information and form the actual summary. While promising, this work does not involve an implementation as of yet, but provides a framework and strategies for future work. Marcu (1997) uses a rhetorical parser to build rhetorical structure trees for arbitrary texts and produces a summary by extracting sentences that span the major rhetorical nodes of the tree.</Paragraph>
      <Paragraph position="2"> In addition to domain-specific information extraction systems, there has also been a large body of work on identifying people and organizations in text through proper noun extraction. These are domain-independent techniques that can also be used to extract information for a summary. Techniques for proper noun extraction include the use of regular grammars to delimit and identify proper nouns (Mani et al. 1993; Paik et al. 1994), the use of extensive name lists, place names, titles and &amp;quot;gazetteers&amp;quot; in conjunction with partial grammars in order to recognize proper nouns as unknown words in close proximity to known words (Cowie et al. 1992; Aberdeen et al. 1992), statistical training to learn, for example, Spanish names, from on-line corpora (Ayuso  Radev and McKeown Generating Natural Language Summaries et al. 1992), and the use of concept-based pattern matchers that use semantic concepts as pattern categories as well as part-of-speech information (Weischedel et al. 1993; Lehnert et al. 1993). In addition, some researchers have explored the use of both local context surrounding the hypothesized proper nouns (McDonald 1993; Coates-Stephens 1991) and the larger discourse context (Mani et al. 1993) to improve the accuracy of proper noun extraction when large known-word lists are not available. In a way similar to this research, our work also aims at extracting proper nouns without the aid of large word lists. We use a regular grammar encoding part-of-speech categories to extract certain text patterns (descriptions) and we use WordNet (Miller et al. 1990) to provide semantic filtering.</Paragraph>
      <Paragraph position="3"> Another system, called MURAX (Kupiec 1993), is similar to ours from a different perspective. MURAX also extracts information from a text to serve directly in response to a user question. MURAX uses lexicosyntactic patterns, collocational analysis, along with information retrieval statistics, to find the string of words in a text that is most likely to serve as ,an answer to a user's wh-query. Ultimately, this approach could be used to extract information on items of interest in a user profile, where each question may represent a different point of interest. In our work, we also reuse strings (i.e., descriptions) as part of the summar34 but the string that is extracted may be merged, or regenerated, as part of a larger textual summary.</Paragraph>
    </Section>
    <Section position="3" start_page="474" end_page="475" type="sub_section">
      <SectionTitle>
2.3 Summary Generation
</SectionTitle>
      <Paragraph position="0"> Summarization of data using symbolic techniques has met with more success than summarization of text. Summary generation is distinguished from the more traditional language generation problem by the fact that summarization is concerned with conveying the maximal amount of information within minimal space. This goal is achieved through two distinct subprocesses, conceptual and linguistic summarization. Conceptual summarization is a form of content selection. It must determine which concepts out of a large number of concepts in the input should be included in the summary.</Paragraph>
      <Paragraph position="1"> Linguistic summarization is concerned with expressing that information in the most concise way possible.</Paragraph>
      <Paragraph position="2"> We have worked on the problem of summarization of data within the context of three separate systems. STREAK (Robin and McKeown 1993; Robin 1994; Robin and McKeown 1995) generates summaries of basketball games, using a revision-based approach to summarization. It builds a first draft using fixed information that must appear in the summary (e.g., in basketball summaries, the score and who won and lost is always present). In a second pass, it uses revision rules to opportunistically add in information, as allowed by the form of the existing text. Using this approach, information that might otherwise appear as separate sentences gets added in as modifiers of the existing sentences, or new words that can simultaneously convey both pieces of information are selected. PLANDoc (McKeown, Kukich, and Shaw 1994a; McKeown, Robin, and Kukich 1995; Shaw 1995) generates summaries of the activities of telephone planning engineers, using linguistic summarization both to order its input messages and to combine them into single sentences. Focus has been on the combined use of conjunction, ellipsis, and paraphrase to result in concise, yet fluent reports (Shaw 1995). ZEDDoc (Passonneau et al. 1997; Kukich et al. 1997) generates Web traffic summaries for advertisement management software. It makes use of an ontology over the domain to combine information at the conceptual level.</Paragraph>
      <Paragraph position="3"> All of these systems take tabular data as input. The research focus has been on linguistic summarization. SUMMONS, on the other hand, focuses on conceptual summarization of both structured and full-text data.</Paragraph>
      <Paragraph position="4"> At least four previous systems developed elsewhere use natural language to sum- null Computational Linguistics Volume 24, Number 3 marize quantitative data, including ANA (Kukich 1983), SEMTEX (R6sner 1987), FOG (Bourbeau et al. 1990), and LFS (Iordanskaja et al. 1994). All of these use some forms of conceptual and linguistic summarization and the techniques can be adapted for our current work on summarization of multiple articles. In related work, Dalianis and Hovy (1993) have also looked at the problem of summarization, identifying eight aggregation operators (e.g., conjunction around noun phrases) that apply during generation to create more concise text.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML