File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1057_intro.xml

Size: 4,702 bytes

Last Modified: 2025-10-06 14:01:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1057">
  <Title>A Noisy-Channel Model for Document Compression</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Single document summarization systems proposed to date fall within one of the following three classes: Extractive summarizers simply select and present to the user the most important sentences in a text -- see (Mani and Maybury, 1999; Marcu, 2000; Mani, 2001) for comprehensive overviews of the methods and algorithms used to accomplish this.</Paragraph>
    <Paragraph position="1"> Headline generators are noisy-channel probabilistic systems that are trained on large corpora of a2 Headline, Texta3 pairs (Banko et al., 2000; Berger and Mittal, 2000). These systems produce short sequences of words that are indicative of the content of the text given as input. Sentence simplification systems (Chandrasekar et al., 1996; Mahesh, 1997; Carroll et al., 1998; Grefenstette, 1998; Jing, 2000; Knight and Marcu, 2000) are capable of compressing long sentences by deleting unimportant words and phrases.</Paragraph>
    <Paragraph position="2"> Extraction-based summarizers often produce outputs that contain non-important sentence fragments. For example, the hypothetical extractive summary of Text (1), which is shown in Table 1, can be compacted further by deleting the clause &amp;quot;which is already almost enough to win&amp;quot;. Headline-based summaries, such as that shown in Table 1, are usually indicative of a text's content but not informative, grammatical, or coherent. By repeatedly applying a sentence-simplification algorithm one sentence at a time, one can compress a text; yet, the outputs generated in this way are likely to be incoherent and to contain unimportant information. When summarizing text, some sentences should be dropped altogether. null Ideally, we would like to build systems that have the strengths of all these three classes of approaches. The &amp;quot;Document Compression&amp;quot; entry in Table 1 shows a grammatical, coherent summary of Text (1), which was generated by a hypothetical document compression system that preserves the most important information in a text while deleting sentences, phrases, and words that are subsidiary to the main message of the text. Obviously, generating coherent, grammatical summaries such as that produced by the hypothetical document compression system in Table 1 is not trivial because of many conflicting Computational Linguistics (ACL), Philadelphia, July 2002, pp. 449-456. Proceedings of the 40th Annual Meeting of the Association for Type of Hypothetical output Output Output is Output is Summarizer contains only coherent grammatical important info Extractive John Doe has already secured the vote of most a4 summarizer democrats in his constituency, which is already almost enough to win. But without the support of the governer, he is still on shaky ground.</Paragraph>
    <Paragraph position="3"> Headline mayor vote constituency governer a4 generator Sentence The mayor is now looking for re-election. John Doe a4 simplifier has already secured the vote of most democrats in his constituency. He is still on shaky ground.</Paragraph>
    <Paragraph position="4"> Document John Doe has secured the vote of most democrats. a4 a4 a4 compressor But he is still on shaky ground.</Paragraph>
    <Paragraph position="5">  goals1. The deletion of certain sentences may result in incoherence and information loss. The deletion of certain words and phrases may also lead to ungrammaticality and information loss.</Paragraph>
    <Paragraph position="6"> The mayor is now looking for re-election. John Doe has already secured the vote of most democrats in his constituency, which is already almost enough to win.</Paragraph>
    <Paragraph position="7"> But without the support of the governer, he is still on shaky grounds.</Paragraph>
    <Paragraph position="9"> In this paper, we present a document compression system that uses hierarchical models of discourse and syntax in order to simultaneously manage all these conflicting goals. Our compression system first automatically derives the syntactic structure of each sentence and the overall discourse structure of the text given as input. The system then uses a statistical hierarchical model of text production in order to drop non-important syntactic and discourse units so as to generate coherent, grammatical document compressions of arbitrary length. The system outperforms both a baseline and a sentence-based compression system that operates by simplifying sequentially all sentences in a text.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML