File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/05/j05-4004_abstr.xml

Size: 12,176 bytes

Last Modified: 2025-10-06 13:44:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="J05-4004">
  <Title>Induction of Word and Phrase Alignments for Automatic Document Summarization</Title>
  <Section position="2" start_page="0" end_page="508" type="abstr">
    <SectionTitle>
1. Introduction and Motivation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="506" type="sub_section">
      <SectionTitle>
1.1 Motivation
</SectionTitle>
      <Paragraph position="0"> We believe that future success in automatic document summarization will be made possible by the combination of complex, linguistically motivated models and effective leveraging of data. Current research in summarization makes a choice between these two: one either develops sophisticated, domain-specific models that are subsequently hand-tuned without the aid of data, or one develops na&amp;quot;ive general models that can [?] 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292. Email: {hdaume,marcu}@isi.edu.</Paragraph>
      <Paragraph position="1"> Submission received: 12 January 2005; revised submission received: 3 May 2005; accepted for publication: 27 May 2005.</Paragraph>
      <Paragraph position="2"> (c) 2006 Association for Computational Linguistics Computational Linguistics Volume 31, Number 4 Figure 1 Example alignment of a single abstract sentence with two document sentences. be trained on large amounts of data (in the form of corpora of document/extract or document/headline pairs). One reason for this is that currently available technologies are only able to extract very coarse and superficial information that is inadequate for training complex models. In this article, we propose a method to overcome this problem: automatically generating word-to-word and phrase-to-phrase alignments between documents and their human-written abstracts.</Paragraph>
      <Paragraph position="3">  To facilitate discussion and to motivate the problem, we show in Figure 1 a relatively simple alignment between a document fragment and its corresponding abstract fragment from our corpus.</Paragraph>
      <Paragraph position="4">  In this example, a single abstract sentence (shown along the top of the figure) corresponds to exactly two document sentences (shown along the bottom of the figure). If we are able to automatically generate such alignments, one can envision the development of models of summarization that take into account effects of word choice, phrasal and sentence reordering, and content selection. Such models could be simultaneously linguistically motivated and data-driven. Furthermore, such alignments are potentially useful for current-day summarization techniques, including sentence extraction, headline generation, and document compression.</Paragraph>
      <Paragraph position="5"> A close examination of the alignment shown in Figure 1 leads us to three observations about the nature of the relationship between a document and its abstract, and hence about the alignment itself: a114 Alignments can occur at the granularity of words and of phrases.</Paragraph>
      <Paragraph position="6"> a114 The ordering of phrases in an abstract can be different from the ordering of phrases in the document.</Paragraph>
      <Paragraph position="7"> a114 Some abstract words do not have direct correspondents in the document, and many document words are never used in an abstract.</Paragraph>
      <Paragraph position="8"> In order to develop an alignment model that could recreate such an alignment, we need our model to be able to operate both at the word level and at the phrase level, we need it to be able to allow arbitrary reorderings, and we need it to be able to account for words on both the document and abstract side that have no direct correspondence. In this paper, we develop an alignment model that is capable of learning all these aspects of the alignment problem in a completely unsupervised fashion.</Paragraph>
    </Section>
    <Section position="2" start_page="506" end_page="506" type="sub_section">
      <SectionTitle>
1.2 Shortcomings of Current Summarization Models
</SectionTitle>
      <Paragraph position="0"> Current state-of-the-art automatic single-document summarization systems employ one of three techniques: sentence extraction, bag-of-words headline generation, or document compression. Sentence extraction systems take full sentences from a document and concatenate them to form a summary. Research in sentence extraction can be traced back to work in the mid 1950s and late 1960s by Luhn (1956) and Edmundson (1969). Recent techniques are startlingly not terribly divergent from these original methods; see Mani and Maybury (1999); Marcu (2000); Mani (2001) for a comprehensive overview. Headline generation systems, on the other hand, typically extract individual words from a document to produce a very short headline-style summary; see Banko, Mittal, and Witbrock (2000); Berger and Mittal (2000); Schwartz, Zajic, and Dorr (2002) for representative examples. Between these two extremes, there has been a relatively modest amount of work in sentence simplification (Chandrasekar, Doran, and Bangalore 1996; Mahesh 1997; Carroll et al. 1998; Grefenstette 1998; Jing 2000; Knight and Marcu 2002) and document compression (Daum'e III and Marcu 2002; Daum'e III and Marcu 2004; Zajic, Dorr, and Schwartz 2004) in which words, phrases, and sentences are selected in an extraction process.</Paragraph>
      <Paragraph position="1"> While such approaches have enjoyed some success, they all suffer from modeling shortcomings. Sentence extraction systems and document compression models make unrealistic assumptions about the summarization task (namely, that extraction is sufficient and that sentences are the appropriate level of granularity). Headline generation systems employ very weak models that make an incorrect bag-of-words assumption.</Paragraph>
      <Paragraph position="2"> This assumption allows such systems to learn limited transformations to produce headlines from arbitrary documents, but such transformations are not nearly complex enough to adequately model anything beyond indicative summaries at a length of around 10 words. Bag-of-words models can learn what the most important words to keep in a headline are, but say nothing about how to structure them in a well-formed, grammatical headline.</Paragraph>
      <Paragraph position="3"> In our own work on document compression models (Daum'e III and Marcu 2002; Daum'e III and Marcu 2004), both of which extend the sentence compression model of Knight and Marcu (2002), we assume that sentences and documents can be summarized exclusively through deletion of contiguous text segments. In Knight and Marcu's data, we found that from a corpus of 39,060 abstract sentences, only 1,067 were created from corresponding document sentences via deletion of contiguous segments. In other words, only 2.7% of the sentences in real &lt;document, abstract&gt; pairs can be explained by the model proposed by Knight and Marcu (2002). Such document compression models do not explain the rich set of linguistic devices employed, for example, in</Paragraph>
    </Section>
    <Section position="3" start_page="506" end_page="507" type="sub_section">
      <SectionTitle>
1.3 Prior Work on Alignments
</SectionTitle>
      <Paragraph position="0"> In the sentence extraction community, there exists a wide variety of techniques for (essentially) creating alignments between document sentences and abstract sentences (Kupiec, Pedersen, and Chen 1995; Teufel and Moens 1997; Marcu 1999); see also Barzilay and Elhadad (2003); Quirk, Brockett, and Dolan (2004) for work describing alignments for the monolingual paraphrasing task. These techniques typically take into account information such as lexical overlap, synonymy, ordering, length, discourse structure, and so forth. The sentence alignment problem is a comparatively simple problem to solve, and current approaches work quite well. Unfortunately, these  Computational Linguistics Volume 31, Number 4 alignments are the least useful, because they can only be used to train sentence extraction systems.</Paragraph>
      <Paragraph position="1"> In the context of headline generation, simple statistical models are used for aligning documents and headlines (Banko, Mittal, and Witbrock 2000; Berger and Mittal 2000; Schwartz, Zajic, and Dorr 2002), based on IBM Model 1 (Brown et al. 1993). These models treat documents and headlines as simple bags of words and learn probabilistic word-based mappings between the words in the documents and the words in the headlines. Such mappings can be considered word-to-word alignments, but as our results show (see Section 5), these models are too weak for capturing the sophisticated operations that are employed by humans in summarizing texts.</Paragraph>
      <Paragraph position="2"> To date, there has been very little work on the word alignment task in the context of summarization. The most relevant work is that of Jing (2002), in which a hidden Markov alignment model is applied to the task of identifying word and phrase-level correspondences between documents and abstracts. Unfortunately, this model is only able to align words that are identical up to their stems, and thus suffers from a problem of recall. This also makes it ill-suited to the task of learning how to perform abstraction, in which one would desire to know how words get changed. For example, Jing's model cannot identify any of the following alignments from Figure 1: (Connecting Point Connecting Point Systems), (Mac - Macintosh), (retailer - seller), (Macintosh - Apple Macintosh systems)and(January 1989 - last January).</Paragraph>
      <Paragraph position="3"> Word alignment (and, to a lesser degree, phrase alignment) has been an active topic of research in the machine translation community. Based on these efforts, one might be initially tempted to use readily available alignment models developed in the context of machine translation, such as GIZA++ (Och and Ney 2003), to obtain word-level alignments in &lt;document, abstract&gt; corpora. However, as we will show (Section 5), the alignments produced by such a system are inadequate for the &lt;document, abstract&gt; alignment task.</Paragraph>
    </Section>
    <Section position="4" start_page="507" end_page="508" type="sub_section">
      <SectionTitle>
1.4 Article Structure
</SectionTitle>
      <Paragraph position="0"> In this article, we describe a novel, general model for automatically inducing wordand phrase-level alignments between documents and their human-written abstracts.</Paragraph>
      <Paragraph position="1"> Beginning in Section 2, we will describe the results of human annotation of such alignments. Based on this annotation, we will investigate the empirical linguistic properties of such alignments, including lexical transformations and movement. In Section 3, we will introduce the statistical model we use for deriving such alignments automatically.</Paragraph>
      <Paragraph position="2"> The inference techniques are based on those of semi-Markov models, extensions of hidden Markov models that allow for multiple simultaneous observations.</Paragraph>
      <Paragraph position="3"> After our discussion of the model structure and algorithms, we discuss the various parameterizations we employ in Section 4. In particular, we discuss three distinct models of movement, two of which are well-known in the machine translation alignment literature, and a third one that exploits syntax in a novel, &amp;quot;light&amp;quot; manner. We also discuss several models of lexical rewriting, based on identities, stems, WordNet synonymy, and automatically induced lexical replacements. In Section 5, we present experimental results that confirm that our model is able to learn the hidden structure in our corpus of &lt;document, abstract&gt; pairs. We compare our model against well-known alignment models designed for machine translation as well as a state-of-the-art alignment model specifically designed for summarization (Jing 2002). Additionally, we discuss errors that the model currently makes, supported by some relevant examples and statistics. We conclude with some directions for future research (Section 6).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML