File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-1015_intro.xml

Size: 4,804 bytes

Last Modified: 2025-10-06 14:02:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1015">
  <Title>Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The development and application of computational models of text structure is a central concern in natural language processing. Document-level analysis of text structure is an important instance of such work. Previous research has sought to characterize texts in terms of domain-independent rhetorical elements, such as schema items (McKeown, 1985) or rhetorical relations (Mann and Thompson, 1988; Marcu, 1997). The focus of our work, however, is on an equally fundamental but domain-dependent dimension of the structure of text: content.</Paragraph>
    <Paragraph position="1"> Our use of the term &amp;quot;content&amp;quot; corresponds roughly to the notions of topic and topic change. We desire models that can specify, for example, that articles about earthquakes typically contain information about quake strength, location, and casualties, and that descriptions of casualties usually precede those of rescue efforts. But rather than manually determine the topics for a given domain, we take a distributional view, learning them directly from un-annotated texts via analysis of word distribution patterns. This idea dates back at least to Harris (1982), who claimed that &amp;quot;various types of [word] recurrence patterns seem to characterize various types of discourse&amp;quot;. Advantages of a distributional perspective include both drastic reduction in human effort and recognition of &amp;quot;topics&amp;quot; that might not occur to a human expert and yet, when explicitly modeled, aid in applications.</Paragraph>
    <Paragraph position="2"> Of course, the success of the distributional approach depends on the existence of recurrent patterns. In arbitrary document collections, such patterns might be too variable to be easily detected by statistical means. However, research has shown that texts from the same domain tend to exhibit high similarity (Wray, 2002). Cognitive psychologists have long posited that this similarity is not accidental, arguing that formulaic text structure facilitates readers' comprehension and recall (Bartlett, 1932).1 In this paper, we investigate the utility of domain-specific content models for representing topics and topic shifts. Content models are Hidden Markov Models (HMMs) wherein states correspond to types of information characteristic to the domain of interest (e.g., earthquake magnitude or previous earthquake occurrences), and state transitions capture possible information-presentation orderings within that domain.</Paragraph>
    <Paragraph position="3"> We first describe an efficient, knowledge-lean method for learning both a set of topics and the relations between topics directly from un-annotated documents. Our technique incorporates a novel adaptation of the standard HMM induction algorithm that is tailored to the task of modeling content.</Paragraph>
    <Paragraph position="4"> Then, we apply techniques based on content models to two complex text-processing tasks. First, we consider information ordering, that is, choosing a sequence in which to present a pre-selected set of items; this is an essential step in concept-to-text generation, multi-document summarization, and other text-synthesis problems. In our experiments, content models outperform Lapata's (2003) state-of-the-art ordering method by a wide margin -- for one domain and performance metric, the gap was 78 percentage points. Second, we consider extractive summa- null so automated approaches still offer advantages over manual techniques, especially if one needs to model several domains. rization: the compression of a document by choosing a subsequence of its sentences. For this task, we develop a new content-model-based learning algorithm for sentence selection. The resulting summaries yield 88% match with human-written output, which compares favorably to the 69% achieved by the standard &amp;quot;leading a0 sentences&amp;quot; baseline.</Paragraph>
    <Paragraph position="5"> The success of content models in these two complementary tasks demonstrates their flexibility and effectiveness, and indicates that they are sufficiently expressive to represent important text properties. These observations, taken together with the fact that content models are conceptually intuitive and efficiently learnable from raw document collections, suggest that the formalism can prove useful in an even broader range of applications than we have considered here; exploring the options is an appealing line of future research.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML