File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/04/n04-1015_relat.xml

Size: 3,162 bytes

Last Modified: 2025-10-06 14:15:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1015">
  <Title>Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization</Title>
  <Section position="3" start_page="0" end_page="0" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> Knowledge-rich methods Models employing manual crafting of (typically complex) representations of content have generally captured one of three types of knowledge (Rambow, 1990; Kittredge et al., 1991): domain knowledge [e.g., that earthquakes have magnitudes], domain-independent communication knowledge [e.g., that describing an event usually entails specifying its location]; and domain communication knowledge [e.g., that Reuters earthquake reports often conclude by listing previous quakes2]. Formalisms exemplifying each of these knowledge types are DeJong's (1982) scripts, McKeown's (1985) schemas, and Rambow's (1990) domain-specific schemas, respectively.</Paragraph>
    <Paragraph position="1"> In contrast, because our models are based on a distributional view of content, they will freely incorporate information from all three categories as long as such information is manifested as a recurrent pattern. Also, in comparison to the formalisms mentioned above, content models constitute a relatively impoverished representation; but this actually contributes to the ease with which they can be learned, and our empirical results show that they are quite effective despite their simplicity.</Paragraph>
    <Paragraph position="2"> In recent work, Duboue and McKeown (2003) propose a method for learning a content planner from a collection of texts together with a domain-specific knowledge base, but our method applies to domains in which no such knowledge base has been supplied.</Paragraph>
    <Paragraph position="3"> Knowledge-lean approaches Distributional models of content have appeared with some frequency in research on text segmentation and topic-based language modeling (Hearst, 1994; Beeferman et al., 1997; Chen et al., 1998; Florian and Yarowsky, 1999; Gildea and Hofmann, 1999; 2This does not qualify as domain knowledge because it is not about earthquakes per se.</Paragraph>
    <Paragraph position="4"> Iyer and Ostendorf, 1996; Wu and Khudanpur, 2002). In fact, the methods we employ for learning content models are quite closely related to techniques proposed in that literature (see Section 3 for more details).</Paragraph>
    <Paragraph position="5"> However, language-modeling research -- whose goal is to predict text probabilities -- tends to treat topic as a useful auxiliary variable rather than a central concern; for example, topic-based distributional information is generally interpolated with standard, non-topic-based a0 -gram models to improve probability estimates. Our work, in contrast, treats content as a primary entity. In particular, our induction algorithms are designed with the explicit goal of modeling document content, which is why they differ from the standard Baum-Welch (or EM) algorithm for learning Hidden Markov Models even though content models are instances of HMMs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML