File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1015_metho.xml

Size: 16,336 bytes

Last Modified: 2025-10-06 14:08:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1015">
  <Title>Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Model Construction
</SectionTitle>
    <Paragraph position="0"> We employ an iterative re-estimation procedure that alternates between (1) creating clusters of text spans with similar word distributions to serve as representatives of within-document topics, and (2) computing models of word distributions and topic changes from the clusters so derived.3 Formalism preliminaries We treat texts as sequences of pre-defined text spans, each presumed to convey information about a single topic. Specifying text-span length thus defines the granularity of the induced topics. For concreteness, in what follows we will refer to &amp;quot;sentences&amp;quot; rather than &amp;quot;text spans&amp;quot; since that is what we used in our experiments, but paragraphs or clauses could potentially have been employed instead.</Paragraph>
    <Paragraph position="1"> Our working assumption is that all texts from a given domain are generated by a single content model. A content model is an HMM in which each state a1 corresponds to a distinct topic and generates sentences relevant to that topic according to a state-specific language model a2a4a3 -note that standard a0 -gram language models can therefore be considered to be degenerate (single-state) content models. State transition probabilities give the probability of changing from a given topic to another, thereby capturing constraints on topic shifts. We can use the forward algorithm to efficiently compute the generation probability assigned to a document by a content model and the Viterbi algorithm to quickly find the most likely content-model state sequence to have generated a given document; see Rabiner (1989) for details.</Paragraph>
    <Paragraph position="2"> In our implementation, we use bigram language models, so that the probability of an a0 -word sentence a5a7a6 a8a10a9a11a8a13a12a15a14a16a14a17a14a18a8a20a19 being generated by a state  of dummy initial and final states. Section 5.2 describes how the free parameters a31 , a32 , a33a22a34 , and a33a36a35 are chosen. The Athens seismological institute said the temblor's epicenter was located 380 kilometers (238 miles) south of the capital.</Paragraph>
    <Paragraph position="3"> Seismologists in Pakistan's Northwest Frontier Province said the temblor's epicenter was about 250 kilometers (155 miles) north of the provincial capital Peshawar.</Paragraph>
    <Paragraph position="4"> The temblor was centered 60 kilometers (35 miles) north-west of the provincial capital of Kunming, about 2,200 kilometers (1,300 miles) southwest of Beijing, a bureau  cluster, corresponding to descriptions of location.</Paragraph>
    <Paragraph position="6"> a24 is described below.</Paragraph>
    <Paragraph position="7"> Initial topic induction As in previous work (Florian and Yarowsky, 1999; Iyer and Ostendorf, 1996; Wu and Khudanpur, 2002), we initialize the set of &amp;quot;topics&amp;quot;, distributionally construed, by partitioning all of the sentences from the documents in a given domain-specific collection into clusters. First, we create a10 clusters via complete-link clustering, measuring sentence similarity by the cosine metric using word bigrams as features (Figure 1 shows example output).4 Then, given our knowledge that documents may sometimes discuss new and/or irrelevant content as well, we create an &amp;quot;etcetera&amp;quot; cluster by merging together all clusters containing fewer than a11 sentences, on the assumption that such clusters consist of &amp;quot;outlier&amp;quot; sentences. We use a12 to denote the number of clusters that results.</Paragraph>
    <Paragraph position="8"> Determining states, emission probabilities, and transition probabilities Given a set a13 a9a15a14 a13 a12a16a14a18a17a19a17a18a17a19a14 a13a21a20 of a12 clusters, where a13a18a20 is the &amp;quot;etcetera&amp;quot; cluster, we construct a content model with corresponding states a1</Paragraph>
    <Paragraph position="10"> induce the state's sentence-emission probabilities) are estimated using smoothed counts from the corresponding cluster:</Paragraph>
    <Paragraph position="12"> vocabulary. But because we want the insertion state a1a16a20 to model digressions or unseen topics, we take the novel step of forcing its language model to be complementary to those of the other states by setting</Paragraph>
    <Paragraph position="14"> bers and dates are (temporarily) replaced with generic tokens to help ensure that clusters contain sentences describing the same event type, rather than same actual event.</Paragraph>
    <Paragraph position="15"> Note that the contents of the &amp;quot;etcetera&amp;quot; cluster are ignored at this stage.</Paragraph>
    <Paragraph position="16"> Our state-transition probability estimates arise from considering how sentences from the same article are distributed across the clusters. More specifically, for two clusters a13 and a13 a8 , let a61 a21a62a13 a14 a13 a8 a24 be the number of documents in which a sentence from a13 immediately precedes one from a13 a8 , and let a61 a21a50a13a16a24 be the number of documents containing sentences from a13 . Then, for any two states a1</Paragraph>
    <Paragraph position="18"> a12 , we use the following smoothed estimate of the probability of transitioning from a1</Paragraph>
    <Paragraph position="20"> Viterbi re-estimation Our initial clustering ignores sentence order; however, contextual clues may indicate that sentences with high lexical similarity are actually on different &amp;quot;topics&amp;quot;. For instance, Reuters articles about earthquakes frequently finish by mentioning previous quakes. This means that while the sentence &amp;quot;The temblor injured dozens&amp;quot; at the beginning of a report is probably highly salient and should be included in a summary of it, the same sentence at the end of the piece probably refers to a different event, and so should be omitted.</Paragraph>
    <Paragraph position="21"> A natural way to incorporate ordering information is iterative re-estimation of the model parameters, since the content model itself provides such information through its transition structure. We take an EM-like Viterbi approach (Iyer and Ostendorf, 1996): we re-cluster the sentences by placing each one in the (new) cluster a13</Paragraph>
    <Paragraph position="23"> that corresponds to the state a1  most likely to have generated it according to the Viterbi decoding of the training data. We then use this new clustering as the input to the procedure for estimating HMM parameters described above. The cluster/estimate cycle is repeated until the clusterings stabilize or we reach a predefined number of iterations.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Evaluation Tasks
</SectionTitle>
    <Paragraph position="0"> We apply the techniques just described to two tasks that stand to benefit from models of content and changes in topic: information ordering for text generation and information selection for single-document summarization.</Paragraph>
    <Paragraph position="1"> These are two complementary tasks that rely on disjoint model functionalities: the ability to order a set of pre-selected information-bearing items, and the ability to do the selection itself, extracting from an ordered sequence of information-bearingitems a representative subsequence. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Information Ordering
</SectionTitle>
      <Paragraph position="0"> The information-ordering task is essential to many text-synthesis applications, including concept-to-text generation and multi-document summarization; While accounting for the full range of discourse and stylistic factors that influence the ordering process is infeasible in many domains, probabilistic content models provide a means for handling important aspects of this problem. We demonstrate this point by utilizing content models to select appropriate sentence orderings: we simply use a content model trained on documents from the domain of interest, selecting the ordering among all the presented candidates that the content model assigns the highest probability to.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Extractive Summarization
</SectionTitle>
      <Paragraph position="0"> Content models can also be used for single-document summarization. Because ordering is not an issue in this application5, this task tests the ability of content models to adequately represent domain topics independently of whether they do well at ordering these topics.</Paragraph>
      <Paragraph position="1"> The usual strategy employed by domain-specific summarizers is for humans to determine a priori what types of information from the originating documents should be included (e.g., in stories about earthquakes, the number of victims) (Radev and McKeown, 1998; White et al., 2001). Some systems avoid the need for manual analysis by learning content-selection rules from a collection of articles paired with human-authored summaries, but their learning algorithms typically focus on within-sentence features or very coarse structural features (such as position within a paragraph) (Kupiec et al., 1999).</Paragraph>
      <Paragraph position="2"> Our content-model-based summarization algorithm combines the advantages of both approaches; on the one hand, it learns all required information from un-annotated document-summary pairs; on the other hand, it operates on a more abstract and global level, making use of the topical structure of the entire document.</Paragraph>
      <Paragraph position="3"> Our algorithm is trained as follows. Given a content model acquired from the full articles using the method described in Section 3, we need to learn which topics (represented by the content model's states) should appear in our summaries. Our first step is to employ the Viterbi algorithm to tag all of the summary sentences and all of the sentences from the original articles with a Viterbi topic label, or V-topic -- the name of the state most likely to have generated them. Next, for each state a1 such that at least three full training-set articles contained V-topic a1 , we compute the probability that the state generates sentences that should appear in a summary. This probability is estimated by simply (1) counting the number of document-summary pairs in the parallel training data such that both the originating document and the summary contain sentences assigned V-topic a1 , and then (2) normalizing this count by the number of full articles containing sentences with V-topic a1 .</Paragraph>
      <Paragraph position="4"> 5Typically, sentences in a single-document summary follow the order of appearance in the original document.</Paragraph>
      <Paragraph position="5">  cabulary size and type/token ratio are computed after replacement of proper names, numbers and dates.</Paragraph>
      <Paragraph position="6"> To produce a length-a0 summary of a new article, the algorithm first uses the content model and Viterbi decoding to assign each of the article's sentences a V-topic. Next, the algorithm selects those a0 states, chosen from among those that appear as the V-topic of one of the article's sentences, that have the highest probability of generating a summary sentence, as estimated above. Sentences from the input article corresponding to these states are placed in the output summary.6</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Evaluation Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Data
</SectionTitle>
      <Paragraph position="0"> For evaluation purposes, we created corpora from five domains: earthquakes, clashes between armies and rebel groups, drug-related criminal offenses, financial reports, and summaries of aviation accidents.7 Specifically, the first four collections consist of AP articles from the North American News Corpus gathered via a TDT-style document clustering system. The fifth consists of narratives from the National Transportation Safety Board's database previously employed by Jones and Thompson (2003) for event-identification experiments. For each such set, 100 articles were used for training a content model, 100 articles for testing, and 20 for the development set used for parameter tuning. Table 1 presents information about article length (measured in sentences, as determined by the sentence separator of Reynar and Ratnaparkhi (1997)), vocabulary size, and token/type ratio for each domain.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Parameter Estimation
</SectionTitle>
      <Paragraph position="0"> Our training algorithm has four free parameters: two that indirectly control the number of states in the induced content model, and two parameters for smoothing bigram probabilities. All were tuned separately for each domain on the corresponding held-out development set using Powell's grid search (Press et al., 1997). The parameter values were selected to optimize system performance  on the information-ordering task8. We found that across all domains, the optimal models were based on &amp;quot;sharper&amp;quot; language models (e.g., a32 a9 a23a1a0 a17a0a2a0a3a0a2a0a3a0a2a0 a39 ). The optimal number of states ranged from 32 to 95.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Ordering Experiments
5.3.1 Metrics
</SectionTitle>
      <Paragraph position="0"> The intent behind our ordering experiments is to test whether content models assign high probability to acceptable sentence arrangements. However, one stumbling block to performing this kind of evaluation is that we do not have data on ordering quality: the set of sentences from an a4 -sentence document can be sequenced in a4a6a5 different ways, which even for a single text of moderate length is too many to ask humans to evaluate. Fortunately, we do know that at least the original sentence order (OSO) in the source document must be acceptable, and so we should prefer algorithms that assign it high probability relative to the bulk of all the other possible permutations. This observation motivates our first evaluation metric: the rank received by the OSO when all permutations of a given document's sentences are sorted by the probabilities that the model under consideration assigns to them. The best possible rank is 0, and the worst is a4a6a5 a40 a39 .</Paragraph>
      <Paragraph position="1"> An additional difficulty we encountered in setting up our evaluation is that while we wanted to compare our algorithms against Lapata's (2003) state-of-the-art system, her method doesn't consider all permutations (see below), and so the rank metric cannot be computed for it.</Paragraph>
      <Paragraph position="2"> To compensate, we report the OSO prediction rate, which measures the percentage of test cases in which the model under consideration gives highest probability to the OSO from among all possible permutations; we expect that a good model should predict the OSO a fair fraction of the time. Furthermore, to provide some assessment of the quality of the predicted orderings themselves, we follow Lapata (2003) in employing Kendall's a7 , which is a measure of how much an ordering differs from the OSO-the underlying assumption is that most reasonable sentence orderings should be fairly similar to it. Specifically, for a permutation a8 of the sentences in an a4 -sentence document, a7 a21a9a8 a24 is computed as</Paragraph>
      <Paragraph position="4"> where a13a13a21a9a8 a24 is the number of swaps of adjacent sentences necessary to re-arrangea8 into the OSO. The metric ranges from -1 (inverse orders) to 1 (identical orders).</Paragraph>
      <Paragraph position="5"> 8See Section 5.5 for discussion of the relation between the ordering and the summarization task.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML