File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1069_metho.xml

Size: 24,180 bytes

Last Modified: 2025-10-06 14:08:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1069">
  <Title>Probabilistic Text Structuring: Experiments with Sentence Ordering</Title>
  <Section position="3" start_page="0" end_page="3" type="metho">
    <SectionTitle>
2 Learning to Order
</SectionTitle>
    <Paragraph position="0"> Given a collection of texts from a particular domain, our task is to learn constraints on the ordering of their sentences. In the training phase our model will learn these constraints from adjacent sentences represented by a set of informative features. In the testing phase, given a set of unseen sentences, we will rely on our prior experience of how sentences are usually ordered for choosing the most likely ordering. null</Paragraph>
    <Section position="1" start_page="0" end_page="3" type="sub_section">
      <SectionTitle>
2.1 The Model
</SectionTitle>
      <Paragraph position="0"> We express the probability of a text made up of sen- null as shown in (1). According to (1), the task of predicting the next sentence is dependent on its n[?]i previous sentences.</Paragraph>
      <Paragraph position="2"> We will simplify (1) by assuming that the probability of any given sentence is determined only by its previous sentence:</Paragraph>
      <Paragraph position="4"> (2) This is a somewhat simplistic attempt at capturing Marcu's (1997) local coherence constraints as well as Barzilay et al.'s (2002) observations about topical relatedness. While this is clearly a naive view of text coherence, our model has some notion of the types of sentences that typically go together, even though it is agnostic about the specific rhetorical relations that glue sentences into a coherent text. Also note that the simplification in (2) will make the estimation of the probabilities P(S</Paragraph>
      <Paragraph position="6"> ) more reliable in the face of sparse data. Of course estimat-</Paragraph>
      <Paragraph position="8"> were actual sentences. It is unlikely to find the exact same sentence repeated several times in a corpus.</Paragraph>
      <Paragraph position="9"> What we can find and count is the number of times a given structure or word appears in the corpus. We will therefore estimate P(S</Paragraph>
      <Paragraph position="11"> ) from features that express its structure and content (these features are described in detail in Section 3):</Paragraph>
      <Paragraph position="13"> . We will assume that these features are independent and that P(S</Paragraph>
      <Paragraph position="15"> from the pairs in the Cartesian product defined over the features expressing sentences S</Paragraph>
      <Paragraph position="17"> Assuming that the features are independent again makes parameter estimation easier. The Cartesian product over the features in S</Paragraph>
      <Paragraph position="19"> is an attempt to capture inter-sentential dependencies. Since  we don't know a priori what the important feature combinations are, we are considering all possible combinations over two sentences. This will admittedly introduce some noise, given that some dependencies will be spurious, but the model can be easily retrained for different domains for which different feature combinations will be important. The proba- null ) is the number of times feature a hi;ji is preceded by feature a hi[?]1;ki in the corpus. The denominator expresses the number of times a hi[?]1;ki is attested in the corpus (preceded by any feature). The probabilities P(a</Paragraph>
      <Paragraph position="21"> ) are small, and undefined in cases where the feature combinations are unattested in the corpus. We therefore smooth the observed frequencies using back-off smoothing (Katz, 1987).</Paragraph>
      <Paragraph position="22"> To illustrate with an example consider the text in Figure 1 which has three sentences S  each represented by their respective features denoted by letters. The probability P(S</Paragraph>
      <Paragraph position="24"> ) will be calculated by taking the product of P(hje), P(hjf ), P(hjg), P(ije), P(ijf ),andP(ijg). To obtain P(hje), we need f (h;e) and f (e) which can be estimated in Figure 1 by counting the number of edges connecting e and h and the number of edges starting from e, respectively. So, P(hje) will be 0.16 given that f (h;e) is one and f (e) is six (see the normalization in (5)).</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
2.2 Determining an Order
</SectionTitle>
      <Paragraph position="0"> Once we have collected the counts for our features we can determine the order for a new text that we haven't encountered before, since some of the features representing its sentences will be familiar.</Paragraph>
      <Paragraph position="1"> Given a text with N sentences there are N! possible orders. The set of orders can be represented as a complete graph, where the set of vertices V is equal to the set of sentences S and each edge u ! v has a weight, the probability P(ujv). Cohen et al. (1999)  show that the problem of finding an optimal ordering through a directed weighted graph is NP-complete.</Paragraph>
      <Paragraph position="2"> Fortunately, they propose a simple greedy algorithm that provides an approximate solution which can be easily modified for our task (see also Barzilay et al. 2002).</Paragraph>
      <Paragraph position="3"> The algorithm starts by assigning each vertex v 2 V a probability. Recall that in our case vertices are sentences and their probabilities can be calculated by taking the product of the probabilities of their features. The greedy algorithm then picks the node with the highest probability and orders it ahead of the other nodes. The selected node and its incident edges are deleted from the graph. Each remaining node is now assigned the conditional probability of seeing this node given the previously selected node (see (4)). The node which yields the highest conditional probability is selected and ordered ahead. The process is repeated until the graph is empty.</Paragraph>
      <Paragraph position="4"> As an example consider again a three sentence text. We illustrate the search for a path through the graph in Figure 2. First we calculate which of the  is most likely to start the text (during training we record which sentences appear in the beginning of each text). Assuming that  . As can be seen in Figure 2 for each vertex we keep track of the most probable edge that ends in that vertex, thus setting th beam search width to one. Note, that equation (4) would assign lower and lower probabilities to sentences with large numbers of features. Since we need to compare sentence pairs with varied numbers of features, we will normalize the conditional probabilities P(S</Paragraph>
      <Paragraph position="6"> ) by the number feature of pairs that form the Cartesian product</Paragraph>
      <Paragraph position="8"> 1. Laidlaw Transportation Ltd. said shareholders will be asked at its Dec. 7 annual meeting to approve a change of name to Laidlaw Inc.</Paragraph>
      <Paragraph position="9"> 2. The company said its existing name hasn't represented its businesses since the 1984 sale of its trucking operations. 3. Laidlaw is a waste management and school-bus operator, in which Canadian Pacific Ltd. has a 47% voting interest.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="3" end_page="3" type="metho">
    <SectionTitle>
3 Parameter Estimation
</SectionTitle>
    <Paragraph position="0"> The model in Section 2.1 was trained on the BLLIP corpus (30 M words), a collection of texts from the Wall Street Journal (years 1987-89). The corpus contains 98,732 stories. The average story length is 19.2 sentences. 71.30% of the texts in the corpus are less than 50 sentences long. An example of the texts in this newswire corpus is shown in Figure 3.</Paragraph>
    <Paragraph position="1"> The corpus is distributed in a Treebank-style machine-parsed version which was produced with Charniak's (2000) parser. The parser is a &amp;quot;maximum-entropy inspired&amp;quot; probabilistic generative model. It achieves 90.1% average precision/recall for sentences with maximum length 40 and 89.5% for sentences with maximum length 100 when trained and tested on the standard sections of the Wall Street Journal Treebank (Marcus et al., 1993).</Paragraph>
    <Paragraph position="2"> We also obtained a dependency-style version of the corpus using MINIPAR (Lin, 1998) a broad coverage parser for English which employs a manually constructed grammar and a lexicon derived from WordNet with an additional dictionary of proper names (130,000 entries in total). The grammar is represented as a network of 35 nodes (i.e., grammatical categories) and 59 edges (i.e., types of syntactic (dependency) relations). The output of MINIPAR is a dependency graph which represents the dependency relations between words in a sentence (see Table 1 for an example). Lin (1998) evaluated the parser on the SUSANNE corpus (Sampson, 1996), a domain independent corpus of British English, and achieved a recall of 79% and precision of 89% on the dependency relations.</Paragraph>
    <Paragraph position="3"> From the two different parsed versions of the BLLIP corpus the following features were extracted: Verbs. Investigations into the interpretation of narrative discourse (Asher and Lascarides, 2003) have shown that specific lexical information (e.g., verbs, adjectives) plays an important role in determining the discourse relations between propositions. Although we don't have an explicit model of rhetorical relations and their effects on sentence ordering, we capture the lexical inter-dependencies between sentences by focusing on verbs and their precedence relationships in the corpus.</Paragraph>
    <Paragraph position="4"> From the Treebank parses we extracted the verbs contained in each sentence. We obtained two versions of this feature: (a) a lemmatized version where verbs were reduced to their base forms and (b) a non-lemmatized version which preserved tense-related information; more specifically, verbal complexes (e.g., I will have been going) were identified from the parse trees heuristically by devising a set of 30 patterns that search for sequences of modals, auxiliaries and verbs. This is an attempt at capturing temporal coherence by encoding sequences of events and their morphology which indirectly indicates their tense.</Paragraph>
    <Paragraph position="5"> To give an example consider the text in Figure 3. For the lemmatized version, sentence (1) will be represented by say, will, be, ask,andapprove;for the tensed version, the relevant features will be said, will be asked,andto approve.</Paragraph>
    <Paragraph position="6"> Nouns. Centering Theory (CT, Grosz et al. 1995) is an entity-based theory of local coherence, which claims that certain entities mentioned in an utterance are more central than others and that this property constrains a speaker's use of certain referring expressions. The principles underlying CT (e.g., continuity, salience) are of interest to concept-to-text generation as they offer an entity-based model of text and sentence planning which is particularly suited for descriptional genres (Kibble and Power, 2000).</Paragraph>
    <Paragraph position="7"> We operationalize entity-based coherence for text-to-text generation by simply keeping track of the nouns attested in a sentence without however taking personal pronouns into account. This simplification is reasonable if one has text-to-text generation mind. In multidocument summarization for example, sentences are extracted from different documents; the referents of the pronouns attested in these sentences are typically not known and in some cases identical pronouns may refer to different entities. So making use of noun-pronoun or pronoun-pronoun co-occurrences will be uninformative or in fact misleading. null We extracted nouns from a lemmatized version of the Treebank-style parsed corpus. In cases of noun compounds, only the compound head (i.e., rightmost noun) was taken into account. A small set of rules was used to identify organizations (e.g., United Laboratories Inc.), person names (e.g., Jose Y. Campos), and locations (e.g., New England) spanning more than one word. These were grouped together and were also given the general categories person, organization,andlocation. The model backs off to these categories when unknown person names, locations, and organizations are encountered. Dates, years, months and numbers were substituted by the categories date, year, month,andnumber.</Paragraph>
    <Paragraph position="8"> In sentence (1) (see Figure 3) we identify the nouns Laidlaw Transportation Ltd., shareholder, Dec 7, meeting, change, name and Laidlaw Inc.In sentence (2) the relevant nouns are company, name, business, 1984, sale,andoperation.</Paragraph>
    <Paragraph position="9"> Dependencies. Note that the noun and verb features do not capture the structure of the sentences to be ordered. This is important for our domain, as texts seem to be rather formulaic and similar syntactic structures are often used (e.g., direct and indirect speech, restrictive relative clauses, predicative structures). In this domain companies typically say things, and texts often begin with a statement of what a company or an individual has said (see sentence (1) in Figure 3). Furthermore, companies and individuals are described with certain attributes (persons can be presidents or governors, companies are bankrupt or manufacturers, etc.) that can give clues for inferring coherence.</Paragraph>
    <Paragraph position="10"> The dependencies were obtained from the output of MINIPAR. Some of the dependencies for sentence (2) from Figure 3 are shown in Table 1. The dependencies capture structural as well lexical information. They are represented as triples, consisting of a head (leftmost element, e.g., say, name), a modifier (rightmost element, e.g., company, its)andarelation (e.g., subject (V:subj:N), object (V:obj:N), modifier (N:mod:A)).</Paragraph>
    <Paragraph position="11"> For efficiency reasons we focused on triples whose dependency relations (e.g., V:subj:N)were attested in the corpus with frequency larger than one per million. We further looked at how individual types of relations contribute to the ordering task. More specifically we experimented with dependencies relating to verbs (49 types), nouns (52 types), verbs and nouns (101 types) (see Table 1 for examples). We also ran a version of our model with all types of relations, including adjectives, adverbs and  prepositions (147 types in total).</Paragraph>
  </Section>
  <Section position="5" start_page="3" end_page="3" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> In this section we describe our experiments with the model and the features introduced in the previous sections. We first evaluate the model by attempting to reproduce the structure of unseen texts from the BLLIP corpus, i.e., the corpus on which the model is trained on. We next obtain an upper bound for the task by conducting a sentence ordering experiment with humans and comparing the model against the human data. Finally, we assess whether this model can be used for multi-document summarization using data from Barzilay et al. (2002). But before we outline the details of our experiments we discuss our choice of metric for comparing different orders.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.1 Evaluation Metric
</SectionTitle>
      <Paragraph position="0"> Our task is to produce an ordering for the sentences of a given text. We can think of the sentences as objects for which a ranking must be produced. Table 2 gives an example of a text containing 10 sentences (A-J) and the orders (i.e., rankings) produced by three hypothetical models.</Paragraph>
      <Paragraph position="1"> A number of metrics can be used to measure the distance between two rankings such as Spearman's correlation coefficient for ranked data, Cayley distance, or Kendall's t (see Lebanon and Lafferty 2002 for details). Kendall's t is based on the number of inversions in the rankings and is defined in (6):</Paragraph>
      <Paragraph position="3"> where N is the number of objects (i.e., sentences) being ranked and inversions are the number of interchanges of consecutive elements necessary to arrange them in their natural order. If we think in terms of permutations, then t can be interpreted as the minimum number of adjacent transpositions needed to bring one order to the other. In Table 2 the number of inversions can be calculated by counting the number of intersections of the lines. The metric ranges from [?]1 (inverse ranks) to 1 (identical ranks). The t for Model 1 and Model 2 in Table 2 is .822.</Paragraph>
      <Paragraph position="4"> Kendall's t seems particularly appropriate for the tasks considered in this paper. The metric is sensitive to the fact that some sentences may be always ordered next to each other even though their absolute orders might differ. It also penalizes inverse rankings. Comparison between Model 1 and Model 3 would give a t of 0.244 even though the orders between the two models are identical modulo the beginning and the end. This seems appropriate given that flipping the introduction in a document with the conclusions seriously disrupts coherence.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.2 Experiment 1: Ordering Newswire Texts
</SectionTitle>
      <Paragraph position="0"> The model from Section 2.1 was trained on the BLLIP corpus and tested on 20 held-out randomly selected unseen texts (average length 15.3). We also used 20 randomly chosen texts (disjoint from the test data) for development purposes (average length 16.2). All our results are reported on the test set.</Paragraph>
      <Paragraph position="1"> The input to the the greedy algorithm (see Section 2.2) was a text with a randomized sentence ordering. The ordered output was compared against the original authored text using t. Table 3 gives the average t (T ) for all 20 test texts when the following features are used: lemmatized verbs (V  encapsulates notions of entity-based coherence, which is relatively important for our domain. A lot of texts are about a particular entity (company or individual) and their properties. The feature V</Paragraph>
      <Paragraph position="3"> subsumes several other features and does expectedly better: it captures entity-based coherence, the inter-relations among verbs, the structure of sentences and also preserves information about argument structure (who is doing what to whom). The distance between the orders produced by the model and the original texts increases when all types of dependencies are Feature T StdDev Min Max  and model generated variants taken into account. The feature space becomes too big, there are too many spurious feature pairs, and the model can't distinguish informative from non-informative features.</Paragraph>
      <Paragraph position="4"> We carried out a one-way Analysis of Variance (ANOVA) to examine the effect of different feature types. The ANOVA revealed a reliable effect of feature type (F(9;171)=3:31; p &lt; 0:01). We performed Post-hoc Tukey tests to further examine whether there are any significant differences among the different features and between our model and the baseline. We found out that N  are not significantly different from each other. However, they are significantly better than all other features (a = 0:05).</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.3 Experiment 2: Human Evaluation
</SectionTitle>
      <Paragraph position="0"> In this experiment we compare our model's performance against human judges. Twelve texts were randomly selected from the 20 texts in our test data. The texts were presented to subjects with the order of their sentences scrambled. Participants were asked to reorder the sentences so as to produce a coherent text. Each participant saw three texts randomly chosen from the pool of 12 texts. A random order of sentences was generated for every text the participants saw. Sentences were presented verbatim, pronouns and connectives were retained in order to make ordering feasible. Notice that this information is absent from the features the model takes into account.</Paragraph>
      <Paragraph position="1"> The study was conducted remotely over the Internet using a variant of Barzilay et al.'s (2002) software. Subjects first saw a set of instructions that explained the task, and had to fill in a short questionnaire including basic demographic information. The experiment was completed by 137 volunteers (approximately 33 per text), all native speakers of English. Subjects were recruited via postings to local Feature T StdDev Min Max  humans and the model on multidocument summaries Usenet newsgroups.</Paragraph>
      <Paragraph position="2"> Table 4 reports pairwise t averaged over 12 texts for all participants (H H ) and the average t between the model and each of the subjects for all features used in Experiment 1. The average distance in the orderings produced by our subjects is .58. The distance between the humans and the best features  are not significantly different from H H (a = 0:01). This is in agreement with Experiment 1 and points to the importance of lexical and structural information for the ordering task.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.4 Experiment 3: Summarization
</SectionTitle>
      <Paragraph position="0"> Barzilay et al. (2002) collected a corpus of multiple orderings in order to study what makes an order cohesive. Their goal was to improve the ordering strategy of MULTIGEN (McKeown et al., 1999) a multidocument summarization system that operates on news articles describing the same event. MULTIGEN identifies text units that convey similar information across documents and clusters them into themes.</Paragraph>
      <Paragraph position="1"> Each theme is next syntactically analysed into predicate argument structures; the structures that are repeated often enough are chosen to be included into the summary. A language generation system outputs a sentence (per theme) from the selected predicate argument structures.</Paragraph>
      <Paragraph position="2"> Barzilay et al. (2002) collected ten sets of articles each consisting of two to three articles reporting the same event and simulated MULTIGEN by manually selecting the sentences to be included in the final summary. This way they ensured that orderings were not influenced by mistakes their system could have made. Explicit references and connectives were removed from the sentences so as not to reveal clues about the sentence ordering. Ten subjects provided orders for each summary which had an average length of 8.8.</Paragraph>
      <Paragraph position="3"> We simulated the participants' task by using the model from Section 2.1 to produce an order for each candidate summary  . We then compared the differences in the orderings generated by the model and participants using the best performing features from  ). Note that the model was trained on the BLLIP corpus, whereas the sentences to be ordered were taken from news articles describing the same event. Not only were the news articles unseen but also their syntactic structure was unfamiliar to the model. The results are shown in table 5, again average pairwise t is reported. We also give the naive baseline of choosing a random order (B R ). The average distance in the orderings produced by Barzilay et al.'s (2002) participants is .60. The distance between the humans and N  and the humans is .56. An ANOVA yielded a significant effect of feature type (F(3;27)=15:25; p &lt; 0:01). Post-hoc Tukey tests showed that V  ments 1 and 2, it failed to outperform the baseline in the summarization task. This may be due to the fact that entity-based coherence is not as important as temporal coherence for the news articles summaries. Recall that the summaries describe events across documents. This information is captured more ad- null that only keeps a record of the entities in the sentence.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML