File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-1015_evalu.xml
Size: 8,931 bytes
Last Modified: 2025-10-06 13:59:08
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1015"> <Title>Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization</Title> <Section position="7" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5.3.2 Results </SectionTitle> <Paragraph position="0"> For each of the 500 unseen test texts, we exhaustively enumerated all sentence permutations and ranked them using a content model from the corresponding domain.</Paragraph> <Paragraph position="1"> We compared our results against those of a bigram language model (the baseline) and an improved version of the state-of-the-art probabilistic ordering method of Lapata (2003), both trained on the same data we used.</Paragraph> <Paragraph position="2"> Lapata's method first learns a set of pairwise sentenceordering preferences based on features such as noun-verb dependencies. Given a new set of sentences, the latest version of her method applies a Viterbi-style approximation algorithm to choose a permutation satisfying many preferences (Lapata, personal communication).9 Table 2 gives the results of our ordering-test comparison experiments. Content models outperform the alternatives almost universally, and often by a very wide margin. We conjecture that this difference in performance stems from the ability of content models to capture global document structure. In contrast, the other two algorithms are local, taking into account only the relationships between adjacent word pairs and adjacent sentence pairs, respectively. It is interesting to observe that our method achieves better results despite not having access to the linguistic information incorporated by Lapata's method. To be fair, though, her techniques were designed for a larger corpus than ours, which may aggravate data sparseness problems for such a feature-rich method.</Paragraph> <Paragraph position="3"> Table 3 gives further details on the rank results for our content models, showing how the rank scores were distributed; for instance, we see that on the Earthquakes domain, the OSO was one of the top five permutations in 95% of the test documents. Even in Drugs and Accidents -- the domains that proved relatively challenging to our method -- in more than 55% of the cases the OSO's rank did not exceed ten. Given that the maximal possible rank in these domains exceeds three million, we believe that our model has done a good job in the ordering task.</Paragraph> <Paragraph position="4"> We also computed learning curves for the different domains; these are shown in Figure 2. Not surprisingly, performance improves with the size of the training set for all domains. The figure also shows that the relative difficulty (from the content-model point of view) of the different domains remains mostly constant across varying training-set sizes. Interestingly, the two easiest domains, Finance and Earthquakes, can be thought of as being more formulaic or at least more redundant, in that they have the highest token/type ratios (see Table 1) -- that is, in these domains, words are repeated much more frequently on average.</Paragraph> <Paragraph position="5"> assigned to the OSO a rank within a given range.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.4 Summarization Experiments </SectionTitle> <Paragraph position="0"> The evaluation of our summarization algorithm was driven by two questions: (1) Are the summaries produced of acceptable quality, in terms of selected content? and (2) Does the content-model representation provide additional advantages over more locally-focused methods? To address the first question, we compare summaries created by our system against the &quot;lead&quot; baseline, which extracts the first a0 sentences of the original text -- despite its simplicity, the results from the annual Document Understanding Conference (DUC) evaluation suggest that most single-document summarization systems cannot beat this baseline. To address question (2), we consider a summarization system that learns extraction rules directly from a parallel corpus of full texts and their summaries (Kupiec et al., 1999). In this system, summarization is framed as a sentence-level binary classification problem: each sentence is labeled by the publicly-available BoosTexter system (Schapire and Singer, 2000) as being either &quot;in&quot; or &quot;out&quot; of the summary. The features considered for each sentence are its unigrams and prediction rate, as a function of the number of documents in the training set.</Paragraph> <Paragraph position="1"> its location within the text, namely beginning third, middle third and end third.10 Hence, relationships between sentences are not explicitly modeled, making this system a good basis for comparison.</Paragraph> <Paragraph position="2"> We evaluated our summarization system on the Earthquakes domain, since for some of the texts in this domain there is a condensed version written by AP journalists.</Paragraph> <Paragraph position="3"> These summaries are mostly extractive11; consequently, they can be easily aligned with sentences in the original articles. From sixty document-summary pairs, half were randomly selected to be used for training and the other half for testing. (While thirty documents may not seem like a large number, it is comparable to the size of the training corpora used in the competitive summarizationsystem evaluations mentioned above.) The average number of sentences in the full texts and summaries was 15 and 6, respectively, for a total of 450 sentences in each of the test and (full documents of the) training sets.</Paragraph> <Paragraph position="4"> At runtime, we provided the systems with a full document and the desired output length, namely, the length in sentences of the corresponding shortened version. The resulting summaries were judged as a whole by the fraction of their component sentences that appeared in the human-written summary of the input text.</Paragraph> <Paragraph position="5"> The results in Table 4 confirm our hypothesis about the benefits of content models for text summarization -our model outperforms both the sentence-level, locally-focused classifier and the &quot;lead&quot; baseline. Furthermore, as the learning curves shown in Figure 3 indicate, our method achieves good performance on a small subset of parallel training data: in fact, the accuracy of our method on one third of the training data is higher than that of the 10This feature set yielded the best results among the several possibilities we tried.</Paragraph> <Paragraph position="6"> 11Occasionally, one or two phrases or, more rarely, a clause were dropped.</Paragraph> <Paragraph position="7"> racy) on Earthquakes as a function of training-set size.</Paragraph> <Paragraph position="8"> sentence-level classifier on the full training set. Clearly, this performance gain demonstrates the effectiveness of content models for the summarization task.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.5 Relation Between Ordering and Summarization Methods </SectionTitle> <Paragraph position="0"> Since we used two somewhat orthogonal tasks, ordering and summarization, to evaluate the quality of the content-model paradigm, it is interesting to ask whether the same parameterization of the model does well in both cases.</Paragraph> <Paragraph position="1"> Specifically, we looked at the results for different model topologies, induced by varying the number of content-model states. For these tests, we experimented with the Earthquakes data (the only domain for which we could evaluate summarization performance), and exerted direct control over the number of states, rather than utilizing the cluster-size threshold; that is, in order to create exactly a12 states for a specific value of a12 , we merged the smallest clusters until a12 clusters remained.</Paragraph> <Paragraph position="2"> Table 5 shows the performance of the different-sized content models with respect to the summarization task and the ordering task (using OSO prediction rate). While the ordering results seem to be more sensitive to the number of states, both metrics induce similar ranking on the models. In fact, the same-size model yields top performance on both tasks. While our experiments are limited to only one domain, the correlation in results is encouraging: optimizing parameters on one task promises to yield a function of model size. Ordering: OSO prediction rate; Summarization: extraction accuracy.</Paragraph> <Paragraph position="3"> good performance on the other. These findings provide support for the hypothesis that content models are not only helpful for specific tasks, but can serve as effective representations of text structure in general.</Paragraph> </Section> </Section> class="xml-element"></Paper>