File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/p02-1057_evalu.xml
Size: 8,387 bytes
Last Modified: 2025-10-06 13:58:51
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1057"> <Title>A Noisy-Channel Model for Document Compression</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> For testing, we began with two sets of data. The first set is drawn from the Wall Street Journal (WSJ) portion of the Penn Treebank and consists of a151a19a152 documents, each containing between a153a154a151 and a155a157a156 words. The second set is drawn from a collection of stu3This tends to be the case for very short documents, as the compressions never get sufficiently long for the length normalization to have an effect.</Paragraph> <Paragraph position="1"> dent compositions and consists ofa158 documents, each containing between a152a46a153 and a159a160a151 words. We call this set the MITRE corpus (Hirschman et al., 1999). We would liked to have run evaluations on longer documents. Unfortunately, the forests generated even for relatively small documents are huge. Because there are an exponential number of summaries that can be generated for any given text4, the decoder runs out of memory for longer documents; therefore, we selected shorter subtexts from the original documents.</Paragraph> <Paragraph position="2"> We used both the WSJ and Mitre data for evaluation because we wanted to see whether the performance of our system varies with text genre. The Mitre data consists mostly of short sentences (average document length from Mitre is a152 sentences), quite in constrast to the typically long sentences in the Wall Street Journal articles (average document length from WSJ is a161a160a16a52a162a157a158 sentences).</Paragraph> <Paragraph position="3"> For purpose of comparison, the Mitre data was compressed using five systems: Random: Drops random words (each word has a 50% chance of being dropped (baseline).</Paragraph> <Paragraph position="4"> Hand: Hand compressions done by a human.</Paragraph> <Paragraph position="5"> Concat: Each sentence is compressed individually; the results are concatenated together, using Knight & Marcu's (2000) system here for comparison. null EDU: The system described in this paper.</Paragraph> <Paragraph position="6"> Sent: Because syntactic parsers tend not to work well parsing just clauses, this system merges together leaves in the discourse tree which are in the same sentence, and then proceeds as described in this paper.</Paragraph> <Paragraph position="7"> The Wall Street Journal data was evaluated on the above five systems as well as two additions. Since the correct discourse trees were known for these data, we thought it wise to test the systems using these human-built discourse trees, instead of the automatically derived ones. The additionall two systems were: PD-EDU: Same as EDU except using the perfect discourse trees, available from the RST corpus (Carlson et al., 2001).</Paragraph> <Paragraph position="8"> 4In theory, a text of a163 words has a145a54a164 possible compressions.</Paragraph> <Paragraph position="9"> len log prob best compression a142 a165a67a166a143a166a20a142a18a144a148a54a147a143a167a54a147 Mayor is now looking which is enough. a166a168a141 a165a67a166a168a141a170a169a171a144a52a166a20a147a19a166a20a147 The mayor is now looking which is already almost enough to win. a166a168a167 a165a67a166a168a149a170a169a171a144a146a39a148a172a169a39a147 The mayor is now looking but without support, he is still on shaky ground. a166a168a142 a165a67a166a168a167a54a147a18a144a149a54a141a19a166a20a147 Mayor is now looking but without the support of governer, he is still on shaky ground. a145a143a145 a165a67a166a56a169a39a167a18a144a52a166a20a148a143a148a54a147 The mayor is now looking for re-election but without the support of the governer, he is still on shaky ground.</Paragraph> <Paragraph position="10"> a145a13a142 a165a173a145a13a141a54a148a18a144a148a54a149a143a148a54a147 The mayor is now looking which is already almost enough to win. But without the support of the governer, he is still on shaky ground.</Paragraph> <Paragraph position="11"> Six human evaluators rated the systems according to three metrics. The first two, presented together to the evaluators, were grammaticality and coherence; the third, presented separately, was summary quality. Grammaticality was a judgment of how good the English of the compressions were; coherence included how well the compression flowed (for instance, anaphors lacking an antecedent would lower coherence). Summary quality, on the other hand, was a judgment of how well the compression retained the meaning of the original document. Each measure was rated on a scale from a151 (worst) to a158 (best).</Paragraph> <Paragraph position="12"> We can draw several conclusions from the evaluation results shown in Table 2 along with average compression rate (Cmp, the length of the compressed document divided by the original length).5 First, it is clear that genre influences the results.</Paragraph> <Paragraph position="13"> Because the Mitre data contained mostly short sentences, the syntax and discourse parsers made fewer errors, which allowed for better compressions to be generated. For the Mitre corpus, compressions obtained starting from discourse trees built above the sentence level were better than compressions obtained starting from discourse trees built above the EDU level. For the WSJ corpus, compression obtained starting from discourse trees built above the sentence level were more grammatical, but less coherent than compressions obtained starting from discourse trees built above the EDU level. Choosing the manner in which the discourse and syntactic representations of texts are mixed should be influenced by the genre of the texts one is interested to compress.</Paragraph> <Paragraph position="14"> 5We did not run the system on the MITRE data with perfect discourse trees because we did not have hand-built discourse trees for this corpus.</Paragraph> <Paragraph position="15"> The compressions obtained starting from perfectly derived discourse trees indicate that perfect discourse structures help greatly in improving coherence and grammaticality of generated summaries. It was surprising to see that the summary quality was affected negatively by the use of perfect discourse structures (although not statistically significant). We believe this happened because the text fragments we summarized were extracted from longer documents.</Paragraph> <Paragraph position="16"> It is likely that had the discourse structures been built specifically for these short text snippets, they would have been different. Moreover, there was no component designed to handle cohesion; thus it is to be expected that many compressions would contain dangling references.</Paragraph> <Paragraph position="17"> Overall, all our systems outperformed both the Random baseline and the Concat systems, which empirically show that discourse has an important role in document summarization. We performed a174 tests on the results and found that on the Wall Street Journal data, the differences in score between the Concat and Sent systems for grammaticality and coherence were statistically significant at the 95% level, but the difference in score for summary quality was not. For the Mitre data, the differences in score between the Concat and Sent systems for grammaticality and summary quality were statistically significant at the 95% level, but the difference in score for coherence was not. The score differences for grammaticality, coherence, and summary quality between our systems and the baselines were statistically significant at the 95% level.</Paragraph> <Paragraph position="18"> The results in Table 2, which can be also assessed by inspecting the compressions in Figure 4 show that, in spite of our success, we are still far away from human performance levels. An error that our system makes often is that of dropping complements that cannot be dropped, such as the phrase &quot;for re-election&quot;, which is the complement of &quot;is looking&quot;. We are currently experimenting with lexicalized models of syntax that would prevent our compression system from dropping required verb arguments. We also consider methods for scaling up the decoder to handling documents of more realistic length.</Paragraph> </Section> class="xml-element"></Paper>