File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/c00-1012_evalu.xml

Size: 11,317 bytes

Last Modified: 2025-10-06 13:58:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1012">
  <Title>The effects of analysing cohesion on document summarisation</Title>
  <Section position="6" start_page="79" end_page="80" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> For evaluating the effect of various strategies upon summarizer output quality, we used as baseline an evaluation corpus of full-length articles and their 'digests', from The New York Times. There are advantages, and disadvantages, to this approach. Setting aside whether task-based evaluation is appropriate for testing strictly the effect of one technology on another (see Section 4.1 below), such a decision ties us to a particular set of data. On the positive side, this offers a realistic baseline against which to compare strategies and heuristics; on the negative side, if a certain type of data is missing from the evaluation corpus, there is little hard evidence for judging the effects of strategies and heuristics on such data.</Paragraph>
    <Paragraph position="1"> The remainder of this section describes our evaluation environment, and then looks at the results for smallto-average size documents (the collection comprises just over 800 texts, less than half of which are over 10K, and virtually none are over 20K; the byte count includes HTML markup tags; in terms of number of sentences per document, very few of these longer documents are over 100 sentences long).</Paragraph>
    <Section position="1" start_page="79" end_page="80" type="sub_section">
      <SectionTitle>
4.1 Summarization evaluationtestbed
</SectionTitle>
      <Paragraph position="0"> Evaluating summarization results is not trivial, at least because there is no such thing as the best, or correct, summary--especially when the summary is constructed as an extract. The purposes of such extracts vary; so do human extractors. Sentence extraction systems may be evaluated by comparing the extract with sentences selected by human subjects (Edmundson, 1969). This is a (superficial) objective measure that clearly ignores the possibility of multiple right answers. Another objective measure compares summaries with pre-existing abstracts using a suitable method for mapping a sentence in the abstract to its counterpart in the document. Subjective measures, even though still less satisfying, can also be devised: for instance, summary acceptability has been proposed as one such measure. Other evaluation protocols share the primary feature of being task-based, even though details may vary. Thus performance may be measured by comparing browsing and search time as summary abstracts and fulMength originals are being used (Miike et al., 1994); other measures look at recall and precision in document retrieval (Brandow et al., 1995); or recall, precision, and time required in document categorization (i.e.</Paragraph>
      <Paragraph position="1"> assessing whether a document has been correctly judged to be relevant or not, on the basis of its summary alone) (Mani et al., 1999).</Paragraph>
      <Paragraph position="2"> We built an environment for baseline summarizer evaluation, as part of its development/training cycle. This was also used in analyzing the impact of discourse segmentation on the summarizer's performance. A background collection vocabulary statistics was derived from analyzing 2334 New York Times news stories. Sentences in digests for 808 stories and feature articles were automatically matched with their corresponding sentences in the full-length documents. Digests range in length from 1 to 4 sentences. Since we were particularly interested in longer stories, as well as stories in which the first sentence in the document did not appear in the digest, their representation in the test set, 38%, is larger than their distribution in the newspaper.</Paragraph>
      <Paragraph position="3"> Since digests are inherently short, this evaluation strategy is somewhat limited in its capability of fully assessing segmentation effects on summarization of long documents. Nonetheless, a number of comparative analyses can be carried out against this baseline collection, which are indicative of the interplay of the various control options, environment settings, and linguistic filters used.</Paragraph>
      <Paragraph position="4"> One parameter, in particular, is quite instrumental in tuning the summarizer's performance, to a large extent because it is directly related to length of the original document: size of thesummary, expressed either as number of sentences, or as percentage of the full length of the original. In addition to a clear intuition (namely that the size of the summary ought to be related to the size of the original), varying the length of the summary offers both the ability to measure the summarizer's performance against baseline summaries (i.e. our collection of digests), and the potential of dynamically adjusting the derived summary size to optimally represent the full document content, de- null pending on the size of that document.</Paragraph>
      <Paragraph position="5"> Our experiments vary tile granularity of summary size. In principle, the performance of a system which does absolute sentence ranking, and systematically picks the N 'best' sentences for tlle summary, should not depend on the summary size. In our case, the additional heuristics for improving the coherence, readability, and representativeness of tile summary (see Section 2.2) introduce variations in overall summary quality, depending on the compaction factor applied to the original document size. A representative spectrum for tlle test corpus we use is given by data points at: diqest size (i.e. summary exactly the size, expressed as number of sentences, of tile digest); 4 sentem:es; I0% of the size of the full length document; and 20% of the document. Not surprisingly (for a salience-based system), the sumnlarization ftmction alone, without discourse segmentation, benefits from larger summary size. Although tlle recall rate is higher still for longer summaries, it is not a measure of the over-all quality of tile summary because of tile inherently short length of tile digest.</Paragraph>
    </Section>
    <Section position="2" start_page="80" end_page="80" type="sub_section">
      <SectionTitle>
4.2 Segmentation effects on summarization
</SectionTitle>
      <Paragraph position="0"> Our experiments compare the base summarization procedure, which calculates object salience with respect to a background document collection (Section 2.2), with enhanced procedures incorporating different strategies using the notions of discourse segments and topic shifts.</Paragraph>
      <Paragraph position="1"> These elaborate tile intuitions underlying our approach to leveraging lexical cohesion effects (see Section 1.2). The experiments fall in either of two categories. In an environment where a background collection, and statistics, cannot be assumed, a summarization procedure was defined to take selected (typically initial) sentences from each segment; this appeals to the intuition that segment-initial sentences would be good topic indicators for their respective segments, qhe other category of experiment focused on enriching the base summarization procedure with a sentence selection mechanism which is informed by segment botmdary identification and topic shift detection.</Paragraph>
      <Paragraph position="2"> In combining different sentence selection mechanisms, several variables need adjustment to account for relative contributions of the different document analysis methods, especially where summaries can be specified to be of different lengths. Given the additional sentence selection factors interacting with abso\]ute sentence ranking, we again set the granularity of summary size at three discrete steps, mirroring the evaluation of the original summarizer: summaries can be requested to be precisely 4 sentences long, or to reflect source compaction factor of 10% or 20% (Sectkm 4.1).</Paragraph>
      <Paragraph position="3"> We experimented with two broad strategies for incorporating topical informatkm into the sunnnary. One approach aimed to bring 'topic openers' into the summars; by adding segment-initial sentences to those already selected via salience calculation. The other was to exert finer control over the number of sentences selected via salience, and 'pad' the summary to its requested size with sentences selected from segments by invoking the 'empty segment' (aka 'empty sectkm', see 2.2) rule. Special provisions accounted for the fact that segmentatkm would naturally always select the document-initial sentence.</Paragraph>
      <Paragraph position="4"> It turns out that the differences between a range of realisations of the above two strategies are not statistically significant over our test corpus; we thus use the label &amp;quot;SUM+SEG&amp;quot; to denote a 'composite' strategy and to represent the whole family of variations. In contrast, &amp;quot;SUM&amp;quot; refers to the base smnmarization component, and &amp;quot;SEG&amp;quot; represents summarization by segmentation alone. Table 1 below shows the recall rates for the three major summarization regimes defined by different summary granularities. Since segmentation effects are clearly very different across different sizes of source document, our experiments were additionally conducted at sampling the document collection at different sizes of the originals: the corpus was split into four sections, grouping together documents less than 7.5K characters long, 7.5-\]0K, 1019K, and over 19K; for brevity, the table encapsulates a 'composite' result (denoted by the label &amp;quot;All documents&amp;quot;).  lb get a better sense for tile effects of different strategy mixes, we also show results for tile same summarization regimes, on subsets of the test corpus. &amp;quot;All documents with &gt; I di,?esl senh'm:t'&amp;quot; represents documents whose digests are longer than a single sentence; &amp;quot;All documetHs whose 1st sent is not in target d~qest&amp;quot; extracts a document set for which a baseline strategy automatically picking a representative sentence for inclusion in the summary would be inappropriate. These subset selection criteria explain the deterioration of overall results; howeveb what is more interesting to observe in the table is the relative performance of tile three summarization regimes.</Paragraph>
      <Paragraph position="5"> Overall, leveraging some of the segmentation analysis is positively beneficial to summarization; the effects are particularly strong where short summaries are required.</Paragraph>
      <Paragraph position="6"> In addition, summarization driven by segmentation data alone shows recall rates comparable to, and in certain situatkms even higher than, tlle baseline: this suggests that such a procedure is certainly usable in situations where background collection-based salience calculation is impossible, or impractical.</Paragraph>
      <Paragraph position="7"> Finally, we emphasise a note of particular interest here: the complete set of data from these experiments makes it possible, for any given document, to select dynamically tile summarization strategy appropriate to its size, in order to get an optimal summary for it, in any given information compaction regime.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML