File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1012_intro.xml
Size: 10,595 bytes
Last Modified: 2025-10-06 14:00:45
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1012"> <Title>The effects of analysing cohesion on document summarisation</Title> <Section position="3" start_page="0" end_page="76" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> This paper add resses a particular class of problems inherent to summaries derived by sentence extraction, namely the related issues of coherence degradation, readability deterioration, and topical under-representation. Fundamentally, these problems arise from unconstrained deletion of arbitrary amount of source material between two sentences which end up adjacent in the summary; this has unpredictable effects on the amount of potentially essential information which may be lost in that deletion. Examples like 'dangling' anaphors (with lost antecedents) have been cited often enougk and strategies like including the immediately preceding sentence in the summary lmve some effect. While intuitively plausible, these are still simple strategies, prone to misfiring; moreover, other effects like the reversal of a cote premise in an argument, or the introduction, and subsequent elaboration, of a new topic, are not easily handled by similar heuristics.</Paragraph> <Paragraph position="1"> We seek to leverage a mechanism for assessing the degree of cohesion between individual sentences in the source document, as well as having a notion of how these map onto the underlying themes in the document. Informally, cohesion--and lexical cohesion in particular--is manifest in the ways in which the words, or word patterns, of a sentence connect that sentence to certain of its predecessors and successors. The intuition is that identifying, and preserving, some of these connections in the summary would improve its coherence.</Paragraph> <Section position="1" start_page="0" end_page="76" type="sub_section"> <SectionTitle> 1.1 Lexical cohesion and summarization </SectionTitle> <Paragraph position="0"> Documents are coherent because of the continuity of their discourse. A number of rhetorical devices help achieve cohesion between related document fragments.</Paragraph> <Paragraph position="1"> Analysing such devices--or at the very least being sensitive to their manifestation and interplay---can bring a moderately refined degree of discourse awareness into the summarization process. In the absence of deep text understanding, this boils down to making extensive use of a formalized notion of lexical cohesion.</Paragraph> <Paragraph position="2"> Linguists have studied extensively how various cohesive devices operate, and interact, in order to account for certain properties of the overall organization of a text discourse. For (Halliday and Hasan, 1976), the organization of text derives from a variety of relationships (cohesive ties) among discourse entities. More recently, (Wintel; 1979) has focused on the devices that enforce lexical relationships and connect a discourse fragment with other discourse fragments. The underlying theme here is that cohesion can be best explained in terms of how repetition is manifested across pairs of sentences. Repetition carries informational value-- it provides a reference point for interpreting what has changed, and thus, what is at the focus of attention of the discourse--and thus clearly goes well beyond the simple notion that discourse fragments with shared content will also share vocabulary. As (Phillips, 1985) points out, the lexical inventory of a text is tightly organized in terms of collocation; this makes it possible to get a handle on the overall organization of text, in general, and on the identification of topic introduction and topic closure, in particular.</Paragraph> <Paragraph position="3"> A variety of linguistic devices act as vehicles for repetition: viewed at the level of interplay between words and phrases in the text, these include lexical repetition, textual substitution and the use of a range of lexical relations, co-re~',rence and ellipsis, paraphrasing, colqunetion, and so forth. Analysing these would enable the identification of strong cohesive ties pulling together a chain of sentences which focus on (aspects of) the same discourse entity or event; this would require carrying out, for instance, in-depth co-reference and ellipsis resolution, as well as lexical relation determination.</Paragraph> <Paragraph position="4"> At the other end of the spectrum, just a lexical chaining procedure (like the one described in (Morris and I-1irst, 1.991)) could be used to determine the degree of cohesion between adjacent pairs of sentences. Indeed, this has been the basis for an operational definition of linear discourse segmentation, where segments in a document are defined to be contiguous blocks of text, roughly 'about the same thing', with segment boundaries indicative of topic shifts.</Paragraph> <Paragraph position="5"> The research reported here is just one aspect of a larger study into the recognition and use of cohesive devices for content characterisation tasks. It presupposes fine-grained methods for the identification of cohesive ties between (sentence) units in a text; describing the computational basis for developing such methods is outside of the scope of this paper (howeveb see (Kennedy and Boguraev, 1996), (Fellbaum, 1999), (Kelleb 1994)), as is the complete framework for lexical cohesion analysis we have developed. Instead, in focusing on the effects of lexical cohesion on summarization, we limit ourselves here on the phenomenon of simple lexical repetition; it turns out that even this can be beneficially applied to enhancing summarizatkm quality.</Paragraph> <Paragraph position="6"> Recent work (Barzilay and Elhadad, 1999) makes explicit this intuition. &quot;Lexical chains&quot; are constructed by grouping together items related by repetition and certain lexical relations derived via the WOI',DNET lexical database (Fellbaum, 1999). A sequence of items in a chain highlights a discussion focused on topic related to (an) ite, m(s) in the chain; a metric for scoring chains picks topically prominent ones; these are then taken as the basis of sentence extractkm heuristics. A positive result of that work is that in an intrinsic evaluation against human-constructed summaries, the system outperformed at least one commercial summarizer. This highlights the potential of a purely lexical chains-based appnmch; still, Barzilay and Elhadad remain frustrated by the high degree of polysemy in WORDNET (not to mention its limited coverage with respect to more specialized domains); fortunately, this does not concern us here.</Paragraph> </Section> <Section position="2" start_page="76" end_page="76" type="sub_section"> <SectionTitle> 1.2 Discourse segmentation and summarization </SectionTitle> <Paragraph position="0"> Unlike Barzilay and Elhadad, we start with a sentence-based summarizeb and are specifically seeking to improve upon what is already (by some measure; see Section 4.1 below) a good performance, judged in a discipline-wide evaluation initiative (Mani et al., 1999).</Paragraph> <Paragraph position="1"> This places certain constraints on how lexical cohesion analysis results, and in particular the identification of topically coherent segments, can be incorporated in the existing strategies and nmchanisms h~r sentence selection, already deployed by the summarizer. Making certain that a summary incorporates sentences from each segment intuitively seeks to ensure uniform representation of all sub-stories in a document; the notion here is to avoid having inordinately large gaps between adjacent summary sentences, which would tend to lose essential inhmnation. Moreove,, a mechanism which would pick the sentence(s) in a segment most representative its main topic, would also carry over into the summary 'traces' of all the main topics in the original document.</Paragraph> <Paragraph position="2"> This is more than just an intuition. In the process of developing, and training, our base summarizer (see Section 2.2 below), an analysis was carried out to determine the causes of a certain class of failure. It turns out that 30.7% of the failures could be prevented by a heuristic sensitive to the logical structure of documents, which would enforce that each (topical) section gets represented in the summary. Additional 15.2% of failures could also be avoided if the summarizer was capable of detecting sub-stories within a single section, leading/trailing noise (see below), and so forth. Thus ahnost half of the errors (in a certain summarization regime, at least) could have been avoided by using a segmentation component.</Paragraph> <Paragraph position="3"> This exemplifies how a document-wide analysis of a single lexical cohesion factor (simple repetition) can improve upon an existing sentence selection strategy-~eveu if such a strategy has been devised without prior knowledge of additional enhancements to come. The specific approaches to being sensitive to foci of attention within a segment, and topic shifts between segments, may vary; as we discuss this below (see Section 3.1), these will depend on other environment settings for the summarizer.</Paragraph> <Paragraph position="4"> Still, in the right operational environment even very simple heuristics--take the first sentence from each segment, for instance--have remarkably noticeable impact.</Paragraph> <Paragraph position="5"> We thus argue that a lexical repetition-based model of linear segmentation offers effective schemes for deriving sentence-based summaries with certain discourse properties, enhancing their quality.</Paragraph> <Paragraph position="6"> What follows is organized in three main sections. We outline some linguistic functions of the summarizel, and give details of the summarization and segmentation components. We focus specifically on how higher level content analysis uses lower level shallow linguistic processing, both to obtain a richer model of the document domain, and to leverage cohesion analysis for sub-story identification. Next we discuss some strategies for optimal ttse of discourse segments and topic shifts for summarization. We sketch our evaluation testbed environment, and present experimental results comparing the t)erformance of sunnnarization alone to segmentation-enhanced summarizatkm. We conclude with an assessment of the overall utility of 'cheap' approximations to lexical cohesion measures, specifically from the point of view of enhancing a fully operational summarizer.</Paragraph> </Section> </Section> class="xml-element"></Paper>