File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1044_metho.xml
Size: 7,820 bytes
Last Modified: 2025-10-06 14:14:57
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1044"> <Title>Veins Theory: A Model of Global Discourse Cohesion and Coherence</Title> <Section position="4" start_page="282" end_page="282" type="metho"> <SectionTitle> 3 Global coherence </SectionTitle> <Paragraph position="0"> This section shows how VT can predict the inference load for processing global discourse, thus providing an account of discourse coherence.</Paragraph> <Paragraph position="1"> A corollary of Conjecture C1 is that CT can be applied along the accessibility domains defined by tile veins of the discourse structure, rather than to sequentially placed units within a single discourse segment. Therefore, in VT reference domains for any node may include units that are sequentially distant in the text stream, and thus long-distance references (including those requiring &quot;returnpops&quot; (Fox, 1987) over segments that contain syntactically feasible referents) can be accounted for. Thus our model provides a description of global discourse cohesion, which significantly extends the model of local cohesion provided by CT.</Paragraph> <Paragraph position="2"> CT defines a set of transition types for discourse (Grosz, Joshi, and Weinstein (1995); Brennan, Friedman and Pollard (1987)). A smoothness score for a discourse segment can be computed by attaching an elementary score to each transition between sequential units according to Table 2, summing up the scores for each transition in the entire segment, and dividing the result by the number of transitions in the segment. This provides an index of the overall coherence of the segment.</Paragraph> <Paragraph position="3"> A global CT smoothness score can be computed by adding up the scores for the sequence of units making up the whole discourse, and dividing the result by the total number of transitions (number of units minus one). In general, this score will be slightly higher than the average of the scores for the individual segments, since accidental transitions at segment boundaries might also occur. Analogously, a global VT smoothness score</Paragraph> <Paragraph position="5"> a discourse when computed following VT is at least as high as the score computed following CT.</Paragraph> <Paragraph position="6"> That is, we claim that long-distance transitions computed using VT are systematically smoother than accidental transitions at segment boundaries. Note that this conjecture is consistent with results reported by authors like Passonneau (1995) and Walker (1996), and provides an explanation for their results.</Paragraph> <Paragraph position="7"> We can also consider anaphora resolution using Cb's computed using accessibility domains.</Paragraph> <Paragraph position="8"> Because a unit can simultaneously occur in several accessibility domains, unification can be applied using the Cf list of one unit and those of possibly several subsequent (although not necessarily adjacent) units. A graph of Cb-unifications can be derived, in which each edge of the graph represents a Cb computation and therefore a unification process.</Paragraph> </Section> <Section position="5" start_page="282" end_page="283" type="metho"> <SectionTitle> 4 Minimal text </SectionTitle> <Paragraph position="0"> The notion that text summaries can be created by extracting the nuclei from RST trees is well known in the literature (Mann and Thompson, (1988)). Most recently, Marcu (1997) has described a method for text summarization based on nuclearity and selective retention of hierarchical fragments. Because his salient units correspond to heads in VT, his results are predicted in our model. That is, the union of heads at a given level in the tree provides a summary of the text at a degree of detail dependent on the depth of that level.</Paragraph> <Paragraph position="1"> In addition to summarizing entire texts, VT can be used to summarize a given unit or sub-tree of that text. In effect, we reverse the problem addressed by text summarization efforts so far: instead of attempting to summarize an entire discourse at a given level of detail, we select a single span of text and abstract the minimal text required to understand this span alone when considered in the context of the entire discourse. This provides a kind of focused abstraction, enabling the extraction of sub-texts from larger documents. Because vein expressions for each node include all of the nodes in the discourse within its domain of reference, they identify exactly which parts of the discourse tree are required in order to understand and resolve references for the unit or subtree below that node.</Paragraph> <Paragraph position="2"> Source No. of units Total no.</Paragraph> </Section> <Section position="6" start_page="283" end_page="283" type="metho"> <SectionTitle> 5. Corpus analysis </SectionTitle> <Paragraph position="0"> Because of the lack of large-scale corpora annotated for discourse, our study currently involves only a small corpus of English, Romanian, and French texts. The corpus was prepared using an encoding scheme for discourse structure (Cristea, Ide, and Romary, 1998) based on the Corpus Encoding Standard (CES) (Ide (1998)). The following texts were included in our analysis: .three short English texts, RST-analyzed by experts and subsequently annotated for reference and Cf lists by the authors; * a fragment from de Balzac s <<Le P~re Goriot>> (French), previously annotated for co-reference (Bruneseaux and Romary (1997)); RST and Cf lists annotation made by the authors; * a fragment from Alexandru Mitru's <<Legendele Olimpului>> 4 . (Romanian); structure, reference, and Cf hsts annotated by one of the authors.</Paragraph> <Paragraph position="1"> The encoding marks referring expressions, links between referring expressions (co-reference or functional), units, relations between units (if known), nuclearity, and the units' Cf lists in terms of refemng expressions. We have developed a program 5 that does the following: builds the tree structure of units and relations between them, adds to each referring expression the index of the unit it occurs in, computes the heads and veins for all nodes in the structure, determines the accessibility domains of the terminal nodes (units), counts the number of direct and indirect references.</Paragraph> <Paragraph position="2"> Hand-analysis was then applied to determine which references are inferential and therefore do not conform to Conjecture C1, as summarized in Table 5. Among the 318 references in the text, only three references not conforming to Conjecture C1 were found (all of them appear in one of the English texts). However, if the BACKGROUND relation is treated as binuclear, ~ all three of these references become direct.</Paragraph> <Paragraph position="3"> To verify Conjecture C2, Cb's and transitions were first marked following the sequential order of the units (according to classical CT), and a smoothness score was computed. Then, following VT, accessibility domains were used to determine maximal chains of accessibility strings, Cb's and transitions were re-computed following these strings, and a VT smoothness score was similarly computed. The results are summarized in Table 6. They show that the score for VT is better than that forCT in all cases, thus validating. Conjecture C2.</Paragraph> <Paragraph position="4"> An investigation of the number of long-distance resolutions yielded the results shown in Table 7. Such resolutions could not have been predicted</Paragraph> </Section> <Section position="7" start_page="283" end_page="284" type="metho"> <SectionTitle> 4 ~The Legends of Olimp~ 5 Written in Java. 6 Other bi-nuclear relations are JOIN and SEQUENCE. </SectionTitle> <Paragraph position="0"/> </Section> class="xml-element"></Paper>