File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/p97-1013_intro.xml
Size: 8,941 bytes
Last Modified: 2025-10-06 14:06:14
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1013"> <Title>The Rhetorical Parsing of Natural Language Texts</Title> <Section position="3" start_page="0" end_page="97" type="intro"> <SectionTitle> 2 Foundation </SectionTitle> <Paragraph position="0"> The mathematical foundations of the rhetorical parsing algorithm rely on a first-order formalization of valid text structures (Marcu, 1997). The assumptions of the formalization are the following. 1. The elementary units of complex text structures are non-overlapping spans of text. 2. Rhetorical, coherence, and cohesive relations hold between textual units of various sizes. 3. Relations can be partitioned into two classes: paratactic and hypotactic. Paratactic relations are those that hold between spans of equal importance. Hypotactic relations are those that hold between a span that is essential for the writer's purpose, i.e., a nucleus, and a span that increases the understanding of the nucleus but is not essential for the writer's purpose, i.e., a satellite. 4. The abstract structure of most texts is a binary, tree-like structure. 5.</Paragraph> <Paragraph position="1"> If a relation holds between two textual spans of the tree structure of a text, that relation also holds between the most important units of the constituent subspans. The most important units of a textual span are determined recursively: they correspond to the most important units of the immediate subspans when the relation that holds between these subspans is paratactic, and to the most important units of the nucleus subspan when the relation that holds between the immediate subspans is hypotactic.</Paragraph> <Paragraph position="2"> In our previous work (Marcu, 1996), we presented a complete axiomatization of these principles in the context of Rhetorical Structure Theory (Mann and Thompson, 1988) and we described an algorithm that, starting from the set of textual units that make up a text and the set of elementary rhetorical relations that hold between these units, can derive all the valid discourse trees of that text. Consequently, if one is to build discourse trees for unrestricted texts, the problems that remain to be solved are the automatic determination of the textual units and the rhetorical relations that hold between them. In this paper, we show how one can find and exploit approximate solutions for both of these problems by capitalizing on the occurrences of certain lexicogrammatical constructs. Such constructs can include tense and aspect (Moens and Steedman, 1988; Webber, 1988; Lascarides and Asher, 1993), certain patterns of pronominalization and anaphoric usages (Sidner, 1981; Grosz and Sidner, 1986; Sumita et al., 1992; Grosz, Joshi, and Weinstein, 1995),/t-clefts (Delin and Oberlander, 1992), and discourse markers or cue phrases (Ballard, Conrad, and Longacre, 1971; Halliday and Hasan, 1976; Van Dijk, 1979; Longacre, 1983; Grosz and Sidner, 1986; Schiffrin, 1987; Cohen, 1987; Redeker, 1990; Sanders, Spooren, and Noordman, 1992; Hirschberg and Litman, 1993; Knott, 1995; Fraser, 1996; Moser and Moore, 1997). In the work described here, we investigate how far we can get by focusing our attention only on discourse markers and lexicogrammatical constructs that can be detected by a shallow analysis of natural language texts.</Paragraph> <Paragraph position="3"> The intuition behind our choice relies on the following facts: * Psycholinguistic and other empirical research (Kintsch, 1977; Schiffrin, 1987; Segal, Duchan, and Scott, 1991; Cahn, 1992; Sanders, Spooren, and Noordman, 1992; Hirschberg and Litman, 1993; Knott, 1995; Costermans and Fayol, 1997) has shown that discourse markers are consistently used by human subjects both as cohesive ties between adjacent clauses and as &quot;macroconnectors&quot; between larger textual units.</Paragraph> <Paragraph position="4"> Therefore, we can use them as rhetorical indicators at any of the following levels: clause, sentence, paragraph, and text.</Paragraph> <Paragraph position="5"> * The number of discourse markers in a typical text -- approximately one marker for every two clauses (Redeker, 1990) -- is sufficiently large to enable the derivation of rich rhetorical structures for texts.</Paragraph> <Paragraph position="6"> * Discourse markers are used in a manner that is consistent with the semantics and pragmatics of the discourse segments that they relate. In other words, we assume that the texts that we process are well-formed from a discourse perspective, much as researchers in sentence parsing assume that they are well-formed from a syntactic perspective. As a consequence, we assume that one can bootstrap the full syntactic, semantic, and pragmatic analysis of the clauses that make up a text and still end up with a reliable discourse structure for that text.</Paragraph> <Paragraph position="7"> Given the above discussion, the immediate objection that one can raise is that discourse markers are doubly ambiguous: in some cases, their use is only sentential, i.e., they make a semantic contribution to the interpretation of a clause; and even in the cases where markers have a discourse usage, they are ambiguous with respect to the rhetorical relations that they mark and the sizes of the textual spans that they connect. We address now each of these objections in turn.</Paragraph> <Paragraph position="8"> Sentential and discourse usages of cue phrases.</Paragraph> <Paragraph position="9"> Empirical studies on the disambiguation of cue phrases (Hirschberg and Litman, 1993) have shown that just by considering the orthographic environment in which a discourse marker occurs, one can distinguish between sentential and discourse usages in about 80% of cases. We have taken Hirschberg and Litman's research one step further and designed a comprehensive corpus analysis that enabled us to improve their results and coverage. The method, procedure, and results of our corpus analysis are discussed in section 3.</Paragraph> <Paragraph position="10"> Discourse markers are ambiguous with respect to the rhetorical relations that they mark and the sizes of the units that they connect. When we began this research, no empirical data supported the extent to which this ambiguity characterizes natural language texts. To better understand this problem, the corpus analysis described in section 3 was designed so as to also provide information about the types of rhetorical relations, rhetorical statuses (nucleus or satellite), and sizes of textual spans that each marker can indicate. We knew from the beginning that it would be impossible to predict exactly the types of relations and the sizes of the spans that a given cue marks. However, given that the structure that we are trying to build is highly constrained, such a prediction proved to be unnecessary: the overall constraints on the structure of discourse that we enumerated in the beginning of this section cancel out most of the configurations of elementary constraints that do not yield correct discourse trees.</Paragraph> <Paragraph position="11"> Consider, for example, the following text: (1) \[Although discourse markers are ambiguous, l\] \[one can use them to build discourse trees for unrestricted texts: 2\] \[this will lead to many new applications in natural language processing)\] For the sake of the argument, assume that we are able to break text (1) into textual units as labelled above and that we are interested now in finding rhetorical relations between these units. Assume now that we can infer that Although marks a CONCESSIVE relation between satellite 1 and nucleus either 2 or 3, and the colon. all ELABORATION between satellite 3 and nucleus either 1 or 2. If we use the convention that hypotactic relations are represented as first-order predicates having the form rhet_rel(NAME, satellite, nucleus) and that paratactic relations are represented as predicates having the form rhet_rel(NAME, nucleust, nucleus2), a correct representation for text (1) is then the set of two disjunctions given in (2): rhet_rel(CONCESSlON, 1,2) V rhet_rel( CONCESSION, 1,3) (2) rhet_rel(ELABORATION, 3, 1) V rhet_rel(ELABORATION, 3, 2) Despite the ambiguity of the relations, the over-all rhetorical structure constraints will associate only one discourse tree with text (1), namely the tree given in figure 1: any discourse tree configuration that uses relations rhet_rel(CONCESSlON, 1,3) and rhet-reI(ELABORATION, 3, 1) will be ruled out. For example, relation rhet_reI(ELABORATION, 3, 1) will be ruled out because unit I is not an important unit for span \[1,2\] and, as mentioned at the beginning of this section, a rhetorical relation that holds between two spans of a valid text structure must also hold between their most important units: the important unit of span \[1,2\] is unit 2, i.e., the nucleus of the relation rhet_rel(CONCESSlON, 1,2).</Paragraph> </Section> class="xml-element"></Paper>