File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/01/w01-1605_concl.xml

Size: 3,620 bytes

Last Modified: 2025-10-06 13:53:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1605">
  <Title>Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory</Title>
  <Section position="7" start_page="19" end_page="19" type="concl">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> A growing number of groups have developed or are developing discourse-annotated corpora for text. These can be characterized both in terms of the kinds of features annotated as well as by the scope of the annotation. Features may include specific discourse cues or markers, coreference links, identification of rhetorical relations, etc.</Paragraph>
    <Paragraph position="1"> The scope of the annotation refers to the levels of analysis within the document, and can be characterized as follows: * sentential: annotation of features at the intra-sentential or inter-sentential level, at a single level of depth (Sundheim, 1995; Tsou et al., 2000; Nomoto and Matsumoto, 1999; Rebeyrolle, 2000).</Paragraph>
    <Paragraph position="2"> * hierarchical: annotation of features at multiple levels, building upon lower levels of analysis at the clause or sentence level (Moser and Moore, 1995; Marcu, et al.</Paragraph>
    <Paragraph position="3"> 1999) * document-level: broad characterization of document structure such as identification of topical segments (Hearst, 1997), linking of large text segments via specific relations (Ferrari, 1998; Rebeyrolle, 2000), or defining text objects with a text architecture (Pery-Woodley and Rebeyrolle, 1998).</Paragraph>
    <Paragraph position="4"> Developing corpora with these kinds of rich annotation is a labor-intensive effort. Building the RST Corpus involved more than a dozen people on a full or part-time basis over a one-year time frame (Jan. - Dec. 2000). Annotation of a single document could take anywhere from 30 minutes to several hours, depending on the length and topic. Re-tagging of a large number of documents after major enhancements to the annotation guidelines was also time consuming.</Paragraph>
    <Paragraph position="5"> In addition, limitations of the theoretical approach became more apparent over time.</Paragraph>
    <Paragraph position="6"> Because the RST theory does not differentiate between different levels of the tree structure, a fairly fine-grained set of relations operates between EDUs and EDU clusters at the macrolevel. The procedural knowledge available at the EDU level is likely to need further refinement for higher-level text spans along the lines of other work which posits a few macro-level relations for text segments, such as Ferrari (1998) or Meyer (1985). Moreover, using the RST approach, the resultant tree structure, like a traditional outline, imposed constraints that other discourse representations (e.g., graph) would not. In combination with the tree structure, the concept of nuclearity also guided an annotator to capture one of a number of possible stylistic interpretations. We ourselves are eager to explore these aspects of the RST, and expect new insights to appear through analysis of the corpus.</Paragraph>
    <Paragraph position="7"> We anticipate that the RST Corpus will be multifunctional and support a wide range of language engineering applications. The added value of multiple layers of overt linguistic phenomena enhancing the Penn Treebank information can be exploited to advance the study of discourse, to enhance language technologies such as text summarization, machine translation or information retrieval, or to be a testbed for new and creative natural language processing techniques.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML