XML Viewer - w01-1605

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1605_metho.xml
Size: 15,068 bytes
Last Modified: 2025-10-06 14:07:45
<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1605">
  <Title>Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory</Title>
  <Section position="4" start_page="0" end_page="19" type="metho">
    <SectionTitle>
3 Discourse Annotation Task
</SectionTitle>
    <Paragraph position="0"> Our methodology for annotating the RST Corpus builds on prior corpus work in the Rhetorical Structure Theory framework by Marcu et al. (1999). Because the goal of this effort was to build a high-quality, consistently annotated reference corpus, the task required that we employ people as annotators whose primary professional experience was in the area of language analysis and reporting, provide extensive annotator training, and specify a rigorous set of annotation guidelines.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Annotator Profile and Training
</SectionTitle>
      <Paragraph position="0"> The annotators hired to build the corpus were all professional language analysts with prior experience in other types of data annotation.</Paragraph>
      <Paragraph position="1"> They underwent extensive hands-on training, which took place roughly in three phases.</Paragraph>
      <Paragraph position="2"> During the orientation phase, the annotators were introduced to the principles of Rhetorical Structure Theory and the discourse-tagging tool used for the project (Marcu et al., 1999). The tool enables an annotator to segment a text into units, and then build up a hierarchical structure of the discourse. In this stage of the training, the focus was on segmenting hard copy texts into EDUs, and learning the mechanics of the tool.</Paragraph>
      <Paragraph position="3"> In the second phase, annotators began to explore interpretations of discourse structure, by independently tagging a short document, based on an initial set of tagging guidelines, and then meeting as a group to compare results. The initial focus was on resolving segmentation differences, but over time this shifted to addressing issues of relations and nuclearity.</Paragraph>
      <Paragraph position="4"> These exploratory sessions led to enhancements in the tagging guidelines. To reinforce new rules, annotators re-tagged the document.</Paragraph>
      <Paragraph position="5"> During this process, we regularly tracked inter-annotator agreement (see Section 4.2). In the final phase, the annotation team concentrated on ways to reduce differences by adopting some heuristics for handling higher levels of the discourse structure. Wiebe et al. (1999) present a method for automatically formulating a single best tag when multiple judges disagree on selecting between binary features. Because our annotators had to select among multiple choices at each stage of the discourse annotation process, and because decisions made at one stage influenced the decisions made during subsequent stages, we could not apply Wiebe et al.'s method. Our methodology for determining the &amp;quot;best&amp;quot; guidelines was much more of a consensus-building process, taking into consideration multiple factors at each step. The final tagging manual, over 80 pages in length, contains extensive examples from the corpus to illustrate text segmentation, nuclearity, selection of relations, and discourse cues. The manual can be downloaded from the following web site: http://www.isi.edu/~marcu/discourse.</Paragraph>
      <Paragraph position="6"> The actual tagging of the corpus progressed in three developmental phases. During the initial phase of about four months, the team created a preliminary corpus of 100 tagged documents.</Paragraph>
      <Paragraph position="7"> This was followed by a one-month reassessment phase, during which we measured consistency across the group on a select set of documents, and refined the annotation rules. At this point, we decided to proceed by pre-segmenting all of the texts on hard copy, to ensure a higher overall quality to the final corpus. Each text was pre-segmented by two annotators; discrepancies were resolved by the author of the tagging guidelines. In the final phase (about six months) all 100 documents were re-tagged with the new approach and guidelines. The remainder of the corpus was tagged in this manner.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="19" type="sub_section">
      <SectionTitle>
3.2 Tagging Strategies
</SectionTitle>
      <Paragraph position="0"> Annotators developed different strategies for analyzing a document and building up the corresponding discourse tree. There were two basic orientations for document analysis - hard copy or graphical visualization with the tool.</Paragraph>
      <Paragraph position="1"> Hard copy analysis ranged from jotting of notes in the margins to marking up the document into discourse segments. Those who preferred a graphical orientation performed their analysis simultaneously with building the discourse structure, and were more likely to build the discourse tree in chunks, rather than incrementally.</Paragraph>
      <Paragraph position="2"> We observed a variety of annotation styles for the actual building of a discourse tree. Two of the more representative styles are illustrated below.</Paragraph>
      <Paragraph position="3"> 1. The annotator segments the text one unit at a time, then incrementally builds up the  discourse tree by immediately attaching the current node to a previous node. When building the tree in this fashion, the annotator must anticipate the upcoming discourse structure, possibly for a large span. Yet, often an appropriate choice of relation for an unseen segment may not be obvious from the current (rightmost) unit that needs to be attached. That is why annotators typically used this approach on short documents, but resorted to other strategies for longer documents.</Paragraph>
      <Paragraph position="4"> 2. The annotator segments multiple units at a time, then builds discourse sub-trees for each sentence. Adjacent sentences are then linked, and larger sub-trees begin to emerge. The final tree is produced by linking major chunks of the discourse structure. This strategy allows the annotator to see the emerging discourse structure more globally; thus, it was the preferred approach for longer documents.</Paragraph>
      <Paragraph position="5"> Consider the text fragment below, consisting of four sentences, and 11 EDUs: [Still, analysts don't expect the buy-back to significantly affect per-share earnings in the short term.]  [The impact won't be that great,]  [of having to average the number of shares  The discourse sub-tree for this text fragment is given in Figure 1. Using Style 1 the annotator, upon segmenting unit [17], must anticipate the upcoming example relation, which spans units [17-26]. However, even if the annotator selects an incorrect relation at that point, the tool allows great flexibility in changing the structure of the tree later on.</Paragraph>
      <Paragraph position="6"> Using Style 2, the annotator segments each sentence, and builds up corresponding sub-trees for spans [16], [17-18], [19-21] and [22-26]. The second and third sub-trees are then linked via an explanation-argumentative relation, after which, the fourth sub-tree is linked via an elaborationadditional relation. The resulting span [17-26] is finally attached to node [16] as an example satellite.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="19" end_page="19" type="metho">
    <SectionTitle>
4 Quality Assurance
</SectionTitle>
    <Paragraph position="0"> A number of steps were taken to ensure the quality of the final discourse corpus. These involved two types of tasks: checking the validity of the trees and tracking inter-annotator consistency.</Paragraph>
    <Section position="1" start_page="19" end_page="19" type="sub_section">
      <SectionTitle>
4.1 Tree Validation Procedures
</SectionTitle>
      <Paragraph position="0"> Annotators reviewed each tree for syntactic and semantic validity. Syntactic checking involved ensuring that the tree had a single root node and comparing the tree to the document to check for missing sentences or fragments from the end of the text. Semantic checking involved reviewing nuclearity assignments, as well as choice of relation and level of attachment in the tree. All trees were checked with a discourse parser and tree traversal program which often identified errors undetected by the manual validation process. In the end, all of the trees worked successfully with these programs.</Paragraph>
    </Section>
    <Section position="2" start_page="19" end_page="19" type="sub_section">
      <SectionTitle>
4.2 Measuring Consistency
</SectionTitle>
      <Paragraph position="0"> We tracked inter-annotator agreement during each phase of the project, using a method developed by Marcu et al. (1999) for computing kappa statistics over hierarchical structures. The kappa coefficient (Siegel and Castellan, 1988) has been used extensively in previous empirical studies of discourse (Carletta et al., 1997; Flammia and Zue, 1995; Passonneau and Litman, 1997). It measures pairwise agreement among a set of coders who make category judgments, correcting for chance expected agreement. The method described in Marcu et al. (1999) maps hierarchical structures into sets of units that are labeled with categorial judgments. The strengths and shortcomings of the approach are also discussed in detail there.</Paragraph>
      <Paragraph position="1"> Researchers in content analysis (Krippendorff, 1980) suggest that values of kappa &gt; 0.8 reflect very high agreement, while values between 0.6 and 0.8 reflect good agreement.</Paragraph>
      <Paragraph position="2"> Table 1 shows average kappa statistics reflecting the agreement of three annotators at various stages of the tasks on selected documents. Different sets of documents were chosen for each stage, with no overlap in documents. The statistics measure annotation reliability at four levels: elementary discourse units, hierarchical spans, hierarchical nuclearity and hierarchical relation assignments.</Paragraph>
      <Paragraph position="3"> At the unit level, the initial (April 00) scores and final (January 01) scores represent agreement on blind segmentation, and are shown in boldface. The interim June and November scores represent agreement on hard copy pre-segmented texts. Notice that even with pre-segmenting, the agreement on units is not 100% perfect, because of human errors that occur in segmenting with the tool. As Table 1 shows, all levels demonstrate a marked improvement from April to November (when the final corpus was completed), ranging from about 0.77 to 0.92 at the span level, from 0.70 to 0.88 at the nuclearity level, and from 0.60 to 0.79 at the relation level. In particular, when relations are combined into the 16 rhetoricallyrelated classes discussed in Section 2.2, the November results of the annotation process are extremely good. The Fewer-Relations column shows the improvement in scores on assigning  relations when they are grouped in this manner, with November results ranging from 0.78 to 0.82. In order to see how much of the improvement had to do with pre-segmenting, we asked the same three annotators to annotate five previously unseen documents in January, without reference to a pre-segmented document.</Paragraph>
      <Paragraph position="4"> The results of this experiment are given in the last row of Table 1, and they reflect only a small overall decline in performance from the November results. These scores reflect very strong agreement and represent a significant improvement over previously reported results on annotating multiple texts in the RST framework (Marcu et al., 1999).</Paragraph>
      <Paragraph position="5"> Table 2 reports final results for all pairs of taggers who double-annotated four or more documents, representing 30 out of the 53 documents that were double-tagged. Results are based on pre-segmented documents.</Paragraph>
      <Paragraph position="6"> Our team was able to reach a significant level of consistency, even though they faced a number of challenges which reflect differences in the agreement scores at the various levels.</Paragraph>
      <Paragraph position="7"> While operating under the constraints typical of any theoretical approach in an applied environment, the annotators faced a task in which the complexity increased as support from the guidelines tended to decrease. Thus, while rules for segmenting were fairly precise, annotators relied on heuristics requiring more human judgment to assign relations and nuclearity. Another factor is that the cognitive challenge of the task increases as the tree takes shape. It is relatively straightforward for the annotator to make a decision on assignment of nuclearity and relation at the inter-clausal level, but this becomes more complex at the inter-sentential level, and extremely difficult when linking large segments.</Paragraph>
      <Paragraph position="8"> This tension between task complexity and guideline under-specification resulted from the practical application of a theoretical model on a broad scale. While other discourse theoretical approaches posit distinctly different treatments for various levels of the discourse (Van Dijk and Kintsch, 1983; Meyer, 1985), RST relies on a standard methodology to analyze the document at all levels. The RST relation set is rich and the concept of nuclearity, somewhat interpretive.</Paragraph>
      <Paragraph position="9"> This gave our annotators more leeway in interpreting the higher levels of the discourse structure, thus introducing some stylistic differences, which may prove an interesting avenue of future research.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="19" end_page="19" type="metho">
    <SectionTitle>
5 Corpus Details
</SectionTitle>
    <Paragraph position="0"> The RST Corpus consists of 385 Wall Street Journal articles from the Penn Treebank, representing over 176,000 words of text. In order to measure inter-annotator consistency, 53 of the documents (13.8%) were double-tagged.</Paragraph>
    <Paragraph position="1"> The documents range in size from 31 to 2124 words, with an average of 458.14 words per document. The final tagged corpus contains 21,789 EDUs with an average of 56.59 EDUs per document. The average number of words per EDU is 8.1.</Paragraph>
    <Paragraph position="2"> The articles range over a variety of topics, including financial reports, general interest stories, business-related news, cultural reviews, editorials, and letters to the editor. In selecting these documents, we partnered with the Linguistic Data Consortium to select Penn Treebank texts for which the syntactic bracketing was known to be of high caliber.</Paragraph>
    <Paragraph position="3"> Thus, the RST Corpus provides an additional level of linguistic annotation to supplement existing annotated resources.</Paragraph>
    <Paragraph position="4">  For details on obtaining the corpus, annotation software, tagging guidelines, and related documentation and resources, see: http://www.isi.edu/~marcu/discourse.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML