File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0307_metho.xml
Size: 30,393 bytes
Last Modified: 2025-10-06 14:15:24
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0307"> <Title>Experiments in Constructing a Corpus of Discourse Trees</Title> <Section position="3" start_page="0" end_page="50" type="metho"> <SectionTitle> 2 The experiment </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="49" type="sub_section"> <SectionTitle> 2.1 Tools </SectionTitle> <Paragraph position="0"> We used as starting point O'Donnell's discourse annotation tool (1997), which we improved significantly. The original tool constrains human judges to construct rhetorical structures in a bottom-up fashion: as a first step, judges determine the elementary discourse units (edu) of a text; subsequently, they recursively assemble the units into discourse trees, in a bottom-up fashion. As texts get larger, the annotation process becomes impractical.</Paragraph> <Paragraph position="1"> We modified O'Donnell's tool in order to enable annotators to construct discourse structures in an incremental fashion as well. At any time t during the annotation process, annotators have access to two panels (see figure 1 for an example): * The upper panel displays in the style of Mann and Thompson (1988) the discourse structure built up to time t. The discourse structure is Mars<l> With its distant orbit<p> - 50 percent farther from the sun than Earth </p>and slim atmospheric blanket, <2>Mars experiences frigid weather :;onditions.<3> Surface temperatures typically ave:rage about -G0 degrees Zaleius<p> (.-7~ degrees Fahrenheit) </p>at the equator<4> and can dip to -123 degree~ C near the poles.<5> Only the midday sun at tropical latitudes is ~arm enough<5> to thaw ice on occasion, <7>but any liquid water formed in this way would evaporate almost instantly<8> because of ~he low atmos~hericpress~re,<9> Although the atmosphere holds a small amount of water, and water-lce clouds sometimes develop, most Martian weather involves blowing dust or carbon dioxide.</Paragraph> <Paragraph position="2"> a tree whose leaves correspond to edus and whose internal nodes correspond to contiguous text spans. Each internal node is characterized by a rhetorical relation, which is a relation that holds between two non-overlapping text spans called NUCLEUS and SATELLITE. (There are a few exceptions to this rule: some relations, such as the LIST relation that holds between units 4 and 5 and the CONTRAST relation that holds between spans \[6,7\] and \[8,9\] in figure 1, are multinuclear.) The distinction between nuclei and satellites comes from the empirical observation that the nucleus expresses what is more essential to the writer's purpose/intention than the satellite; and that the nucleus of a rhetorical relation is comprehensible independent of the satellite, but not vice versa. Some edus may contain parenthetical units, i.e., embedded units whose deletion does not affect the understanding of the edu to which they belong.</Paragraph> <Paragraph position="3"> For example, the unit shown in italics in (1) is parenthetic.</Paragraph> <Paragraph position="4"> This book, which 1 have received (1) from John, is the best book that I have read in a while.</Paragraph> <Paragraph position="5"> * The lower panel displays the text read by the annotator up to time t and only the first sentence that immediately follows the labeled edus.</Paragraph> <Paragraph position="6"> Annotators can create elementary and parenthetical units by clicking on their boundaries; immediately add a newly created unit to a partial discourse structure using operations specific to tree-adjoining and bottom-up parsers; postpone the construction of a partial discourse structure until the understanding of the text enables them to do so; take discourse structures apart and re-connect them; change relation names and nuclearity assignments; undo any number of steps; etc. In other words, annotators have complete control over the discourse construction strategy that they employ.</Paragraph> <Paragraph position="7"> All actions taken by annotators are automatically logged.</Paragraph> </Section> <Section position="2" start_page="49" end_page="50" type="sub_section"> <SectionTitle> 2.2 Annotation protocol </SectionTitle> <Paragraph position="0"> One of us initially prepared a manual that contained instructions pertaining to the functionality of the tool, definitions of edus and rhetorical relations, and a protocol that was supposed to be followed during the annotation process (Marcu, 1998).</Paragraph> <Paragraph position="1"> Edus were defined functionally as clauses or clause-like units that are unequivocally the NUCLEUS or SATELLITE of a rhetorical relation that adds some significant information to the text. For example, because of the low atmospheric pressure in text (2) is not a fully fleshed clause. However, since it is the SATELLITE of an EXPLANATION relation, it should be treated as elementary.</Paragraph> <Paragraph position="2"> \[Only the midday sun at tropical latitudes (2) is warm enough\] \[to thaw ice on occasion,\] \[but any liquid water formed in this way would evaporate almost instantly\] \[because of the low atmospheric pressure.\] A total of 70 rhetorical relations were partitioned into clusters, each cluster containing a subset of relations that shared some rhetorical meaning. For example, one cluster contained the contrast-like rhetorical relations of ANTITHESIS, CONTRAST, and CONCESSION. Another cluster contained REA-SON, EVIDENCE, and EXPLANATION. Each relation was paired with an informal definition given in the style of Mann and Thompson (1988) and Moser and Moore (1997) and one or more examples. No explicit distinction was made between intentional and informational relations. In addition, we also marked two constituency relations that were ubiquitous in our corpora and that often subsumed complex rhetorical constituents, and one textual relation. The constituency relations were ATTRIBUTION, which was used to label the relation between a reporting and a reported clause, and APPOSITION. The textual relation was TEXTUALORGANIZATION; it was used to connect in an RST-like manner the textual spans that corresponded to the title, author, and textual body of each document in the corpus. We also enabled the annotators to use the label OTHER-RELATION whenever they felt that no relation in the manual captured sufficiently well the meaning of a rhetorical relation that held between two text spans.</Paragraph> <Paragraph position="3"> In an attempt to manage the inherent rhetorical ambiguity of texts, we also devised a protocol that listed the clusters of relations in decreasing order of specificity. Hence, the relations at the beginning of the protocol were more specific than the relations at the end of the protocol. The protocol specified that in assigning rhetorical relations judges should choose the first relation in the protocol whose definition was consistent with the case under consideration. For example, it is often the case that when an EVIDENCE relation holds between two segments, an ELABORATION relation holds as well. Because EVII~ENCE is more specific than ELABORATION, it comes first in the protocol, and hence, whenever both of these relations hold, only EVIDENCE is supposed to be used for tagging.</Paragraph> <Paragraph position="4"> The protocol specified that the rhetorical tagging should be performed incrementally. That is, if an annotator created an edu at time t and if she knew how to attach that edu to the discourse structure, she was supposed to do so at time t + 1. If the text read up to time t did not warrant such a decision, the annotator was supposed to determine the edus of the subsequent text and complete the construction of the discourse structure as soon as sufficient information became available.</Paragraph> <Paragraph position="5"> 2_3 Materials and method Since we were aware of no previous study that investigated thoroughly the coverage of any set of rhetorical relations or any protocol, we felt necessary to divide the experiment into a training and an annotation stage. During the training stage, each of us built the discourse structures of 10 texts that varied in size from 162 to 1924 words. The texts belonged to the news story, editorial, and scientific genres. We had extensive discussions after the tagging of each text. During these discussions, we refined the definition of edu, the definitions and number of rhetorical relations that we used, and the order of the relations in the protocol. Eventually, our protocol comprised 50 mononuclear relations and 23 multinuclear relations. All relations were divided into 23 clusters of rhetorical similarity</Paragraph> <Paragraph position="7"> (see (Marcu, 1998) for the complete list of rhetorical relations and protocol).</Paragraph> <Paragraph position="8"> During the annotation stage, we independently built the discourse structures of 90 texts by following the instructions in the protocol; 30 texts were taken from the MUC7 co-reference corpus, 30 texts from the Brown-Learned corpus, and 30 texts from the Wall Street Journal (WSJ) corpus. The MUC corpus contained news stories about changes in corporate executive management personnel; the Brown corpus contained long, highly elaborate scientific articles; and the WSJ corpus contained editorials.</Paragraph> <Paragraph position="9"> The average number of words for each text was 405 in the MUC corpus, 2029 in the Brown corpus, and 878 in the WSJ corpus. The average number ofedus in each text was 52 in the MUC corpus, 170 in the Brown corpus, and 95 in the WSJ corpus. Each of the MUC texts was tagged by all three of us; each of the Brown and WSJ texts was tagged by only two of us. Table 1 shows the 15 relations that were used most frequently by annotators in each of the three corpora; the associated percentages reflect averages computed over all annotators. The table also shows the percentage of cases in which the annotators used the label OTHER-RELAT1 ON.</Paragraph> <Paragraph position="10"> Problems with the method. It has been argued that the reliability of a coding schema can be assessed only on the basis of judgments made by naive coders (Carletta, 1996). Although we agree with this, we believe that more experiments of the kind reported here will have to be carried out before we can produce a tagging manual that is usable by naive coders. In our experiment, it is not clear how much of the agreement came from the manual and how much from the common understanding that we reached during the training session. For our annotation task, we felt that it was more important to arrive at a common understanding instead of tightly controlling how this understanding was reached. This position was taken by other computational linguists as well (Carletta et al., 1997, p. 25).</Paragraph> </Section> </Section> <Section position="4" start_page="50" end_page="54" type="metho"> <SectionTitle> 3 Computing agreement among judges </SectionTitle> <Paragraph position="0"> We computed agreement figures with respect to the way we set up edu boundaries and the way we built hierarchical discourse structures of texts.</Paragraph> <Section position="1" start_page="50" end_page="51" type="sub_section"> <SectionTitle> 3.1 Reliability of tagging the edu and </SectionTitle> <Paragraph position="0"> parenthetical unit boundaries In order to compute how well we agreed on determining the edu and parenthetical unit boundaries, we used the kappa coefficient k (Siegel and Castellan, 1988), a statistic used extensively in previous empirical studies of discourse. The kappa coefficient measures palrwise agreement among a set of coders who make category judgements, correcting for chance expected agreement (see equation (3) helow, where P(A) is the proportion of times a set of coders agree and P(E) is the proportion of times a set of coders are expected to agree by chance).</Paragraph> <Paragraph position="2"> Carletta (1996) suggests that the units over which the kappa statistic is computed affects the outcome.</Paragraph> <Paragraph position="3"> To account for this, we computed the kappa statistics in two ways: 1. The first statistic, kw, reflects inter-annotator agreement under the assumption that edu and parenthetical unit boundaries can be inserted after any word in a text. Because many of the words occur within units and not at their boundaries, the chance agreement is very high, and therefore, k~ tends to be higher than the statistic discussed below.</Paragraph> <Paragraph position="4"> 2. The second statistic, ku, reflects inter-annotator agreement under the assumption that edu and parenthetical unit boundaries can occur only at locations judged to be boundaries by at least one annotator. This statistic offers the most conservative measure of agreement.</Paragraph> </Section> <Section position="2" start_page="51" end_page="53" type="sub_section"> <SectionTitle> 3.2 Reliability of tagging the discourse </SectionTitle> <Paragraph position="0"> structure of texts We are aware of only one proposal for computing agreement with respect to the way human judges construct hierarchical structures, that of Flammia and Zue (1995). This proposal appears to be adequate for computing the observed agreement, but it provides only a lower bound on the chance agreement, and hence, only an upper bound on the kappa coefficient. With the exception of Flammia and Zue, other researchers relied primarily on cascaded schemata for computing agreement among hierarchical structures. For example, Carletta et al. (1997) computed agreement on a coarse segmentation level that was constructed on the top of finer segments, by determining how well coders agreed on where the coarse segments started, and, for agreed starts, by computing how coders agreed on where coarse segments ended. Moser and Moore (1997) determined first the kappa coefficient with respect to the way judges assigned boundaries at the highest level of segmentation. Then judges met and agreed on a particular segmentation. Each high-level segment was then independently broken into smaller segments and the process was repeated recursively until the elementary unit level was reached. Although Moser and Moore's approach accommodates readily the traditional computation of kappa, it is impractical for large texts. In addition, since judges meet and agree on every level, it is likely that the agreement at finer levels of detail is influenced by judges' interaction.</Paragraph> <Paragraph position="1"> In order to compute the kappa statistics we devised a new method whose core idea is to map hierarchical structures into sets of units that are labeled with categorial judgments (see (Marcu and Hovy, 1999) for details). Consider, for example, the two hierarchical structures shown in figure 2.a, in which for simplicity, we focus only on the nuclear status of each segment (Nucleus or Satellite). In order to enable the computation of the kappa agreement we take as elementary all textual units found between two consecutive textual boundaries, independent of whether one or multiple judges chose those boundaries. Hence, for the segmentations in figure 2.a we consider that the text is made of 7 units;judge 1 took as elementary segments \[0,1 \], \[2,2\], \[3,3\], \[4,5\] and \[6,6\], while judge 2 took as elementary segments \[0,0\], \[1,1\], \[2,2\], \[3,3\], \[4,4\] and \[5,6\]. The mapping between the hierarchical structure and a set of units labeled with categorial judgments is straightforward if we consider not only the segments that play an active role in the structure (the nuclei and the satellites) but also the segments that are hot explicitly represented. For example, for segmentation 1, there is no active segment across units \[2,4\], \[2,5\], and \[2,6\]. Similarly, for segmentation 2, there is no active segment across units \[4,5\] and \[6,6\]. By associating the label NONE to the textual units that do not play an active role in a hierarchical representation, each discourse structure can be mapped into a set that explicitly enumerates all possible spans that range over the units in the text. For a text of n units there are n spans of length 1, n - 1 spans of length 2 ..... and 1 span of length n. Hence, each hierarchical structure of n units can be mapped intoa set of n+ (n- 1)+... + 1 = n(n+ 1)/2 units, each labeled with a categorial judgment. And computing the kappa statistic for such sets is a problem with a textbook solution (Siegel and Castellan, 1988).</Paragraph> <Paragraph position="2"> In the example in figure 2, we therefore compute the kappa statistics between the two hierarchies by computing the kappa statistics between the two sets that are represented in figure 2.b.</Paragraph> <Paragraph position="3"> The hierarchical structures in figure 2 correspond to nuclearity judgments. However, the schema we use here is general, since it can accommodate the computation of kappa statistic for judgments at the segmentation and rhetorical levels as well. In fact, the schema can be applied to any discourse, syntactic, or semantic hierarchical labeling.</Paragraph> <Paragraph position="4"> In our experiment, we computed the kappa statistic with respect to four categorial judgments: 1. ks reflects the agreement with respect to the hi- null 2. kn reflects the agreement with respect to the hierarchical assignment of nuclearity; 3. k r reflects the agreement with respect to the hierarchical assignment of rhetorical relations; 4. /err reflects the agreement with respect to the hierarchical assignment of rhetorical relations under the assumption that rhetorical relations are chosen from a reduced set of only 18 relations, each relation representing one or more clusters of rhetorical similarity.l We chose to compute krr in order to estimate whether the confusion in assigning rhetorical relations lay within the clusters or across them.</Paragraph> <Paragraph position="5"> Problems with the proposed method. In interpreting the statistical Significance of the results reported in this paper, the readers should bear in mind that the method we propose and use here is not perfect. The biggest problem we perceive concerns the violation of the independence assumption between the categorial judgments that are usually associated with the computation of the kappa statistics. Obviously, in our proposal, since the decisions taken at 2Some relations were used so rarely in our corpus that we decided to cluster them into only one group in spite of not belonging to the same cluster of rhetorical similarity. one level in the tree affect decisions taken at other levels in the tree, the data points over which the kappa coefficient is computed are not independent.</Paragraph> <Paragraph position="6"> However, it is not clear what the effect on kappa this interdependence has: if two judges agree, for example, on the labels associated with two large spans, they automatically agree on many other spans that do not play an active role in the representation. When this happens, it is likely that the value of kappa increases. However, if two judges disagree on two high-level spans, they automatically disagree on other spans that play an active role in the representation. When this happens, it is likely that the value of kappa decreases. Therefore, it is not very clear how much the final value of kappa is skewed in one direction or another. Most likely, if two judges agree significantly, the kappa coefficient will be skewed to higher values; if two judges disagree significantly, the kappa coefficient will be skewed to smaller valties. null Another problem concerns the effect of NONEagreements on the computation of kappa. Although the kappa statistics makes corrections for chance agreement, it is likely that the kappa coefficient is &quot;artificially&quot; high, because of the large numbers of non-active spans that are labelled NONE. Typically, a hierarchical structure with n leaves will be mapped into n(n + 1)/2 categorial judgments, of which only 2n - 1 have values different than NONE.</Paragraph> <Paragraph position="7"> Hence, it is possible the kappa coefficient to be &quot;artificially&quot; high because of many agreements on non-active spans. However, the interdependence effect discussed above may equally well &quot;artificially&quot; decrease the value of the kappa coefficient. One may imagine variants of our method in which all NONE-NONE agreements are eliminated, or in which only 2n - 1 are preserved. The first variant may be infelicitous because its adoption may artificially prevent judges to agree on NONE labels. Adopting the second variant is problematic because we don't know exactly how many NONE labels to keep in the mapped representation.</Paragraph> <Paragraph position="8"> Another potential problem stems from assigning the same importance to agreements at all levels in the hierarchy. For some classes of problems, one may argue that achieving agreement at higher levels in the hierarchy should be more important than achieving agreement at lower levels. Obviously, the method we described here does not enable such an intuition to be properly accounted for. However, for the discourse annotation task, we are quite ambivalent about this intuition. It is not clear to us whether we should consider the annotations that have high agreements with respect to large textual segments and low agreements with respect to small segments better than the annotations that have low agreements with respect to large textual segments and high agreements with respect to small segments.</Paragraph> <Paragraph position="9"> The first group of annotations would correspond to an ability to deal properly with global discourse phenomena, but no ability to deal with local discourse phenomena. The second group of annotations would correspond to an ability to deal properly with local discourse phenomena, but no ability to deal with global discourse phenomena. Which one is &quot;better&quot;? The method we propose treats all spans equally. It is similar in this respect to the labeled recall and precision methods used to evaluate parsers, for example, which also do not consider that it is more important to agree on high level constituents than low level constituents.</Paragraph> <Paragraph position="10"> The method we propose does not enable one to assess agreement at different levels of granularity; it produces one number, which cannot be used to diagnose where the disagreements are coming from.</Paragraph> <Paragraph position="11"> Although we believe that cascade techniques that were used to measure agreement between hierarchies (Moser and Moore, 1997; Carletta et al., 1997) are more adequate for diagnosing problems in the annotation, we found these techniques difficult to apply on our data. Some of our trees have more than 200 elementary units; and carrying out and interpreting a cascade analysis at potentially 200 levels of granularity is not straightforward either.</Paragraph> <Paragraph position="12"> Another choice for computing agreement of hierarchical annotations would be to devise a method similar to that used in the Kendall's 7- statistic, in which one computes the minimal number of operations that can map one annotation into another.</Paragraph> <Paragraph position="13"> Since the problem of finding the minimal number of operations that rewrite a tree into another tree is NP-complete, devising an operational method for computing agreement does not seem computationally feasible. After all, the number of possible trees that can be built for a text with 200 units is a number larger than 1 followed by 110 zeroes.</Paragraph> </Section> <Section position="3" start_page="53" end_page="54" type="sub_section"> <SectionTitle> 3.3 Tagging consistency </SectionTitle> <Paragraph position="0"> For each corpus, table 2 displays the numbers of coders that annotated each text in the corpus (#c) and the average numbers of data points (N~ and N~) over which the kappas were computed for each text in the corpus. In the first three rows, the table also shows the average kappa statistics computed for each text in the corpus with respect to judges' ability to agree on elementary discourse boundaries (k~ and k~,) and the average value of the corresponding z statistics (zw and zu) that were computed to test the significance of kappa (Siegel and Castellan, 1988, p. 289). The last three rows show the same statistics computed over all data points in each corpus.</Paragraph> <Paragraph position="1"> The field of content analysis suggests that values of k higher than 0.6 indicate good agreement. Values of z that are higher than 2.32 correspond to significance levels that are higher than a = 0.01. The results in table 2 indicate that high, statistically significant agreement was achieved for all three corpora with respect to the task of determining the elementary discourse units.</Paragraph> <Paragraph position="2"> For each corpus, table 3 displays in its first three rows the number of coders (#c) that annotated the texts and the average number of points (N) over which the agreements were computed for each text in the corpus. In the first three rows, the table displays the average kappa statistics with respect to the judges' ability to agree on each text on discourse segmentation, ks, nuclearity assignments, kn, and rhetorical relation assignments, kr and kr~. In the last three rows, the table displays the kappa statistics computed over all the data points in each corpus. If the statistical method we proposed does not skew the values of k -- a fact that we have not demonstrated -- the data in table 3 suggest that reliable agreement is obtained across all three corpora with respect to the assignment of discourse segments and nuclear statuses. Reliable agreement is obtained with respect to the rhetorical labeling only for the MUC corpus. The results in table 3 also show that a significant reduction in the size of the taxonomy of relations may not have a significant impact on agreement (k~T is only about 4% higher than k~).</Paragraph> <Paragraph position="3"> This suggests that choosing one relation from a set of rhetorically similar relations produces some, but not too much, confusion. However, it may also suggest that it is more difficult to assess where to attach an edu in a discourse tree than what relation to use.</Paragraph> <Paragraph position="4"> The results in tables 2 and 3 also show that the agreement figures vary significantly from one corpus to another: the news story genre of the MUC texts yields higher agreement figures than the editorial genre of the WSJ texts, which yields higher agreement figures than the scientific genre of the Brown texts. One possible explanation is that some of the Brown texts, which dealt with advanced topics in mathematics, physics, and chemistry, were difficult to understand.</Paragraph> <Paragraph position="5"> Overall, if our method for computing the kappa statistic is not skewed towards higher values, our experiment suggests that even simple, intuitive definitions of rhetorical relations, textual saliency, and discourse structure can lead to reliable annotation schemata. However, the results do not exclude that better definitions of edu and parenthetical units and rhetorical relations can lead to significant improvements in the reliability scores.</Paragraph> </Section> </Section> <Section position="5" start_page="54" end_page="55" type="metho"> <SectionTitle> 4 Tagging style </SectionTitle> <Paragraph position="0"> The vast majority of the computational approaches to discourse parsing rely on models that implicitly or explicitly assume that parsing is incremental (Polanyi, 1988; Lascarides and Asher, 1993; Gardent, 1997; Schilder, 1997; van den Berg, 1996; Cristea and Webber, 1997). That is, as edus are processed, they are immediately added to one partial discourse structure that subsumes all previous text. However, the logs of our experiment show that, quite often, annotators axe unable to decide where to attach a newly created edu. The annotation style varies significantly among annotators; but nevertheless, even the most aggressive annotator still needs to postpone 9.2% of the time the decision of where to attach a newly created edu (see table 4). Note that this percentage does not reflect UNDO steps, which may also correlate with attachment decisions that are eventually proven to be incorrect. 2 We noticed that managing multiple partial dis- null course trees during the annotation process is the norm rather than the exception. In fact it is not that edus are attached incrementally to one partial discourse structure, although the annotators were asked to do so, but rather that multiple partial discourse structures are created and then assembled using a rich variety of operations, which are specific to tree-adjoining and bottom-up parsers. Moreover, even this strategy proves to be somewhat inadequate, since annotators need from time to time to change rhetorical relation labels (2-3% of the operations) and re-structure completely the discourse (1-2% of the operations).</Paragraph> <Paragraph position="1"> This data suggests that it is unlikely that we will be able to build perfect discourse parsers that can incrementally derive discourse trees without applying any form of backtracking. If humans are unable to decide incrementally, in 100% of the cases, where to attach the edus, it is unlikely we can build computer programs that are.</Paragraph> <Paragraph position="2"> Note.* Estibaliz Amorrortu and Magdalena Romera cotributed equally to this paper.</Paragraph> <Paragraph position="3"> Note.** The tool described in this paper can be obtained by emailing the first author or by downloading it from http://www.isi.edu/,,~marcu/.</Paragraph> <Paragraph position="4"> Acknowledgements. We are grateful to Mick O'Donnell for making publically available his discourse annotation tool and to Benjamin Liberman and Ulrich Germann for contributing to the development of the annotation tool described in this paper. We also thank Eduard Hovy, Kevin Knight, and three anonymous reviewers for extensive comments on a previous version of this paper.</Paragraph> </Section> class="xml-element"></Paper>