File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2323_metho.xml
Size: 31,632 bytes
Last Modified: 2025-10-06 14:09:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2323"> <Title>Unifying Annotated Discourse Hierarchies to Create a Gold Standard</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Methods for Creating a Gold Standard </SectionTitle> <Paragraph position="0"> It is likely that there is no perfect way to find and evaluate a gold standard, and in some cases there may be multiple segmentations that are equally likely to serve as a gold standard. In the BDC corpus, unlike the Penn Treebank, there are multiple annotations for each discourse which were not manually combined into one gold standard annotation. In this paper, we explore automatic methods to create a gold standard for the BDC corpus. These methods could also be used on other corpora with non-unified annotations. Next, we present several automatic methods to combine multiple human-annotated discourse segmentations into one gold standard.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Flat vs. Hierarchical Approaches </SectionTitle> <Paragraph position="0"> Most previous work that has combined multiple annotations has used linear segmentations, i.e. discourse segmentations without hierarchies (Hirschberg and Nakatani, 1996). In general, the hierarchical nature of discourse structure has not been considered when computing labeler inter-reliability and in evaluations of agreement with automatic methods. Since computational discourse theory relies on the hierarchy of its segments, we will consider it in this paper. For each approach that follows, we consider both a &quot;flat&quot; version, which does not consider level of embeddedness, and a &quot;full&quot; approach, which does. We analyze the differences between the flat and full versions for each approach.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Segment Definition </SectionTitle> <Paragraph position="0"> A discourse is made up of a sequence of utterances, a0a2a1a4a3a5a0a7a6a8a3a9a0a11a10a12a3a14a13a15a13a16a13a15a3a5a0a7a17 . In this paper, we define a segment as a triple a18a20a19 a3a22a21a23a3a9a24a22a25 , where a0a7a26 is the first utterance in the segment, a0a28a27 is the last utterance in the segment, and a24 is the segment's level of embeddedness.2 We will sometimes refer to a0a7a26 and a0a28a27 as boundary utterances. Lastly, when we are not interested in level of embeddedness, we will sometimes refer to a segment as a18a20a19 a3a22a21a29a25 , without the a24 value.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 The Consensus Approach </SectionTitle> <Paragraph position="0"> A conservative way to combine segmentations into a gold standard is the Consensus (CNS) (or raw agreement) ap2The levels are numbered from top to bottom; hence, 1 is the level of the largest segment, 2 is the level below that, and so on. would create the same gold standard. The segments in the annotations that are marked in bold are those selected by the gold standard.</Paragraph> <Paragraph position="1"> have been included by every annotator. In the &quot;full&quot; version of CNS (FullCNS), the annotators need to agree upon the embedded level of the segment along with the segment boundaries. The &quot;flat&quot; version (FlatCNS) ignores hierarchy and considers only the segment boundaries when determining agreement.</Paragraph> <Paragraph position="2"> Figure 1 shows an example of performing FullCNS on three annotations. In the figure, all three annotators agree on only the largest segment (segments A, E, and H). Hence, the gold standard includes only that single segment. FlatCNS gives the same gold standard in this example as there are no two segments with the same boundaries but with different levels of embeddedness.</Paragraph> <Paragraph position="3"> In Figure 2, we see an example where the gold standards created by FlatCNS and FullCNS differ. Aside from the largest segment, FullCNS contains only the segment representing the agreement of segments D, G, and M. FlatCNS includes that segment as well, in addition to two more from the agreement of segments B, H, and K and segments C, I, and L. FullCNS does not include those segments because the segments occur at differing levels of embeddedness.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Majority Consensus </SectionTitle> <Paragraph position="0"> A straightforward extension to the CNS approach is to relax the need for full agreement and include those segments on which a majority of the annotators agreed (Grosz and Hirschberg, 1992). Other thresholds of agreement could be used as well, but in this paper we annotators agree on a segment. We call this the Majority Consensus (MCNS) approach. As with CNS, we can have both a &quot;full&quot; version (FullMCNS) and a &quot;flat&quot; version (FlatMCNS), which do and do not consider the level of embeddedness, respectively.</Paragraph> <Paragraph position="1"> Figure 3 shows an example of performing FullMCNS on the same three annotations we saw in Figure 1. Here, we again include the largest segment because it is agreed upon by all, but now we also include the two segments agreed upon by annotators 1 and 3 because two out of three annotators, a majority, have selected them. These two segments correspond to segments B and C for annotator 1 and segments I and J for annotator 3.</Paragraph> <Paragraph position="2"> MCNS is less strict than CNS as it includes segments agreed upon by most annotators and does not require full agreement, but both methods are affected by a potential flaw. Note that in Figure 3, segment D could very well be in some notion of agreement with annotation 3, but MCNS does not capture this near-miss; D is left out of the gold standard. The next approach we discuss can handle this sort of situation.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.5 Conflict-Free Union </SectionTitle> <Paragraph position="0"> The Conflict-Free Union (CFU) approach combines the annotations of all of the annotators and removes those segments that conflict with each other to get a conflict-free gold standard. There are usually multiple ways to construct a conflict-free gold standard. The CFU approach finds the one with the fewest segments removed.</Paragraph> <Paragraph position="1"> Figure 4 shows the use of CFU on the three example annotations. Notice that the only segments not included in the gold standard are F and G, which conflict with B, C, D, I, and J. Resolving the conflicts here required removing two segments; the other way to resolve the conflict would have been to remove C and J, which would be equally as good. CFU captures as many conflict-free segments from the annotations as possible without discrimination. Even if only one annotator chose a segment, CFU would include it if it did not create more conflicts.</Paragraph> <Paragraph position="2"> Hence, it is likely that CFU could construct gold standards with too much structure. However, in our example it is better at capturing the similarity of structure between annotators 1 and 3. Due to its ability to capture structure, we expected that CFU would perform better in recall than the previously mentioned approaches.</Paragraph> <Paragraph position="3"> The consensus and majority approaches are straight-forward to compute, but CFU presents an optimization problem in which the greatest number of segments that can be combined without any internal conflicts must be found. Brute force methods, such as trying every possible set of segments and picking the largest conflict-free set, grow exponentially in the total number of segments contained in the annotations. We present a dynamic programming algorithm that computes the CFU in a0a2a1a4a3a6a5a8a7 time, where a3 is the number of utterances in the discourse.</Paragraph> <Paragraph position="4"> First, we say that segment a9a4a10a12a11a14a13a16a15 straddles an utterance</Paragraph> <Paragraph position="6"> a18 represent the number of segments between utterances a17 a27 and a17 a28 , inclusive, that straddle a17a19a18 . That is, a26a19a27a29a28 a18 represents the number of unique segments with the form a9a4a30a6a11a32a31a33a15 , where a10a34a20a35a30a36a20a37a22a38a20a35a31a39a20a40a13 and a30a36a41a42a31 . We use a26a43a27a29a28 a18 to compute a44a38a27a29a28 , the index representing the utterance a17 a18 that, if considered a new boundary utterance, would minimize the number of conflicting segments within a17 a27 and a17 a28 , and a45 a27a29a28 , that minimum number of segments. Then we can solve the following recurrence equations: We can generate a binary tree with a9 a68 a11a78a77a79a15 as the value of the root node, representing the segment over all utterances. The left child, then, has the value a9 a68 a11a73a44 a62a76a80 a15 , and the right child has value a9a81a44 a62a64a80a35a57a82a68 a11a78a77a79a15 . We compute the rest of the tree similarly, with a9 a68 a11a78a44 a62a76a83a85a84a4a86 a15 as the left child of the left child, and so on. For each segment included by an annotator, we include it in the gold standard if it is represented by a segment in the constructed tree. Note that we only store the boundary utterances in the tree, so the gold standard we construct will not include level of</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.6 Union </SectionTitle> <Paragraph position="0"> The methods for finding a gold standard in Sections 2.12.4 produce segmentations that contain no internal conflicts.3 However, since we evaluate a gold standard by its similarity with the original annotations, it makes sense to define an approach that is capable of constructing a unified segmentation that includes conflicts, which we call the Union approach (UNI). UNI simply includes every segment from every annotator. The flat version ignores hierarchies (FlatUNI), and the full version includes them (FullUNI).</Paragraph> <Paragraph position="1"> An example of an application of FullUNI is given in Figure 5. We see that every segment chosen by annotators 1, 2, and 3 have been included in the gold standard, creating some internal conflicts. We certainly would not expect to use this construction as a prediction of the actual gold standard, but we include it for comparison with CFU in evaluating the importance of avoiding internal conflicts with respect to our metrics.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.7 Best Annotator </SectionTitle> <Paragraph position="0"> The final approach we considered chooses the &quot;best&quot; annotation and considers it to be the gold standard. We select the annotation with the highest inter-labeler reliability with all the other annotations to be the &quot;best&quot; annotation, using the pairwise a87 metric. We discuss this metric and its uses in Section 3.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="21" type="metho"> <SectionTitle> 3 Measures of Evaluation </SectionTitle> <Paragraph position="0"> There are several ways of evaluating an algorithm for creating a gold standard, just as there are several ways of evaluating any segmentation algorithm. Ideally, we would like to compare to some objectively true gold standard, but it is impossible to determine if there are one or more true standards, or even if one exists. Instead, we can compare a gold standard against each annotator's individual structuring, or against that of several human annotators collectively. Also, we could compare gold standards with each other in terms of how they affect the out3MCNS avoids conflicts because any two segments that a majority of annotators agree upon will always both be included by at least one annotator, and we assume that individual annotations are always internally consistent.</Paragraph> <Paragraph position="1"> tations in agreement.</Paragraph> <Paragraph position="2"> come of some computational task which considers discourse structure, such as anaphora resolution. This last approach is probably the best when the purpose of the gold standard is known in advance, but in this paper we consider only task-independent metrics.</Paragraph> <Paragraph position="3"> For the sake of scientific validity, we did not compare a gold standard with a segmentation of our own. Instead, we chose to evaluate gold standards by averaging their similarity to the original segmentations made by human annotators. For each approach we presented earlier, we report an average similarity score over all original segmentations and the gold standard, based on several different quantitative measures of inter-reliability.</Paragraph> <Section position="1" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 3.1 Pairwise Agreement Scores </SectionTitle> <Paragraph position="0"> For linear segmentation, pairwise agreement between annotators is computed by dividing the number of utterances for which both annotators agree by the total number of utterances. In contrast, a hierarchical segmentation for a sequence of utterance in a discourse is analogous to a parse tree for a sequence of words. It requires a different metric for pairwise agreement that considers the hierarchy. null Following Flammia and Zue (1995), we define a general symmetric metric a1a3a2 for observed agreement between two segments that accounts both for deletions and for insertions of segments in a hierarchy. Intuitively, we want different sub-trees that vary only in hierarchical structure but share the same boundaries to0 be considered similar. For example, in Figure 6, there is good reason to consider both annotations to be similar, even though no segment pair in either spans the same utterances.</Paragraph> <Paragraph position="1"> Formally, let a26 a62 and a26a5a4 be two possible segmentations. A segment a9a4a10a12a11 a13 a15 in a26 a62 matches with segmentation a26a5a4 if there exists some segment a9a4a10a12a11a7a6a71a15 or a9a8a6a16a11a78a10a10a9 a68 a15 in a26a11a4 and there exists some segment a9a12a6a16a11 a13 a15 or a9a46a13 a57 a68 a11a7a6 a15 in a26 a4 . In other words, a segment in a26 a62 matches a segmentation a26 a4 if the utterances that constitute its boundaries also constitute boundaries for some segment in a26 a4 . For example, in Figure 5, we consider that the segments a13 a11a15a14 a11 and a16 in annotation 3 match the segments a17 a11a19a18a39a11a21a20a51a11 and a22 in annotation 1.</Paragraph> <Paragraph position="2"> Flammia and Zue then let a0a24a23 a84 be the number of segments in a26 a62 that match with segments in a26a11a4 and let a0a25a23a27a26 be the number of segments in a26 a4 that match with segments in a26 a62 . a77 a23 a84 and a77 a23 a26 are the number of segments in</Paragraph> <Paragraph position="4"> a26 a4 respectively. Following Bakeman and Gottman (1986), they define the observed agreement to be</Paragraph> <Paragraph position="6"> For the metric to be valid, they also take into account the probability of chance agreement between annotators.</Paragraph> <Paragraph position="7"> For example, if the distribution underlying the segmentation is skewed such that the structure is very sparse, most segmentations will include very few constituents, and a1a10a2 will be unnaturally deflated.</Paragraph> <Paragraph position="8"> The kappa coefficient (a0 ) is used for correcting the observed agreement by subtracting the probability a1a31a30 that two segments in a26 a62 and a26a5a4 , chosen at random, happen to agree. The a0 coefficient is computed as follows:</Paragraph> <Paragraph position="10"> Carletta (1996) reports that content analysis researchers generally think of a0a34a33 a49a36a35a37 as &quot;good reliability,&quot; with a49a36a35a38a40a39a37a41 a0 a41a25a49a36a35a37 allowing &quot;tentative conclusions to be drawn.&quot; All that remains is to define the chance agreement probability a1 a30 . Let a1a32a41 a1 a30 a7 and a1a32a42 a1 a30 a7 be the fraction of utterances that begin or end one or more segments in segmentation a30 respectively. Flammia and Zue compute an upper bound on a1a3a30 as</Paragraph> <Paragraph position="12"/> </Section> <Section position="2" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 3.2 Precision and Recall </SectionTitle> <Paragraph position="0"> We use standard evaluation metrics from information retrieval to measure pairwise agreement between gold standards and annotations. We say that segment a46 a47 a9 a10a73a11a14a13 a11a7a6 a15 in some segmentation flatly agrees with segmentation a26 if there exists a segment a47 a47 a9 a10a73a11a14a13 a11a7a6 a15 in a26 , which spans exactly the same utterances as a46 . We say that segment a46 a47 a9 a10a12a11 a13 a11a19a48 a15 in some segmentation fully agrees with segmentation a26 if a46 flatly agrees with a26 , and the segment that fits it is also of the same depth as a46 ; i.e., there exists a segment a47 a47 a9 a10a12a11 a13 a11a19a48 a15 in a26 .</Paragraph> <Paragraph position="1"> We define the number of relevant segments in a segmentation a26 to be the total number of segments in a26 that agree with a gold standard for that particular discourse. For gold standard types that consider embeddedness, such as Full Consensus and Full Majority Consensus, we check for full agreement. For gold standard types that do not, such as Flat Consensus and Conflict-Free Union, we consider flat agreement.</Paragraph> <Paragraph position="2"> We define recall as the number of relevant segments in a0 divided by the total number of segments in a0 . We define precision as the number of relevant segments in a0 divided by the total number of segments in the gold standard. Intuitively, if a gold standard has low agreement with the original segmentation, recall will be low. If a gold standard's structure is more dense then the original segmentation, precision will be low.</Paragraph> </Section> <Section position="3" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 3.3 Non-Crossing-Brackets </SectionTitle> <Paragraph position="0"> The non-crossing-bracket measure is a common performance metric used in syntactic parsing for measuring hierarchical structure similarity. A segment constituent</Paragraph> <Paragraph position="2"> a3a22a21a29a25 in some segmentation crosses brackets with segmentationa0 ifa1 spans at least one boundary utterance in a0 .</Paragraph> <Paragraph position="3"> For each segmentation a0 , we define the number of non-crossing-brackets as the number of segments in a0 that do not exhibit crossing brackets with the appropriate gold standard. For each segmentation, we compute a non-crossing-bracket percentage by dividing the number of non-crossing-brackets by the total number of bracket pairs.</Paragraph> </Section> </Section> <Section position="5" start_page="21" end_page="21" type="metho"> <SectionTitle> 4 Empirical Methodology </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.1 Boston Directions Corpus </SectionTitle> <Paragraph position="0"> For our empirical analysis of different gold standard approaches, we used the Boston Directions Corpus (BDC).</Paragraph> <Paragraph position="1"> The BDC corpus contains transcribed monologues by speakers who were instructed to perform a series of direction-giving tasks. The monologues were subsequently annotated by a group of subjects according to the Grosz and Sidner (1986) theory of discourse structure.</Paragraph> <Paragraph position="2"> This theory provides a foundation for hierarchical segmentation of discourses into constituent parts. Some of the subjects were experts in discourse theory and others were naive annotators. In our experiments here, we only consider the annotations from experts.</Paragraph> </Section> <Section position="2" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.2 Experimental Design </SectionTitle> <Paragraph position="0"> Our experiments were run on 12 discourses in the spontaneous speech component of the BDC. The lengths of the discourses ranged from 15 to 150 intonational phrases.</Paragraph> <Paragraph position="1"> Each discourse was segmented by three different annotators, resulting in 36 separate annotations. For each discourse, we combined the three annotations into a gold standard according to each technique described in Section 2. We then proceeded to compute the similarity between the gold standard and each of the original annotations by using the pairwise evaluation metrics described in Section 3.</Paragraph> </Section> <Section position="3" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.3 Results </SectionTitle> <Paragraph position="0"> We report results for each gold standard averaged over all 36 annotations. Table 1 presents precision/recall percentages for pairwise agreement scores, as well as a0 values and non-crossing brackets (NCB) percentages. Figure 7 plots the pairwise agreement precision/recall values on a graph, with error bars indicating one standard deviation from the mean. Recall that the gold standards we are comparing are Full Consensus (FullCNS), Flat Consensus (FlatCNS), Full Majority Consensus (FullM-CNS), Flat Majority Consensus (FlatMCNS), Conflict Free Union (CFU), Full Union (FullUNI), Flat Union (FlatUNI) and Best Annotator.</Paragraph> <Paragraph position="1"> Our results show that CFU, FullUNI and FlatUNI all achieved high a0 scores and low variance. Both Full and Flat Consensus scored the worst. This pattern was also apparent with regard to agreement between the gold standard and the annotations. Again, CFU, FullUNI and FlatUNI achieved the best recall, and FullCNS and FlatCNS scored the worst recall. It is interesting to point out that since any segmentation proposed by an evaluator will always be included in the FullUNI gold standard, its agreement recall will always be 1.</Paragraph> <Paragraph position="2"> We see a change in trend with regard to precision between gold standard and the annotations. Here, FullCNS and FlatCNS achieved very high precision, while FullUNI and FlatUNI achieved low precision. CFU's precision was slightly better. With respect to the non-crossing-brackets metrics, the gold standards based on consensus (FullCNS, FlatCNS) did not clash at all with any annotation, since any segment in the gold standard is present in each of the annotations. Of the remaining methods, FlatMCNS (0.84) and FullMCNS (0.81) had the highest percentage of non-crossing-brackets, while the union based approaches, FullUNI (0.47) and FlatUNI (0.54) a0 Agreement Rec. Agreement Pre. NCB ave. sd. ave. sd. ave. sd. ave. sd had the lowest, because their gold standards are densely structured and internally include conflicts.</Paragraph> <Paragraph position="3"> Looking at each gold standard separately, we do not identify a single gold standard that does well across the board. CFU, FullUNI and FlatUNI have high a0 and agreement recall values, but they all have low agreement precision values. FullCNS and FlatCNS have low a0 and recall values, but better agreement precision values. FullMCNS and FlatMCNS average out the best across all metrics, but they do not achieve the best performance in any of the metrics. Note that &quot;full&quot; type methods require agreement in hierarchy; they are held to a higher standard of evaluation than &quot;flat&quot; type methods.</Paragraph> </Section> </Section> <Section position="6" start_page="21" end_page="21" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> From the results, we see that generally the consensustype approaches (CNS and MCNS) perform very well with the precision metric and the union approaches (CFU, UNI) perform well with the recall and a0 metrics. Precision measures the percentage of the gold standard that was agreed upon by the annotators, and since the consensus approaches tend to include only those segments labeled by everyone, they have high precision. Specifically, FullCNS performs perfectly in precision because it contains only those segments explicitly included by everyone, while the majority consensus methods perform slightly worse because an annotator is occasionally in the minority.</Paragraph> <Paragraph position="1"> Recall measures the percentage of the annotator's segments captured by the gold standard. Since the union approaches include every or almost every segment, depending on whether it is &quot;flat&quot; or &quot;full,&quot; respectively, an annotator's segment is almost always included in the gold standard, yielding high recall for these methods. The difference between precision and recall highlights two different approaches: precision encourages a bottom-up approach where the most likely segments to be included in the gold standard are added from scratch; recall encourages a top-down approach where all possible segments are added and the least likely segments to be included in the gold standard are pruned. The a0 metric attempts to balance these two approaches by rewarding agreements yet penalizing extra structure. Nevertheless, even the naive union methods (UNI) performs well with a0 , indicating that it favors agreement far more than it punishes extra structure.</Paragraph> <Paragraph position="2"> Based on these observations, we believe that there is good reason to prefer to use CFU as a gold standard over FullUNI and FlatUNI. Although they all have the same a0 and similar precision/recall values, the CFU gold standard corresponds to a true segmentation -- it does not exhibit internal conflicts.</Paragraph> <Paragraph position="3"> However, if a conservative but accurate gold standard is desired, then the MCNS approaches are the best allaround consensus approaches to use, as they perform fairly well with a0 as well as with precision and recall.</Paragraph> <Paragraph position="4"> These approaches construct fairly conservative gold standards, but not nearly as strict as the full consensus approaches. Hence, as seen by the high precision value, a gold standard constructed by an MCNS method will contain mostly relevant segments but will be missing the more controversial segments.</Paragraph> <Paragraph position="5"> The Best Annotator approach performed very well with a0 , but not as well with respect to precision and recall. Its performance was completely dominated by the MCNS approaches in all metrics, except for a0 . In general, a0 is at its highest when minor boundary disagreements are infrequent, because it is not sensitive to the exact type of matching boundaries. This phenomenon is shown in Figure 6. There, we see two segmentations that are clearly different but are considered the same by a0 . Precision and recall, however, would not consider the second level segments in agreement.</Paragraph> <Paragraph position="6"> The consistently good results of the non-crossing-brackets metric for MCNS and CFU indicate that there are few cases in which the expert BDC annotators create segments whose boundaries cross. Again, this effect is probably a result of the well-structured nature of the tasks in the BDC discourses. Since there are few crossing boundaries, the a0 metric performs well for the Union and Best Annotator methods since almost every boundary is represented. If annotations had exhibited more discrepancy, the non-crossing-brackets and a0 metrics would probably differentiate more among these approaches.</Paragraph> <Paragraph position="7"> Lastly, we note that the difference between &quot;full&quot; and &quot;flat&quot; metrics of the same type were insignificant, but with the consensus approaches, the &quot;flat&quot; approaches performed slightly better than their &quot;full&quot; counterparts, most likely because the &quot;full&quot; approaches were too conservative in demanding level agreement. Thus, if we care not to have conflicts in our gold standard, the &quot;full&quot; approaches should be used to find the gold standard, as they produce more structured segmentations. In addition, a gold standard with labeled embeddedness might be necessary for post-segmentation processing, such as anaphora resolution. However, if the gold standard is being used for purely evaluative reasons, the &quot;flat&quot; approaches should be used as they perform slightly better.</Paragraph> </Section> <Section position="7" start_page="21" end_page="21" type="metho"> <SectionTitle> 6 Future Work </SectionTitle> <Paragraph position="0"> One problem with the measures of evaluation that we have explored in this paper is that they tell us how similar a gold standard is to the original annotations but say nothing about how effective the gold standard would be when used for further discourse processing. One suggestion for future studies would be to evaluate the gold standards with respect to possible post-segmentation tasks, such as anaphora resolution or summarization. Such an approach would be a better measure of the objective goodness of the gold standard and could also be a way to monitor the skills of the annotators. Specific metrics might also be more relevant for a specific discourse task. For instance, perhaps non-crossing-brackets is a more useful metric to consider when segmenting as a preprocessing step for anaphora resolution.</Paragraph> <Paragraph position="1"> It would also be interesting to further explore the Conflict-Free Union approach, as it performed well but suffered from including extra structure. The top-down processing could be enhanced by removing those segments which are deemed the least probable to be in the gold standard, perhaps based on some features, such as depth. For example, perhaps a segment that is at a deep level and is only in a few annotations could get removed, while larger segments would remain regardless, or vice versa. With a few good features, it seems quite possible to increase the precision of CFU. A similar approach could be taken to add new segments to those picked in the Majority Consensus approach.</Paragraph> <Paragraph position="2"> Finally, it is worth exploring whether it is a good idea to have multiple annotations for a given corpus in the first place. Some corpora, such as the Penn Treebank, require its annotators to meet whenever there is a conflict so that the conflict can be resolved before the corpus is publicly released. Penn has now begun a Discourse Treebank as well (Creswell et al., 2003). Wiebe et al. (1999) use statistical methods to automatically correct the biases in annotations of speaker subjectivity. The corrections are then used as a basis for further conflict resolution. Carlson et al. (2001) also used conflict resolution when creating their discourse-tagged corpus. One interesting area of research would be to compare how annotators choose to resolve their conflicts compared to the different automatic approaches of finding a gold standard. It is possible that the compromises made by the annotators cannot be captured by any computational method, in which case it may be worth having all conflicts resolved manually.</Paragraph> </Section> class="xml-element"></Paper>