File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2323_intro.xml
Size: 6,155 bytes
Last Modified: 2025-10-06 14:02:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2323"> <Title>Unifying Annotated Discourse Hierarchies to Create a Gold Standard</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The linguistic structure of a discourse is composed of utterances that exhibit meaningful hierarchical relationships (Grosz and Sidner, 1986). Automatic segmentation of discourse forms the basis for many applications, from information retrieval and text summarization to anaphora resolution (Hearst, 1997). These automatic methods, usually based on supervised machine learning techniques, require a manually annotated corpus of data for training. The creation of these corpora often involves multiple judges annotating the same discourses, so as to avoid bias from using a single judge's annotations as ground truth.</Paragraph> <Paragraph position="1"> Usually, for a particular discourse, these multiple annotations are unified into a single annotation, either manually by the annotators' discussions or automatically. However, annotation unification approaches have not been formally evaluated, and although manual unification might be the best approach, it can be time-consuming. Indeed, much of the work on automatic recognition of discourse structure has focused on linear, rather than hierarchical segmentation (Hearst, 1997; Hirschberg and Nakatani, 1996), because of the difficulties of obtaining consistent hierarchical annotations. In addition, those approaches that do handle hierarchical segmentation do not address automatic unification methods (Carlson et al., 2001; Marcu, 2000).</Paragraph> <Paragraph position="2"> There are several reasons for the prevailing emphasis on linear annotation and the lack of work on automatic methods for unifying hierarchical discourse annotations.</Paragraph> <Paragraph position="3"> First, initial attempts to create annotated hierarchical corpora of discourse structure using naive annotators have met with difficulties. Rotondo (1984) reported that &quot;hierarchical segmentation is impractical for naive subjects in discourses longer than 200 words.&quot; Passonneau and Litman (1993) conducted a pilot study in which subjects found it &quot;difficult and time-consuming&quot; to identify hierarchical relations in discourse. Other attempts have had more success using improved annotation tools and more precise instructions (Grosz and Hirschberg, 1992; Hirschberg and Nakatani, 1996). Second, hierarchical segmentation of discourse is subjective. While agreement among annotators regarding linear segmentation has been found to be higher than 80% (Hearst, 1997), with respect to hierarchical segmentation it has been observed to be as low as 60% (Flammia and Zue, 1995). Moreover, the precise definition of &quot;agreement&quot; with respect to hierarchical segmentation is unclear, complicating evaluation.</Paragraph> <Paragraph position="4"> It is natural to consider two segments in separate annotations to agree if they both span precisely the same utterances and agree on the level of embeddedness. However, it is less clear how to handle segments that share the same utterances but differ with respect to the level of embeddedness. null In this paper, we show that despite these difficulties it is possible to automatically combine a set of multi-level discourse annotations together into a single gold standard, a segmentation that best captures the underlying linguistic structure of the discourse. We aspire to create corpora analogous to the Penn Treebank in which a unique parse tree exists for each sentence that is agreed upon by all to convey the &quot;correct&quot; parse of the sentence. However, whereas the Penn Treebank parses are determined through a time-consuming negotiation between labelers, we aim to derive gold standard segmentations automatically. null There are several potential benefits for having a unifying standard for discourse corpora. First, it can constitute a unique segmentation of the discourse that is deemed the nearest approximation of the true objective structure, assuming one exists. Second, it can be used as a single unified version with which to train and evaluate algorithms for automatic discourse segmentation. Third, it can be used as a preprocessing step for computational tasks that require discourse structure, such as anaphora resolution and summarization.</Paragraph> <Paragraph position="5"> In this work, we describe and evaluate several approaches for unifying multiple hierarchical discourse segmentations into one gold standard. Some of our approaches measure the agreement between annotations by taking into account the level of embeddedness and others ignore the hierarchy. We also introduce a novel method, called the Conflict-Free Union, that minimizes the number of conflicts between annotations. For our experiments, we used the Boston Directions Corpus (BDC).1 Ideally, each technique would be evaluated with respect to a single unified segmentation of the BDC that was deemed &quot;true&quot; by annotators who are experts in discourse theory, but we did not have the resources to attempt this task. Instead, we measure each technique by comparing the average similarity between its gold standard and the original annotations used to create it. Our similarity metrics measure both hierarchical and linear segment agreement using precision/recall metrics, inter-reliability similarities among annotations using the (a0 ) metric, and percentage of non-crossing-brackets.</Paragraph> <Paragraph position="6"> We found that there is no single approach that does ever, the Conflict-Free Union approach outperforms the other methods for the a0 and precision metrics. Also, techniques that include majority agreements of annotators have better recall than techniques which demanded full consensus among annotators. We also uncovered some flaws in each technique; for example, we found that gold standards that include dense structure are over-penalized by some of the metrics.</Paragraph> </Section> class="xml-element"></Paper>