Unifying Annotated Discourse Hierarchies to Create a Gold Standard
Marco Carbone, Ya’akov Gal, Stuart Shieber, and Barbara Grosz
Division of Engineering and Applied Sciences
Harvard University
33 Oxford St.
Cambridge, MA 02138
mcarbone,gal,shieber,grosz@eecs.harvard.edu
Abstract
Human annotation of discourse corpora typi-
cally results in segmentation hierarchies that
vary in their degree of agreement. This paper
presents several techniques for unifying multi-
ple discourse annotations into a single hierar-
chy, deemed a “gold standard” — the segmen-
tation that best captures the underlying linguis-
tic structure of the discourse. It proposes and
analyzes methods that consider the level of em-
beddedness of a segmentation as well as meth-
ods that do not. A corpus containing annotated
hierarchical discourses, the Boston Directions
Corpus, was used to evaluate the “goodness”
of each technique, by comparing the similar-
ity of the segmentation it derives to the original
annotations in the corpus. Several metrics of
similarity between hierarchical segmentations
are computed: precision/recall of matching ut-
terances, pairwise inter-reliability scores (a0 ),
and non-crossing-brackets. A novel method for
unification that minimizes conflicts among an-
notators outperforms methods that require con-
sensus among a majority for the a0 and precision
metrics, while capturing much of the structure
of the discourse. When high recall is preferred,
methods requiring a majority are preferable to
those that demand full consensus among anno-
tators.
1 Introduction
The linguistic structure of a discourse is composed of
utterances that exhibit meaningful hierarchical relation-
ships (Grosz and Sidner, 1986). Automatic segmentation
of discourse forms the basis for many applications, from
information retrieval and text summarization to anaphora
resolution (Hearst, 1997). These automatic methods, usu-
ally based on supervised machine learning techniques,
require a manually annotated corpus of data for train-
ing. The creation of these corpora often involves multiple
judges annotating the same discourses, so as to avoid bias
from using a single judge’s annotations as ground truth.
Usually, for a particular discourse, these multiple annota-
tions are unified into a single annotation, either manually
by the annotators’ discussions or automatically. How-
ever, annotation unification approaches have not been for-
mally evaluated, and although manual unification might
be the best approach, it can be time-consuming. In-
deed, much of the work on automatic recognition of dis-
course structure has focused on linear, rather than hi-
erarchical segmentation (Hearst, 1997; Hirschberg and
Nakatani, 1996), because of the difficulties of obtain-
ing consistent hierarchical annotations. In addition, those
approaches that do handle hierarchical segmentation do
not address automatic unification methods (Carlson et al.,
2001; Marcu, 2000).
There are several reasons for the prevailing emphasis
on linear annotation and the lack of work on automatic
methods for unifying hierarchical discourse annotations.
First, initial attempts to create annotated hierarchical cor-
pora of discourse structure using naive annotators have
met with difficulties. Rotondo (1984) reported that “hi-
erarchical segmentation is impractical for naive subjects
in discourses longer than 200 words.” Passonneau and
Litman (1993) conducted a pilot study in which subjects
found it “difficult and time-consuming” to identify hi-
erarchical relations in discourse. Other attempts have
had more success using improved annotation tools and
more precise instructions (Grosz and Hirschberg, 1992;
Hirschberg and Nakatani, 1996). Second, hierarchical
segmentation of discourse is subjective. While agreement
among annotators regarding linear segmentation has been
found to be higher than 80% (Hearst, 1997), with respect
to hierarchical segmentation it has been observed to be
as low as 60% (Flammia and Zue, 1995). Moreover, the
precise definition of “agreement” with respect to hierar-
chical segmentation is unclear, complicating evaluation.
It is natural to consider two segments in separate annota-
tions to agree if they both span precisely the same utter-
ances and agree on the level of embeddedness. However,
it is less clear how to handle segments that share the same
utterances but differ with respect to the level of embed-
dedness.
In this paper, we show that despite these difficulties it is
possible to automatically combine a set of multi-level dis-
course annotations together into a single gold standard, a
segmentation that best captures the underlying linguis-
tic structure of the discourse. We aspire to create cor-
pora analogous to the Penn Treebank in which a unique
parse tree exists for each sentence that is agreed upon by
all to convey the “correct” parse of the sentence. How-
ever, whereas the Penn Treebank parses are determined
through a time-consuming negotiation between labelers,
we aim to derive gold standard segmentations automati-
cally.
There are several potential benefits for having a unify-
ing standard for discourse corpora. First, it can constitute
a unique segmentation of the discourse that is deemed the
nearest approximation of the true objective structure, as-
suming one exists. Second, it can be used as a single uni-
fied version with which to train and evaluate algorithms
for automatic discourse segmentation. Third, it can be
used as a preprocessing step for computational tasks that
require discourse structure, such as anaphora resolution
and summarization.
In this work, we describe and evaluate several ap-
proaches for unifying multiple hierarchical discourse seg-
mentations into one gold standard. Some of our ap-
proaches measure the agreement between annotations by
taking into account the level of embeddedness and others
ignore the hierarchy. We also introduce a novel method,
called the Conflict-Free Union, that minimizes the num-
ber of conflicts between annotations. For our experi-
ments, we used the Boston Directions Corpus (BDC).1
Ideally, each technique would be evaluated with re-
spect to a single unified segmentation of the BDC that
was deemed “true” by annotators who are experts in dis-
course theory, but we did not have the resources to at-
tempt this task. Instead, we measure each technique by
comparing the average similarity between its gold stan-
dard and the original annotations used to create it. Our
similarity metrics measure both hierarchical and linear
segment agreement using precision/recall metrics, inter-
reliability similarities among annotations using the (a0 )
metric, and percentage of non-crossing-brackets.
We found that there is no single approach that does
1The Boston Directions Corpus was designed and collected
by Barbara Grosz, Julia Hirschberg, and Christine H. Nakatani.
well with respect to all of the similarity metrics. How-
ever, the Conflict-Free Union approach outperforms the
other methods for the a0 and precision metrics. Also, tech-
niques that include majority agreements of annotators
have better recall than techniques which demanded full
consensus among annotators. We also uncovered some
flaws in each technique; for example, we found that gold
standards that include dense structure are over-penalized
by some of the metrics.
2 Methods for Creating a Gold Standard
It is likely that there is no perfect way to find and evaluate
a gold standard, and in some cases there may be multiple
segmentations that are equally likely to serve as a gold
standard. In the BDC corpus, unlike the Penn Treebank,
there are multiple annotations for each discourse which
were not manually combined into one gold standard an-
notation. In this paper, we explore automatic methods to
create a gold standard for the BDC corpus. These meth-
ods could also be used on other corpora with non-unified
annotations. Next, we present several automatic methods
to combine multiple human-annotated discourse segmen-
tations into one gold standard.
2.1 Flat vs. Hierarchical Approaches
Most previous work that has combined multiple an-
notations has used linear segmentations, i.e. dis-
course segmentations without hierarchies (Hirschberg
and Nakatani, 1996). In general, the hierarchical nature
of discourse structure has not been considered when com-
puting labeler inter-reliability and in evaluations of agree-
ment with automatic methods. Since computational dis-
course theory relies on the hierarchy of its segments, we
will consider it in this paper. For each approach that fol-
lows, we consider both a “flat” version, which does not
consider level of embeddedness, and a “full” approach,
which does. We analyze the differences between the flat
and full versions for each approach.
2.2 Segment Definition
A discourse is made up of a sequence of utterances,
a0a2a1a4a3a5a0a7a6a8a3a9a0a11a10a12a3a14a13a15a13a16a13a15a3a5a0a7a17 . In this paper, we define a segment as
a triple a18a20a19 a3a22a21a23a3a9a24a22a25 , where a0a7a26 is the first utterance in the seg-
ment, a0a28a27 is the last utterance in the segment, and a24 is the
segment’s level of embeddedness.2 We will sometimes
refer to a0a7a26 and a0a28a27 as boundary utterances. Lastly, when
we are not interested in level of embeddedness, we will
sometimes refer to a segment as a18a20a19 a3a22a21a29a25 , without the a24 value.
2.3 The Consensus Approach
A conservative way to combine segmentations into a gold
standard is the Consensus (CNS) (or raw agreement) ap-
2The levels are numbered from top to bottom; hence, 1 is the
level of the largest segment, 2 is the level below that, and so on.
C
EA B
D
F
G
H I
J
1 2 3 GS
Figure 1: An example of the FullCNS approach. FlatCNS
would create the same gold standard. The segments in the
annotations that are marked in bold are those selected by
the gold standard.
F H
I
J KA B
C
D
E
G M
L
Flat GS1 2 3 Full GS
Figure 2: An example where FullCNS and FlatCNS cre-
ate different gold standards..
proach (Hirschberg and Nakatani, 1996). CNS constructs
a gold standard by including only those segments which
have been included by every annotator. In the “full” ver-
sion of CNS (FullCNS), the annotators need to agree
upon the embedded level of the segment along with the
segment boundaries. The “flat” version (FlatCNS) ig-
nores hierarchy and considers only the segment bound-
aries when determining agreement.
Figure 1 shows an example of performing FullCNS
on three annotations. In the figure, all three annotators
agree on only the largest segment (segments A, E, and
H). Hence, the gold standard includes only that single
segment. FlatCNS gives the same gold standard in this
example as there are no two segments with the same
boundaries but with different levels of embeddedness.
In Figure 2, we see an example where the gold stan-
dards created by FlatCNS and FullCNS differ. Aside
from the largest segment, FullCNS contains only the seg-
ment representing the agreement of segments D, G, and
M. FlatCNS includes that segment as well, in addition
to two more from the agreement of segments B, H, and K
and segments C, I, and L. FullCNS does not include those
segments because the segments occur at differing levels
of embeddedness.
2.4 Majority Consensus
A straightforward extension to the CNS approach is
to relax the need for full agreement and include
those segments on which a majority of the annotators
agreed (Grosz and Hirschberg, 1992). Other thresholds
of agreement could be used as well, but in this paper we
C
EA B
D
F
G
H I
J
1 2 3 GS
Figure 3: An example of the FullMCNS approach.
only consider it an agreement when more than 50% of
annotators agree on a segment. We call this the Major-
ity Consensus (MCNS) approach. As with CNS, we can
have both a “full” version (FullMCNS) and a “flat” ver-
sion (FlatMCNS), which do and do not consider the level
of embeddedness, respectively.
Figure 3 shows an example of performing FullMCNS
on the same three annotations we saw in Figure 1. Here,
we again include the largest segment because it is agreed
upon by all, but now we also include the two segments
agreed upon by annotators 1 and 3 because two out of
three annotators, a majority, have selected them. These
two segments correspond to segments B and C for anno-
tator 1 and segments I and J for annotator 3.
MCNS is less strict than CNS as it includes segments
agreed upon by most annotators and does not require full
agreement, but both methods are affected by a potential
flaw. Note that in Figure 3, segment D could very well
be in some notion of agreement with annotation 3, but
MCNS does not capture this near-miss; D is left out of the
gold standard. The next approach we discuss can handle
this sort of situation.
2.5 Conflict-Free Union
The Conflict-Free Union (CFU) approach combines the
annotations of all of the annotators and removes those
segments that conflict with each other to get a conflict-
free gold standard. There are usually multiple ways to
construct a conflict-free gold standard. The CFU ap-
proach finds the one with the fewest segments removed.
Figure 4 shows the use of CFU on the three example
annotations. Notice that the only segments not included
in the gold standard are F and G, which conflict with B,
C, D, I, and J. Resolving the conflicts here required re-
moving two segments; the other way to resolve the con-
flict would have been to remove C and J, which would
be equally as good. CFU captures as many conflict-free
segments from the annotations as possible without dis-
crimination. Even if only one annotator chose a segment,
CFU would include it if it did not create more conflicts.
Hence, it is likely that CFU could construct gold stan-
dards with too much structure. However, in our example
it is better at capturing the similarity of structure between
annotators 1 and 3. Due to its ability to capture structure,
C
EA B
D
F
G
H I
J
1 2 3 GS
Figure 4: An example of the CFU approach.
we expected that CFU would perform better in recall than
the previously mentioned approaches.
2.5.1 CFU Algorithm
The consensus and majority approaches are straight-
forward to compute, but CFU presents an optimization
problem in which the greatest number of segments that
can be combined without any internal conflicts must be
found. Brute force methods, such as trying every possible
set of segments and picking the largest conflict-free set,
grow exponentially in the total number of segments con-
tained in the annotations. We present a dynamic program-
ming algorithm that computes the CFU in a0a2a1a4a3a6a5a8a7 time,
where a3 is the number of utterances in the discourse.
First, we say that segment a9a4a10a12a11a14a13a16a15 straddles an utterance
a17a19a18 when
a10a21a20a23a22a24a20a25a13 . Let a26a19a27a29a28
a18 represent the number
of segments between utterances a17 a27 and a17 a28 , inclusive, that
straddle a17a19a18 . That is, a26a19a27a29a28 a18 represents the number of unique
segments with the form a9a4a30a6a11a32a31a33a15 , where a10a34a20a35a30a36a20a37a22a38a20a35a31a39a20a40a13
and a30a36a41a42a31 . We use a26a43a27a29a28 a18 to compute a44a38a27a29a28 , the index repre-
senting the utterance a17 a18 that, if considered a new bound-
ary utterance, would minimize the number of conflicting
segments within a17 a27 and a17 a28 , and a45 a27a29a28 , that minimum num-
ber of segments. Then we can solve the following recur-
rence equations:
a45a19a27a46a27a48a47a37a49
a45a50a27a29a28a51a47
a28
a52a2a53a46a54
a18a56a55
a27
a1a4a45a19a27
a18a58a57
a45a60a59
a18a56a61a60a62a64a63
a28
a57
a26a19a27a65a28
a18
a7 where a13a39a66a67a10
a57a37a68
a44a39a27a29a28a51a47a37a69a71a70a73a72
a28
a52a2a53a74a54
a18a56a55
a27
a1a4a45a19a27
a18a75a57
a45a60a59
a18a56a61a48a62a76a63
a28
a57
a26a19a27a29a28
a18
a7
We can generate a binary tree with a9 a68 a11a78a77a79a15 as the value
of the root node, representing the segment over all utter-
ances. The left child, then, has the value a9 a68 a11a73a44 a62a76a80 a15 , and
the right child has value a9a81a44 a62a64a80a35a57a82a68 a11a78a77a79a15 . We compute the
rest of the tree similarly, with a9 a68 a11a78a44 a62a76a83a85a84a4a86 a15 as the left child
of the left child, and so on. For each segment included by
an annotator, we include it in the gold standard if it is
represented by a segment in the constructed tree. Note
that we only store the boundary utterances in the tree, so
the gold standard we construct will not include level of
embeddedness.
C
EA B
D
F
G
H I
J
1 2 3 GS
Figure 5: An example of the FullUNI approach.
2.6 Union
The methods for finding a gold standard in Sections 2.1-
2.4 produce segmentations that contain no internal con-
flicts.3 However, since we evaluate a gold standard by its
similarity with the original annotations, it makes sense to
define an approach that is capable of constructing a uni-
fied segmentation that includes conflicts, which we call
the Union approach (UNI). UNI simply includes every
segment from every annotator. The flat version ignores
hierarchies (FlatUNI), and the full version includes them
(FullUNI).
An example of an application of FullUNI is given in
Figure 5. We see that every segment chosen by annotators
1, 2, and 3 have been included in the gold standard, cre-
ating some internal conflicts. We certainly would not ex-
pect to use this construction as a prediction of the actual
gold standard, but we include it for comparison with CFU
in evaluating the importance of avoiding internal conflicts
with respect to our metrics.
2.7 Best Annotator
The final approach we considered chooses the “best” an-
notation and considers it to be the gold standard. We se-
lect the annotation with the highest inter-labeler reliabil-
ity with all the other annotations to be the “best” annota-
tion, using the pairwise a87 metric. We discuss this metric
and its uses in Section 3.
3 Measures of Evaluation
There are several ways of evaluating an algorithm for
creating a gold standard, just as there are several ways
of evaluating any segmentation algorithm. Ideally, we
would like to compare to some objectively true gold stan-
dard, but it is impossible to determine if there are one or
more true standards, or even if one exists. Instead, we
can compare a gold standard against each annotator’s in-
dividual structuring, or against that of several human an-
notators collectively. Also, we could compare gold stan-
dards with each other in terms of how they affect the out-
3MCNS avoids conflicts because any two segments that a
majority of annotators agree upon will always both be included
by at least one annotator, and we assume that individual anno-
tations are always internally consistent.
21
Figure 6: The a0 metric would consider these two segmen-
tations in agreement.
come of some computational task which considers dis-
course structure, such as anaphora resolution. This last
approach is probably the best when the purpose of the
gold standard is known in advance, but in this paper we
consider only task-independent metrics.
For the sake of scientific validity, we did not compare
a gold standard with a segmentation of our own. Instead,
we chose to evaluate gold standards by averaging their
similarity to the original segmentations made by human
annotators. For each approach we presented earlier, we
report an average similarity score over all original seg-
mentations and the gold standard, based on several dif-
ferent quantitative measures of inter-reliability.
3.1 Pairwise Agreement Scores
For linear segmentation, pairwise agreement between an-
notators is computed by dividing the number of utter-
ances for which both annotators agree by the total number
of utterances. In contrast, a hierarchical segmentation for
a sequence of utterance in a discourse is analogous to a
parse tree for a sequence of words. It requires a different
metric for pairwise agreement that considers the hierar-
chy.
Following Flammia and Zue (1995), we define a gen-
eral symmetric metric a1a3a2 for observed agreement be-
tween two segments that accounts both for deletions and
for insertions of segments in a hierarchy. Intuitively,
we want different sub-trees that vary only in hierarchical
structure but share the same boundaries to0 be considered
similar. For example, in Figure 6, there is good reason to
consider both annotations to be similar, even though no
segment pair in either spans the same utterances.
Formally, let a26 a62 and a26a5a4 be two possible segmentations.
A segment a9a4a10a12a11 a13 a15 in a26 a62 matches with segmentation a26a5a4 if
there exists some segment a9a4a10a12a11a7a6a71a15 or a9a8a6a16a11a78a10a10a9 a68 a15 in a26a11a4 and
there exists some segment a9a12a6a16a11 a13 a15 or a9a46a13 a57 a68 a11a7a6 a15 in a26 a4 . In
other words, a segment in a26 a62 matches a segmentation a26 a4
if the utterances that constitute its boundaries also con-
stitute boundaries for some segment in a26 a4 . For example,
in Figure 5, we consider that the segments a13 a11a15a14 a11 and a16
in annotation 3 match the segments a17 a11a19a18a39a11a21a20a51a11 and a22 in
annotation 1.
Flammia and Zue then let a0a24a23 a84 be the number of seg-
ments in a26 a62 that match with segments in a26a11a4 and let a0a25a23a27a26
be the number of segments in a26 a4 that match with seg-
ments in a26 a62 . a77 a23 a84 and a77 a23 a26 are the number of segments in
a26
a62 and
a26 a4 respectively. Following Bakeman and Gottman
(1986), they define the observed agreement to be
a1 a2
a47
a0a28a23
a84 a57
a0a25a23a29a26
a77 a23
a84 a57
a77 a23 a26
For the metric to be valid, they also take into account
the probability of chance agreement between annotators.
For example, if the distribution underlying the segmenta-
tion is skewed such that the structure is very sparse, most
segmentations will include very few constituents, and a1a10a2
will be unnaturally deflated.
The kappa coefficient (a0 ) is used for correcting the ob-
served agreement by subtracting the probability a1a31a30 that
two segments in a26 a62 and a26a5a4 , chosen at random, happen to
agree. The a0 coefficient is computed as follows:
a0 a47
a1a32a2
a9
a1 a30
a68
a9
a1a32a30
Carletta (1996) reports that content analysis researchers
generally think of a0a34a33 a49a36a35a37 as “good reliability,” with
a49a36a35a38a40a39a37a41 a0 a41a25a49a36a35a37 allowing “tentative conclusions to be
drawn.”
All that remains is to define the chance agreement
probability a1 a30 . Let a1a32a41 a1 a30 a7 and a1a32a42 a1 a30 a7 be the fraction of
utterances that begin or end one or more segments in seg-
mentation a30 respectively. Flammia and Zue compute an
upper bound on a1a3a30 as
a1a43a30
a47
a77a44a23
a84
a77 a23
a84 a57
a77 a23 a26
a1 a41
a1 a10a64a7
a1 a42
a1a4a10a64a7
a1a32a45
a1a74a13a16a7
a57
a77 a23 a26
a77a44a23
a84 a57
a77a44a23a29a26
a1 a41
a1a46a13 a7
a1 a42
a1a74a13 a7
a1a32a45
a1a4a10a64a7
where a1 a45 a1 a30 a7 a47 a1 a1a32a41 a1a4a30a50a7 a57 a1a32a42 a1 a30 a7a3a9 a1a32a41 a1 a30 a7 a1a32a42 a1 a30 a7a78a7 a4 .
3.2 Precision and Recall
We use standard evaluation metrics from information re-
trieval to measure pairwise agreement between gold stan-
dards and annotations. We say that segment a46 a47 a9 a10a73a11a14a13 a11a7a6 a15
in some segmentation flatly agrees with segmentation a26
if there exists a segment a47 a47 a9 a10a73a11a14a13 a11a7a6 a15 in a26 , which spans
exactly the same utterances as a46 . We say that segment
a46 a47 a9 a10a12a11 a13 a11a19a48 a15 in some segmentation fully agrees with seg-
mentation a26 if a46 flatly agrees with a26 , and the segment
that fits it is also of the same depth as a46 ; i.e., there exists
a segment a47 a47 a9 a10a12a11 a13 a11a19a48 a15 in a26 .
We define the number of relevant segments in a seg-
mentation a26 to be the total number of segments in a26
that agree with a gold standard for that particular dis-
course. For gold standard types that consider embed-
dedness, such as Full Consensus and Full Majority Con-
sensus, we check for full agreement. For gold standard
types that do not, such as Flat Consensus and Conflict-
Free Union, we consider flat agreement.
We define recall as the number of relevant segments
in a0 divided by the total number of segments in a0 . We
define precision as the number of relevant segments in a0
divided by the total number of segments in the gold stan-
dard. Intuitively, if a gold standard has low agreement
with the original segmentation, recall will be low. If a
gold standard’s structure is more dense then the original
segmentation, precision will be low.
3.3 Non-Crossing-Brackets
The non-crossing-bracket measure is a common perfor-
mance metric used in syntactic parsing for measuring
hierarchical structure similarity. A segment constituent
a1a3a2
a18a20a19
a3a22a21a29a25 in some segmentation crosses brackets with
segmentationa0 ifa1 spans at least one boundary utterance
in a0 .
For each segmentation a0 , we define the number of
non-crossing-brackets as the number of segments in a0
that do not exhibit crossing brackets with the appropri-
ate gold standard. For each segmentation, we compute a
non-crossing-bracket percentage by dividing the number
of non-crossing-brackets by the total number of bracket
pairs.
4 Empirical Methodology
4.1 Boston Directions Corpus
For our empirical analysis of different gold standard ap-
proaches, we used the Boston Directions Corpus (BDC).
The BDC corpus contains transcribed monologues by
speakers who were instructed to perform a series of
direction-giving tasks. The monologues were subse-
quently annotated by a group of subjects according to the
Grosz and Sidner (1986) theory of discourse structure.
This theory provides a foundation for hierarchical seg-
mentation of discourses into constituent parts. Some of
the subjects were experts in discourse theory and others
were naive annotators. In our experiments here, we only
consider the annotations from experts.
4.2 Experimental Design
Our experiments were run on 12 discourses in the sponta-
neous speech component of the BDC. The lengths of the
discourses ranged from 15 to 150 intonational phrases.
Each discourse was segmented by three different annota-
tors, resulting in 36 separate annotations. For each dis-
course, we combined the three annotations into a gold
standard according to each technique described in Sec-
tion 2. We then proceeded to compute the similarity be-
tween the gold standard and each of the original annota-
tions by using the pairwise evaluation metrics described
in Section 3.
Figure 7: Results — Pairwise Agreement Scores
4.3 Results
We report results for each gold standard averaged over
all 36 annotations. Table 1 presents precision/recall per-
centages for pairwise agreement scores, as well as a0 val-
ues and non-crossing brackets (NCB) percentages. Fig-
ure 7 plots the pairwise agreement precision/recall values
on a graph, with error bars indicating one standard de-
viation from the mean. Recall that the gold standards
we are comparing are Full Consensus (FullCNS), Flat
Consensus (FlatCNS), Full Majority Consensus (FullM-
CNS), Flat Majority Consensus (FlatMCNS), Conflict
Free Union (CFU), Full Union (FullUNI), Flat Union
(FlatUNI) and Best Annotator.
Our results show that CFU, FullUNI and FlatUNI all
achieved high a0 scores and low variance. Both Full
and Flat Consensus scored the worst. This pattern was
also apparent with regard to agreement between the gold
standard and the annotations. Again, CFU, FullUNI
and FlatUNI achieved the best recall, and FullCNS and
FlatCNS scored the worst recall. It is interesting to point
out that since any segmentation proposed by an evaluator
will always be included in the FullUNI gold standard, its
agreement recall will always be 1.
We see a change in trend with regard to precision be-
tween gold standard and the annotations. Here, FullCNS
and FlatCNS achieved very high precision, while Full-
UNI and FlatUNI achieved low precision. CFU’s preci-
sion was slightly better. With respect to the non-crossing-
brackets metrics, the gold standards based on consensus
(FullCNS, FlatCNS) did not clash at all with any anno-
tation, since any segment in the gold standard is present
in each of the annotations. Of the remaining methods,
FlatMCNS (0.84) and FullMCNS (0.81) had the high-
est percentage of non-crossing-brackets, while the union
based approaches, FullUNI (0.47) and FlatUNI (0.54)
a0 Agreement Rec. Agreement Pre. NCB
ave. sd. ave. sd. ave. sd. ave. sd
FullCNS .25 .63 .15 .34 1 0 1 0
FlatCNS .48 .42 .27 .32 .81 .60 1 0
FullMCNS .67 .32 .43 .39 .75 .25 .81 .32
FlatMCNS .79 .21 .59 .20 .71 .89 .84 .12
CFU .84 .08 .91 .89 .50 .11 .78 .19
FullUNI .84 .08 1 0 .44 .16 .53 .20
FlatUNI .84 .09 .98 .01 .49 .06 .46 .09
BestAnnotator .85 .15 .55 .45 .52 .47 .62 .33
Table 1: Experimental results.
had the lowest, because their gold standards are densely
structured and internally include conflicts.
Looking at each gold standard separately, we do not
identify a single gold standard that does well across the
board. CFU, FullUNI and FlatUNI have high a0 and
agreement recall values, but they all have low agree-
ment precision values. FullCNS and FlatCNS have low a0
and recall values, but better agreement precision values.
FullMCNS and FlatMCNS average out the best across all
metrics, but they do not achieve the best performance in
any of the metrics. Note that “full” type methods require
agreement in hierarchy; they are held to a higher standard
of evaluation than “flat” type methods.
5 Discussion
From the results, we see that generally the consensus-
type approaches (CNS and MCNS) perform very well
with the precision metric and the union approaches (CFU,
UNI) perform well with the recall and a0 metrics. Preci-
sion measures the percentage of the gold standard that
was agreed upon by the annotators, and since the con-
sensus approaches tend to include only those segments
labeled by everyone, they have high precision. Specifi-
cally, FullCNS performs perfectly in precision because it
contains only those segments explicitly included by ev-
eryone, while the majority consensus methods perform
slightly worse because an annotator is occasionally in the
minority.
Recall measures the percentage of the annotator’s seg-
ments captured by the gold standard. Since the union
approaches include every or almost every segment, de-
pending on whether it is “flat” or “full,” respectively, an
annotator’s segment is almost always included in the gold
standard, yielding high recall for these methods. The dif-
ference between precision and recall highlights two dif-
ferent approaches: precision encourages a bottom-up ap-
proach where the most likely segments to be included in
the gold standard are added from scratch; recall encour-
ages a top-down approach where all possible segments
are added and the least likely segments to be included in
the gold standard are pruned. The a0 metric attempts to
balance these two approaches by rewarding agreements
yet penalizing extra structure. Nevertheless, even the
naive union methods (UNI) performs well with a0 , indi-
cating that it favors agreement far more than it punishes
extra structure.
Based on these observations, we believe that there is
good reason to prefer to use CFU as a gold standard over
FullUNI and FlatUNI. Although they all have the same
a0 and similar precision/recall values, the CFU gold stan-
dard corresponds to a true segmentation — it does not
exhibit internal conflicts.
However, if a conservative but accurate gold standard
is desired, then the MCNS approaches are the best all-
around consensus approaches to use, as they perform
fairly well with a0 as well as with precision and recall.
These approaches construct fairly conservative gold stan-
dards, but not nearly as strict as the full consensus ap-
proaches. Hence, as seen by the high precision value,
a gold standard constructed by an MCNS method will
contain mostly relevant segments but will be missing the
more controversial segments.
The Best Annotator approach performed very well
with a0 , but not as well with respect to precision and re-
call. Its performance was completely dominated by the
MCNS approaches in all metrics, except for a0 . In gen-
eral, a0 is at its highest when minor boundary disagree-
ments are infrequent, because it is not sensitive to the
exact type of matching boundaries. This phenomenon
is shown in Figure 6. There, we see two segmentations
that are clearly different but are considered the same by
a0 . Precision and recall, however, would not consider the
second level segments in agreement.
The consistently good results of the non-crossing-
brackets metric for MCNS and CFU indicate that there
are few cases in which the expert BDC annotators cre-
ate segments whose boundaries cross. Again, this effect
is probably a result of the well-structured nature of the
tasks in the BDC discourses. Since there are few cross-
ing boundaries, the a0 metric performs well for the Union
and Best Annotator methods since almost every bound-
ary is represented. If annotations had exhibited more dis-
crepancy, the non-crossing-brackets and a0 metrics would
probably differentiate more among these approaches.
Lastly, we note that the difference between “full” and
“flat” metrics of the same type were insignificant, but
with the consensus approaches, the “flat” approaches per-
formed slightly better than their “full” counterparts, most
likely because the “full” approaches were too conserva-
tive in demanding level agreement. Thus, if we care not to
have conflicts in our gold standard, the “full” approaches
should be used to find the gold standard, as they produce
more structured segmentations. In addition, a gold stan-
dard with labeled embeddedness might be necessary for
post-segmentation processing, such as anaphora resolu-
tion. However, if the gold standard is being used for
purely evaluative reasons, the “flat” approaches should
be used as they perform slightly better.
6 Future Work
One problem with the measures of evaluation that we
have explored in this paper is that they tell us how similar
a gold standard is to the original annotations but say noth-
ing about how effective the gold standard would be when
used for further discourse processing. One suggestion
for future studies would be to evaluate the gold standards
with respect to possible post-segmentation tasks, such as
anaphora resolution or summarization. Such an approach
would be a better measure of the objective goodness of
the gold standard and could also be a way to monitor
the skills of the annotators. Specific metrics might also
be more relevant for a specific discourse task. For in-
stance, perhaps non-crossing-brackets is a more useful
metric to consider when segmenting as a preprocessing
step for anaphora resolution.
It would also be interesting to further explore the
Conflict-Free Union approach, as it performed well but
suffered from including extra structure. The top-down
processing could be enhanced by removing those seg-
ments which are deemed the least probable to be in the
gold standard, perhaps based on some features, such as
depth. For example, perhaps a segment that is at a deep
level and is only in a few annotations could get removed,
while larger segments would remain regardless, or vice
versa. With a few good features, it seems quite possi-
ble to increase the precision of CFU. A similar approach
could be taken to add new segments to those picked in the
Majority Consensus approach.
Finally, it is worth exploring whether it is a good idea
to have multiple annotations for a given corpus in the first
place. Some corpora, such as the Penn Treebank, require
its annotators to meet whenever there is a conflict so that
the conflict can be resolved before the corpus is publicly
released. Penn has now begun a Discourse Treebank as
well (Creswell et al., 2003). Wiebe et al. (1999) use sta-
tistical methods to automatically correct the biases in an-
notations of speaker subjectivity. The corrections are then
used as a basis for further conflict resolution. Carlson
et al. (2001) also used conflict resolution when creating
their discourse-tagged corpus. One interesting area of re-
search would be to compare how annotators choose to re-
solve their conflicts compared to the different automatic
approaches of finding a gold standard. It is possible that
the compromises made by the annotators cannot be cap-
tured by any computational method, in which case it may
be worth having all conflicts resolved manually.
Acknowledgments
We would like to thank Jill Nickerson for comments on
an earlier draft of this paper. This work is supported in
part by the National Science Foundation under Grant No.
IIS-0222892, the GK-12 Fellowship (NSF 04-533), and
NSF Career Award IIS-0091815.

References
R. Bakeman and J.M. Gottman. 1986. Observing inter-
actions: an introduction to sequential analysis. Cam-
bridge University Press.
J. Carletta. 1996. Assessing agreement on classification
tasks: The kappa statistic. Computational Linguistics,
22(2).
L. Carlson, D. Marcu, and M. E. Okurowski. 2001.
Building a discourse-tagged corpus in the framework
of rhetorical structure theory. In Proc. of the 2nd
SIGDIAL Workshop on Discourse and Dialogue, Eu-
rospeech 2001, Denmark, September.
C. Creswell, K. Forbes, E. Miltsakaki, R. Prasad,
A. Joshi, and B. Webber. 2003. Penn discourse tree-
bank: Building a large scale annotated corpus encod-
ing dltag-based discourse structure and discourse rela-
tions. Manuscript [fix this].
G. Flammia and V. Zue. 1995. Empirical evalua-
tion of human performance and agreement in parsing
discourse constituents in spoken dialogue. In Proc.
Euroospeech-95, volume 3, pages 1965–1968, Madrid,
Spain.
B.J. Grosz and J. Hirschberg. 1992. Some intona-
tional characteristics of disourse structure. In Proc. of
ICSLP-92, volume 1.
B.J. Grosz and C.L. Sidner. 1986. Attention, intentions
and the structure of discourse. Computational Linguis-
tics, 12:175–204.
M. Hearst. 1997. Texttiling: Segmenting text into multi-
paragraph subtopic passages. Computational Linguis-
tics, 23:33–64.
J. Hirschberg and C. Nakatani. 1996. A prosodic anal-
ysis of discourse segments in direction-giving mono-
logues. In Proc. of ACL-1996, Santa Cruz.
D. Marcu. 2000. The Theory and Practice of Discourse
Parsing and Summarization. The MIT Press, Novem-
ber.
R.J. Passonneau and D.J. Litman. 1993. Intention-based
segmentation: Human reliability and correlation with
linguistic cues. In Meeting of the Association for Com-
putational Linguistics, pages 148–155.
J.A. Rotondo. 1984. Clustering analysis of subject parti-
tions of text. Discourse Processes, 7:69–88.
J. Wiebe, R. Bruce, and T. O’Hara. 1999. Development
and use of a gold-standard data set for subjectivity clas-
sifications. In Proc. 37th Annual Meeting of the Assoc.
for Computational Linguistics (ACL-99), pages 246–
253, University of Maryland, June.
