Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, pages 126–133,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Measuring annotator agreement in a complex hierarchical dialogue act
annotation scheme
Jeroen Geertzen and Harry Bunt
Language and Information Science
Tilburg University, P.O. Box 90153
NL-5000 LE Tilburg, The Netherlands
{j.geertzen,h.bunt}@uvt.nl
Abstract
We present a first analysis of inter-
annotator agreement for the DIT++ tagset
of dialogue acts, a comprehensive, lay-
ered, multidimensional set of 86 tags.
Within a dimension or a layer, subsets of
tags are often hierarchically organised. We
argue that especially for such highly struc-
tured annotation schemes the well-known
kappa statistic is not an adequate measure
of inter-annotator agreement. Instead, we
propose a statistic that takes the structural
properties of the tagset into account, and
we discuss the application of this statistic
in an annotation experiment. The exper-
iment shows promising agreement scores
for most dimensions in the tagset and pro-
vides useful insights into the usability of
the annotation scheme, but also indicates
that several additional factors influence
annotator agreement. We finally suggest
that the proposed approach for measuring
agreement per dimension can be a good
basis for measuring annotator agreement
over the dimensions of a multidimensional
annotation scheme.
1 Introduction
The DIT++ tagset (Bunt, 2005) was designed to
combine in one comprehensive annotation scheme
the communicative functions of dialogue acts dis-
tinguished in Dynamic Interpretation Theory (DIT,
(Bunt, 2000; Bunt and Girard, 2005)), and many
of those in DAMSL (Allen and Core, 1997) and in
other annotation schemes. An important differ-
ence between the DIT++ and DAMSL schemes is the
more elaborate and fine-grained set of functions
for feedback and other aspects of dialogue control
that is available in DIT, partly inspired by the work
of Allwood (Allwood et al., 1993). As it is often
thought that more elaborate and fine-grained anno-
tation schemes are difficult for annotators to apply
consistently, we decided to address this issue in an
annotation experiment on which we report in this
paper. A frequently used way of evaluating hu-
man dialogue act classification is inter-annotator
agreement. Agreement is sometimes measured as
percentage of the cases on which the annotators
agree, but more often expected agreement is taken
into account in using the kappa statistic (Cohen,
1960; Carletta, 1996), which is given by:
κ = po − pe1 − p
e
(1)
where po is the observed proportion of agreement
and pe is the proportion of agreement expected by
chance. Ever since its introduction in general (Co-
hen, 1960) and in computational linguistics (Car-
letta, 1996), many researchers have pointed out
that there are quite some problems in using κ (e.g.
(Di Eugenio and Glass, 2004)), one of which is
the discrepancy between p0 and κ for skewed class
distribution.
Another is that the degree of disagreement is
not taken into account, which is relevant for any
non-nominal scale. To address this problem, a
weighted κ has been proposed (Cohen, 1968) that
penalizes disagreement according to their degree
rather than treating all disagreements equally. It
would be arguable that in a similar way, charac-
teristics of dialogue acts in a particular taxonomy
and possible pragmatic similarity between them
should be taken into account to express annotator
agreement. For dialogue act taxonomies which are
structured in a meaningful way, such as those that
126
express hierarchical relations between concepts in
the taxonomy, the taxonomic structure can be ex-
ploited to express how much annotators disagree
when they choose different concepts that are di-
rectly or indirectly related. Recent work that ac-
counts for some of these aspects is a metric for
automatic dialogue act classification (Lesch et al.,
2005) that uses distance in a hierarchical structure
of multidimensional labels.
In the following sections of this paper, we will
first briefly consider the dimensions in the DIT++
scheme and highlight the taxonomic characteris-
tics that will turn out to be relevant in later stage.
We will then introduce a variant of weighted κ for
inter-annotator agreement called κtw that adopts
a taxonomy-dependent weighting, and discuss its
use.
2 Annotation using DIT
DIT is a context-change (or information-state up-
date) approach to the analysis of dialogue, which
describes utterance meaning in terms of context
update operations called ‘dialogue acts’. A dia-
logue act in DIT has two components: (1) the se-
mantic content, being the objects, events, proper-
ties, relations, etc. that are considered; and (2)
the communicative function, that describes how
the addressee is intended to use the semantic con-
tent for updating his context model when he un-
derstands the utterance correctly. DIT takes a mul-
tidimensional view on dialogue in the sense that
speakers may use utterances to address several as-
pects of the communication simultaneously, as re-
flected in the multifunctionality of utterances. One
such aspect is the performance of the task or ac-
tivity for which the dialogue takes place; another
is the monitoring of each other’s attention, under-
standing and uptake through feedback acts; others
include for instance the turn-taking process and
the timing of communicative actions, and finally
yet another aspect is formed by the social obli-
gations that may arise such as greeting, apologis-
ing, or thanking. The various aspects of commu-
nication that can be addressed independently are
called dimensions (Bunt and Girard, 2005; Bunt,
2006). The DIT++ tagset distinguishes 11 dimen-
sions, which all contain a number of communica-
tive functions that are specific to that dimension,
such as TURN GIVING, PAUSING, and APOLOGY.
Besides dimension-specific communicative
functions, DIT also distinguishes a layer of
communicative functions that are not specific to
any particular dimension but that can be used
to address any aspect of communication. These
functions, which include questions, answers,
statements, and commissive as well as directive
acts, are called general purpose functions. A
dialogue act falls within a specific dimension
if it has a communicative function specific for
that dimension or if it has a general-purpose
function and a semantic content relating to that
dimension. Dialogue utterances can in principle
have a function (but never more than one) in each
of the dimensions, so annotators using the DIT++
scheme can assign at most one tag for each of the
11 dimensions to any given utterance.
Both within the set of general-purpose com-
municative function tags and within the sets of
dimension-specific tags, tags can be hierarchically
related in such a way that a label lower in a hier-
archy is more specific than a label higher in the
same hierarchy. Tag F1 is more specific than tag
F2 if F1 defines a context update operation that in-
cludes the update operation corresponding to F2.
For instance, consider a part of the taxonomy for
general purpose functions (Figure 1).
INFO.SEEKING
IND-YNQ
YNQ
CHECK
POSI NEGA
IND-WHQ
WHQ
...
Figure 1: Two hierarchies in the information seek-
ing general purpose functions.
For an utterance to be assigned a YN-QUESTION,
we assume the speaker believes that the addressee
knows the truth value of the proposition presented.
For an utterance to be assigned a CHECK, we as-
sume the speaker additionally has a weak be-
lief that the proposition that forms the seman-
tic content is true. And for a POSI-CHECK, there
is the additional assumption that the speaker be-
lieves (weakly) that the hearer also believes that
the proposition is true.1
Similar to the hierarchical relations between
YN-Question,CHECK, and POSI-CHECK, other parts
1For a formal description of each function in the DIT++
tagset see http://ls0143.uvt.nl/dit/
127
of the annotation scheme contain hierarchically re-
lated functions.
The following example illustrates the use of
DIT++ communicative functions for a very simple
translated) dialogue fragment2.
1 S at what time do you want to travel today?
TASK = WH-Q, TURN-MANAGEMENT = GIVE
2 U at ten.
TASK = WH-A, TURN-MANAGEMENT = GIVE
3 S so you want to leave at ten in the morning?
TASK = POSI-CHECK,TURN-MANAGEMENT = GIVE
4 U yes that is right.
TASK = CONFIRM, TURN-MANAGEMENT = GIVE
3 Agreement using κ
3.1 Related work
Inter-annotator agreements have been calculated
with the purpose of qualitatively evaluating tagsets
and individual tags. For DAMSL, the first agree-
ment results were presented in (Core and Allen,
1997), based on the analysis of TRAINS 91-
93 dialogues (Gross et al., 1993; Heeman and
Allen, 1995). In this analysis, 604 utterances
were tagged by mostly two annotators. Follow-
ing the suggestions in (Carletta, 1996), Core et
al. consider kappa scores above 0.67 to indi-
cate significant agreement and scores above 0.8
reliable agreement. Another more recent analy-
sis was performed for 8 dialogues of the MON-
ROE corpus (Stent, 2000), counting 2897 utter-
ances in total, processed by two annotators for 13
DAMSL dimensions. Other analyses apply DAMSL
derived schemes (such as SWITCHBOARD-DAMSL)
to various corpora (e.g. (Di Eugenio et al., 1998;
Shriberg et al., 2004) ). For the comprehensive
DIT++ taxonomy, the work reported here repre-
sents the first investigation of annotator agree-
ment.
3.2 Experiment outline
As noted, existing work on annotator agreement
analysis has mostly involved only two annotators.
It may be argued that especially for annotation of
concepts that are rather complex, an odd number
of annotators is desirable. First, it allows having
majority agreement unless all annotators choose
entirely different. Second, it allows to deal bet-
ter with the undesirable situation that one annota-
tor chooses quite differently from the others. The
2Drawn from the OVIS corpus (Strik et al., 1997):
OVIS2:104/001/001:008-011
agreement scores reported in this paper are all cal-
culated on the basis of the annotations of three
annotators, using the method proposed in (Davies
and Fleiss, 1982).
The dialogues that were annotated are task-
oriented and are all in Dutch. To account for
different complexities of interaction, both human-
machine and human-human dialogues are consid-
ered. Moreover, the dialogues analyzed are drawn
from different corpora: OVIS (Strik et al., 1997),
DIAMOND (Geertzen et al., 2004), and a collec-
tion of Map Task dialogues (Caspers, 2000); see
Table 1, where the number of annotated utterances
is also indicated.
corpus domain type #utt
OVIS TRAINS like interactions H-M 193
on train connections
DIAMOND1 interactions on how to H-M 131
operate a fax device
DIAMOND2 interactions on how to H-H 114
operate a fax device
MAPTASK HCRC Map Task like H-H 120
interaction
558
Table 1: Characteristics of the utterances consid-
ered
Six undergraduate students annotated the se-
lected dialogue material. They had been intro-
duced to the DIT++ annotation scheme and the un-
derlying theory while participating in a course on
pragmatics. During this course they were exposed
to approximately four hours of lecturing and few
small annotation exercises. For all dialogues, the
audio recordings were transcribed and the annota-
tors annotated presegmented utterances for which
full agreement was established on segmentation
level beforehand. During the annotation sessions
the annotators had — apart from the transcribed
speech — access to the audio recordings, to the
on-line definitions of the communicative functions
in the scheme and to a very brief, 1-page set of an-
notation guidelines3. The task was facilitated by
the use of an annotation tool that had been built
for this occasion; this tool allowed the subjects to
assign each utterance one DIT++ tag for each di-
mension without any further constraints. In total
1,674 utterances were annotated.
3.3 Problems with standard κ
If we were to apply the standard κ statistic to
DIT++ annotations, we would not do justice to an
important aspect of the annotation scheme con-
cerning the differences between alternative tags,
3See http://ls0143.uvt.nl/dit
128
and hence the possible differences in the dis-
agreement between annotators using alternative
tags. An aspect in which the DIT++ scheme dif-
fers from other taxonomies for dialogue acts is
that, as noted in Section 2, communicative func-
tions (CFs) within a dimension as well as general-
purpose CFs are often structured into hierarchies
in which a difference in level represents a relation
of specificity. When annotators differ in that they
assign tags which both belong to the same hier-
archy, they may differ in the degree of specificity
that they want to express, but they agree to the ex-
tent that these tags inherit the same elements from
tags higher in the hierarchy. Inter-annotator dis-
agreement is in such a case much less than if they
would choose two unrelated tags. This is for in-
stance obvious in the following example of the an-
notations of two utterances by two annotators:
1 S what do you want to know? WHQ YNQ
2 U can I print now? YNQ CHECK
With utterance 1, the annotators should be said
simply to disagree (in fact, annotator 2 incorrectly
assigns a YNQ function). Concerning utterance 2
the annotators also disagree, but Figure 1 and the
definitions given in Section 2 tell us that the dis-
agreement in this case is quite small, as a CHECK in-
herits the properties of a YNQ. We therefore should
not use a black-and-white measure of agreement,
like the standard κ, but we should have a measure
for partial annotator agreement.
In order to measure partial (dis-)agreement be-
tween annotators in an adequate way, we should
not just take into account whether two tags are hi-
erarchically related or not, but also how far they
are apart in the hierarchy, to reflect that two tags
which are only one level apart are semantically
more closely related than tags that are several lev-
els apart. We will take this additional requirement
into account when designing a weighted disagree-
ment statistic in the next section.
4 Agreement based on structural
taxonomic properties
The agreement coefficient we are looking for
should in the first place be weighted in the sense
that it takes into account the magnitude of dis-
agreement. Two such coefficients are weighted
kappa (κw, (Cohen, 1968)) and alpha (Krippen-
dorff, 1980). For our purposes, we adopt κw for
its property to take into account a probability dis-
tribution typical for each annotator, generalize it to
the case for multiple annotators by taking the aver-
age over the scores of annotator pairs, and define
a function to be used as distance metric.
4.1 Cohen’s weighted κ
Assuming the case of two annotators, let pij de-
note the proportion of utterances for which the first
and second annotator assigned categories i and j,
respectively. Then Cohen defines κw in terms of
disagreement rather than agreement where qo =
1 − po and qe = 1 − pe such that Equation 1 can
be rewritten to:
κ = 1 − qoq
e
(2)
To arrive at κw, the proportions qo and qe in Equa-
tion 2 are replaced by weighted functions over all
possible category pairs:
κw = 1 −
summationtextv
ij · poijsummationtext
vij · peij (3)
where vij denotes the disagreement weight. To
calculate this weight we need to specify a distance
function as metric.
4.2 A taxonomic metric
The task of defining a function in order to calcu-
late the difference between a pair of categories re-
quires us to determine semantic-pragmatic related-
ness between the CFs in the taxonomy. For any an-
notation scheme, whether it is hierarchically struc-
tured or not, we could assign for each possible pair
of categories a value that expresses the semantic-
pragmatic relatedness between the two categories
compared to all other possible pairs. However, it
seems quite difficult to find universal characteris-
tics for CFs to be used to express relatedness on a
rational scale. When we consider a taxonomy that
is structured in a meaningful way, in this case one
that expresses hierarchical relations between CF
based on their effect on information states, the tax-
onomic structure can be exploited to express in a
systematic fashion how much annotators disagree
when they choose different concepts that are di-
rectly or indirectly related.
The assignment of different CFs to a specific ut-
terance by two annotators represents full disagree-
ment in the following cases:
1. the two CFs belong to different dimensions;
129
2. one of the two CFs is general-purpose; the
other is dimension-specific;4
3. the two CFs belong to the same dimension
but not to the same hierarchy;
4. the two CFs belong to the same hierarchy
but are not located in the same branch. Two
CFs are said to be located in the same branch
when one of the two CFs is an ancestor of the
other.
If, by contrast, the two CFs take part in a parent-
child relation within a hierarchy (either within a
dimension or among the general-purpose CFs),
then the CFs are related and this assignment repre-
sents partial disagreement. A distance metric that
measures this disagreement, which we denote as
δ, should have the following properties:
1. δ should be a real number normalized in the
range [0...1];
2. Let C be the (unordered) set of CFs.5 For ev-
ery two CFs c1,c2 ∈ C, δ(c1,c2) = 0 when
c1 and c2 are not related;
3. Let C be the (unordered) set of CFs. For ev-
ery communicative function c ∈ C, δ(c,c) =
1;
4. Let C be the (unordered) set of CFs. For
every two CFs c1,c2 ∈ C, δ(c1,c2) =
δ(c2,c1).
Furthermore, when c1 and c2 are related, we
should specify how distance between them in the
hierarchy should be expressed in terms of partial
disagreement. For this, we should take the follow-
ing aspects into account:
1. The distance in levels between c1 and c2 in
the hierarchy is proportional to the magnitude
of the disagreement;
4This is in fact a simplification. For instance, an INFORM
act of which the semantic content conveys that the speaker
did not understand the previous utterance forms an act in the
Auto-Feedback dimension (see Note 6), and a tagging to this
effect should perhaps not be considered to express full dis-
agreement with the assignment of the dimension-specific tag
AUTO-FEEDBACK-Int−. See also the next footnote.
5Strictly speaking, in DIT a dialogue act annotation tag is
either (a) the name of a dimension-specific function, or (b) a
pair consisting of the name of a general-purpose function and
the name of a dimension. However, in view of the simplifica-
tion mentioned in the previous note, for the sake of this paper
we may as well consider tags containing a general-purpose
function as simply consisting of that function.
Auto Feedback
Perc−
Int−
Eval−
Exec−
Perc+
Int+
Eval+
Exec+
Figure 2: Hierarchical structures in the auto feed-
back dimension.
2. The magnitude of disagreement between c1
and c2 being located in two different levels of
depths n and n+1 might be considered to be
more different than that between to levels of
depth n + 1 and n + 2. If this would be the
case, the deeper two levels are located in the
tree, the smaller the differences between the
nodes on those levels. For the hierarchies in
DIT, we keep the magnitude of disagreement
linear with the difference in levels, and inde-
pendent of level depth;
Given the considerations above, we propose the
following metric:
δ(ci,cj) = a∆(ci,cj) · bΓ(ci,cj) (4)
where:
• a is a constant for which 0 < a < 1, express-
ing how much distance there is between two
adjacent levels in the hierarchy; a plausible
value for a could be 0.75;
• ∆ is a function that returns the difference in
depth between the levels of ci and cj;
• b is a constant for which 0 < b ≤ 1, express-
ing in what rate differences should become
smaller when the depth in the hierarchy gets
larger. If there is no reason to assume that
differences on a higher depth in the hierarchy
are of less magnitude than differences on a
lower depth, then b = 1;
• Γ(ci,cj) is a function that returns the mini-
mal depth of ci and cj.
To provide some examples of how δ would be
calculated, let us consider the general purpose
functions in Figure 1. Consider also Figure 2,
that represents two hierarchies of CFs in the auto
130
feedback dimension6, and let us assume the values
of the various parameters those that are suggested
above. We then get the following calculations:
δ(IND −YNQ,CHECK) = 0.752 · 1 = 0.563
δ(YNQ,CHECK) = 0.751 · 1 = 0.75
δ(Perc+,Perc+) = 0.750 · 1 = 1
δ(Perc+,Eval+) = 0.752 · 1 = 0.563
δ(Int−,Int+) = 0
δ(POSI,NEGA) = 0
To conclude, we can simply take δ to be the
weighting in Cohen’s κw and come to a coefficient
which we will call taxonomically weighted kappa,
denoted by κtw:
κtw = 1 −
summationtext(1 − δ(i,j)) · p
oijsummationtext
(1 − δ(i,j)) · peij (5)
4.3 κtw statistics for DIT
Considering the DIT++ taxonomy, it may be argued
that due to the many hierarchies in the topology
of the general-purpose functions, this is the part
where most is to be gained by employing κtw.
Table 2 shows the statistics for each dimension,
averaged over all annotation pairs. With anno-
tation pair is understood the pair of assignments
an utterance received by two annotators for a par-
ticular dimension. The figures in the table are
based on those cases in which both annotators as-
signed a function to a specific utterance for a spe-
cific dimension. Cases where either one annotator
does not assign a function while the other does,
or where both annotators do not assign a function,
are not considered. Scores for standard κ and κtw
can be found in the first two columns. The column
#pairs indicates on how many annotation pairs the
statistics are based. The last column shows the
ap-ratio. This figure indicates which fraction of
all annotated functions in that dimension are rep-
resented by annotation pairs. When #ap denotes
the number of annotation pairs and #pa denotes
the number of partial annotations (annotations in
which one annotator assigned a function and the
other did not), then the ap-ratio is calculated as
#ap/(#pa + #ap). We can observe that due to
the use of the taxonomic weighting both feedback
dimensions and the task dimension gained sub-
stantially in annotator agreement.
6Auto-feedback: feedback on the processing (perception,
understanding, evaluation,..) of previous utterances by the
speaker. DIT also distinguishes allo-feedback, where the
speaker provides or elicits information about the addressee’s
processing.
Dimension κ κtw #pairs ap-ratio
task 0.47 0.71 848 0.87
task:action discussion 0.61 0.61 91 0.37
auto feedback 0.21 0.57 127 0.34
allo feedback 0.42 0.58 17 0.14
turn management 0.82 0.82 115 0.18
time management 0.58 0.58 68 0.72
contact management 1.00 1.00 8 0.17
topic management nav nav 2 0.08
own com. management 1.00 1.00 2 0.08
partner com. management nav nav 1 0.07
dialogue struct. management 0.74 0.74 15 0.31
social obl. management 1.00 1.00 61 0.80
Table 2: Scores for corrected κ and κtw per DIT
dimension.
When we look at the agreement statistics and
consider κ scores above 0.67 to be significant
and scores above 0.8 considerably reliable, as is
usual for κ statistics, we can find the dimensions
TURN-MANAGEMENT, CONTACT MANAGEMENT, and
SOCIAL-OBLIGATIONS-MANAGEMENT to be reliable
and DIALOGUE STRUCT. MANAGEMENT to be signif-
icant. For some dimensions, the occurences of
functions in these dimensions in the annotated di-
alogue material were too few to draw conclusions.
When we also take the ap-ratio into account,
only the dimensions TASK, TIME MANAGEMENT,
and SOCIAL-OBLIGATIONS-MANAGEMENT combine
a fair agreement on functions with fair agreement
on whether or not to annotate in these dimensions.
Especially for the other dimensions, the question
should be raised for which cases and for what rea-
sons the ap-ratio is low. This question asks for
further qualitative analysis, which is beyond the
scope of this paper7.
5 Discussion
In the previous sections, we showed how the tax-
onomically weighted κtw that we proposed can be
more suitable for taxonomies that contain hierar-
chical structures, like the DIT++) taxonomy. How-
ever, there are some specific and general issues
that deserve more attention.
A question that might be raised in using κtw as
opposed to ordinary κ, is if the assumption that the
interpretations of κ proposed in literature in terms
of reliability is also valid for κtw statistics. This
is ultimately an empirical issue, to be decided by
which κtw scores researchers find to correspond to
fair or near agreement between annotators.
Another point of discussion is the arbitrariness
of the values of the parameters that can be cho-
sen in δ. In this paper we proposed a = 0.75 and
β = 0.5. Choosing different values may change
7See (Geertzen, 2006) for more details.
131
the disagreement of two distinct CFs located in the
same hierarchy considerably. Still, we think that
by interpolating smoothly between the intuitively
clear cases at the two extreme ends of the scale,
it is possible to choose reasonable values for the
parameters that scale well, given the average hier-
archy depth.
A more general problem, inherent in almost
any (dialogue act) annotation activity is that when
we consider the possible factors that influence the
agreement scores, we find that they can be nu-
merous. Starting with the tagset, unclear defini-
tions and vague concepts are a major source of
disagreement. Other factors are the quality and ex-
tensiveness of annotation instructions, and the ex-
perience of the annotators. These were kept con-
stant throughout the experiment reported in this
paper, but clearly the use of more experienced or
better trained annotators could have a great influ-
ence. Then there is the influence that the use of an
annotation tool can have. Does the tool gives hints
on annotation consistency (e.g. an ANSWER should
be preceded by a QUESTION), does it enforce con-
sistency, or does it not consider annotation consis-
tency at all? Are the possible choices for anno-
tators presented in such a way that each choice is
equally well visible and accessible? Clearly, when
we do not control these factors sufficiently, we run
the risk that what we measure does not express
what we try to quantify: (dis)agreement among
annotators about the description of what happens
in a dialogue.
6 Conclusion and future work
In this paper we have presented agreement scores
for Cohen’s unweighted κ and claimed that for
annotation schemes with hierarchically related
tags, a weighted κ gives a better indication of
(dis)agreement than unweighted κ. The κ scores
for some dimensions seem not particularly spec-
tacular but become more interesting when look-
ing at semantic-pragmatic differences between di-
alogue acts or CFs. Even though there are some-
what arbitrary aspects in weighting, when parame-
ters are carefully chosen a weighted metric gives a
better representation of the inter-annotator agree-
ments. More generally, we propose that semantic-
pragmatic relatedness between taxonomic con-
cepts should be taken into account when calculat-
ing inter-annotator (dis)agreement. While we used
DIT++ as tagset, the weighting function we pro-
posed can be employed in any taxonomy contain-
ing hierarchically related concepts, since we only
used structural properties of the taxonomy.
We have also quantitatively8 evaluated the
DIT++ tagset per dimension, and obtained an in-
dication of its usability. We focussed on agree-
ment per dimension, but when we desire a global
indication of the difference in semantic-pragmatic
interpretation of a complete utterance it requires
us to consider other aspects. A truly multidimen-
sional study of inter-annotator agreement should
not only take intra-dimensional aspects into ac-
count but also relate the dimensions to each other.
In (Bunt and Girard, 2005; Bunt, 2006) it is argued
that dimensions should be orthogonal, meaning
that an utterance can have a function in one dimen-
sion independent of functions in other dimensions.
This is a somewhat utopical condition, since there
are some functions that show correlations and de-
pendencies with across dimensions. For this rea-
son it makes sense to try to express the effect of the
presence of strong correlations, dependencies and
possible entailments in a multidimensional notion
of (dis)agreement. Additionally, it may be desir-
able to take into account the importance that a CF
can have. It is widely acknowledged that utter-
ances are often multifunctional, but it could be ar-
gued that in many cases an utterance has a primary
function and secondary functions; for instance, if
an utterance has both a task-related function and
one or more other functions, the task-related func-
tion is typically felt to be more important than the
other functions, and disagreement about the task-
related function is therefore felt to be more seri-
ous than disagreement about one of the other func-
tions. This might be taken into account by adding
a weighting function when combining agreement
measures over multiple dimensions.
Other future work we plan is more methodolog-
ical in nature, quantifying the relative effect of the
factors that may have influenced the scores that we
have found. This would create a situation in which
there is more insight in what exactly is evaluated.
As for evaluating the tagset, we for instance plan
to further analyze co-occurence matrices to iden-
tify frequent misannotations, and to have annota-
tors thinking aloud while performing the annota-
tion task.
8Kappa statistics are indicative. To get a full understand-
ing of what the figures represent, qualitative analysis by using
e.g. co-occurence matrices is required, which is beyond the
scope of this paper.
132
Acknowledgements
The authors thank three anonymous reviewers for
their helpful comments on an earlier version of this
paper.

References
James Allen and Mark Core. 1997. Draft of DAMSL:
Dialog act markup in several layers. Unpublished
manuscript.
J. Allwood, J. Nivre, and E. Ahls´en. 1993. Manual for
coding interaction management. Technical report,
G¨oteborg University. Project report: Semantik och
talspr˚ak.
Harry C. Bunt and Yann Girard. 2005. Designing an
open, multidimensional dialogue act taxonomy. In
Proceedings of the 9th Workshop on the Semantics
and Pragmatics of Dialogue (DIALOR 2005), pages
37–44, Nancy, France, June.
Harry C. Bunt. 2000. Dialogue pragmatics and con-
text specification. In Harry C. Bunt and William
Black, editors, Abduction, Belief and Context in Di-
alogue; Studies in Computational Pragmatics, pages
81–150. John Benjamins, Amsterdam, The Nether-
lands.
Harry C. Bunt. 2005. A framework for dialogue act
specification. In Joint ISO-ACL Workshop on the
Representation and Annotation of Semantic Infor-
mation, Tilburg, The Netherlands, January.
Harry C. Bunt. 2006. Dimensions in dialogue annota-
tion. In Proceedings of the 5th International Confer-
ence on Language Resources and Evaluation (LREC
2006), Genova, Italy, May.
Jean Carletta. 1996. Assessing agreement on classi-
fication tasks: The kappa statistic. Computational
Linguistics, 22(2):249–254.
Johanneke Caspers. 2000. Pitch accents, boundary
tones and turn-taking in dutch map task dialogues.
In Proceedings of the 6th International Conference
on Spoken Language Processing (ICSLP 2000), vol-
ume 1, pages 565–568, Beijing, China.
Jacob Cohen. 1960. A coefficient of agreement for
nominal scales. Education and Psychological Mea-
surement, 20:37–46.
Jacob Cohen. 1968. Weighted kappa: Nominal scale
agreement with provision for scaled disagreement or
partial credit. Psychological Bulletin, 70:213–220.
Mark G. Core and James F. Allen. 1997. Coding di-
alogues with the DAMSL annotation scheme. In
David Traum, editor, Working Notes: AAAI Fall
Symposium on Communicative Action in Humans
and Machines, pages 28–35, Menlo Park, CA, USA.
American Association for Artificial Intelligence.
Mark Davies and J.L. Fleiss. 1982. Measuring agree-
ment for multinomial data. Biometrics, 38:1047–
1051.
Barbara Di Eugenio and Michael Glass. 2004. The
kappa statistic: a second look. Computational Lin-
guistics, 30(1):95–101.
Barbara Di Eugenio, Pamela W. Jordan, Johanna D.
Moore, and Richmond H. Thomason. 1998. An em-
pirical investigation of proposals in collaborative di-
alogues. In Proceedings of the 17th International
Conference on Computational Linguistics and the
36th Annual Meeting of the Association for Com-
putational Linguistics (COLING-ACL 1998), pages
325–329, Montreal, Canada.
Jeroen Geertzen, Yann Girard, Roser Morante, Ielka
van der Sluis, Hans Van Dam, Barbara Suijkerbuijk,
Rintse van der Werf, and Harry Bunt. 2004. The
diamond project (poster,project description). In The
8th Workshop on the Semantics and Pragmatics of
Dialogue (Catalog’04). Barcelona, Spain.
Jeroen Geertzen. 2006. Inter-annotator agreement
within dit++ dimensions. Technical report, Tilburg
University, Tilburg, The Netherlands.
Derek Gross, James F. Allen, and David R. Traum.
1993. The TRAINS 91 dialogues. Technical Re-
port TN92-1, University of Rochester, Rochester,
NY, USA.
Peter A. Heeman and James F. Allen. 1995. The
TRAINS 93 dialogues. Technical Report TN94-2,
University of Rochester, Rochester, NY, USA.
Klaus Krippendorff. 1980. Content Analysis: An In-
troduction to its Methodology. Sage Publications,
Beverly Hills, CA, USA.
Stephan Lesch, Thomas Kleinbauer, and Jan Alexan-
dersson. 2005. A new metric for the evaluation
of dialog act classification. In Proceedings of the
9th Workshop on the Semantics and Pragmatics of
Dialogue (DIALOR 2005), pages 143–146, Nancy,
France, june.
Elizabeth Shriberg, Raj Dhillon, Sonali Bhagat, Jeremy
Ang, and Hannah Carvey. 2004. The ICSI meeting
recorder dialog act (MRDA) corpus. In Proceedings
of the 5th SIGdial Workshop on Discourse and Dia-
logue, pages 97–100, Boston, USA, April-May.
Amanda J. Stent. 2000. The monroe corpus. Techni-
cal Report TR728/TN99-2, University of Rochester,
Rochester, UK.
Helmer Strik, Albert Russel, Henk van den Heuvel, Ca-
tia Cucchiarini, and Lou Boves. 1997. A spoken di-
alog system for the dutch public transport informa-
tion service. International Journal of Speech Tech-
nology, 2(2):119–129.
