Identifying Agreement and Disagreement in Conversational Speech:
Use of Bayesian Networks to Model Pragmatic Dependencies
Michel Galleya0 , Kathleen McKeowna0 , Julia Hirschberga0 ,
a0 Columbia University
Computer Science Department
1214 Amsterdam Avenue
New York, NY 10027, USA
a1 galley,kathy,julia
a2 @cs.columbia.edu
and Elizabeth Shriberga3
a3 SRI International
Speech Technology and Research Laboratory
333 Ravenswood Avenue
Menlo Park, CA 94025, USA
ees@speech.sri.com
Abstract
We describe a statistical approach for modeling
agreements and disagreements in conversational in-
teraction. Our approach first identifies adjacency
pairs using maximum entropy ranking based on a
set of lexical, durational, and structural features that
look both forward and backward in the discourse.
We then classify utterances as agreement or dis-
agreement using these adjacency pairs and features
that represent various pragmatic influences of pre-
vious agreement or disagreement on the current ut-
terance. Our approach achieves 86.9% accuracy, a
4.9% increase over previous work.
1 Introduction
One of the main features of meetings is the occur-
rence of agreement and disagreement among par-
ticipants. Often meetings include long stretches
of controversial discussion before some consensus
decision is reached. Our ultimate goal is auto-
mated summarization of multi-participant meetings
and we hypothesize that the ability to automatically
identify agreement and disagreement between par-
ticipants will help us in the summarization task.
For example, a summary might resemble minutes of
meetings with major decisions reached (consensus)
along with highlighted points of the pros and cons
for each decision. In this paper, we present a method
to automatically classify utterances as agreement,
disagreement, or neither.
Previous work in automatic identification of
agreement/disagreement (Hillard et al., 2003)
demonstrates that this is a feasible task when var-
ious textual, durational, and acoustic features are
available. We build on their approach and show
that we can get an improvement in accuracy when
contextual information is taken into account. Our
approach first identifies adjacency pairs using maxi-
mum entropy ranking based on a set of lexical, dura-
tional and structural features that look both forward
and backward in the discourse. This allows us to ac-
quire, and subsequently process, knowledge about
who speaks to whom. We hypothesize that prag-
matic features that center around previous agree-
ment between speakers in the dialog will influence
the determination of agreement/disagreement. For
example, if a speaker disagrees with another per-
son once in the conversation, is he more likely to
disagree with him again? We model context using
Bayesian networks that allows capturing of these
pragmatic dependencies. Our accuracy for classify-
ing agreements and disagreements is 86.9%, which
is a 4.9% improvement over (Hillard et al., 2003).
In the following sections, we begin by describ-
ing the annotated corpus that we used for our ex-
periments. We then turn to our work on identify-
ing adjacency pairs. In the section on identification
of agreement/disagreement, we describe the contex-
tual features that we model and the implementation
of the classifier. We close with a discussion of future
work.
2 Corpus
The ICSI Meeting corpus (Janin et al., 2003) is
a collection of 75 meetings collected at the In-
ternational Computer Science Institute (ICSI), one
among the growing number of corpora of human-
to-human multi-party conversations. These are nat-
urally occurring, regular weekly meetings of vari-
ous ICSI research teams. Meetings in general run
just under an hour each; they have an average of 6.5
participants.
These meetings have been labeled with adja-
cency pairs (AP), which provide information about
speaker interaction. They reflect the structure of
conversations as paired utterances such as question-
answer and offer-acceptance, and their labeling is
used in our work to determine who are the ad-
dressees in agreements and disagreements. The an-
notation of the corpus with adjacency pairs is de-
scribed in (Shriberg et al., 2004; Dhillon et al.,
2004).
Seven of those meetings were segmented into
spurts, defined as periods of speech that have no
pauses greater than .5 second, and each spurt was
labeled with one of the four categories: agreement,
disagreement, backchannel, and other.1 We used
spurt segmentation as our unit of analysis instead of
sentence segmentation, because our ultimate goal is
to build a system that can be fully automated, and
in that respect, spurt segmentation is easy to ob-
tain. Backchannels (e.g. “uhhuh” and “okay”) were
treated as a separate category, since they are gener-
ally used by listeners to indicate they are following
along, while not necessarily indicating agreement.
The proportion of classes is the following: 11.9%
are agreements, 6.8% are disagreements, 23.2% are
backchannels, and 58.1% are others. Inter-labeler
reliability estimated on 500 spurts with 2 labelers
was considered quite acceptable, since the kappa
coefficient was .63 (Cohen, 1960).
3 Adjacency Pairs
3.1 Overview
Adjacency pairs (AP) are considered fundamental
units of conversational organization (Schegloff and
Sacks, 1973). Their identification is central to our
problem, since we need to know the identity of
addressees in agreements and disagreements, and
adjacency pairs provide a means of acquiring this
knowledge. An adjacency pair is said to consist of
two parts (later referred to as A and B) that are or-
dered, adjacent, and produced by different speakers.
The first part makes the second one immediately rel-
evant, as a question does with an answer, or an offer
does with an acceptance. Extensive work in con-
versational analysis uses a less restrictive definition
of adjacency pair that does not impose any actual
adjacency requirement; this requirement is prob-
lematic in many respects (Levinson, 1983). Even
when APs are not directly adjacent, the same con-
straints between pairs and mechanisms for select-
ing the next speaker remain in place (e.g. the case
of embedded question and answer pairs). This re-
laxation on a strict adjacency requirement is partic-
ularly important in interactions of multiple speak-
ers since other speakers have more opportunities to
insert utterances between the two elements of the
AP construction (e.g. interrupted, abandoned or ig-
nored utterances; backchannels; APs with multiple
second elements, e.g. a question followed by an-
swers of multiple speakers).2
Information provided by adjacency pairs can be
used to identify the target of an agreeing or dis-
agreeing utterance. We define the problem of AP
1Part of these annotated meetings were provided by the au-
thors of (Hillard et al., 2003).
2The percentage of APs labeled in our data that have non-
contiguous parts is about 21%.
identification as follows: given the second element
(B) of an adjacency pair, determine who is the
speaker of the first element (A). A quite effective
baseline algorithm is to select as speaker of utter-
ance A the most recent speaker before the occur-
rence of utterance B. This strategy selects the right
speaker in 79.8% of the cases in the 50 meetings that
were annotated with adjacency pairs. The next sub-
section describes the machine learning framework
used to significantly outperform this already quite
effective baseline algorithm.
3.2 Maximum Entropy Ranking
We view the problem as an instance of statisti-
cal ranking, a general machine learning paradigm
used for example in statistical parsing (Collins,
2000) and question answering (Ravichandran et al.,
2003).3 The problem is to select, given a set of a0
possible candidates a1a3a2a5a4a7a6a7a8a9a8a9a8a9a6a10a2a12a11a14a13 (in our case, po-
tential A speakers), the one candidate a2a16a15 that maxi-
mizes a given conditional probability distribution.
We use maximum entropy modeling (Berger et
al., 1996) to directly model the conditional proba-
bility a17a19a18a20a2a21a15a23a22a24a26a25 , where each a27a5a15 in a24a29a28a30a18a31a27a32a4a33a6a7a8a9a8a9a8a9a6a23a27a34a11a14a25 is
an observation associated with the corresponding
speaker a2 a15 . a27 a15 is represented here by only one vari-
able for notational ease, but it possibly represents
several lexical, durational, structural, and acoustic
observations. Given a35 feature functions a36a12a37a38a18a31a24a39a6a10a2a33a15a40a25
and a35 model parameters a41a42a28a43a18a45a44a46a4a33a6a7a8a9a8a9a8a9a6a47a44a49a48a50a25 , the prob-
ability of the maximum entropy model is defined as:
a17a49a51a52a18a20a2a21a15a53a22a24a54a25a55a28
a56
a57a59a58
a24a26a60a62a61a64a63a66a65
a67 a48
a68
a37a70a69a52a4
a44a5a37a3a36a7a37a38a18a31a24a39a6a10a2a33a15a45a25a72a71
The only role of the denominator a57 a18a31a24a54a25 is to ensure
that a17a49a51 is a proper probability distribution. It is
defined as:
a57 a58
a24 a60 a28
a11
a68
a15a74a73a75a69a52a4
a61a64a63a76a65
a67 a48
a68
a37a47a69a52a4
a44a34a37a21a36a7a37a38a18a31a24a39a6a10a2 a15a73 a25a72a71
To find the most probable speaker of part A, we use
the following decision rule:
a77
a2 a28 a78a16a79a23a80a39a81a82a78
a63
a83a40a84a31a85a38a86a87a83a53a88a87a89a91a90a91a90a91a90a91a89a83a72a92a46a93a46a94
a17a49a51a52a18a20a2 a15 a22a24a54a25a21a95
a28 a78a16a79a23a80a39a81a82a78
a63
a83a40a84a31a85a38a86a87a83a53a88a87a89a91a90a91a90a91a90a91a89a83a72a92a46a93a46a94
a61a64a63a76a65
a67 a48
a68
a37a47a69a52a4
a44 a37 a36 a37 a18a31a24a39a6a10a2 a15 a25a72a71a82a95
Note that we have also attempted to model the
problem as a binary classification problem where
3The approach is generally called re-ranking in cases where
candidates are assigned an initial rank beforehand.
each speaker is either classified as speaker A or
not, but we abandoned that approach, since it gives
much worse performance. This finding is consis-
tent with previous work (Ravichandran et al., 2003)
that compares maximum entropy classification and
re-ranking on a question answering task.
3.3 Features
We will now describe the features used to train the
maximum entropy model mentioned previously. To
rank all speakers (aside from the B speaker) and to
determine how likely each one is to be the A speaker
of the adjacency pair involving speaker B, we use
four categories of features: structural, durational,
lexical, and dialog act (DA) information. For the
remainder of this section, we will interchangeably
use A to designate either the potential A speaker or
the most recent utterance4 of that speaker, assuming
the distinction is generally unambiguous. We use
B to designate either the B speaker or the current
spurt for which we need to identify a corresponding
A part.
The feature sets are listed in Table 1. Struc-
tural features encode some helpful information re-
garding ordering and overlap of spurts. Note that
with only the first feature listed in the table, the
maximum entropy ranker matches exactly the per-
formance of the baseline algorithm (79.8% accu-
racy). Regarding lexical features, we used a count-
based feature selection algorithm to remove many
first-word and last-word features that occur infre-
quently and that are typically uninformative for the
task at hand. Remaining features essentially con-
tained function words, in particular sentence-initial
indicators of questions (“where”, “when”, and so
on).
Note that all features in Table 1 are “backward-
looking”, in the sense that they result from an anal-
ysis of context preceding B. For many of them, we
built equivalent “forward-looking” features that per-
tain to the closest utterance of the potential speaker
A that follows part B. The motivation for extracting
these features is that speaker A is generally expected
to react if he or she is addressed, and thus, to take
the floor soon after B is produced.
3.4 Results
We used the labeled adjacency pairs of 50 meetings
and selected 80% of the pairs for training. To train
the maximum entropy ranking model, we used the
generalized iterative scaling algorithm (Darroch and
Ratcliff, 1972) as implemented in YASMET.5
4We build features for both the entire speaker turn of A and
the most recent spurt of A.
5http://www.isi.edu/˜och/YASMET.html
Structural features:
a0 number of speakers taking the floor between A
and B
a0 number of spurts between A and B
a0 number of spurts of speaker B between A and B
a0 do A and B overlap?
Durational features:
a0 duration of A
a0 if A and B do not overlap: time separating A and
B
a0 if they do overlap: duration of overlap
a0 seconds of overlap with any other speaker
a0 speech rate in A
Lexical features:
a0 number of words in A
a0 number of content words in A
a0 ratio of words of A (respectively B) that are also
in B (respectively A)
a0 ratio of content words of A (respectively B) that
are also in B (respectively A)
a0 number of
a1 -grams present both in A and B (we
built 3 features for a1 ranging from 2 to 4)
a0 first and last word of A
a0 number of instances at any position of A of
each cue word listed in (Hirschberg and Litman,
1994)
a0 does A contain the first/last name of speaker B?
Table 1. Speaker ranking features
Feature sets Accuracy
Baseline 79.80%
Structural 83.97%
Durational 84.71%
Lexical 75.43%
Structural and durational 87.88%
All 89.38%
All (only backward looking) 86.99%
All (Gaussian smoothing, FS) 90.20%
Table 2. Speaker ranking accuracy
Table 2 summarizes the accuracy of our statistical
ranker on the test data with different feature sets: the
performance is 89.39% when using all feature sets,
and reaches 90.2% after applying Gaussian smooth-
ing and using incremental feature selection as de-
scribed in (Berger et al., 1996) and implemented in
the yasmetFS package.6 Note that restricting our-
selves to only backward looking features decreases
the performance significantly, as we can see in Ta-
ble 2.
We also wanted to determine if information about
6http://www.isi.edu/˜ravichan/YASMET.html
dialog acts (DA) helps the ranking task. If we
hypothesize that only a limited set of paired DAs
(e.g. offer-accept, question-answer, and apology-
downplay) can be realized as adjacency pairs, then
knowing the DA category of the B part and of all
potential A parts should help in finding the most
meaningful dialog act tag among all potential A
parts; for example, the question-accept pair is ad-
mittedly more likely to correspond to an AP than
e.g. backchannel-accept. We used the DA annota-
tion that we also had available, and used the DA tag
sequence of part A and B as a feature.7
When we add the DA feature set, the accuracy
reaches 91.34%, which is only slightly better than
our 90.20% accuracy, which indicates that lexical,
durational, and structural features capture most of
the informativeness provided by DAs. This im-
proved accuracy with DA information should of
course not be considered as the actual accuracy of
our system, since DA information is difficult to ac-
quire automatically (Stolcke et al., 2000).
4 Agreements and Disagreements
4.1 Overview
This section focusses on the use of contextual in-
formation, in particular the influence of previous
agreements and disagreements and detected adja-
cency pairs, to improve the classification of agree-
ments and disagreements. We first define the classi-
fication problem, then describe non-contextual fea-
tures, provide some empirical evidence justifying
our choice of contextual features, and finally eval-
uate the classifier.
4.2 Agreement/Disagreement Classification
We need to first introduce some notational con-
ventions and define the classification problem
with the agreement/disagreement tagset. In our
classification problem, each spurt a0a33a15 among the a1
spurts of a meeting must be assigned a tag a0a21a15a3a2
a1 AGREEa6 DISAGREEa6 BACKCHANNELa6 OTHERa13 .
To specify the speaker of the spurt (e.g. speaker
B), the notation will sometimes be augmented to
incorporate speaker information, as with a0 a4
a15
, and
to designate the addressee of B (e.g. listener A),
we will use the notation a0 a4a6a5a8a7
a15
. For example,
a0
a4a6a5a8a7
a15
a28 AGREE simply means that B agrees with
A in the spurt of index a9 . This notation makes
it obvious that we do not necessarily assume
that agreements and disagreements are reflexive
7The annotation of DA is particularly fine-grained with a
choice of many optional tags that can be associated with each
DA. To deal with this problem, we used various scaled-down
versions of the original tagset.
relations. We define:
a65
a79
a61
a10
a11a12a5a14a13
a58
a0
a4a6a5a8a7
a15
a60
as the tag of the most recent spurt before a0 a4a15a5a8a7
a15
that
is produced by Y and addresses X. This definition
will help our multi-party analyses of agreement and
disagreement behaviors.
4.3 Local Features
Many of the local features described in this subsec-
tion are similar in spirit to the ones used in the pre-
vious work of (Hillard et al., 2003). We did not use
acoustic features, since the main purpose of the cur-
rent work is to explore the use of contextual infor-
mation.
Table 3 lists the features that were found most
helpful at identifying agreements and disagree-
ments. Regarding lexical features, we selected a
list of lexical items we believed are instrumental
in the expression of agreements and disagreements:
agreement markers, e.g. “yes” and “right”, as listed
in (Cohen, 2002), general cue phrases, e.g. “but”
and “alright” (Hirschberg and Litman, 1994), and
adjectives with positive or negative polarity (Hatzi-
vassiloglou and McKeown, 1997). We incorpo-
rated a set of durational features that were described
in the literature as good predictors of agreements:
utterance length distinguishes agreement from dis-
agreement, the latter tending to be longer since the
speaker elaborates more on the reasons and circum-
stances of her disagreement than for an agreement
(Cohen, 2002). Duration is also a good predictor
of backchannels, since they tend to be quite short.
Finally, a fair amount of silence and filled pauses
is sometimes an indicator of disagreement, since it
is a dispreferred response in most social contexts
and can be associated with hesitation (Pomerantz,
1984).
4.4 Contextual Features: An Empirical Study
We first performed several empirical analyses in or-
der to determine to what extent contextual informa-
tion helps in discriminating between agreement and
disagreement. By integrating the interpretation of
the pragmatic function of an utterance into a wider
context, we aim to detect cases of mismatch be-
tween a correct pragmatic interpretation and the sur-
face form of the utterance, e.g. the case of weak or
“empty” agreement, which has some properties of
downright agreement (lexical items of positive po-
larity), but which is commonly considered to be a
disagreement (Pomerantz, 1984).
While the actual classification problem incorpo-
rates four classes, the BACKCHANNEL class is ig-
Structural features:
a0 is the previous/next spurt of the same speaker?
a0 is the previous/next spurt involving the same B
speaker?
Durational features:
a0 duration of the spurt
a0 seconds of overlap with any other speaker
a0 seconds of silence during the spurt
a0 speech rate in the spurt
Lexical features:
a0 number of words in the spurt
a0 number of content words in the spurt
a0 perplexity of the spurt with respect to four lan-
guage models, one for each class
a0 first and last word of the spurt
a0 number of instances of adjectives with positive
polarity (Hatzivassiloglou and McKeown, 1997)
a0 idem, with adjectives of negative polarity
a0 number of instances in the spurt of each cue
phrase and agreement/disagreement token listed
in (Hirschberg and Litman, 1994; Cohen, 2002)
Table 3. Local features for agreement and disagreement
classification
nored here to make the empirical study easier to in-
terpret. We assume in that study that accurate AP
labeling is available, but for the purpose of building
and testing a classifier, we use only automatically
extracted adjacency pair information. We tested the
validity of four pragmatic assumptions:
1. previous tag dependency: a tag a0a33a15 is influ-
enced by its predecessor a0a7a15a1a0 a4
2. same-interactants previous tag depen-
dency: a tag a0
a4a6a5 a7
a15
is influenced by
a65
a79
a61
a10
a4a6a5a8a7
a58
a0
a4a6a5a8a7
a15
a60 , the most recent tag of
the same speaker addressing the same listener;
for example, it might be reasonable to assume
that if speaker B disagrees with A, B is likely
to disagree with A in his or her next speech
addressing A.
3. reflexivity: a tag a0 a4a6a5a8a7
a15
is influenced by
a65
a79
a61
a10
a7 a5 a4
a58
a0
a4a6a5a8a7
a15
a60 ; the assumption is that a0
a4
a15
is influenced by the polarity (agreement or dis-
agreement) of what A said last to B.
4. transitivity: assuming there is a speaker a2
for which
a65
a79
a61
a10
a13 a5a8a7
a58
a65
a79
a61
a10
a4a15a5a14a13
a58
a0
a4a15a5a8a7
a15
a60a21a60
exists, then a tag a0 a4a6a5a8a7
a15
is influ-
enced by
a65
a79
a61
a10
a4a6a5a14a13
a58
a0
a4a6a5a8a7
a15
a60 and
a65
a79
a61
a10
a13 a5 a7
a58
a65
a79
a61
a10
a4a6a5a14a13
a58
a0
a4a6a5a8a7
a15
a60a21a60 ; an ex-
ample of such an influence is a case where
speaker a2 first agrees with a3 , then speaker a4
disagrees with a2 , from which one could possi-
bly conclude that a4 is actually in disagreement
with a3 .
Table 4 presents the results of our empirical eval-
uation of the first three assumptions. For compar-
ison, the distribution of classes is the following:
18.8% are agreements, 10.6% disagreements, and
70.6% other. The dependencies empirically eval-
uated in the two last columns are non-local; they
create dependencies between spurts separated by an
arbitrarily long time span. Such long range depen-
dencies are often undesirable, since the influence of
one spurt on the other is often weak or too diffi-
cult to capture with our model. Hence, we made a
Markov assumption by limiting context to an arbi-
trarily chosen value a0 . In this analysis subsection
and for all classification results presented thereafter,
we used a value of a0 a28 a56a6a5 .
The table yields some interesting results, show-
ing quite significant variations in class distribution
when it is conditioned on various types of contex-
tual information. We can see for example, that
the proportion of agreements and disagreements (re-
spectively 18.8% and 10.6%) changes to 13.9% and
20.9% respectively when we restrict the counts to
spurts that are preceded by a DISAGREE. Simi-
larly, that distribution changes to 21.3% and 7.3%
when the previous tag is an AGREE. The variable
is even more noticeable between probabilities a17a19a18 a0a21a15a40a25
and a17a26a18 a0 a4a6a5a8a7
a15
a22
a65
a79
a61
a10
a4a6a5 a7
a18 a0
a4a6a5a8a7
a15
a25a53a25 . In 26.1% of the
cases where a given speaker B disagrees with A, he
or she will continue to disagree in the next exchange
involving the same speaker and the same listener.
Similarly with the same probability distribution, a
tendency to agree is confirmed in 25% of the cases.
The results in the last column are quite different
from the two preceding ones. While agreements in
response to agreements (a17a19a18 AGREEa22AGREEa25 a28 a8
a56a8a7a10a9 )
are slightly less probable than agreements with-
out conditioning on any previous tag (a17a26a18 AGREEa25 a28
a8
a56a12a11a13a11 ), the probability of an agreement produced
in response to a disagreement is quite high (with
23.4%), even higher than the proportion of agree-
ments in the entire data (18.8%). This last result
would arguably be quite different with more quar-
relsome meeting participants.
Table 5 represents results concerning the fourth
pragmatic assumption. While none of the results
characterize any strong conditioning of a0a8a14 by a0a70a15
and a0a53a37 , we can nevertheless notice some interest-
ing phenomena. For example, there is a tendency
for agreements to be transitive, i.e. if X agrees with
A and B agrees with X within a limited segment of
speech, then agreement between B and A is con-
firmed in 22.5% of the cases, while the probabil-
ity of the agreement class is only 18.8%. The only
slightly surprising result appears in the last column
of the table, from which we cannot conclude that
disagreement with a disagreement is equivalent to
agreement. This might be explained by the fact that
these sequences of agreement and disagreement do
not necessarily concern the same propositional con-
tent.
The probability distributions presented here are
admittedly dependent on the meeting genre and par-
ticularly speaker personalities. Nonetheless, we be-
lieve this model can as well be used to capture
salient interactional patterns specific to meetings
with different social dynamics.
We will next discuss our choice of a statisti-
cal model to classify sequence data that can deal
with non-local label dependencies, such as the ones
tested in our empirical study.
4.5 Sequence Classification with Maximum
Entropy Models
Extensive research has targeted the problem of la-
beling sequence information to solve a variety of
problems in natural language processing. Hidden
Markov models (HMM) are widely used and con-
siderably well understood models for sequence la-
beling. Their drawback is that, as most genera-
tive models, they are generally computed to max-
imize the joint likelihood of the training data. In
order to define a probability distribution over the
sequences of observation and labels, it is necessary
to enumerate all possible sequences of observations.
Such enumeration is generally prohibitive when the
model incorporates many interacting features and
long-range dependencies (the reader can find a dis-
cussion of the problem in (McCallum et al., 2000)).
Conditional models address these concerns.
Conditional Markov models (CMM) (Ratnaparkhi,
1996; Klein and Manning, 2002) have been
successfully used in sequence labeling tasks incor-
porating rich feature sets. In a left-to-right CMM as
shown in Figure 1(a), the probability of a sequence
of L tags a0 a28a43a18 a0a12a4a33a6a7a8a9a8a9a8a9a6 a0a2a1 a25 is decomposed as:
a17a26a18a3a0a49a22a24a26a25 a28 a17a19a18 a0a5a4a21a25
a1
a6
a15 a69a52a4
a17a19a18 a0 a15 a22a0 a15 a0 a4 a6a23a27 a15 a25
a24 a28 a18a31a27a32a4a33a6a7a8a9a8a9a8a9a6a23a27a7a1a46a25 is the vector of observations and
each a9 is the index of a spurt. The probability dis-
tribution a17a19a18 a0a64a15a10a22a0a47a15a1a0 a4a33a6a23a27a5a15a45a25 associated with each state of
the Markov chain only depends on the preceding tag
a0a70a15a1a0 a4 and the local observation a27a34a15 . However, in order
to incorporate more than one label dependency and,
in particular, to take into account the four pragmatic
c 1 c 2 c 1 c 2 c 3 
(a) (b) 
d 1 d 2 d 1 d 2 d 3 
Figure 1. (a) Left-to-right CMM. (b) More complex
Bayesian network. Assuming for example that a8 a88a10a9 a8a12a11a14a13a16a15a88
and a8a18a17 a9 a8 a11a19a13a16a15a17 , there is then a direct dependency be-
tween a8 a88 and a8a18a17 , and the probability model becomes
a20a22a21a24a23a26a25a27a29a28 a9 a20a22a21
a8
a88 a25a30 a88 a28a31a20a22a21
a8a12a32
a25
a8
a88a12a33 a30
a32
a28a31a20a22a21
a8 a17
a25
a8a34a32
a33
a8
a88a34a33 a30
a17
a28 . This is a sim-
plifying example; in practice, each label is dependent on
a fixed number of other labels.
contextual dependencies discussed in the previous
subsection, we must augment the structure of our
model to obtain a more general one. Such a model
is shown in Figure 1(b), a Bayesian network model
that is well-understood and that has precisely de-
fined semantics.
To this Bayesian network representation, we ap-
ply maximum entropy modeling to define a proba-
bility distribution at each node (a0a33a15 ) dependent on the
observation variable a27a76a15 and the five contextual tags
used in the four pragmatic dependencies.8 For no-
tational simplicity, the contextual tags representing
these pragmatic dependencies are represented here
as a vector a35 (a0 a15a1a0 a4 ,
a65
a79
a61
a10
a4a6a5 a7
a18 a0
a4a6a5a8a7
a15
a25 , and so on).
Given a35 feature functions a36a21a37a38a18a3a35a39a6a23a27a38a15a53a6 a0a70a15a40a25 (both
local and contextual, like previous tag features)
and a35 model parameters a41 a28 a18a45a44 a4 a6a7a8a9a8a9a8a9a6a47a44 a48 a25 , the
probability of the model is defined as:
a17a49a51a52a18 a0a70a15a10a22a35a39a6a23a27a38a15a40a25a54a28
a56
a57 a58
a35a39a6a23a27 a15 a60 a61a64a63a66a65
a67 a48
a68
a37a70a69a52a4
a44a5a37a3a36a7a37a38a18a3a35a39a6a23a27a38a15a87a6 a0a70a15a45a25a72a71
Again, the only role of the denominator a57 a18a31a24a26a25 is to
ensure that a17 a51 sums to 1, and need not be computed
when searching for the most probable tags. Note
that in our case, the structure of the Bayesian net-
work is known and need not be inferred, since AP
identification is performed before the actual agree-
ment and disagreement classification. Since tag se-
quences are known during training, the inference of
a model for sequence labels is no more difficult than
inferring a model in a non-sequential case.
We compute the most probable sequence by
performing a left-to-right decoding using a beam
search. The algorithm is exactly the same as the one
described in (Ratnaparkhi, 1996) to find the most
probable part-of-speech sequence. We used a large
beam of size a0 =100, which is not computationally
prohibitive, since the tagset contains only four ele-
8The transitivity dependency is conditioned on two tags,
while all others on only one. These five contextual tags are de-
faulted to OTHER when dependency spans exceed the threshold
of a36 a9a38a37a34a39 .
a0a2a1a4a3a6a5a8a7 a3a6a5a4a9a11a10a13a12 a0a2a1a4a3a6a14a16a15a18a17
a5
a7a20a19a22a21a24a23a26a25
a14a2a15a27a17
a1a28a3a6a14a16a15a18a17
a5
a12a29a12 a0a30a1a4a3a31a14a2a15a18a17
a5
a7a32a19a33a21a24a23a26a25
a17a34a15a18a14
a1a4a3a6a14a16a15a18a17
a5
a12a29a12
a0a2a1 AGREEa7 AGREEa12 .213 .250 .175
a0a2a1 OTHERa7 AGREEa12 .713 .643 .737
a0a30a1 DISAGREEa7 AGREEa12 .073 .107 .088
a0a2a1 AGREEa7 OTHERa12 .187 .115 .177
a0a2a1 OTHERa7 OTHERa12 .714 .784 .710
a0a30a1 DISAGREEa7 OTHERa12 .098 .100 .113
a0a30a1 AGREEa7 DISAGREEa12 .139 .087 .234
a0a30a1 OTHERa7 DISAGREEa12 .651 .652 .638
a0a2a1 DISAGREEa7 DISAGREEa12 .209 .261 .128
Table 4. Contextual dependencies (previous tag, same-interactants previous tag, and reflexivity)
a0a2a1a4a3 a14a16a15a18a17
a35
a7 a3a31a5a29a36a8a3a8a37a26a12 , where a3a8a37a39a38a40a19a33a21a24a23a26a25
a14a2a15a27a41
a1a28a3 a14a2a15a27a17
a35
a12 and a3a31a5a42a38a43a19a33a21a8a23a13a25
a41a44a15a27a17
a1a28a3a8a37a26a12
a3 a5 a38 AGREE a3 a5 a38 AGREE a3 a5 a38 DISAGREE a3 a5 a38 DISAGREE
a3a8a37a39a38 AGREE a3a8a37a39a38 DISAGREE a3a45a37a46a38 AGREE a3a45a37a46a38 DISAGREE
a0a30a1 AGREEa7 a3a6a5a24a36a24a3a8a37a26a12 .225 .147 .131 .152
a0a30a1 OTHERa7 a3 a5 a36a24a3 a37 a12 .658 .677 .683 .668
a0a30a1 DISAGREEa7 a3a31a5a29a36a8a3a8a37a26a12 .117 .177 .186 .180
Table 5. Contextual dependencies (transitivity)
ments. Note however that this algorithm can lead to
search errors. An alternative would be to use a vari-
ant of the Viterbi algorithm, which was successfully
used in (McCallum et al., 2000) to decode the most
probable sequence in a CMM.
4.6 Results
We had 8135 spurts available for training and test-
ing, and performed two sets of experiments to evalu-
ate the performance of our system. The tools used to
perform the training are the same as those described
in section 3.4. In the first set of experiments, we re-
produced the experimental setting of (Hillard et al.,
2003), a three-way classification (BACKCHANNEL
and OTHER are merged) using hand-labeled data of
a single meeting as a test set and the remaining data
as training material; for this experiment, we used
the same training set as (Hillard et al., 2003). Per-
formance is reported in Table 6. In the second set
of experiments, we aimed at reducing the expected
variance of our experimental results and performed
N-fold cross-validation in a four-way classification
task, at each step retaining the hand-labeled data of
a meeting for testing and the rest of the data for
training. Table 7 summarizes the performance of
our classifier with the different feature sets in this
classification task, distinguishing the case where the
four label-dependency pragmatic features are avail-
able during decoding from the case where they are
not.
First, the analysis of our results shows that with
our three local feature sets only, we obtain substan-
tially better results than (Hillard et al., 2003). This
Feature sets Accuracy
(Hillard et al., 2003) 82%
Lexical 84.95%
Structural and durational 71.23%
All (no label dependencies) 85.62%
All (with label dependencies) 86.92%
Table 6. 3-way classification accuracy
Feature sets Label dep. No label dep.
Lexical 83.54% 82.62%
Structural, durational 62.10% 58.86%
All 84.07% 83.11%
Table 7. 4-way classification accuracy
might be due to some additional features the latter
work didn’t exploit (e.g. structural features and ad-
jective polarity), and to the fact that the learning al-
gorithm used in our experiments might be more ac-
curate than decision trees in the given task. Second,
the table corroborates the findings of (Hillard et al.,
2003) that lexical information make the most help-
ful local features. Finally, we observe that by in-
corporating label-dependency features representing
pragmatic influences, we further improve the perfor-
mance (about 1% in Table 7). This seems to indicate
that modeling label dependencies in our classifica-
tion problem is useful.
5 Conclusion
We have shown how identification of adjacency
pairs can help in designing features representing
pragmatic dependencies between agreement and
disagreement labels. These features are shown to
be informative and to help the classification task,
yielding a substantial improvement (1.3% to reach
a 86.9% accuracy in three-way classification).
We also believe that the present work may be use-
ful in other computational pragmatic research fo-
cusing on multi-party dialogs, such as dialog act
(DA) classification. Most previous work in that area
is limited to interaction between two speakers (e.g.
Switchboard, (Stolcke et al., 2000)). When more
than two speakers are involved, the question of who
is the addressee of an utterance is crucial, since it
generally determines what DAs are relevant after the
addressee’s last utterance. So, knowledge about ad-
jacency pairs is likely to help DA classification.
In future work, we plan to extend our inference
process to treat speaker ranking (i.e. AP identifica-
tion) and agreement/disagreement classification as
a single, joint inference problem. Contextual in-
formation about agreements and disagreements can
also provide useful cues regarding who is the ad-
dressee of a given utterance. We also plan to incor-
porate acoustic features to increase the robustness of
our procedure in the case where only speech recog-
nition output is available.
Acknowledgments
We are grateful to Mari Ostendorf and Dustin
Hillard for providing us with their agreement and
disagreement labeled data.
This material is based on research supported by
the National Science Foundation under Grant No.
IIS-012196. Any opinions, findings and conclu-
sions or recommendations expressed in this mate-
rial are those of the authors and do not necessarily
reflect the views of the National Science Founda-
tion.
References
A. Berger, S. Della Pietra, and V Della Pietra.
1996. A maximum entropy approach to natural
language processing. Computational Linguistics,
22(1):39–72.
J. Cohen. 1960. A coefficient of agreement for
nominal scales. Educational and Psychological
measurements, 20:37–46.
S. Cohen. 2002. A computerized scale for monitor-
ing levels of agreement during a conversation. In
Proc. of the 26th Penn Linguistics Colloquium.
M. Collins. 2000. Discriminative reranking for nat-
ural language parsing. In Proc. 17th Interna-
tional Conf. on Machine Learning, pages 175–
182.
J. N. Darroch and D. Ratcliff. 1972. Generalized
iterative scaling for log-linear models. Annals of
Mathematical Statistics, 43:1470–1480.
R. Dhillon, S. Bhagat, H. Carvey, and E. Shriberg.
2004. Meeting recorder project: Dialog act label-
ing guide. Technical Report TR-04-002, ICSI.
V. Hatzivassiloglou and K. McKeown. 1997. Pre-
dicting the semantic orientation of adjectives. In
Proc. of ACL.
D. Hillard, M. Ostendorf, and E Shriberg. 2003.
Detection of agreement vs. disagreement in meet-
ings: training with unlabeled data. In Proc. of
HLT/NAACL.
J. Hirschberg and D. Litman. 1994. Empirical stud-
ies on the disambiguation of cue phrases. Com-
putational Linguistics, 19(3):501–530.
A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gel-
bart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg,
A. Stolcke, and C. Wooters. 2003. The ICSI
meeting corpus. In Proc. of ICASSP-03, Hong
Kong.
D. Klein and C. D. Manning. 2002. Conditional
structure versus conditional estimation in NLP
models. Technical report.
S. Levinson. 1983. Pragmatics. Cambridge Uni-
versity Press.
A. McCallum, D. Freitag, and F. Pereira. 2000.
Maximum entropy markov models for informa-
tion extraction and segmentation. In Proc. of
ICML.
A. Pomerantz. 1984. Agreeing and disagree-
ing with assessments: some features of pre-
ferred/dispreferred turn shapes. In J.M. Atkinson
and J.C. Heritage, editors, Structures of Social
Action, pages 57–101.
A. Ratnaparkhi. 1996. A maximum entropy part-
of-speech tagger. In Proc. of EMNLP.
D. Ravichandran, E. Hovy, and F. J. Och. 2003.
Statistical QA - classifier vs re-ranker: What’s
the difference? In Proc. of the ACL Workshop
on Multilingual Summarization and Question An-
swering.
E. A. Schegloff and H Sacks. 1973. Opening up
closings. Semiotica, 7-4:289–327.
E. Shriberg, R. Dhillon, S. Bhagat, J. Ang, and
H. Carvey. 2004. The ICSI meeting recorder dia-
log act (MRDA) corpus. In SIGdial Workshop on
Discourse and Dialogue, pages 97–100.
A. Stolcke, K. Ries, N. Coccaro, E. Shriberg,
R. Bates, D. Jurafsky, P. Taylor, R. Martin, C. Van
Ess-Dykema, and M. Meteer. 2000. Dialogue
act modeling for automatic tagging and recog-
nition of conversational speech. Computational
Linguistics, 26(3):339–373.
