Annotating Student Emotional States in Spoken Tutoring Dialogues
Diane J. Litman
University of Pittsburgh
Department of Computer Science
Learning Research and Development Center
Pittsburgh PA, 15260, USA
litman@cs.pitt.edu
Kate Forbes-Riley
University of Pittsburgh
Learning Research and Development Center
Pittsburgh PA, 15260, USA
forbesk@pitt.edu
Abstract
We present an annotation scheme for stu-
dent emotions in tutoring dialogues. Analy-
ses of our scheme with respect to interannota-
tor agreement and predictive accuracy indicate
that our scheme is reliable in our domain, and
that our emotion labels can be predicted with
a high degree of accuracy. We discuss issues
concerning the implementation of emotion pre-
diction and adaptation in the computer tutoring
dialogue system we are developing.
1 Introduction
This paper describes a coding scheme for annotating stu-
dent emotional states in spoken dialogue tutoring cor-
pora, and analyzes the scheme not only for its reliabil-
ity, but also for its utility in developing a spoken dia-
logue tutoring system that can model and respond to stu-
dent emotions. Motivation for this work comes from the
performance discrepancy between human tutors and cur-
rent machine tutors: typically, students tutored by hu-
man tutors achieve higher learning gains than students
tutored by computer tutors. The development of com-
putational tutorial dialogue systems (Ros·e and Aleven,
2002) represents one method of closing this performance
gap, e.g. it is hypothesized that dialogue-based tutors al-
low greater adaptivity to students’ beliefs and misconcep-
tions. Another method for closing this performance gap
involves incorporating emotion prediction and adaptation
into computer tutors (Kort et al., 2001; Evens, 2002).
For example (Aist et al., 2002) have shown that adding
human-provided emotional scaffolding to an automated
reading tutor increases student persistence. This suggests
that the success of computer dialogue tutors could be in-
creased by responding to both what a student says and
how s/he says it, e.g. with con dence or uncertainty.
To assess the impact of adding emotion modeling to
dialogue tutoring systems, we are building ITSPOKE
(Intelligent Tutoring SPOKEn dialogue system), a spo-
ken dialogue system that uses the Why2-Atlas concep-
tual physics tutoring system (VanLehn et al., 2002) as its
 back-end. 1 Our  rst step towards incorporating emo-
tion processing into ITSPOKE is to develop a reliable
annotation scheme for student emotions. Our next step
will be to use the data that has been annotated accord-
ing to this scheme to enhance ITSPOKE to dynamically
predict and adapt to student emotions. This adds addi-
tional constraints on our annotation scheme besides good
reliability, namely that our annotations are predictable by
ITSPOKE with a high degree of accuracy (automatically
and in real-time), and that they are expressive enough to
support the range of desired system adaptations.
In Section 2 we review previous work in emotion anno-
tation for spoken dialogue systems. In Section 3 we dis-
cuss our tutoring research project and corpora. In Section
4 we present an emotion annotation scheme for this do-
main. In Section 5 we analyze our scheme with respect to
interannotator agreement and predictive accuracy, using a
corpus of human tutoring dialogues. Our agreement indi-
cates that our scheme is reliable, while machine learning
experiments on annotated data indicate that our emotion
labels can be predicted with a high degree of accuracy.
In Section 6 we analyze more expressive versions of our
scheme, and discuss differences between annotating hu-
man and computer spoken tutoring dialogues.
2 Prior Research on Emotion
Developing a descriptive theory of emotion is a com-
plex research topic, viewed from either a theoretical or
an empirical standpoint (Cowie et al., 2001). Some re-
searchers have proposed a variety of  fundamental hu-
man emotions, while others have argued that emotions
1We also use ITSPOKE to examine the utility of building
spoken dialogue tutors (e.g. (Litman and Forbes, 2003)).
are best represented componentially, in terms of multiple
dimensions. Despite this lack of a well-de ned descrip-
tive framework, there has been great recent interest in
predicting emotional states, using information extracted
from a person’s text, speech, physiology, facial expres-
sions, eye gaze, etc. (Pantic and Rothkrantz, 2003).
In the area of emotional speech, most research has
used databases of speech read by actors or native speak-
ers as training data for developing emotion predic-
tors (Holzapfel et al., 2002; Liscombe et al., 2003). In
this work the set of emotions to be read is prede ned be-
fore the utterance is spoken, rather than annotated after
the fact. One problem with this approach is that such
prototypical emotional speech does not necessarily re-
 ect natural speech (Batliner et al., 2003), e.g. the way
one acts an emotion is not necessarily the same as the
way one naturally expresses an emotion. Moreover, ac-
tors repeatedly reading the same sentence are restricted
to conveying different emotions using only acoustic and
prosodic features, while in natural interactions a much
wider feature variety is available (e.g., lexical, dialogue).
As a result of these problems, researchers motivated by
spoken dialogue applications have instead started to train
emotion predictors using naturally-occurring speech that
has been hand-annotated for various emotions (Ang et al.,
2002; Batliner et al., 2003; Lee et al., 2001; Litman and
Forbes, 2003). However, this requires researchers to  rst
develop a scheme for annotating emotions in naturally-
occurring spoken dialogue corpora. Although emotion
annotation of natural corpora (typically at the turn or ut-
terance level) has been addressed in various domains, lit-
tle has yet been done in the educational setting. Although
not yet tested, (Evens, 2002) has hypothesized adaptive
strategies; for example, if detecting frustration, the sys-
tem should respond to hedges and self-deprecation, by
supplying praise and restructuring the problem. A com-
parison of our annotation scheme and prior non-tutoring
schemes is presented in Section 4.4.
3 The ITSPOKE System and Corpora
In ITSPOKE, a student types an essay answering a qual-
itative physics problem. The ITSPOKE computer tutor
then engages the student in spoken dialogue to correct
misconceptions and elicit more complete explanations,
after which the student revises the essay, thereby ending
the tutoring or causing another round of tutoring/essay re-
vision. Student speech is digitized from microphone in-
put and sent to the Sphinx2 recognizer, whose most prob-
able  transcription output is then sent to the Why2-Atlas
back-end for syntactic, semantic and dialogue analysis.
The text response produced by Why2-Atlas is sent to the
Cepstral text-to-speech system. A formal evaluation of
ITSPOKE began in November 2003; to date we have col-
lected 50 dialogues from 10 students. A corpus example
is shown in Figure 4, Appendix A. Corpus collection uses
the same experimental procedure as our human-human
tutoring corpus, described next.
Our Human-Human Spoken Dialogue Tutoring Corpus
contains spoken dialogues collected via a web interface
supplemented with a high-quality audio link, where the
human tutor performs the same task as ITSPOKE. The
experimental procedure for collecting both corpora is as
follows: 1) students are given a pre-test measuring their
physics knowledge, 2) students read through a small doc-
ument of background material, 3) students use the web
and voice interface to work through a set of training prob-
lems (dialogues) with the tutor, and 4) students are given
a post-test that is similar to the pre-test. Subjects are Uni-
versity of Pittsburgh students who have never taken col-
lege physics and who are native English speakers. One tu-
tor currently participates. To date we have collected 149
dialogues from 17 students. Annotated (see Section 4)
corpus examples are shown in Figure 1 and Figure 2 (Ap-
pendix A) (punctuation added for clarity).
. . . dialogue excerpt at 5.2 minutes into session. . .
TUTORa0 : Suppose you apply equal force by pushing them.
Then uh what will happen to their motion?
STUDENTa1 : Um the one that’s heavier...uh, the acc- accelera-
tion won’t be as great. (NEGATIVE, UNCERTAIN)
TUTORa1 : The one which is...
STUDENTa2 : Heavier? (NEGATIVE, UNCERTAIN)
TUTORa2 : Mm, well, uh, is that your common-
STUDENTa3a5a4 : Er I’m sorry- I’m sorry- the one with most mass.
You- (POSITIVE, CONFIDENT)
TUTORa3a5a4 : (lgh) Yeah, the one with more mass will- if you-
if the mass is more and force is the same then which one will
accelerate more?
STUDENTa3a6a3 : Which one will move more? (NEGATIVE,
CONFUSED)
TUTORa3a6a3 : Mm, which one will accelerate more?
STUDENTa3a5a7 : The- the one with the least amount of mass?
(NEGATIVE, UNCERTAIN)
TUTORa3a5a7 : Yeah, but what you said was different isn’t it? So
uh you are applying Newton’s law of uh second law of motion:
F is equal to M times A. And uh you apply equal force on both
the containers, then the one which is less massive will accelerate
more.
STUDENTa3a5a8 : Right. (WEAK POSITIVE, CONFIDENT)
Figure 1: Annotated Excerpt (Human Spoken Corpus)
4 Annotation Scheme
In our spoken dialogue tutoring corpora, student emo-
tional states can only be identi ed indirectly  via what a
student says and/or how s/he says it. Furthermore, such
evidence is not always obvious, unambiguous, or consis-
tent. For example, a student may express anger through
the use of swear words, or through a particular tone of
voice, or via a combination of signals, or not at all. More-
over, another student may present some of these same sig-
nals even when s/he does not feel anger.
Our objective is nevertheless to develop a reliable an-
notation scheme across annotators, for manually labeling
the student turns in our spoken tutoring dialogues for per-
ceived expressions of emotion.
4.1 Emotion Classes
In our current annotation scheme, perceived expressions
of emotion are viewed along a linear scale, as shown and
de ned below: negative a9a11a10 neutral a10a13a12 positive
Negative: a student turn that strongly expresses emo-
tions such as confused, bored, irritated, uncertain, sad.
Examples in Figure 1 include studenta14 and studenta15a16a15 .
Evidence2 for the negative emotions in these turns in-
cludes syntax (constructions such as questions), dis u-
encies, and acoustic-prosodic features.
Positive: a student turn that strongly expresses emo-
tions such as con dent, enthusiastic. An example is
studenta15a6a17 in Figure 1, where evidence of a positive emo-
tion comes primarily from acoustic-prosodic features.
Neutral: a student turn not strongly expressing a neg-
ative or positive emotion.
In addition to these three main emotion classes, we
also distinguish three minor emotion classes:
Weak Negative: a student turn that weakly expresses
negative emotions.
Weak Positive: a student turn that weakly expresses
positive emotions. An example is studenta15a6a18 in Figure
1, where evidence is primarily lexical ( right ).
Mixed: a student turn that strongly expresses both posi-
tive and negative emotions: Case 1) multi-utterance turns
where one utterance is judged positive and another, nega-
tive. Case 2) turns where the simultaneous strong expres-
sion of negative and positive emotions is perceived. Case
2 is often due to con icting domains (Section 4.2), e.g.
boredom with tutoring but con dence about physics.
4.2 Relativity and Domains of Emotion Classes
Our emotion annotation is relative to both context and
task. By context-relative we mean that a student turn in
our tutoring dialogues is identi ed as expressing emotion
relative to the other student turns in that dialogue. By
task-relative we mean that a student turn perceived during
tutoring as expressing an emotion might not be perceived
as expressing the same emotion with the same strength
in another situation. For example, consider the context
of a tutoring session, where a student has been answer-
ing tutor questions with apparent ease. If the tutor then
asks another question, and the student responds slowly,
2Determined in post-annotation discussion (see Section 4.4).
saying  Um, now I’m confused , this turn would likely
be labeled negative. However, in the context of a heated
argument between two people, this same turn might be
labeled as a weak negative, or even weak positive.
We also annotate emotion with respect to multiple do-
mains. One focus of our annotation scheme is expres-
sions of emotion that pertain to the physics material be-
ing learned ( PHYS domain). For example, a student
may express confusion or con dence about the physics
material. Another focus of our scheme is expressions of
emotion that pertain to the tutoring process, including at-
titudes towards the tutor, the dialogue, and/or being tu-
tored ( TUT domain). For example, a student may ex-
press boredom or amusement with the tutoring.
4.3 Speci c Annotation Instructions
Our annotation scheme is detailed in an online, audio-
enhanced emotion labeling manual. As shown in Figure 3
(Appendix A), the emotion annotation is performed using
(our customization of) Wavesurfer, an open source sound
visualization and manipulation tool. The  Tutor Speech 
and  Student Speech panes show a portion of the tutor
and student speech  les, while the  Tutor Text and  Stu-
dent Text show the associated transcriptions, where ver-
tical lines correspond to turn segmentations.3 There are
three additional panes for emotion annotation:
The EMOa pane records the annotator’s judgment of
the expressed emotion class for each turn, e.g. the six
emotion classes described in Section 4.1: negative, weak
negative, neutral, weak positive, positive, mixed. An-
notators are instructed to focus on expressed emotions in
the PHYS domain. If an additional expressed emotion in
the TUT domain is perceived, this is noted in the NOTES
pane (e.g.  amused/TUT ). If no expressed emotion is
perceived in the PHYS domain, any expressed emotion in
the TUT domain is labeled in the EMOa pane, and noted
(e.g.  TUT ) in the NOTES pane. Domain indecision is
also noted (e.g.  TUT/PHYS? ) in the NOTES pane.
The EMOb pane further speci es the annotations in
the EMOa pane, by recording a speci c expressed emo-
tion for each turn. Our current list of speci c emotions
contains those that we believe will be useful for trigger-
ing ITSPOKE adaptation. Speci c negative emotions are:
uncertain, confused, sad, bored, irritated. Speci c pos-
itive emotions are: con dent, enthusiastic. Our manual
includes glosses for these speci c emotions, formulated
using synonyms and/or hyponyms that are currently not
distinguished. For example, our gloss for enthusiastic in-
cludes interested, pleased, amused. There are also com-
plex labels combining multiple speci c emotions within
a class (e.g. uncertain+sad, con dent+enthusiastic). If
3Transcription and turn-segmentation of the human-human
dialogues were also done within Wavesurfer, by a paid tran-
scriber prior to emotion annotation.
the annotator judges a speci c emotion that is not listed
(or lacks a close substitute), s/he selects the label other,
and lists the alternative(s) in the NOTES pane. If the an-
notator selected mixed (case 1) in the EMOa pane, s/he
subdivides the turn into utterances in the EMOb pane and
provides a speci c emotion label for each utterance. If
the annotator selected mixed (case 2) in the EMOa pane,
s/he selects the label other in the EMOb pane, and com-
ments on the indecision in the NOTES pane.
The NOTES pane records any additional annotator
comments concerning their judgment, the annotation, etc.
Because our annotation is student-, context-, and task-
speci c, our manual  rst instructs the annotator to listen
to each dialogue at least once before annotating, to se-
cure an intuition of how and with what range emotional
expression is displayed. S/he is also instructed to not as-
sume that all dialogues will begin with neutral student
turns. S/he is however reminded that it is not necessary
to assign a non-neutral label to every turn. Finally, s/he
is told to ignore correctness when annotating, because a
correct answer to a tutor question can express uncertainty,
and an incorrect answer can express con dence.
Our manual also describes two default conventions for
our annotation scheme, which can however be overridden
by the annotator’s intuitive judgment and/or other extenu-
ating considerations (e.g. irony, etc), as described below:
1) By de nition, a question expresses strong uncertainty
or confusion. Thus if a student turn consists only of a
question, its default label is negative. However:
a) If the turn consists of multiple utterances, one of
which is a question, and the other(s) expresses a positive
emotion, then the turn should be labeled mixed and sub-
divided (e.g.  What directions are the forces acting in?
Gravity is only acting in the down direction ).
b) The domain must be considered. For example, de-
faults in one domain can be overridden if the turn ex-
presses a contrasting emotion in the other domain.
2) Many student turns in our dialogues are very short,
containing only grounding phrases such as  yeah ,  ok ,
 mm-hm ,  uh-huh , etc. By default, such turns are la-
beled neutral, because groundings serve mainly to en-
courage another speaker to continue speaking. However:
a) Groundings may occasionally strongly express an
emotion (e.g.  yeah! , (sigh)  ok ), thereby overriding
the default label.
b) The semantics of certain groundings is associated
with weakly expressed understanding, (e.g.  right and
 sure ), and default to weak positive.
c) Certain phrases are associated with strongly ex-
pressed uncertainty or confusion (e.g.  um (silence)),
and default to negative.
Our annotation manual concludes with 8 examples of
annotated student turns (as in Figure 1), with links to cor-
responding audio  les. The variety exempli es how dif-
ferent students express emotions differently at different
points in the dialogue, and cover all 6 emotion labels at
least once (there are 2 negatives and 2 positives). Also
provided is a lengthy audio-enhanced transcript from a
single student tutoring dialogue, to exemplify how stu-
dent emotion changes throughout a single tutoring ses-
sion. This transcript is shown in part in Figure 2, Ap-
pendix A. The transcript is organized in terms of tutor
and student turn start and end times. For each student
turn, the four Wavesurfer panes are shown.
4.4 Comparison with Prior Schemes
Studies of actor-read speech often make a large num-
ber of emotion distinctions, e.g. the LDC Emotional
Prosody corpus distinguishes 15 classes. Our work,
like other studies of naturally occurring dialogues, uses
a more restricted set of emotions, due to the need to
 rst manually annotate such emotions reliably across an-
notators. As discussed above, our annotation scheme
distinguishes negative, neutral, and positive emotions,
as well as  weak and  mixed classes. Other stud-
ies of naturally occurring data have annotated only two
emotion classes (e.g. emotional/non-emotional (Bat-
liner et al., 2000), negative/non-negative (Lee et al.,
2001)). The study of (Ang et al., 2002) annotates six
emotion classes, but collapses most of these for the
purposes of emotion prediction.4 In Section 5, we
will similarly explore the impact of collapsing some
of our 6 distinctions, to produce simpler 3-way (neg-
ative/positive/neutral) and 2-way (negative/non-negative
and emotional/non-emotional) schemes.
In further contrast to (Lee et al., 2001), our annotations
are context- and task-relative, because like (Ang et al.,
2002; Batliner et al., 2003), we are interested in detect-
ing emotional changes across our dialogues. But unlike
(Batliner et al., 2003), we allow annotators to be guided
by their intuition rather than a set of expected features,
to avoid restricting or otherwise in uencing their intu-
itive understanding of emotion expression, and because
such features are not used consistently or unambiguously
across speakers. Instead, our manual contains annotated
audio-enhanced corpus examples (as in Figures 1-2).
5 Analysis of the Annotation Scheme
Given our complete annotation scheme in Section 4, we
now explore both the reliability of the scheme at three
levels of granularity that have been proposed in prior
work, and the accuracy of automatically predicting these
variations. These analyses give insight into the tradeoff
4(Ang et al., 2002) also discussesthe use of an  uncertainty 
label, although it did not improve inter-annotator agreement.
Our  weak labels are more similar to an  intensity dimension
found in studies of elicited speech (see (Cowie et al., 2001)).
between interannotator reliability, annotation granularity,
and predictive accuracy.
For the purposes of these analyses, we randomly se-
lected 10 transcribed and turn-annotated dialogues from
our human-human tutoring corpus (Section 3), yielding
453 student turns from 9 subjects. The turns were sep-
arately annotated by two annotators, using the emotion
annotation instructions in Section 4. For our machine-
learning experiments we follow the methodology in (Lit-
man and Forbes, 2003), instantiated with the learning
method (boosted decision trees) and feature set (acoustic-
prosodic, lexical, dialogue and contextual) that has given
us our best results in ongoing studies.
5.1 Agreed Student Turns
Con ating Minor and Neutral Classes
For our  rst analysis, only our three main emotion
classes were distinguished: negative, neutral, positive.
Our three minor classes, weak negative, mixed, weak pos-
itive, were con ated with the neutral class. A confusion
matrix summarizing the resulting inter-annotator agree-
ment is shown in Table 1. The rows correspond to the
labels assigned by annotator 1, and the columns corre-
spond to the labels assigned by annotator 2. For example,
90 negatives were agreed upon by both annotators, while
6 negatives assigned by annotator 1 were labeled as neu-
tral by annotator 2. The two annotators agreed on the an-
notations of 385/453 turns, achieving 84.99% agreement
(Kappa = 0.68 (Carletta, 1996)). Such agreement is ex-
pected given the dif culty of the task, and exceeds that
of prior studies of emotion annotation in naturally occur-
ring speech; (Ang et al., 2002), for example, achieved
agreement of 71% (Kappa 0.47), while (Lee et al., 2001)
averaged around 70% agreement.
As in (Lee et al., 2001), we next performed a machine
learning experiment on the 385 student turns where the
two annotators agreed on the emotion label. Our predic-
tive accuracy for this data was 84.75% (using 10 x 10
cross-validation as in (Litman and Forbes, 2003)). Com-
pared to a baseline accuracy of 72.74% achieved by al-
ways predicting the majority (neutral) class, our result
yields a relative improvement of 44.06%.5
negative neutral positive
negative 90 6 4
neutral 23 280 30
positive 0 5 15
Table 1: Confusion Matrix 1: Minor a12 Neutral
5Relative improvement of x over y = a19a6a20a6a20a6a21a6a20a23a22a25a24a23a26a5a27a28a19a6a20a6a20a6a21a6a20a23a22a25a29a23a26
a19a6a20a6a20a6a21a6a20a23a22a30a24a23a26 ,
where error(x) is 100 - %accuracy(x).
Con ating Weak and Negative/Positive Classes
In a second analysis, we again distinguished only our
three main emotion classes; however, this time weak neg-
ative was con ated with negative, and weak positive was
con ated with positive. Our mixed class was again con-
 ated with neutral. A confusion matrix summarizing the
resulting inter-annotator agreement is shown in Table 2.
As shown, although the number of agreed negative and
positive turns increased, overall interannotator agreement
decreased to 340/453 turns, or 75.06% (Kappa = 0.60).
We performed our machine learning experiment on
these 340 agreed student turns. The predictive accuracy
for this data decreased to 79.29%; however, baseline (ma-
jority class) accuracy also decreased to 53.24%; thus rel-
ative improvement in fact increased to 55.71%
negative neutral positive
negative 112 9 9
neutral 31 181 53
positive 1 10 47
Table 2: Confusion Matrix 2: Weak a12 Neg/Pos
Negative/Non-Negative Classes
As Tables 1-2 indicate, our annotators found the pos-
itive class the most dif cult to annotate and agree upon,
and the positive class was also the least frequent class
overall. Not surprisingly, our prior machine learning ex-
periments have also showed that the positive class is the
hardest to predict (Litman and Forbes, 2003). We thus
next explored a binary analysis where our positive and
neutral classes are con ated, yielding a negative/non-
negative distinction akin to (Lee et al., 2001). Again
however we experimented with con ating our minor
weak classes with either the neutral class or their main
class counterparts (e.g. weak negative a12 negative).
Two confusion matrices summarizing the resulting inter-
annotator agreements are shown in Tables 3 - 4.
In Table 3, our three minor classes are con ated with
the neutral class. Interannotator agreement in this case
rises sharply to 420/453 turns, or 92.72% (Kappa =
0.80). The predictive accuracy for this data increased to
86.83%; however, baseline (majority class) accuracy also
increased to 78.57%; thus relative improvement in fact
decreased to 38.54%
negative non-negative
negative 90 10
non-negative 23 330
Table 3: Confusion Matrix 3: Pos/Neu a12 Non-Neg
In Table 4, our two weak classes are con ated with
their main class counterparts. Interannotator agreement
only rises to 403/453 turns, or 88.96% (Kappa = 0.74),
Predictive accuracy decreases to 82.94%. However, base-
line (majority class) accuracy also decreases to 72.21%;
thus relative improvement was comparable, at 38.61%
negative non-negative
negative 112 18
non-negative 32 291
Table 4: Conf. Matrix 4: (Weak) Pos/Neu a12 Non-Neg
Emotional/Non-Emotional Classes
We also explored an alternative binary analysis that
con ated our positive and negative classes, yielding an
emotional/non-emotional distinction, akin to (Batliner
et al., 2000). Again we con ated our minor weak classes
with either the neutral class or their main class counter-
parts, as shown in in Tables 5-6. In Table 5, our three mi-
nor classes are con ated with the neutral class, yielding
agreement on 389/453 turns, or 85.87% (Kappa = 0.67).
The predictive accuracy was high at 85.07%, while base-
line (majority) accuracy was 71.98%; thus relative im-
provement was 46.72%
emotional non-emotional
emotional 109 11
non-emotional 53 280
Table 5: Confusion Matrix 5: Pos/Neg a12 Emotional
In Table 6, weak classes are con ated with their main
class counterparts. Interannotator agreement decreases to
350/453 turns, or 77.26% (Kappa = 0.55). Predictive ac-
curacy was high at 86.14%; moreover, baseline (majority)
accuracy was the lowest yet seen, 51.71%, and relative
improvement was the best yet seen, at 71.30%
emotional non-emotional
emotional 169 19
non-emotional 84 181
Table 6: Confusion Matrix 6: (Weak) Pos/Neg a12 Emo
Summary
A summary of our results across analyses of agreed
student turns are shown in Table 7. NPN represents anal-
yses distinguishing negative, neutral and positive emo-
tions, NnN represents  negative/non-negative analyses,
and EnE represents  emotional/non-emotional analy-
ses. Column  K shows Kappa for each analysis,  Acc 
shows the predictive accuracy achieved by machine learn-
ing,  Base shows the baseline (majority class) accu-
racy, and  RI show the relative improvement achieved
by learning compared with this baseline.
As can be seen, there is no single optimal way to con-
 ate the original 6 classes; optimality depends on whether
maximizing Kappa, predictive accuracy, or expressive-
ness is most important. For example, con ating minor
and neutral labels (the  rst three rows) yields better an-
notation reliability than for their counterparts (con ating
weak and main labels) in the last three rows; the reverse
is true, however, for machine learning performance (mea-
sured by relative improvement over the majority class
baseline). With respect to expressiveness, only the 3-way
NPN distinction can explicitly distinguish positive emo-
tions. With respect to the binary distinctions, annotating
negative/non-negative (NnN) can be done most reliably,
while predicting emotional/non-emotional (EnE) yields a
better relative improvement.
K Acc Base RI
minor a12 neutral
NPN .68 84.75% 72.74% 44.06%
NnN .80 86.83% 78.57% 38.54%
EnE .67 85.07% 71.98% 46.72%
weak a12 main
NPN .60 79.29% 53.24% 55.71%
NnN .74 82.94% 72.21% 38.61%
EnE .55 86.14% 51.71% 71.30%
Table 7: Summary: Annotation and Learning Results
5.2 Consensus-Labeled Student Turns
Following (Ang et al., 2002), we also explored consensus
labeling, both to increase our usable data set for predic-
tion, and to include the more dif cult annotation cases.
For consensus labeling, the original annotators revisited
each originally disagreed case, and through discussion,
sought a consensus label. Agreement thus rose across
all analyses, to 99.12%; we discarded 8/453 turns for
lack of consensus. A summary of the consensus label-
ing across all 6 analyses discussed above is shown in
Table 8. The row and column labels are as above, e.g.
the NPN row represents turns consensus-labeled as nega-
tive/neutral/positive,  rst when all three minor classes are
con ated with neutral, and second where the weak minor
classes are con ated with their main counterparts.
minor a12 neu weak a12 main
neg neu pos neg neu pos
NPN 99 321 25 119 265 61
neg nonneg neg nonneg
NnN 99 346 119 326
emo nonemo emo nonemo
EnE 124 321 180 265
Table 8: Consensus Labeling over Analyses
We performed our machine learning experiment on the
consensus data for all 6 analyses. A summary of our
results are shown in Table 9. A comparison of Tables
7-9 shows that for all of our evaluation metrics, our re-
sults decrease across all analyses when using consensus
data; similar  ndings were observed in (Ang et al., 2002).
While increasing our data set using more dif cult exam-
ples decreases predictive ability, note that our consensus
results are still an improvement over the baseline.
Acc Base RI
minor a12 neutral
NPN 79.97% 72.14% 28.10%
NnN 84.97% 77.75% 32.45%
EnE 80.78% 72.14% 31.01%
weak a12 main
NPN 73.14% 59.55% 33.60%
NnN 81.88% 73.26% 32.24%
EnE 75.75% 59.55% 40.05%
Table 9: Predicting Consensus Labels
6 Extensions to the Analyses
6.1 Minor Emotion Classes
Our analyses so far distinguished only our 3 main emo-
tion classes; our 3 minor classes were always con ated
with one or the other of the main classes. In part, this
is because our minor labels were consistently employed
only later in the development of our scheme; in early ver-
sions, annotators optionally labeled the minor classes (in
the NOTES pane), for the purpose of post-annotation dis-
cussion. At present, only the last 5 of our 10 annotated
dialogues are consistently labeled with minor classes. Ta-
ble 10 shows a confusion matrix for the annotation of all
6 emotion classes for these 5 dialogues. Interannotator
agreement is 142/211 turns, or 67.30% (Kappa = 0.54).
Compared to Section 5, we see that this higher level of
granularity yields a lower level of agreement. However,
most disagreements fall adjacent to the diagonal, indicat-
ing that they are mostly differences in strength rather than
differences in polarity. The analyses in Section 5 investi-
gated various means of resolving these differences.
neg w. neg neut w. pos pos mix
neg 48 2 0 0 0 2
w. neg 6 10 3 2 2 0
neut 2 11 70 22 3 3
w. pos 0 1 1 9 2 0
pos 0 0 1 1 1 0
mix 1 1 2 1 0 4
Table 10: Confusion Matrix: All 6 Emotion Classes
6.2 Speci c Emotions
Our analyses in Section 5 did not consider the speci c
emotion annotations in our  EMOb pane. This is in part
because, as with our minor labels, our speci c emotion
labels were only consistently employed when annotating
the last 5 of our 10 dialogues. If we consider only the
66 turns where both annotators agreed that the turn was
negative (weak or strong), and view multiple emotion la-
bels which overlap with single emotions as agreed (e.g.
sad+bored agrees with a sad or bored label), interannota-
tor agreement is 45/66 turns, or 68.18% (Kappa = 0.41).
The same analysis for the 13 positive turns yields 100%
agreement (Kappa = 1).
The labels we’ve included so far are those we’ve en-
countered in our human-human tutoring dialogues; we
expect to see some differences in the human-computer di-
alogues, as discussed in Section 6.3, and continue to em-
ploy the  other label. In part, the decision about which
speci c emotions to ultimately recognize in our system
depends on what we want the system to adapt to. This
in turn requires some understanding of how human tutors
adapt to different emotions. For example, perhaps our
tutor responds differently to anger, uncertainty, boredom
and confusion, but responds the same to most positive
emotions. We are currently investigating this in our an-
notated human-human tutoring dialogues.
6.3 Human-Computer Corpus
We have just begun annotating our corpus of human-
computer spoken tutoring dialogues; to date we have an-
notated 5 dialogues from 5 different students.
We have applied the 6 reliability analyses in this paper
to these annotations, and have found again that most dis-
agreements are simply differences in strength rather than
differences in polarity. Our best interannotator reliability
was found using the NnN, weak a12 main analysis (con-
trary to the human-human  ndings), which gave agree-
ment of 96/115 turns, or 83.48% (Kappa = 0.67).
The corpus example in Figure 4 (Appendix A) high-
lights differences between our human-human and human-
computer tutoringdialogues that potentiallymight impact
emotion annotation. First, both the average student turn
length in words, and the average number of student turns
per dialogue, are much shorter in the human-computer
than in the human-human dialogues. This means that
there is less information in the human-computer dia-
logues to make use of when judging expressed emotions.
Second, errors in speech and natural language process-
ing can have a signi cant effect on the student emotional
state in the human-computer tutoring dialogues. Such
emotions don’t concern either the PHYS domain or the
TUT domain, and suggest that we might want to add a
third NLP domain if we want the system to respond to
these emotions differently. Relatedly, we already see fre-
quency differences across the human-human and human-
computer dialogues with respect to speci c emotions, for
example an increased use of  irritated in the human-
computer data. Finally, computer tutors are far less  exi-
ble than human tutors. This alone can effect student emo-
tional state, and furthermore it can limit how the student
expresses their own emotional states. For example, in the
human-human dialogues we see more student initiative,
groundings, and references to prior problems.
7 Conclusions and Current Directions
In this paper we presented and analyzed our scheme
for annotating student emotional states in spoken tu-
toring dialogues. Our scheme distinguishes three main
(negative, neutral and positive) and three minor (weak
negative, mixed, and weak positive) emotion classes.
Our inter-annotator agreement is on par with prior emo-
tion annotation in other types of corpora. We used
consensus-labeling to resolve disagreements and increase
our dataset. Through further annotation and the use of
other inter-annotation metrics (Gwet, 2001), we will in-
vestigate how systematic disagreements can yield revi-
sions to our annotation scheme that improve reliability.
Our machine learning experiments have shown that our
main emotion categories can be predicted with a high
degree of accuracy. Although not presented here, F-
Measures (a31 a32a34a33a36a35a38a37a40a39a36a41a34a42a43a42a25a33a23a44a46a45a16a37a40a39a6a47a49a48a40a47a51a50a6a52a35a38a37a6a39a36a41a34a42a43a42a25a53a54a44a46a45a34a37a6a39a6a47a51a48a6a47a51a50a6a52 ) for our experiments on
agreed data ranged from 67%-86%; in future work we
will more closely examine the tradeoff between recall and
precision when predicting our annotations. Our experi-
ments have also highlighted tradeoffs that can be made
between coding reliability, predictive accuracy, and an-
notation scheme granularity.
Finally, we presented initial results in annotating our
ITSPOKE human-computer tutoring corpus, and dis-
cussed differences from our human-human annotations.
This research on emotion annotation and prediction is a
 rst step towards extending the ITSPOKE computer tu-
toring dialogue system to predict and adapt to student
emotional states. Our next goal is to label human tutor
reactions to emotional student turns, in order to formu-
late adaptive strategies for ITSPOKE, and to determine
which of our six prediction tasks best triggers adaptation.
Acknowledgments
This research is supported by NSF Grants Nos. 9720359
and No. 0328431. We thank Kurt VanLehn and the
Why2-Atlas team, and Scott Silliman of ITSPOKE, for
system development and data collection.

References
G. Aist, B. Kort, R. Reilly, J. Mostow, and R. Pi-
card. 2002. Experimentally augmenting an intelli-
gent tutoring system with human-supplied capabilities:
Adding human-provided emotional scaffolding to an
automated reading tutor that listens. In Proc. of ITS.
J. Ang, R. Dhillon, A. Krupski, E.Shriberg, and A. Stol-
cke. 2002. Prosody-based automatic detection of an-
noyance and frustration in human-computer dialog. In
Proc. of ICSLP.
A. Batliner, K. Fischer, R. Huber, J. Spilker, and E. Ncurrency1oth.
2000. Desperately seeking emotions: Actors, wizards,
and human beings. In ISCA Workshop on Speech and
Emotion.
A. Batliner, K. Fischer, R. Huber, J. Spilker, and E. Noth.
2003. How to  nd trouble in communication. Speech
Communication, 40.
J. Carletta. 1996. Assessing agreement on classi cation
tasks: the kappa statistic. Computational Linguistics,
22(2), June.
R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis,
S. Kollias, W. Fellenz, and J. Taylor. 2001. Emotion
recognition in human-computer interaction. IEEE Sig-
nal Processing Magazine, 18:32 80, January.
Martha Evens. 2002. New questions for Circsim-Tutor.
Presentation at the 2002 Symposium on Natural Lan-
guage Tutoring, University of Pittsburgh.
K. Gwet. 2001. Handbook of Inter-Rater Reliability.
STATAXIS Publishing Company.
H. Holzapfel, C. Fuegen, M. Denecke, and A. Waibel.
2002. Integrating emotional cues into a framework for
dialogue management. In Proc. of ICMI.
B. Kort, R. Reilly, and R. W. Picard. 2001. An affec-
tive model of interplay between emotions and learn-
ing: Reengineering educational pedagogy - building a
learning companion. In Proc. of ICALT.
C.M. Lee, S. Narayanan, and R. Pieraccini. 2001.
Recognition of negative emotions from the speech sig-
nal. In Proc. of ASRU.
J. Liscombe, J. Venditti, and J.Hirschberg. 2003. Classi-
fying subject ratings of emotional speech using acous-
tic features. In Proc. of EuroSpeech.
D. Litman and K. Forbes. 2003. Recognizing emotion
from student speech in tutoring dialogues. In Proc. of
ASRU.
M. Pantic and L. J. M. Rothkrantz. 2003. Toward an
affect-sensitive multimodal human-computer interac-
tion. Proc. of IEEE, 91(9):1370 1390.
C. P. Ros·e and V. Aleven. 2002. Proceedings of the ITS
2002 workshop on empirical methods for tutorial dia-
logue systems, June.
K. VanLehn, P. W. Jordan, C. Ros·e, D. Bhembe,
M. Bcurrency1ottner, A. Gaydos, M. Makatchev, U. Pap-
puswamy, M. Ringenberg, A. Roque, S. Siler, R. Sri-
vastava, and R. Wilson. 2002. The architecture of
Why2-Atlas: A coach for qualitative physics essay
writing. In Proc. of ITS.
