Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 231–234,
New York, June 2006. c©2006 Association for Computational Linguistics
Detecting Emotion in Speech: Experiments in Three Domains
Jackson Liscombe
Columbia University
jaxin@cs.columbia.edu
Abstract
The goal of my proposed dissertation work
is to help answer two fundamental questions:
(1) How is emotion communicated in speech?
and (2) Does emotion modeling improve spo-
ken dialogue applications? In this paper I de-
scribe feature extraction and emotion classi -
cation experiments I have conducted and plan
to conduct on three different domains: EPSaT,
HMIHY, and ITSpoke. In addition, I plan to
implement emotion modeling capabilities into
ITSpoke and evaluate the effectiveness of do-
ing so.
1 Introduction
The focus of my work is the expression of emotion in
human speech. As normally-functioning people, we are
each capable of vocally expressing and aurally recogniz-
ing the emotions of others. How often have you been
put off by the  tone in someone’s voice or tickled others
with the humorous telling of a good story? Though we
as everyday people are intimately familiar with emotion,
we as scientists do not actually know precisely how it is
that emotion is conveyed in human speech. This is of spe-
cial concern to us as engineers of natural language tech-
nology; in particular, spoken dialogue systems. Spoken
dialogue systems enable users to interact with computer
systems via natural dialogue, as they would with human
agents. In my view, a current de ciency of state-of-the-
art spoken dialogue systems is that the emotional state of
the user is not modeled. This results in non-human-like
and even inappropriate behavior on the part of the spoken
dialogue system.
There are two central questions I would like to at lest
partially answer with my dissertation research: (1) How
is emotion communicated in speech? and (2) Does emo-
tion modeling improve spoken dialogue applications? In
an attempt to answer the  rst question, I have adopted
the research paradigm of extracting features that charac-
terize emotional speech and applying machine learning
algorithms to determine the prediction accuracy of each
feature. With regard to the second research question, I
plan to implement an emotion modeler  one that detects
and responds to uncertainty and frustration  into an In-
telligent Tutoring System.
2 Completed Work
This section describes my current research on emotion
classi cation in three domains and forms the foundation
of my dissertation. For each domain, I have adopted an
experimental design wherein each utterance in a corpus
is annotated with one or more emotion labels, features
are extracted from these utterances, and machine learn-
ing experiments are run to determine emotion prediction
accuracy.
2.1 EPSaT
The publicly-available Emotional Prosody Speech and
Transcription corpus1 (EPSaT) comprises recordings of
professional actors reading short (four syllables each)
dates and numbers (e.g., ‘two-thousand-four’) with dif-
ferent emotional states. I chose a subset of 44 utterances
from 4 speakers (2 male, 2 female) from this corpus and
conducted a web-based survey to subjectively label each
utterance for each of 10 emotions, divided evenly for va-
lence. These emotions included the positive emotion cat-
egories: con dent, encouraging, friendly, happy, inter-
ested; and the negative emotion categories: angry, anx-
ious, bored, frustrated, sad.
Several features were extracted from each utterance in
this corpus, each one designed to capture emotional con-
tent. Global acoustic-prosodic information  e.g., speak-
ing rate and minimum, maximum, and mean pitch and in-
tensity  has been well known since the 1960s and 1970s
1LDC Catalog No.: LDC2002S28.
231
to convey emotion to some extent (e.g., (Davitz, 1964;
Scherer et al., 1972)). In addition to these features, I also
included linguistically meaningful prosodic information
in the form of ToBI labels (Beckman et al., 2005), as well
as the spectral tilt of the vowel in each utterance bearing
the nuclear pitch accent.
In order to evaluate the predictive power of each fea-
ture extracted from the EPSaT utterances, I ran machine
learning experiments using RIPPER, a rule-learning al-
gorithm. The EPSaT corpus was divided into training
(90%) and testing (10%) sets. A binary classi cation
scheme was adopted based on the observed ranking dis-
tributions from the perception survey:  not at all was
considered to be the absence of emotion x; all other ranks
was recorded as the presence of emotion x. Performance
accuracy varied with respect to emotion, but on average I
observed 75% prediction accuracy for any given emotion,
representing an average 22% improvement over chance
performance. The most predictive included the global
acoustic-prosodic features, but interesting novel  ndings
emerged as well; most notably, signi cant correlation
was observed between negative emotions and pitch con-
tours ending in a plateau boundary tone, whereas positive
emotions correlated with the standard declarative phrasal
ending (in ToBI, these would be labeled as /H-L%/ and
/L-L%/, respectively). Further discussion of such  nd-
ings can be found in (Liscombe et al., 2003).
2.2 HMIHY
 How May I Help YouSM (HMIHY) is a natural lan-
guage human-computer spoken dialogue system devel-
oped at AT&T Research Labs. The system enables AT&T
customers to interact verbally with an automated agent
over the phone. Callers can ask for their account bal-
ance, help with AT&T rates and calling plans, explana-
tions of certain bill charges, or identi cation of num-
bers. Speech data collected from the deployed system
has been assembled into a corpus of human-computer
dialogues. The HMIHY corpus contains 5,690 com-
plete human-computer dialogues that collectively con-
tain 20,013 caller turns. Each caller turn in the corpus
was annotated with one of seven emotional labels: posi-
tive/neutral, somewhat frustrated, very frustrated, some-
what angry, very angry, somewhat other negative2, very
other negative. However, the distribution of the labels
was so skewed (73.1% were labeled as positive/neutral)
that the emotions were collapsed to negative and non-
negative.
In addition to the set of automatic acoustic-prosodic
features found to be useful for emotional classi cation of
the EPSaT corpus, the features I examined in the HMIHY
corpus were designed to exploit the discourse information
2‘Other negative’ refers to any emotion that is perceived
negatively but is not anger nor frustration.
available in the domain of spontaneous human-machine
conversation. Transcriptive features  lexical items,  lled
pauses, and non-speech human noises  we recorded as
features, as too were the dialogue acts of each caller turn.
In addition, I included contextual features that were de-
signed to track the history of the previously mentioned
features over the course of the dialogue. Speci cally,
contextual information included the rate of change of the
acoustic-prosodic features of the previous two turns plus
the transcriptive and pragmatic features of the previous
two turns as well.
The corpus was divided into training (75%) and testing
(25%) sets. The machine learning algorithm employed
was BOOSTEXTER, an algorithm that forms a hypothesis
by combining the results of several iterations of weak-
learner decisions. Classi cation accuracy using the auto-
matic acoustic-prosodic features was recorded to be ap-
proximately 75%. The majority class baseline (always
guessing non-negative) was 73%. By adding the other
feature-sets one by one, prediction accuracy was itera-
tively improved, as described more fully in (Liscombe et
al., 2005b). Using all the features combined  acoustic-
prosodic, lexical, pragmatic, and contextual  the result-
ing classi cation accuracy was 79%, a healthy 8% im-
provement over baseline performance and a 5% improve-
ment over the automatic acoustic-prosodic features alone.
2.3 ITSpoke
This section describes more recent research I have been
conducting with the University of Pittsburgh’s Intelli-
gent Tutoring Spoken Dialogue System (ITSpoke) (Lit-
man and Silliman, 2004). The goal of this research is to
wed spoken language technology with instructional tech-
nology in order to promote learning gains by enhanc-
ing communication richness. ITSpoke is built upon the
Why2-Atlas tutoring back-end (VanLehn et al., 2002), a
text-based Intelligent Tutoring System designed to tutor
students in the domain of qualitative physics using natural
language interaction. Several corpora have been recorded
for development of ITSpoke, though most of the work
presented here involves tutorial data between a student
and human tutor. To date, we have labeled the human-
human corpus for anger, frustration, and uncertainty.
As this work is an extension of previous work, I chose
to extract most of the same features I had extracted from
the EPSaT and HMIHY corpora. Speci cally, I extracted
the same set of automatic acoustic-prosodic features, as
well as contextual features measuring the rate of change
of acoustic-prosodic features of past student turns. A
new feature set was introduced as well, which I refer
to as the breath-group feature set, and which is an auto-
matic method for segmenting utterances into intonation-
ally meaningful units by identifying pauses using back-
ground noise estimation. The breath group feature set
232
comprises the number of breath-groups in each turn, the
pause time, and global acoustic-prosodic features calcu-
lated for the  rst, last, and longest breath-group in each
student turn.
I used the WEKA machine learning software package
to classify whether a student answer was perceived to be
uncertain, certain, or neutral3 in the ITSpoke human-
human corpus. As a predictor, C4.5, a decision-tree
learner, was boosted with AdaBoost, a learning strategy
similar to the one presented in Section 2.2. The data
were randomly split into a training set (90%) and a test-
ing set (10%). The automatic acoustic-prosodic features
performed at 75% accuracy, a relative improvement of
13% over the baseline performance of always guessing
neutral. By adding additional feature-sets  contextual
and breath-group information  I observed an improved
prediction accuracy of 77%. Thus indicating that breath-
group features are useful. I refer the reader to (Liscombe
et al., 2005a) for in-depth implications and further analy-
sis of these results. In the immediate future, I will extract
features previously mentioned in Section 2.2 as well as
the exploratory features I will discuss in the following
section.
3 Work-in-progress
In this section I describe research I have begun to con-
duct and plan to complete in the coming year, as agreed-
upon in February, 2006 by my dissertation committee. I
will explore features that are not well studied in emotion
classi cation research, primarily pitch contour and voice
quality approximation. Furthermore, I will outline how I
plan to implement and evaluate an emotion detection and
response module into ITSpoke.
3.1 Pitch Contour Clustering
The global acoustic-prosodic features used in most emo-
tion prediction studies capture meaningful prosodic vari-
ation, but are not capable of describing the linguisti-
cally meaningful intonational behavior of an utterance.
Though phonological labeling methods exist, such as
ToBI, annotation of this sort is time-consuming and must
be done manually. Instead, I propose an automatic al-
gorithm that directly compares pitch contours and then
groups them into classes based on abstract form. Specif-
ically, I intend to use partition clustering to de ne a
disjoint set of similar prosodic contour types over our
data. I hypothesize that the resultant clusters will be the-
oretically meaningful and useful for emotion modeling.
The similarity metric used to compare two contours will
be edit distance, calculated using dynamic time warping
techniques. Essentially, the algorithm  nds the best  t
between two contours by stretching and shrinking each
3With respect to certainness.
contour as necessary. The score of a comparison is calcu-
lated as the sum of the normalized real-valued distances
between mapped points in the contours.
3.2 Voice Quality
Voice quality is a term used to describe a perceptual col-
oring of the acoustic speech signal and is generally be-
lieved to play an important role in the vocal communica-
tion of emotion. However, it has rarely been used in au-
tomatic classi cation experiments because the exact pa-
rameters de ning each quality of voice (e.g., creaky and
breathy) are still largely unknown. Yet, some researchers
believe much of what constitutes voice quality can be
described using information about glottis excitation pro-
duced by the vocal folds, most commonly referred to
as the glottal pulse waveform. While there are ways of
directly measuring the glottal pulse waveform, such as
with an electroglottograph, these techniques are too inva-
sive for practical purposes. Therefore, the glottal pulse
waveform is usually approximated by inverse  ltering of
the speech signal. I will derive glottal pulse waveforms
from the data using an algorithm that automatically iden-
ti es voiced regions of speech, obtains an estimate of the
glottal  ow derivative, and then represents this using the
Liljencrants-Fant parametric model. The  nal result is a
glottal pulse waveform, from which features can be ex-
tracted that describe the shape of this waveform, such as
the Open and Skewing Quotients.
3.3 Implementation
The motivating force behind much of the research I have
presented herein is the common assumption in the re-
search community that emotion modeling will improve
spoken dialogue systems. However, there is little to no
empirical proof testing this claim (See (Pon-Barry et al.,
In publication) for a notable exception.). For this rea-
son, I will implement functionality for detecting and re-
sponding to student emotion in ITSpoke (the Intelligent
Tutoring System described in Section 2.3) and analyze
the effect it has on student behavior, hopefully showing
(quantitatively) that doing so improves the system’s ef-
fectiveness.
Research has shown that frustrated students learn less
than non-frustrated students (Lewis and Williams, 1989)
and that human tutors respond differently in the face of
student uncertainty than they do when presented with cer-
tainty (Forbes-Riley and Litman, 2005). These  ndings
indicate that emotion plays an important role in Intelli-
gent Tutoring Systems. Though I do not have the ability
to alter the discourse- ow of ITSpoke, I will insert active
listening prompts on the part of ITSpoke when the sys-
tem has detected either frustration or uncertainty. Active
listening is a technique that has been shown to diffuse
negative emotion in general (Klein et al., 2002). I hy-
233
pothesize that diffusing user frustration and uncertainty
will improve ITSpoke.
After collecting data from an emotion-enabled IT-
Spoke I will compare evaluation metrics with those of
a control study conducted with the original ITSpoke sys-
tem. One such metric will be learning gain, the differ-
ence between student pre- and post-test scores and the
standard metric for quantifying the effectiveness of edu-
cational devices. Since learning gain is a crude measure
of academic achievement and may overlook behavioral
and cognitive improvements, I will explore other metrics
as well, such as: the amount of time taken for the stu-
dent to produce a correct answer, the amount of negative
emotional states expressed, the quality and correctness of
answers, the willingness to continue, and subjective post-
tutoring assessments.
4 Contributions
I see the contributions of my dissertation to be the extent
to which I have helped to answer the questions I posed at
the outset of this paper.
4.1 How is emotion communicated in speech?
The experimental design of extracting features from spo-
ken utterances and conducting machine learning experi-
ments to predict emotion classes identi es features im-
portant for the vocal communication of emotion. Most of
the features I have described here are well established in
the research community; statistic measurements of fun-
damental frequency and energy, for example. However, I
have also described more experimental features as a way
of improving upon the state-of-the-art in emotion mod-
eling. These exploratory features include breath-group
segmentation, contextual information, pitch contour clus-
tering, and voice quality estimation. In addition, explor-
ing three domains will allow me to comparatively ana-
lyze the results, with the ultimate goal of identifying uni-
versal qualities of spoken emotions as well as those that
may particular to speci c domains. The  ndings of such
a comparative analysis will be of practical bene t to fu-
ture system builders and to those attempting to de ne a
universal model of human emotion alike.
4.2 Does emotion modeling help?
By collecting data of students interacting with an
emotion-enabled ITSpoke, I will be able to report quan-
titatively the results of emotion modeling in a spoken di-
alogue system. Though this is the central motivation for
most researchers in this  eld, there is currently no de ni-
tive evidence either supporting or refuting this claim.
References
M. E. Beckman, J. Hirschberg, and S. Shattuck-Hufnagel,
2005. Prosodic Typology  The Phonology of Intona-
tion and Phrasing, chapter 2 The original ToBI sys-
tem and the evolution of the ToBI framework. Oxford,
OUP.
J. R. Davitz, 1964. The Communication of Emotional
Meaning, chapter 8 Auditory Correlates of Vocal Ex-
pression of Emotional Feeling, pages 101 112. New
York: McGraw-Hill.
Kate Forbes-Riley and Diane J. Litman. 2005. Using
bigrams to identify relationships between student cer-
tainness states and tutor responses in a spoken dialogue
corpus. In Proceedings of 6th SIGdial Workshop on
Discourse and Dialogue,, Lisbon, Portugal.
J. Klein, Y. Moon, and R. W. Picard. 2002. This com-
puter responds to user frustration: Theory, design, and
results. Interacting with Computers, 14(2):119 140,
February.
V. E. Lewis and R. N. Williams. 1989. Mood-congruent
vs. mood-state-dependent learning: Implications for a
view of emotion. D. Kuiken (Ed.), Mood and Mem-
ory: Theory, Research, and Applications, Special Is-
sue of the Journal of Social Behavior and Personality,
4(2):157 171.
Jackson Liscombe, Jennifer Venditti, and Julia
Hirschberg. 2003. Classifying subject ratings
of emotional speech using acoustic features. In
Proceedings of Eurospeech, Geneva, Switzerland.
Jackson Liscombe, Julia Hirschberg, and Jennifer Ven-
ditti. 2005a. Detecting certainness in spoken tutorial
dialogues. In Proceedings of Interspeech, Lisbon, Por-
tugal.
Jackson Liscombe, Guiseppe Riccardi, and Dilek
Hakkani-Tcurrency1ur. 2005b. Using context to improve emo-
tion detection in spoken dialogue systems. In Proceed-
ings of Interspeech, Lisbon, Portugal.
Diane Litman and Scott Silliman. 2004. Itspoke: An in-
telligent tutoring spoken dialogue system. In Proceed-
ings of the 4th Meeting of HLT/NAACL (Companion
Proceedings), Boston, MA, May.
Heather Pon-Barry, Karl Schultz, Elizabeth Owen Bratt,
Brady Clark, and Stanley Peters. In publication. Re-
sponding to student uncertainty in spoken tutorial dia-
logue systems. International Journal of Arti cial In-
telligence in Education (IJAIED).
K. R. Scherer, J. Koivumaki, and R. Rosenthal. 1972.
Minimal cues in the vocal communication of affect:
Judging emotions from content-masked speech. Jour-
nal of Psycholinguistic Research, 1:269 285.
K. VanLehn, P. Jordan, and C. P. Rose. 2002. The archi-
tecture of why2-atlas: A coach for qualitative physics
essay writing. In Proceedings of the Intelligent Tutor-
ing Systems Conference, Biarritz, France.
234
