Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 193–200,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Dependencies between Student State and Speech Recognition  
Problems in Spoken Tutoring Dialogues 
 
 
Mihai Rotaru 
University of Pittsburgh 
Pittsburgh, USA 
mrotaru@cs.pitt.edu 
Diane J. Litman 
University of Pittsburgh 
Pittsburgh, USA 
litman@cs.pitt.edu 
 
 
 
Abstract 
Speech recognition problems are a reality 
in current spoken dialogue systems. In 
order to better understand these phenom-
ena, we study dependencies between 
speech recognition problems and several 
higher level dialogue factors that define 
our notion of student state: frustra-
tion/anger, certainty and correctness. We 
apply Chi Square (χ2) analysis to a cor-
pus of speech-based computer tutoring 
dialogues to discover these dependencies 
both within and across turns. Significant 
dependencies are combined to produce 
interesting insights regarding speech rec-
ognition problems and to propose new 
strategies for handling these problems. 
We also find that tutoring, as a new do-
main for speech applications, exhibits in-
teresting tradeoffs and new factors to 
consider for spoken dialogue design. 
1 Introduction 
Designing a spoken dialogue system involves 
many non-trivial decisions. One factor that the 
designer has to take into account is the presence 
of speech recognition problems (SRP). Previous 
work (Walker et al., 2000) has shown that the 
number of SRP is negatively correlated with 
overall user satisfaction. Given the negative im-
pact of SRP, there has been a lot of work in try-
ing to understand this phenomenon and its impli-
cations for building dialogue systems. Most of 
the previous work has focused on lower level 
details of SRP: identifying components responsi-
ble for SRP (acoustic model, language model, 
search algorithm (Chase, 1997)) or prosodic 
characterization of SRP (Hirschberg et al., 2004). 
We extend previous work by analyzing the re-
lationship between SRP and higher level dia-
logue factors. Recent work has shown that dia-
logue design can benefit from several higher 
level dialogue factors: dialogue acts (Frampton 
and Lemon, 2005; Walker et al., 2001), prag-
matic plausibility (Gabsdil and Lemon, 2004). 
Also, it is widely believed that user emotions, as 
another example of higher level factor, interact 
with SRP but, currently, there is little hard evi-
dence to support this intuition. We perform our 
analysis on three high level dialogue factors: 
frustration/anger, certainty and correctness. Frus-
tration and anger have been observed as the most 
frequent emotional class in many dialogue sys-
tems (Ang et al., 2002) and are associated with a 
higher word error rate (Bulyko et al., 2005). For 
this reason, we use the presence of emotions like 
frustration and anger as our first dialogue factor. 
Our other two factors are inspired by another 
contribution of our study: looking at speech-
based computer tutoring dialogues instead of 
more commonly used information retrieval dia-
logues. Implementing spoken dialogue systems 
in a new domain has shown that many practices 
do not port well to the new domain (e.g. confir-
mation of long prompts (Kearns et al., 2002)). 
Tutoring, as a new domain for speech applica-
tions (Litman and Forbes-Riley, 2004; Pon-Barry 
et al., 2004), brings forward new factors that can 
be important for spoken dialogue design. Here 
we focus on certainty and correctness. Both fac-
tors have been shown to play an important role in 
the tutoring process (Forbes-Riley and Litman, 
2005; Liscombe et al., 2005). 
A common practice in previous work on emo-
tion prediction (Ang et al., 2002; Litman and 
Forbes-Riley, 2004) is to transform an initial 
finer level emotion annotation (five or more la-
bels) into a coarser level annotation (2-3 labels). 
We wanted to understand if this practice can im-
193
pact the dependencies we observe from the data. 
To test this, we combine our two emotion
1
 fac-
tors (frustration/anger and certainty) into a binary 
emotional/non-emotional annotation. 
To understand the relationship between SRP 
and our three factors, we take a three-step ap-
proach. In the first step, dependencies between 
SRP and our three factors are discovered using 
the Chi Square (χ2) test. Similar analyses on hu-
man-human dialogues have yielded interesting 
insights about human-human conversations 
(Forbes-Riley and Litman, 2005; Skantze, 2005). 
In the second step, significant dependencies are 
combined to produce interesting insights regard-
ing SRP and to propose strategies for handling 
SRP. Validating these strategies is the purpose of 
the third step. In this paper, we focus on the first 
two steps; the third step is left as future work.  
Our analysis produces several interesting in-
sights and strategies which confirm the utility of 
the proposed approach. With respect to insights, 
we show that user emotions interact with SRP. 
We also find that incorrect/uncertain student 
turns have more SRP than expected. In addition, 
we find that the emotion annotation level affects 
the interactions we observe from the data, with 
finer-level emotions yielding more interactions 
and insights. 
In terms of strategies, our data suggests that 
favoring misrecognitions over rejections (by 
lowering the rejection threshold) might be more 
beneficial for our tutoring task – at least in terms 
of reducing the number of emotional student 
turns. Also, as a general design practice in the 
spoken tutoring applications, we find an interest-
ing tradeoff between the pedagogical value of 
asking difficult questions and the system’s ability 
to recognize the student answer. 
2 Corpus 
The corpus analyzed in this paper consists of 95 
experimentally obtained spoken tutoring dia-
logues between 20 students and our system 
ITSPOKE (Litman and Forbes-Riley, 2004), a 
speech-enabled version of the text-based WHY2 
conceptual physics tutoring system (VanLehn et 
al., 2002). When interacting with ITSPOKE, stu-
dents first type an essay answering a qualitative 
physics problem using a graphical user interface. 
ITSPOKE then engages the student in spoken dia-
logue (using speech-based input and output) to 
correct misconceptions and elicit more complete 
                                                 
1
 We use the term “emotion” loosely to cover both affects 
and attitudes that can impact student learning. 
explanations, after which the student revises the 
essay, thereby ending the tutoring or causing an-
other round of tutoring/essay revision. For rec-
ognition, we use the Sphinx2 speech recognizer 
with stochastic language models. Because speech 
recognition is imperfect, after the data was col-
lected, each student utterance in our corpus was 
manually transcribed by a project staff member. 
An annotated excerpt from our corpus is shown 
in Figure 1 (punctuation added for clarity). The 
excerpts show both what the student said (the 
STD labels) and what ITSPOKE recognized (the 
ASR labels). The excerpt is also annotated with 
concepts that will be described next. 
2.1 Speech Recognition Problems (SRP) 
One form of SRP is the Rejection. Rejections 
occur when ITSPOKE is not confident enough in 
the recognition hypothesis and asks the student 
to repeat (Figure 1, STD
3,4
). For our χ
2
 analysis, 
we define the REJ variable with two values: Rej 
(a rejection occurred in the turn) and noRej (no 
rejection occurred in the turn). Not surprisingly, 
ITSPOKE also misrecognized some student turns. 
When ITSPOKE heard something different than 
what the student actually said but was confident 
in its hypothesis, we call this an ASR Misrecog-
nition (a binary version of the commonly used 
Word Error Rate) (Figure 1, STD
1,2
). Similarly, 
we define the ASR MIS variable with two val-
ues: AsrMis and noAsrMis. 
Semantic accuracy is more relevant for dia-
logue evaluation, as it does not penalize for word 
errors that are unimportant to overall utterance 
interpretation. In the case of form-based informa-
tion access spoken dialogue systems, computing 
semantic accuracy is straightforward (i.e. con-
cept accuracy = percentage of correctly recog-
nized concepts). In contrast, in the tutoring do-
main there are no clear forms with slots to be 
filled. We base our semantic accuracy on the 
“correctness” measure of the student turn. For 
each student turn, ITSPOKE interprets it and la-
bels its correctness with regard to whether the 
student correctly answered the tutor question (see 
the labels between square brackets in Figure 1). 
We define Semantic Misrecognition as cases 
where ITSPOKE was confident in its recognition 
hypothesis and the correctness interpretation of 
the recognition hypothesis is different from the 
correctness interpretation of the manual tran-
script (Figure 1, STD
1
). Similarly, we define the 
SEM MIS variable with two values: SemMis 
and noSemMis. The top part of Table 1 lists the 
distribution for our three SRP variables. 
194
3 Student State Annotation 
Our definition of student state includes annota-
tion on three dimensions (our three dialogue fac-
tors): Frustration/Anger/Hyperarticulation, Cer-
tainty and Correctness (see Table 1). The first 
two dimensions capture the student emotions and 
the last one the student correctness. 
ITSPOKE: Good. Apply the same force to an object of 
mass M3, which is three times the mass of M1. 
What is the magnitude M3's acceleration in terms of 
A1 [acceleration of the object M1]? 
STD
1
: third of a one (CERT=Uncertain, CRCT=C) 
[correct] 
ASR
1
: they're it but a one (AsrMis, SemMis) [incorrect] 
ITSPOKE: The acceleration of M3 [...] would be one-
third the acceleration of M1. [...] In general for the 
same force what remains constant? 
STD
2
: weight (CRCT=I) [incorrect] 
ASR
2
: great (AsrMis) [incorrect] 
ITSPOKE: Well... The product of the mass of a body 
and its acceleration remains constant for a given 
force. What is the relationship between a force act-
ing on the box, the mass of the box, and its accel-
eration? 
STD
3
: newton's second law (CERT=Uncertain, 
CRCT=PC) 
ASR
3
: the cancel the fire (Rej) 
ITSPOKE: Could you please repeat that? 
STD
4
: newton's second law (FAH=FrAng, CRCT=PC) 
ASR
4
: newton second long (Rej) 
Figure 1: Human-Computer Dialogue Excerpt 
The Frustration/Anger/Hyperarticulation 
dimension captures the perceived negative stu-
dent emotional response to the interaction with 
the system. Three labels were used to annotate 
this dimension: frustration-anger, hyperarticula-
tion and neutral. Similar to (Ang et al., 2002), 
because frustration and anger can be difficult to 
distinguish reliably, they were collapsed into a 
single label: frustration-anger (Figure 1, STD
4
). 
Often, frustration and anger is prosodically 
marked and in many cases the prosody used is 
consistent with hyperarticulation (Ang et al., 
2002). For this reason we included in this dimen-
sion the hyperarticulation label (even though hy-
perarticulation is not an emotion but a state). We 
used the hyperarticulation label for turns where 
no frustration or anger was perceived but never-
theless were hyperarticulated. For our interaction 
experiments we define the FAH variable with 
three values: FrAng (frustration-anger), Hyp 
(hyperarticulation) and Neutral. 
The Certainty dimension captures the per-
ceived student reaction to the questions asked by 
our computer tutor and her overall reaction to the 
tutoring domain (Liscombe et al., 2005). 
(Forbes-Riley and Litman, 2005) show that stu-
dent certainty interacts with a human tutor’s dia-
logue decision process (i.e. the choice of feed-
back). Four labels were used for this dimension: 
certain, uncertain (e.g. Figure 1, STD
1
), mixed 
and neutral. In a small number of turns, both cer-
tainty and uncertainty were expressed and these 
turns were labeled as mixed (e.g. the student was 
certain about a concept, but uncertain about an-
other concept needed to answer the tutor’s ques-
tion). For our interaction experiments we define 
the CERT variable with four values: Certain, 
Uncertain, Mixed and Neutral. 
 
Vari-
able 
Values 
Student turns 
(2334) 
Speech recognition problems 
 
ASR 
MIS 
AsrMis 
noAsrMis 
25.4% 
74.6% 
 
SEM 
MIS 
SemMis 
noSemMis 
5.7% 
94.3% 
 REJ 
Rej 
noRej 
7.0% 
93.0% 
Student state 
 FAH 
FrAng 
Hyp 
Neutral 
9.9% 
2.1% 
88.0% 
 CERT 
Certain 
Uncertain 
Mixed 
Neutral 
41.3% 
19.1% 
2.4% 
37.3% 
 CRCT 
C 
I 
PC 
UA 
63.3% 
23.3% 
6.2% 
7.1% 
 EnE 
Emotional 
Neutral 
64.8% 
35.2% 
Table 1: Variable distributions in our corpus. 
To test the impact of the emotion annotation 
level, we define the Emotional/Non-Emotional 
annotation based on our two emotional dimen-
sions: neutral turns on both the FAH and the 
CERT dimension are labeled as neutral
2
; all other 
turns were labeled as emotional. Consequently, 
we define the EnE variable with two values: 
Emotional and Neutral. 
Correctness is also an important factor of the 
student state. In addition to the correctness labels 
assigned by ITSPOKE (recall the definition of 
SEM MIS), each student turn was manually an-
notated by a project staff member in terms of 
their physics-related correctness. Our annotator 
used the human transcripts and his physics 
knowledge to label each student turn for various 
                                                 
2
 To be consistent with our previous work, we label hyperar-
ticulated turns as emotional even though hyperarticulation is 
not an emotion. 
195
degrees of correctness: correct, partially correct, 
incorrect and unable to answer. Our system can 
ask the student to provide multiple pieces of in-
formation in her answer (e.g. the question “Try 
to name the forces acting on the packet. Please, 
specify their directions.” asks for both the names 
of the forces and their direction). If the student 
answer is correct and contains all pieces of in-
formation, it was labeled as correct (e.g. “grav-
ity, down”). The partially correct label was used 
for turns where part of the answer was correct 
but the rest was either incorrect (e.g. “gravity, 
up”) or omitted some information from the ideal 
correct answer (e.g. “gravity”). Turns that were 
completely incorrect (e.g. “no forces”) were la-
beled as incorrect. Turns where the students did 
not answer the computer tutor’s question were 
labeled as “unable to answer”. In these turns the 
student used either variants of “I don’t know” or 
simply did not say anything. For our interaction 
experiments we defined the CRCT variable with 
four values: C (correct), I (incorrect), PC (par-
tially correct) and UA (unable to answer). 
Please note that our definition of student state 
is from the tutor’s perspective. As we mentioned 
before, our emotion annotation is for perceived 
emotions. Similarly, the notion of correctness is 
from the tutor’s perspective. For example, the 
student might think she is correct but, in reality, 
her answer is incorrect. This correctness should 
be contrasted with the correctness used to define 
SEM MIS. The SEM MIS correctness uses 
ITSPOKE’s language understanding module ap-
plied to recognition hypothesis or the manual 
transcript, while the student state’s correctness 
uses our annotator’s language understanding. 
All our student state annotations are at the turn 
level and were performed manually by the same 
annotator. While an inter-annotator agreement 
study is the best way to test the reliability of our 
two emotional annotations (FAH and CERT), 
our experience with annotating student emotions 
(Litman and Forbes-Riley, 2004) has shown that 
this type of annotation can be performed reliably. 
Given the general importance of the student’s 
uncertainty for tutoring, a second annotator has 
been commissioned to annotate our corpus for 
the presence or absence of uncertainty. This an-
notation can be directly compared with a binary 
version of CERT: Uncertain+Mixed versus Cer-
tain+Neutral. The comparison yields an agree-
ment of 90% with a Kappa of 0.68. Moreover, if 
we rerun our study on the second annotation, we 
find similar dependencies. We are currently 
planning to perform a second annotation of the 
FAH dimension to validate its reliability. 
We believe that our correctness annotation 
(CRCT) is reliable due to the simplicity of the 
task: the annotator uses his language understand-
ing to match the human transcript to a list of cor-
rect/incorrect answers. When we compared this 
annotation with the correctness assigned by 
ITSPOKE on the human transcript, we found an 
agreement of 90% with a Kappa of 0.79. 
4 Identifying dependencies using χ
2
 
To discover the dependencies between our vari-
ables, we apply the χ
2
 test. We illustrate our 
analysis method on the interaction between cer-
tainty (CERT) and rejection (REJ). The χ
2
 value 
assesses whether the differences between ob-
served and expected counts are large enough to 
conclude a statistically significant dependency 
between the two variables (Table 2, last column). 
For Table 2, which has 3 degrees of freedom ((4-
1)*(2-1)), the critical χ
2
 value at a p<0.05 is 7.81. 
We thus conclude that there is a statistically sig-
nificant dependency between the student cer-
tainty in a turn and the rejection of that turn. 
Combination  Obs. Exp. χ
2
CERT – REJ    11.45 
Certain – Rej - 49 67 9.13 
Uncertain – Rej + 43 31 6.15 
Table 2: CERT – REJ interaction. 
If any of the two variables involved in a sig-
nificant dependency has more than 2 possible 
values, we can look more deeply into this overall 
interaction by investigating how particular values 
interact with each other. To do that, we compute 
a binary variable for each variable’s value in part 
and study dependencies between these variables. 
For example, for the value ‘Certain’ of variable 
CERT we create a binary variable with two val-
ues: ‘Certain’ and ‘Anything Else’ (in this case 
Uncertain, Mixed and Neutral). By studying the 
dependency between binary variables we can 
understand how the interaction works. 
Table 2 reports in rows 3 and 4 all significant 
interactions between the values of variables 
CERT and REJ. Each row shows: 1) the value 
for each original variable, 2) the sign of the de-
pendency, 3) the observed counts, 4) the ex-
pected counts and 5) the χ
2
 value. For example, 
in our data there are 49 rejected turns in which 
the student was certain. This value is smaller 
than the expected counts (67); the dependency 
between Certain and Rej is significant with a χ
2
 
value of 9.13. A comparison of the observed 
counts and expected counts reveals the direction 
196
(sign) of the dependency. In our case we see that 
certain turns are rejected less than expected (row 
3), while uncertain turns are rejected more than 
expected (row 4). On the other hand, there is no 
interaction between neutral turns and rejections 
or between mixed turns and rejections. Thus, the 
CERT – REJ interaction is explained only by the 
interaction between Certain and Rej and the in-
teraction between Uncertain and Rej. 
5 Results - dependencies 
In this section we present all significant depend-
encies between SRP and student state both 
within and across turns. Within turn interactions 
analyze the contribution of the student state to 
the recognition of the turn. They were motivated 
by the widely believed intuition that emotion 
interacts with SRP. Across turn interactions look 
at the contribution of previous SRP to the current 
student state. Our previous work (Rotaru and 
Litman, 2005) had shown that certain SRP will 
correlate with emotional responses from the user. 
We also study the impact of the emotion annota-
tion level (EnE versus FAH/CERT) on the inter-
actions we observe. The implications of these 
dependencies will be discussed in Section 6. 
5.1 Within turn interactions 
For the FAH dimension, we find only one sig-
nificant interaction: the interaction between the 
FAH student state and the rejection of the current 
turn (Table 3). By studying values’ interactions, 
we find that turns where the student is frustrated 
or angry are rejected more than expected (34 in-
stead of 16; Figure 1, STD
4
 is one of them). 
Similarly, turns where the student response is 
hyperarticulated are also rejected more than ex-
pected (similar to observations in (Soltau and 
Waibel, 2000)). In contrast, neutral turns in the 
FAH dimension are rejected less than expected. 
Surprisingly, FrAng does not interact with 
AsrMis as observed in (Bulyko et al., 2005) but 
they use the full word error rate measure instead 
of the binary version used in this paper. 
Combination  Obs. Exp. χ
2
FAH – REJ    77.92 
FrAng – Rej + 34 16 23.61 
Hyp – Rej + 16 3 50.76 
Neutral – Rej - 113 143 57.90 
Table 3: FAH – REJ interaction. 
Next we investigate how our second emotion 
annotation, CERT, interacts with SRP. All sig-
nificant dependencies are reported in Tables 2 
and 4. In contrast with the FAH dimension, here 
we see that the interaction direction depends on 
the valence. We find that ‘Certain’ turns have 
less SRP than expected (in terms of AsrMis and 
Rej). In contrast, ‘Uncertain’ turns have more 
SRP both in terms of AsrMis and Rej. ‘Mixed’ 
turns interact only with AsrMis, allowing us to 
conclude that the presence of uncertainty in the 
student turn (partial or overall) will result in ASR 
problems more than expected. Interestingly, on 
this dimension, neutral turns do not interact with 
any of our three SRP. 
Combination  Obs. Exp. χ
2
CERT – ASRMIS    38.41 
Certain – AsrMis - 204 244 15.32 
Uncertain – AsrMis + 138 112 9.46 
Mixed – AsrMis + 29 13 22.27 
Table 4: CERT – ASRMIS interaction. 
Finally, we look at interactions between stu-
dent correctness and SRP. Here we find signifi-
cant dependencies with all types of SRP (see Ta-
ble 5). In general, correct student turns have 
fewer SRP while incorrect, partially correct or 
UA turns have more SRP than expected. Partially 
correct turns have more AsrMis and SemMis 
problems than expected, but are rejected less 
than expected. Interestingly, UA turns interact 
only with rejections: these turns are rejected 
more than expected. An analysis of our corpus 
reveals that in most rejected UA turns the student 
does not say anything; in these cases, the sys-
tem’s recognition module thought the student 
said something but the system correctly rejects 
the recognition hypothesis. 
Combination  Obs. Exp. χ
2
CRCT – ASRMIS    65.17 
C – AsrMis - 295 374 62.03 
I – AsrMis + 198 137 45.95 
PC – AsrMis + 50 37 5.9 
CRCT – SEMMIS    20.44 
C – SemMis + 100 84 7.83 
I – SemMis - 14 31 13.09 
PC – SemMis + 15 8 5.62 
CRCT – REJ    99.48 
C – Rej - 53 102 70.14 
I – Rej + 84 37 79.61 
PC – Rej - 4 10 4.39 
UA – Rej + 21 11 9.19 
Table 5: Interactions between Correctness and SRP. 
The only exception to the rule is SEM MIS. 
We believe that SEM MIS behavior is explained 
by the “catch-all” implementation in our system. 
In ITSPOKE, for each tutor question there is a list 
of anticipated answers. All other answers are 
197
treated as incorrect. Thus, it is less likely that a 
recognition problem in an incorrect turn will af-
fect the correctness interpretation (e.g. Figure 1, 
STD
2
: very unlikely to misrecognize the incor-
rect “weight” with the anticipated “the product of 
mass and acceleration”). In contrast, in correct 
turns recognition problems are more likely to 
screw up the correctness interpretation (e.g. mis-
recognizing “gravity down” as “gravity sound”). 
5.2 Across turn interactions 
Next we look at the contribution of previous SRP 
– variable name or value followed by 
(-1)
 – to the 
current student state. Please note that there are 
two factors involved here: the presence of the 
SRP and the SRP handling strategy. In 
ITSPOKE, whenever a student turn is rejected, 
unless this is the third rejection in a row, the stu-
dent is asked to repeat using variations of “Could 
you please repeat that?”. In all other cases, 
ITSPOKE makes use of the available informa-
tion ignoring any potential ASR errors. 
Combination  Obs. Exp. χ
2
ASRMIS
(-1)
 – FAH    7.64 
AsrMis
(-1)
 – FrAng -
t
46 58 3.73 
AsrMis
(-1)
 – Hyp -
t
7 12 3.52 
AsrMis
(-1)
 – Neutral + 527 509 6.82 
REJ
(-1)
 – FAH    409.31
Rej
(-1)
 – FrAng + 36 16 28.95 
Rej
(-1)
 – Hyp + 38 3 369.03
Rej
(-1)
 – Neutral - 88 142 182.9 
REJ
(-1)
 – CRCT    57.68 
Rej
(-1)
 – C - 68 101 31.94 
Rej
(-1)
 – I + 74 37 49.71 
Rej
(-1)
 – PC - 3 10 6.25 
Table 6: Interactions across turns (
t
 – trend, p<0.1). 
Here we find only 3 interactions (Table 6). We 
find that after a non-harmful SRP (AsrMis) the 
student is less frustrated and hyperarticulated 
than expected. This result is not surprising since 
an AsrMis does not have any effect on the nor-
mal dialogue flow. 
In contrast, after rejections we observe several 
negative events. We find a highly significant in-
teraction between a previous rejection and the 
student FAH state, with student being more frus-
trated and more hyperarticulated than expected 
(e.g. Figure 1, STD
4
). Not only does the system 
elicit an emotional reaction from the student after 
a rejection, but her subsequent response to the 
repetition request suffers in terms of the correct-
ness. We find that after rejections student an-
swers are correct or partially correct less than 
expected and incorrect more than expected. The 
REJ
(-1)
 – CRCT interaction might be explained 
by the CRCT – REJ interaction (Table 5) if, in 
general, after a rejection the student repeats her 
previous turn. An annotation of responses to re-
jections as in (Swerts et al., 2000) (repeat, re-
phrase etc.) should  provide additional insights.  
We were surprised to see that a previous 
SemMis (more harmful than an AsrMis but less 
disruptive than a Rej) does not interact with the 
student state; also the student certainty does not 
interact with previous SRP. 
5.3 Emotion annotation level 
We also study the impact of the emotion annota-
tion level on the interactions we can observe 
from our corpus. In this section, we look at inter-
actions between SRP and our coarse-level emo-
tion annotation (EnE) both within and across 
turns. Our results are similar with the results of 
our previous work (Rotaru and Litman, 2005) on 
a smaller corpus and a similar annotation 
scheme. We find again only one significant in-
teraction: rejections are followed by more emo-
tional turns than expected (Table 7). The strength 
of the interaction is smaller than in previous 
work, though the results can not be compared 
directly. No other dependencies are present. 
Combination  Obs. Exp. χ
2
REJ
(-1)
 – EnE    6.19 
Rej
(-1)
 – Emotional + 119 104 6.19 
Table 7: REJ
(-1)
 – EnE interaction. 
We believe that the REJ
(-1)
 – EnE interaction is 
explained mainly by the FAH dimension. Not 
only is there no interaction between REJ
(-1)
 and 
CERT, but the inclusion of the CERT dimension 
in the EnE annotation decreases the strength of 
the interaction between REJ and FAH (the χ
2
 
value decreases from 409.31 for FAH to a mere 
6.19 for EnE). Collapsing emotional classes also 
prevents us from seeing any within turn interac-
tions. These observations suggest that what is 
being counted as an emotion for a binary emo-
tion annotation is critical its success. In our case, 
if we look at affect (FAH) or attitude (CERT) in 
isolation we find many interactions; in contrast, 
combining them offers little insight.  
6 Results – insights & strategies 
Our results put a spotlight on several interesting 
observations which we discuss below. 
Emotions interact with SRP 
The dependencies between FAH/CERT and 
various SRP (Tables 2-4) provide evidence that 
user’s emotions interact with the system’s ability 
198
to recognize the current turn. This is a widely 
believed intuition with little empirical support so 
far. Thus, our notion of student state can be a 
useful higher level information source for SRP 
predictors. Similar to (Hirschberg et al., 2004), 
we believe that peculiarities in the acous-
tic/prosodic profile of specific student states are 
responsible for their SRP. Indeed, previous work 
has shown that the acoustic/prosodic information 
plays an important role in characterizing and 
predicting both FAH (Ang et al., 2002; Soltau 
and Waibel, 2000) and CERT (Liscombe et al., 
2005; Swerts and Krahmer, 2005). 
The impact of the emotion annotation level 
A comparison of the interactions yielded by 
various levels of emotion annotation shows the 
importance of the annotation level. When using a 
coarser level annotation (EnE) we find only one 
interaction. By using a finer level annotation, not 
only we can understand this interaction better but 
we also discover new interactions (five interac-
tions with FAH and CERT). Moreover, various 
state annotations interact differently with SRP. 
For example, non-neutral turns in the FAH di-
mension (FrAng and Hyp) will be always re-
jected more than expected (Table 3); in contrast, 
interactions between non-neutral turns in the 
CERT dimension and rejections depend on the 
valence (‘certain’ turns will be rejected less than 
expected while ‘uncertain’ will be rejected more 
than expected; recall Table 2). We also see that 
the neutral turns interact with SRP depending on 
the dimension that defines them: FAH neutral 
turns interact with SRP (Table 3) while CERT 
neutral turns do not (Tables 2 and 4). 
This insight suggests an interesting tradeoff 
between the practicality of collapsing emotional 
classes (Ang et al., 2002; Litman and Forbes-
Riley, 2004) and the ability to observe meaning-
ful interactions via finer level annotations. 
Rejections: impact and a handling strategy 
Our results indicate that rejections and 
ITSPOKE’s current rejection-handling strategy 
are problematic. We find that rejections are fol-
lowed by more emotional turns (Table 7). A 
similar effect was observed in our previous work 
(Rotaru and Litman, 2005). The fact that it gen-
eralizes across annotation scheme and corpus, 
emphasizes its importance. When a finer level 
annotation is used, we find that rejections are 
followed more than expected by a frustrated, an-
gry and hyperarticulated user (Table 6). More-
over, these subsequent turns can result in addi-
tional rejections (Table 3). Asking to repeat after 
a rejection does not also help in terms of correct-
ness: the subsequent student answer is actually 
incorrect more than expected (Table 6). 
These interactions suggest an interesting strat-
egy for our tutoring task: favoring misrecogni-
tions over rejections (by lowering the rejection 
threshold). First, since rejected turns are more 
than expected incorrect (Table 5), the actual rec-
ognized hypothesis for such turns turn is very 
likely to be interpreted as incorrect. Thus, ac-
cepting a rejected turn instead of rejecting it will 
have the same outcome in terms of correctness: 
an incorrect answer. In this way, instead of at-
tempting to acquire the actual student answer by 
asking to repeat, the system can skip these extra 
turn(s) and use the current hypothesis. Second, 
the other two SRP are less taxing in terms of 
eliciting FAH emotions (recall Table 6; note that 
a SemMis might activate an unwarranted and 
lengthy knowledge remediation subdialogue). 
This suggests that continuing the conversation 
will be more beneficial even if the system mis-
understood the student. A similar behavior was 
observed in human-human conversations through 
a noisy speech channel (Skantze, 2005). 
Correctness/certainty–SRP interactions 
We also find an interesting interaction between 
correctness/certainty and system’s ability to rec-
ognize that turn. In general correct/certain turns 
have less SRP while incorrect/uncertain turns 
have more SRP than expected. This observation 
suggests that the computer tutor should ask the 
right question (in terms of its difficulty) at the 
right time. Intuitively, asking a more complicated 
question when the student is not prepared to an-
swer it will increase the likelihood of an incor-
rect or uncertain answer. But our observations 
show that the computer tutor has more trouble 
recognizing correctly these types of answers. 
This suggests an interesting tradeoff between the 
tutor’s question difficulty and the system’s abil-
ity to recognize the student answer. This tradeoff 
is similar in spirit to the initiative-SRP tradeoff 
that is well known when designing information-
seeking systems (e.g. system initiative is often 
used instead of a more natural mixed initiative 
strategy, in order to minimize SRP). 
7 Conclusions 
In this paper we analyze the interactions between 
SRP and three higher level dialogue factors that 
define our notion of student state: frustra-
tion/anger/hyperarticulation, certainty and cor-
rectness. Our analysis produces several interest-
ing insights and strategies which confirm the 
199
utility of the proposed approach. We show that 
user emotions interact with SRP and that the 
emotion annotation level affects the interactions 
we observe from the data, with finer-level emo-
tions yielding more interactions and insights. 
We also find that tutoring, as a new domain 
for speech applications, brings forward new im-
portant factors for spoken dialogue design: cer-
tainty and correctness. Both factors interact with 
SRP and these interactions highlight an interest-
ing design practice in the spoken tutoring appli-
cations: the tradeoff between the pedagogical 
value of asking difficult questions and the sys-
tem’s ability to recognize the student answer (at 
least in our system). The particularities of the 
tutoring domain also suggest favoring misrecog-
nitions over rejections to reduce the negative im-
pact of asking to repeat after rejections. 
In our future work, we plan to move to the 
third step of our approach: testing the strategies 
suggested by our results. For example, we will 
implement a new version of ITSPOKE that never 
rejects the student turn. Next, the current version 
and the new version will be compared with re-
spect to users’ emotional response. Similarly, to 
test the tradeoff hypothesis, we will implement a 
version of ITSPOKE that asks difficult questions 
first and then falls back to simpler questions. A 
comparison of the two versions in terms of the 
number of SRP can be used for validation. 
While our results might be dependent on the 
tutoring system used in this experiment, we be-
lieve that our findings can be of interest to practi-
tioners building similar voice-based applications. 
Moreover, our approach can be applied easily to 
studying other systems. 
Acknowledgements 
This work is supported by NSF Grant No. 
0328431. We thank Dan Bohus, Kate Forbes-
Riley, Joel Tetreault and our anonymous review-
ers for their helpful comments. 
References 
J. Ang, R. Dhillon, A. Krupski, A. Shriberg and A. 
Stolcke. 2002. Prosody-based automatic detection 
of annoyance and frustration in human-computer 
dialog. In Proc. of ICSLP. 
I. Bulyko, K. Kirchhoff, M. Ostendorf and J. Gold-
berg. 2005. Error-correction detection and response 
generation in a spoken dialogue system. Speech 
Communication, 45(3). 
L. Chase. 1997. Blame Assignment for Errors Made 
by Large Vocabulary Speech Recognizers. In Proc. 
of Eurospeech. 
K. Forbes-Riley and D. J. Litman. 2005. Using Bi-
grams to Identify Relationships Between Student 
Certainness States and Tutor Responses in a Spo-
ken Dialogue Corpus. In Proc. of SIGdial. 
M. Frampton and O. Lemon. 2005. Reinforcement 
Learning of Dialogue Strategies using the User's 
Last Dialogue Act. In Proc. of IJCAI Workshop on 
Know.&Reasoning in Practical Dialogue Systems. 
M. Gabsdil and O. Lemon. 2004. Combining Acoustic 
and Pragmatic Features to Predict Recognition 
Performance in Spoken Dialogue Systems. In Proc. 
of ACL. 
J. Hirschberg, D. Litman and M. Swerts. 2004. Pro-
sodic and Other Cues to Speech Recognition Fail-
ures. Speech Communication, 43(1-2). 
M. Kearns, C. Isbell, S. Singh, D. Litman and J. 
Howe. 2002. CobotDS: A Spoken Dialogue System 
for Chat. In Proc. of National Conference on Arti-
ficial Intelligence (AAAI). 
J. Liscombe, J. Hirschberg and J. J. Venditti. 2005. 
Detecting Certainness in Spoken Tutorial Dia-
logues. In Proc. of Interspeech. 
D. Litman and K. Forbes-Riley. 2004. Annotating 
Student Emotional States in Spoken Tutoring Dia-
logues. In Proc. of SIGdial Workshop on Discourse 
and Dialogue (SIGdial). 
H. Pon-Barry, B. Clark, E. O. Bratt, K. Schultz and S. 
Peters. 2004. Evaluating the effectiveness of Scot:a 
spoken conversational tutor. In Proc. of ITS Work-
shop on Dialogue-based Intellig. Tutoring Systems. 
M. Rotaru and D. Litman. 2005. Interactions between 
Speech Recognition Problems and User Emotions. 
In Proc. of Eurospeech. 
G. Skantze. 2005. Exploring human error recovery 
strategies: Implications for spoken dialogue sys-
tems. Speech Communication, 45(3). 
H. Soltau and A. Waibel. 2000. Specialized acoustic 
models for hyperarticulated speech. In Proc. of 
ICASSP. 
M. Swerts and E. Krahmer. 2005. Audiovisual Pros-
ody and Feeling of Knowing. Journal of Memory 
and Language, 53. 
M. Swerts, D. Litman and J. Hirschberg. 2000. Cor-
rections in Spoken Dialogue Systems. In Proc. of 
ICSLP. 
K. VanLehn, P. W. Jordan, C. P. Rosé, et al. 2002. 
The Architecture of Why2-Atlas: A Coach for 
Qualitative Physics Essay Writing. In Proc. of In-
telligent Tutoring Systems (ITS). 
M. Walker, D. Litman, C. Kamm and A. Abella. 
2000. Towards Developing General Models of Us-
ability with PARADISE. Natural Language Engi-
neering. 
M. Walker, R. Passonneau and J. Boland. 2001. 
Quantitative and Qualitative Evaluation of Darpa 
Communicator Spoken Dialogue Systems. In Proc. 
of ACL. 
200
