ITSPOKE: An Intelligent Tutoring Spoken Dialogue System
Diane J. Litman
University of Pittsburgh
Department of Computer Science &
Learning Research and Development Center
Pittsburgh PA, 15260, USA
litman@cs.pitt.edu
Scott Silliman
University of Pittsburgh
Learning Research and Development Center
Pittsburgh PA, 15260, USA
scotts@pitt.edu
Abstract
ITSPOKE is a spoken dialogue system that
uses the Why2-Atlas text-based tutoring sys-
tem as its  back-end . A student  rst types a
natural language answer to a qualitative physics
problem. ITSPOKE then engages the student
in a spoken dialogue to provide feedback and
correct misconceptions, and to elicit more com-
plete explanations. We are using ITSPOKE
to generate an empirically-based understanding
of the rami cations of adding spoken language
capabilities to text-based dialogue tutors.
1 Introduction
The development of computational tutorial dialogue sys-
tems has become more and more prevalent (Aleven and
Rose, 2003), as one method of attempting to close the
performance gap between human and computer tutors.
While many such systems have yielded successful evalu-
ations with students, most are currently text-based (Evens
et al., 2001; Aleven et al., 2001; Zinn et al., 2002; Van-
Lehn et al., 2002). There is reason to believe that speech-
based tutorial dialogue systems could be even more ef-
fective. Spontaneous self-explanation by students im-
proves learning gains during human-human tutoring (Chi
et al., 1994), and spontaneous self-explanation occurs
more frequently in spoken tutoring than in text-based tu-
toring (Hausmann and Chi, 2002). In human-computer
tutoring, the use of an interactive pedagogical agent that
communicates using speech rather than text output im-
proves student learning, while the visual presence or ab-
sence of the agent does not impact performance (Moreno
et al., 2001). In addition, it has been hypothesized that
the success of computer tutors could be increased by rec-
ognizing and responding to student emotion. (Aist et al.,
2002) have shown that adding emotional processing to
a dialogue-based reading tutor increases student persis-
tence. Information in the speech signal such as prosody
has been shown to be a rich source of information for
predicting emotional states in other types of dialogue in-
teractions (Ang et al., 2002; Lee et al., 2002; Batliner et
al., 2003; Devillers et al., 2003; Shafran et al., 2003).
With advances in speech technology, several projects
have begun to incorporate basic spoken language capabil-
ities into their systems (Mostow and Aist, 2001; Fry et al.,
2001; Graesser et al., 2001; Rickel and Johnson, 2000).
However, to date there has been little examination of the
rami cations of using a spoken modality for dialogue tu-
toring. To assess the impact and evaluate the utility of
adding spoken language capabilities to dialogue tutoring
systems, we have built ITSPOKE (Intelligent Tutoring
SPOKEn dialogue system), a spoken dialogue system
that uses the Why2-Atlas conceptual physics tutoringsys-
tem (VanLehn et al., 2002) as its  back-end. We are
using ITSPOKE as a platform for examining whether
acoustic-prosodic information can be used to improve the
recognition of pedagogically useful information such as
student emotion (Forbes-Riley and Litman, 2004; Lit-
man and Forbes-Riley, 2004), and whether speech can
improve the performance evaluations of dialogue tutoring
systems (e.g., as measured by learning gains, ef ciency,
usability, etc.) (Ros·e et al., 2003).
2 Application Description
ITSPOKE is a speech-enabled version of the Why2-
Atlas (VanLehn et al., 2002) text-based dialogue tutoring
system. As in Why2-Atlas, a student  rst types a nat-
ural language answer to a qualitative physics problem.
In ITSPOKE, however, the system engages the student
in a spoken dialogue to correct misconceptions and elicit
more complete explanations.
Consider the screenshot shown in Figure 1. ITSPOKE
 rst poses conceptual physics problem 58 to the student,
as shown in the upper right of the  gure. Next, the
Figure 1: Screenshot during ITSPOKE Human-Computer Spoken Dialogue
user types in a natural language essay answer (as shown
in the essay box in the middle right of Figure 1), and
clicks  SUBMIT. ITSPOKE then analyzes the essay, af-
ter which the spoken dialogue with the student begins.
During the dialogue, the system and student discuss
a solution to the problem relative to the student’s es-
say explanation, using spoken English. At the time the
screenshot was generated, the student had just said  free
fall. After each system or student utterance, the system
prompt, or the system’s understanding of the student’s re-
sponse, respectively, are added to the dialogue history (as
shown in the dialogue box in the middle left of Figure 1).1
At some point later in the dialogue, the system will
eventually ask the student to edit the typed essay explana-
tion. The system will then either express satisfaction and
end the tutoring for the current problem, or continue with
another round of spoken dialogue interaction and typed
essay revision. Upon completing the problem, the stu-
dent is presented with an example of an ideal essay.
Another excerpt with ITSPOKE is shown in Figure 2.
1The  Tell Tutor box is used for typed student login only.
3 System Architecture
The architecture of ITSPOKE is shown in Figure 3. Stu-
dent speech is digitized from microphone input and sent
to the Sphinx2 automatic speech recognizer. Sphinx2’s
best hypothesis is then sent to Why2-Atlas for syntactic
and semantic analysis (Ros·e, 2000), discourse and do-
main processing (Jordan et al., 2003), and  nite-state di-
alogue management (Ros·e et al., 2001). The tutor’s text
output is sent to the Cepstral text-to-speech system, and
played through a speaker or headphone.
ITSPOKE is implemented in Python and acts as a
proxy server between the Why2-Atlas server and client.
ITSPOKE monitors the XML being sent between the two
and decides what text should be spoken and when to lis-
ten. After speech recognition, ITSPOKE submits what it
thinks the student said to Why2-Atlas (as if the student
had typed it). Sphinx2’s C API is tied into Python by us-
ing SWIG to generate a Python wrapper API. Cepstral’s
speech synthesis engine is currently run out of process
using a fork/exec, but we plan to use SWIG to tie its C
API into Python in the future. For speed, we have in-
ITSPOKE: I didn’t understand. What force acts on all objects
within close proximity to earth?
STUDENT: uh the force of gravity
ITSPOKE: Fine. Besides the Earth’s gravitational force, are
there any other forces acting on the person, keys, and elevator
after the release?
STUDENT: no
ITSPOKE: Yep. Thus, keep in mind that after the release, the
only force on the person, keys, and elevator is the force of grav-
ity. If gravity is the only force acting on an object, will it be
moving or staying still?
STUDENT: it will be moving
ITSPOKE: Fine. Not only are the person, keys, and elevator
moving, they have only gravitational forces on them. When an
object is falling and has only gravitational force on it, it is said
to be in what?
STUDENT: freefall
ITSPOKE: Good. Thus, after the release the only force on
the keys and man is the downward force of earth’s gravity, so
they are in freefall. Before the release, what is the relationship
between the man’s velocity and the keys’ velocity?
STUDENT: they’re equal
Figure 2: ITSPOKE Excerpt (3.5 minutes into session)
Figure 3: The Architecture of ITSPOKE
stalled Sphinx2 and Cepstral on the ITSPOKE machine.
The dashed line in the  gure re ects that Why2-Atlas can
be installed on a different machine, although we are cur-
rently running it on the same machine.
4 Performance Analysis
A formal evaluation comparing ITSPOKE and other tu-
toring methods began in November 2003, and is still on-
going. Subjects are University of Pittsburgh students who
have taken no college physics and are native speakers of
American English. Our experimental procedure, taking
roughly 4 hours/student, is as follows: students 1) read a
small document of background material, 2) take a pretest
measuring their physics knowledge, 3) use ITSPOKE to
work through 5 physics problems, and 4) take a post-test
similar to the pretest.
As of March 2004, we have collected 80 dialogues
from 16 students (21 total hours of speech, mean dialogue
time of 17 minutes). An average dialogue contains 21.3
student turns and 26.3 tutor turns. The mean student turn
length is 2.8 words (max=28, min=1).2
ITSPOKE uses 56 dialogue-state dependent language
models for speech recognition; 43 of these 56 models
have been used to process the data collected to date.3
These stochastic language models were initially trained
using 4551 typed student utterances from a 2002 eval-
uation of Why2-Atlas, then later enhanced with spo-
ken utterances obtained during ITSPOKE’s pilot testing.
For the 1600 student turns that we have collected, IT-
SPOKE’s current Word Error Rate is 31.2%. While this is
the traditional method of evaluating speech recognition,
semantic rather than transcription accuracy is more useful
for dialogue evaluation as it does not penalize for word
errors that are unimportant to overall utterance interpre-
tation. Semantic analysis based on speech recognition
is the same as based on perfect transcription 92% of the
time. An average dialogue contains 1.4 rejection prompts
(when ITSPOKE is not con dent of the speech recogni-
tion output, it asks the user to repeat the utterance), and
.8 timeout prompts (when the student doesn’t say any-
thing within a speci ed time frame, ITSPOKE repeats its
previous question).
5 Summary
The goal of ITSPOKE is to generate an empirically-based
understanding of the implications of using speech instead
of text-based dialogue tutoring, and to use these results
to build an improved version of ITSPOKE. We are cur-
rently analyzing our corpus of dialogues with ITSPOKE
to determine whether spoken dialogues yield increased
performance compared to text with respect to a variety
of evaluation metrics, and whether acoustic-prosodic fea-
tures only found in speech can be used to better predict
pedagogically useful information such as student emo-
tions. Our next step will be to modify the dialogue man-
ager inherited from Why2-Atlas to use new tutorial strate-
gies optimized for speech, and to enhance ITSPOKE to
predict and adapt to student emotion. In previous work
on adaptive (non-tutoring) dialogue systems (Litman and
Pan, 2002), adaptation to problematic dialogue situations
measurably improved system performance.
Acknowledgments
This research has been supported by NSF Grant Nos.
9720359, 0328431 and 0325054, and by the Of ce of
2Word count is estimated from speech recognition output.
3The remaining language models correspond to physics
problems that are not being tested in the current evaluation.
Naval Research. Thanks to Kurt VanLehn and the Why2-
Atlas team for the use and modi cation of their system.
References
G. Aist, B. Kort, R. Reilly, J. Mostow, and R. Picard.
2002. Experimentally augmenting an intelligent tutor-
ing system with human-supplied capabilities: Adding
Human-Provided Emotional Scaffolding to an Auto-
mated Reading Tutor that Listens. In Proc. of Intel-
ligent Tutoring Systems.
V. Aleven and C. P. Rose. 2003. Proc. of the AIED 2003
Workshop on Tutorial Dialogue Systems: With a View
toward the Classroom.
V. Aleven, O. Popescu, and K. Koedinger. 2001.
Towards tutorial dialog to support self-explanation:
Adding natural language understanding to a cognitive
tutor. In J. D. Moore, C. L. Red eld, and W. L. John-
son, editors, Proc. of Arti cial Intelligence in Educa-
tion, pages 246 255.
J. Ang, R. Dhillon, A. Krupski, E.Shriberg, and A. Stol-
cke. 2002. Prosody-based automatic detection of an-
noyance and frustration in human-computer dialog. In
Proc. of ICSLP.
A. Batliner, K. Fischer, R. Huber, J. Spilker, and E. Noth.
2003. How to  nd trouble in communication. Speech
Communication, 40:117 143.
M. Chi, N. De Leeuw, M. Chiu, and C. Lavancher.
1994. Eliciting self-explanations improves under-
standing. Cognitive Science, 18:439 477.
L. Devillers, L. Lamel, and I. Vasilescu. 2003. Emotion
detection in task-oriented spoken dialogs. In Proc. of
ICME.
M. Evens, S. Brandle, R. Chang, R. Freedman, M. Glass,
Y. Lee, L. Shim, C. Woo, Y. Zhang, Y. Zhou,
J. Michaeland, and Allen A. Rovick. 2001. Circsim-
tutor: An Intelligent Tutoring System Using Natural
Language Dialogue. In Proc. Midwest AI and Cogni-
tive Science Conference.
K. Forbes-Riley and D. Litman. 2004. Predicting
emotion in spoken dialogue from multiple knowledge
sources. In Proc. Human Language Technology Con-
ference and North American Chapter of the Associa-
tion for Computational Linguistics.
J. Fry, M. Ginzton, S. Peters, B. Clark, and H. Pon-Barry.
2001. Automated tutoring dialogues for training in
shipboard damage control. In Proc. SIGdial Workshop
on Discourse and Dialogue.
A. Graesser, N. Person, and D. Harter et al. 2001. Teach-
ing tactics and dialog in Autotutor. International Jour-
nal of Arti cial Intelligence in Education.
R. Hausmann and M. Chi. 2002. Can a computer inter-
face support self-explaining? The International Jour-
nal of Cognitive Technology, 7(1).
P. Jordan, M. Makatchev, and K. VanLehn. 2003. Ab-
ductive theorem proving for analyzing student expla-
nations. In Proc. Arti cial Intelligence in Education.
C.M. Lee, S. Narayanan, and R. Pieraccini. 2002. Com-
bining acoustic and language information for emotion
recognition. In Proc. of ICSLP.
D. J. Litman and K. Forbes-Riley. 2004. Annotating stu-
dent emotional states in spoken tutoring dialogues. In
Proc. SIGdial Workshop on Discourse and Dialogue.
D. J. Litman and S. Pan. 2002. Designing and evaluating
an adaptive spoken dialogue system. User Modeling
and User-Adapted Interaction, 12.
R. Moreno, R.E. Mayer, H. A. Spires, and J. C. Lester.
2001. The case for social agency in computer-based
teaching: Do students learn more deeply when they
interact with animated pedagogical agents. Cognition
and Instruction, 19(2):177 213.
J. Mostow and G. Aist. 2001. Evaluating tutors that lis-
ten: An overview of Project LISTEN. In K. Forbus and
P. Feltovich, editors, Smart Machines in Education.
J. Rickel and W. L. Johnson. 2000. Task-oriented col-
laboration with embodied agents in virtual worlds. In
J. Cassell, J. Sullivan, S. Prevost, and E. Churchill, ed-
itors, Embodied Conversational Agents.
C. P. Ros·e, P. Jordan, M. Ringenberg, S. Siler, K. Van-
Lehn, and A. Weinstein. 2001. Interactive conceptual
tutoringin Atlas-Andes. In Proc. Arti cial Intelligence
in Education.
C. P. Ros·e, D. Litman, D. Bhembe, K. Forbes, S. Silli-
man, R. Srivastava, and K. VanLehn. 2003. A com-
parison of tutor and student behavior in speech versus
text based tutoring. In Proc. HLT/NAACL Workshop:
Building Educational Applications Using NLP.
C. P. Ros·e. 2000. A framework for robust sentence level
interpretation. In Proc. North American Chapter of the
Association for Computational Lingusitics.
I. Shafran, M. Riley, and M. Mohri. 2003. Voice signa-
tures. In Proc. of Automatic Speech Recognition and
Understanding.
K. VanLehn, P. W. Jordan, C. Ros·e, D. Bhembe,
M. Bcurrency1ottner, A. Gaydos, M. Makatchev, U. Pap-
puswamy, M. Ringenberg, A. Roque, S. Siler, R. Sri-
vastava, and R. Wilson. 2002. The architecture of
Why2-Atlas: A coach for qualitative physics essay
writing. In Proc. Intelligent Tutoring Systems.
C. Zinn, J. D. Moore, and M. G. Core. 2002. A 3-tier
planning architecture for managing tutorial dialogue.
In Proc. Intelligent Tutoring Systems.
