Proceedings of HLT/EMNLP 2005 Demonstration Abstracts, pages 24–25,
Vancouver, October 2005.
A Flexible Conversational Dialog System for MP3 Player
Fuliang Weng
1
 Lawrence Cavedon
2
 Badri Raghunathan
1
 Danilo Mirkovic
2 
Ben Bei
1
Heather Pon-Barry
1
 Harry Bratt
3
 Hua Cheng
2
 Hauke Schmidt
1
 Rohit Mishra
4
 Brian Lathrop
4
Qi Zhang
1
   Tobias Scheideck
1
   Kui Xu
1
    Tess Hand-Bender
1
   Sandra Upson
1
     Stanley Peters
2
Liz Shriberg
3
 Carsten Bergmann
4
Research and Technology Center, Robert Bosch Corp., Palo Alto, California
1
Center for Study of Language and Information, Stanford University, Stanford, California
2
Speech Technology and Research Lab, SRI International, Menlo Park, California
3
Electronics Research Lab, Volkswagen of America, Palo Alto, California
4
{Fuliang.weng,badri.raghunathan,hauke.Schmidt}@rtc.bosch.com
{lcavedon,huac,peters}@csli.Stanford.edu
{harry,ees}@speech.sri.com
{rohit.mishra,carsten.bergmann}@vw.com
1 Abstract
In recent years, an increasing number of new de-
vices have found their way into the cars we drive.
Speech-operated devices in particular provide a
great service to drivers by minimizing distraction,
so that they can keep their hands on the wheel and
their eyes on the road. This presentation will dem-
onstrate our latest development of an in-car dialog
system for an MP3 player designed under a joint
research effort from Bosch RTC, VW ERL, Stan-
ford CSLI, and SRI STAR Lab funded by NIST
ATP [Weng et al 2004] with this goal in mind.
This project has developed a number of new tech-
nologies, some of which are already incorporated
in the system.  These include: end-pointing with
prosodic cues, error identification and recovering
strategies, flexible multi-threaded, multi-device
dialog management, and content optimization and
organization strategies. A number of important
language phenomena are also covered in the sys-
tem’s functionality. For instance, one may use
words relying on context, such as ‘this,’ ‘that,’ ‘it,’
and ‘them,’ to reference items mentioned in par-
ticular use contexts. Different types of verbal revi-
sion are also permitted by the system, providing a
great convenience to its users. The system supports
multi-threaded dialogs so that users can diverge to
a different topic before the current one is finished
and still come back to the first after the second
topic is done. To lower the cognitive load on the
drivers, the content optimization component orga-
nizes any information given to users based on on-
tological structures, and may also refine users’
queries via various strategies. Domain knowledge
is represented using OWL, a web ontology lan-
guage recommended by W3C, which should
greatly facilitate its portability to new domains.
The spoken dialog system consists of a number of
components (see Fig. 1 for details). Instead of the
hub architecture employed by Communicator pro-
jects [Senef et al, 1998], it is developed in Java and
uses a flexible event-based, message-oriented mid-
dleware. This allows for dynamic registration of
new components. Among the component modules
in Figure 1, we use the Nuance speech recognition
engine with class-based ngrams and dynamic
grammars, and the Nuance Vocalizer as the TTS
engine. The Speech Enhancer removes noises and
echo. The Prosody module will provide additional
features to the Natural Language Understanding
(NLU) and Dialogue Manager (DM) modules to
improve their performance.
The NLU module takes a sequence of recognized
words and tags, performs a deep linguistic analysis
with probabilistic models, and produces an XML-
based semantic feature structure representation.
Parallel to the deep analysis, a topic classifier as-
signs top n topics to the utterance, which are used
in the cases where the dialog manager cannot make
24
any sense of the parsed structure. The NLU mod-
ule also supports dynamic updates of the knowl-
edge base.
The CSLI DM module mediates and manages in-
teraction. It uses the dialogue-move approach to
maintain dialogue context, which is then used to
interpret incoming utterances (including fragments
and revisions), resolve NPs, construct salient re-
sponses, track issues, etc. Dialogue states can also
be used to bias SR expectation and improve SR
performance, as has been performed in previous
applications of the DM. Detailed descriptions of
the DM can be found in [Lemon et al 2002; Mirk-
ovic & Cavedon 2005].
The Knowledge Manager (KM) controls access to
knowledge base sources (such as domain knowl-
edge and device information) and their updates.
Domain knowledge is structured according to do-
main-dependent ontologies. The current KM
makes use of OWL, a W3C standard, to represent
the ontological relationships between domain enti-
ties. Protégé (http://protege.stanford.edu), a do-
main-independent ontology tool, is used to
maintain the ontology offline. In a typical interac-
tion, the DM converts a user’s query into a seman-
tic frame (i.e. a set of semantic constraints) and
sends this to the KM via the content optimizer.
The Content Optimization module acts as an in-
termediary between the dialogue management
module and the knowledge management module
during the query process. It receives semantic
frames from the DM, resolves possible ambigui-
ties, and queries the KM. Depending on the items
in the query result as well as the configurable
properties, the module selects and performs an ap-
propriate optimization strategy.
Early evaluation shows that the system has a
task completion rate of 80% on 11 tasks of MP3
player domain, ranging from playing requests to
music database queries. Porting to a restaurant se-
lection domain is currently under way.
References
Seneff, Stephanie, Ed Hurley, Raymond Lau, Christine Pao,
Philipp Schmid, and Victor Zue, GALAXY-II: A Reference
Architecture for Conversational System Development, In-
ternational Conference on Spoken Language Processing
(ICSLP), Sydney, Australia, December 1998.
Lemon, Oliver, Alex Gruenstein, and Stanley Peters, Collabo-
rative activities and multi-tasking in dialogue systems,
Traitement Automatique des Langues (TAL), 43(2), 2002.
Mirkovic, Danilo, and Lawrence Cavedon, Practical Multi-
Domain, Multi-Device Dialogue Management, Submitted
for publication, April 2005.
Weng, Fuliang, Lawrence Cavedon, Badri Raghunathan, Hua
Cheng, Hauke Schmidt, Danilo Mirkovic, et al., Develop-
ing a conversational dialogue system for cognitively over-
loaded users, International Conference on Spoken
Language Processing (ICSLP), Jeju, Korea, October 2004.
25
