Turn off the Radio and Call Again: 
How Acoustic Clues can Improve Dialogue Management 
Christian Lieske 
\]~cole Polytechnique F~d@derale Lausanne (EPFL) 
e-mail: lieske@di.epfl.ch 
1 Introduction 
Traditionally, the most imPortant input to 
the dialogue management component of a 
natural language system are semantic repre- 
sentations (e.g., formulae of first-order pre- 
dicate calculus). Only recently, other kinds 
of information (e.g., phonetic transcriptions 
of unknown words) has been used. There 
is, however, room for the utilization of ad- 
ditional knowledge sources. In what fol- 
lows, we first explicate how information from 
the accoustic level (e.g., the presence of cer- 
tain kinds of background noise from a radio) 
enables better system performance. Our ar- 
gumentation is that this is possible since the 
dialogue manager can use acoustic clues, on 
the one hand, to establish better recognition 
conditions, and on the other hand to gene- 
rate more co-operative interactions. After 
this argumentation, we indicate how the aco- 
ustic cues for dialogue management compo- 
nents can be generated. 
2 State-of-the-art in Dialogue 
Management 
Today, many dialogue management compo- 
nents base their operations on a combination 
of dialogue history, a rule-based or statisti- 
Most of the ideas presented here were developed du- 
ring work at Forum Technology, Malvern. I am indebted 
to all of my colleagues there, especicially J6rg ~lberla 
and David Bijl for providing valuable discussions. I also 
owe thanks to Afzal Ballhn, and Yoshiki Mori for their 
comments on drafts of this poster. 
45 
cal dialogue model, rule-based or statistical 
dialogue act idenfication rules, and semantic 
representations of user utterances. For spo- 
ken language input, this repertoire of infor- 
mation ensues interactions like the (unsuc- 
cessful) dialogue in Figure 1 (from a train 
timetable enquiry system). 1 
What happens in a dialogue like this is 
the following: The dialogue manager tries 
to verify a value that is needed for a da- 
tabase query. This attempt fails several ti- 
mes, since the incoming semantic represen- 
tations are inconsistent due to recognition 
errors. After an arbitrary number of trials 
(two in the example), the dialogue manager 
surrenders. 
More advanced systems exploit additional 
information to generate more intelligent fe- 
edback for the user. The Verbmobil device 
which translates spoken utterances between 
German and English as well as between 
Japanese and English (see (Kay et al., 
1994)), e.g. exploits representations of 
unknown words and information about pho- 
netically similar words to generate clarifica- 
tion dialogues along the lines of Figure 2 (cf. 
(Maier et al., 1997)). 
1 Apologies for the fact that all examples are related 
to German. 
user 
user 
system 
system 
user 
user 
system 
system 
user 
user 
system 
system 
Ich mSchte gerne um zwei Uhr 
nach Hamburg fahren. 
I would like to travel to Hamburg 
at two o 'clock. 
Sie mSchten nach Homburg 
fahren? 
You would like to go to Homburg? 
Nein, nicht nach Homburg, son- 
dern nach Hamburg. 
No, not to Homburg but to Ham- 
burg. 
Sie mSchten nach Offenburg 
fahren? 
You would like to go to Often- 
burg? 
Nein, ich mSchte auch nicht nach 
Offenburg, sondern nach Ham- 
burg. 
No, I also don't want to go to Of- 
fenburg but to Hamburg. 
Es scheint ein Kommuni- 
kationsproblem zu geben. Bitte 
versuchen Sie es sp~ter noch ein- 
mal. 
We seem to have a communica- 
tions problem. Please try again 
later. 
Figure 1: Communication failure 
user 
user 
system 
system 
user 
user 
system 
system 
user 
user 
Mein Name ist Schwartz. Ich 
mSchte mit Ihnen einen Termin 
im Juli ausmachen. 
My name is Schwartz. I would 
like to arrange an appointment in 
July. 
Ist Schwartz ein Eigenname? 
Is Schwartz a proper name? 
Ja. 
Yes. 
Sagten Sie 'Juli' oder 'Juni'? 
Did you say 'July' or 'June'? 
Juli. 
July. 
Figure 2: Intelligent dialogue 
3 Dialogue Management with 
Acoustic Clues 
Still missing even from the more sophistica- 
ted systems like Verbmobil is a flow of in- 
formation all the way from the acoustic level 
up to the dialogue management component. 2 
This kind of coupling will enable more suc- 
cessful systems for two reasons: impro- 
ved speech recognition and more informative 
responses to the user. 
Improvement in speech recognition will 
stem from the dialogue manager acting as 
a kind of mediator between the speech re- 
cognizer and the user. In case of bad re- 
cognition rates (speech recognizers already 
deliver confidence scores), the dialogue ma- 
nager could ask for acoustic clues concerning 
the recognition conditions. If it then receives 
some clues about background noise (e.g., a 
radio), it might initiate a request to the user 
to establish a better acoustic environment. 
More specifically, the dialogue manager co- 
uld generate the concepts to ask Could you 
please turn off the radio? 
The quality of responses to users equally 
well can profit from information about the 
acoustic environment. To see this, imagine a 
situation where a police officer reports from 
the scene of an accident: In this situation, 
the acoustic conditiohs presumably are so 
adverse that recognition accuracy is inacce- 
ptable. Unlike in the scenario above, ho- 
wever~ little can be done to change the en- 
vironment. The appropriate action of the 
dialogue manager thus would be to make it 
clear to the officer that he is wasting his time 
in trying to get his message through. 
Thus, with information on the acoustic 
conditions/environment, the dialogue in Fi- 
gure 1 could become the one outlined in Fi- 
gure 3. 
2Take this as a metaphor. The architecture might in 
fact be black- or whiteboard-like. 
46 
user 
user 
system 
system 
user 
user 
Ich mSchte gerne um zwei Uhr 
nach Hamburg fahren. 
I would like to travel to Hamburg 
at two o'clock. Sie sind wegen der Musik im Hin- 
tergrund leider schwer zu verste- 
hen. W£re es mSglich, dat~ Sie 
nochmals anrufen, wenn Sie die 
Musik abgestellt haben? 
Unfortunately, I have got difficul- 
ties in understanding you due to 
the music in the background. Co- 
uld you call again after having 
turned it off? 
Ja; bis gleich. 
Yes; until later. 
Figure 3: Dialogue utilizing acoustic clues 
4 Techniques for Detecting Acoustic 
Clues 
Work on the acoustic level that is suited for 
integration into a dialogue framework like 
that depicted above has not advanced far 
enough, yet. Only recently, in the backgro- 
und of the DARPA evaluations for speech 
recognition systems, the importance of the 
type of noise tracking that is needed has been 
realized. A component that not only dete- 
cts but also classifies noise (as, e.g., music 
or street noise) has a good chance of beco- 
ming the first plug-and-play spoken language 
interface entity (SLIE), and seems to be re- 
alizable by well-mastered techniques for spe- 
ech recognition like Hidden Markov Models 
((Rabiner, 1989)). 
One approach for classifying acoustic 
conditions into different categories (e.g. 
background music) would be to use techni- 
ques like the ones used for non-word ba- 
sed topic spotting (see (Nowell and Moore, 
1995)). The different categories of noise 
would correspond to topics, and typical se- 
ctions of acoustic material from each cate- 
gory would correspond to keywords. Ba- 
sed on samples from each category/topic, ke- 
ywords which are most useful in identifying 
this topic would then be extracted automa- 
tically. An incoming signal could then be 
classified as belonging to one of the catego- 
ries, depending on which keywords appear 
most frequently. 
Another approach is to build a simple Hid- 
den Markov Model which gets trained for 
each category from the data in that category. 
An incoming signal could then be assigned 
to the category whose HMM gives the best 
match. 
Research is also needed in the realm of 
dialogue management. It remains to be in- 
vestigated in exactly which ways the acoustic 
information can be used. Obvious requests 
or follow-up questions like those exampli- 
fled above are one option; more clever qu- 
estions like Our communication may proceed 
more smoothly if the system adapts to your 
acoustic conditions; shall this be done? are 
another. 

References 
Martin Kay, Jean Mark Gawron, and Peter Nor- 
vig. 1994. Verbmobil: A Translation System for 
Face-to-Face Dialog. Number 33 in Lecture No- 
tes. CSLI, Stanford, CA. 
E. Maier, N. Reithinger, and Alexandersson J. 1997. 
Clarification dialogues as measures to increase ro- 
bustness in a spoken dialogue system. In Proce- 
edings of the ACL/EACL Workshop on Spoken 
Dialog Systems, Madrid. 
Peter Nowell and Roger Moore. 1995. The applica- 
tion of dynamic programming techniques to non- 
word based topic spotting. In Proceedings of the 
4th European Conference on Speech Communica- 
tion and Technology, Madrid. 
Lawrence R. Rabiner. 1989. A tutorial on hidden 
Markov models and selected applications in spe- 
ech recognition. In Proceedings of the IEEE, pa- 
ges 257 - 286. 
