Spoken Dialogue Control Based on a Turn-minimization
Criterion Depending on the Speech Recognition Accuracy
YASUDA Norihito and DOHSAKA Kohji and AIKAWA Kiyoaki
NTT Communication Science Laboratories
3-1 Morinosato-Wakamiya, Atsugi, Kanagawa, 243-0198 Japan
{yasuda, dohsaka}@atom.brl.ntt.co.jp, aik@idea.brl.ntt.co.jp
Abstract
This paper proposes a new dialogue
control method for spoken dialogue
systems. The method conﬁgures a
dialogue plan so as to minimize the
estimated number of turns to com-
plete the dialogue. The number of
turns is estimated depending on the
current speech recognition accuracy
and probability distribution of the
true user’s request. The proposed
method reduces the number ofturns
to complete the task at almost any
recognition accuracy.
1 Introduction
A spoken dialogue system determines user
requests from user utterances. Spoken di-
alogue systems, however, can’t determine a
user’s request only from an initial utterance,
because there is a limitation to automatic
speech recognition and recognition errors are
unavoidable. Thus,mostspokendialoguesys-
temsconﬁrmauser’sutteranceordemandthe
information that is lacking in order to deter-
mine user’s request. Such dialogues for con-
ﬁrmation or demand between the system and
the user are called “conﬁrmation dialogues”.
Long conﬁrmation dialogues are annoying, so
more eﬃcient conﬁrmation is desirable. To
measure the eﬃciency of the dialogue, we use
the number of turns (exchanges), where of
course, the fewer number of turns is better.
In practical applications, the system can
accepts multiple types of user requests like
“making a new appointment”, “changing a
schedule”, and “inquiring about a schedule”.
If the user request type is diﬀerent, the re-
quired information for determining the user
request is also diﬀerent. Sometimes the user
request type is ambiguous due to recognition
errors, and various types of user requests are
possible. In such a case, it is important for
the system to choose the type of user request
it will conﬁrm at ﬁrst, since it will be useless
toconﬁrmitemsthatarerequiredforunlikely
type of request.
The recognition accuracy aﬀects the eﬃ-
ciency in other cases. For example, if there
are multiple items to be conﬁrmed, intu-
itively,itseemseﬃcienttoconﬁrmallofthem
at once. However, the system must include
candidatesforallattributesinrecognitionvo-
cabulary, which cause more recognition er-
rors. Moreover,even though there isonly one
misrecognized item in conﬁrmed items, the
user might just say coldly “No”, and the sys-
temcannotknowthatwhatarecorrectitems.
Several eﬃcient dialogue control methods
have been proposed (Niimi and Kobayashi,
1996; Litman et al., 2000). But there is no
previous works that take into account mul-
tiple types of user requests and recognition
accuracy during conﬁrmation, which changes
whattobeconﬁrmedwithoutdomain-speciﬁc
rules or training.
To prevent needlessly longconﬁrmation di-
alogues even if the system can accepts mul-
tiple types of user request, our method esti-
mates the expected number of turns to a cer-
tain use request type and the approximated
probabilitydistributionof user request types.
The expected number of turns can be derived
fromtherequiredvocabularyforconﬁrmation
and base recognition accuracy under certain
vocabulary size.
2 Method
Overview First, we describe about a sys-
tem to which we assume this method will be
applied. The system has belief state which
is represented by the set of attributes, their
values, and the certainty of the values. The
certainty is in [0 .. 1], and the certainty for
the determined value is 1. That is, if the
user replies “Yes” to the conﬁrmation, the
systemchanges thecertaintyforthatvalue to
1. In practice, we can use the score from the
recognition engine as this certainty. The sys-
tem changes the recognition vocabulary ac-
cording to the attributes to be conﬁrmed at
each conﬁrmation. At any given time, the
system either conﬁrms or demands some at-
tribute(s); it doesn’t conﬁrm and demand at
the same time. Any values required in order
to determine the user request are explicitly-
conﬁrmed without exception. Words that are
irrelevant to the present conﬁrmation are ex-
cluded from the recognition vocabulary. The
system knows the base recognition accuracy
under acertainvocabularysize, whichisused
to estimate the recognition accuracy.
Our method can be divided roughly into
ﬁve parts; the ﬁrst three parts are used to
obtaintheexpectednumberofturns,granting
thatthe user request type are alreadyknown,
the fourth part is used to approximate the
probability distribution of the user request,
and the last part is used to decide the next
action to be taken by the system.
The system needs to know only three sorts
of information: 1) the vocabulary for each
attribute; 2) the meaning constraints among
words like “If the family name of the person
is Yasuda, then his department must be ac-
counting”; and 3) the required information
for each type of user request like “To can-
cel an appointment; the day and the time are
required”. No other domain-speciﬁc rules or
training are necessary.
Guessing the Recognition Accuracy
Hereweconsiderhowtoestimatetherecogni-
tion accuracy during conﬁrmation from con-
ﬁrmation target. Once attributes for conﬁr-
mation are decided, the recognition vocabu-
lary will consist of the words accepted by the
attributes and general words for moving the
dialogue along that are at least necessary to
progress the dialogue such as “Yes”, “No”,
etc. We call the recognition accuracy at this
time the “attribute recognition accuracy”.
We adopt the rule ofthumb thatthe recog-
nitionerrorrateisinproportiontothesquare
rootofvocabularysize(Rosenfeld,1996;Nak-
agawa and Ida, 1998). Thus, the approxi-
mated attribute recognition accuracy can be
derived from the number of words accepted
by the attributes.
Note that the attribute recognition accu-
racy can’t be estimated beforehand, because
thecandidatesforsomeattributesaredynam-
ically change, as a result of the meaning con-
straints among words; if the value of one at-
tribute is ﬁxed, then candidates for other at-
tributes will be limited to values that satisfy
the constraints. Besides, the degree of lim-
itation varies with the values. The relation
between the user’s family name and depart-
ment is such an example.
Turn Estimation to Determine Some
Attributes Next we consider how to esti-
mate the expected number of turns for de-
termining some attributes using the approxi-
mated attribute recognition accuracy.
We assumethatthe user’sreplytothe con-
ﬁrmationmustcontaintheintentionthatcor-
responds to “Yes” or “No”, and the inten-
tion must be transmitted to the system with-
out fail. Then, the expected number of turns
to complete conﬁrming for some attributes is
equal to the expected number of turns in the
case that the conﬁrmation is incorrect (i.e.
misrecognized). Therefore, we can derive the
number of expected turns to complete con-
ﬁrming T
c
and demanding T
d
for some at-
tributes by the followingexpression:
T
c
=
∞
summationdisplay
t=1
tr(1− r)
t−1
=
1
r
T
d
= T
c
+1=1+
1
r
where r denotes the attribute recognition ac-
curacyforattributesthataretobeconﬁrmed.
Turn Estimation to a Certain User Re-
quest Type Here weestimatetheexpected
number of turns, granting that the type of
user request is already known.
If the user request type is ﬁxed, the re-
quired attributes for that type are also ﬁxed.
By comparing the belief state with these at-
tributes,wecanrepresenttherequiredactions
todeterminethe userrequest byasetofpairs
made up of attributes and actions for the at-
tribute (conﬁrmation or demand). Once this
setofpairsisgiven,wecanchoosetheoptimal
plan, because we can estimate the expected
turns of any permutations of any partitions
of this set. The expected number of turns
for this optiomal plan is used as the expected
number ofturnsfora givenuserrequest type.
Probability Distribution of User Re-
quest Types Here, we consider how to es-
timate the relevance between the belief state
and each user request types.
As it is hard to obtain the actual probabil-
ity distribution, we deﬁne the degree of rele-
vance between the belief state and each user
request type as an approximation.
Let a
i
, v
i
, c
i
be the i-thattribute,thevalue
of a
i
, and the certainty of v
i
respectively. We
deﬁne the relevance Rel(S, R
j
) between the
belief state S and the user request type R
j
as
for any v
i
which can be accepted
by R
j
:
Rel(S, R
j
)=
1
N
G
j
summationdisplay
c
i
M
v
i
where N
R
j
denotesthenumberofrequiredat-
tributes in user request type R
j
, and M
v
i
de-
notes the number ofuser requests thataccept
the value v
i
.
Choosing the Next Action Even if there
is a highly possible user request type, choos-
ingconﬁrmationplanforitisnotalwaysbest,
if the expected number of turns for that re-
quest is very large. In such case, conﬁrm-
ing another type of request that is easily con-
ﬁrmed and medium possibility may better.
Weassumethatwhen theuserrequesttype
guessed by the system is not the real user re-
quest type, the number of turns required to
know that the guess is incorrect is equal to
the number of turnswhen the guess is correct
and ﬁnish conﬁrming the contents.
Let p
R
i
be the probability of user request
type R
i
, and t
R
i
be the expected number of
turns to user request type R
i
.
From permutations of request types,
our method chooses the optimal order
a(1),a(2),...,a(n) such that the expression
p
R
a(1)
t
R
a(1)
+ p
R
a(2)
(t
R
a(1)
+ t
R
a(2)
)+... +
p
R
a(n)
(t
R
a(1)
+ ...+ t
R
a(n)
) is minimal. Then
our method chooses the action that appears
ﬁrstintheoptimalplanforrequesttype R
a(1)
as the next action.
3 Experiments
Weevaluatedtheproposedmethodbysimula-
tion. In the simulation,the system conversed
with a simulated user program. Simulation
witha simulateduser enables rapidprototyp-
ing and evaluation (Eckert et al., 1998). The
conversationwasnotdonebyexchangingspo-
ken language, but by exchanging attribute-
value pairs.
Simulated User Program The simulated
user program works in the following steps:
1. Select a request. The request never
changes throughout the dialogue
2. Tell thesystem therequest ora subset of
the request
3. Respond YesorNoifthesystemconﬁrms
4. Give corrections at random if conﬁrma-
tion contains errors
5. Respond to the demand from the system
6. Tell the system that there is no infor-
mation if the system refers to attributes
with which the user is not concerned
Specification of Test Task We prepared
a ﬁctitious task for simulation. This task ac-
cepts six types of user demand. There are
six attributes,and twoof them have meaning
dependence like the family name and depart-
ment. Thenumbersofpersons,familynames,
and departments are 3000, 1000, 300 respec-
tively.
2
4
6
8
10
12
14
16
0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
mean number of turns
recognition rate under 500 words vocaburaly
Our method
Naive method
Figure 1: Average number of turns to com-
plete a dialogue
0
20
40
60
80
100
0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
variance of the number of turns
recognition rate under 500 words vocaburaly
Our method
Naive method
Figure 2: Variance of the number of turns to
complete a dialogue
Comparison with a Naive Method For
comparison,wepreparedanaiveconﬁrmation
dialogue control method, with the following
speciﬁcations:
1. If the user request can be ﬁxed uniquely
and there are unbound attributes re-
quiredforthatrequest,demandthoseat-
tributes one by one.
2. Iftherearevaluesthatarenotconﬁrmed,
conﬁrm them one by one.
3. Iftheuserrequesttypecan’tbeﬁxedyet,
demand a value for an attribute in the
orderofthenumberofuserrequest types
that require that attribute.
Experimental Results Figures 1 and 2
showtheaveragenumberofturnsanditsvari-
ance out of 1000 diaglogue. We can see from
these ﬁgures that our method can complete
dialoguesinshorterturnsthanothermethods
under various levels of recognition accuracy.
In addition, the variance is small in almost
every range, which illustrates the stability of
our method.
4 Conclusion
A new dialogue control method is proposed.
The method takes into consideration the ex-
pected number of turns based on the guessed
recognition accuracy and the approximated
probability distribution of user requests.
We don’t have to write domain-speciﬁc
rules manuallyby using this method. We can
thus easily transfer domain of the system.
We evaluated our method by simulation.
The result shows that it can complete di-
alogues in shorter turns than conventional
methods under various recognition accuracy.
Acknowledgements
We thank Ken’ichiro Ishii, Norihiro Hagita,
and all our colleagues in the Dialogue Un-
derstandingResearchGroupforusefuldiscus-
sions.

References
Wieland Eckert, Esther Levin, and Roberto Pier-
accini. 1998. Automatic evaluation of spoken
dialogue systems. In TWLT13: Formal seman-
tics and pragmatics of dialogue.
Diane J. Litman, Michael S. Kearns, and Mari-
lyn A. Walker. 2000. Automatic optimization
of dialogue management. In COLING.
Seiichi Nakagawa and Masaki Ida. 1998. A
new measure of task complexity for continuous
speech recognition. IEICE, J81-D-II(7):1491–
1500(in Japanese).
Yasuhisa Niimi and Yutaka Kobayashi. 1996. Di-
alog control stragey based on the reliability of
speech recognition. In International Confer-
ence on Spoken Language Processing,pages 25–
30.
R. Rosenfeld. 1996. A maximum entropy ap-
proach to adaptive statistical language model-
ing. Computer, Speech and Language, 10:187–
228.
