Automatic Optimization of Dialogue Management 
Diane J. Litman, Michael S. Kearns, Satinder Singh, Marilyn A. Walker 
AT&T Labs- Research 
180 Park Avenue 
Florham Park, NJ 07932 USA 
{diane,mkearns,bavej a,walker} @research.art.corn 
Abstract 
Designing the dialogue strategy of a spoken dialogue 
system involves many nontrivial choices. This pa- 
per I)resents a reinforcement learning approach for 
automatically optimizing a dialogue strategy that 
addresses the technical challenges in applying re- 
inforcement learning to a working dialogue system 
with hulnan users. \¥e then show that our approach 
measurably improves performance in an experimen- 
tal system. 
1 Introduction 
Recent advances in spoken  understanding 
have made it 1)ossible to develop dialogue systems 
tbr many applications. The role of the dialogue man- 
ager in such systems is to interact in a naturM w~y 
to hel 1 ) the user complete the tasks that the system 
is designed to support. Tyl)ically, an expert designs 
a dialogue manager by hand, and has to make m~ny 
nontrivial design choices that can seriously impact 
system performance. This paper applies reintbrce- 
ment learning (RL) to automatically learn design 
choices that optimize system pertbrnmnee for a cho- 
seLL pertbrmance measure (Levin et al., 2000; Walker 
et al., 1998). 
Consider the spoken dialogue system named N J- 
Fun, wlfich we built to help users find fun places 
to go in New Jersey. A sample dialogue with N J- 
Fun is shown in Figure 1, with system utterances 
labeled Si and user utterances labeled Ui. When 
NJFun greets the user, it can provide an open greet- 
ing such as utterance S1, expecting that the user 
will take the initiative. Alternatively, it can pro- 
vide a directive greeting in which the system takes 
the initiative, as in "Welcome to NJFun. Please say 
an activity name or say 'list activities' for a list of 
activities I know about." After tile user's response 
U1, NJFun must decide whether it should explicitly 
confirm its understanding, as in utterances $2 and 
$3. NJFun can also simply continue on with the di- 
alogue, as when it does not explicitly confirm that 
the user wants to find out M)out wineries. In NJFun, 
as shown in more detail below, decisions about ini- 
tiative and confirmation strategies alone result in a 
SI: Welcome to N.llSm. How may I help you? 
UI: I'd like to find mn winetasting in Lambertville 
in the morning. (ASR output: I'd like to find 
out wirterics the in the Lambertvillc in the 
mornin.q. ) 
$2: Did you say you are interested in Lambertville? 
U2: Yes. 
$3: Did you say you want to go in the morning? 
U3: Yes. 
$4: I found a winery near l~ambertville that is 
open in the morning, it is \[...\] Please give 
lne feedback by saying 'good', ~so-so', or 'bad'. 
U4: Good. 
Figure 1: An example dialogue with NJFun. 
search space of 242 potential globnl dialogue strate- 
gies. Furthermore, the performance of a dialogue 
strategy depends on many other factors, such as 
the user population, the robustness of the automatic 
speech recognizer (ASR), and task difficulty (Kamm 
et al., 1998; DanMi and Gerbino, 1995). 
In the main, previous research has treated the 
specification of the dialogue management strategy 
as an iterative design problem: several versions of a 
system are created, dialogue corpora are collected 
with human users interacting with different ver- 
sions of tile system, a number of evaluation met- 
rics are collected ibr each dialogue, and the differ- 
ent versions are statistically compared (Danieli and 
Gerbino, 1995; Sanderman et al., 1998). Due to the 
costs of experimentation, only a few global strategies 
are typically explored in any one experiment. 
However, recent work has suggested that dialogue 
strategy can be designed using tile formalism of 
Markov decision processes (MDPs) and the algo- 
rithms of RL (Biermann and Long, 1996; Levin et 
al., 2000; Walker et nl., 1998; Singh et al., 1999). 
More specifically, the MDP formalism suggests a 
method for optimizing dialogue strategies from sam- 
ple dialogue data. The main advantage of this ap- 
proach is the 1)otential tbr computing an optilnal di- 
alogue strategy within a much larger search space, 
using a relatively small nmnber of training dialogues. 
This paper presents an application of RL to the 
502 
problem of oi)timizing dialogue strategy selection in 
the NJFnn system, and exl)erimentally demonstrates 
the utility of the ~l)proach. Section 2 exl)lahls how 
we apply RL to dialogue systems, then Se('tion 3 
describes t.he NJFun system in detail. Section 4 dee 
scribes how NJFun optimizes its dialogue strategy 
from experimentally obtained dialogue data. Sec- 
tion 5 reports results from testing the learned strat- 
egy demonstrating that our al)l)roach improves task 
coml)letion rates (our chosen measure for 1)erfor- 
mance optimization). A conll)alliOll paper provides 
only an al)brevi~tted system and dialogue manager 
description, but includes additional results not pre- 
sented here (Singh et al., 2000), such as analysis es- 
tablishing the veracity of the MDP we learn, and 
comparisons of our learned strategy to strategies 
hand-picked by dialogue experts. 
2 Reinforcement Learning for 
Dialogue 
Due to space limitations, we 1)resent only a 1)rief 
overview of how di~dogue strategy optimization can 
be viewed as an llL 1)roblem; for more details, 
see Singh ctal. (\]999), Walker el; a.1. (\]998), Levin 
et al. (2000). A dialogue strategy is a mapl)ing h'om 
a set ot! states (which summarize the entire dialogue 
so far) to a set of actions (such as the system's utter- 
mines and database queries). There are nmltil)l(~ rea- 
sonable action choices in each state; tyl)ically these 
choices are made by the system designer. Our RL- 
I)ased at)l)roach is to build a system that explores 
these choices in a systematic way through experi- 
ments with rel)resentative human us(!rs. A scalar 
i)erf()rmanee llleasllre, called a rewal'd, is t h(m (;al- 
eulated for each Cxl)erimental diMogue. (We dis- 
cuss various choices for this reward measure later, 
but in our experiments only terminal dialogue states 
ha,re nonzero rewi-l,rds, slid the reward lneasul'(}s arc 
quantities directly obtMnable from the experimental 
set-up, such as user satisfaction or task coml)letion. ) 
This experimental data is used to construct an MDP 
which models the users' intera(:tion with the system. 
The l)roblem of learning the best dialogue strategy 
from data is thus reduced to computing the optimal 
policy tbr choosing actions in an MDP - that is, the 
system's goal is to take actions so as to maximize 
expected reward. The comput~ttion of the ol)timal 
policy given the MDP can be done etficiently using 
stan&trd RL algorithms. 
How do we build the desired MDP from sample 
dialogues? Following Singh et al. (1999), we can 
view a dialogue as a trajectory in the chosen state 
space determined by the system actions and user 
resl) onses: 
S1 -~al,rl '5'2 --}a~,rs 83 "-~aa,ra "'" 
Here si -%,,,.~ si+l indicates that at the ith ex- 
change, the system was in state si, executed action 
ai, received reward ri, and then the state changed 
to si+~. Dialogue sequences obtained froln training 
data can be used to empirically estimate the transi- 
tion probabilities P(.s"la', a) (denoting the probabil- 
ity of a mmsition to state s', given that the system 
was in state .s and took ;ration a), and the reward 
function R(.s, (t). The estilnated transition 1)tel)abil- 
ities and rewi~rd flmction constitute an MDP model 
of the nser population's interaction with the system. 
Given this MDP, the exl)ected cumnlative reward 
(or Q-value) Q(s, a) of taking action a from state s 
can be calculated in terms of the Q-wdues of succes- 
sor states via the following recursive equation: 
Q(.% = ,) + r(,,'l,% ,)n,a:,: Q(,,", ,'). 
t; ! 
These Q-values can be estimated to within a desired 
threshold using the standard RL value iteration al- 
gorithm (Sutton, 1.991.), which iteratively updates 
the estimate of Q(s, a) based on the current Q-vahms 
of neighboring states. Once value iteration is con> 
pleted, the optima\] diah)gue strategy (according to 
our estimated model) is obtained by selecting the 
action with the maximum Q-value at each dia.logue 
state. 
While this apl)roach is theoretically appealing, the 
cost of obtaining sample human dialogues makes it 
crucial to limit the size of the state space, to mini- 
mize data sparsity problems, while retaining enough 
information in the state to learn an accurate model. 
Our approad~ is to work directly in a minimal but 
carefully designed stat;e space (Singh et al., 1999). 
The contribution of this paper is to eml)irically 
vMi(tate a practical methodology for using IlL to 
build a dialogue system that ol)timizes its behav- 
ior from dialogue data. Our methodology involves 
1) representing a dialogue strategy as a mapl)il~g 
fronl each state in the chosen state space S to a 
set of dialogue actions, 2) deploying an initial trah> 
ing system that generates exploratory training data 
with respect to S, 3) eonstrncting an MDP model 
from the obtained training data, 4) using value iter- 
ation to learn the optimal dialogue strategy in the 
learned MDP, and 4) redeploying the system using 
the learned state/~mtion real)ping. The next section 
details the use of this methodology to design the 
NJFun system. 
3 The NJFun System 
NJFnn is a real-time spoken dialogue system that 
provides users with intbrmation about things to do 
in New Jersey. NJFun is built using a general pur- 
pose 1)latt'ornl tbr spoken dialogue systems (Levin 
et al., 1.999), with support tbr modules tbr attto- 
rustic speech recognition (ASI/.), spoken  
503 
Action Prompt 
Greets Welcome to NJIqm. Please say an activity name or say 'list activities' for a list of activities I know 
about. 
GreetU \Velcome to NdPun. How may I help you? 
ReAsklS Iknow about amusement parks, aquariums, cruises, historic sites, museums, parks, theaters, 
wineries, and zoos. Please say an activity name from this list. 
ReAsklM Please tell me the activity type.You can also tell me the location and time. 
Ask2S Please say the name of the town or city that you are interested in. 
Ask2U Please give me more information. 
ReAsk2S Please teli me the name of the town or city that you are interested in. 
ReAsk2M Please tell me the location that you are interested in. You call also tell me the time. 
Figure 2: Sample initiative strategy choices. 
understanding, text-to-speech (TTS), database ac- 
cess, and dialogue management. NJFnn uses a 
speech recognizer with stochastic  and un- 
derstanding models trained from example user ut- 
terances, and a TTS system based on concatena- 
tive diphone synthesis. Its database is populated 
from the nj. online webpage to contain information 
about activities. NJFun indexes this database using 
three attributes: activity type, location, and time of 
day (which can assume values morning, afternoon, 
or evening). 
hffornmlly, the NJFun dialogue manager sequen- 
tially queries the user regarding the activity, loca- 
tion and time attributes, respectively. NJFun first 
asks the user for the current attribute (and 1)ossibly 
the other attributes, depending on the initiative). 
If the current attribute's value is not obtained, N.J- 
Fun asks for the attrilmte (and possibly the later 
attributes) again. If NJFun still does not obtain 
a value, NJFun moves on to the next attribute(s). 
Whenever NJFun successihlly obtains a value, it 
can confirm the vMue, or move on to the next at- 
tribute(s). When NJFun has finished acquiring at- 
tributes, it queries the database (using a wildcard 
for each unobtained attribute value). The length of 
NJFun dialogues ranges from 1 to 12 user utterances 
before the database query. Although the NJFun di- 
alogues are fairly short (since NJFun only asks for 
an attribute twice), the information access part of 
the dialogue is similar to more complex tasks. 
As discussed above, our methodology for using RL 
to optimize dialogue strategy requires that all poten- 
tim actions tbr each state be specified. Note that at 
some states it is easy for a human to make the cor- 
rect action choice. We made obvious dialogue strat- 
egy choices in advance, and used learIfing only to 
optimize the difficult choices (Walker et al., 1998). 
Ill NJFun, we restricted the action choices to 1) the 
type of initiative to use when asking or reasking for 
an attribute, and 2) whether to confirm an attribute 
value once obtained. The optimal actions may vary 
with dialogue state, and are subject to active debate 
in the literature. 
Tile examples in Figure 2 show that NJFun can 
ask the user about the first 2 attributes I using three 
types of initiative, based on the combination of tile 
wording of the system prompt (open versus direc- 
tive), and the type of grammar NJFun uses during 
ASR (restrictivc versus non-restrictive). If NJFun 
uses an open question with m~ unrestricted gram- 
mar, it is using v.scr initiative (e.g., Greet\[l). If N J- 
Fun instead uses a directive prompt with a restricted 
grammar, the system is using systcm initiative (e.g., 
GreetS). If NJFun uses a directive question with a 
non-restrictive granmlar, it is using mizcd initiative, 
because it allows the user to take the initiative by 
supplying extra intbrnlation (e.g., ReAsklM). 
NJFun can also vary the strategy used to confirm 
each attribute. If NJFun asks the user to explicitly 
verify an attribute, it is using czplicit confirmation 
(e.g., ExpConf2 for the location, exemplified by $2 
in Figure 1). If NJFun does not generate any COlt- 
firnmtion prompt, it is using no confirmation (the 
NoConf action). 
Solely tbr the purposes of controlling its operation 
(as opposed to the le~trning, which we discuss in a 
moment), NJNm internally maintains an opcratio'ns 
vector of 14 variables. 2 variables track whether the 
system has greeted the user, and which attribute 
the system is currently attempting to obtain. For 
each of the 3 attributes, 4 variables track whether 
the system has obtained the attribute's value, the 
systent's confidence in the value (if obtained), the 
number of times the system has asked the user about 
the attribute, and the type of ASR grammar most 
recently used to ask for the attribute. 
The formal state space S maintained by NJFun 
for tile purposes of learning is nmch silnt)ler than 
the operations vector, due to the data sparsity con- 
cerns already discussed. The dialogue state space 
$ contains only 7 variables, as summarized in Fig- 
sire 3. S is computed from the operations vector us- 
ing a hand-designed algorithm. The "greet" variable 
1 "Greet" is equivalent to asking for the first attribute. N J- 
Fun always uses system initiative for the third attribute, be- 
cause at that point the user can only provide the time of (lay. 
504 
greet attr conf val times gram hist \] 
0,1 1,2,3,4 0,1,2,3,4 0,1 0,1,2 0,1 0,1 \] 
Figure 3: State features and vahles. 
tracks whether tile system has greeted tile user or 
not (no=0, yes=l). "Attr ~: specifies which attrihute 
NJFun is ('urrently ~tttelnpting to obtain or ver- 
ify (activity=l, location=2, time=a, done with at- 
tributes=4). "Conf" tel)resents the confidence that 
NaFun has after obtaining a wdue for an attribute. 
The values 0, 1, and 2 represent the lowest, middle 
and highest ASR confidence vahms? The wdues 3 
and 4 are set when ASR hears "yes" or "no" after a 
confirmation question. "Val" tracks whether NaFun 
has obtained a value, for tile attribute (no=0, yes=l). 
"Times" tracks the number of times that N,lFun has 
aske(1 the user ~d)out the attribute. "(4ram" tracks 
the type of grammar most ree(mtly used to obtain 
the attribute (0=non-restrictive, 1=restrictive). Fi- 
nally, "hist" (history) represents whether Nalflm had 
troullle understanding the user ill the earlier p~trt of 
the conversation (bad=0, good=l). We omit the full 
detinition, but as a,n ex~unl>le , when N.lFun is work- 
ing on the secon(1 attribute (location), the history 
variable is set to 0 if NJFun does not have an ac- 
tivity, has an activity but has no confidence in the 
value, or needed two queries to obtain the activity. 
As mentioned above, the goal is to design a small 
state space that makes enough critical distin('tions to 
suPi)ort learning. The use of 6" redu(:es the mmfl)er 
of states to only 62, and SUl)l)orts the constru('tion of 
mt MI)P model that is not sparse with respect to ,g, 
even using limite(1 trMning (btta. :~ Tit(.' state sp~t(;e 
that we utilize here, although minimal, allows us 
to make initiative decisions based on the success of 
earlier ex(:ha.nges, and confirmation de(:isions based 
on ASR. confidence scores and gralnmars. 
'.Phe state/~t('tiol: real)ping r(-`l)resenting NaFun's 
initial dialogue strategy EIC (Explor:ttory for Ini- 
tiative and Confirmation) is in Figure 4. Only the 
exploratory portion of the strategy is shown, namely 
those states for which NaFun has an action choice. 
~klr each such state, we list tile two choices of actions 
available. (The action choices in boldfime are the 
ones eventually identified as el)ritual 1)y the learning 
process, an(1 are discussed in detail later.) The EIC 
strategy chooses random, ly between these two ac- 
21"or each uttermme, the ASH. outfmt includes 11o|, only the 
recognized string, but also aIl asso(:ia.ted acoustic (:onJld(mce 
score, iBased on data obtaintM dm'ing system deveJolmmnt , 
we defined a mapl)ing from raw confidence, values into 3 ap- 
proximately equally potmlated p~rtitions. 
362 refers to those states that can actually occur in a di- 
alogue. \[<)r example, greet=0 is only possible in the initial 
dialogue state "0 1 0 0 0 0 0". Thus, all other states beginning 
with 0 (e.g. "0 I 0 0 I 0 0") will never occur. 
g 
1 0 0 (} 0 0 
1 1 0 (} 1 0 0 
1 1 0 1 0 0 0 
1 1 0 1 0 1 0 
1 1 1 1 0 0 0 
1 1 1 1 0 \] 0 
1 1 2 1 0 0 0 
1 1 2 1 0 1 0 
1 1 4 0 0 0 0 
\] 1 4 (} 1 0 0 
St~tte Action Choices 
a c v t g it 
1 2 0 0 0 0 0 
1 2 0 0 0 0 1 
1 2 (} 0 1 (} 0 
1 2 0 (} 1 0 1 
1 2 0 1 0 (} 0 
1 2 0 \] 0 0 1 
1 2 0 1 0 1 0 
1 2 0 1 0 1 1 
1 2 1 1 0 (1 0 
1 2 1 \] 0 0 1 
1 2 1 1 0 1 0 
\] 2 1 1 0 \] 1 
1 2 2 1 0 0 0 
1 2 2 \] 0 0 1 
1 2 2 \] 0 1 0 
\] 2 2 1 0 1 1 
1 2 4 0 0 (1 0 
1 2 4 0 0 0 1 
1 2 4 0 1 0 0 
1 2 4 0 1 0 1 
1 3 () 1 () () 0 
\] 3 0 1 0 0 1 
1 3 (1 1 0 1 0 
1 3 \[) 1 0 \] \] 
1. 3 1 1 0 0 0 
1 3 1 1 0 0 1 
1 3 1 1. 0 1 0 
1 3 1 1 0 \] 1 
1 3 2 1 0 0 0 
1 3 2 1 0 0 1 
1 3 2 1 0 \] 0 
\] 3 2 1 0 1 1 
GreetS,GreetU 
ReAsklS,ReAsklM 
NoConf, ExpConfl 
NoConf, ExpColffl 
NoCont, Exp Confl 
NoConf, ExpConfl 
NoConf~ExpConfl 
NoConf, ExpConfl 
ReAsklS,ReAsklM 
ReAsklS,RcAsklM 
Ask2S,Ask2U 
Ask2S,Ask2U 
I{eAsk2S,ReAsk2M 
ReAsk2S,ReAsk2M 
NoConf, ExpCont'2 
NoConf, ExpConP2 
NoConf~ExpCont)2 
NoConf,ExpConf2 
NoCoiff, Exp C oaF2 
NoConf, ExpConf2 
NoConf, ExpCont2 
NoConf,ExpCong2 
NoConf~ExpConf2 
NoConf, ExpConf2 
NoConf, ExpConf2 
NoConf, ExpCon\[2 
ReAsk2S,RcAsk2M 
I/.eAsk2S,ReAsk2M 
RcAsk2S,ReAsk2M 
ReAsk2S,ReAsk2M 
NoConf, ExpCon\[3 
NoConf, Exl) Conf3 
NoConf, ExpConf3 
NoConf,Ext)ConF3 
NoConf, ExpCont~ 
NoConf, ExpConf3 
NoConf, ExpConf3 
NoConf,ExpConF3 
NoConf, ExpConi~J 
NoConf, ExpConf3 
NoColff, ExpConf3 
NoConf, ExpConf3 
Figure 4: Exploratory portion of EIC strategy. 
tions in the indicated state, to maximize exploration 
and minimize data sparseness when constructing our 
model. Since there are 42 states with 2 choices each, 
there is n search space of 242 potential global di- 
alogue strategies; the goal of RL is to identify an 
apparently optimal strategy fl'om this large search 
space. Note that due to the randomization of the 
EIC strategy, the prompts are designed to ensure 
the coherence of all possible action sequences. 
Figure 5 illustrates how the dialogue strategy in 
Figure 4 generates the diMogue in Figure 1. Each 
row indicates the state that NJFun is in, the ac- 
505 
State Action %Irn Reward 
gacvtgh 
0100000 GreetU $1 0 
1121000 NoConf 0 
1221001 ExpConf2 $2 0 
1 3 2 1 0 0 1 ExpConf3 $3 0 
1400000 Tell $4 1 
Figure 5: Generating the dialogue in Figure 1. 
tion executed in this state, the corresponding turn 
in Figure 1, and the reward received. The initial 
state represents that NaFun will first attempt to ob- 
tain attribute 1. NJFun executes GreetU (although 
as shown in Figure 4, GreetS is also possible), gen- 
erating the first utterance in Figure 1. Alter the 
user's response, the next state represents that N J- 
Fun has now greeted the user and obtained the ac- 
tivity value with high confidence, by using a non- 
restrictive grmnmar. NJFnn then chooses the No- 
Conf strategy, so it does not attempt to confirm 
the activity, which causes the state to change but 
no prompt to be generated. The third state repre- 
sents that NJFun is now working on the second at- 
tribute (location), that it already has this vahle with 
high confidence (location was obtained with activity 
after the user's first utterance), and that the dia- 
logue history is good. 4 This time NaFun chooses the 
ExpConf2 strategy, and confirms the attribute with 
the second NJFun utterance, and the state changes 
again. The processing of time is similar to that of lo- 
cation, which leads NJFun to the final state, where it 
performs the action "Tell" (corresponding to query- 
ing the database, presenting the results to the user, 
and asking the user to provide a reward). Note that 
in NJFun, the reward is always 0 except at the ter- 
minal state, as shown in the last column of Figure 5. 
4 Experimentally Optimizing a 
Strategy 
We collected experimental dialogues for both train- 
ing and testing our system. To obtain training di- 
alogues, we implemented NJFun using the EIC dia- 
logue strategy described in Section 3. We used these 
dialogues to build an empirical MDP, and then com- 
puted the optimal dialogue strategy in this MDP (as 
described in Section 2). In this section we describe 
our experimental design and the learned dialogue 
strategy. In the next section we present results from 
testing our learned strategy and show that it im- 
proves task completion rates, the performance mea- 
sure we chose to optimize. 
Experimental subjects were employees not associ- 
a, ted with the NJFun project. There were 54 sub- 
4Recall that only the current attribute's features are ill the 
state, lIowever, the operations vector contains information 
regarding previous attributes. 
jects for training and 21 for testing. Subjects were 
distributed so tile training and testing pools were 
balanced for gender, English as a first , and 
expertise with spoken dialogue systems. 
During both training and testing, subjects carried 
out free-form conversations with NJFun to complete 
six application tasks. For examl)le , the task exe- 
cuted by the user in Figure 1 was: "You feel thirsty 
and want to do some winetasting in the morning. 
Are there any wineries (;lose by your house in Lam- 
bertville?" Subjects read task descriptions on a web 
page, then called NJFun from their office phone. 
At the end of the task, NJFun asked for feedback 
on their experience (e.g., utterance $4 in Figure 1). 
Users then hung up the phone and filled out a user 
survey (Singh et al., 2000) on the web. 
The training phase of the experiment resulted in 
311 complete dialogues (not all subjects completed 
all tasks), for which NJFun logged the sequence 
of states and the corresponding executed actions. 
The number of samples per st~tte for the initi~fl ask 
choices are: 
0 1 0 0 0 0 0 GreetS=IS5 GreetU=156 
1 2 0 0 0 0 0 Ask2S=93 Ask2U=72 
1 2 0 0 0 0 1 Ask2S=36 Ask2U=48 
Such data illustrates that the random action choice 
strategy led to a fairly balanced action distribution 
per state. Similarly, the small state space, and the 
fact that we only allowed 2 action choices per state, 
prevented a data sparseness problem. The first state 
in Figure 4, the initial state for every dialogue, was 
the most frequently visited state (with 311 visits). 
Only 8 states that occur near the end of a dialogue 
were visited less tlmn 10 times. 
The logged data was then used to construct the 
empirical MDP. As we have mentioned, the measure 
we chose to optinfize is a binary reward flmction 
based on the strongest possible measure of task com- 
pletion, called StrongComp, that takes on value 
1 if NJFun queries the database using exactly the 
attributes specified in the task description, and 0 
otherwise. Then we eoml)uted the optimal dialogue 
strategy in this MDP using RL (cf. Section 2). The 
action choices constituting the learned strategy are 
in boldface in Figure 4. Note that no choice was 
fixed for several states, inealfing that the Q-values 
were identical after value iteration. Thus, even when 
using the learned strategy, NJFun still sometimes 
chooses randomly between certain action pairs. 
Intuitively, the learned strategy says that the op- 
timal use of initiative is to begin with user initia- 
tive, then back off to either mixed or system ini- 
tiative when reasking for an attribute. Note, how- 
ever, that the specific baekoff method differs with 
attribute (e.g., system initiative for attribute 1, but 
gcnerMly mixed initiative for attribute 2). With 
respect to confirmation, the optimal strategy is to 
506 
mainly contirm at lower contidenee -values. Again, 
however, the point where contirlnation becomes un- 
necessary difl'ers across attributes (e.g., confidence 
level 2 for attribute 1, but sometimes lower levels 
for attributes 2 and 3), and also dt!txmds on other 
features of the state besides confidence (e.g., gram- 
mar and history). This use (if ASP, (:ontidence. by the 
dialogue strategy is more Sol)hisli('ated than previ- 
ous al)proaches, e.g. (Niimi and Kot)ayashi, 1996; 
Lit\]nan and Pan, 2000). N.lI,'un ('an learn such line- 
grained distinctions l}ecause the el)ritual strategy is 
based on a eonll)arisoi) of 24~ l}ossible exl}h)ratory 
strategies. Both the initiative and confirmation re- 
suits sugge.sl that the begimfing of the dialogue was 
the most problenmtie for N.lli'un. Figure I ix an ex- 
ample dialogue using the Ol)tilnal strategy. 
5 Experimentally Evaluating the 
Strategy 
For the testing i)\]tase, NJFun was reilnplemented to 
use the learned strategy. 2:t test sul)je(;Is then per- 
formed the same 6 tasks used during training, re- 
sulling in 124 complete test dialogues. ()he of our 
main resull;s is that task completion its measured by 
StrongCom 11 increased front 52cX} in training 1o 64% 
in testing (p < .06)) 
There is also a signilicant in~twaction (!II'(~c.t 
between strategy nnd task (p<.01) for Strong- 
Colnl). \]'revious work has suggested l;hat novic(~ 
users l)erform (:Oml)arably to eXl)erts after only 2 
tasks (Kamm et ill., \] 9!18). Sill('e Ollr \]oarllt}d sl.rat- 
egy was based on 6 tasks with each user, one (?xpla- 
nation of the interaction eft'cot is that the learnc.d 
strategy is slightly optimized for expert users. ~lb 
explore this hyi)othesis, we divided our corpus into 
diah)gues with "novice" (tasks \] and 2) and "ex- 
pert" (tasks 3-6) users. We fOltltd that the learned 
strategy did in fact lc'a(l to a large an(1 significant 
improvement in StrongComp tbr (;Xl)erts (EIC=.d6, 
learned==.69, 11<.001), and a non-signilieant degra- 
dation for novices (1,31C=.66, learned=.55, 11<.3). 
An apparent limitation of these results is that EIC 
may not 1)e the best baseline strategy tbr coral)arisen 
to our learned strategy. A more standard alternative 
would be comparison to the very best hand-designed 
fixed strategy. However, there is no itgreement in the 
literature, nor amongst the authors, its to what the 
1)est hand-designed strategy might have been. There 
is agreement, however, that the best strategy ix sen- 
sitive to lnally unknown and unmodeled factors: the 
aThe ('.xlmrimental design (lescribed above Colmists of 2 
factors: the within-in groul) fa(:tor sl~ntefly aim the l)etween- 
groui)s facl;or task. \,Ve 11812, ~1, l,WO-~,g~l,y D.llO.ly,qiS of variance 
(ANOVA) to comtmte wlmtlmr main and int(!raction (!flk!cts 
of strategy are statistically signitica nt (t)<.05) or indicative 
of a statistical trend (p < .101. Main effe.cts of strategy are 
task-in(lel)endent , while interaction eIt'(!cts involving strat(%y 
are task-dependent. 
~4(~aSIlIX~ 
StrongComp 
\VcakComp 
ASR 
Fecdlmck 
UserSat 
EIC 
(n=:3111 
0.52 
1.75 
2.50 
0.18 
1.3.38 
v__ l~eatned p 
(n=124) 
0.64 
2.19 .02 
2.67 .04 
0.11 .d2 
13.29 .86 
Table 1: Main ett'ects of dialogue strategy. 
user 1)olmlation, the specitics of the, task, the 1)ar- 
ticular ASR used, etc. Furthernlore, \]P, IC was (:are- 
fully designed so that the random choices it makes 
never results in tm unnatural dialogue. Finally, a 
companion paper (Singh et al., 2000) shows that the 
1)erforntanee of the learned strategy is better thall 
several "stmtdard" fixed strategies (such as always 
use system-initiative and no-confirmation). 
Although many types of measures have been used 
to evaluate dialogue systems (e.g., task success, 
dialogue quality, ettit:ieney, usability (l)anieli and 
Gerbino, 1995; Kamm et al., 11998)), we optimized 
only tbr one task success measure, StrongConll). 
Ilowever, we also examined the 1)erl'ornmnee of the 
learned strategy using other ewduation measures 
(which t)ossibly could have llo011 used its our reward 
function). WeakComp is a relaxed version of task 
comt)letion that gives partial credit: if all attribute 
values are either correct or wihh:ards, the value is the 
sum of the correct munl)er of attrilmtes. ()tlmrwise, 
at least one attribute is wrong (e.g., the user says 
"Lanfl)ertvilhf' but the system hears "Morristown"), 
and the wdue is -1. ASR is a dialogue quality lllea- 
sure that itl)l)roxinmtes Sl)eech recognition act:uracy 
for tl,e datM)ase query, a.nd is computed 1:) 3, adding 
1 for each correct attribute value altd .5 for every 
wihtca.rd. Thus, if the task ix to go winetasting 
near Lambertville in the morning, and the systenl 
queries the database for an activity in New Jersey 
in the morning, StrongComp=0, \VeakComp=l, and 
ASR=2. In addition to the objective measures dis- 
cussed a,bove, we also COmlmted two subjective us- 
ability measures. Feedback is obtained front the 
dialogue (e.g. $4 in Figure 5), by mapping good, 
so-so, bad to 1, 0, m~d -1, respectively. User satis- 
faction (UserSat, ranging front 0-20) is obtained by 
summing the answers of the web-based user survey. 
Table I summarizes the diflhrence in performance 
of NJFun tbr our original reward flmction and the 
above alternative evaluation measures, from trail> 
ing (EIC) to test (learned strategy for StrongComp). 
For WeakComp, the average reward increased from 
1.75 to 2.19 (p < 0.02), while tbr ASll the average 
reward increased from 2.5 to 2.67 (p < 0.04). Again, 
these iml)rovements occur even though the learned 
strategy was not optilnized for these measures. 
The last two rows of the table show that for the 
507 
subjective measures, i)erformmme does not signifi- 
cantly differ for the EIC and learned strategies. In- 
terestingly, the distributions of the subjective mea- 
sures move to the middle from training to testing, 
i.e., test users reply to the survey using less extreme 
answers than training users. Explaining the subjec- 
tire results is an area for future work. 
6 Discussion 
This paper presents a practical methodology for ap- 
plying RL to optimizing dialogue strategies in spo- 
ken dialogue systems, and shows empirically that the 
method improves performance over the EIC strategy 
in NJFun. A companion paper (Singh et al., 2000) 
shows that the learned strategy is not only better 
than EIC, but also better than other fixed choices 
proposed in the literature. Our results demonstrate 
that the application of RL allows one to empirically 
optimize a system's dialogue strategy by searching 
through a much larger search space than can be ex- 
plored with more traditional lnethods (i.e. empiri- 
cally testing several versions of a systent). 
RL has been appled to dialogue systems in pre- 
vious work, but our approach ditlhrs from previous 
work in several respects. Biermann and Long (1996) 
did not test RL in an implemented system, and the 
experiments of Levin et 31. (2000) utilized a simu- 
lated user model. Walker et al. (1998)'s methodol- 
ogy is similar to that used here, in testing RL with 
an imt)lelnented system with human users. However 
that work only explored strategy choices at 13 states 
in the dialogue, which conceivably could have been 
explored with more traditional methods (~ts com- 
pared to the 42 choice states explored here). 
We also note that our learned strategy made di- 
alogue decisions based on ASR confidence in con- 
junction with other features, mid alto varied initia- 
tive and confirmation decisions at a finer grain than 
previous work; as such, our learned strategy is not; 
a standard strategy investigated in the dialogue sys- 
teln literature. For example, we would not have pre- 
dicted the complex and interesting back-off strategy 
with respect to initiative when reasking for an at- 
tribute. 
To see how our method scales, we are al)plying RL 
to dialogue systems for eustolner care and tbr travel 
planning, which are more complex task-oriented do- 
mains. As fllture work, we wish to understand 
the aforementioned results on the subjective reward 
measures, explore the potential difference between 
optimizing tbr expert users and novices, automate 
the choice of state space for dialogue systems, ilwes- 
tigate the use of a learned reward function (Walker 
et al., 1998), and explore the use of more informative 
non-terminal rewards. 
Acknowledgements 
The authors thank Fan Jiang for his substantial effort 
in implenmnting NJFun, Wieland Eckert, Esther Levin, 
Roberto Pieraccini, and Mazin R.ahinl for their technical 
help, Julia Hirsehberg for her comments on a draft of this 
paper, and David McAllester, I~ichard Sutton, Esther 
Levin and Roberto Pieraccini for hell)tiff conversations. 

References 

A. W. Biermann and P. M. Long. 1996. The composition 
of messages in sl)eeeh-graphies interactive systems. In 
Proe. of the International Symposium on Spoken Dia- 
logue, pages 97 100. 

M. Danieli and E. Gerbino. 1995. Metrics for evaluating 
dialogue strategies in a spoken  system. In 
P~vc. of the AAAI Spring Symposium on Empirical 
Methods in Discourse Interpretation and Generation, 
pages 34 39. 

C. Kamm, D. Litman, and M. A. Walker. 1998. From 
novice to expert: The effect of tutorials on user exl)er- 
tise with spoken dialogue systems. In P~vc. of the In- 
ternational Conference on Spolccn Language P~vccss- 
in.q, ICSLP98. 

E. Levin, R. Pieraccini, W. Eekere, G. Di Fabbrizio, and 
S. Narayanan. 1999. Spoken  dialogue: lh'om 
theory to practice. In Pwc. IEEE Workshop on Au- 
tomatic Speech R.ecognition and Understanding, AS- 
R U U99. 

E. Levin, R. Pieraccini, and W. Eckert. 2000. A stochas- 
tic model of human machine interaction for learning 
dialog strategies. IEEE TTnnsactions on Speech and 
Audio Processing, 8(1):11-23. 

D. J. Litman and S. Pan. 2000. Predicting and adapting 
to poor Sl)eech recognition in a spoken dialogue sys- 
tern. In Proc. of the Scv('ntccnth National Confl:rcncc 
on Artificial Intclligcncc, AAAI-2000. 

Y. Niimi and Y. Kobayashi. 1996. A dialog control strat- 
egy based on the reliability of speech recognition. In 
Proc. of the International Symposium on Spoken Dia- 
loguc, pages 157--160. 

A. Sanderman, J. Sturm, E. den Os, L. Boves, and 
A. Cremers. 1998. Evaluation of the dutchtrain 
timetable inibrmation system developed in the arise 
project. In Interactive Voice Technology for Tclccom- 
munications Applications, IVT2'A, pages 91-96. 

S. Singh, M. S. Kearns, D. J. Litman, and M. A. \Valker. 
1999. Reinforcement learning for spoken dialogue sys- 
tems. In Proc. NIPS99. 

S. B. Singh, M. S. Kearns, D. J. Litman, and 
M. A. Walker. 2000. Empirical evaluation of a rein- 
forccment learning spoken dialogue system. In Proc. 
of thc Scvcntccnth National Conference on Artificial 
Intelligence, AAAI-2000. 

R. S. Sutton. 1991. Plamfing by incremental dynamic 
programming. In Proc. Ninth Confcwztcc on Machine 
Learning, pages 353-357. 

M. A. Walker, J. C. Promer, and S. Narayanan. 1998. 
Learning optimal dialogue strategies: A ease study of 
a Sl)oken dialogue agent tbr email. In P~vc. of the 36th 
Annual Meeting of the Association of Computational 
Linguistics, COLING//ACL 98, pages 1345 1352. 
