Learning Optimal Dialogue Strategies: 
A Case Study of a Spoken Dialogue Agent for Email 
Marilyn A. Walker 
walker @ research.att.com 
ATT Labs Research 
180 Park Ave. 
Florham Park, NJ 07932 
Jeanne C. Fromer 
jeannie@ai.mit.edu 
MIT AI Lab 
545 Technology Square 
Cambridge, MA, 02139 
Shrikanth Narayanan 
shri @ research.att.com 
ATT Labs Research 
180 Park Ave. 
Florham Park, NJ 07932 
Abstract 
This paper describes a novel method by which a dia- 
logue agent can learn to choose an optimal dialogue 
strategy. While it is widely agreed that dialogue 
strategies should be formulated in terms of com- 
municative intentions, there has been little work on 
automatically optimizing an agent's choices when 
there are multiple ways to realize a communica- 
tive intention. Our method is based on a combina- 
tion of learning algorithms and empirical evaluation 
techniques. The learning component of our method 
is based on algorithms for reinforcement learning, 
such as dynamic programming and Q-learning. The 
empirical component uses the PARADISE evalua- 
tion framework (Walker et al., 1997) to identify the 
important performance factors and to provide the 
performance function needed by the learning algo- 
rithm. We illustrate our method with a dialogue 
agent named ELVIS (EmaiL Voice Interactive Sys- 
tem), that supports access to email over the phone. 
We show how ELVIS can learn to choose among 
alternate strategies for agent initiative, for reading 
messages, and for summarizing email folders. 
1 Introduction 
This paper describes a novel method by which a dia- 
logue agent can learn to choose an optimal dialogue 
strategy. The main problem for dialogue agents 
is deciding what information to communicate to a 
hearer and how and when to communicate it. For 
example, consider one of the strategy choices faced 
by a spoken dialogue agent that accesses email by 
phone. When multiple messages match the user's 
query, e.g. Read my messages from Kim, an email 
agent must choose among multiple response strate- 
gies. The agent might choose the Read-First strat- 
egy in DI: 
(D1) A: In the messages from Kim, there's 1 message 
about "Interviewing Antonio" and 1 message 
about "Meeting Today:' The first message is 
titled, "Interviewing Antonio:' It says, 'Td like 
to interview him. I could also go along to lunch. 
Kim:' 
D1 involves summarizing all the messages from 
Kim, and then taking the initiative to read the first 
one. Alternate strategies are the Read-Summary- 
Only strategy in D2, where the agent provides infor- 
mation that allows users to refine their selection cri- 
teria, and the Read-Choice-Prompt strategy in D3, 
where the agent explicitly tells the user what to say 
in order to refine the selection: 
(D2) A: In the messages from Kim, there's 1 message 
about "Interviewing Antonio" and 1 message about 
"Meeting Today:' 
(D3) A: In the messages from Kim, there's 1 message 
about "Interviewing Antonio" and 1 message about 
"Meeting Today:' To hear the messages, say, "In- 
terviewing Antonio" or "Meeting." 
Decision theoretic planning can be applied to the 
problem of choosing among strategies, by associ- 
ating a utility U with each strategy (action) choice 
and by positing that agents should adhere to the 
Maximum Expected Utility Principle (Keeney and 
Raiffa, 1976; Russell and Norvig, 1995), 
Maximum Expected Utility Principle: 
An optimal action is one that maximizes 
the expected utility of outcome states. 
An agent acts optimally by choosing a strategy 
a in state Si that maximizes U(Si). But how are 
the utility values U(Si) for each dialogue state Si 
derived? 
Several reinforcement learning algorithms based 
on dynamic programming specify a way to calcu- 
late U(Si) in terms of the utility of a successor state 
Sj (Bellman, 1957; Watkins, 1989; Sutton, 1991; 
Barto et al., 1995). Thus if we know the utility for 
1345 
the final state of the dialogue, we can calculate the 
utilities for all the earlier states. However, until re- 
cently there as been no way of determining a per- 
formance function for assigning a utility to the final 
state of a dialogue. 
This paper presents a method based on dynamic 
programming by which dialogue agents can learn 
to optimize their choice of dialogue strategies. We 
draw on the recently proposed PARADISE evalua- 
tion framework (Walker et al., 1997) to identify the 
important performance factors and to provide a per- 
formance function for calculating the utility of the 
final state of a dialogue. We illustrate our method 
with a dialogue agent named ELVIS (EmaiL Voice 
Interactive System), that supports access to email 
over the phone. We test alternate strategies for agent 
initiative, for reading messages, and for summariz- 
ing email folders. We report results from modeling 
a corpus of 232 spoken dialogues in which ELVIS 
conversed with human users to carry out a set of 
email tasks. 
2 Method for Learning to Optimize 
Dialogue Strategy Selection 
Our method for learning to optimize dialogue strat- 
egy selection combines the application of PAR- 
ADISE to empirical data (Walker et al., 1997), with 
algorithms for learning optimal strategy choices. 
PARADISE provides an empirical method for de- 
riving a performance function that calculates over- 
all agent performance as a linear combination of a 
number of simpler metrics. Our learning method 
consists of the following sequence of steps: 
• Implement a spoken dialogue agent for a particular 
domain. 
• Implement multiple dialogue strategies and design 
the agent so that strategies are selected randomly or 
under experimenter control. 
• Define a set of dialogue tasks for the domain, and 
their information exchange requirements. Repre- 
sent these tasks as attribute-value matrices to facil- 
itate calculating task success. 
• Collect experimental dialogues in which a number 
of human users converse with the agent to do the 
tasks. 
• For each experimental dialogue: 
- Log the history of the state-strategy choices 
for each dialogue. Use this to estimate a state 
transition model. 
- Log a range of quantitative and qualitative 
cost measures for each dialogue, either auto- 
matically or with hand-tagging. 
- Collect user satisfaction reports for each dia- 
logue. 
• Use multivariate linear regression with user satis- 
faction as the dependent variable and task success 
and the cost measures as independent variables to 
determine a performance equation. 
• Apply the derived performance equation to each di- 
alogue to determine the utility of the final state of 
the dialogue. 
• Use reinforcement learning to propagate the utility 
of the final state back to states Si where strategy 
choices were made to determine which action max- 
imizes U(Si). 
These steps consist of those for deriving a perfor- 
mance function (Section 3), and for using the de- 
rived performance function as feedback to the agent 
with a learning algorithm (Section 4). 
3 Using PARADISE to Derive a 
Performance Function 
3.1 ELVIS Spoken Dialogue System 
ELVIS is implemented using a general-purpose 
platform for spoken dialogue agents (Kamm et al., 
1997). The platform consists of a speech recognizer 
that supports barge-in so that the user can interrupt 
the agent when it is speaking. It also provides an 
audio server for both voice recordings and text-to- 
speech ('I~I'S), an interface between the computer 
running ELVIS and the telephone network, a mod- 
ule for application specific functions, and modules 
for specifying the application grammars and the dia- 
logue manager. Our experiments are based on mod- 
ifications to the dialogue manager as described be- 
low. 
The dialogue manager is based on a state ma- 
chine. Each state specifies transitions to other states 
and the conditions that license these transitions, 
as well as a grammar for what the user can say. 
State definitions also include the specification of 
agent prompts in terms of templates, with variables 
that are instantiated each time the state is entered. 
Prompts include: (1) an initial prompt, which the 
agent says upon entering the state (this may include 
a response to the user's current request); (2) a help 
prompt which the agent says if the user says help; 
(3) multiple rejection prompts which the agent says 
if the speech recognizer confidence is too low to 
continue without more user input; (4) multiple time- 
out prompts which the agent produces if the user 
doesn't say anything. 
Each of these specifications is affected by the 
agent' s dialogue strategy. An agent' s dialogue strat- 
egy is implemented as a combination of the prompts 
1346 
that are played to the user and the state transitions 
that the agent makes in response to the user' s utter- 
ance. In particular, alternative prompts can be spec- 
ified for all types of prompts (initial, help, rejection 
and timeout) to provide alternate dialogue strategies 
in each state. We implemented alternate strategies 
for reading messages, for initiative and for summa- 
rization. 
The alternate strategies for reading messages 
are Read-First, Read-Summarize-Only and Read- 
Choice-Prompt as in dialogues D1, D2, and D3. 
The alternate strategies for initiative are a system- 
initiative (SI) strategy and a mixed-initiative (MI) 
strategy. In the SI strategy, the system maintains 
control of the dialogue and the user must specify 
one query parameter at a time, as in D4. 
(D4) A: Hi, Elvis here. You have 5 new and 0 unread 
messages in your inbox. Say Read, Summarize, or 
say Help for more options. 
U: Read. 
A: Select by Sender, Subject or in order of arrival? 
U: Sender, 
A: Which Sender? 
U: Kim. 
A: In the messages from Kim, there's 1 message 
about "Interviewing Antonio". ...... 
In the MI strategy, the system never tells the user 
what to do, and the user can take control of the dia- 
logue and specify exactly what they want to do in a 
single utterance, as in D5. 
(D5) A: Hi, Elvis here. I' ve got your mail. 
U: Read me my messages from Kim. 
We also implemented three alternate summariza- 
tion strategies as responses to a user request to Sum- 
marize my messages. The basis for these alternates 
is ELVIS's decision about which message attributes 
to mention in the summary. The Summarize-Choice 
strategy in D6 asks the user to specify which of the 
relevant attributes to summarize by. 
(D6) A: Summarize by subject, by sender, or both? 
U: Subject. 
A: In your top level inbox, there's 1 message about 
"Lunch" 1 message about "Interviewing Antonio" 
.1 message about "Call Me Tomorrow," 1 message 
about "Evaluation Group Meeting," and 1 message 
about "Agent Personality:' 
The Summarize-Both strategy in D7 uses all at- 
tributes in the summary: 
(DT) A: In your top level inbox, from Kim, there's 1 
message about "Lunch:' From Michael, there's 1 
message about "Evaluation group meeting." From 
Noah, there's 1 message about "Call Me Tomor- 
row" and 1 message about "Interviewing Antonio?' 
And from Owen, there's 1 message about "Agent 
Personality?' 
In the Summarize-System strategy in D8, ELVIS 
summarizes by subject or by sender based on the 
current context. For instance, if the user is in the top 
level inbox, ELVIS will summarize by sender, but if 
the user is situated in a folder containing messages 
from Noah, ELVIS will summarize by subject, as a 
summary by sender would provide no new informa- 
tion. 
(D8) A: In your top level inbox, there's 1 message 
from Kim, 2 messages from Noah, 1 message from 
Michael, and 1 message from Owen. 
Transitions between states are driven by the 
user' s conversational behavior, such as whether s/he 
says anything and what s/he says, the semantic in- 
terpretation of the user' s utterances, and the settings 
of the agent' s dialogue strategy parameters. 
3.2 Experimental Design 
Experimental dialogues were collected via two ex- 
periments in which users (AT&T summer interns 
and MIT graduate students) interacted with ELVIS 
to complete three representative application tasks 
that required them to access email messages in three 
different email inboxes. In the second experiment, 
users participated in a tutorial dialogue before do- 
ing the three tasks. The first experiment varied ini- 
tiative strategies and the second experiment varied 
the presentation strategies for reading messages and 
summarizing folders. In order to have adequate data 
for learning, the agent must explore the space of 
strategy combinations and collect enough samples 
of each combination. In the second experiment, we 
parameterized the agent so that each user interacted 
with three different versions of ELVIS, one for each 
task. These experiments resulted in a corpus of 108 
dialogues testing the initiative strategies, and a cor- 
pus of 124 dialogues testing the presentation strate- 
gies. 
Each of the three tasks were performed in se- 
quence, and each task consisted of two scenarios. 
Following PARADISE, the agent and the user had 
to exchange information about criteria for selecting 
messages and information within the message body 
in each scenario. Scenario 1.1 is typical. 
• 1.1: You are working at home in the morning and 
plan to go directly to a meeting when you go into 
1347 
work. Kim said she would send you a message 
telling you where and when the meeting is. Find 
out the Meeting Time and the Meeting Place. 
Scenario 1.1 is represented in terms of the at- 
tribute value matrix (AVM) in Table 1. Successful 
completion of a scenario requires that all attribute- 
values must be exchanged (Walker et al., 1997). The 
AVM representation for all six scenarios is similar 
to Table 1, and is independent of ELVIS's dialogue 
strategy. 
attribute actual value 
Selection Criteria Kim V Meeting 
Email.attl 10:30 
Email.att2 2D516 
Table 1: Attribute value matrix instantiation, Key 
for Email Scenario 1.1 
3.3 Data Collection 
Three different methods are used to collect the mea- 
sures for applying the PARADISE framework and 
the data for learning: (1) All of the dialogues are 
recorded; (2) The dialogue manager logs the agent's 
dialogue behavior and a number of other measures 
discussed below; (3) Users fill out web page forms 
after each task (task success and user satisfaction 
measures). Measures are in boldface below. 
The dialogue recordings are used to transcribe the 
user's utterances to derive performance measures 
for speech recognition, to check the timing of the in- 
teraction, to check whether users barged in on agent 
utterances (Barge In), and to calculate the elapsed 
time of the interaction (ET). 
For each state, the system logs which dialogue 
strategy the agent selects. In addition, the num- 
ber of timeout prompts (Timeout Prompts), Rec- 
ognizer Rejections, and the times the user said 
Help (Help Requests) are logged. The number of 
System Turns and the number of User Turns are 
calculated on the basis of this data. In addition, 
the recognition result for the user's utterance is ex- 
tracted from the recognizer and logged. The tran- 
scriptions are used in combination with the logged 
recognition result to calculate a concept accuracy 
measure for each utteranceJ Mean concept accu- 
racy is then calculated over the whole dialogue and 
1For example, the utterance Read my messages from Kim 
contains two concepts, the read function, and the sender:kim 
selection criterion. If the system understood only that the user 
said Read, concept accuracy would be .5. 
used as a Mean Recognition Score MRS for the di- 
alogue. 
The web page forms are the basis for calculat- 
ing Task Success and User Satisfaction measures. 
Users reported their perceptions as to whether they 
had completed the task (Comp), 2 and filled in an 
AVM with the information that they had acquired 
from the agent, e.g. the values for Email.attl and 
Email.att2 in Table 1. The AVM matrix supports 
calculating Task Success objectively by using the 
Kappa statistic to compare the information in the 
AVM that the users filled in with an AVM key such 
as that in Table 1 (Walker et al., 1997). 
In order to calculate User Satisfaction, users were 
asked to evaluate the agent's performance with a 
user satisfaction survey. The data from the survey 
resulted in user satisfaction values that range from 0 
to 33. See (Walker et al., 1998) for more details. 
3.4 Deriving a Performance Function 
Overall, the results showed that users could success- 
fully complete the tasks with all versions of ELVIS. 
Most users completed each task in about 5 minutes 
and average ,~ over all subjects and tasks was .82. 
However, there were differences between strategies; 
as an example see Table 2. 
Measure SYSTEM (SI) MIXED (MI) 
Kappa 
Comp 
User Turns 
System Turns 
Elapsed Time (ET) 
MeanRecog (MRS) 
Time Outs 
Help Requests 
Barge Ins 
Recognizer Rejects 
.81 
.83 
25.94 
28.18 
328.59 s 
.88 
2.24 
.70 
5.2 
.98 
.83 
.78 
17.59 
21.74 
289.43 s 
.72 
4.15 
.94 
3.5 
1.67 
User Satisfaction 26.6 23.7 
Table 2: Performance measure means per dialogue 
for Initiative Strategies 
PARADISE provides a way to calculate dialogue 
agent performance as a linear combination of a 
number of simpler metrics that can be directly mea- 
sured such as those in Table 2. Performance for any 
(sub)dialogue D is defined by the following equa- 
tion: 
n 
Performance= (a, A/'(,~))- ~_,wi* A/'(ci) 
i=1 
2 Yes, No responses are converted to 1,0. 
1348 
where o~ is a weight on n, ci are the cost functions, 
which are weighted by wl, and .Af is a Z score nor- 
malization function (Walker et al., 1997; Cohen, 
1995). The Z score normalization function ensures 
that, when the weights c~ and wi are solved for, that 
the magnitude of the weights reflect the magnitude 
of the contribution of that factor to performance. 
The performance function is derived through mul- 
tivariate linear regression with User Satisfaction as 
the dependent variable and all the other measures as 
independent variables (Walker et al., 1997). See Ta- 
ble 2. In the ELVIS data, an initial regression over 
the measures in Table 2 suggests that Comp, MRS 
and ET are the only significant contributors to User 
Satisfaction. A second regression including only 
these factors results in the following equation: 
Performance = .21 • Comp+.47 . MRS -. 15. ET 
with Comp (t=2.58, p =.01), MRS (t =5.75, p 
=.0001) and ET (t=-l.8, p=.07) significant predic- 
tors, accounting for 38% of the variance in R- 
Squared (F (3,104)=21.2, p <.0001). The mag- 
nitude of the coefficients in this equation demon- 
strates the performance of the speech recognizer 
(MRS) is the most important predictor, followed by 
users' perception of Task Success (Comp) and ef- 
ficiency (ET). In the next section, we show how to 
use this derived performance equation to compute 
the utility of the final state of the dialogue. 
4 Applying Q-learning to ELVIS 
Experimental Data 
The basic idea is to apply the performance func- 
tion to the measures logged for each dialogue Di, 
thereby replacing a range of measures with a single 
performance value 19i. Given the performance val- 
ues Pi, any of a number of automatic learning algo- 
rithms can be used to determine which sequence of 
action choices (dialogue strategies) maximize util- 
ity, by using/~ as the utility for the final state of the 
dialogue Di. Possible algorithms include Genetic 
Algorithms, Q-learning, TD-Leaming, and Adap- 
tive Dynamic Programming (Russell and Norvig, 
1995). Here we use Q-learning to illustrate the 
method (Watkins, 1989). See (Fromer, 1998) for 
experiments using alternative algorithms. 
The utility of doing action a in state Si, U(a, Si) 
(its Q-value), can be calculated terms of the utility 
of a successor state S i, by obeying .the following 
recursive equation: 
M ~ u(a,s ) = R(Sd+ F, 
J 
where R(Si) is a reward associated with being in 
state Si, a is a strategy from a finite set of strate- 
gies A that are admissable in state Si, and M~j is 
the probability of reaching state Sj if strategy a is 
selected in state Si. 
In the experiments reported here, the reward asso- 
ciated with each state, R(SI), is zero. 3 In addition, 
since reliable a priori prediction of a user action in a 
particular state is not possible (for example the user 
may say Help or the speech recognizer may fail to 
understand the user), the state transition model M/~ 
is estimated from the logged state-strategy history 
for the dialogue. 
The utility values can be estimated to within a de- 
sired threshold using Value Iteration, which updates 
the estimate of U(a, Si), based on updated utility 
estimates for neighboring states, so that the equa- 
tion above becomes: 
Un+l(a, Sd = R(Sd + ~ M~ m÷xUn(a',Sj) 
3 
where Un(a, Si) is the utility estimate for doing 
a in state Si after n iterations. Value Iteration 
stops when the difference between Un(a, Si) and 
Un+l (a, Si) is below a threshold, and utility values 
have been associated with states where strategy se- 
lections were made. After experimenting with var- 
ious threshholds, we used a threshold of 5% of the 
performance range of the dialogues. 
The result of applying Q-learning to ELVIS data 
for the initiative strategies is illustrated in Figure 1. 
The figure plots utility estimates for SI and MI over 
time. It is clear that the SI strategy is better because 
it has a higher utility: at the end of 108 training 
sessions (dialogues), the utility of SI is estimated 
at .249 and the utility of MI is estimated at -0.174. 
TYPE STRATEGY UTILITY 
Read Read-First .21 
Read-Choice-Prompt .07 
Read-Summarize-Only .08 
Summarize Summarize-System .162 
Summarize-Choice -0.03 
Summarize-Both .09 
Table 3: Utilities for Presentation Strategy Choices 
after 124 Training Sessions 
The SI and MI strategies affect the whole dia- 
logue; the presentation strategies apply locally and 
a See (Fromer, 1998) for experiments in which local rewards 
are nonzero. 
1349 
O. 
O.E 
0.4 
0.2 
-0.2 
-OA 
-g.6 
-0.8 
-1 
Utilities f~" SI and MI over Training Sessions 
- ~ o 
Training Instarces (Dialogues) 
i 
100 120 
Figure 1: Results of applying Q-learning to System- 
Initiative (SI) and Mixed-Initiative (MI) Strategies 
for 108 ELVIS Dialogues 
can be actived in different states of the dialogue. We 
examined the variation in a strategy' s utility at each 
phase of the task, by representing the task as having 
three phases: no scenarios completed, one scenario 
completed and both scenarios completed. Table 3 
reports utilities for the use of a strategy after one 
scenario was completed. The policy implied by the 
utilities at other phases of the task are the same. See 
(Fromer, 1998) for more detail. 
The Read-First strategy in D1 has the best per- 
formance of the read strategies. This strategy takes 
the initiative to read a message, which might re- 
sult in messages being read that the user wasn't in- 
terested in. However since the user can barge-in 
on system utterances, perhaps little is lost by tak- 
ing the initiative to start reading a message. After 
124 training sessions, the best summarize strategy 
is Summarize-System, which automatically selects 
which attributes to summarize by, and so does not 
incur the cost of asking the user to specify these at- 
tributes. However, the utilities for the Summarize- 
Choice strategy have not completely converged after 
124 trials. 
5 Conclusions and Future Work 
This paper illustrates a novel technique by which 
an agent can learn to choose an optimal dialogue 
strategy. We illustrate our technique with ELVIS, an 
agent that supports access to email by phone, with 
strategies for initiative, and for reading and sum- 
marizing messages. We show that ELVIS can learn 
that the System-Initiative strategy has higher utility 
than the Mixed-Initiative strategy, that Read-First is 
the best read strategy, and that Summarize-System 
is the best summary strategy. 
Here, our method was illustrated by evaluating 
strategies for managing initiative and for message 
presentation. However there are numerous dia- 
logue strategies that an agent might use, e.g. to 
gather information, handle errors, or manage the di- 
alogue interaction (Chu-Carroll and Carberry, 1995; 
Danieli and Gerbino, 1995; Hovy, 1993; McK- 
eown, 1985; Moore and Paris, 1989). Previous 
work in natural language generation has proposed 
heuristics to determine an agent's choice of dialogue 
strategy, based on factors such as discourse focus, 
medium, style, and the content of previous expla- 
nations (McKeown, 1985; Moore and Paris, 1989; 
Maybury, 1991; Hovy, 1993). It should be possi- 
ble to test experimentally whether an agent can au- 
tomatically learn these heuristics since the method- 
ology we propose is general, and could be applied 
to any dialogue strategy choice that an agent might 
make. 
Previous work has also proposed that an agent's 
choice of dialogue strategy can be treated as a 
stochastic optimization problem (Walker, 1993; 
Biermann and Long, 1996; Levin and Pieraccini, 
1997). However, to our knowledge, these meth- 
ods have not previously been applied to interactions 
with real users. The lack of an appropriate perfor- 
mance function has been a critical methodological 
limitation. 
We use the PARADISE framework (Walker et 
al., 1997) to derive an empirically motivated per- 
formance function, that combines both subjective 
user preferences and objective system performance 
measures into a single function. It would have been 
impossible to predict a prior{ which dialogue fac- 
tors influence the usability of a dialogue agent, and 
to what degree. Our performance equation shows 
that both dialogue quality and efficiency measures 
contribute to agent performance, but that dialogue 
quality measures have a greater influence. Further- 
more, in contrast to assuming an a priori model, we 
use the dialogues from real user-system interactions 
to provide realistic estimates of M/~, the state tran- 
sition model used by the learning algorithm. It is 
impossible to predict a priori the transition frequen- 
cies, given the imperfect nature of spoken language 
understanding, and the unpredictability of user be- 
1350 
havior. 
The use of this method introduces several open 
issues. First, the results of the learning algorithm 
are dependent on the representation of the state 
space. In many reinforcement learning problems 
(e.g. backgammon), the state space is pre-defined. 
In spoken dialogue systems, the system designers 
construct the state space and decide what state vari- 
ables need to be monitored. Our initial results sug- 
gest that the state representation that the agent uses 
to interact with the user may not be the optimal state 
representation for learning. See (Fromer, 1998). 
Second, in advance of actually running learning ex- 
periments, it is not clear how much experience an 
agent will need to determine which strategy is bet- 
ter. Figure 1 shows that it took no more than 50 
dialogue samples for the algorithm to show the dif- 
ferences in convergence trends when learning about 
initiative strategies. However, it appears that more 
data is needed to learn to distinguish between the 
summarization strategies. Third, our experimental 
data is based on short-term interactions with novice 
users, but we might expect that users of an email 
agent would engage in many interactions with the 
same agent, and that preferences for agent interac- 
tion strategies could change over time with user ex- 
pertise. This means that the performance function 
might change over time. Finally, the learning algo- 
rithm that we report here is an off-line algorithm, 
i.e. the agent collects a set of dialogues and then de- 
cides on an optimal strategy as a result. In contrast, 
it should be possible for the agent to learn on-line, 
during the course of a dialogue, if the performance 
function could be automatically calculated (or ap- 
proximated). We are exploring these issues in on- 
going work. 
6 Acknowledgements 
G. Di Fabbrizio, D. Hindle, J. Hirschberg, C. 
Kamm, and D. Litman provided assistance with this 
research or paper. 

References 
A.G. Barto, S. J. Bradtke, and S. P. Singh. 1995. Learn- 
ing to act using real-time dynamic programming. Ar- 
tificial Intelligence Journal, 72(I-2): 81-138. 
R. E. Bellman. 1957. Dynamic Programming. Princeton 
University Press, Princeton, N.J. 
A. W. Biermann and Philip M. Long. 1996. The compo- 
sition of messages in speech-graphics interactive sys- 
tems. In Proc. of the 1996 International Symposium 
on Spoken Dialogue, pp. 97-100. 
J. Chu-Carroll and S. Carberry. 1995. Response genera- 
tion in collaborative negotiation. In Proc. of the 33rd 
Annual Meeting of the ACL, pp. 136-143. 
E R. Cohen. 1995. Empirical Methods for Artificial In- 
telligence. MIT Press, Boston. 
M. Danieli and E. Gerbino. 1995. Metrics for evaluating 
dialogue strategies in a spoken language system. In 
Proc. of the 1995 AAAI Spring Symposium on Empir- 
ical Methods in Discourse, pages 34-39. 
J. C. Fromer. 1998. Learning optimal discourse strate- 
gies in a spoken dialogue system. Technical Report 
Forthcoming, MIT AI Lab M.S. Thesis. 
E. H. Hovy. 1993. Automated discourse generation 
using discourse structure relations. Artificial Intelli- 
gence Journal, 63:341-385. 
C. Kamm, S. Narayanan, D. Dutton, and R. Ritenour. 
1997. Evaluating spoken dialog systems for telecom- 
munication services. In EUROSPEECH 97. 
R. Keeney and H. Raiffa. 1976. Decisions with Multiple 
Objectives: Preferences and Value Tradeoffs. John 
Wiley and Sons. 
E. Levin and R. Pieraccini. 1997. A stochastic model 
of computer-human interaction for learning dialogue 
strategies. In EUROSPEECH 97. 
M.T. Maybury. 1991. Planning multi-media explana- 
tions using communicative acts. In Proc. of the Ninth 
National Conf. on Artificial Intelligence, pages 61-66. 
K. R. McKeown. 1985. Discourse strategies for gen- 
erating natural language text. Artificial Intelligence, 
27( 1): 1-42, September. 
J. D. Moore and C. L. Paris. 1989. Planning text for 
advisory dialogues. In Proc. 27th Annual Meeting of 
theACL. 
S. Russell and R Norvig. 1995. Artificial Intelligence: A 
Modern Approach. Prentice Hall, N.J. 
R. S. Sutton. 1991. Planning by incremental dynamic 
programming. In Proc. Ninth Conf. on Machine 
Learning, pages 353-357. Morgan-Kaufmann. 
M. A. Walker, D. Litman, C. Kamm, and A. Abella. 
1997. PARADISE: A general framework for evalu- 
ating spoken dialogue agents. In Proc. of the 35th An- 
nual Meeting of the ACL, pp. 271-280. 
M. Walker, J. Fromer, G. Di Fabbrizio, C. Mestel, and D. 
Hindle. 1998. What can I say: Evaluating a spoken 
language interface to email. In Proc. of the Conf. on 
Computer Human Interaction ( CH198). 
M. A. Walker. 1993. InformationalRedundancy and Re- 
source Bounds in Dialogue. Ph.D. thesis, University 
of Pennsylvania. 
C. J. Watkins. 1989. Models of Delayed Reinforcement 
Learning. Ph.D. thesis, Cambridge University. 
