An ISU Dialogue System Exhibiting Reinforcement Learning of Dialogue
Policies: Generic Slot-filling in the TALK In-car System
Oliver Lemon, Kallirroi Georgila, and James Henderson
School of Informatics
University of Edinburgh
olemon@inf.ed.ac.uk
Matthew Stuttle
Dept. of Engineering
University of Cambridge
mns25@cam.ac.uk
Abstract
Wedemonstrateamultimodaldialoguesystem
using reinforcement learning for in-car sce-
narios, developed at EdinburghUniversity and
Cambridge University for the TALK project1.
This prototype is the first “Information State
Update”(ISU) dialoguesystemto exhibitrein-
forcement learning of dialogue strategies, and
also has a fragmentary clarification feature.
Thispaperdescribesthemaincomponentsand
functionality of the system, as well as the pur-
posesandfutureuseofthesystem,andsurveys
theresearchissuesinvolvedin itsconstruction.
Evaluation of this system (i.e. comparing the
baselinesystem with handcodedvs. learntdia-
logue policies) is ongoing, and the demonstra-
tion will show both.
1 Introduction
The in-car system described below has been con-
structed primarily in order to be able to collect data
for Reinforcement Learning (RL) approaches to mul-
timodal dialogue management, and also to test and fur-
ther develop learnt dialogue strategies in a realistic ap-
plication scenario. For these reasons we have built a
system which:
a0 containsan interfaceto a dialoguestrategy learner
module,
a0 covers a realistic domain of useful “in-car” con-
versation and a wide range of dialogue phenom-
ena (e.g. confirmation, initiative, clarification, in-
formation presentation),
a0 can be used to complete measurable tasks (i.e.
there is a measure of successful and unsuccessful
dialogues usable as a reward signal for Reinforce-
ment Learning),
a0 logs all interactions in the TALK data collection
format (Georgila et al., 2005).
1This research is supported by the TALK project (Euro-
pean Community IST project no. 507802), http://www.talk-
project.org
In this demonstration we will exhibit the software
system that we have developed to meet these require-
ments. First we describe the domain in which the di-
alogue system operates (an “in-car” information sys-
tem). Then we describe the major components of the
system and giveexamplesof their use. We then discuss
the important features of the system in respect to the
dialogue phenomena that they support.
1.1 A System Exhibiting Reinforcement Learning
The central motivation for building this dialogue sys-
tem is as a platform for Reinforcement Learning (RL)
experiments. The system exhibits RL in 2 ways:
a0 It can be run in online learning mode with real
users. Herethe RL agentis able to learn fromsuc-
cessfulandunsuccessfuldialogueswithrealusers.
Learningwill be much slowerthan with simulated
users, but can start from an already learnt policy,
and slowly improve upon that.
a0 It can be run using an already learnt policy (e.g.
the one reported in (Henderson et al., 2005;
Lemon et al., 2005), learnt from COMMUNICA-
TOR data (Georgila et al., 2005)). This mode can
be used to test the learnt policies in interactions
with real users.
Please see (Henderson et al., 2005) for an expla-
nation of the techniques developed for Reinforcement
Learning with ISU dialogue systems.
2 System Overview
The baseline dialogue system is built around the DIP-
PER dialogue manager (Bos et al., 2003). This sys-
tem is initially used to conduct information-seekingdi-
alogues with a user (e.g. find a particular hotel and
restaurant), using hand-coded dialogue strategies (e.g.
always use implicit confirmation, except when ASR
confidence is below 50%, then use explicit confirma-
tion). We have then modified the DIPPER dialogue
manager so that it can consult learnt strategies (for ex-
ample strategies learnt from the 2000 and 2001 COM-
MUNICATOR data (Lemon et al., 2005)), based on its
119
currentinformationstate, andthenexecutedialogueac-
tions from those strategies. This allows us to compare
hand-coded against learnt strategies within the same
system (i.e. the other components such as the speech-
synthesiser, recogniser, GUI, etc. all remain fixed).
2.1 Overview of System Features
The following features are currently implemented:
a0 use of Reinforcement Learning policies or dia-
logue plans,
a0 multiple tasks: information seeking for hotels,
bars, and restaurants,
a0 overanswering/ question accommodation/ user-
initiative,
a0 open speech recognition using n-grams,
a0 confirmations - explicit and implicit based on
ASR confidence,
a0 fragmentary clarifications based on word confi-
dence scores,
a0 multimodal output - highlighting and naming en-
tities on GUI,
a0 simple user commands (e.g. “Show me all the in-
dian restaurants”),
a0 dialogue context logging in ISU format (Georgila
et al., 2005).
3 Research Issues
Thework presentedhereexploresa numberof research
themes, in particular: using learnt dialogue policies,
learning dialogue policies in online interaction with
users, fragmentary clarification, and reconfigurability.
3.1 Moving between Domains:
COMMUNICATOR and In-car Dialogues
Thelearntpolicies in (Hendersonet al., 2005)focussed
on the COMMUNICATOR system for flight-booking di-
alogues. There we reportedlearning a promisinginitial
policy for COMMUNICATOR dialogues, but the issue
arises of how we could transfer this policy to new do-
mains – for example the in-car domain.
In the in-car scenarios the genre of “information
seeking” is central. For example the SACTI corpora
(Stuttle et al., 2004) have driver information requests
(e.g. searching for hotels) as a major component.
One question we address here is to what extent di-
alogue policies learnt from data gathered for one sys-
tem, or family of systems, can be re-used or adapted
for use in other systems. We conjecture that the slot-
filling policies learnt from our experiments with COM-
MUNICATOR will also be good policies for other slot-
filling tasks – that is, that we are learning “generic”
slot-filling or informationseeking dialoguepolicies. In
section 5 we describe how the dialogue policies learnt
for slot filling on the COMMUNICATOR data set can be
abstracted and used in the in-car scenarios.
3.2 Fragmentary Clarifications
Another research issue we have been able to explore
in constructing this system is the issue of generating
fragmentary clarifications. The system can be run with
this featureswitched on or off (off for comparison with
COMMUNICATOR systems). Instead of a system sim-
plysaying“Sorry,pleaserepeatthat”orsomesuchsim-
ilar simple clarification request when there is a speech
recognition failure, we were able to use the word con-
fidence scores output by the ATK speech recogniser to
generate more intelligent fragmentary clarification re-
quests such as “Did you say a cheap chinese restau-
rant?”. This works by obtaining an ASR confidence
score for each recognised word. We are then able to
try various techniques for clarifying the user utterance.
Many possibilities arise, for example: explicitly clarify
only the highest scoring content word below the rejec-
tion threshold, or, implicitly clarify all content words
and explicitly clarify the lowest scoring content word.
The current platform enables us to test alternative
strategies, and develop more complex ones.
4 The “In-car” Scenario
The scenario we have designed the system to cover is
that of information seeking about a town, for example
its hotels, restaurants, and bars. We imagine a driver
who is travelling towards this town, or is already there,
who wishes to accomplish relatively complex tasks,
such as finding an italian restaurant near their hotel,
or finding all the wine bars in town, and so on. The
driver/user should be able to specify queries using nat-
ural dialogue, and will receive system output that is a
mixtureof spokenand graphicalinformation(e.g.a de-
scription of an item and a map showing its location).
The example town is taken from the (Stuttle et al.,
2004) corpus collection materials, and contains a num-
ber of hotels, bars, restaurants, and tourist attractions.
Theusershouldbeableto getinformationonarange
of locations in the town, and the dialogue system will
be used to specify and refine the user queries, as well
as to present information to the user. See the example
dialogue in table 1.
We now describe the dialogue system components.
5 Component-level Description
This section describes the components of the baseline
in-car dialogue system. Communication between com-
ponents is handled by OAA’s asynchronous hub archi-
tecture (Cheyer and Martin, 2001). The major compo-
nents are:
a0 DIPPER (Bos et al., 2003) ISU dialoguemanager,
a0 Dialogue Policy Reinforcement Learner Agent2,
a0 ATK for speech recognition,
a0 Festival speech synthesiser,
2This is written in Python and has an OAA wrapper in C.
120
Utterance Feature
S How can I help you? Open initiative ini-
tial question
U I’m looking for a cheap chinese
restaurant.
User initiative
S Would you like something cen-
tral?
Intention recogni-
tion
U Yes.
S The Dragon Way is an af-
fordable cantonese restaurant
located on Main Street [dis-
played on map].
Multimodal presen-
tation
S Are you happy with those op-
tions?
U Yes, and I also want a double
room.
S Would you like something ex-
pensive, cheap, or reasonably
priced?
Intention recogni-
tion
U A luxury hotel, in the center of
town.
S There are 2 hotel options meet-
ing your requirements. The
Royal Hotel [displayed on
map] is a renovated nineteenth
century palace a0a1a0a1a0 The Hotel
Primus[displayedonmap]has
big, bright, clean rooms a0a1a0a2a0
Which hotel option would you
like?
Multimodal presen-
tation
U The second one please.
S The phone number of the Hotel
Primus is 2094-227.
S Would you like any further in-
formation?
U I want to find a jazz bar.
S I’m sorry, what kind of bar are
you looking for?
Intention recogni-
tion, Fragmentary
clarification
U A jazz bar.
S A jazz bar. Would you like
something central?
Implicit confirma-
tion
Table 1: Example dialogue, showing system features
a0 Multimodal Map interface (a java OAA agent),
a0 Database agent (java OAA wrapper to MySQL).
5.1 Dialogue Policy Learner Agent
This agent acts as an interface between the DIPPER
dialogue manager and the system simulation based on
RL. In particular it has the following solvable:
callRLsimulation(IS file name,
conversational domain, speech act, task,
result).
The first argument is the name of the file that contains
all information about the current information state,
which is required by the RL algorithm to produce
an action. The action returned by the RL agent is
a combination of conversational domain,
speech act, and task. The last argument shows
whether the learnt policy will continue to produce
more actions or release the turn. When run in online
learning mode the agent not only produces an action
when supplied with a state, but at the end of every
dialogue it uses the reward signal to update its learnt
policy. The reward signal is defined in the RL agent,
and is currently a linear combination of task success
metrics combined with a fixed penalty for dialogue
length (see (Henderson et al., 2005)).
This agent can be called whenever the system has
to decide on the next dialogue move. In the original
hand-coded system this decision is made by way of a
dialogue plan (using the “deliberate” solvable). The
RL agent can be used to drive the entire dialogue pol-
icy,orcan becalled onlyin certaincircumstances. This
makes it usable for whole dialogue strategies, but also,
if desired, it can be targetted only on specific dialogue
managementdecisions (e.g. implicit vs. explicit confir-
mation, as was done by (Litman et al., 2000)).
One important research issue is that of tranferring
learnt strategies between domains. We learnt a strat-
egy for the COMMUNICATOR flight booking dialogues
(Lemon et al., 2005; Henderson et al., 2005), but
this is generated by rather different scenarios than the
in-car dialogues. However, both are “slot-filling” or
information-seeking applications. We defined a map-
ping (described below) between the states and actions
of both systems, in order to construct an interface be-
tween the learnt policies for COMMUNICATOR and the
in-car baseline system.
5.2 Mapping between COMMUNICATOR and
the In-car Domains
There are 2 main problems to be dealt with here:
a0 mappingbetweenin-carsystem informationstates
and COMMUNICATOR information states,
a0 mapping between learnt COMMUNICATOR sys-
tem actions and in-car system actions.
The learnt COMMUNICATOR policy tells us, based
on a current IS, what the optimal system action
is (for example request info(dest city) or
acknowledgement). Obviously, in the in-car sce-
nario we have no use for task types such as “destina-
tion city” and “departure date”. Our method therefore
is to abstract away from the particular details of the
task type, but to maintain the information about dia-
loguemovesandtheslotnumbersthatareunderdiscus-
sion. That is, we construe the learnt COMMUNICATOR
policy as a policy concerning how to fill up to 4 (or-
dered) informational slots, and then access a database
and present results to the user. We also note that some
slots are more essential than others. For example, in
COMMUNICATOR it is essential to have a destination
city, otherwise no results can be found for the user.
Likewise, for the in-car tasks, we consider the food-
type, bar-type, and hotel-location to be more important
to fill than the other slots. This suggests a partial order-
ing on slots via their importance for an application.
In order to do this we define the mappings shown
in table 2 between COMMUNICATOR dialogue actions
and in-car dialogue actions, for each sub-task type of
the in-car system.
121
COMMUNICATOR action In-car action
dest-city food-type
depart-date food-price
depart-time food-location
dest-city hotel-location
depart-date room-type
depart-time hotel-price
dest-city bar-type
depart-date bar-price
depart-time bar-location
Table 2: Action mappings
Note that we treat each of the 3 in-car sub-tasks (ho-
tels, restaurants,bars) as a separateslot-filling dialogue
thread, governed by COMMUNICATOR actions. This
means that the very top level of the dialogue (“How
may I help you”) is not governed by the learnt policy.
Only when we are in a recognised task do we ask the
COMMUNICATOR policy for the next action. Since the
COMMUNICATOR policy is learnt for 4 slots, we “pre-
fill” a slot3 in the IS when we send it to the Dialogue
Policy Learner Agent in order to retrieve an action.
As for the state mappings, these follow the same
principles. That is, we abstract from the in-car states to
form states that are usable by COMMUNICATOR . This
means that, for example, an in-car state where food-
type and food-price are filled with high confidence is
mapped to a COMMUNICATOR state where dest-city
and depart-date are filled with high confidence, and
all other state information is identical (modulo the task
names). Note that in a future version of the in-car sys-
tem where task switching is allowed we will have to
maintain a separate view of the state for each task.
In terms of the integration of the learnt policies with
theDIPPERsystemupdaterules, wehaveasystemflag
which states whether or not to use a learnt policy. If
this flag is present, a different update rule fires when
the system determines what action to take next. For
example, instead of using the deliberate predicate
to access a dialogue plan, we instead call the Dialogue
PolicyLearnerAgentviaOAA, using thecurrentInfor-
mation State of the system. This will return a dialogue
action to the DIPPER update rule.
Incurrentworkweareevaluatinghowwellthelearnt
policies work for real users of the in-car system.
6 Conclusions and Future Work
This report has described work done in the TALK
project in building a software prototype baseline “In-
formation State Update” (ISU)-based dialogue system
in the in-car domain, with the ability to use dialogue
policies derivedfrom machinelearningand also to per-
form onlinelearningthroughinteraction. We described
the scenarios, gave a component level description of
the software, and a feature level description and exam-
3We choose “orig city” because it is the least important
and is already filled at the start of many COMMUNICATOR
dialogues.
ple dialogue.
Evaluation of this system (i.e. comparing the sys-
tem with hand-coded vs. learnt dialogue policies) is
ongoing. Initial evaluation of learnt dialogue policies
(Lemon et al., 2005; Henderson et al., 2005) suggests
that the learnt policy performs at least as well as a rea-
sonable hand-codedsystem (the TALK policy learnt for
COMMUNICATOR dialogue management outperforms
all the individual hand-coded COMMUNICATOR sys-
tems).
The main achievements made in designing and con-
structing this baseline system have been:
a0 Combining learnt dialogue policies with an ISU
dialogue manager. This has been done for online
learning, as well as for strategies learnt offline.
a0 Mapping learnt policies between domains, i.e.
mapping Information States and system actions
between DARPA COMMUNICATOR and in-carin-
formation seeking tasks.
a0 Fragmentaryclarification strategies: the combina-
tion of ATK word confidence scoring with ISU-
based dialoguemanagementrules allows us to ex-
plore word-based clarification techniques.
References
J. Bos, E. Klein, O. Lemon, and T. Oka. 2003.
DIPPER: Description and Formalisation of an
Information-State Update Dialogue System Archi-
tecture. In 4th SIGdial Workshop on Discourse and
Dialogue, Sapporo.
A. Cheyer and D. Martin. 2001. The open agent archi-
tecture. Journal of Autonomous Agents and Multi-
Agent Systems, 4(1):143–148.
K. Georgila, O. Lemon, and J. Henderson. 2005. Au-
tomatic annotation of COMMUNICATOR dialogue
data for learning dialogue strategies and user sim-
ulations. In Ninth Workshop on the Semantics and
Pragmatics of Dialogue (SEMDIAL), DIALOR’05.
J. Henderson, O. Lemon, and K. Georgila. 2005.
Hybrid Reinforcement/Supervised Learning for Di-
alogue Policies from COMMUNICATOR data. In
IJCAI workshop on Knowledge and Reasoning in
Practical Dialogue Systems.
O. Lemon, K. Georgila, J. Henderson, M. Gabsdil,
I. Meza-Ruiz, and S. Young. 2005. D4.1: Inte-
gration of Learning and Adaptivity with the ISU ap-
proach. Technical report, TALK Project.
D. Litman, M. Kearns, S. Singh, and M. Walker. 2000.
Automaticoptimizationofdialoguemanagement. In
Proc. COLING.
M. Stuttle, J. Williams, and S. Young. 2004. A frame-
work for dialog systems data collection using a sim-
ulated ASR channel. In ICSLP 2004, Jeju, Korea.
122
