Multi-tasking and Collaborative Activities in Dialogue Systems
Oliver Lemon, Alexander Gruenstein, Alexis Battle, and Stanley Peters
Center for the Study of Language and Information
Stanford University, CA 94305
lemon,alexgru,ajbattle,peters@csli.stanford.edu
Abstract
We explain dialogue management tech-
niques for collaborative activities with hu-
mans, involving multiple concurrent tasks.
Conversational context for multiple con-
current activities is represented using a
“Dialogue Move Tree” and an “Activity
Tree” which support multiple interleaved
threads of dialogue about different activi-
ties and their execution status. We also de-
scribe the incremental message selection,
aggregation, and generation method em-
ployed in the system.
1 Introduction
This paper describes implemented multi-modal dia-
logue systems1 which support collaboration with au-
tonomous devices in their execution of multiple con-
current tasks. We will focus on the particular mod-
elling and processing aspects which allow the sys-
tems to handle dialogues about multiple concurrent
tasks in a coherent and natural manner. Many con-
versations between humans have this property, and
dialogues between humans and semi-autonomous
devices will have this feature in as much as devices
are able to carry out activities concurrently. This
ability to easily interleave communication streams is
a very useful property of conversational interactions.
Humans are adept at carrying out conversations with
1This research was (partially) funded under the Wallenberg
laboratory for research on Information Technology and Au-
tonomous Systems (WITAS) Project, Link¨oping University, by
the Wallenberg Foundation, Sweden.
multiple threads, or topics, and this capability en-
ables fluid and efficient communication, and thus ef-
fective co-ordination of actions (see (Lemon et al.,
2002) for a more extensive discussion). We will
show how to endow a dialogue system with some
of these capabilities.
The main issues which we address in this paper
are:
a0 Representation of dialogue context such that
collaborative activities and multi-tasking are
supported.
a0 Dialogue management methods such that free
and natural communication over several con-
versational topics is supported.
a0 Natural generation of messages in multi-
tasking collaborative dialogues.
In Section 2 we discuss the demands of multi-
tasking and collaboration with autonomous devices.
Section 3 covers the robot with which our current
dialogue system interacts, and the architecture of
the dialogue system. In Section 4 we introduce the
“joint activities” and Activity Models which repre-
sent collaborative tasks and handle multi-tasking in
an interface layer between the dialogue system and
autonomous devices. Section 5 presents the dia-
logue modelling and management techniques used
to handle multiple topics and collaborative activi-
ties. Section 6 surveys the message selection, ag-
gregation, and generation component of the system,
in the context of multi-tasking.
     Philadelphia, July 2002, pp. 113-124.  Association for Computational Linguistics.
                  Proceedings of the Third SIGdial Workshop on Discourse and Dialogue,
2 Multi-tasking and Collaboration
A useful dialogue system for interaction with au-
tonomous devices will enable collaboration with hu-
mans in the planning and execution of tasks. Dia-
logue will be used to specify and clarify instructions
and goals for the device, to monitor its progress,
and also to jointly solve problems. Before we deal
with such issues in detail, we note that such devices
also have the following properties which are relevant
from the point of view of dialogue management:
a0 Devices exist within dynamic environments,
where new objects appear and are available for
discussion. Device sensors may give rise to
new information at any time, and this may need
to be communicated urgently.
a0 Devices may perform multiple concurrent ac-
tivities which may succeed, fail, become can-
celled, or be revised. These activities can be
topics of conversation.
(Allen et al., 2001) present a taxonomy of dia-
logue systems ranging from “finite-state script” di-
alogues for simple tasks (such as making a long-
distance call) to the most complex “agent-based
models” which cover dialogues where different pos-
sibilities, such as future plans, are discussed. Within
this taxonomy, a useful dialogue system for interac-
tion with autonomous devices must be located at or
near the “agent-based” point since we wish to com-
municate with devices about their possible actions,
their plans, and the tasks they are currently attempt-
ing. For these reasons we built a dialogue manager
that represents (possibly collaborative) activities and
their execution status, and tracks multiple threads of
dialogue about concurrent and planned activities.
For these sorts of reasons it is clear that form-
filling or data-base query style dialogues (e.g. the
CSLU Toolkit, (McTear, 1998)) will not suffice here
(see (Elio and Haddadi, 1999; Allen et al., 2001) for
similar arguments).
3 The WITAS Dialogue System
In our current application, the autonomous system
is the WITAS2 UAV (‘unmanned aerial vehicle’) –
a small robotic helicopter with on-board planning
2See http://www.ida.liu.se/ext/witas
and deliberative systems, and vision capabilities (for
details see e.g. (Doherty et al., 2000)). This robot
helicopter will ultimately be controlled by the dia-
logue system developed at CSLI, though at the mo-
ment we interact with a simulated3 UAV. Mission
goals are provided by a human operator, and an on-
board planning system then responds. While the he-
licopter is airborne, an on-board active vision sys-
tem interprets the scene or focus below to interpret
ongoing events, which may be reported (via NL gen-
eration) to the operator (see Section 6). The robot
can carry out various “activities” such as flying to a
location, or following a vehicle, or landing. These
activities are specified by the user during dialogue,
or can be initiated by the UAV’s on-board AI. In any
case, a major component of the dialogue, and a way
of maintaining its coherence, is tracking the state of
current or planned activities of the device.
A more interesting and problematic notion is that
of “joint-activities” between an autonomous system
and a human operator. These are activities which
the autonomous system cannot complete alone, but
which require some human intervention. In our cur-
rent scenarios, the UAV’s vision system is not good
enough to determine whether a particular vehicle is
the one sought-after, and only the human operator
has the authority to determine this, so that human
and robot must collaborate in order to find and track
a vehicle. The dialogue in Figure 2 shows how a
typical interaction works4 (other capabilities, such
as clarification subdialogues, are covered in (Lemon
et al., 2001)). Note here that the user is able to make
explicit queries about the robot’s activities (both cur-
rent and future), that there are concurrent activi-
ties, and that conversational initiative centers around
the joint activities currently being specified and ex-
ecuted.
4 Activity Models
The idea of Activity Modelling in our system is
the vision that dialogue systems can, in generality,
be built for ‘devices’ which carry out certain well-
3Our UAV simulator uses KIF statements under JTP (the
Java Theorem Prover) to represent and non-montonically up-
date UAV state information.
4The system runs on a laptop computer under Windows
2000. Video footage of the system can be found at http:
//www-csli.stanford.edu/semlab/witas/
Figure 2: A demonstration of the WITAS dialogue system (November 2001)
Multi-modal Utterances Dialogue Move
Operator (O): Our job is to look for a red car Command (Joint Activity)
UAV (U): Ok. I am looking for one. Report (Confirm Activity)
O: Fly here please [+click on map] Command (Deictic)
U: Okay. I will fly to waypoint one Report (Confirm Activity)
U: Now taking off and flying there. Report (Current Activity)
O: Stop that. Go to the tower instead. Command, Revision
U: I have cancelled flying to waypoint one. I will fly to the
tower.
Report (Activity status)
O: What are you doing? Wh-question (Current Activity)
U: I am searching for a red car and flying to the tower Answer (Current Activity)
O: What will you do next? Wh-question (Planned Activity)
U: I have nothing planned. Answer(Planned Activity)
U: I see a red car on main street [display on map, show video
images], Is this the right car?
Report, Yn-question (Activity)
O: Yes, that’s the right car Yn-answer (Positive)
U: Okay. I am following it . Report (Current activity)
facilitator
OAA2
Synthesizer
Generator
Gemini
Parser and 
Recognizer
Speech
Festival
Display
Interactive Map
NL
SR
TTS
DM
GUI
Activities
Model
Interface
       
Dialogue Move Tree (DMT)
Activity Tree (AT)
System Agenda (SA)
Pending List (PL)
Modality Buffer (MB)
ROBOT
Salience List (SL)
Speech
Nuance
DIALOGUE MANAGER
Figure 1: The WITAS dialogue system architecture
defined activities (e.g. switch lights on, record on
channel a0 , send email a1 to a2 , search for vehicle a3 ),
and that an important part of the dialogue context to
be modelled in such a system is the device’s planned
activities, current activities, and their execution sta-
tus5. We choose to focus on building this class of
dialogue systems because we share with (Allen et
al., 2001), a version of the the Practical Dialogue
5Compare this with the motivation behind the “Pragmatic
Adapter” idea of (LuperFoy et al., 1998).
Hypothesis:
“The conversational competence required
for practical dialogues, although still com-
plex, is significantly simpler to achieve
than general human conversational com-
petence.”
We also share with (Rich et al., 2001) the idea that
declarative descriptions of the goal decomposition
of activities (COLLAGEN’s “recipes”, our “Activ-
ity Models”) are a vital layer of representation, be-
tween a dialogue system and the device with which
it interacts.
In general we assume that a device is capable of
performing some “atomic” activities or actions (pos-
sibly simultaneously), which are the lowest-level ac-
tions that it can perform. Some devices will only
know how to carry out sequences of atomic activ-
ities, in which case it is the dialogue system’s job
to decompose linguistically specified high-level ac-
tivities (e.g. “record the film on channel 4 tonight”)
into a sequence of appropriate atomic actions for the
device. In this case the dialogue system is provided
with a declarative “Activities Model” (see e.g. Fig-
ure 3) for the device which states how high-level
linguistically-specified activities can be decomposed
into sequences of atomic actions. This model con-
tains traditional planning constraints such as precon-
ditions and postconditions of actions. In this way, a
relatively “stupid” device (i.e. with little or no plan-
ning capabilities) can be made into a more intelli-
gent device when it is dialogue-enabled.
At the other end of the spectrum, more intelli-
gent devices are able to plan their own sequences of
atomic actions, based on some higher level input. In
this case, it is the dialogue system’s role to translate
natural language into constraints (including tempo-
ral constraints) that the device’s planner recognizes.
The device itself then carries out planning, and in-
forms the dialogue manager of the sequence of ac-
tivities that it proposes. Dialogue can then be used
to re-specify constraints, revise activities, and mon-
itor the progress of tasks. We propose that the pro-
cess of decomposing a linguistically specified com-
mand (e.g. “vacuum in the main bedroom and the
lounge, and before that, the hall”) into an appropri-
ate sequence of constraints for the device’s on-board
planner, is an aspect of “conversational intelligence”
that can be added to devices by dialogue-enabling
them.
We are developing one representation and reason-
ing scheme to cover this spectrum of cases from de-
vices with no planning capabilities to some more
impressive on-board AI. Both dialogue manager
and robot/device have access to a single “Activity
Tree” which is a shared representation of current
and planned activities and their execution status, in-
volving temporal and hierarchical ordering (in fact,
one can think of the Activity Tree as a Hierarchical
Task Network for the device). This tree is built top-
down by processing verbal input from the user, and
its nodes are then expanded by the device’s planner
(if it has one). In cases where no planner exists, the
dialogue manager itself expands the whole tree (via
the Activity Model for the device) until only leaves
with atomic actions are left for the device to execute
in sequence. The device reports completion of activ-
ities that it is performing and any errors that occur
for an activity.
Note that because the device and dialogue system
share the same representation of the device’s activ-
ities, they are always properly coordinated. They
also share responsibility for different aspects of con-
structing and managing the whole Activity Tree.
Note also that some activities can themselves be
speech acts, and that this allows us to build collabo-
rative dialogue into the system. For example, in Fig-
ure 3 the ASK-COMPLETE activity is a speech act,
generating a yes-no question to be answered by the
user.
4.1 An example Activity Model
An example LOCATE activity model for the UAV
is shown in Figure 3. It is used when constructing
parts of the activity tree involving commands such
as “search for”, “look for” and so on. For instance,
if the user says “We’re looking for a truck”, that ut-
terance is parsed into a logical form involving the
structure (locate, np[det(a),truck]).
The dialogue manager then accesses the Activity
Model for LOCATE and adds a node to the Activ-
ity Tree describing it. The Activity Model speci-
fies what sub-activities should be invoked, and un-
der what conditions they should be invoked, what
the postconditions of the activity are. Activity Mod-
els are similar to the “recipes” of (Rich et al., 2001).
For example, in Figure 3 the Activity Model for LO-
CATE states that,
a0 it uses the camera resource (so that any other
activity using the camera must be suspended,
or a dialogue about resource conflict must be
initiated),
a0 that the preconditions of the activity are that the
UAV must be airborne, with fuel and engine in-
dicators satisfactory,
a0 that the whole activity can be skipped if the
UAV is already “locked-on” to the sought ob-
ject,
a0 that the postcondition of the activity is that the
UAV is “locked-on” to the sought object,
a0 that the activity breaks into three sequen-
tial sub-activities: WATCH-FOR, FOLLOW-OBJ,
and ASK-COMPLETE.
Nodes on the Activity Tree can be either: ac-
tive, complete, failed, suspended, or canceled. Any
change in the state of a node (typically because of
a report from the robot) is placed onto the System
Agenda (see Section 5) for possible verbal report to
the user, via the message selection and generation
module (see Section 6).
Figure 3: A “Locate” Activity Model for a UAV, exhibiting collaborative dialogue
Locate// locate is "find-by-type", collaborative activity.
// Breaks into subactivities: watch_for, follow, ask_complete.
{ResourcesUsed {camera;} // will be checked for conflicts.
PreConditions //check truth of KIF statements.
{(Status flight inair) (Status engine ok) (Status fuel ok);}
SkipConditions // skip this Activity if KIF condition true.
{(Status locked-on THIS.np);}
PostConditions// assert these KIF statements when completed.
{(Status locked-on THIS.np) ;}
Children SEQ //sequential sub-activities.
{TaskProperties
{command = "watch_for"; // basic robot action ---
np = THIS.np;} // set sensors to search.
TaskProperties
{command = "follow_obj"; //triggers complex activity --
np = THIS.np;} //following a candidate object.
TaskProperties //collaborative speech action:
{command = "ask_complete";//asks user whether this is
np = THIS.np; }}} //object we are looking for.
5 The Dialogue Context Model
Dialogue management falls into two parts – dialogue
modelling (representation), and dialogue control (al-
gorithm). In this section we focus on the representa-
tional aspects, and section 5.2 surveys the main al-
gorithms. As a representation of conversational con-
text, the dialogue manager uses the following data
structures which make up the dialogue Information
State (IS);
a0 Dialogue Move Tree (DMT)
a0 Activity Tree (AT)
a0 System Agenda (SA)
a0 Pending List (PL)
a0 Salience List (SL)
a0 Modality Buffer (MB)
Figure 4 shows how the Dialogue Move Tree re-
lates to other parts of the dialogue manager as a
whole. The solid arrows represent possible update
functions, and the dashed arrows represent query
functions. For example, the Dialogue Move Tree
can update Salience List, System Agenda, Pend-
ing List, and Activity Tree, while the Activity Tree
can update only the System Agenda and send ex-
ecution requests to the robot, and it can query the
Activity Model (when adding nodes). Likewise, the
Message Generation component queries the System
Agenda and the Pending List, and updates the Dia-
logue Move Tree whenever a synthesized utterance
is produced.
Figure 5 shows an example Information State
logged by the system, displaying the interpretation
of the system’s utterance “now taking off” as a re-
port about an ongoing “go to the tower” activity (the
Pending List and System Agenda are empty, and
thus are not shown).
5.1 The Dialogue Move Tree
Dialogue management uses a set of abstract dia-
logue move classes which are domain independent
(e.g. command, activity-query, wh-question, revi-
sion, a0a1a0a1a0 ). Any ongoing dialogue constructs a par-
ticular Dialogue Move Tree (DMT) representing the
current state of the conversation, whose nodes are
DIALOGUE
ACTIVITY
MOVE
TREE
AGENDA
SYSTEM TREE
Activities)
(NPs,
(Selection and Aggregation)
SALIENCE
ACTIVITY
LAYER
speechsynthesis
INFORMATION
INDEXICAL
(Active Node List)
MESSAGE 
GENERATION
ACTIVITY
MODEL
DEVICE
LIST
PENDING
LIST
MODALITY
BUFFER
Map Display Inputs
(parsed human speech)
(mouse clicks)
Conversational Move Inputs
Figure 4: Dialogue Manager Architecture (solid arrows denote possible updates, dashed arrows represent
possible queries)
instances of the dialogue move classes, and which
are linked to nodes on the Activity Tree where ap-
propriate, via an activity tag (see below).
Incoming logical forms (LFs) from the pars-
ing process are always tagged with a dialogue
move (see e.g. (Ginzburg et al., 2001)), which pre-
cedes more detailed information about an utter-
ance. For instance the logical form: command([go],
[param-list ([pp-loc(to, arg([np(det([def],the),
[n(tower,sg)])]))])])
corresponds to the utterance “go to the tower”,
which is flagged as a command.
A slightly more complex example is; re-
port(inform, agent([np([n(uav,sg)])]), compl-
activity([command([take-off])]))
which corresponds to “I have taken off” – a re-
port from the UAV about a completed ‘taking-off’
activity.
The first problem in dialogue management is
to figure out how these incoming “Conversational
Moves” relate to the current dialogue context. In
other words, what dialogue moves do they consti-
tute, and how do they relate to previous moves in
the conversation? In particular, given multi-tasking,
to which thread of the conversation does an incom-
ing utterance belong? We use the Dialogue Move
Tree to answer these questions:
1. A DMT is a history or “message board” of
dialogue contributions, organized by “thread”,
based on activities.
2. A DMT classifies which incoming utterances
can be interpreted in the current dialogue con-
text, and which cannot be. It thus delimits
a space of possible Information State update
functions.
3. A DMT has an Active Node List which con-
trols the order in which this function space is
searched 6.
4. A DMT classifies how incoming utterances are
to be interpreted in the current dialogue con-
text.
In general, then, we can think of the DMT as
representing a function space of dialogue Informa-
6It also defines an ordering on language models for speech
recognition.
tion State update functions. The details of any par-
ticular update function are determined by the node
type (e.g. command, question) and incoming dia-
logue move type and their contents, as well as the
values of Activity Tag and Agent.
Note that this notion of “Dialogue Move Tree” is
quite different from previous work on dialogue trees,
in that the DMT does not represent a “parse” of the
dialogue using a dialogue grammar (e.g. (Ahrenberg
et al., 1990)), but instead represents all the threads
in the dialogue, where a thread is the set of utter-
ances which serve a particular dialogue goal. In the
dialogue grammar approach, new dialogue moves
are attached to a node on the right frontier of the
tree, but in our approach, a new move can attach
to any thread, no matter where it appears in the
tree. This means that the system can flexibly in-
terpret user moves which are not directly related to
the current thread (e.g. a user can ignore a system
question, and give a new command, or ask their
own question). Finite-state representations of dia-
logue games have the restriction that the user is con-
strained by the dialogue state to follow a particular
dialogue path (e.g. state the destination, clarify, state
preferred time, a0a1a0a1a0 ). No such restriction exists with
DMTs, where dialogue participants can begin and
discontinue threads at any time.
We discuss this further below.
5.2 Interpretation and State Update
The central algorithm controlling dialogue manage-
ment has two main steps, Attachment, and Process
Node;
1. Attachment: Process incoming input conversa-
tional move a0 with respect to the current DMT
and Active Node List, and “attach” a new node
a1 interpreting
a0 to the tree if possible.
2. Process Node: process the new node a1 , if it
exists, with respect to the current information
state. Perform an Information State update us-
ing the dialogue move type and content of a1 .
When an update function a2 exists, its effects de-
pend on the details of the incoming input a0 (in par-
ticular, to the dialogue move type and the contents
of the logical form) and the DMT node to which it
attaches. The possible attachments can be thought
of as adjacency pairs, and each dialogue move class
contains information about which node types it can
attach. For instance the command node type can at-
tach confirmation, yn-question, wh-question, and re-
port nodes.
Examples of different attachments available in our
current system can be seen in Figure 67. For exam-
ple, the first entry in the table states that a command
node, generated by the user, with activity tag a3 , is
able to attach any system confirmation move with
the same activity tag, any system yes-no question
with that tag, any system wh- question with that tag,
or any system report with that activity tag. Similarly,
the rows for wh-question nodes state that:
a0 a wh-question by the system with activity tag
a3
can attach a user’s wh-answer (if it is a possible
answer for that activity)
a0 a user’s wh-question can attach a system wh-
answer, and no particular activity need be spec-
ified.
These possible attachments delimit the ways in
which dialogue move trees can grow, and thus clas-
sify the dialogue structures which can be captured in
the current system. As new dialogue move types are
added to the system, this table is being extended to
cover other conversation types (e.g. tutoring (Clark
et al., 2001)).
It is worth noting that the node type created af-
ter attachment may not be the same as the dialogue
move type of the incoming conversational move a0 .
Depending on the particular node which attaches the
new input, and the move type of that input, the cre-
ated node may be of a different type. For exam-
ple, if a wh-question node attaches an input which is
simply a command, the wh-question node may inter-
pret the input as an answer, and attach a wh-answer.
These interpretation rules are local to the node to
which the input is attached. In this way, the DMT
interprets new input in context, and the pragmatics
of each new input is contextually determined, rather
than completely specified via parsing using conver-
sational move types. Note that Figure 6 does not
state what move type new input is attached as, when
it is attached.
7Where Activity Tags are not specified, attachment does not
depend on sharing of Activity Tags.
In the current system, if the user produces an ut-
terance which can attach to several nodes on the
DMT, only the “most active” node (as defined by the
Active Node List) will attach the incoming move. It
would be interesting to explore such events as trig-
gers for clarification questions, in future work.
6 Message generation
Since the robot is potentially carrying out multiple
activities at once, a particular problem is how to de-
termine appropriate generation of utterances about
those activities, in a way which does not overload
the user with information, yet which establishes and
maintains appropriate context in a natural way.
Generation for dialogue systems in general is
problematic in that dialogue contributions arise in-
crementally, often in response to another partici-
pant’s utterances. For this reason, generation of
large pieces of text is not appropriate, especially
since the user is able to interrupt the system. Other
differences abound, for example that aggregation
rules must be sensitive to incremental aspects of
message generation.
As well as the general problems of message selec-
tion and aggregation in dialogue systems, this par-
ticular type of application domain presents specific
problems in comparison with, say, travel-planning
dialogue systems – e.g. (Seneff et al., 1991). An au-
tonomous device will, in general, need to communi-
cate about,
a0 its perceptions of a changing environment,
a0 progress towards user-specified goals,
a0 execution status of activities or tasks,
a0 its own internal state changes,
a0 the progress of the dialogue itself.
For these reasons, the message selection and gen-
eration component of such a system needs to be
of wider coverage and more flexible than template-
based approaches, while remaining in real, or near-
real, time (Stent, 1999). As well as this, the system
must potentially be able to deal with a large band-
width stream of communications from the robot,
and so must be able to intelligently filter them for
“relevance” so that the user is not overloaded with
unimportant information, or repetitious utterances.
In general, the system should appear as ‘natural’ as
possible from the user’s point of view – using the
same language as the user if possible (“echoing”),
using anaphoric referring expressions where possi-
ble, and aggregating utterances where appropriate.
A ‘natural’ system should also exhibit “variability”
in that it can convey the same content in a variety
of ways. A further desirable feature is that the sys-
tem’s generated utterances should be in the cover-
age of the dialogue system’s speech recognizer, so
that system-generated utterances effectively prime
the user to speak in-grammar.
Consequently we attempted to implement the fol-
lowing features in message selection and generation:
relevance filtering; recency filtering; echoing; vari-
ability; aggregation; symmetry; real-time genera-
tion.
Our general method is to take as inputs to the pro-
cess various communicative goals of the system, ex-
pressed as logical forms, and use them to construct a
single new logical form to be input to Gemini’s Se-
mantic Head-Driven Generation algorithm (Shieber
et al., 1990), which produces strings for Festival
speech synthesis. We now describe how to use com-
plex dialogue context to produce natural generation
in multitasking contexts.
6.1 Message selection - filtering
Inputs to the selection and generation module are
“concept” logical forms (LFs) describing the com-
municative goals of the system. These are struc-
tures consisting of context tags (e.g. activity identi-
fier, dialogue move tree node, turn tag) and a con-
tent logical form consisting of a Dialogue Move
(e.g. report, wh-question), a priority tag (e.g. warn
or inform), and some additional content tags (e.g.
for objects referred to). An example input logical
form is, “report(inform, agent(AgentID), cancel-
activity(ActivityID))”, which corresponds to the re-
port “I have cancelled flying to the tower” when
AgentID refers to the robot and ActivityID refers to
a “fly to the tower” task.
Items which the system will consider for genera-
tion are placed (either directly by the robot, or indi-
rectly by the Activity Tree) on the “System Agenda”
(SA), which is the part of the dialogue Information
State which stores communicative goals of the sys-
tem. Communicative goals may also exist on the
“Pending List” (PL) which is the part of the infor-
mation state which stores questions that the system
has asked, but which the user has not answered, so
that they may be re-raised by the system. Only ques-
tions previously asked by the system can exist on the
Pending List.
Due to multi-tasking, at any time there is a num-
ber of “Current Activities” which the user and sys-
tem are performing (e.g. fly to the tower, search for
a red car). These activities are topics of conversa-
tion (defining threads of the DMT) represented in
the dialogue information state, and the system’s re-
ports can be generated by them (in which case the
are tagged with that activity label) or can be rele-
vant to an activity in virtue of being about an object
which is in focus because it is involved in that activ-
ity.
Some system reports are more urgent that others
(e.g. “I am running out of fuel”) and these carry the
label warning. Warnings are always relevant, no
matter what activities are current – they always pass
the recency and relevance filters.
Echoing (for noun-phrases) is achieved by access-
ing the Salience List whenever generating referential
terms, and using whatever noun-phrase (if any) the
user has previously employed to refer to the object
in question. If the object is top of the salience list,
the generator will select an anaphoric expression.
The end result of our selection and aggregation
module (see section 6.2) is a fully specified logi-
cal form which is to be sent to the Semantic-Head-
Driven Generation component of Gemini (Shieber
et al., 1990). The bi-directionality of Gemini (i.e.
that we use the same grammar for both parsing and
generation) automatically confers a useful “symme-
try” property on the system – that it only utters sen-
tences which it can also understand. This means that
the user will not be misled by the system into em-
ploying out-of-vocabulary items, or out-of-grammar
constructions. Another side effect of this is that
the system utterances prime the user to make in-
grammar utterances, thus enhancing co-ordination
between user and system in the dialogues.
6.2 Incremental aggregation
Aggregation combines and compresses utterances to
make them more concise, avoid repetitious language
structure, and make the system’s speech more nat-
ural and understandable. In a dialogue system ag-
gregation should function incrementally because ut-
terances are generated on the fly. In dialogue sys-
tems, when constructing an utterance we often have
no information about the utterances that will follow
it, and thus the best we can do is to compress it
or “retro-aggregate” it with utterances that preceded
it. Only occasionally does the System Agenda con-
tain enough unsaid utterances to perform reasonable
“pre-aggregation”.
Each dialogue move type (e.g. report, wh-
question) has its own aggregation rules, stored in
the class for that LF type. In each type, rules spec-
ify which other dialogue move types can aggregate
with it, and exactly how aggregation works. The
rules note identical portions of LFs and unify them,
and then combine the non-identical portions appro-
priately.
For example, the LF that represents the phrase “I
will fly to the tower and I will land at the parking
lot”, will be converted to one representing “I will fly
to the tower and land at the parking lot” according
to the compression rules. Similarly, “I will fly to the
tower and fly to the hospital” gets converted to “I
will fly to the tower and the hospital”.
The “retro-aggregation” rules result in sequences
of system utterances such as, “I have cancelled fly-
ing to the school. And the tower. And landing at the
base.”
7 Summary
We explained the dialogue modelling techniques
which we implemented in order to build a real-
time multi-modal conversational interface to an au-
tonomous device. The novel issues tackled by the
system and its dialogue model are that it is able to
manage conversations about multiple tasks and col-
laborative activities in a robust and natural way.
We argued that in the case of dialogues with
devices, a dialogue management mechanism has
to be particularly robust and flexible, especially
in comparison with finite-state or frame-based di-
alogue managers which have been developed for
information-seeking dialogues, such as travel plan-
ning, where topics of conversation are predeter-
mined. Another challenge was that conversations
may have multiple open topics at any one time, and
this complicates utterance interpretation and gener-
ation.
We discussed the dialogue context model and al-
gorithms used to produce a system with the follow-
ing features:
a0 supports multi-tasking, multiple topics, and
collaboration,
a0 support of commands, questions, revisions, and
reports, over a dynamic environment,
a0 multi-modal, mixed-initiative, open-ended dia-
logues,
a0 echoic and variable message generation, fil-
tered for relevance and recency
a0 asynchronous, real-time operation.
An video demonstration of the system is avail-
able at www-csli.stanford.edu/semlab/
witas/.

References
Lars Ahrenberg, Arne Jonsson, and Nils Dalhbeck. 1990.
Discourse representation and discourse management
for natural language interfaces. In In Proceedings of
the Second Nordic Conference on Text Comprehension
in Man and machine.
James Allen, Donna Byron, Myroslva Dzikovska, George
Ferguson, Lucian Galescu, and Amanda Stent. 2001.
Toward conversational human-computer interaction.
AI Magazine, 22(4):27–37.
Brady Clark, John Fry, Matt Ginzton, Stanley Pe-
ters, Heather Pon-Barry, and Zachary Thomsen-Gray.
2001. Automated tutoring dialogues for training in
shipboard damage control. In Proceedings of SIGdial
2001.
Patrick Doherty, G¨osta Granlund, Krzystof Kuchcinski,
Erik Sandewall, Klas Nordberg, Erik Skarman, and Jo-
han Wiklund. 2000. The WITAS unmanned aerial
vehicle project. In European Conference on Artificial
Intelligence (ECAI 2000).
Renee Elio and Afsaneh Haddadi. 1999. On abstract
task models and conversation policies. In Workshop
on Specifying and Implementing Conversation Poli-
cies, Autonomous Agents’99, Seattle.
Jonathan Ginzburg, Ivan A. Sag, and Matthew Purver.
2001. Integrating Conversational Move Types in
the Grammar of Conversation. In Bi-Dialog 2001—
Proceedings of the 5th Workshop on Formal Semantics
and Pragmatics of Dialogue, pages 45–56.
Beth-Ann Hockey, Gregory Aist, Jim Hieronymous,
Oliver Lemon, and John Dowding. 2002. Targeted
help: Embedded training and methods for evaluation.
In Proceedings of Intelligent Tutoring Systems (ITS).
(to appear).
Oliver Lemon, Anne Bracy, Alexander Gruenstein, and
Stanley Peters. 2001. Information states in a multi-
modal dialogue system for human-robot conversation.
In Peter K¨uhnlein, Hans Reiser, and Henk Zeevat, edi-
tors, 5th Workshop on Formal Semantics and Pragmat-
ics of Dialogue (Bi-Dialog 2001), pages 57 – 67.
Oliver Lemon, Alexander Gruenstein, and Stanley Peters.
2002. Collaborative activities and multi-tasking in di-
alogue systems. Traitement Automatique des Langues
(TAL). Special Issue on Dialogue (to appear).
Susann LuperFoy, Dan Loehr, David Duff, Keith Miller,
Florence Reeder, and Lisa Harper. 1998. An architec-
ture for dialogue management, context tracking, and
pragmatic adaptation in spoken dialogue systems. In
COLING-ACL, pages 794 – 801.
Micheal McTear. 1998. Modelling spoken dialogues
with state transition diagrams: Experiences with the
CSLU toolkit. In Proc 5th International Conference
on Spoken Language Processing.
Charles Rich, Candace Sidner, and Neal Lesh. 2001.
Collagen: applying collaborative discourse theory to
human-computer interaction. AI Magazine, 22(4):15–
25.
S. Seneff, L. Hirschman, and V. W. Zue. 1991. Interac-
tive problem solving and dialogue in the ATIS domain.
In Proceedings of the Fourth DARPA Speech and Nat-
ural Language Workshop. Morgan Kaufmann.
Stuart M. Shieber, Gertjan van Noord, Fernando C. N.
Pereira, and Robert C. Moore. 1990. Semantic-
head-driven generation. Computational Linguistics,
16(1):30–42.
Amanda Stent. 1999. Content planning and generation
in continuous-speech spoken dialog systems. In Pro-
ceedings of KI’99 workshop ”May I Speak Freely?”.
