An Architecture for Dialogue Management, Context Tracking, 
and Pragmatic Adaptation in Spoken Dialogue Systems 
Susann LuperFoy, Dan Loehr, David Duff, Keith Miller, Florence Reeder, Lisa Harper 
The MITRE Corporation 
1820 Dolley Madison Boulevard, McLean VA 22102 USA 
{luperfoy, loehr, duff, keith, freeder, lisah } @mitre.org 
Abstract 
This paper details a software architecture for 
discourse processing in spoken dialogue 
systems, where the three component tasks of 
discourse processing are (1) Dialogue Man- 
agement, (2) Context Tracking, and (3) 
Pragmatic Adaptation. We define these three 
component tasks and describe their roles in a 
complex, near-future scenario in which 
multiple humans interact with each other 
and with computers in multiple, simulta- 
neous dialogue exchanges. This paper 
reports on the software modules that accom- 
plish the three component tasks of discourse 
processing, and an architecture for the inter- 
action among these modules and with other 
modules of the spoken dialogue system. A 
motivation of this work is reusable discourse 
processing software for integration with 
non-discourse modules in spoken dialogue 
systems. We document the use of this ar- 
chitecture and its components in several 
prototypes, and also discuss its potential ap- 
plication to spoken dialogue systems defined 
in the near-future scenario. 
Introduction 
We present an architecture for spoken dialogue 
systems for both human-computer interaction 
and computer mediation or analysis of human 
dialogue. The architecture shares many compo- 
nents with those of existing spoken dialogue 
systems, such as CommandTalk (Moore et al. 
1997), Galaxy (Goddeau et al. 1994), TRAINS 
(Allen et al. 1995), Verbmobil (Wahlster 1993), 
Waxholm (Carlson 1996), and others. Our ar- 
chitecture is distinguished from these in its 
treatment of discourse-level processing. 
Most architectures, including ours, contain mod- 
ules for speech recognition and natural language 
interpretation (such as morphology, syntax, and 
sentential semantics). Many include a module 
for interfacing with the back-end application. If 
the dialogue is two-way, the architectures also 
include modules for natural language generation 
and speech synthesis. 
Architectures differ in how they handle dis- 
course. Some have a single, separate module 
labeled "discourse processor", "dialogue com- 
ponent", or perhaps "contextual interpretation". 
Others, including earlier versions of our system, 
bury discourse functions inside other modules, 
such as natural language interpretation or the 
back-end interface. 
An innovation of this work is the compartmen- 
talization of discourse processing into three gen- 
erically definable components--Dialogue Man- 
agement, Context Tracking, and Pragmatic Ad- 
aptation (described in Section 1 below)--and the 
software control structure for interaction be- 
tween these and other components of a spoken 
dialogue system (Section 2). 
In Section 3, we examine the dialogue process- 
ing requirement in a complex scenario involv- 
ing multiple users and multiple simultaneous 
dialogues of diverse types. We describe how 
our architecture supports implementations of 
such a scenario. Finally, we describe two im- 
plemented spoken dialogue systems that embody 
this architecture (Section 4). 
1 Component Tasks of Discourse 
Processing 
We divide discourse-level processing into three 
component tasks: Dialogue Management, Con- 
text Tracking, and Pragmatic Adaptation. 
1.1 Dialogue Management 
The Dialogue Manager is an oversight module 
whose purpose is to facilitate the interaction 
between dialogue participants. In a user-initiated 
system, the dialogue manager directs the proc- 
essing of an input utterance from one component 
to another through interpretation and back-end 
794 
system response. In the process, it detects and 
handles dialogue trouble, invokes the context 
tracker when updates are necessary, generates 
system output, and so on. 
Our conception of Dialogue Manager as con- 
troller becomes increasingly relevant as the 
software system moves away from the standard 
"NL pipeline" in order to deal with dialogue 
disfluencies. Its oversight perspective affords it 
(and the architecture) certain capabilities, which 
are listed in Table 1. 
1 Supports mixed-initiative system by fielding sponta- 
neous input from either participant and routing it to 
the appropriate components. 
2 Supports non-linguistic dialogue "events" by accept- 
ing them and routing them to the Context Tracker 
(below). 
5 
3 Increases overall system performance. For example, 
awareness of system output allows the Dialogue 
Manager to predict user input, boosting speech 
recognition accuracy. Similarly, if the back-end intro- 
duces a new word into the discourse, the Dialogue 
Manager can request the speech recognizer to add it 
to its vocabulary for later reco\[nition. 
4 Supports meta-dialogues between the dialogue sys- 
tem itself and either participant. An example might be 
a participant's questions about the status of the dia- 
lo\[ue s2/stem. 
Acts as a central point for dialogue troubleshooting, 
after (Duff et al. 1996). If any component has insuffi- 
cient input to perform its task, it can alert the Dia- 
logue Manager, which can then reconsult a previously 
invoked component for different output. 
Table 1. Dialogue Manager Capabilities 
The Dialogue Manager is the primary locus of 
the dialogue agent's outward personality as a 
function of interaction style; its simple protocol 
specifies conditions for interrupting user speech 
for permitting interruption by the user, when to 
initiate repair dialogues, and how often to back- 
channel. 
1.2 Context Tracking 
The Context Tracker maintains a record of the 
discourse context which it and other components 
can consult in order to (a) resolve dependent 
forms that occur in input utterances and (b) gen- 
erate appropriate context-dependent forms for 
achieving natural output. Interpretation of defi- 
nite pronouns, demonstratives (this, those), in- 
dexicals (you, now, here, tomorrow), definite 
NPs (a car...the car), one-anaphora (the earlier 
one) and ellipsis (how about Seattle) all rely on 
stored context. 
The Context Tracker strives to record only those 
entities and events that could become eligible for 
reference. Context thus includes linguistic com- 
municative acts (verbalizations), non-linguistic 
communicative acts (gesture), and non- 
communicative events that are deemed salient. 
Since determining salience requires a judge- 
ment, our implementations rely on heuristic 
rules to decide which events and objects get 
entered into the context representation. For ex- 
ample, the disappearance of a simulated vehicle 
off the edge of a map display might be deemed 
salient relative to a particular user model, the 
discourse history, or the task structure. 
1.3 Pragmatic Adaptation 
The Pragmatic Adaptation module serves as the 
boundary between language and action by de- 
termining what action to take given an inter- 
preted input utterance or a back-end response. 
This module's role is to "make sense" of a 
communicative act in the current linguistic and 
non-linguistic context. 
The Pragmatic Adapter receives an interpreta- 
tion of an input utterance with context- 
dependent forms resolved. It then proceeds to 
translate that utterance into a valid back-end 
command. It checks for violations of the Do- 
main Model, which contains information about 
the back-end system such as allowable parame- 
ter values for command arguments. It also 
checks for commands that are infelicitous given 
the current Back-end State (e.g., the referenced 
vehicle does not exist at the moment). The 
Pragmatic Adapter combines the result of these 
simple tests and a set of if-then heuristics to 
determine whether to send through the command 
or to intercept the utterance and notify the Dia- 
logue Manager to initiate a repair dialogue with 
the user. 
The Pragmatic Adapter receives output re- 
sponses from the back-end and adapts or "trans- 
lates" them into natural language communica- 
tions which get incorporated by the Context 
Tracker into the dialogue history. 
2 An Architecture for Spoken 
Dialogue Systems 
Having introduced our three discourse compo- 
nents, we now present our overall architecture. 
It is laid out in Figure 1, and its components are 
described in Table 2, starting from the user and 
going clockwise. The discourse components are 
795 
left in white, while non-discourse components 
have been shaded gray. 
- Communication Link ~ = Default Order of Firing (changeable by Dialogue Manager) 
Speech 
Recognition 
Speech 
Synthesis 
Natural ~ Context 
'Language Tracking 
aterpretation (on Input) 
Natural 
Language 
Generation 
Dialogue 
Manager 
Pragmatic 
Adaptation 
(on Input) 
Back-End " ~ 
Pragmatic 
Adaptation 1 
k (on Output) ! 
Context :~/ 
Tracking ~ 
Figure 1. An Architecture for Spoken Dialogue Systems, with Discourse Components in White 
".ml~Oncnt (A,~,cnt) Bri~f l)cscription Pos.~'ihlc Inlml Pos.~ihh" ().tim 
Speech Reco\[nition:: 
'NL Interp~tation : 
Context Tracking 
on Input 
Pragmatic Adaptation 
on Input 
Convert waveform to Strin~ 6f words ::: 
Converi words to meaning representation 
Track discourse entities of input utterance, 
resolve dependent references 
Convert logical form to back-end command 
Waveform :: 
: Text striri\[ ~ ."~ 
Logical form (with 
dependent references) 
Logical form 
Text string: : :~: ,: : 
Lo~ic~ form, /~,,, 
Logical form (with 
dependent references 
replaced by their referents) 
Back-end command 
Pragmatic Adaptation 
on Output 
Context Tracking 
on Output 
Dialogue Manager 
Convert back-end response to logical form 
representation of communicative act 
Track discourse entities of output utterance, 
insert dependent references (if desired) 
High-level control, intelligently route 
information between all agents and partici- 
pants (see section 1.1) based on its own 
protocol for interaction. 
Back-end response 
Logical form (w/out 
dependent references) 
Various 
Logical form 
Logical form (conditioned 
by discourse context) 
Various 
796 
Table 2. Description of the Architecture Components, with Discourse Components in White 
Several items are of note in Figure 1 and Table 
2. First, although a default firing order is 
shown, this order is perturbed any time dialogue 
trouble arises. For example, a Speech Recogni- 
tion (SR) error, may be detected after Natural 
Language Interpretation fails to parse the output 
of SR. Rather than continuing the flow on to- 
wards the back-end, the Dialogue Manager can 
re-consult SR for other hypotheses. Alterna- 
tively, the Dialogue Manager can fire Natural 
Language Generation with an output request for 
clarification. That request gets incorporated into 
the context representation by Context Tracking, 
the dialogue state is "pushed" in a repair dia- 
logue, and a string is ultimately sent to Speech 
Synthesis for delivery to the user's ear. The next 
utterance is then interpreted in the context of the 
repair dialogue. 
Note also that Context Tracking and Pragmatics 
Adaptation are called twice each: on "input" 
(from the user), and on "output" (from the back- 
end). The logical Context Tracker may be im- 
plemented as one or as two related modules, 
together tracking both sides of that dialogue so 
that either user or system can make anaphoric 
mention of entities introduced earlier. 
3 A Near-Future Scenario of Spoken 
Dialogue Systems 
3.1 The Scenario 
We build on images from the popular science 
fiction series Star Trek as a rich source of dia- 
logue types in complex interrelations. These 
example dialogues have more primitive cousins 
under development today. 
Briefly, our example dialogue types are listed in 
Table 3. 
Dialogue 
with an 
Appliance 
Dialogue 
with an 
Application 
Dialogue 
with an 
Intelligent 
Robot 
Computer 
Mediation 
of Human 
Dialogue 
Computer 
Analysis 
of Human 
Dialogue 
Dialogue 
between 
2 characters 
Food 
Replicator 
Ship's 
Computer 
Android 
"Data" 
Universal 
Translator 
Conver- 
sation 
Playback 
Holodeck 
The "Food Replicator" on Star Trek accepts 
structured English command language such as 
"Tea. Earl Grey. Hot" and produces results in the 
physical world. 
The ship's computer on Star Trek is an advanced 
application which can understand natural lan- 
guage queries, and replies either via actions or 
via a multimodal interface. 
"Data" on Star Trek converses as a human while 
providing information processing of a computer 
and is capable of action in the physical world. 
Star Trek's "Universal Translator" is capable of 
automatically interpreting between any two 
humans 
The ship's computer has the ability to retrieve, 
play back, and analyze previously-recorded 
conversations. In this sense, the dialogue 
becomes empirical data to be analyzed. 
Star Trek's "Holodeck" creates simulated hu- 
mans (or characters) as actors, for the entertain- 
ment or training of human viewers. 
Human 
Human 
Human 
Human 
Human 
Character 
Food 
Replicator 
Ship's 
Computer 
Android 
"Data" 
Human 
Human 
Character 
Table 3. A Scenario of Dialogue Types 
Universal 
Translator 
Ship's 
Computer 
797 
3.2 Application of the Architecture to 
the Scenario 
We now describe the role our architecture, and 
specifically our discourse components, play in 
these near-future examples. 
3.2.1 Dialogue with a Back-End Computer 
The first three examples illustrate dialogues in 
which a human is talking to a computer. One 
dimension distinguishing the three examples is 
the agent's intelligent use of context. In a dia- 
logue with an "appliance", simple, structured, 
unambiguous command language utterances are 
interpreted one at a time in isolation from prior 
dialogue history. The Pragmatic Adaptation 
facility can follow a simple scheme for mapping 
each utterance to one of a very few back-end 
commands. The Context Tracker has no cross- 
sentence dependent references to contend with, 
and finally, since the appliance provides no lin- 
guistic feedback, the Dialogue Manager fires 
none of the "output" components (from back- 
end to human). In a dialogue with more sophis- 
ticated application or with a robot, the Dialogue 
Manager, Context Tracker, and Pragmatic 
Adapter need greater functionality, to handle 
both linguistic and non-linguistic events in both 
directions. 
3.2.2 Computer-Mediated Dialogue 
The fourth example, that of the Universal 
Translator, is representative of a general dia- 
logue type we label Mediator, in which an agent 
plays a mediation role between humans. In ad- 
dition to interpretation, other roles of the me- 
diator might be (Table 4): 
lediatorRol~ 
A Genie, which is available for meta-dialogues with 
the system itself, instead of with the dialogue partner 
(much as a human might ask an interpreter to repeat 
the partner's last utterance). 
A Moderator, which, in multi-party dialogues, en- 
forces an agreed-upon interaction protocol, such as 
Robert's Rules of Order or a talk-show format (under 
control of the host). 
3 A Bouncer, which decides who may join the dialogue 
based on current enrollment (first-come-first-served), 
clearance level, invitation list, etc., as well as permit- 
ting different types of participation, so that some may 
only listen while others may fully participate. 
4 A Stenographer, which records the dialogue, and 
prepares a "visualization" of the dialogue structure. 
Table 4. Roles of a Mediator Agent 
Our architecture is applicable to mediated dia- 
logues as well. In fact, it was first developed for 
bilingual dialogue in a voice-to-voice machine 
translation application. In this application, the 
Dialogue Manager is available for meta- 
dialogues with either user (as in Could you re- 
peat her last utterance?), and the Context 
Tracker can use a single discourse representation 
structure to track the unfolding context in both 
languages. 
3.2.3 Computer-Analyzed Dialogue 
Our fifth example, a post-hoc analysis of a dia- 
logue, does not require real-time processing. It 
is, nonetheless, a dialogue which can be ana- 
lyzed using the components of our architecture, 
exactly as if it were real-time. The only differ- 
ence is that no generation will be required, only 
analysis; thus, the Dialogue Manager need only 
fire the "input" components on each utterance. 
3.2.4 Character-Character Dialogue 
Our last example concerns a simulated human 
dialogue between two computer characters, for 
the benefit of human viewers. Such character- 
character dialogues have been produced by sev- 
eral researchers, including (Kalra et al. 1998). 
Here, the architecture applies at two levels. 
First, the architecture can be internal to each 
agent, to implement that agent's conversational 
ability. Second, the architecture can be used 
externally to analyze the agents' dialogue, as 
discussed in the previous section. 
4 Implementations of the Architecture 
We have implemented two spoken dialogue 
systems using the architecture presented. The 
first is a telephone-based interface to a simulated 
employee Time Reporting System (TRS), as 
might be used at a large corporation. We then 
ported the system to a spoken interface to a bat- 
tlefield simulation (Modular Semi-Automated 
Forces, or ModSAF). 
In our implementation of this architecture, each 
component is a unique agent which may reside 
on its own platform and communicate over a 
network. The middleware our agents use to 
communicate is the Open Agent Architecture 
(OAA) (Moran et al. 1997) from SRI. The 
OAA's flexibility allowed us to easily hook up 
modules and experiment with the division of 
labor between the three discourse components 
we are studying. We treat the Dialogue Manager 
as a special OAA agent that insists on being 
called frequently so that it can monitor the pro- 
gress of communicative events through the sys- 
tem. 
798 
4.1 The Time Reporting System (TRS) 
The architecture components in our TRS system 
are listed in Table 5, along with their specific 
implementations used. Each implemented mod- 
ule included a thin OAA agent layer, allowing it 
to communicate via the OAA. 
. • ~ i I " 
• NL !nterpre~ion/Generation Simulated :i~, '!,,L ~:~i 
"'Back-End Interface ~:~~:~-::~: "::Simulated ~'=~ ~.~: 
Context Trackin\[ (LuperFo~, 1992) 
Pra\[\[matic Adaptation Currently, Simulated 
Dialo\[\[ue Manager Current Development 
Table 5. Components of TRS System, with 
Discourse Components in White 
Components not in our focus (shaded in gray) 
are either commercial or simulated software. For 
Context Tracking, we use an algorithm based on 
(LuperFoy 1992). For Dialogue Management, 
we developed a simple agent able to control a 
system-initiated dialogue, as well as handle non- 
linguistic events from the back-end. The third 
discourse component, Pragmatic Adaptation, 
awaits future research, and was simulated for 
this system. 
Figure 2 presents a sample TRS dialogue. 
System: Welcome. What is your employee number? 
User: 12345 
System: What is your password? 
User: 54321 
System: How can I help you? 
User: What's the first charge number? 
System: 123GK498J 
User: What's the name of that task? 
System: Project X 
User: Charge 6 hours to it today for me. 
System: 6 hours has been charged to Project X. 
Figure 2. Sample TRS Dialogue 
When the user logs in, the back-end system 
brings up a non-linguistic event--the list of 
tasks, with associated charge numbers, which 
belong to the user. The Dialogue Manager re- 
ceives this and passes it to the Context Tracker. 
The Context Tracker is then able to resolve the 
first charge number, as well as subsequent de- 
pendent references such as that task, it, and to- 
day. 
4.2 The ModSAF Interface 
We ported the TRS demo to a simulated battle- 
field back-end called ModSAF. We used the 
same components with the exception of the 
speech recognizer and the back-end interface. 
The Dialogue Manager was improved over the 
TRS demo in several ways. First, we added the 
capability of the Dialogue Manager to dynami- 
cally inform the speech recognizer of what input 
to expect, i.e., which language model to use. The 
Dialogue Manager could also add words to the 
speech recognizer's vocabulary on the fly. We 
chose Nuance (from Nuance Communications) 
as our speech recognition component specifi- 
cally because it supports such run-time updates. 
Figure 3 presents a sample ModSAF dialogue. 
Note that only the user speaks. 
• Create an M 1 A2 platoon. 
• Name it Bravo. 
• Give it location 4 9 degrees 3 0 minutes north, 
1 1 degrees 4 5 minutes east. 
• Bravo, advance to checkpoint Charlie. 
(At this point, a new platoon appears on the screen, 
created by another player in the simulation) 
• Zoom in on that new platoon. 
• Bravo, change location and approach X. 
(Where X is the name of the new platoon.) 
Figure 3. Sample ModSAF Dialogue 
When the user asks to create an entity, the Dia- 
logue Manager detects the beginning of a sub- 
dialogue, and informs the speech recognizer to 
restrict its expected grammar to that of entity 
creation (name and location). Later, the back- 
end (ModSAF) sends the Dialogue Manager a 
non-linguistic event, in which a different platoon 
(created by another player in the simulation) 
appears. This event includes a name for the new 
platoon; the Dialogue Manager passes this to the 
speech recognizer, so that it may later recognize 
it. In addition, the event is passed to the Context 
Tracker, so that it may later resolve the reference 
that new platoon. 
To illustrate some advantages of our architec- 
ture, we briefly mention what we needed to 
change to port from TRS to ModSAF. First, the 
Context Tracker needed no change at 
all--operating on linguistic principles, it is do- 
main-independent. LuperFoy's framework does 
provide for a layer connected to a knowledge 
source, for external context--this would need to 
be changed when changing domains. The Dia- 
logue Manager also required little change to its 
core code, adding only the ability to influence 
799 
the speech recognizer. The Pragmatic Adapta- 
tion Module, being dependent on the domain of 
the back-end, is where most changes are needed 
when switching domains. 
Conclusion 
We have presented a modular, flexible architec- 
ture for spoken dialogue systems which sepa- 
rates discourse processing into three component 
tasks with three corresponding software mod- 
ules: Dialogue Management, Context Tracking, 
and Pragmatic Adaptation. We discussed the 
roles of these components in a complex, near- 
future scenario portraying a variety of dialogue 
types. We closed by describing implementations 
of these dialogues using the architecture pre- 
sented, including development and porting of the 
first two discourse components. 
The architecture itself is derived from a standard 
blackboard control structure. This is appropriate 
for our current dialogue processing research in 
two ways. First, it does not require a prior full 
enumeration of all possible subroutine firing 
sequences. Rather, the possibilities emerge from 
local decisions made by modules that communi- 
cate with the blackboard, depositing data and 
consuming data from the blackboard. Second, 
as we learn categories of dialogue segment 
types, we can move away from the fully decen- 
tralized control structure, to one in which the 
central Dialogue Manager, as a blackboard 
module with special status, assumes increasing 
decision power for processing flow, in cases of 
dialogue segment type with which it is familiar. 
The intended contribution of this work is thus in 
the generic definition of standard dialogue func- 
tions such as dynamic troubleshooting (repair), 
context updating, anaphora resolution, and 
translation of natural language interpretations 
into functional interface languages of back-end 
systems. 
Future work includes investigation of issues 
raised when a human is engaged in more than 
one of our scenario dialogues concurrently. For 
example, how does one speech enabled dialogue 
system among many determine when it is being 
addressed by the user, and how can the system 
judge whether the current utterance is human- 
computer, i.e., to be fully interpreted and acted 
upon by the system as opposed to a human- 
human utterance that is to be simply recorded, 
transcribed, or translated without interpretation. 

References 
Allen J., Schubert L., Ferguson G., Heeman P., 
Hwang C., Kato T., Light M., Martin N., Miller B., 
Poesio M., Traum D. (1995) The TRAINS Project: 
A case study in building a conversational planning 
agent. Journal of Experimental and Theoretical AI, 
7, pp. 7 48. 
Carlson R. (1996) The Dialogue Component in the 
Waxholm System. Proc. Twente Workshop on Lan- 
guage Technology: Dialogue Management in Natu- 
ral Language Systems, University of Twente, the 
Netherlands. 
Duff D., Gates B., LuperFoy S. (1996) A Centralized 
Troubleshooting Mechanism for a Spoken Dia- 
logue Interface to a Simulation Application. Proc. 
International Conference on Spoken Language 
Processing. 
Goddeau D., Brill E., Glass J., Pao C., Phillips M., 
Polifroni J., Seneff S., Zue V. (1994) GALAXY: A 
Human-Language Interface to On-line Travel In- 
formation. Proc. International Conference on Spo- 
ken Language Processing. 
Kalra P., Thalmann N., Becheiraz, P., Thalmann D. 
(1998) Communication Between Synthetic Actors. 
In "Automated Spoken Dialogue Systems", S. Lu- 
perFoy, ed. MIT Press (forthcoming). 
LuperFoy. S (1992) The Representation of Multi- 
modal User-Interface Dialogues Using Discourse 
Pegs. Proc. Annual Meeting of the Association for 
Computational Linguistics. 
Moore R., Dowding J., Bratt H., Gawron J., Gorfu 
Y., Cheyer, A. (1997) CommandTalk: A Spoken- 
Language Interface for Battlefield Simulations. 
Proc. Fifth Conference on Applied Natural Lan- 
guage Processing. 
Moran D., Cheyer A., Julia L., Martin D. Park S. 
(1997) Multimodal User Interfaces in the Open 
Agent Architecture. Proc. International Conference 
on Intelligent User Interfaces. 
Wahlster W. (1993) Verbmobil: Translation of Face- 
To-Face Dialogues. In "Grundlagen und An- 
wendungen der Ktinstlichen Intelligenz", O. Her- 
zog, T. Christaller, D. Schiitt, eds., Springer. 
