Speech-Graphics Dialogue Systems 
Alan W. Biermann, Michael S. Fulkerson, Greg A. Keim 
Duke University 
{awb, msf, ke J.m)@cs. duke. edu 
1 A Theory of Dialogue 
The central mechanism of a dialogue system must 
be a planner (Allen et al., 1994; Smith et al., 1995; 
Young et al., 1989) that seeks the dialogue goal and 
organizes all behaviors for that purpose. Our project 
uses a hybrid Prolog-like planner (Smith and Hipp, 
1994) which first attempts to prove the top-most 
goal and then initiates interactions with the user 
when the proof cannot easily be achieved. Specif- 
ically, it attempts to discover key missing axioms in 
the proof that prevent its completion and that may 
be attainable with the help of the user. The pur- 
poses of the interaction are to gather the missing 
information and to eventually achieve the top-most 
goal. 
Once the structure of the system is settled, a va- 
riety of desirable behaviors for realistic dialogue can 
be programmed. These include subdialogue behav- 
iors, variable initiative, the ability to account for a 
user model, the use of expectation for error correc- 
tion purposes, and the ability to handle multimedia 
input and output. Each of these is described in the 
following paragraphs. 
1.1 Subdialogue behaviors 
Traditional analyses of human-human dialogue de- 
compose sequences into segments which are locally 
coherent and which individually address their own 
subgoals in the overall dialogue structure. (Hobbs, 
1979; Reichman, 1985; Grosz and Sidner, 1986; 
Lochbaurn, 1991). Such a segment is opened for a 
specific purpose, may involve a series of interactions 
between participants, and may be closed having suc- 
cessfuUy achieved the target subgoal. Such a seg- 
ment may be interrupted for the purpose of achiev- 
ing a new, locally discovered, subgoal or for ap- 
proaching a different goal. It may also fail to achieve 
success and be abandoned. Typical dialogues involve 
repeatedly opening such segments, pursuing one sub- 
goal, jumping to another, returning to a previous 
subgoal and so forth until the highest level goal is 
achieved or abandoned. 
The Prolog-like proof tree enables this kind of be- 
havior because the dialogue segments can be built 
around the explicit subgoals of the proof tree. Con- 
trol for the search can be governed by the domain 
dependent characteristics of the subproofs. The or- 
dinary Prolog depth first search is not used and, 
instead, control can pass from subgoal to subgoal 
to match the segmental behavior that is normal for 
such dialogues. 
1.2 Variable initiative 
The primary facility needed for variable initiative is 
the ability either to control the movements between 
subgoals (dialogue segments) or to release control 
and to follow the user's movements (Guinn, 1995; 
Kitano and Ess-Dykema, 1991; Novick, 1988; Walker 
and Whittaker, 1990). Controlling the movement 
requires that the system have domain information 
available to guide decisions concerning which direc- 
tions may be good to take. Having made these deci- 
sions, the system then jumps to the associated subdi- 
alogs and follows its plan to completion. Releasing 
control to the other participant involves matcbing 
incoming utterances to expected interactions for the 
various available subgoals and following the user to 
subgoals where matches are found. This is called 
plan recognition in the literature and has been the 
object of much study (Allen and Perrault, 1980; Lit- 
man and Allen, 1987; Pollack, 1986; Carberry, 1988; 
Carberry, 1990). Mechanisms for both managing 
movement between subgoals and deciding when to 
release control to the other participant, including 
extensive analyses of their effectiveness, are given in 
(Smith and Hipp, 1994; Guinn, 1995; Guinn, 1996). 
1.3 Accounting for the user model 
Efficient dialogue requires that the knowledge and 
abilities of the other participant be accounted for 
(Kobsa and Wahlster, 1989). When the system pro- 
121 
rides information to the user, it is important that' 
it present the new information at the appropriate 
level. If the system describes details that are al- 
ready known to the user, he or she will become de- 
moralized. If the system fails to give needed infor- 
mation, the user will cease to function effectively. 
The Prolog theorem proving system provides a nat- 
ural means for encoding and using the user model 
without major additional mechanisms. The missing 
axiom discovery mechanism simply selects subdia- 
logues for interaction at the levels in the proof tree 
where the user has knowledge and these levels are 
where the interaction occurs (Smith et al., 1995). 
1.4 Expectation for the purposes of error 
correction 
Because all interactions in a given subdiaiogue are 
occurring in the context of the associated subgoal, 
the actual vocabulary and syntax that are locally ap- 
propriate may be anticipated (Young et al., 1989). 
Thus, for example, if the system has asked the user 
to measure a certain voltage, the Prolog theorem 
proving tree will include locally the possibilities that 
the user has responded successfully (as "I read six 
volts"), that the user has asked for clarification ("at 
which terminal?"), that the user needs instruction 
("how do I measure that?"), or that the user has 
failed to satisfy the request ("no"). The error cor- 
rection mechanism looks for an expected input that 
has a low Hamming distance (weighted) from the ac- 
tual recognized input and chooses the best match to 
the input that it will respond to. If the match does 
not exceed a specified threshold, the system could 
look for matches on other recent subdialogues to de- 
termine whether the user is attempting to move to 
another subject. If no match is found, the system 
could also ask for a repeat of the spoken input. In 
tests of the Circuit Fixit Shoppe, the Hamming dis- 
tance algorithm alone corrected an utterance level 
error rate from 50 percent down to 18.5 percent in a 
series of 141 dialogues (Hipp, 1992; Smith and Gor- 
don, 1996). 
1.5 Multimedia input and output 
capabilities 
The original version of this architecture envisioned 
only speech in and speech out as the communication 
media. Speech input (such as "The voltage is six 
volts") was translated to predicated form (such as 
answer(measure(1;17,t202,6))) and turned over 
to the theorem proving mechanism. Similarly, out- 
puts from Prolog were converted to strings of text 
that were enunciated by a speech synthesizer. In 
recent years, however, our project (Biermann and 
Long, 1996) has experimented with multimedia 
grammars that convert full multimedia communica- 
tion to and from the internal predicate form. The 
following sections describe our method. 
2 Multimedia Grammars 
We designed a multimedia grammar made up of a 
series of operators that relate media syntax and se- 
mantics. Each operator accounts specifically for a 
syntactic item and simultaneously executes code in 
the semantic world which is appropriate for that 
syntax. For example, in a programming domain 
where one might refer to the lines of code on the 
screen, a useful operator is llne which finds the set 
of all lines on the screen within the current region 
of focus. Other operators find other sets (associated 
with nouns), find subsets of sets (as with the adjec- 
tive "capitalized"), select out individuals (as with 
an ordinal), specify relationships (as with contain- 
ment), and call for changes on the screen (as with 
"delete"). An important characteristic of such op- 
erators is that their syntactic and semantic portions 
are specified by a general purpose language (C++) 
so that they can manipulate any media or semantic 
objects that the designer may address. 'While our 
prototype system has used only spoken and text En- 
glish and graphical pointing (highlighting or arrows), 
the approach could conceivably involve full graphi- 
cal capabilities, mechanical devices, or other input- 
output media. Our approach is in contrast with the 
methods of (Feiner and McKeown, 1993; Wahlster et 
al., 1993) where communications are split into sev- 
eral media and then those media are coordinated for 
presentation to the user. Other work on multimedia 
communication is surveyed in (Maybury, 1993). 
The multimedia grammar is demonstrated in the 
generation of the phrase "the fifth character in this 
line" with highlighting of a specified line as given 
in Figure 2. The domain is Pascal tutoring and the 
phrase specifies a particular character that the sys- 
tem wishes to comment on. An example of this type 
of reference from the actual system is shown in Fig- 
ure 1. 
Such a grammar can be used either for genera- 
tion or input. In the generation mode, the target 
meaning is known (it is a particular character "1" 
in the example) and a sequence of operators is to be 
found that can achieve the target meaning. The as- 
sociated syntax becomes the output to be presented 
to the user ("the fifth character in this line" (with 
pointer) in the example). In the parsing mode, the 
target syntax is known and a sequence of operators 
is desired that can account for the syntax. These 
operators will then compute th e meaning for the ut- 
122 
Operator 
llne 
Syntax 
line 
Semantics 
\[begin\] 
\[writln( 'Hello' ) ;\] 
\[end. \] 
Complexity 
this this line \[writln (' Hello' ) ; \] Ct~i,_poi,ter * log(n) 
(with pointer) (with pointer) 
in in this line writln( 'Hello ' ) ; Cm 
(with pointer) 
character character in this line \[w\] Jr\] \[i\] \[1;\] \[1\] In\] ... Cchar 
(with pointer) 
ordinal fifth character in this line \[i\] Cora * 5 
(with pointer) 
the the fifth character in this line 1 Cart 
(with pointer) 
Figure 2: The operator grammar generating syntax to select an item on the screen. 
Figure I: A screen from the Duke Programming Tu- 
tor: "There is an error at the fifth character in this 
line." 
terance. Our project uses this grammar for output 
only because we have separately invested major ef- 
forts in the error correction system that has not been 
merged with the multimedia grammar. 
This grammar, of course, has the ability to gen- 
erate a variety of outputs: "this character" (with 
pointer), "the tenth character", "the fifth character 
in the second line", "this character in the second 
line" (with pointer), etc. A mechanism needs to be 
devised that will select among these choices and that 
will also prune the search to avoid unnecessary com- 
putation. Our system uses complexity numbers as 
shown in Figure 2 and seeks a minimum complexity 
utterance. The methodology is experimental and as- 
signs a complexity constant to each operator. The 
pointer complexity is also multiplied by log(n) where 
n is the number of items that are being distinguished 
from. The intuition here is that a geometrical mech- 
anism centers on the highlighted item. The ordinal 
is multiplied by v, the value of the ordinal, on the 
intuition that the user may actually count out the 
number specified by the ordinal. The actual values 
of the constants are obtained by training continu- 
ously as the user operates the system. This is ex- 
plained in the next section. 
3 Learning User Preferences 
The complexity constants (Cline, OG,, etc.) for gen- 
eration are learned and continuously updated dur- 
ing normal operation. The system gathers feedback 
from the user via any means the designer may choose 
and seeks a set of generation constants that optimize 
user satisfaction. In a test of the system (Biermann 
and Long, 1996), the feedback mechanism was sim- 
ply the time required for the user to respond. A 
quick response was thus recorded as encouragement 
to continue the current type of generation and a long 
response acted to encourage the system to experi- 
ment with other values for the constants and seek 
a new optimum. The specifics of the learning algo- 
rithm employed are explained in (Long, 1996). 
The graph in Figure 3 illustrates this process by 
tracking two of these constants through a sample 
run of the system. The system begins with a bias 
towards highlighting, evidenced by its lower relative 
value as compared to that of using ordinals. How- 
ever,in the middle of the run, the user begins taking 
a long time to respond to the use of highlighting. 
Eventually, this drives the system to try ordinals, to 
which the user responds more quickly. This has the 
effect of lowering the constant for ordinals, thereby 
123 
Ordinals 
Highlighting 
00:45 
E 
I =" oo:3o 
O 00:'15 
D. 
(/) 
CI~ 00:00 
Learning: Adapting to the User's Preferences 
t n AAA| 
1 2 3 , S 8 7 12 ,5 
Utterance Number 
t2 
¢J 
0 
Figure 3: Adapting to the user's preferences. 
making it the prefered output mode. Note that the 
algorithm also has an exploration parameter, which 
is the probability that it will choose a mode other 
than what is currently prefered. This allows the 
algorithm to periodically test modes that it might 
otherwise avoid, and explains why the system used 
ordinals for the seventh response, despite the higher 
constant. 
4 Building Dialogue Systems 
Our most recent voice dialogue system incorporates 
many of the ideas outlined above. The Duke Pro- 
gramming Tutor allows students in the introductory 
Computer Science course to write and debug sim- 
ple programs, communicating with the system using 
voice, text and selection with the mouse. The sys- 
tem can respond with debugging or tutorial informa- 
tion, presented as a combination of speech, text and 
graphics. In the fall of 1996, 15 Duke undergradu- 
ates used the Duke Programming Tutor in place of 
their regular weekly lab. These sessions lasted about 
half an hour, and students received less than 3 min- 
utes of instruction about how to use the system. For 
most students, this was only the second or third time 
they had debugged a program. 
While constructing this system, we often wanted 
to add modules to explore new ideas: an animated 
face, the machine learning of the output mode pref- 
erences, a novel dialogue control algorithm, etc. As 
we struggled through integrating each of these new 
modules and discovering their dependencies on other 
parts of the existing system, we found ourselves 
wishing for a standardized framework--a communi- 
cation and architectural infrastructure for voice dia- 
logue systems. And while the CSLU Toolkit (Sutton 
et al., 1996) already promises rapid development of 
voice system applications, it and other commercial 
systems rely primarily on finite state models of dia- 
logne, which may be insufficient for modeling com- 
plex domains or posing some research questions. A 
complete set of dialogue application programmer in- 
teffaces (APIs) would reduce system development 
time, lead to increased resource sharing and allow 
more accurate system and component evaluation 
(Fulkerson and Kehn, 1997). 
The high level architecture we envision uses mes- 
sages to communicate content, and events to de- 
scribe meta and control information. For example, 
SPEECHIN, a speech recognition module, might gen- 
erate events such as SpeechStart or SpeechStop, and 
also produce a message containing a recognized ut- 
terance. This research effort will focus on under- 
standing and formalizing these communication lan- 
guages, so that they are not only powerful enough 
to capture the dialogue information we can obtain 
today, but also extensible enough to convey novel 
pieces of the human/machine interaction allowed by 
future developments. In order to test these ideas, 
124 
we are currently converting modules in our existing 
system to allow experimentation with various com- 
munication languages and architectures. 
Olalogu4l $yltam 
Ss'ttctl I~ 
\ A,,~la ~~Tl~t I. ~ N~\ DIALOGUE 
Figure 4: A multimodal dialogue system using mes- 
sages and events. 
~~~TltXTIN N N\ \ DIALOGUE 
Figure 5: Adding learning module to an existing 
system. 
Consider a hypothetical multimodal dialogue sys- 
tem that was constructed according to the above 
guidelines, as illustrated in Figure 4. The message 
output from dialogue processing contains a predi- 
cate form of some content to be communicated, and 
a set of modes in which this can be presented. The 
output generation module takes this message as in- 
put, and generates a response based on this infor- 
mation, choosing the mode randomly from what is 
allowed in the message. In many systems, adding a 
mechanism for learning the user's preferences might 
involve adding code to a number of modules. In the 
system we've just described however, the process is 
much easier. In Figure 5, we see that the learning 
algorithm can be inserted between the dialogue pro- 
cessing and output generation modules. It receives 
events generated by other modules, and uses tim- 
ings between output and input events to calculate 
the user's response time. It then modifies the mes- 
125 
sage from dialogue to allow only user's current preb 
ered modes, and passes it on to output generation. 
Note that the API would not only make develop- 
ment faster and easier, it would also allow multiple 
learning algorithms to be tested in a particular do- 
main. This type of component evaluation within the 
context of a system is currently much harder to ac- 
complish. 
5 Summary 
We have discussed a mechanism for building dia- 
logue systems, and how one might achieve useful 
behaviors, such as handling subdialogues, allowing 
variable initiative, accounting for user differences, 
correcting for errors, and' communicating in a vari- 
ety of modes. We discussed the Duke Programming 
Tutor, a system that demonstrates the integration 
of many of these ideas, which has been used by a 
number of students. Finally, we presented our on- 
going project to make designing, constructing and 
evaluating new dialogue systems faster and easier. 
6 Acknowledgements 
This research is supported by the Office of Naval Re- 
search grant N00014-94-1-0938, the National Science 
Foundation grant IRI-92-21842 and a grant from the 
Research Triangle Institute, which is funded in part 
by the Army Research Office. Other individuals who 
have contributed to the Duke Programming Tutor 
include Curry Guinn, Zheng Liang, Phil Long, Dou- 
glas Melamed and Krislman Rajagopalan. 

References 
J. F. Allen and C. R. Perrault. 1980. Analyz- 
ing intention in dialogues. Artificial Intelligence, 
15(3):143-178. 
James F. Allen, Lenhart K. Schubert, George Fergu- 
son, Peter Heeman, Chung Hee Hwang, Tsuneaki 
Kato, Marc Light, Nathaniel G. Martin, Brad- 
ford W. Miller, Massimo Poesio, and David R. 
Traum. 1994. The TRAINS project: A case study 
in building a conversational planning agent. Tech- 
nical Report TRAINS Technical Note 94-3, The 
University of Rochester, September. 
A. W. Biermann and P. M. Long. 1996. The com- 
position of messages in speech-graphics interac- 
tive systems. In Proceedings of the 1996 Interna- 
tional Symposium on Spoken Dialogue, pages 97- 
100, October. 
Sandra Carberry. 1988. Modeling the user's plans 
and goals. Computational Linguistics, 14(3):23- 
37. 
Sandra Carberry. 1990. Plan recognition in natural 
language dialogue. ACL-MIT Press series in nat- 
ural language processing. MIT Press, Cambridge, 
Massachusetts. 
S.K. Feiner and K.R. McKeown. 1993. Au- 
tomating the generation of coordinated multime- 
dia explanations. In M.T. Maybury, editor, In- 
telligent Multimedia Interfaces, pages 113-134. 
AAAI/MIT Press. 
Michael F. Fulkerson and Greg A. Keim. 1997. De- 
velopment of a component level API for voice di- 
alogue systems. In submission. 
Barbara J. Grosz and Candace L. Sidner. 1986. At- 
tention, intentions, and the structure of discourse. 
Computational Linguistics, 12(3):175-204, Sep. 
Curry I. Guinn. 1995. Meta-Dialogue Behav- 
iors: Improving the Efficiency of Human-Machine 
Dialogue-A Computational Model of Variable Ini- 
titive and Negotiation in Collaborative Problem- 
Solving. Ph.D. thesis, Duke University. 
C. I. Guinn. 1996. Mechanisms for mixed-initiative 
human-computer collaborative discource. In Pro- 
ceedings of the 34th Annual Meeting of the Asso- 
ciation for Computational Linguistics, pages 278- 
285. 
D. R. Hipp. 1992. A New Technique for Parsing 
ill-formed Spoken Natural-language Dilaog. Ph.D. 
thesis, Duke University. 
J. R. Hobbs. 1979. Coherence and coreference. Cog- 
nitive Science, 3:67-90. 
H. Kitano and C. Van Ess-Dykema. 1991. Toward 
a plan-based understanding model for mixed- 
initiative dialogues. In Proceedings of the 29th 
Annual Meeting of the Association for Computa- 
tional Linguistics, pages 25-32. 
Alfred Kobsa and Wolfgang Wahlster, editors. 1989. 
User Models in Dialog Systems. Springer-Verlag, 
Berlin. 
D. J. Litman and J. F. Allen. 1987. A plan recogni- 
tion model for subdialogues in conversations. Cog- 
nitive Science, 11(2):163-200. 
K.E. Lochbaum. 1991. An algorithm for plan recog- 
nition in collaborative discource. In Proceedings 
of the 29th Annual Meeting of the Association for 
Computational Linguistics, pages 33-38. 
P. M. Long. 1996. Improved bounds about on-line 
learning of smooth functions of a single variable. 
In Proceedings of the 1996 Workshop on Algorith- 
mic Learning Theory. 
M.T. Maybury, editor. 1993. Intelligent Multimedia 
Interfaces. AAAI/MIT Press. 
D. G. Novick. 1988. Control of Mixed-Initiative Dis- 
course Through Meta-Locutionary Acts: A Com- 
putational Model Ph.D. thesis, University of Ore- 
gon. 
M. E. Pollack. 1986. A model of plan inference that 
distinguishes between the beliefs of factors and ob- 
servers. In Proceedings of the 2~th Annual Meet- 
ing of the Association for Computational Linguis- 
tics, pages 207-214. 
R. Reichman. 1985. Getting computers to talk like 
you and me. The MIT Press, Cambridge, Mass. 
Ronnie W. Smith and Steven A. Gordon. 1996. 
Pragmatic issues in handling miscommunication: 
Observations of a spoken natural language dialog 
system. In AAAI Workshop on Detecting, Repair- 
ing, and Preventing Human-Machine Miscommu- 
nication in Portland, Oregan. 
Ronnie W. Smith and D. Richard Hipp. 1994. Spo- 
ken Natural Language Dialog Systems: A Practi- 
cal Approach. Oxford University Press. 
Ronnie W. Smith, D. Richard Hipp, and Alan W. 
Biermann. 1995. An arctitecture for voice dia- 
log systems based on prolog-style theorem prov- 
ing. Computational Linguistics, 21(3):281-320, 
September. 
Stephen Sutton, David G. Novick, Ronald Cole, 
Pieter Vermeulen, Jacques de Villiers, Johan 
Schalkwyk, and Mark Fanty. 1996. Building 
10,000 spoken dialogue systems. In Proceedings 
of the Fourth International Conference on Spoken 
Language Processing, pages 709-712, October. 
Wolfgang Wahlster, Elisabeeth Andrfi, Wolfgang 
Finkler, Hans-Jiirgen Profitlich, and Thomas Rist. 
1993. Plan-based integration of natural language 
and graphics generation. Artificial Intelligence, 
63:387---427. 
Marilyn Walker and Steve Whittaker. 1990. Mixed 
initiative in dialogue: An investigation into dis- 
course segmentation. In Proceedings, 28th An- 
nual Meeting of the Association for Computa- 
tional Linguistics, pages 70.--78. 
S. R. Young, A. G. Hauptmann, W. H. Ward, E. T. 
Smith, and P. Werner. 1989. High level knowl- 
edge sources in usable speech recognition systems. 
Communications of the ACM, pages 183-194, Au- 
gust. 
