Coordination and context-dependence in the generation of embodied 
conversation 
Justine Cassell* 
*Media Laboratory 
MIT 
E15-315 
20 Ames, Cambridge MA 
{justine, yanhao}@media.mit, edu 
Matthew Stone t Hao Yan* 
tDepartment of Computer Science & 
Center for Cognitive Science 
Rutgers University 
110 Frelinghuysen, Piscataway NJ 08854-8019 
mdstone@cs, rutgers, edu 
Abstract 
We describe the generation of communicative ac- 
tions in an implemented embodied conversational 
agent. Our agent plans each utterance so that mul- 
tiple communicative goals may be realized oppor- 
tunistically by a composite action including not only 
speech but also coverbal gesture that fits the con- 
text and the ongoing speech in ways representative 
of natural human conversation. We accomplish this 
by reasoning from a grammar which describes ges- 
ture declaratively in terms of its discourse function, 
semantics and synchrony with speech. 
1 Introduction 
When we are face-to-face with another human, no 
matter what our , cultural background, or 
age, we virtually all use our faces and hands as an in- 
tegral part of our dialogue with others. Research on 
embodied conversational agents aims to imbue in- 
teractive dialogue systems with the same nonverbal 
skills and behaviors (Cassell, 2000a). 
There is good reason to think that nonverbal be- 
havior will play an important role in evoking from 
users the kinds of communicative dialogue behav- 
iors they use with other humans, and thus allow 
them to use the computer with the same kind of ef- 
ficiency and smoothness that characterizes their di- 
alogues with other people. For example, (Cassell 
and Th6risson, 1999) show that humans are more 
likely to consider computers lifelike, and to rate their 
 skills more highly, when those computers 
display not only speech but appropriate nonverbal 
communicative behavior. This argument takes on 
particular importance given that users repeat them- 
selves needlessly, mistake when it is their turn to 
speak, and so forth when interacting with voice di- 
alogue systems (Oviatt, 1995): tn -life; noisy situa- 
tions like these provoke the non-verbal modalities to 
come into play (Rogers, 1978). 
In this paper, we describe the generation of com- 
municative actions in an implemented embodied 
conversational agent. Our generation framework 
adopts a goal-directed view of generation and casts 
knowledge about communicative action in the form 
of a grammar that specifies how forms combine, 
what interpretive effects they impart and in what 
contexts they are appropriate (Appelt, 1985; Moore, 
1994; Dale, 1992; Stone and Doran, 1997). We ex- 
pand this framework to take into account findings, 
by ourselves and others, on the relationship between 
spontaneous coverbal hand gestures and speech. In 
particular, our agent plans each utterance so that 
multiple communicative goals may be realized op- 
portunistically by a composite action including not 
only speech but also coverbal gesture. By describing 
gesture declaratively in terms of its discourse func- 
tion, semantics and synchrony with speech, we en- 
sure that coverbal gesture fits the context and the on- 
going speech in ways representative of natural hu- 
man conversation. The result is a streamlined imple- 
mentation that instantiates important theoretical in- 
sights into the relationship between speech and ges- 
ture in human-human conversation. 
2 Exploring the relationship between 
-speech and gesture 
To generate embodied communicative action re- 
quires an architecture for embodied conversation; 
ours is provided by the agent REA ("Real Estate 
Agent"), a computer-generated humanoid that has 
an articulated graphical body, can sense the user 
passively through cameras and audio input, and 
supports communicative actions realized in speech 
with intonation, facial display, and animated ges- 
ture. REA currently offers the reasoning and dis- 
play capabilities to act as a real estate agent showing 
..... ~users'the--features"o~ vm i-o-wsmodels"of howsesthat ~ 
appear on-screen behind her. We use existing fea- 
tures of kEA here as a resem'ch platform for imple- 
171 
menting models of the relationship between speech 
and spontaneous hand gestures during conversation. 
For more details about the functionality of REA see 
(Cassell, 2000a). 
Evidence from many sources suggests that this re- 
.lationship is aclose one..About three,quarters of al! 
clauses in narrative discourse are accompanied by 
gestures of one kind or another (McNeill, 1992), and 
within those clauses, the most effortful part of ges- 
tures tends to co-occur with or just before the phono- 
logically most prominent syllable of the accompany- 
ing speech (Kendon, 1974). 
Of course, communication is still possible with- 
out gesture. But it has been shown that when speech 
is ambiguous (Thompson and Massaro, 1986) or in 
a speech situation with some noise (Rogers, 1978), 
listeners do rely on gestural cues (and, the higher the 
noise-to-signal ratio, the more facilitation by ges- 
ture). Similarly, Cassell et al. (1999) established that 
listeners rely on information conveyed only in ges- 
ture as they try to comprehend a story. 
Most interesting in terms of building interactive 
dialogue systems is the semantic and pragmatic rela- 
tionship between gesture and speech. The two chan- 
nels do not always manifest the same information, 
but what they convey is virtually always compati- 
ble. Semantically, speech and gesture give a con- 
sistent view of an overall situation. For example, 
gesture may depict the way in which an action was 
carried out when this aspect of meaning is not de- 
picted in speech. Pragmatically, speech and ges- 
ture mark information about this meaning as advanc- 
ing the purposes of the conversation in a consistent 
way. Indeed, gesture often emphasizes information 
that is also focused pragmatically by mechanisms 
like prosody in speech (Cassell, 2000b). The seman- 
tic and pragmatic compatibility seen in the gesture- 
speech relationship recalls the interaction of words 
and graphics in multimodal presentations (Feiner 
and McKeown, 1991; Green et al., 1998; Wahlster 
et al., 1991 ). In fact, some suggest (McNeill, 1992), 
that gesture and speech arise together from an under- 
lying representation that has both visual and linguis- 
tic aspects, and so the relationship between gesture 
and speech is essential to the production of meaning 
and to its comprehension. 
This theoretical perspective on speech and gesture 
involves two key claims with computational import: 
that gesture and speech reflectacommon concep- 
tual source; and that the content and form of a ges- 
ture is tuned to the communicative context and the 
172 
actor's communicative intentions. We believe that 
these characteristics of the use of gesture are uni- 
versal, and see the key contribution of this work as 
providing a general framework for building dialogue 
systems in accord with them. However, a concrete 
!mplementationrequires " more thanJustgeneralities 
behind its operation; we also need an understanding 
of the precise ways gesture and speech are used to- 
gether in a particular task and setting. 
To this end, we collected a sample of real-estate 
descriptions in line with what REA might be asked 
to provide. To elicit each description, we asked one 
subject to study a video and floor plan of a partic- 
ular house, and then to describe the house to a sec- 
ond subject (who did not know the house and had not 
seen the video). During the conversation, the video 
and floor plan were not available to either subject; 
the listener was free to interrupt and ask questions. 
The collected conversations were transcribed, 
yielding 328 utterances and 134 referential gestures, 
and coded to describe the general communicative 
goals of the speaker and the kinds of semantic fea- 
tures realized in speech and gesture. 
Analysis of the data revealed that for roughly 
50% of the gesture-accompanied utterances, gestu- 
ral content was redundant with speech; for the other 
50% gesture contributed content that was different, 
but complementary, to that contributed by speech. 
In addition, the relationship between content of ges- 
ture, content of speech and general communicative 
functions in house descriptions could be captured by 
a small number or rules; these rules are informed by 
and accord with our two key claims about speech 
and gesture. For example, one rule describes di- 
alogue contributions whose general function was 
what we call presentation, to advance the descrip- 
tion of the house by introducing a single new ob- 
ject.. These contributions tended to be made up of 
a sentence that asserted the existence of an object 
of some type, accompanied by a non-redundant ges- 
• ture that elaborated theshape or location of the ob- 
ject. Our approach casts this extended description of 
a new entity, mediated by two compatible modali- 
ties, as the speaker's expression of one overall func- 
tion of presentation. 
( I ) is a representative example. 
(1) It has \[a nice garden\]. (right hand, held flat, 
traces a circle, indicating location of the 
garden sunounding the house) 
Six rules account for 60% of the gestures in the 
Figure 1" Interacting with REA 
transcriptions (recall) and apply with an accuracy of 
96% (precision). These patterns provide a concrete 
specification for the main communicative strategies 
and communicative resources required for REA. m 
full discussion of the experimental methods and 
analysis, and the resulting rules, can be found in 
(Yan, 2000). 
3 Framing the generation problem 
In REA, requests for the generation of speech and 
gesture are formulated within the dialogue manage- 
ment module. REA'S utterances reflect a coordina- 
tion of multiple kinds of processing in the dialogue 
manager- the system recognizes that it has the floor, 
derives the appropriate communicative context for 
a response and an appropriate set of communicative 
goals, triggers the generation process, and realizes 
the resulting speech and gesture. The dialogue man- 
ager is only one component in a multithreaded ar- 
chitecture that carries out hardwired reactions to in- 
put as well as deliberative processing. The diver- 
sity is required in order to exhibit appropriate inter- 
actional and propositional conversational behaviors 
at a range of time scales, from tracking the user's 
movements with gaze and providing nods and other 
feedback as the user speaks, to participating in rou- 
tine exchanges and generating principled responses 
to user's queries. See (Cassell, 2000a) for descrip- 
tion and motivation of the architecture, as well as the 
conversational functions and behaviors it supports. 
REA'S design and capabilities reflect our research 
focus on allying conversational content with conver- 
sation management, and allying nonverbal modali- 
ties with speech: how can anembodiedagent use'all 
its communicative modalities to contribute new con- 
tent when needed (propositional function), to signal 
the state of the dialogue, and to regulate the over- 
all process of conversation (interactional function)? 
Within this focus, REA's talk is firmly delimited. 
REA'S utterances take a question-answer format, in 
which the user asks about (and REA describes) a 
single house .at.a. time. REA'S .sentences ,ate short; 
generally, they contribute just a few new semantic 
features about particular rooms or features of the 
house (in speech and gesture), and flesh this contri- 
bution out with a handful of meaningful elements (in 
speech and gesture) that ground the contribution in 
shared context of the conversation. 
Despite the apparent simplicity, the dialogue 
manager must contribute a wealth of information 
about the domain and the conversation to represent 
the communicative context. This detail is needed for 
REA tO achieve a theoretically-motivated realization 
of the common patterns of speech and gesture we ob- 
served in human conversation. For example, a vari- 
ety of changing features determine whether marked 
forms in speech and gesture are appropriate in the 
context. REA'S dialogue manager tracks the chang- 
ing status of such features as: 
e Attentionalprominence, represented (as usual 
in natural  generation) by setting up a 
context set for each entity (Dale, 1992). Our 
model of prominence is a simple local one sim- 
ilar to (Strube, 1998). 
o Cognitive status, including whether an entity is 
hearer-old or hearer-new (Prince, 1992), and 
whether an entity is in-focus or not (Gundel 
et al., 1993). We can assume that houses and 
their rooms are hearer-new until REA describes 
them; and that just those entities mentioned in 
the prior sentence are in-focus. 
Information structure, including the open 
propositions or, following (Steedman, 1991 ), 
themes, which describe the salient questions 
currently at issue in the discourse (Prince, 
1986). In REA'S dialogue, open questions are 
always general questions about some entity 
raised by a recent turn; although in principle 
such an open question ought to be formalized 
as theme(XP.Pe), REA can use the simpler 
theme(e). 
In fact, both speech and gesture depend on the same 
• " kinds of'feamresi;-andaccessthem in the same way; " 
this specification of the dialogue state crosscuts dis- 
tinctions of communicative modality. 
173 
Another component of context is provided by a 
domain knowledge base, consisting of facts explic- 
itly labeled with the kind of information they repre- 
sent. This defines the common ground in the con- 
versation in terms of sources of information that 
lation of goals and tightly fits the context specified 
by the dialogue manager. 
4 Generation and linguistic representation 
speaker and hearer share. Modeling the discourse as We model REA'S communicative actions as com- 
a shared source of information means that new ~e "'-':''~'lmsed~°f:a~c~rHeeti°n'°f'at°mie'etementsiqnclndiiag 
both lexical items in speech and clusters of seman- mantic features REA imparts are added to the com- 
mon ground as the dialogue proceeds. Following re- 
sults from (Kelly et al., 1999) which show that infor- 
mation from both speech and gesture is used to pro- 
vide context for ongoing talk, our common ground 
may be updated by both speech and gesture. 
The structured domain knowledge also provides 
a resource for specifying communicative strategies. 
Recall that REA'S communicative strategies are for- 
mulated in terms of functions which are common 
in naturally-occurring dialogues (such as "presenta- 
tion") and which lead to distinctive bundles of con- 
tent in gesture and speech. The knowledge base's 
kinds of information provide a mechanism for spec- 
ifying and reasoning about such functions. The 
knowledge base is structured to describe the rela- 
tionship between the system's private information 
and the questions of interest that that information 
can be used to settle. Once the user's words have 
been interpreted, a layer of production rules con- 
structs obligations for response (Traum and Allen, 
1994); then, a second layer plans to meet these obli- 
gations by deciding to present a specified kind of 
information about a specified object. This deter- 
mines some concrete communicative goals--facts 
of this kind that a contribution to dialogue could 
make. Both speech and gesture can access the 
whole structured database in realizing these concrete 
communicative goals. For example, a variety of 
facts that bear on where a residence is--which city, 
which neighborhood or, if appropriate, where in a 
building--all provide the same kind of information, 
and would therefore fit the obligation to specify the 
location of a residence. Or, to implement the rule 
for presentation described in connection with ( 1 ), we 
can associate an obligation of presentation with a 
cluster of facts describing an object's type, its loca- 
tion in a house, and its size, shape or quality. 
The communicative context and concrete com- 
municative goals provide a common source for gen- 
erating speech and gesture in REA. The utterance 
generation problem ,in REa,.then, is to construct a- 
complex communicative action, made up of speech 
and coverbal gesture, that achieves a given constel- 
tic features expressed as gestures; since we assume 
that any such item usually conveys a specific piece 
of content, we refer to these elements generally as 
lexicalized descriptors. The generation task in REA 
thus involves selecting a number of such lexical- 
ized descriptors and organizing them into a gram- 
matical whole that manifests the right semantic and 
pragmatic coordination between speech and gesture. 
The information conveyed must be enough that the 
hearer can identify the entity in each domain ref- 
erence from among its context set. Moreover, the 
descriptors must provide a source which allows the 
hearer to recover any needed new domain proposi- 
tion, either explicitly or by inference. 
We use the SPUD generator ("Sentence Planning 
Using Description") introduced in (Stone and Do- 
ran, 1997) to carry out this task for REA. SPUD 
builds the utterance element-by-element; at each 
stage of construction, SPUD'S representation of the 
current, incomplete utterance specifies its syntax, 
semantics, interpretation and fit to context. This rep- 
resentation both allows SPUD to determine which 
lexicalized descriptors are available at each stage to 
extend the utterance, and to assess the progress to- 
wards its communicative goals which each exten- 
sion would bring about. At each stage, then, SPUD 
selects the available option that offers the best im- 
mediate advance toward completing the utterance 
successfully. (We have developed a suite of guide- 
lines for the design of syntactic structures, seman- 
tic and pragmatic representations, and the interface 
between them so that SPUD'S greedy search, which 
is necessary for real-time performance, succeeds in 
finding concise and effective Utterances described 
by the grammar (Stone et al., 2000).) 
As part of the development of REA, we have con- 
structed a new inventory of lexicalized descriptors. 
REA'S descriptors consist of entries that contribute 
to coverbal gestures, as well as revised entries for 
spoken words that allow for their coordination with 
gesture under appropriate discourse conditions. The 
:-organization of'these entries assures'that--rasing the 
same mechanism as with speech--REA'S gestures 
draw on the single available conceptual representa- 
174 
tion and that both REA'S gesture and the relation- 
ship between gesture and speech-vary as a function 
of pragmatic context in the same way as natural ges- 
tures and speech do. More abstractly, these entries 
enable SPUD to realize the concrete goals tied to 
common communicative functions with same dis- 
tribution of speech and gestiire bbse~ed:iffn/lttl'ral- 
conversations. 
To explain how these entries work, we need to 
consider SPUD's representation of lexicalized de- 
scriptors in more detail. Each entry is specified 
in three parts. The first part--the syntax of the 
elemenv--sets out what words or other actions the 
element contributes to its utterance. The syn- 
tax is a hierarchical structure, formalized using 
Feature-Based Lexicalized Tree Adjoining Gram- 
mar (LTAG) (Joshi et al., 1975; Schabes, 1990). 
Syntactic structures are also associated with referen- 
tial indices that specify the entities in the discourse 
that the entry refers to. For the entry to apply at a 
particular stage, its syntactic structure must combine 
by LTAG operations with the syntax of the ongoing 
utterance. 
REA'S syntactic entries combine typical phrase- 
structure analyses of linguistic constructions with 
annotations that describe the occurrence of gestures 
in coordination with linguistic phrases. Our device 
for this is a construction SYNC which pairs a descrip- 
tion of a gesture G with the syntactic structure of a 
spoken constituent c: 
SYNC 
(2) G C 
The temporal interpretation of (2) mirrors the rules 
for surface synchrony between speech and gesture 
presented in (Cassell et al., 1994). That is, the 
preparatory phase of gesture G is set to begin before 
the time constituent c begins; the stroke of gesture 
G (the most effortful part) co-occurs with the most 
phonologically prominent syllable in c; and, except 
in cases of coarticulation between successive ges- 
tures, by the time the constituent c is complete, the 
speaker must be relaxing and bringing the hands out 
of gesture space (while the generator specifies syn- 
chrony as described, in practice the synchronization 
of synthesized speech with graphics is an ongoing 
challenge in the REA~projeet).-Jn. sum; 'the produc- 
tion of gesture G is synchronized with the produc- 
tion of speech c. (Our representation of synchrony 
175 
in a single tree conveniently allows modules dowrL- 
stream to describe embodied communicative actions 
as marked-up text.) 
The syntactic description of the gesture itself in- 
dicates the choices the generator must make to pro- 
duce a gesture, but does not analyze a ,gesture lit- 
er~i|y--~is '~/ hier~chy :i~f ~+p~a~e ~ m~=~fi~s~-'~f~?:"= .... 
stead, these choices specify independent semantic 
features which we can associate with aspects of a 
gesture (such as handshape and trajectory through 
space). Our current grammar does not undertake the 
final step of associating semantic features to choice 
of particular handshapes and movements, or gesture 
morphology; we reserve this problem for later in 
the research program. We allow gesture to accom- 
pany alternative constituents by introducing alterna- 
tive syntactic entries; these entries take on different 
pragmatic requirements (as described below) to cap- 
ture their respective discourse functions. 
So much for syntax. The second part--the seman- 
tics of the element--is a formula that specifies the 
content that the element carries. Before the entry 
can be used, SPUD must establish that the semantics 
holds of the entities the entry describes. If the se- 
mantics already follows from the common ground, 
SPUD assumes that the hearer can use it to help iden- 
tify the entities described. If the semantics is merely 
part of the system's private knowledge, SPUD treats 
it as new information for the hearer. 
Finally, the third part--the pragmatics of the 
element--is also a formula that SPUD looks to prove 
before using the entry. Unlike the semantics, how- 
ever, the pragmatics does not achieve specific com- 
municative goals like identifying referents. Instead, 
the pragmatics establishes a general fit between the 
entry and the context. 
The entry schematized in (3) illustrates these three 
components; the entry also suggests how these com- 
ponents can define coordinated actions of speech 
and gesture that respond coherently to the context. 
(3) a syntax: s 
NP VP 
NP:o V SYNC 
/have/ G:x I NP:x 
b semantics: have(o,x) 
c 'pragmaties:"heardr-n-ew(x) A'theme{O) ..... 
(3) describes the use of have to introduce a new fea- 
ture of (a house) o. The feature, indicated through- 
out the entry by the variable x,.is realized as the ob- 
ject NP of the verb have, but x can also form the ba- 
sis of a gesture G coordinated with the noun phrase 
(as indicated by the SYNC constituent). The entry as- 
serts that o has x. 
(3) is a presentational Construction; in other 
words, it coordinates non-redundant paired speech 
and gesture in the same way as demonstrated by our 
house description data. To represent this constraint 
on its use, the entry carries two pragmatic require- 
ments: first, x must be new to the hearer; moreover, 
o must link up with the open question in the dis- 
course that the sentence responds to. 
The pragmatic conditions of (3) help support 
our theory of the discourse function of gesture and 
speech. A similar kind of sentence could be used 
to address other open questions in the discourse-- 
for example, to answer which house has a garden? 
This would not be a presentational function, and 
(3) would be infelicitous here. In that case, gesture 
would naturally coordinate with and elaborate on the 
answering information--in this case the house. So 
the different information structure would activate a 
different entry, where the gesture would coordinate 
with the subject and describe o. 
Meanwhile, alternative entries like (4a) and 
(4b)---two entries that both convey (4c) and that 
both could combine with (3) by LTAG operations-- 
underlie our claim that our implementation allows 
gesture and speech to draw on a single conceptual 
source and fulfill similar communicative intentions. 
(4) a syntax: G:x 
circular-trajectory RS:x l 
b syntax: NP 
NP.:x VP 
V NP:p j 
I 
surrounding 
c semantics: surround(x.p) 
(4a) provides a structure that could substitute for the 
G node in (3) to produce semantically and pragmat- 
ically coordinated speech and gesture. (4a) speci- 
fies a right hand gestnre:in.wlhieh.~the hand. traces 
out a circular trajectory; a further decision must de- 
termine the correct handshape (node RS, as a func- 
176 
tion of the entity x that the gesture describes). We 
pair (4a) with the semantics in (4c), and thereby 
model that the gesture indicates that one object, x, 
surrounds another, p. Since p cannot be further de- 
scribed, p must be identified by an additional pre- 
supposition of the gesture which.picks up~a reference 
frame from the sliared context. 
Similarly, (4b) describes how we could modify 
the vP introduced by (3) (using the LTAG operation 
of adjunction), to produce an utterance such as It 
has a garden surrounding it. By pairing (4b) with 
the same semantics (4c), we ensure that SPUD will 
treat the communicative contribution of the alterna- 
tive constructions of (4) in a parallel fashion. Both 
are triggered by accessing background knowledge 
and both are recognized as directly communicating 
specified facts. 
5 Solving the generation problem 
We now sketch how entries such as these combine 
together to account for REA'S utterances. Our exam- 
ple is the dialogue in (5): 
(5) a User: Tell me more about the house. 
b REA: It has \[a nice garden\]. (right hand, held 
fiat, traces a circle) 
REA's response indicates both that the house has a 
nice garden and that it surrounds the house. 
As we have seen, (5b) represents a common pat- 
tern of description; this particular example is moti- 
vated by an exchange two human subjects had in our 
study, cf. (1). (5b) represents a solution to a gen- 
eration problem that arises as follows within REA'S 
overall architecture. The user's directive is inter- 
preted and classified as a directive requiring a delib- 
erative response. The dialogue manager recognizes 
an obligation to respond to the directive, and con- 
cludes that to fulfill the function of presenting the 
garden would discharge this obligation. The presen- 
tational function grounds out in the communicative 
goal to convey a collection of facts about the garden 
(type, quality, location relative to the house). Along 
with these goals, the dialogue manager supplies its 
communicative context, which represents the cen- 
trality of the house in attentional prominence, cog- 
nitive status and information structure. 
In producing (5b) in response to this NLG prob- 
lem, SPUD both calculates the applicability of and 
. ~determines a preference,for-theqexiOatized descrip- 
tors involved. Initially, (3) is applicable; the system 
knows the house has the garden, and represents the 
garden as new and the house as questioned. The en- 
try can be selected over potential alternatives based 
on its interpretation--it achieves a communicative 
goal, refers to a prominent entity, and makes a rel- 
atively specific connection to facts in the context. 
and what its role might be. Likewise, we need a 
model of the communicative effects of spontaneous 
coverbal gesture--one that allows us to reason nat- 
urally about the multiple goals speakers have in pro- 
ducing each utterance. 
_Similarly, in the .second, stage, SPUD evaluates .and 
selects (4a) because it Communicates a needed fact 7 
in a way that helps flesh out a concise, balanced 
communicative act by supplying a gesture that by 
using (3) SPUD has already realized belongs here. 
Choices of remaining elements--the words garden 
and nice, the semantic features to represent the gar- 
den in the gesture--proceed similarly. Thus SPUD 
arrives at the response in (5b) just by reasoning from 
the declarative specification of the meaning and con- 
text of communicative actions. 
6 Related Work 
The interpretation of speech and gesture has been 
investigated since the pioneering work of (Bolt, 
1980) on deictic gesture; recent work includes 
(Koons et al., 1993; Bolt and Herranz, 1992). Sys- 
tems have also attempted generation of gesture in 
conjunction with speech. Lester et al. (1998) gener- 
ate deictic gestures and choose referring expressions 
as a function of the potential ambiguity of objects re- 
ferred to, and their proximity to the animated agent. 
Rickel and Johnson (1999)'s pedagogical agent pro- 
duces a deictic gesture at the beginning of explana- 
tions about objects in the virtual world. Andr6 et 
al. (1999) generate pointing gestures as a sub-action 
of the rhetorical action of labeling, in turn a sub- 
action of elaborating. 
Missing from these prior systems, however, is a 
representation of communicative action that treats 
the different modalities on a par. Such representa- 
tions have been explored in research on combining 
linguistic and graphical interaction. For example, 
multimodal managers have been described to allo- 
cate an underlying content representation for gen- 
eration of text and graphics (Wahlster et al., 1991; 
Green et al., 1998). Meanwhile, (Johnston et al., 
1997; Johnston, 1998) describe a formalism for 
tightly-coupled interpretation which uses a gram- 
mar and semantic constraints to analyze input from 
speech and pen. While many insights from these 
formalisms are relevant in embodied conversation, 
spontaneous gesture requires a distinct analysis with 
different emphasis:For example;-we need some no- 
tion of discourse pragmatics that would allow us to 
predict where gesture occurs with respect to speech, 
Conclusion ..... 
Research on the robustness of human conversation 
suggests that a dialogue agent capable of acting 
as a conversational partner would provide for effi- 
cient and natural collaborative dialogue. But human 
conversational partners display gestures that derive 
from the same underlying conceptual source as their 
speech, and which relate appropriately to their com- 
municative intent. In this paper, we have summa- 
rized the evidence for this view of human conver- 
sation, and shown how it informs the generation 
of communicative action in our artificial embodied 
conversational agent, REA. REA has a working im- 
plementation, which includes the modules described 
in this paper, and can engage in a variety of interac- 
tions including that in (5). Experiments are under- 
way to investigate the extent to which REA 'S conver- 
sational capacities share the strengths of the human 
capacities they are modeled on. 
Acknowledgments 
The research reported here was supported by NSF (award 
IIS-9618939), Deutsche Telekom, AT&T, and the other 
generous sponsors of the MIT Media Lab, and a postdoc- 
toral fellowship from RUCCS. Hannes Vilhj~ilmsson as- 
sisted with the implementation of REA'S discourse man- 
ager. We thank Nancy Green. James Lester, Jeff Rickel, 
Candy Sidner, and anonymous reviewers for comments 
on this and earlier drafts. 

References 
Elisabeth Andr6, Thomas Rist, and Jochen Mailer. 1999. 
Employing AI methods to control the behavior of ani- 
mated interface agents. AppliedArtificial Intelligence, 
13:415-448. 
Douglas Appelt. 1985. Planning English Sentences. 
Cambridge University Press, Canlbridge England. 
R. A. Bolt and E. Herranz. 1992. Two-handed gesture in 
multi-modal natural dialog. In UIST92: Fifth Ammal 
Symposium on User Interface Software and Technol- 
ogy. 
R. A. Bolt. 1980. Put-that-there: voice and gesture at the 
graphics interface. Computer Graphics. 14(3):262- 
270. 
J. Cassell and K. Th6risson. 1999. The power of a nod 
and a glance: Envelope vs. emotional feedback in an- 
imatefd conversational agents. AppliedArt(ficial Intel- 
ligence, 13(3). 
Justine Casseil, Catherine Pelachaud, Norm Badler, 
Mark Steedman, Brett Achorn, Tripp Becket, Brett 
Douville, Scott Prevost, and Matthew Stone. 1994. 
Animated conversation: Rule-based generation of fa- 
cial expression, gesture and spoken intonation for mul- 
tiple conversational agents. In SIGGRAPH, pages 
413-420. 
J. Cassell, D. McNeill, and K. E. McCullough. 1999. 
Speech-gesture mismatches: evidence for one under- 
lying representation of linguistic and nonlinguistic in- 
formation. Pragmatics and Cognition, 6(2). 
Justine Cassell. 2000a. Embodied conversational inter- 
face agents. Communications of the ACM, 43(4):70- 
78. 
Justine Cassell. 2000b. Nudge nudge wink wink: Ele- 
ments of face-to-face conversation for embodied con- 
versational agents. In J. Cassell, J. Sullivan, S. Pre- 
vost, and E. Churchill, editors, Embodied Conversa- 
tional Agents, pages 1-28. MIT Press, Cambridge, 
MA. 
Robert Dale. 1992. Generating Referring Expressions: 
Constructing Descriptions in a Domain of Objects and 
Processes. MIT Press, Cambridge MA. 
S. Feiner and K. McKeown. 1991. Automating the gen- 
eration of coordinated multimedia explanations. IEEE 
Computer, 24(10): 33-41. 
Nancy Green, Giuseppe Carenini, Stephan Kerpedjiev, 
Steven Roth, and Johanna Moore. 1998. A media- 
independent content  for integrated text and 
graphics generation. In CVIR '98- Workshop on Con- 
tent Visualization and Intermedia Representations. 
Jeanette K. Gundel, Nancy Hedberg, and Ron Zacharski. 
1993. Cognitive status and the form of referring ex- 
pressions in discourse. Language, 69(2):274--307. 
M. Johnston, P. R. Cohen, D. McGee, J. Pittman, S. L. 
Oviatt, and I. Smith. 1997. Unification-based multi- 
modal integration. In ACL/EACL 97: Proceedings of 
the Annual Meeting of the Assocation for Computa- 
tional Linguistics. 
Michael Johnston. 1998. Unification-based naultimodal 
parsing. In COLING/ACL. 
Aravind K. Joshi, L. Levy, and M. Takahashi. 1975. Tree 
adjunct grammars. Journal of the Computer and Sys- 
tem Sciences, 10:136--- 163. 
S. D. Kelly, J. D. Barr, R. B. Church, and K. Lynch. 1999. 
Offering a hand to pragmatic understanding: The role 
of speech and gesture in comprehension and memory. 
Journal of Memory and Language, 40:577-592. 
A. Kendon. 1974. Movement coordination in social in- 
teraction: somem examples described. In S. Weitz, ed- 
itor, Nonverbal Communication. Oxford, New York. 
D. B. Koons. C. J. Sparrell, and K. R. Th6risson. 1993. 
Integrating simultaneous input from speech, gaze and 
hand gestures. In M. T. Maybtiry,'-editor~-lntel/igent 
Multi-media Interfaces. MIT Press, Cambridge. 
James Lester, Stuart Towns. Charles Calloway, and 
Patrick FitzGerald. 1998. Deictic and emotive com- 
munication in animated pedagogical agents. In Work- 
shop on Embodied Conversational Characters. 
David McNeill. 1992. Hand and Mind: What Gestures 
Reveal about Thought. University of Chicago Press, 
Chicago. 
Johanna Moore. 1994 Participating in Explanatory Di- 
alogues. MIT Press, Cambridge MA. 
S. L. Oviatt. 1995. Predicting spoken  disflu- 
encies during human-computer interaction. Computer 
Speech and Language, 9( l ): 19-35. 
Ellen Prince. 1986. On the syntactic marking of pre- 
supposed open propositions. In Proceedings of the 
22nd Annual Meeting of the Chicago Linguistic Soci- 
ety, pages 208-222, Chicago. CLS. 
Ellen F. Prince. 1992. The ZPG letter: Subjects, definite- 
ness and information status. In William C. Mann and 
Sandra A. Thompson, editors, Discourse Description: 
Diverse Analyses of a Fund-raising Text, pages 295- 
325. John Benjamins, Philadelphia. 
Jeff Rickel and W. Lewis Johnson. 1999. Animated 
agents for procedural training in virtual reality: Per- 
ception, cognition and motor control. Applied Artifi- 
cial Intelligence, 13:343-382. 
W. T. Rogers. 1978. The contribution of kinesic illus- 
trators towards the comprehension of verbal behavior 
within utterances. Human Communication Research, 
5:54--62. 
Yves Schabes. 1990. Mathematical and Computational 
Aspects of Lexicalized Grammars. Ph.D. thesis, Com- 
puter Science Department, University of Pennsylva- 
nia. 
Mark Steedman. 1991. Structure and intonation. Lan- 
guage, 67:260-296. 
Matthew Stone and Christine Doran. 1997. Sentence 
planning as description using tree-adjoining grammar. 
In Proceedings of ACL, pages 198-205. 
Matthew Stone, Tonia Bleam, Christine Doran, and 
Martha Palmer. 2000. Lexicalized grammar and the 
description of motion events. In TAG+: Workshop on 
Tree-Adjoining Grammar and Related Forntalisms. 
Michael Strube. 1998. Never look back: An alternative 
to centering. In Proceedings of COLING-ACL. 
L. A. Thompson and D. W. Massaro. 1986. Evaluation 
and integration of speech and pointing gestures dur- 
ing referential understanding. Journal of Experimen- 
tal Child Psychology, 42:144-168. 
David R. Traum and James F. Allen. 1994. Discourse 
obligations in dialogue processing. In ACL, pages 1- 
8. 
W. Wahlster, E. Andr6, W. Graf, and T. Rist. 1991. 
Designing illustrated texts. In Proceedings of EACL, 
pages 8-14. 
Hao Yan.~ 20,00. Paired speech~and gesture generation in 
embodied conversational agents~ Master's thesis, Me- 
dia Lab, MIT. 
