Planning Referential Acts 
for Animated Presentation Agents 
Elisabeth AndrE, Thomas Rist 
German Research Center for Artificial Intelligence (DFKI) 
Stuhlsatzenhausweg 3, D-66123 Saarbrficken, Germany 
Email: { andre,fist } @dfki.uni-sb.de 
Abstract 
Computer-based presentation systems en- 
able the realization of effective and dy- 
namic presentation styles that incorporate 
multiple media. In particular, they allow 
for the emulation of conversational styles 
known from personal human-human com- 
munication. In this paper, we argue that 
life-like characters are an effective means 
of encoding references to world objects in 
a presentation. We present a two-phase ap- 
proach which first generates high-level ref- 
erential acts and then transforms them into 
fine-grained animation sequences. 
• effectively establish cross-references between 
presentation parts which are conveyed by differ- 
ent media possibly being displayed in different 
windows; 
• enable new forms of deixis by personalizing the 
system as a situated presenter. 
For illustration, let's have a look at two example 
presentations taken from the PPP system (Person- 
alized Plan-Based Presenter, (RAM97)). In Fig. 1, a 
pointing gesture is combined with a graphical anno- 
tation technique using a kind of magnifying glass. 
1 Introduction 
A number of researchers have developed algorithms 
in order to discriminate referents from alternatives 
via linguistic means (cf. (RD92)). When moving 
from language discourse to a multimedia discourse, 
referring expressions may be composed of several 
constituents in different media. Each constituent 
conveys some discriminating attributes which in sum 
allow for a proper identification of the referent. How- 
ever, to ensure that a composed referring expression 
is intelligible, the system has to establish cohesive 
links between the single parts (cf. (AR94)). 
In this paper, we argue that life-like characters 
are particularly suitable for accomplishing referring 
tasks. For example, a life-like character can: 
• draw the viewer's attention to graphical object 
representations by pointing with body parts, 
and additional devices such as a pointing stick. 
• make use of facial displays and head movements 
as an additional means of disambiguating dis- 
course references, 
Figure 1: Referring to Objects Using a Magnifying 
Glass 
The Persona provides an overview of interesting sites 
in the Saarland county by uttering their names and 
pointing to their location on a map. In addition, 
Persona annotates the map with a picture of each 
site before the user's eyes. The advantage of this 
method over static annotations is that the system 
can influence the temporal order in which the user 
processes an illustration. Furthermore, space prob- 
lems are avoided since the illustration of the corre- 
sponding building disappears again after it has been 
68 E. Andrd and T. Rist 
Figure 2: Establishing Cross-Media References 
described. The example also demonstrates how fa- 
cial displays and head movements help to restrict the 
visual focus. By having the Persona look into the di- 
rection of the target object, the user's attention is 
directed to the target object. 
Whereas in the last example, the pointing act of 
the Persona referred to a single graphical object, 
the scenario in Fig. 2 illustrates how cross-media 
links can be effectively built up between several il- 
lustrations. In this example, the Persona informs 
the user where the DFKI building is located. It ut- 
ters: "DFKI is located in Saarbriicken" and uses two 
pointing sticks to refer to two graphical depictions 
of DFKI on maps with different granularity. 
As shown above, life-like characters facilitate the 
disambiguation of referring expressions. On the 
other hand, a number of additional dependencies 
have to be handled since a referring act involves not 
only the coordination of document parts in different 
media, but also the coordination of locomotion, ges- 
tures and facial displays. To accomplish these tasks, 
we have chosen a two-phase approach which involves 
the following steps: 
(1) the creation of a script that specifies the tempo- 
ral behavior of the constituents of a referential 
act, such as speaking and pointing 
(2) the context-sensitive conversion of these con- 
stituents into animation sequences 
2 Representation of the Multimedia 
Discourse 
A few researchers have already addressed referring 
acts executed by life-like characters in a virtual 3D 
environment (cf. (CPB+94; LVTC97)). In this case, 
the character may refer to virtual objects in the 
same way as a human will do in a real environment 
with direct access to the objects. A different situa- 
tion occurs when a character interacts with objects 
via their presentations as in the example scenarios 
above. Here, we have to explicitly distinguish be- 
tween domain objects and document objects. First, 
there may be more than one representative for one 
and the same world object in a presentation. For ex- 
ample, in Fig. 2, DFKI is represented by a schematic 
drawing and a colored polygon. Furthermore, it 
makes a difference whether a system refers to fea- 
tures of an object in the domain or in the presen- 
tation since these features may conflict with each 
other. To enable references to objects in a presenta- 
tion, we have to explicitly represent how the system 
has encoded information. For instance, to generate 
a cross-media reference as in Fig. 2, the system has 
to know which images are encodings for DFKI. In- 
spired by (Mac86), we use a relation tuple of the 
form: 
(Encodes carrier info context-space) 
to specify the semantic relationship between a pre- 
sentation means, and the information the means is 
to convey in a certain context space (cf. (AR94)). 
In our approach, the third argument refers to the 
context space to which the encoding relation cor- 
responds to and not to a graphical language as in 
the original Mackinlay approach. This enables us 
to use one and the same presentation means differ- 
ently in different context spaces. For example, the 
zoom inset in Fig. 1 is used as a graphical encoding 
of the DFKI building in the current context space, 
but may serve in another context as a representative 
building of a certain architectual style. In addition, 
we not only specify encoding relations between in- 
dividual objects, but also specify encoding relations 
on a generic level (e.g., that the property of being a 
red polygon on a map encodes the property of being 
Planning Referential Acts for Animated Presentation Agents 69 
m 
m 
m 
n 
m 
m 
m 
m 
m 
m 
m 
m 
m 
m 
m 
m 
m 
m 
m 
m 
\[\] 
m 
a built-up area in the world). 
Furthermore, we have to explicitly represent the 
Persona's current state since it influences both the 
contents and the form of a referring expression. For 
instance, the applicability of deictic spatial expres- 
sions, such as "on my left", depends on the Persona's 
current position. 
3 Highlevel Planning of Referential 
Acts 
Following a speech-act theoretic perspective, we 
consider referring as a goal-directed activity (cf. 
(AK87)). The goal underlying a referring expres- 
sion is to make the user activate appropriate mental 
representations in the sense of picking them out of a 
set of representations which are already available or 
which have to be built up (e.g., by localizing an ob- 
ject in a user's visual field). To plan referential acts 
which accomplish such goals, we build upon our pre- 
vious work on multimedia presentation design (cf. 
(AR96)). The main idea behind this approach was 
to formalize action sequences for designing presenta- 
tion scripts as operators of a planning system. Start- 
ing from a complex communicative goal, the planner 
tries to find a presentation strategy which matches 
this goal and generates a refinement-style plan in the 
form of a directed acyclic graph (DAG). This plan 
reflects not only the rhetorical structure, but also 
the temporal behavior of a presentation by means of 
qualitative and metric constraints. Qualitative con- 
straints are represented in an "Allen-style" fashion 
(cf. (All83)) which allows for the specification of 
thirteen temporal relationships between two named 
intervals, e.g. (Speak1 (During) PointP). Quantita- 
tive constraints appear as metric (in)equalities, e.g. 
(5 < Duration Point2). While the top of the pre- 
sentation plan is a more or less complex presentation 
goal (e.g., instructing the user in switching on a de- 
vice), the lowest level is formed by elementary pro- 
duction (e.g., to create an illustration or to encode a 
referring expression) and presentation acts (e.g., to 
display an illustration, to utter a verbal reference or 
to point to an object). 
If the presentation planner decides that a reference 
to an object should be made, it selects a strategy 
for activating a mental representation of this object. 
These strategies incorporate knowledge concerning: 
• the attributes to be selected for referent disam- 
biguation 
To discriminate objects from alternatives, the 
system may refer not only to features of an ob- 
ject in a scene, but also to features of the pre- 
sentation model, their interpretation and to the 
position of objects within a presentation, see 
also (Waz92). 
• the determination of an appropriate media com- 
bination 
To discriminate an object against its alterna- 
tives through visual attributes, such as shape 
or surface, or its location, illustrations are used. 
Pointing gestures are planned to disambiguate 
or simplify a referring expression or to establish 
a coreferential relationship to other document 
parts. 
• the temporal coordination of the constituents of 
a referential act 
If a referrring expression is composed of several 
constituents of different media, they have to be 
synchronized in an appropriate manner. For in- 
stance, a pointing gesture should be executed 
while the corresponding verbal part of the re- 
ferring expression is uttered. 
After the planning process is completed, the sys- 
tem builds up a schedule for the presentation which 
specifies the temporal behavior of all production and 
presentation acts. To accomplish this task, the sys- 
tem first builds up a temporal constraint network by 
collecting all temporal constraints on and between 
the actions. Some of these constraints are given by 
the applied plan operators. Others result from lin- 
earization constraints of the natural-langnage gener- 
ator. 
For illustration, let's assume the presentation 
planner has built up the following speech and point- 
ing acts: 
AI: (S-Speak Persona User (type pushto 
modus (def imp tense pres number sg))) 
A2: (S-Speak Persona User 
(theagent (type individual 
thediscourserole 
(type discourserole value hearer) 
modus 
(def the ref pro number sg)))) 
A3: (S-Speak Persona User 
(theobject (type taskobject 
thetaskobject 
(type namedobject 
thename S-4 
theclass 
(type class value on-off-switch))))) 
A4: (S-Speak Persona User 
(thegoal (type dest 
thedest (type destloc value right)))) 
AS: (S-Point Persona 
User image-on-off-switch-1 window-3) 
At this time decisions concerning word orderings 
are not yet made. The only temporal constraints 
70 E. Andrg and T. Rist 
which have been set up by the planner are: (AS 
(During) A3). That is the Persona has to point to 
an object while the object's name and type is uttered 
verbally. 
The act specifications A1 to A4 are forwarded to 
the natural-language generation component where 
grammatical encoding, linearization and inflection 
takes place. This component generates: "Push the 
on/off switch to the right". That is, during text gen- 
eration we get the following additional constraints: 
(A1 (meets) A3), (A3 (meets) A~). 1 
After collecting all constraints, the system de- 
termines the transitive closure over all qualitative 
constraints and computes numeric ranges over in- 
terval endpoints and their difference. Finally, a 
schedule is built up by resolving all disjunctions 
and computing a total temporal order (see (AR96)). 
Among other things, disjunctions may result from 
different correct word orderings, such as "Press 
the on/off switch now." versus "Now, press the 
on/off switch." In this case, the temporal con- 
straint network would contain the following con- 
straints: (Or (S-Speak-Now (Meets) S-Speak-Press) 
(S-Speak-Switch (Meets) S-Speak-Now)), (S-Speak- 
Press (Meets) S-Speak-Switch), (S-Point (During) 
S-Speak-Switch). For these constraints, the system 
would build up the following schedules: 
Schedule 1 
1: Start S-Speak-Now 
2: Start S-Speak-Press, End S-Speak-Now 
3: Start S-Speak-Switch, End S-Speak-Press 
4: Start S-Point 
5: End S-Point 
6: End S-Speak-Switch 
Schedule 2 
1: Start S-Speak-Press 
2: Start S-Speak-Switch, End S-Speak-Press 
3: Start S-Point 
4: End S-Point 
5: Start S-Speak-Now, End S-Speak-Switch 
6: End S-Now 
Since it is usually difficult to anticipate at design 
time the exact durations of speech acts, the system 
just builds up a partial schedule which reflects the 
ordering of the acts. This schedule is refined at pre- 
sentation display time by adding new metric con- 
straints concerning the duration of speech acts to 
the temporal constraint network. 
4 Context-sensitive Refinement of 
Referential Acts 
The presentation scripts generated by the presen- 
tation planner are forwarded to the Persona Server 
1Note that we don't get any temporal constraints for 
A2 since it is not realized on the surface level. 
which converts them into fine-grained animations. 
Since the basic actions the Persona has to perform 
depend on its current state, complex dependencies 
have to be considered when creating of animation se- 
quences. To choose among different start positions 
and courses of pointing gestures (see Fig. 3), we 
consider the following criteria: 
Figure 3: Different Pointing Gestures 
- the position o/the Persona relative to the target 
object; 
If the Persona is too far away from the target 
object, it has to walk to it or use a tele-scope 
pointing stick. In case the target object is lo- 
cated behind the Persona, the Persona has to 
turn around. To determine the direction of the 
pointing gesture, the system considers the ori- 
entation of the vector from the Persona to the 
target object. For example, if the target object 
is located on the right of the Persona's right 
foot, the Persona has to point down and to the 
right. 
- the set of adjacent objects and the size of the 
target object; 
To avoid ambiguities and occlusions, the Per- 
sona may have to use a pointing stick. On the 
other hand, it may point to isolated and large 
objects just with a hand. 
- the current screen layout; 
If there are regions which must not be occluded 
by the Persona, the Persona might not be able 
to move closer to the target object and may 
have to use a pointing stick instead. 
- the expected length of a verbal explanation that 
accompanies the pointing gesture; 
If the Persona intends to provide a longer verbal 
explanation, it should move to the target object 
and turn to the user (as in the upper row in 
Fig. 3). In case the verbal explanation is very 
short, the Persona should remain stationary if 
possible. 
Planning Referential Acts for Animated Presentation Agents 71 
High-Level I s-point\[t~ t2 I 
Persona Actions 
I take-position\[t~ t2\]l I start-point\[t2 t~ I lend-p°int\[t~ t,0 I 
Context-Sensitive I m°ve't°\[t' t~\]l I r-stick-point\[t2 tJ I ,on.,on 
Decomposition into ,. ,..~r, , , , ~ \ ~, 
Unlnter rJptable I ,-.u,,,u, "2,J I l, \ ~ . k~.,~_,~r÷ +, ~ X. 
Basic Postures / ' ~- I "-' ,u, ,u-,,:, u2 ,3~J I / I r-ste? (t2, t')l"h I Ir-stick-expose\[t~, 
Frames , / ~ / , , $ , , ~ , 
f-toroCt  t,) I ' ,Pixmaps) ~ ~ ~ ~ ~ ~ ~ ~ .~ 
Figure 4: Context-Sensitive Decomposition of a Pointing Gesture 
- the remaining overall presentation time. 
While the default strategy is to move the Per- 
sona towards the target object, time shortage 
will make the Persona use a pointing stick in- 
stead. 
To support the definition of Persona actions, we 
have defined a declarative specification language and 
implemented a multi-pass compiler that enables the 
automated generation of finite-state automata from 
these declarations. These fine-state automata in 
turn are translated into efficient machine code (cf. 
(RAM97)). 
Fig. 4 shows a context-sensitive decomposition of 
a pointing act delivered by the presentation planner 
into an animation sequence. Since in our case the 
object the Persona has to point to is too far away, the 
Persona first has to perform an navigation act before 
the pointing gesture may start. We associate with 
each action a time interval in which the action takes 
place. For example, the act take-position has to be 
executed during (tl t2). The same applies to the 
move-to act, the specialization of take-position. The 
intervals associated with the subactions of move-to 
are subintervals of (tl t2) and form a sequence. That 
is the Persona first has to turn to the right during 
(tl t21), then take some steps during (t21 t22) and 
finally turn to the front during (t22 t2). Note that 
the exact length of all time intervals can only be 
determined at runtime. 
5 Conclusion 
In this paper, we have argued that the use of life-like 
characters in the interface can essentially increase 
the effectiveness of referrring expressions. We have 
presented an approach for the automated planning 
of referring expressions which may involve different 
media and dedicated body movements of the char- 
acter. While content selection and media choice are 
performed in a proactive planning phase, the trans- 
formation of referential acts into fine-grained anima- 
tion sequences is done reactively taking into account 
the current situation of the character at presentation 
runtime. 
The approach presented here provides a good 
starting point for further extensions. Possible di- 
rections include: 
• Extending the repertoire of pointing gestures 
Currently, the Persona only supports punctual 
pointing with a hand or a stick. In the fu- 
ture, we will investigate additional pointing ges- 
tures, such as encircling and underlining, by ex- 
ploiting the results from the XTRA project (cf. 
(Rei92)). 
72 E. Andrd and T. Rist 
• Spatial deixis 
The applicability of spatial prepositions, such 
as "on the left", depends on the orientation of 
the space which is either given by the intrinsic 
organization of the reference object or the loca- 
tion of the observer (see e.g. (Wun85)). While 
we assumed in our previous work on the seman- 
tics of spatial prepositions that the user's loca- 
tion coincides with the presenter's location (cf. 
(Waz92)), we now have to distinguish whether 
an object is localized from the user's point of 
view or the Persona's point of view as the situ- 
ated presenter. 
• Referring to moving target objects 
A still unsolved problem results from the dy- 
namic nature of online presentations. Since im- 
age attributes may change at any time, the vi- 
sual focus has to be updated continuously which 
may be very time-consuming. For instance, the 
Persona is currently not able to point to moving 
objects in an animation sequence since there is 
simply not enough time to determine an object's 
coordinates at presentation time. 
• Empirical evaluation of the Persona's pointing 
gestures 
We have argued that the use a life-like char- 
acter enables the realization of more effective 
referring expressions. To empirically validate 
this hypothesis, we are currently embarking on 
a study of the user's reference resolution pro- 
cesses with and without the Persona. 
Acknowledgments 
This work has been supported by the BMBF under 
the grants ITW 9400 7 and 9701 0. We would like 
to thank Jochen Mfiller for his work on the Persona 
server and the overall system integration. 

References 
D. Appelt and A. Kronfeld. A computational model 
of referring. In Proc. of the I0 th HCAI, pages 
640--647, Milan, Italy, 1987. 
J. F. Allen. Maintaining Knowledge about Tem- 
poral Intervals. Communications of the A CM, 
26(11):832-843, 1983. 
E. Andre and T. Rist. Referring to World Ob- 
jects with Text and Pictures. In Proc. of the 
15 th COLING, volume 1, pages 530-534, Kyoto, 
Japan, 1994. 
E. Andr~ and T. Rist. Coping with temporal con- 
straints in multimedia presentation planning. In 
Proc. off AAAI-96, volume 1, pages 142-147, Port- 
land, Oregon, 1996. 
J. Cassell, C. Pelachaud, N.I. Badler, M. Steedman, 
B. Achorn, T. Becket, B. Douville, S. Prevost, and 
M. Stone. Animated conversation: Rule-based 
generation of facial expression,gesture and spoken 
intonation for multiple conversational agents. In 
Proc. of Siggraph'94, Orlando, 1994. 
J. Lester, J.L. Voerman, S.G. Towns, and C.B. Call- 
away. Cosmo: A life-like animated pedagogi- 
cal agent with deictic believability. In Proc. of 
the IJCAI-97 Workshop on Animated Interface 
Agents: Making them Intelligent, Nagoya, 1997. 
J. Mackinlay. Automating the Design of Graphi- 
cal Presentations of Relational Information. ACM 
Transactions on Graphics, 5(2):110-141, April 
1986. 
T. Rist, E. AndrE, and J. Mfiller. Adding Animated 
Presentation Agents to the Interface. In Proceed- 
ings of the 1997 International Conference on In- 
telligent User Interfaces, pages 79-86, Orlando, 
Florida, 1997. 
E. Reiter and R. Dale. A Fast Algorithm for the 
Generation of Referring Expressions. In Proc. 
of the 14 th COLING, volume 1, pages 232-238, 
Nantes, France, 1992. 
N. Reithinger. The Performance of an Incremen- 
tal Generation Component for Multi-Modal Dia- 
log Contributions. In R. Dale, E. Hovy, D. RSsner, 
and O. Stock, editors, Aspects of Automated Nat- 
ural Language Generation: Proceedings of the 
6th International Workshop on Natural Language 
Generation, pages 263-276. Springer, Berlin, Hei- 
delberg, 1992. 
P. Wazinski. Generating Spatial Descriptions for 
Cross-Modal References. In Proceedings of the 
Third Conference on Applied Natural Language 
Processing, pages 56-63, Trento, Italy, 1992. 
D. Wunderlich. Raumkonzepte. Zur Semantik der 
lokaien Pr~ipositionen. In T.T. Ballmer and 
R. Posener, editors, Nach-Chomskysche Linguis- 
tik, pages 340--351. de Gruyter, Berlin, New York, 
1985. 
