REFERRING TO WORLD OBJECTS WITH TEXT AND PICTURES 
Elisabcth Andr6, Thomas Rist 
Gcrman Research Center for Artificial Intelligence (DFKI) 
D-66123 Saarbriickcn, Germany, c-mail: {andre, fist} @dfki.uni-sb.dc 
ABSTRACT: It oftcn makes sense to employ both text 
and pictures wheu referring to world objects. In this pa- 
per, we present a model for referring which is based on 
the assumption that concepts may be activated not only by 
Icxt, but also by pictures and tcxt-pieturc combiniltious. By 
means of a case study, we demouslrale that l'ailure aml suc- 
cess of referring acts can be cxplalncd by thc user's ability 
to infer ccrtaiu links between mental representations and 
object descriptions. Finally, we show how the model has 
been incorporated into a plan-t)ased multimedia prcseata- 
tion system by defiaiug operators lk)r concept activation. 
1 INTRODUCTION 
From a speech act theoretical point of view, referring is a 
planncd action to achicve certalu go:ds (Appclt aud Kroa- 
fold, 1987). Although natur~ language may be the most 
conventional vehicle for referring, it has been widely ac- 
ccpted that pictures cau be used ~s well. For example, 
Goodmann (1969) points out that pictures can be cmploycd 
to refer to both an individual object and the type of which 
an objcct is an exemplary of. Morcovcr, there arc good 
reasons to include pictures in refcrring acts. l'icturcs ef- 
fectively convey discriminating object properties such as 
surface atlributes and shape. If au object can only be dis- 
criminated against alternatives through ils location, a pic- 
ture may provide the spatted context of the object. Since 
depictions arc explicit material representations of the world 
objects to which they correspond, new attributes of the type 
'being dcpicted as ...' arc iutroducc(l which, in ttlrn, pro- 
vide an additiomd source for object discriminatiou (e.g., 
the knob which is reprcscnlcd by thc black circle ...). Last 
but not least, several graphical focusing tcchniqucs can bc 
applied to effcctivcly constraiu the set of alternatives (c.g., 
arrows, blinking). Unfortunately, there is also a dark side 
of the picture. An obvious drawback is that pictures do 
not provide for syutactical devices to distinguish between 
a reference-specifying and a predication-specifying part 
since objects and their properties are hardly separable once 
depict "cd. Auothcr difliculty is that pictures lack the means 
to distinguish deliuitc from indefinite descriptions. Thus, it 
may remain unclear whcthcr a particular object or whether 
~m ,-u-biUzu-y exemplary of a class is depicted. The conclu- 
sion we can draw from these considerations is that it often 
makes sensc to employ bofll text lind pictures when rcfcr- 
riug to domain objects. Pictures may be used in order to 
simplify verbal reference expressions. On the other hand, 
ambiguitics of pictures cau be rcsolvcd by providing addi- 
tional information throngh text. When an~dyzing illustrated 
documeuls such as assembly matmals and iustructions for 
use, diffcrcnt kinds of rcfcrring expression can be found: 
Multimedia referring expressions rcfcr to world objects 
via a combination of at least two media. :Each medium con- 
vcys somc discriminatlug attributcs which in sum ,allow for 
a proper identification of the intended object. Examples ~ue 
NL expressions that are accompanied by pointing gestures 
and text-picture combinations where the picture provides 
information about the appe~u'ance of au object mid the text 
restricts the visual search space as in "the switch on the 
frontsidc". 
Anaphoric referring expressions refer to world objects in 
an abbreviated form (llirst, 1981) presuming that they are 
already explicitly or implicitly introduced in the discourse. 
Thc presentation part to which ,-m anaphoric expression 
refers back is called the antecedent of the referring expres- 
sion. In a multimedia discourse, we have not only to h,'mdle 
linguistic anaphora with linguistic antecedents, but also lin- 
guistic anaphora with pictorial antecedents, mid pictorial 
anaphora with linguistic or pictorial m~tecedents. Ex,'un- 
pies, such as "the hatched switch," show that the boundary 
bctwcen multimedia referring expressions and ,'maphora is 
indistinct, llere, we have to consider whether the user is 
intended to employ all parts of a presentation for object dis- 
ambiguation or whethcr one wants him to infer anaphoric 
rclations bctwcen them. 
Cross-media referring expressions do not refcr to world 
objects, but to document parts in other prcscnultiou mcdia 
(Wahlslcr et at., 1991). Examples of cross-media referring 
expressious are "the upper left comer of the picture" or 
"Fig. x". in most c,'tses, cross-media referring cxprc,ssions 
are part of a complex multimedia referring expresssiou 
where they serve to direct the rc~lder's attention to part.s of 
a document that has ,also to be employed in order to find 
the intended referent. 
When viewing referring as a planned action, we have to 
specify which goals uuderly the use of different types of 
referring expressions. Appelt ,'rod Kronfeld (1987) distin- 
guish between the literal goal and the discourse purpose 
of a refcrence act. Wherc~ls the literal goal is to establish 
mutuld belief between a speaker and a hearer that a partic- 
ular object is being talked about, the discourse purpose is 
to make the hearer recognize what kind of identification is 
appropriate and to have him identify the referent accord- 
ingly. When addressing illustrated docmncnts, the question 
arises of what idcutification means when domain objects 
are referred to via pictures (,'rod text). As with h'mguage 
this varies from discourse to discourse. For exmnple, if 
the user is confronted with a picture showiug how to insert 
the filter of a coffee machine, he has to recognize whether 
530 
System 
believes 
Pigurc 
(llas-position rl s pl s) 
(Temperature-control rl_s) 
(llas-position r3_.s p3s) 
(Onhfff switch r3s) 
(Corcfrl_s rl u) 
(Corcfrl s r230 
(Corefr3_s r3 u) 
(Corcf r3 s r4.u) 
System 
believes 
Wsel" 
believes 
(this-position rl_u pl_u) 
(Teml)cr ature-control r2u) 
(Ilas-position r3 u p3 u) 
(On/off-swilch r4u) 
iC)r (ArM (Corcf x'l u r2u) 
(Corer r3 t, l~- u)) 
(Aqd (Coref rl u r4 u) 
(Corer r3 u r2_u))) 
I: Modelling Example: I)iffl:renl Knowledge Concerning the Identity of Objects 
any object with the feature 'being a liltcr' can be inserted 
or whctlter a particular object is lUCallt. Ill the first case, 
he has to idenlify the piclurc t)l~,jccl as all cxemphuy of a 
certain class whereas, ill tile second case, hc has to look for 
somethiug in lhe workl which tits the graphical depiction. 
lu other siluations, )dentil)cation involves establishing a 
kind of cohesive link between doeluneut parts. If Ihe user 
is coufrouled with a sequeuce of pictures showing an ob- 
ject lmm different angles, he has to recognize that in all 
pictures the same ol~jcct is depicted (pictorial anaphor with 
pictorial anlecedent). When re:aliug an utterance, such as 
"the resistor in the ligurc above," he has to recognize au 
anaphoric rchttionship between the textual closer)p lion and 
Ihc graphical depiclion (linguistic anaphor with pictorial 
antecet&nt). 
Previous work on Ihc generation of rclc) ring expressions 
in a multimedia cnvirotuncnt has mainly cotlcclltrated Oil 
single refercnce phenomena, such as references to pictorial 
material via natural language and pointing gestures (Allo 
gayer et al., 1989; C.laasseu, 1992; Stock el al., 1993) and 
the generation of cross-media references lrom text to grlqfl> 
ics (McKcown ct al., 1992; Wahlster ct al., 1993). The aim 
of this paper is, however, to provide a more general model 
Iha! explains which kinds of corcferculial link bctweeu re- 
ferring expressions, objects of the world :rod ol2iccts of the 
multimedia preseutalion have Io be established to ensure 
rite coutpreheusibility of at rclcrring expression. 
2 A MODEl, F()R RI,.3,'ER1UN(; WITII TEXT AND 
PICTURI';S 
When referring to domain objects a presentation system h;ts 
to lind intelligible object descriptions which will activate 
aplnOl~riate represcutations. We assume thai reprcscnla- 
lions can be act)wiled in the sense of picking them out 
of a set of representations which arc already available or 
which have to be built lip (c.g., by localiziug an object in 
a user's visual licld). Rcprcscnlations can bc act)wiled by 
textual descriptions, by graphical descriptions or by mixed 
descriptions. Whereas the order in which representations 
are activated by a text is ittlhmtlccd by the discourse struc- 
ture, it is less than clear ill which order a picture activates 
representations. If scvcral objects are depictcd, the conc- 
SlXmding rcprescntatious may be activated simultaneously. 
2.1 Rcprcsenlations of World ()bjecls 
qb ensure tile transferal))lily of our al)pmach, wc don't 
presuppose a cer|aill kllowledge representation language. 
l\[owcvcl, iu\] essential part of the model concerns file dis- 
tinct)on between the system's belicl\s about the world and 
the system's beliefs about the user's beliefs. We represent 
these beliefs ill different models. For example, the system 
may classify a cert:du object )ks ml espresso machine while 
it asstllUeS tile user regards tile object as a coffee machine. 
l:urtherniore, we have to COllsider that the user's alld the 
system's beliefs al×mt the identity of objects may differ. 
The system may bclicve that the user has different repre- 
sentations for ouc and tile. salne object without knowing 
how they arc rclattxl to each other. Conversely, it may hap- 
pen that the user is assumed to have only one representation 
for objccls which tile systeln considers as distinct entities. 
As a coascquence, our models can coutaiu dill'ereut rcpre- 
seutaliolls for one and the sanle world object. We use tile 
predicate 
(Corer IW~I rep2) 
I0 c, XplCSs thai rep 1 and rep2 arc representations of the stune 
world object. 
Fig. 1 gives an example of how to use the concepts intro- 
duced above, l.ct's start li'om the billowing situation taken 
from an espresso machine d/mudu: "lain system knows that 
|here are two switches (the temperature coutrol and tile 
on/off switch) and also knows where they m'e k~cated. 1 .et 
rl_s mid r3_s corrcspoud to lhe system's internal rcprcscm 
rations of the switches. The user is assumed to look at the 
espresso machine aud to see two switches. Let rl_u and 
r3_u corresl×md to iutenml reprcscnlatious of the switches 
which Ihe user builds up when looking at tim machine. We 
assume that tile user idso knows of the existeuce of the 
on/off switch and file temperature control, but is not able 
to localize them. l.et r2_u and r4_u be the user's represen- 
tations for tile temperature control and the on/off switch. 
"l lie fact that he o)lly knows that one of tile switches lie sccs 
must be the temperature control and the other file on/oil 
switch can be expressed by metals of a disjunction. Either 
a corer ,elation holds between rl_u and r2_u and between 
r3_u aud r4_u or conversely, between rl_u and r4_u and be~ 
twecu r3_u and r2_u. The couucctiou between the system's 
rcprcscnlations rl_s and r3_s to tim rcpresentalious tile user 
is assumed to have. is also expressed by corelizreuce rela- 
tions. 
2.2 Reln'esent:dion of Descriptions 
As nmntioncd ill section 1, descriptions can be co;nlx)stal 
531 
of text, graphics mid further presenUUion media. To cope 
with such descriptions, we associate with each syntactical 
unit (depictions, noun phrases, etc.) the set of objcct rcp- 
reseutations which will be activated by that particular part. 
The referent of tile whole description is then considered 
as a member of thc intersection of all sets resulting from 
partial descriptions. 
An important prerequisite of our approach is that the 
system explicitly represents how it has encoded in formation 
in a presentation. Inspired by (Mackinlay, 1986), we use a 
relation tuplc of tim form: 
(Encodes nwans itlformation context-slmce) 
to specify tim semantic relationship Imtwccn a textual or 
graphical means, and tim inh)rmatiou tim means is to con- 
vey in a cerladn context space. In our approach, the third 
argulnent refers to tile context space to which the encod- 
ing relation corresponds R~ and not to a graphical language 
as in Mackinlay's al~proach. This enables us to use one 
and the same presentation means differently in different 
context spaces. For example, a depiction of an csprcsso 
machine may refer to an individual machine in one context 
space, but may serve as a prototypical representative of 
an espresso machine in mmthcr. In addition, we not only 
specify encoding relations bctwccn individual objccls, but 
~dso specify encoding relations on a generic level (e.g., that 
tile property of being red in a picture encodes tile property 
of being defect iu tile world). 
While it can be assumed that a user reads a text in se- 
queutial order, it is often not clear at which times a user 
looks at a picture. ThercR)re, it makes not ,'always sense to 
further distinguish between an mlaphor and its antecedent. 
Fortunately, our approach does not require identi lying parts 
of a presentation as anaphora and antecedents. It suffices 
to recognize which parts of a description ~u'e intended to 
encode a uniquely determined object. ~Ib express such co- 
hesive relationships between presentation parts p 1 and p2, 
we define the predicate: 
(EncodesSame pl p2 c) : = 
(Exists w (And (Encodes pl w c) (Encodes 1)2 w c) 
(Forall v (Implies (Or (Encodes plv c) (Encodes p2 v c)) 
(Coref w v))))) 
The first part of this dcfiuition expresses that there exists 
an object w thai pl and p2 encode in tile context space c 
while the second part means that this object w is uniquely 
determined. 
2.3 Links between Representations and Descriptions 
In uuderstanding a referring expression, the user has to 
recognize certain links between actiwttcd mental represen- 
tations, between descriptions and mental representations, 
and between textual and graphic,'d parts of dcscriptions. 
Which links are present in a description and which have to 
be inferred varies from sifimtiou to situation. To illustrate 
this, let's have a look at a case study carried ot, t in our 
espresso machine domaiu where text-picture combinations 
are used to explain how to operate an espresso machiuc. We 
assume that tile user is rexlUested to tunl the temperature 
control of an espresso machine. In this case, identification 
means actiwtting a representation the user builds up when 
localizing the referent in his visual field. Furthermore, we 
presume tile user knowledge of the espresso machine as 
in Scction 2.1; i.e., file user knows of the existence of tile 
on/off~ and the temperature control, has visual access to 
tile two switches in the world but is not able to tell them 
apart. In the diagrams below, we use the abbreviations ES, 
C aud E for die relations EncodesSame, Coref and Eucodes 
respectively. 
In tile document fragment shown in Fig. 2, the tex- 
tual rcfcrcncc expression uniquely determines a referent, 
but activates a reprcscutation (r2_u) which docsn't contain 
any information to localize rile referent. Colwersely, the 
representations activated by tim picture contain locative in- 
formation, but here we have the problem that several objcct 
representations arc activated to tile siune extent. Since only 
the prope,ty of being a switch, but not tile property of be- 
ing a temperature control is conveyed by the picture, both 
switch depictions become possible as antecedents of the 
textual referring expression. 
? ? 
lhe temperal.re ~ E ~ 
control 
Figure 2: Missing Cohesive Liuk between Text and Picture 
In Fig. 3, tile verbal descriptiou discriminates tim refer- 
ent from its alternatives by attributes of the world object, 
umncly 'being a switch', and 'being depicted in tile figure' 
and an attribute of the depiction, namely 'being dark'. But, 
in contrast to tim previous example, only one of the repre- 
sentations activated by the picture fits tim verbal descrip- 
lion. "llius, the user should be able to discover the anaphoric 
link between the verbal description and the graphical de- 
piction and activate an appropriate representation. 
the dark switch '.~ r2~u 
Figure 3: Establishing a Cohesive Liuk by Incorporating 
Picture Attributes in Vcrbal Descriptions 
In tile previous example, an anaphoric link between text 
and picture has been established by including pictorial at- 
tributes in the vcrbal descriptiou. All altcrnative is to apply 
graphical focusing tcclmiqucs ,as in Fig. 4. Ilere, it's vcry 
likely that the user will be able to draw a link between 
text mid picture because he will assume that the pictorial 
,'rod the textual focus cx)incide. This ex~unple also illus- 
trates how tile user's knowledge of rile identity of objects 
cml be enriched by means of a referring act. The verbal 
532 
descripthm without the graphics and tim graphical dcpic- 
thin witimut the text actiwtte different reprcseatalions of 
tim switch. When coasidering bolh text and graphics, tim 
user will conclude timt they refer to tile same object. Thus, 
he is not only able to identify tim switch ,as required, he 
is ,also able to combine tim different representations of tile 
switch into one. Note that this phenomenon cm~ ~dso be 
explained in tcnns of centering tiltx)ry (Gmsz et ~d., 1983). 
In tim example, tim prcferrcd center of tim picture wouhl 
coincide with the backward looking center of tim text. 
O I .,'t 
Ttlrn the tclluleralure Ihe ; control clockwise. / tern ~erat,re ~ '" *.,-.- \[ r2 
u ) 
Figure 4: F.slablishiag a Cohesive IAnk by Correlaling 
%xtu~d aud Pictorial Focus 
qhe example shown in Fig. 5 differs from the previous 
ones in that ao corrcspondency link between picture objects 
and real world objects can be established. Although the user 
is able to draw an anaphoric link between the verbal aud 
tim pictorial description, he is not able to visually identify 
the intended referent. 
l 'r|lrll |he |r, lllplt!r~|tllrt! COlltrlll clockwise. 
@ '., @ 
fkS I ? 
the -e 13 ~ t rz .u) telllllerlttllre 
control 
Figure 5: Missing Corrcspondency between Picture and 
World 
Summing up, it can be said that a rcfcrrinp act is only 
successful whell tile description provides an access path to 
an al)l)ropriate represeatation. "lhe user has to iufcr such 
a path li'om encoding relationships and cohesive links be-. 
twccn tim parts of a description. As lhc cxamplcs show, 
tim following cases occur: a) if tile user does nol recog- 
nize which picture parts correspond to which world object, 
tim referring act ciflmr fMls (cf. Fig. 5) or the picture 
contributes uolhing to ils success, b) If tim relationship 
between pictori',d depictions and verbal dcscriptions is un- 
clear, tim referent can either not bc lound (cf. Fig. 2) or 
one of Ihe media has no inllocuce oil refereut identilica- 
lion. c) if at graphic~d dcpiclion aad a vcrbal tiescription 
acliw|te dill~crent rcprescnlations of one and tile Sallle t)\[) ~ 
ject and Ihe user recognizes not only these links, but :dso a 
link between tim two presenlatiou parts, he is uot only able 
to lind the refcrcnt, but also able to combine tim tliffcrent 
rep,escntations into one (cf. t:ig. 4). 
3 USING TILE" MODEL TO GENERATE REFER- 
RING EXPRESSIONS 
In tim lbllowing, we will sketch how we have integratexl 
tile approach into tim multimcdia presentation system WlP 
(Wahlstcr et al., 1993). At tile hcau't of tim WIP system is a 
prcscnUttion planner that is reslxansible for determining the 
contents aad selccti,lg an appropriate medium combination. 
"llle presenlatioll planner receives ~ks input a presentation 
goal (e.g., the user should know where a certain switch is 
located), it then tries to fiad a presentation strategy which 
malchcs this goal and gencrales a refiue,nent-style plan in 
tim form of a directed acyclic graph (DAG). This DAG 
rellecls rellccls lhe proposithmal contents of the potcnti;d 
document paris, Ihe intcntkmal gems behind tim parts as 
well as tim rhetoric~d relationships between them, lot de- 
tails see (Andr6 and Rist, 1993). While tim top of the 
presenlalion plan is a more or less complex presentation 
goal (e.g., instrucling tim user in switching on a device), 
the lowest level is formed by specilications of elementary 
presentatioa lasks (e.g., formulating a r~lucst or depicting 
an object). These elementary tasks m'e directly forwardcxl 
to tim mcdium-spccilic generators, currcntly for text (Kil- 
gel, 1994) aud graphics (Rist, and Andr6, 1992). 
"llm contcut of referring expressions is determined by 
Ihe presentation planucr Ihat Mso decides which represen- 
tations should be actiwttcd and which medium should be 
chosen for tiffs. "lb be able to pcrlbnn these steps, we need 
presentation slrategics for linking propositional acts with 
activation acts. An exmnplc of such a strategy is \[1\]. 
\[t\] Ileader: (Request S U (Action ?action)'l~xt) 
I'.;lliect: (BMB S U (Goal S (Done U ?action))) 
Applicability Conditions: 
(And (Goal S (l)one U ?action)) 
(Bel S (Complex-OF.crating-Action ?action)) 
(Bcl S (Agent ?agent ?action)) 
(Bel S (Object ?object ?action))) 
Main Acls: 
(S-Request S l\[J 
(?action-spec (Agent ?agent-spec) (Object 7object-spec))) 
Subsidiary Acts: 
(Activate S U (Action ?action) ?action-spec'li:xt) 
(Activate S IJ (Agent ?agent) ?agent-spec Text) 
(Activate S tJ (Object ?object) ?object-spec "l~xt) 
This strategy can be used to request the user to perfoml 
an action, h, Ihis strategy, two kinds of act occur: an 
clcmenlary speech act S(urface)-Rcquest aad three activa- 
tion acts for specifying tim action mid the scmantic casc 
roles ;Lssociatcd with tim action (Activate). The strategy 
prcscrilx:s text for tile subsidiary acts 'because the result- 
ing rcfcn'ing expressions (?action-six:c, ?agent-SlrCC and 
?object-spot) are obligatory c~tse roles of an S-Request 
speech act which will bc conveyexl by tcxt. For optional 
case roles any medium c;nl be taken. In addition to strate- 
gies for linking propositionM aud activation acts, we. need 
strategies lot diffcrcnt kinds of actiwttion mid lot establish- 
ing Corcf- and l';ncodesSamc-relationships. For cxmnplc, 
strategy \[2\] caq be used to aclivale a representation ?r-1 
by text and to simultaneously enrich the user's knowlex.lge 
.5,3.3 
about the identity of objects. The strategy only applies if 
Ihcre exists already an image ?pic-obj which encodes 71"-1, 
the system believes that ?r-1 and ?r-2 are representations 
of the same world object and if the system's model of the 
user's belicls contains ?r-2. If the strategy is applied, the 
system a) provides a unique description ?d for ?r-2 (re:fin 
act) mid b) ensures that the user recognizes that this descrip- 
tion mid the corresponding image specify the same object 
(subsidiary act). 
\[2\] lleader: (Activate S U (?case-t'ole ?r-1) ?d "li~xt) 
Effect: (BMB S lI (Corer ?r-I '/r~2)) 
Applicability Conditions: 
(And (BMB 5 U (l';ncodes ?pic-t~bj ?l=l ?c)) 
(Bet S (Corer ?r-I ?r-2)) 
(Bel S (Bel U (Thing ?r-2)))) 
Main Acts: 
(Provide-Uniqueq)escriptitm .'; \[ I ?r~2 ?d Text) 
Subsidiary Acts: 
(Achieve S 
(BMB S U (Enct)desSame 7d 7pie-oh ?c)) ?medium) 
For ~0, we use a discrimination algorithm similar to the 
algorithm presented in (Reiter and Dale, 1992). Ilowev- 
or, we have investigated additional possibilities for distin- 
guishing objects from their alternatives. We can refer not 
only to features of an object in a scene, but also to tidal ures of 
the graphical model, Ihcir interprclalion ~md to the position 
of picture objects within the picture, scc ~dso (Wazinski, 
1992). A dclailed description of our discrimination algo- 
rithm can be found in (Schueiderl0chuer, 1994). Task b) 
c,'m bc accomplished by correlating the visu~fl and the tex- 
tual locus, by redundantly encoding objccl atlribules, or 
by explicitly informing Ihc user about a Corcf-rclalionship. 
Such a Corer-relationship can bc established by strategies 
for the gcneration of cross-media rcfcrring exprcssions (as 
iu "The left switch in lhc ligurc is Ihe lcmpcraturc control") 
or by slralcgics for annotating objects in a ligurc. 
4 CONCLUSION 
We h~we presented a model of referring which is based on 
the lollowing ~Lssumptions: 1) Ment~d representations of 
ol2jccts may be activated not only by textural, but "also by 
graphicsd and mixed descriptions. 2) Failure ,'rod success of 
referring acts can be expl~fincd by the user's ability to rcc- 
ognize ccrtain links between Ihcse mcnt~d representations 
,-rod the corresponding object descriptions. "lo demonstrate 
that the model is of praclical use lk)r the gencration of rcfcr- 
enccs, we have delinc.d presentation strategies for concept 
activation whidt scrve as operators in the plan-based pre- 
sentation system WIE WIP is ablc to generate mullimedia, 
auaphoric attd cross-lncdia referring expressions. 
ACKNOWLEDGEMENTS: This work is supportcd by 
the BMH" under grant lqW8901 8. Wc would like to thank 
Doug Appelt lk)r wduable discussions attd comntcnls. 
REI,'I~I~,ENCES 
Allgayer, J., llarbusch, K., Kobsa, A., Reddig, C., Rei- 
thingcl, N. and Schmaucks, D. (1989). XTt?A: A Natural- 
l~nguage Access System to Expert Systems. Intern. 
.lournal of Man-Machine Studies, 31, pp. 161-195, 
Andr6, E., and Rist, q: (1993). 1"he Design of Illustrated 
l)ocuments as a Planning Task. In M.'I: Maybury I~., 
hltelligent Multimedia lnterfaces,'lhe MIT Press, Menlo 
Park, pp. 94-116. 
Appcll, D., and Kronfeld, A. (1987). A Computational 
Model of Referring. 1'roe. of lJCAl-87, pp. 640-647. 
Cl~msseu, W. (1992). Generating Re.ferring Expressions 
in a Multimodal Environment. ht R. Dale, E. llovy, D. 
R6sucr ~utd O. Stock 1~., Aspects of Automated Natural 
Language Generation: Proc. of the 6th International 
Workshop on Natural I~tnguage Generation. Springer, 
Berlin, pp. 247-262. 
Goodmau, N. (1969). L(uzguages of Art. Oxlord University 
Press, Oxford. 
Grosz, B., Joshi, A.K., and Weinstciu, S. (1983). Providing 
a UnifiedAccount of Definite Noun Phrases in Discourse. 
Proc. of the 21stACL, pp. 44-50. 
IIirsl, G. (1981). Anaphora in Natural Language Umler- 
standing. Springer, Berlin. 
Kilgcr, A. (1994). Using U1AGs for htctemental and Par- 
allel Generation. Computational lntelligence, to appear. 
Mackiulay, J. (1986). Automating the Design of Graphical 
Presentations of Rehttional Infornultion. ACM Transac- 
tions on Graphics, 5(2), pp. 110-141. 
McKeowu, K.R., Feiner, S.K., Robin, J., Seligmaun, D.D. 
and Tancnbiatt, M. (1992). Generating Cross-References 
fi)r Multimedia Exl)ktnation. Plot. AAAI-92, pp. 9-16. 
Reitcr, E., and Dale, R. (1992). A Fast Algorithm for the 
Generation of Referring l?al)ressions. Proc. of COLING- 
92, 1, pp. 232-238. 
Rist, T., ~md Andr6 (1992). b)otn Presentation Tasks to 
Pictures: "lbwards an Approach to Autonvatic Graphics 
Design. Proc. of ECAI-92, Vienna, Austria, pp. 764- 
768. 
Schneiderli3chne~, I:. (1994). Generierung von Referen- 
zausdrficken in einem multimodalen Diskurs. Diploma 
Thesis, Universitat des ,Satrlandes, Genmmy, to appear. 
Stock O., ,'utd the ALFRESCO l'rojcct Tram (1993). AL- 
FRESCO: Enjoying the Combination of Natural Lan- 
guage Processing and tfypermedia for Information Ex- 
ploration, ht: In M.'I: Maybury Ed., lntelligentMultime- 
dia lnterfaces,qlm MIT Press, Menlo Park, pp. 197-224. 
Wahlstc~, W., Andr6, E., Gral, W., ,'rod Rist, T. (1991). 
Designing Illustrated 7i'xts: How Language Production 
Is h!fluenced by Graphics Generation. Proc. of EACL- 
92, Berlin, pp. 8-14. 
Wahlster, W., Andr6, E., Finklct, W., Profitlidl, II.J., and 
Risl, T. (1993). Plan-Based h~tegration of Natural Lan. 
guage atzd Graphics Generation. AI Journal, 63, pp. 
387-427. 
Wazinski, I ~. (1992). Generatitlg Spatial Description for 
Ctvss-modal Referet~ces. Proc. of ANLP-92, Treuto, 
Italy, pp. 56-63, 
534 
