Generating a Coherent Text Describing a 
Traffic Scene 
Hans-Joachim NovalP 
Fachbereich Informatik, Universitgt Hamburg 
D-2000 Hamburg 13, West-Germany 
Abstract 
If a system that embodies a reference semanl;ic for motion verbs and prepo- 
sitions is to generate a coherent text describing the recognized motions it 
needs it decision procedure ~,o select Ihe events. In NAOS event, selection 
is done by use of a specialization hierarchy of mellon verbs. The st.rat- 
egy of anticipated visualization is used tbr the selection of optional 
deep cases, qJhe system exhibits low-level strategies which are based on 
verbinherent, properties that allow the generation of a coherent descriptive 
I;ext. 
t I~troductlon 
This contribution focuses on the verbalization component of t;he 
NAOS system (the acronym stands for NAtural language descrip- 
tion of Object movements in a traffic Scene). NAOS is designed 
to explore the border area between computer vision and natural 
language processing, especially the realm of recognizing and ver-- 
balizing motion concepts in image sequences. 
NAOS goes all the way h'om a representation of a real~world 
traffic scene to a natural language text describing the scene. 
The representation of the scene basically consists of its geometry 
(theretbre called geometric scene description (GSD))+ ~lk) giw~ an 
impression of the representation a GSD contains for each frame of 
the image sequence: 
o instance of time 
o visible objects 
o viewpoint 
o illumination 
o 31) shape 
o surface characteristics (color) 
o class 
o identity 
31) position and orientation in each flame 
(fl)r a detailed description ,,f the GSD see \[t6\]). 
For event recognition we use event models (\[18\], \[191) which 
define a reference semantic for motion verbs. In t, he current im- 
plementation of the NAOS system about 35 motion verbs and (,he 
prepositions beside, 1)y, in~fronl;oof, near, and on may />e reco- 
gnized by matching the event models against the representation of 
the scene. 
In this paper we are neither concerned with the representation of 
t.he underlying scene data nor with the question of event recognition 
as tt ...... issues haw, bee,, put,list,ed elsewhere (see \[10\] \[171 \[20\]). 
Instead, we fi)cns on the generation of a coherent t;exl, describing 
the irnage sequel'lee. 
In the nexl, section we brielly describe the represent, ation of the 
recognized events which fi)rm the initial data for the verbalizatiou 
eotllpo+lerlt, tl)hen I,}+e overall strategy fnr (:on+p(ming a coher(!llt 
description is discussed. The fblk)wing section i,ltrodnces a part, ial 
solution to the selection problem which is based on the strategy 
of anticipated visualization. Fourth, we show how some linguistic 
choiees like passlve~ restrictive relative elanses, and negation 
I thank B. Neumamt who contributed several ideas to this article. 
are natural consequences of the task of generating unambiguous 
referring expressions. In the last section we relate our research 
with current work on language generation. 
\]\[nitiaJ~ \]Data 
Verbalization starts when event reeognil, ion has been aclfiew~d. 
Besides complex event, s like overtalm and turn off, other predi-- 
cares like in-front-of, I)esi(les, move, etc. are also inst, antiated. 
Pleh)w is a section of the database after event recognition has taken 
place (the original entries are ill German). 
1: (MOVE PERSONI 0 40) 
2: (WAI,K PERSONI 0 40) 
3: (RECE1)E PERSON I FBI 20 40) 
4: (OWmTAKE BMW~ VWI 00:12) (~,~ 3~,)) 
5: (MOVE BMW1 l0 40) 
6: (IN-FRONT-OF VWl TRAFFIC-LKHITI 27 32) 
"\]'he above entries are instantiations of event models containing 
symbolic identifiers for scene objects (e.g. BMWI). Tile last two 
elements of an instantiation denote the start and end time of the 
event. 
We use the following notations to denote the event time: 
~. (....rb Te) 
~. (....(r<,,, Tb .... ) ('r~,,,,, r, ...... )) 
a. (....(rb,,.,+ Tb ...... ) "r.) 
4. (....rb re ...... )) 
Tb, Te denote start and end t, ime of an event. The first notation 
is used for (htratlve events (e.g. move). A duratlve event, is also 
valid for each subinterval of (Tb Te). 
The secolld t+oi, ation is tlsed for ilO\]l-dllrative evelltl; 
(e.g. overtake). Start and end time of such an event are Imth 
restrk:ted by lower and upper bounds. Note, that nmt.-dnratlve 
events are not wdid for each subinterval of the event boundarie~L 
The third notation b; used for re.(mltafive events (e.g. stop). 
The start time ofa resnltatlve event lies within an interwd whereas 
the end time is a time--point. 
Finally, the last notation is used for inchoatlve events 
(e.g. start moving, corresponding to the German verb loafah-. 
ten). )!nchoative events have a well defined start time whereas 
the end time lies within an interval. 
For the task of generating a coherent description of a tra\[fic 
scene NAOS first instantiates all event models and predicates which 
may be instantiated using the scene data. This leads to the well 
known selection problem of natural language generation. For one 
object, timre may be many instantiations with different time inter- 
vals, hence the task of the verbalization component, to choose what 
to say. In the next section we discuss the theoretical background 
on which our verbalization component is based. 
3 'Iheoretmal Background 
In general, language is not generated per se but is Mways in.- 
tended for a hearer. Furthermore, language is used to fulfil certain 
570 
goals of the speaker which may sometimes simply be to inform the 
hearer about certain facts. 
In NAOS the generatioo of a deseriptiou of the underlying image 
sequence aims at diminishing the discrepancy between tim :~ystem's 
knowledge of the scene and the heater's knowledge (the same mo- 
tivation is nsed in Davcy's program \[61). Concernig the hearer we 
make the following assumptions: 
1. S/he knows tide static baekgrmmd of the scene, i.e. the streets, 
houses, traffic lights, etc. 
2. S/he did not utter specific interests except: Describe ~he 
scel\]e! 
A description may be the result of snch diverse speech acts 
aa !NI?ORM, PROMISE+ PERSUADE, or CONVINCE. 
NAOS only generates the speech act INFORM. 
qb inform a hearer abotlL something means to tell her/his1 so- 
mething s/he has not known before, somethint, that is tr,le and 
new. In NAOS the definition of true utt, erances buiht~+ on the si- 
tuational semantics of Barwise and Perry \[31. 'rhcy mtderstand t, he 
Iueanillg Of an utterance as a relatiou t)etwcetl the tit\[clause alld 
the described sitmation. The interpretatiou of an utLerance by a 
hearer usually consists of a set, of possible situations with a mea- 
ning relation I;o the ut, terance. We now define an uta:rance to be 
true if the set of possible situations cooJ~ains the actually occnrred 
situation. 
The requirement to generate true utterances has two consequell- 
ces for our verbalization component, l,'irst, the verbalization pro- 
cess nnlst take the bearer's lneanillg relations into account. This co- 
incides with the eommtmication rule to tune one's utterances to the 
heater's comprehension ability. Second, assumiug that the speaker 
has tide same meatnng relations as the hearer, the speaker can itlt- 
ticipate the hearer's interpretation of an IILteraaee, ie. the possible 
situations implied solely by the utterance can be generated without 
knowledge of the actual situation. In the case of scene descriptions 
these situations are equivalenl; to the heater's visualization of an 
unknown scene. 
An utteraace must be new to tile hearer in order I,o inform him. 
In the context of situational semantics we define an utterance to 
be new if its interpretation restricts the set of possible situations 
implied by previous utterances. Thus new information additionally 
specifies described situations. 
The task of a verbalization component is to choose utterances 
such that they inform in the above sense. Therefore it is neces- 
sary to anticipate the hearer's understanding for judging whether 
a planned utterance carries new information, 
The general principle tbr hearer simulation is depicted in figure 
1. 
U1 TERANCE ............ > UTTERANCE 
CASE FRAMES CASE FffAMtS 
T~_ _ DEEP CASE--~ I "-- - SEMAN flCS----~ 
EVENTS EVEN\[S 
T,5==: EVENT i NODELS =;~ 
GEOMETRIC SCENE VISUAL\[ZEU SCh~\[ 
DESCRI PT \] ON DESCR \[ P \[ ION 
SUEAKER iIERREU 
Figure l: llearersimulation 
On the side of the speaker the event recognition process leads by 
use of event models to instantiated event models (called events in 
the figure). A lirst seleclfion process chooses amoug the instantiati- 
otis those which are to be verbalized. As event models are associa- 
ted with verbs the appropriat, c case frame of the verb is available. 
A second selection process now chooses among the optional deep 
cases of the verb. This is where the deep case semantics comes 
into play. If, for instance, it it+ decided that a locative expression 
should be generated it is necessary to know how the location of an 
object may be expressed in natural language as in the geometric 
scene descriptiou the location of an object is given by its x, y, and 
z coordinates. 'l+}le deep ease setnanties also eoilt~Lius information 
about the prepositions which may be used for expressing a specific 
deep ease. 
Assuming that the hearer has the same meaning relations as the 
speaker he basically can use the speaker's processes in reverse order 
and reconstruct the underlying case fi'ame froro I.he utterance and 
thns build a visualized scene description. 
Note, however, that we agree with Olsou \[21\] that the verbali- 
zation of a visual ewmt always leads to a loss of infkwmation. In our 
cam~, for instance, we ca\[lllOt }lSSlli'Ile /~llat the hearer knows the x, 
y, aud z coordinates of ,an object when he hears tim phrase i~l \[font 
o{' the del;arf, lne,qt o# comptJter science. 5;llch a \[)hl'ase \[~elleraLes 
a set of coordinates delining the regi(m which corresponds t~ the 
preposition in-.front+of The act.a\] Iocat, iun ++I' the object which 
gave ri~+e to t, he geaeratiotl off t,he pln'ase lies somewhere within that 
region. Preset,tly, hearer modeling sl,~>l)+; at; I,ho level of ease frames 
aml the visualized scene ia am.ieipaLed (see xect.ion ,1.2). 
As shown in figure I the case frame of a verb plays a cetttral 
role in our verbalization compoaent. We adopt the view of Fillmore 
expressed in his scelles-+and-fralnes selnallLic \[71 I;ll+tt case frames 
relate sceues to natural language expressions. 
4 The ~3election Problem 
Usually this ln'oblem is divided intc~ the subtasks of deeiding 
what to mxy and how t,. say it,. As mentioned above NAOS uses 
two selection processes. First, it selects amoug the instantiated 
events and second, it selects among the optional deep cases of the 
verb associated with the choseu evellt. The first selection process 
corresponds to deciding what to say and the second one determines 
largely how to say it as will be shown later. 
The selection processes are based on the representation of the 
ease semantics of an event model and on a specialization hierarchy 
of the verbs. Below is the representation of the case semantics for 
tile event model iiberholen (overtake). 
Agent-RestL : VEIti(0LE 
Deep--canen : (VERB UEBERHflL) 
(UEBERI10LEN *SBJ1 *0B J2 *T1 *T2) 
Obligatory : (AGENT AGT-..EXP) 
(ItEF AgT-EXP *OBJI) 
(TENSE TNS.,I~XP) 
(TIME-REF 'fNS-EXP *T1 *T2) 
(OBJECTIVE SBJ F, XP) 
(REF OBJ-EXP *ON J2) 
Optional : (LOCA'rIVE LOO--EXP) 
(LOC-'N.EF LUC-EXP *SBJ1 *Ti *'1'2) 
Combilm~ionlJ : NIL 
Loc-prep~3: (All AUI,' BEI IIINTER I~l NEITEN 
UEI\]ER UNTF, R VOR Z~,'llSCttl;;tO 
The first slot specifies the agent restriction. The deep<asus slot 
571 
contains first the verb stem of iiberholen as needed by the gene- 
ration component and second the formal notation for an instantia- 
tion. The obligatory cases must be generated but may be omitted 
in the surface string in case of elliptic utterances whereas optio- 
nal deep cases need not be generated at all. In the combinations 
slot it is represented which deep cases may be generated together 
(e.g. for the verb fahren (drive) it is not allowed to generate a 
single SOURCE but instead SOURCE and GOAL must be genera- 
ted). The Lot-preps slot specifies the prepositions which may be 
used with the verb iiberholen to generate locative expressions. 
The case descriptions in the obligatory and optional slots consist 
of two parts: a declaration of an identifier for the case expression on 
the language side, and a predicate (in general a list of predicates) 
relating the case expression to the scene data. The most important 
predicates are REF, TIME-REF, and LOC-REF. 
REF generates referring phrases for internal object descriptors 
like BMW1. TIME-REF generates the tense of the verb. As de- 
scriptions are usually given in present tense, presently TIME-REF 
only generates this tensc. LOC-REF relates the abstract location 
of the object as given by its coordinates to a natural language ex- 
pression for a reference object. Note, that REF has to be used to 
generate a referring phrase for the reference object. Consider the 
sixth entry of the database in section 2. The instantiation only 
contains internal identifiers for objects, like traflic-lightl, for which 
referring phrases have to be generated (see section 4 for further 
details on REF). 
In NAOS we use a specialization hierarchy for motion verbs. 
This hierarchy is pragmatically motivated and is rooted in situa- 
tional semantics. It is no hierarchy of motion concepts as the one 
proposed in \[23\]. It connects general verbs witb more special ones. 
A situation which may be described using a special verb implies the 
application of all more general verbs Take for instance the verb 
iiberholen (overtake). \[t implies the use of the more general 
verbs vorueberfahren~ vorl)nifahren (drive past), passieren 
(pass), naehern-r (approach), entfernen-r (recede), fahren 
(drive, move), and bewegen-r (move). 
It shonld be intuitively plausible that such a hierarchy is also 
used for event recognition. If, for instance, no naehern-r (ap- 
proach) can be instantiated the more special events need not be 
tested. 
4.1 Event Selection 
In NAOS the overall strategy for generating a descriptive text 
is as follows: 
* Group all moving objects according to their classmembership; 
• For each object in each group describe the motions of the 
object for the time interval during which it was visible in the 
scene, 
Event selection for an object is done according to the following 
algorithm: 
1. Collect all events in the interval where the object was visible 
and where the object was the agent; 
2. determine for each timepoint during the object's visibility the 
most special event of the above collected ones; 
3. if two events have the same specificity then either take the 
one which started earlier and has the same or longer duration 
as the other one or take the one with longer duration; 
4. put the selected events on the verbalization list of the object 
in temporally consecutive order. 
Consider the following example. All events which were found 
for PERSON1 are: 
572 
(RECEDE PERSON1 FBI 20 40)(ENTFERNEN-R PERSDI\]I FDI 20 40) 
(}/ALK PERSONI 0 40) (OEHEI~ PERSONI 0 40) 
(MOVE PERSONI O 40) (BE~//EGEN-R PERSOIll 0 40) 
The above algorithm leads by use of the specialization hierarchy 
to the following verbalization list for PERSON1: 
(((~IALK PERSOI~I 0 40) (0 20)) 
((RECEDE PERSON1 FBI 20 40) (20 40))) 
(The last entry in parenthesis of each selected event denotes the 
interval in which the event was the most special one.) 
4.2 Selection of Optional Deep Cases 
This selection process is our first implementation of the strategy 
of anticipated visualization. The underlying question is: Which op- 
tional deep cases should be selected to restrict the hearer's possibi- 
lities of placing the trajectory of an object in his internal model of 
the static background of the scene? 
In NAOS the selection algorithm answering the above question 
is rather straightforward. It is based on the manner of action of the 
verb, the verbtype, and the heater's knowledge. The algorithm is 
graphically represented in figure 2. 
EVENTTYPE 
NON-DURATIVE, INCHOATIVE 
DURATIVE 
Tl,,o = EB A T,,,a = SE 
Tb,~ = SB ^ T,,,d ¢~ SE 
T~,~ ~- S B A T,,d = S E 
VERBTYPE 
DIR 
LOC 
RED 
DIR,LOC 
RED 
RED 
DIR, STAT 
LOC 
STAT 
DIR, REO 
LOC 
STAT 
DIR 
LOC 
RED 
DEEP CASES 
LOCATIVE? 
DIRECTION?, LOCATIVE? 
DIRECTION? 
LOCATIVE? 
NIL 
NIL 
LOCATIVE? 
DIRECTION?, LOCATIVE? 
LOCATIVE? 
NIL 
DIRECTION? 
LOCATIVE? 
LOCATIVE? 
SOURCE?, DIRECTION? 
NIL 
RED NIL 
T~,~ ~ 8B ^ T,,a ~ SE DIR, STAT LOCATIVE? 
LOC SOURCE?, GOAL7 
Figure 2: Selection of Deep Cases 
The abbreviations denote: Tb~, Ten,t: start, end time of the 
event; SB, SE: scene begin and scene end; I)IR, LOC, STAT, RED: 
directional (turn off, return), locomotion (walk, overtake), and sta- 
tic (stand, wait) verbs, finally verbs whose recognition implies re- 
ference objects (reach s. th., arrive at). 
The figure has to be read as follows. If an inchoative event 
like losfahren (start moving) has to be verbalized which has 
the verbtype locomotion, then choose direction? and locative? 
as deep cases. The question mark generally means, look into the 
partnermodel Lo see whether this deep case has already been ge- 
nerated fi)r another event. If so, determine by use of the object's 
actual location (represnnted in the scene representation) whether it 
is still valid. If this is the case don't generate a uatural language 
expression for this deep case, otherwise do. 
Presently the partnermodel contains information about the sta- 
tic background of the scene and about what has been said so far 
in the same relational notation as was shown for instantiations in 
section 2. It is being updated when an event is verbalized. 
Note, that for durative events the decision is based on whether 
the start and end time of the event coincide with the beginning or 
ending of the image sequence. Consider the first case for durative 
events as given in figure 2. Right from the beginning of the sequence 
there is a car moving along a street until the sequence ends. In such 
a case it is not possible to verbalize a source as the object may have 
started its motion anywhere. To restrict the hearer's visualization, 
direction and locative cases are verbalized, leading to a sentence 
like: The car moves on Schliiterstreet in direction of HaHerplace. 
Verbalizing a direction when the static background is known re- 
stricts the trajectory to being on one side of the road. Basically, 
our direction case is a goal or source ease where only two preposi- 
tional phrases are allowed, the German phrases in Richtung and 
aus Richtung (in direction~ from direction). These phrases 
do not imply that the motion ends at the goal location as do most 
prepositional phrases in German which have to be in accusative sur- 
face case to denote a goal. The English language is in this respect 
inherently ambiguous. In the sentence The car moves behind the 
truck, the phrase behind the truck may denote a locative or goal 
deep case. In German these eases arc distinguished at the surface. 
\["or locative the above sentence translates to Des Aitto f~hrt hinter 
dem LKW, for the goal case, it translates to Des Auto f~hrt hinter 
den LKW. 
We have to distinguish different verbtypes as e.g. the meaning 
of a directional phrase changes with the verl)type. Consider the 
sentences The car moves in direction of Hallerplace versus The car 
stands in direction of l\[allerplace (in German both sentences arc 
well formed). The first sentence denotes the direction of the mo- 
tion whereas the second one denotes the orientation of I, hc car. 
We thns distinguish between static (STAT) and h)eomotion (LOC) 
verbs. The third verbtype, directional (I)IR), is used for verbs with 
a strong directional component like umkehren (return), abbie- 
gen (turn off), etc. As they already imply a certain direction the 
additional verbalization of a direction using a prepositional phrase 
does usually not lead to acceptable sentences. The fourth type 
(REO) is used tbr verbs like erreichen (reach s. th.) having an 
obligatory locative case. 
The main result to note here is that the selection processes are 
low-level and verboriented. The only higher level goal is to inform 
the hearer and to convey as ranch information about an event as 
possible. In the next section we show by differem; verbalizations of 
the same scene how rather complex syntactic structures arise. 
5 Generation 
The general scheme for the generation process is as follows: 
1. Sort the objects according to their classmembership, vehicles 
first, then persons; 
2. in the above partial order sort the objects according to their 
time of occurrence in the scene, earliest first; 
3. do for all elements in each verbalization list of each object 
(a) if the current event has a precedent and its event time 
is included in the precedent's, begin the sentence with 
dabei (in the meantime); go to (c); 
(b) if the current event has a precedent and its event time 
overlaps the precedent's, begin the sentence with unter- 
dessen (approx, in the meantime); go to (c); 
(c) determine the optional deep cases and build a simple 
declarative sentence by using all chosen deep cases and 
applying the deep case semantics. 
Two temporally consecutive events are not verbalized using a tem- 
poral adverb as in the cases of inclusion and overlapping. This is 
due to the fact that from the linear order of the sentences the hearer 
usually infers consecutivity. 
The result of the above algorithm is a formal representation of 
the surface sentence which, rougidy, contains the w~rb's stem, gemls 
verbi, modality, and person, all deep cases in random order, and all 
stems of the \[exical entries which appear in the surface sentence. 
This representation is taken as input by the system SUTRA (for 
further details on the formal represeutation and the SUTRA system 
see \[41) which then generates a correctly inflected German sentence. 
Below is an example of the output of NAOS. 
18. ,ausgabe text 
DIE SZENE EN'rHAELT VIER BEWEGTE OBJEKTE: DREI 
PKWS UND EINEN FUSSGAENGER. 
The scene consists of four moving objects: three vehicles and a pede- 
strian. 
EIN GRUENER VW NAEHERT SICtt DEM GROSSEN 
FUSSGAENGER AUS RICHTUNG tIALLERPLATZ. ER FAE- 
IIRT AUF DER SCHLUETERSTRASSE. 
A green VW approaches the tall pedestrian from the direction of flal- 
terplaee. It drives on Schlseterstreet. 
EIN GELBER VW FAEHRT VON DER ALTEN POST VOR 
DIE AMPEL. WAEHREN1)I)ESSEN ENTFERNT ER SICH VON 
DEM GRUENEN VW. 
A yellow VW drives from the old postoftice to the tra~c light, h~ the 
meantime it recedes from the green VW. 
EIN SCHWARZER BMW FAEHRT IN RICHTUNG ttAL- 
LERPLATZ. DABEI UEBERIIOLT ER DEN GELBEN VW VOR 
DEM I"ACIIBERI,\]ICI\[I INFORMATIK, DER SCltWARZE BMW 
ENTFERNT S1CI1 VON I)EM GRUENEN VW. 
A black BMWdrives in the direction of Hallerpiace, During this time 
it overtakes the yellow VW in front of the department of computer science. 
The black BMW recedes from the green VW. 
DER GROSSE FUSSGAENGER GEHT IN RICHTUNG 
DAMMTOR AUF I)EM SUEDLICIIEN FUSSWEG WEST- 
LICH DER SCHLUETERSTRASSE. WAEHRFNDDESSEN ENT- 
FERNT ER SICH VON DEM FACIIBEREICH INFORMATIK. 
The tall pedestrian walks hJ the direction of Dammtnr on the southern 
sidewalk west of Sehlseterstreet. h~ the meantime he recedes from the 
department of compnter science. 
19. ,logout 
The first sentence above is a standard one having the same 
structure for all different scenes. The remaining four paragraphs 
are motion descriptions for the tbur moving objects. 
We now discuss step (c) of the above algorithm in more detail 
as it covers some interesting phenomena. 
Consider the third paragraph describing the motions of the yel- 
low VW. The verbalization list for this object is: 
(((DRIVE VW1 10 20) (10 25)) 
((RECEDE VW1 ~2 25 32) (25 32))) 
The beginning (SB) and ending of the sequence (SE) lie at points 
0 and 40, respectively. According to the selection algorithm (figure 
3) a SOURCE should be verbalized for a durative event with the 
above event time if the verbtype is LOC. The generation algorithm 
checks whether the chosen optional cases are allowed for the verb, 
if so, it is further checked whether the combinations are allowed. 
As a SOURCE may not be generated alone for a fahren (drive, 
move) event, SOURCE and GOAL are generated. 
The fourth paragraph shows the outcome of a deep case selection 
in which the chosen case is not allowed for the verb. The verbaliza- 
tion llst for the black BMW contains only ilberholen (overtake) 
and entfernen-r (recede). 
(((OVERTAKE BMWI VWI (10 12)(12 32) (10 32)) 
((RECEDE Bl~qt ~/2 20 40) (32 40))) 
According to event- and verhtype DIRECTION is chosen as the 
appropriate deep case. As this case may not be used with the verb 
overtake two sentences are generated, one describing the direction 
573 
of the motion and the other one describing tbe specific event. The 
second sentence begins with a temporal advert) specifying that both 
motions occur at the same time. In order to generate the two sen- 
tences first the classmembership of the agent of the verb which may 
not take the chosen deep case is determined. Then the speeializa- 
tionhierarehy is used to go up to either fahren (driv% move) or 
gnhen (walk) as those verbs may take any deep case. Then the 
sentences are generated. 
Consider the following verbalization list: 
(((OVERTAKE BI~WI VW1 (0 8) (12 18) ( 0 18)) 
((DRIVE BI'~I 0 40) (18 40))) 
Assuming the direction and location of the motion to be the 
same as before the algorithm presented so fat" would generate A 
black BMW drives in the direction of Hallerplace. During this time 
it overtakes the yellow VW in front of the department of computer 
science. The black BMW drives. 
According to the deep ease selection algorithm a DIRECTION 
and LOCATIVE should be generated for the second event above. 
As both cases have already been generated with the first event and 
are still valid the sentence The black BMW drives is not genera- 
ted because before generating a sentence it is checked whether the 
intbrmation is already known to the partner. 
5.1 Referring Phrases 
In this section some aspects of the referring phrase generator 
are discussed. As can be seen from the example text objects are 
characterized by their properties, introdueed with indefinite noun 
phrases when they are not single representatives of a class and they 
may also be pronominalized to add to the coherence of the text. 
Therefore we use standard techniques as e.g. described in \[8\], \[9\]. 
We want to stress one aspect of our referring phrase generator, 
namely its capability to generate restrictive relative clauses with 
motion verbs. As it may be easily the ease that a scene contains 
two objects with similar properties the task arises to distinguish 
them and generate unequivocal referring expressions. 
It is an interesting fact, that, we have several options to cope 
with this problem which each have their consequences. 
One option is to adopt McDonald's scheme of generation wi- 
thout precisely knowing what to say next \[13\]. According to this 
scheme two similar objects are characterized in the following way 
in NAOS. When the first one is introduced it is characterized by 
it's properties e.g. a yellow VW. When the second one has to be 
introduced, REF notices that a yellow VW is already known to the 
partner and generates the phrase another yellow VW. It starts get- 
ting interesting in subsequent reference. The objects are then cha- 
racterized by the events in which they were involved earlier whether 
as agent or in another role. This leads to referring phrases like the 
yellow VW, which receded from the pedestrian or the yellow VW, 
which has been overtaken. Note, how passive relative clauses arise 
naturally from the task of generating referring phrases in this para- 
digm. The same is also true for negation. Consider the case where 
the first yellow VW, say VWI, has passed an object and the second 
yellow VW, say VW2, has overtaken an object and both event,s 
are already known to the partner. If REF has to generate again 
a referring phrase for VWI it notices that pass is a more general 
verb than overtake and may thus also be applied for the overtake 
event. It therefore generates the phrase the yellow VW, which has 
not overtaken the other object to distinguish it unequivocally from 
VW2, 
Below is an example of this strategy in a texL for the same scene 
as above. The difference to the th'st scene is that we replaced the 
green VW by a yellow one. 
10. ,ausgabe text; 
574 
DIE SZENE ENTItAI'~LT VIER BEWIdGTE OBJEKTE: I)RE1 
PKWS UND EINEN FUSS(~AENGER. 
The scene consists of four moving objects: three vehicles and a pede.. 
s~rian. 
EIN GELBER VW NAEIIERT SICIt DEM GROSSEN FUSS- 
GAENGER AUS RICIITUNG tIALI,ERPI~ATZ. ER FAEHRT 
AUF I)ER SCHLUETERSTRASSE. 
A yellow VW approaches the tall pedestrian from the direction of 
flallerplace. It drives on 3chlueterstreet. 
EIN ANI)ERER GELBER VW FAEHRT VON DER AUPEN 
POST VOR DIE AMPEL. WAEtIRENDDESSEN ENTFFRNT 
ER S1Ctl VON DEM GIdLBEN VW, DER SICIt I)EM GROSSEN 
FUSSGAENGER GENAEHERT HAT. 
Another yellow VW drives fi'om the old post office to the tralllc light. 
\[n the meantime it recedes from the yellow VW which approached the tall 
pedestrian. 
\[!;IN SCHWARZER BMW FAEllRT IN R1CHTUNG IIALLER- 
PI,ATZ. I)ABEI UEBEtHIOLT ER DEN ANDEREN CELBEN 
VW, DF, R SICII VON 1)EM CELP, I~N VW ENTFERNT flAT, 
VOR DFM FACIIBFI-H~,ICtt INI,'OllMA'I'IK. DER SCHWARZE 
BMW ENTI,'ERNT SICIt VON DEM GI!H,BI~;N VW, DEI{ NICIIT 
UEBERIIOILT WORI)F,N IST. 
A black BMW drives in direction of Ifallerphtce. Dewing this time it 
overtakes the other VW which receded fronl the yellow VW, is ti'oet of 
the department of computer science. Tile black BMW recedes fl'om the 
yellow VW which was not ow~rtaken. 
I)EI{ GROSSE FUSS(.~AEN(\]Ie, R (IEIIT IN R1.CIITUNG 
I)AMMTOR AUF I)I,,M SUEI)LICHI,2N I,'USSWEG WEST- 
LICH DER SCIILUI'~TIt',I{STRASSE. WAIi;III{ENDI)FSSI,;N ENT- 
FERNT El1. SICH VON I)FM FACIIBh'J~.E\[Clt INFORMATIK. 
"/'lie tall pedestrian walks in direction of Dammtor on the southern 
sidewalk west of Schlueterstreet. \[n the meantime he recedes from the 
department of computer science. 
11. ,logout 
The consequences of this first option are rather complex syn- 
tactic structures whieh are not inotivated by higher level stylisl.ic 
choices. 
1,el us now look at a second opt, ion which has also been imple- 
mented. Experience with the above algorithm for dill%rent scenes 
showed, that if more than two similar objects are in a scene the 
restrictive relative clauses become hardly mlderstandable. We ~,hus 
determine how many similar objects there are in the scene before 
we start the generation process. If there are more than two, REF 
generates names for them and introduces them as e.g. the first yel- 
low VW, the second yellow VW and so on and uses these phrases 
in subsequent references. An example of this strategy would look 
like the first example text where the different vehicles are nanmd 
l, he first ..., the second .... Tbe rest of the text would remain the 
same. 
Taking this option implies leaving McDonald's scheme and ap- 
proaching to a planning paradigm. 
It should be noted here that there is a third optimt which has 
hardly been investigated, namely to switch frmn contextual to co- 
textual reference as in phrases like the VW I mentioned last. We 
need filrther research hefore we can use such techniques effectively. 
6 Conclusion and Related Research 
We have proposed the scheme of anticipated visualization to 
generate coherent texts describing reaL-wnrld events (visual data). 
The selection algorithms are based on low-level, verbinherent pro.- 
perties, and on a pragmatically motivated verb hierarchy. 'lk~gether 
with t, he verbalization component the NAOS system is now fully 
operational from event, recognition to text generation in the do- 
main of trafl'ie scenes. As this domain is rich enough to still pose 
a 1ol; of problems I, his opens up l, he ol)portunity t,o inl;egral;e hig- 
her level sl, rabelJies for e.g. combining sentences, selecting evengs, 
generating deie~ie expressions, el;e. 
The main difference between NAOS and other systems for lan- 
guage generation is that, we approach the verbalization problem 
from the visual side. and thus are led to use basic selection algo- 
ril;hms. Other systems like TAI,ESI'iN \[151, KI)S \[12J, TEXT \[1,t, 
KAMI' \[l\], and I1AM-ANS \[1()} start their proeessi,g wibh language 
whereas NAOS starts with images. In close emmection to our re- 
sea,<, is U,e wo,'k ,,f \[21, 1~,4}, 1231, \[??,\], ~,.,,d \[,% 'rhe fi,.st iV)u,. 
authors deal wilJl questions of moqon recognition and with a re-. 
ferellcc senlant, ic for irlOt;iOrl verbs })Lit ~Ll'e IIot. CoLleerlled wit}l i, exL 
general~ion. They showed that case frames can Iw used to generate 
single utl,erancem Conklin and Ivh:l)onald use the notion of salience 
to deal wil, h ghe seleel,ion problem in the task of describing a single 
image of a nal)ural oul, door scene. 
TALESPIN exemplifies ~ha~; plans and goals of an actor may 
form the underlying sl, rueture of narratives and may I;hus be mo- 
tivation for l;ext generation, hi KI)S a represental, ion of wha~ to 
do in ea~(., of fire alarm is transformed into a natural language. 
text. As the initial representa1,ion already contains lexieal eni, ries 
and primitive l)roposilfions the task is to organize tJds information 
anew so that i~ may be expressed ill an English text. Matll/ and 
Moore prol)ose rules for (:oml)ining l)ropositiolm and re,.ediL the text 
eonl, inuously to produce l,he final version. TEXT gem.'rate~; pars. 
gr~tplls as aiiswel's ~o qtlestiolls a\[)ollt da\[,abase Stl'llCtl/Fe. \[~e/cl(ef)wI1 
}las idenl;ified discourse stra(.e, gie,~ for fulfilling three (;(mmmlaie~fl.ive 
goals: detine, compare, aud describe. These sl, rategi(~s g,dde, the t,;e- 
aeration l)ro(:e.<ls ill deciding what; to say \]lext. Me}(,eowlI Ilses 1,he 
qucsl;ion to deteemine tile eommunh:al.ive goal that the text should 
fldfil. Research of IJfis kind is very important to clarify ~he relation 
between l, he \[orln of (-z text and il;s underlying goals. 
()ue of I;he domains of IIAM..ANS is the Mad of I;raflic scene 
which is also used in NAOS. /n this domain I|AM-ANS deals with 
primarily with answering questions about ~he tool, iota; o\[ ol@~cts 
and wi~h overanswering yes/no que,%ions \[25 I. The dialogue (:ore- 
i)onent, of IIAM-ANS may be commcted to NA()~g I;o also allow 
quest,ions of the user if' t}m generated text was not sM\[ieienl fi)r his 
underst;anding. An evalual;ion of the kind of question being asked 
by a user may help in devising bel, ter generation strategies. 
|(AMP is a, system tbr plamfing natural languago ubteranees 
ia the domain of task oriented dialogues. The 1)lantfinlg altorithm 
i;akes 1;he knowledge and I)elief'a of the hearer into account,. '\['his sy.- 
stem shows |low a priori beliefs of 1;he hearer may a\]L;o be integrated 
in NAOS to generat;e appropria/;e referring phrases. 
It would be interesting to use a phrasing componen~ for NAOS 
which would firs/, determine all deep eases uecessary ~o maximally 
restrict \[,he visualized t, ra.jeet;ory of an objeet's mot, ion sequence and 
then try to distribute I;he cases to the di\[ferent verbs u.sed in the 
descripl;ion in order to general;e smooth text. 

References

\[1\] Appelt, \]).E., iPlamfi*Jtg Nat m'al-Lal:G'tmge Utt;era:nee~: to ~hd;i.~;fy 
M'ulLiple Goah~. SRI lntecn,xtlonal, Technical Note 259, Menlo Park, 
CA., 1982 

{2} Bad\]or, N.I., '.l_~m~po:ral .~.Ice,te Analysis: (\]oxtceptnal \]i)e.,:(:ril)tioL~ 
of Object Movemenl:m l{eport TR-.80, l)ept, of CS, University of 
Toronto, 1975 

13\] Barwise, .I., Perry, J., ~il;L~atio~s and Attibnde.<;. Bradford Books, 
MIT Press, 1983 

14\] Busemann, S., ~qurlhc.e Transli)rmations dm'ing the Generation 
of Written German Senteneea. hc llolc, L. ted.), Natural l,anguage 
Generation Systems. Springer, Berlin, 1984 

\[51 Conldin, E.J., McDomdd~ \]).D., ,qallenee: The. Key t;o the Se- 
leeti(m Prol)lem in Nat:m'al Lanl,mage Generation. COI,ING-82, 
129-135 

16\] Davey, A., l)iaeour~e Produel;i,m. A Computer Model of gome 
Aspeel:~ of a Spealter. Edinburt;h Uniwn'siW Press, 1978 

\[71 Filhnm'e, C.,\]., ,qeenes-and.-fl'ames gemantle,., ILL: Zampolli, 
A. (ed.), fAngui.stie Structures Processing. NoLth-llolland, Amsterdam, 
1977, 55-81 

\[8\] Goldman, N.M., Coneeplaml (-'eneration. In: Scha.lq It.C, (ed.), 
Concept.al hffolmation Processhlg, Noi~h.I\]olland, 1973, 289~371 

191 yon llahn, W., Hoeppner, W., ,lame.son, A., Wahlster, %i., '.Fhe 
AnaLomy of the Natural Langm~ge rtlialogue b;ysl:em J(\]AM- 
~LPIVi. In: Bole, i, (eLl.), Natm'al \],anig,.age \]lased (1omimter .Systems. 
Ilanser/McMillan, Miinchen, 1980, 11!).253 

\[101 l\[oeppner, W., ('hristalhw, T., Marburger, II., Morik, K., Nobel, 
\[I, O'Leary, M., WahMmr, W., lhWmld }.k,nah;o.~ndependcnee: Ex- 
perlenee with the Develolmuml; of a (-\]eriL|a\]l)#~lglla~2 A¢:ee!i\[i 
,~iysl;em to t{}(ighly Diverse }gaekgrmmd gy.'nh.'r,.q, lJCAI..83,588..594 

Ill\] Jamesoa, A., WMdster, W.,/lser Modelli~lg in/~nlflmra Gener- 
al;itnl: lglliptdtl a~td l)ei~nlte ))oserip(.ia~. ECAI-82:222-227 

\[12\] Mami, W.C., Moore, J., ConLlmter GeneratiwL of Mnli;iparn- 
graph ~'exl;. AJC}, 711), 1981, 1%29 

(J3 / McDonald, I).l)., Nat;m'al L.'nLgnage (~eneraLJon as a Com- 
prd.a|:ional )~rol)!enl: a~l ~'n|;lod'netion. in: lh'ady, M., Bmwick, 
H.C. (eds.), ComputaiAonal Models of Discourse. M\]T Press, Cambridl;e, 
Ms:Is., 1983, ?,09-.265 

114\] McI(eown, \[CII., })iL;eom'~m ~.;I;ratxwje~ for (-~eneral;i,.~; N:qmral- 
iLaxiff, mlge Text. Artificial hd.elligenee 27, 1985, 1-,tl 

115\] Meehan, J., 'YA.LIn...qPIN. h< Schank, I{.C., lli,,d:,eck, C.K. (e&;.), 
hmicte Conllmter l.Indel-st~Lnding: Vive PioF, rarlu~ plus Miniat.ures. I,EA, 
llillsdMe, New Jersey, 198i, i97-258 

\[161 Nemmmn, ll., Natural l,angm~ge Descrii)l;im~ of 'Yi:~bm.SVaryi:,~g 
~3ee:nea. ln: Walt, z, l). ted.), Advances in Natm'M I,:mge.age l'rocesses. 
Vohlme 1 (in press); also as I,'II\[-\[Ilt..\]L.105/8,t, l/'achbereich hlformat, ik, 
Unlversitlit tlamburg, 1984 

117} Neumann, ll., On Natm'al Language Aecem; to \](mage 5h> 
quencet;: Ewml; \]Reeognition a~d Verbalization. Proc. Firsl Confer-- 
trice oil Altilicial lnl.elllgence Apl)lications (CAIA-8,1), Denver, Celorz~clo, 
1984 

/18\] Neum~mn, B., Novak, II.oJ., Natm'al Language Orlen';ed ~,lvent 
JVJ(odels fi~r linage f:leqnence }\[:al;erpretat;ion: The }u:J~lles. ('.SRG 
q'echn. Note // ad, University of 'l;)ronto, 1983 

\[191 Neumann, It, NovM¢, I\[...J., F,~ent Models Rn' Recognition and 
Na|;m'al }.',angnage Description of I'Ivenl;.,t in li~.ea\].'World ~mage 
~;equeneem IJCAI..sa, 724-726 

\[20J Nowdq II..J., B 1~.elatitmal Matching Strategy tb'_,' tremlmral 
Event li'.eeognitioyL. ILL: l,aubscl h ./. (ed.), CWAI-84. \]nformaeik Fach- 
berichte 10a, Springer~ 1985, 109-118 

121 / Olson I).R., Lang, Ual;e Use for (dommmdcat:i~G, -\[ntd:muq;inI>; 
mud '.lPhlnklng. In: I'¥eedle, l{.O., Carroll, .I.\[L (eds.), I,an?,uage Con> 
prehension and l, he Acquisition of Knowledge. Washington, 1972 

\[221 Okada, N., Concepl;ual Taxonomy of Japanese -Ve:<bs fo:e ()'n- 
dertLI;andlng Natnral l\[,a:nguage and }iqeture Pat, retire. COI,\[NG- 
80, 127-135 

\[23\] il'sog:~os, J.K, A. Frameworlu :Cot Visaal Motion "crndersLa~Lding. 
CSttC "I?R.-114, University of Toronto, 1980 

12/!} 'l'suji, S., Km'oda, S., Morizono, A., "(\].adersl;andhW; a Sh~Jple 
G'arl;oon Film by a Compul;er Vishm System. lJCAI-77, 6(\]9-610 

\[2\[;\] Wahlat.er, W., Mi~rburger, H., Jameson, A., Bnsemann, S., Overm~- 
e~werh~g .'g-.eLi-No-Qneatioim: \]t'\]xtended \]t/.esponses ia a N}, i*_d:er- 
face f,o a gi0ion Syi*tem. IJCAI..sa, 6,13-646 
