File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1086_metho.xml
Size: 16,312 bytes
Last Modified: 2025-10-06 14:13:43
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1086"> <Title>REFERRING TO WORLD OBJECTS WITH TEXT AND PICTURES</Title> <Section position="3" start_page="530" end_page="532" type="metho"> <SectionTitle> 2 A MODEl, F()R RI,.3,'ER1UN(; WITII TEXT AND PICTURI';S </SectionTitle> <Paragraph position="0"> When referring to domain objects a presentation system h;ts to lind intelligible object descriptions which will activate aplnOl~riate represcutations. We assume thai reprcscnlalions can be act)wiled in the sense of picking them out of a set of representations which arc already available or which have to be built lip (c.g., by localiziug an object in a user's visual licld). Rcprcscnlations can bc act)wiled by textual descriptions, by graphical descriptions or by mixed descriptions. Whereas the order in which representations are activated by a text is ittlhmtlccd by the discourse structure, it is less than clear ill which order a picture activates representations. If scvcral objects are depictcd, the conc-SlXmding rcprescntatious may be activated simultaneously.</Paragraph> <Section position="1" start_page="530" end_page="530" type="sub_section"> <SectionTitle> 2.1 Rcprcsenlations of World ()bjecls </SectionTitle> <Paragraph position="0"> qb ensure tile transferal))lily of our al)pmach, wc don't presuppose a cer|aill kllowledge representation language.</Paragraph> <Paragraph position="1"> l\[owcvcl, iu\] essential part of the model concerns file distinct)on between the system's belicl\s about the world and the system's beliefs about the user's beliefs. We represent these beliefs ill different models. For example, the system may classify a cert:du object )ks ml espresso machine while it asstllUeS tile user regards tile object as a coffee machine.</Paragraph> <Paragraph position="2"> l:urtherniore, we have to COllsider that the user's alld the system's beliefs alxmt the identity of objects may differ.</Paragraph> <Paragraph position="3"> The system may bclicve that the user has different representations for ouc and tile. salne object without knowing how they arc rclattxl to each other. Conversely, it may happen that the user is assumed to have only one representation for objccls which tile systeln considers as distinct entities.</Paragraph> <Paragraph position="4"> As a coascquence, our models can coutaiu dill'ereut rcpreseutaliolls for one and the sanle world object. We use tile</Paragraph> <Paragraph position="6"> world object.</Paragraph> <Paragraph position="7"> Fig. 1 gives an example of how to use the concepts introduced above, l.ct's start li'om the billowing situation taken from an espresso machine d/mudu: &quot;lain system knows that |here are two switches (the temperature coutrol and tile on/off switch) and also knows where they m'e k~cated. 1 .et rl_s mid r3_s corrcspoud to lhe system's internal rcprcscm rations of the switches. The user is assumed to look at the espresso machine aud to see two switches. Let rl_u and r3_u correslxmd to iutenml reprcscnlatious of the switches which Ihe user builds up when looking at tim machine. We assume that tile user idso knows of the existeuce of the on/off switch and file temperature control, but is not able to localize them. l.et r2_u and r4_u be the user's representations for tile temperature control and the on/off switch. &quot;l lie fact that he o)lly knows that one of tile switches lie sccs must be the temperature control and the other file on/oil switch can be expressed by metals of a disjunction. Either a corer ,elation holds between rl_u and r2_u and between r3_u aud r4_u or conversely, between rl_u and r4_u and be~ twecu r3_u and r2_u. The couucctiou between the system's rcprcscnlations rl_s and r3_s to tim rcpresentalious tile user is assumed to have. is also expressed by corelizreuce relations. null</Paragraph> </Section> <Section position="2" start_page="530" end_page="531" type="sub_section"> <SectionTitle> 2.2 Reln'esent:dion of Descriptions </SectionTitle> <Paragraph position="0"> As nmntioncd ill section 1, descriptions can be co;nlx)stal of text, graphics mid further presenUUion media. To cope with such descriptions, we associate with each syntactical unit (depictions, noun phrases, etc.) the set of objcct rcpreseutations which will be activated by that particular part. The referent of tile whole description is then considered as a member of thc intersection of all sets resulting from partial descriptions.</Paragraph> <Paragraph position="1"> An important prerequisite of our approach is that the system explicitly represents how it has encoded in formation in a presentation. Inspired by (Mackinlay, 1986), we use a relation tuplc of tim form: (Encodes nwans itlformation context-slmce) to specify tim semantic relationship Imtwccn a textual or graphical means, and tim inh)rmatiou tim means is to convey in a cerladn context space. In our approach, the third argulnent refers to tile context space to which the encoding relation corresponds R~ and not to a graphical language as in Mackinlay's al~proach. This enables us to use one and the same presentation means differently in different context spaces. For example, a depiction of an csprcsso machine may refer to an individual machine in one context space, but may serve as a prototypical representative of an espresso machine in mmthcr. In addition, we not only specify encoding relations bctwccn individual objccls, but ~dso specify encoding relations on a generic level (e.g., that tile property of being red in a picture encodes tile property of being defect iu tile world).</Paragraph> <Paragraph position="2"> While it can be assumed that a user reads a text in sequeutial order, it is often not clear at which times a user looks at a picture. ThercR)re, it makes not ,'always sense to further distinguish between an mlaphor and its antecedent.</Paragraph> <Paragraph position="3"> Fortunately, our approach does not require identi lying parts of a presentation as anaphora and antecedents. It suffices to recognize which parts of a description ~u'e intended to encode a uniquely determined object. ~Ib express such cohesive relationships between presentation parts p 1 and p2, we define the predicate:</Paragraph> </Section> <Section position="3" start_page="531" end_page="532" type="sub_section"> <SectionTitle> 2.3 Links between Representations and Descriptions </SectionTitle> <Paragraph position="0"> In uuderstanding a referring expression, the user has to recognize certain links between actiwttcd mental representations, between descriptions and mental representations, and between textual and graphic,'d parts of dcscriptions.</Paragraph> <Paragraph position="1"> Which links are present in a description and which have to be inferred varies from sifimtiou to situation. To illustrate this, let's have a look at a case study carried ot, t in our espresso machine domaiu where text-picture combinations are used to explain how to operate an espresso machiuc. We assume that tile user is rexlUested to tunl the temperature control of an espresso machine. In this case, identification means actiwtting a representation the user builds up when localizing the referent in his visual field. Furthermore, we presume tile user knowledge of the espresso machine as in Scction 2.1; i.e., file user knows of the existence of tile on/off~ and the temperature control, has visual access to tile two switches in the world but is not able to tell them apart. In the diagrams below, we use the abbreviations ES, C aud E for die relations EncodesSame, Coref and Eucodes respectively.</Paragraph> <Paragraph position="2"> In tile document fragment shown in Fig. 2, the textual rcfcrcncc expression uniquely determines a referent, but activates a reprcscutation (r2_u) which docsn't contain any information to localize rile referent. Colwersely, the representations activated by tim picture contain locative information, but here we have the problem that several objcct representations arc activated to tile siune extent. Since only the prope,ty of being a switch, but not tile property of being a temperature control is conveyed by the picture, both switch depictions become possible as antecedents of the textual referring expression.</Paragraph> <Paragraph position="3"> In Fig. 3, tile verbal descriptiou discriminates tim referent from its alternatives by attributes of the world object, umncly 'being a switch', and 'being depicted in tile figure' and an attribute of the depiction, namely 'being dark'. But, in contrast to tim previous example, only one of the representations activated by the picture fits tim verbal descriplion. &quot;llius, the user should be able to discover the anaphoric link between the verbal description and the graphical depiction and activate an appropriate representation.</Paragraph> <Paragraph position="4"> the dark switch '.~ r2~u Figure 3: Establishing a Cohesive Liuk by Incorporating Picture Attributes in Vcrbal Descriptions In tile previous example, an anaphoric link between text and picture has been established by including pictorial attributes in the vcrbal descriptiou. All altcrnative is to apply graphical focusing tcclmiqucs ,as in Fig. 4. Ilere, it's vcry likely that the user will be able to draw a link between text mid picture because he will assume that the pictorial ,'rod the textual focus cx)incide. This ex~unple also illustrates how tile user's knowledge of rile identity of objects cml be enriched by means of a referring act. The verbal descripthm without the graphics and tim graphical dcpicthin witimut the text actiwtte different reprcseatalions of tim switch. When coasidering bolh text and graphics, tim user will conclude timt they refer to tile same object. Thus, he is not only able to identify tim switch ,as required, he is ,also able to combine tim different representations of tile switch into one. Note that this phenomenon cm~ ~dso be explained in tcnns of centering tiltx)ry (Gmsz et ~d., 1983).</Paragraph> <Paragraph position="5"> In tim example, tim prcferrcd center of tim picture wouhl coincide with the backward looking center of tim text.</Paragraph> <Paragraph position="6"> qhe example shown in Fig. 5 differs from the previous ones in that ao corrcspondency link between picture objects and real world objects can be established. Although the user is able to draw an anaphoric link between the verbal aud tim pictorial description, he is not able to visually identify the intended referent.</Paragraph> <Paragraph position="7"> l 'r|lrll |he |r, lllplt!r~|tllrt! COlltrlll clockwise.</Paragraph> </Section> <Section position="4" start_page="532" end_page="532" type="sub_section"> <SectionTitle> World </SectionTitle> <Paragraph position="0"> Summing up, it can be said that a rcfcrrinp act is only successful whell tile description provides an access path to an al)l)ropriate represeatation. &quot;lhe user has to iufcr such a path li'om encoding relationships and cohesive links be-.</Paragraph> <Paragraph position="1"> twccn tim parts of a description. As lhc cxamplcs show, tim following cases occur: a) if tile user does nol recognize which picture parts correspond to which world object, tim referring act ciflmr fMls (cf. Fig. 5) or the picture contributes uolhing to ils success, b) If tim relationship between pictori',d depictions and verbal dcscriptions is unclear, tim referent can either not bc lound (cf. Fig. 2) or one of Ihe media has no inllocuce oil refereut identilicalion. c) if at graphic~d dcpiclion aad a vcrbal tiescription acliw|te dill~crent rcprescnlations of one and tile Sallle t)\[) ~ ject and Ihe user recognizes not only these links, but :dso a link between tim two presenlatiou parts, he is uot only able to lind the refcrcnt, but also able to combine tim tliffcrent rep,escntations into one (cf. t:ig. 4).</Paragraph> </Section> </Section> <Section position="4" start_page="532" end_page="532" type="metho"> <SectionTitle> 3 USING TILE&quot; MODEL TO GENERATE REFER- RING EXPRESSIONS </SectionTitle> <Paragraph position="0"> In tim lbllowing, we will sketch how we have integratexl tile approach into tim multimcdia presentation system WlP (Wahlstcr et al., 1993). At tile hcau't of tim WIP system is a prcscnUttion planner that is reslxansible for determining the contents aad selccti,lg an appropriate medium combination.</Paragraph> <Paragraph position="1"> &quot;llle presenlatioll planner receives ~ks input a presentation goal (e.g., the user should know where a certain switch is located), it then tries to fiad a presentation strategy which malchcs this goal and gencrales a refiue,nent-style plan in tim form of a directed acyclic graph (DAG). This DAG rellecls rellccls lhe proposithmal contents of the potcnti;d document paris, Ihe intcntkmal gems behind tim parts as well as tim rhetoric~d relationships between them, lot details see (Andr6 and Rist, 1993). While tim top of the presenlalion plan is a more or less complex presentation goal (e.g., instrucling tim user in switching on a device), the lowest level is formed by specilications of elementary presentatioa lasks (e.g., formulating a r~lucst or depicting an object). These elementary tasks m'e directly forwardcxl to tim mcdium-spccilic generators, currcntly for text (Kilgel, 1994) aud graphics (Rist, and Andr6, 1992).</Paragraph> <Paragraph position="2"> &quot;llm contcut of referring expressions is determined by Ihe presentation planucr Ihat Mso decides which representations should be actiwttcd and which medium should be chosen for tiffs. &quot;lb be able to pcrlbnn these steps, we need presentation slrategics for linking propositional acts with activation acts. An exmnplc of such a strategy is \[1\].</Paragraph> <Paragraph position="3"> This strategy can be used to request the user to perfoml an action, h, Ihis strategy, two kinds of act occur: an clcmenlary speech act S(urface)-Rcquest aad three activation acts for specifying tim action mid the scmantic casc roles ;Lssociatcd with tim action (Activate). The strategy prcscrilx:s text for tile subsidiary acts 'because the resulting rcfcn'ing expressions (?action-six:c, ?agent-SlrCC and ?object-spot) are obligatory c~tse roles of an S-Request speech act which will bc conveyexl by tcxt. For optional case roles any medium c;nl be taken. In addition to strategies for linking propositionM aud activation acts, we. need strategies lot diffcrcnt kinds of actiwttion mid lot establishing Corcf- and l';ncodesSamc-relationships. For cxmnplc, strategy \[2\] caq be used to aclivale a representation ?r-1 by text and to simultaneously enrich the user's knowlex.lge .5,3.3 about the identity of objects. The strategy only applies if Ihcre exists already an image ?pic-obj which encodes 71&quot;-1, the system believes that ?r-1 and ?r-2 are representations of the same world object and if the system's model of the user's belicls contains ?r-2. If the strategy is applied, the system a) provides a unique description ?d for ?r-2 (re:fin act) mid b) ensures that the user recognizes that this description mid the corresponding image specify the same object (subsidiary act).</Paragraph> <Paragraph position="4"> \[2\] lleader: (Activate S U (?case-t'ole ?r-1) ?d &quot;li~xt) For ~0, we use a discrimination algorithm similar to the algorithm presented in (Reiter and Dale, 1992). Ilowevor, we have investigated additional possibilities for distinguishing objects from their alternatives. We can refer not only to features of an object in a scene, but also to tidal ures of the graphical model, Ihcir interprclalion ~md to the position of picture objects within the picture, scc ~dso (Wazinski, 1992). A dclailed description of our discrimination algorithm can be found in (Schueiderl0chuer, 1994). Task b) c,'m bc accomplished by correlating the visu~fl and the textual locus, by redundantly encoding objccl atlribules, or by explicitly informing Ihc user about a Corcf-rclalionship.</Paragraph> <Paragraph position="5"> Such a Corer-relationship can bc established by strategies for the gcneration of cross-media rcfcrring exprcssions (as iu &quot;The left switch in lhc ligurc is Ihe lcmpcraturc control&quot;) or by slralcgics for annotating objects in a ligurc.</Paragraph> </Section> class="xml-element"></Paper>