File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/j00-2002_metho.xml
Size: 14,839 bytes
Last Modified: 2025-10-06 14:07:15
<?xml version="1.0" standalone="yes"?> <Paper uid="J00-2002"> <Title>Pineda and Garza Multimodal Reference Resolution Constant of G: C~</Title> <Section position="2" start_page="142" end_page="145" type="metho"> <SectionTitle> FL L W </SectionTitle> <Paragraph position="0"> Figure 3 Multimodal representational system for linguistic and graphical modalities. P stands for the set of graphical symbols constituting the graphical modality proper (i.e., the actual symbols on a piece of paper or on the screen). Note that two sets of expressions are considered for the graphical modality: the expressions in G belong to a formal language in which the geometry of pictures is represented and reasoned about, and P contains the overt graphical symbols that can be seen and drawn but cannot be manipulated directly. The functions PL-G and PC-L stand for the translation mappings between the languages L and G, and the functions PP-c and Pc-P stand for the corresponding translations between G and P. The translation function pP-c maps well-defined objects of the graphical modality into expressions of G where the interpretation process is performed. The translation Pc-P, on the other hand, maps geometrical expressions of G into pictures; for every well-defined term of G of a graphical type (e.g., dot, line, etc.) there is a graphical object or a graphical composition that can be drawn or highlighted with the application of geometrical algorithms associated to operators of G in a systematic fashion. The circle labeled W stands for the world and together with the functions FL and Fp constitutes a multimodal system of interpretation. The ordered pair (W, FL) defines the model ML for the natural language, and the ordered pair (W, Fp) defines the model Mp for the interpretation of drawings. The interpretation of expressions in G in relation to the world is defined either by the composition FLdegpc_L or, alternatively, by FpdegpG_p. The denotation of the word France in L, for instance, is the same as the denotation of the corresponding region of the map of Europe that denotes France, the country, since both refer to the same individual. The denotation of the symbol rl in G that is related to the word France in L through PG-L, and to a particular region in P through pG-P, is also France, as translation is a meaning-preserving relation between expressions. The interpretation functions FL and Fp relate basic expressions, either graphical or linguistic, to the objects or relations of the world that these expressions happen to represent, and the definition of a semantic algebra for computing the denotation of composite graphical and linguistic expressions is required.</Paragraph> <Paragraph position="1"> An important consideration for the scheme in Figure 3 is that the symbols of P have two roles: on the one hand, they are representational objects (e.g., a region of Computational Linguistics Volume 26, Number 2 the drawing represents a country), but on the other, they are also geometrical objects that can be talked about as geometrical entities. The geometrical region of the map representing France, for instance, is itself represented by the constant rl in G. In this second view, geometrical entities are individual objects in the world of geometry, and as such they have a number of geometrical properties that are independent of whether we think of graphical symbols as objects in themselves or as symbols representing something else. The same duality can be stated from the point of view of the expressions of G, since the set of individual geometrical objects (i.e., P) constitutes a domain of interpretation for the language G. This is to say that expressions of G have two interpretations: they represent geometrical objects, properties, and relations directly, but they also represent the objects of the world (e.g., France, Germany, etc.) indirectly through the translation relation and interpretation of symbols in P taken as a language (i.e., the composition FpdegpG_p). The ordered pair (P, Fc/defines the model Mc for the geometrical interpretation of G as geometrical objects; the geometrical interpretation function FG assigns a denotation for every constant of G; the denotation of individual constants of G are the graphical symbols themselves, and the denotation of operators and function symbols of G denoting graphical properties and relations will be given by predefined geometrical algorithms commonly used in computational geometry and computer graphics--see, for instance, Shamos (1978). The semantic interpretation of composite expressions of G, on the other hand, is defined through a semantic algebra, as will be shown below in Section 2.3.2. The definition of this geometrical interpreter will allow us to perform inferences about the geometry of the drawing in a very effective fashion. Consider that to state explicitly all true and false geometrical statements about a drawing would be a very cumbersome task, as the number of statements that would have to be made even for small drawings would be very large. Note also that although a map can be an incomplete representation of the world (e.g., some cities might have been omitted), the geometrical algorithms associated with operators of G will always provide complete information on the map as a geometrical object.</Paragraph> <Section position="1" start_page="143" end_page="145" type="sub_section"> <SectionTitle> 1.2 Multimodal interpretation </SectionTitle> <Paragraph position="0"> For the kind of problem exemplified in Figures 1 and 2, the objects in L, P, and G are given, and the function FL establishes the relation between linguistic constants and the objects of the world that such constants happen to refer to. To interpret these multimodal messages, Fp must be made explicit. If one asks who is he? looking at Figure 1, for instance, the answer is found by computing pG-p(pL-G(he)), whose value is the picture of the man on the drawing. Once this computation is performed, the picture can be highlighted or signaled by other graphical means. However, in other kinds of situations the knowledge of Fp might be available and the purpose of the interpretation process could be to identify Ft. If one points out the middle dot in Figure 2 at the time the question what is this? is asked, the answer can be found by applying the function PG_LdegPP_G to the dot indicated (i.e., PG_L(PP_G(O))), whose value would be the word Saarbriicken. A similar situation arises in the interpretation of multimodal referring expressions. Consider the following example--also from Andrd and Rist (1994)--in which a multimodal message is constituted by a picture of an espresso machine that has two switches, and by the textual expression the temperature control. In this scenario, the denotation of the natural language expression can be found by the human interpreter if the corresponding switch is identified in the picture through visual inspection (e.g., if the switch is highlighted). In general, multimodal coreference can be established if pL-G and PG-L are defined, as Fp can be made explicit in terms of FL and vice versa.</Paragraph> <Paragraph position="1"> Pineda and Garza Multimodal Reference Resolution In situations in which all theoretical elements illustrated in Figure 3 are given, questions about multimodal scenarios can be answered through the evaluation of expressions of a given modality in terms of the interpreters of the languages involved and the translation functions. However, when one is instructed to interpret a multi-modal message, like Figures 1 and 2, not all information in the scheme of Figure 3 is available. In particular, the translation functions PL-G and PG-L of the graphical and linguistic individual constants mentioned in the texts and the pictures of the multi-modal messages are not known, and the crucial inference of the interpretation process has as its goal to find out the definition of these functions (i.e., to establish the relations between names of L and G). It is important to emphasize that in order to find out PL-G and PC-L, the information overtly provided in the multimodal message is usually not enough, and in order to carry out such an interpretation process it will be necessary to consider the grammatical structure of the languages involved, the definition of translations rules between languages, and also conceptual knowledge stored in memory about the interpretation domain.</Paragraph> <Paragraph position="2"> An additional consideration regarding the scheme in Figure 3 is related to the problem of ambiguity in the interpretation of multimodal messages. In the literature of intelligent multimodal systems, ambiguity is commonly seen from the perspective of human users. A multimodal referring expression constituted by the text the temperature control and a drawing with two switches is said to be ambiguous, for instance, if the human user is not able to tell which one is the temperature control. A well-designed presentation should avoid this kind of ambiguity by providing additional information either in a textual form (e.g., the temperature control is the switch on the left) or by a graphical focusing technique (e.g., highlighting the left switch). An important motivation in the design of intelligent presentation systems like WIP (Wahlster et al.</Paragraph> <Paragraph position="3"> 1993) and COMET (Feiner and McKeown 1993) is to generate graphical and linguistic explanations in which these kinds of ambiguities are avoided. 2 Note, however, that such situations are better characterized as problems of underspecification, rather than as problems of ambiguity, since the expression the temperature control has only one syntactic structure and one meaning, and the referent can be identified in a given context if enough information is available.</Paragraph> <Paragraph position="4"> Ambiguity in multimodal systems has also been related to the granularity of graphical pointing acts. A map, for instance, can be represented by an expression of G that translates into a graphical composition in P denoting a single individual (e.g., Europe) or by a number of expressions of G that refer to the minimal graphical partitions in P (e.g., the countries of Europe) depending on whether the focus of the interpretation process is the whole of the drawing or its constituent parts. This problem has also been addressed in a number of intelligent multimodal systems like XTRA (Wahlster 1991) and AlFresco (Stock et al. 1993), but the lack of a formalized notion of graphical language (and also a better understanding of indexical expressions), has prevented a deeper analysis of this kind of ambiguity.</Paragraph> <Paragraph position="5"> These notions of &quot;ambiguity&quot; in multimodal systems contrast with the traditional notion of ambiguity in natural language in which an ambiguous expression has several interpretations. The formalization of graphical representations through the definition of graphical languages with well-defined syntax and semantics allows us to face the problem of ambiguity directly in terms of the relation of translation between natural and graphical languages, and the semantics of expressions of both modal- null Computational Linguistics Volume 26, Number 2 ities. An interesting question is whether the graphical context offers clues that the parser can use to resolve lexical and structural ambiguity. Although we have yet to explore this issue, there are some antecedents in this regard. In Steedman's theory of incremental interpretation in dialogue, for instance, the rules of syntax, semantics, and processing are very closely linked (Steedman 1986) and local ambiguities may be resolved by taking into account their appropriateness to the context, which can be graphical. Structural ambiguity in G can be appreciated, for instance, in relation to the granularity of graphical objects, as the same drawing will have different syntactic analysis depending on whether it is interpreted as a whole or as an aggregation of parts. It is likely that the resolution of this latter kind of ambiguity is also influenced by pragmatic factors concerning the purpose of the task, the interpretation domain, and the attentional state of the interpreter, but this investigation is also pending.</Paragraph> <Paragraph position="6"> We do, however, address issues of ambiguity related to the resolution of spatial indexical terms and anaphoric references in an integrated fashion. In Section 3, an incremental constraint satisfaction algorithm for resolving referential terms in relation to the graphical domain is presented. This algorithm relies on spatial constraints of drawings and general knowledge about the interpretation domain, and its computation is performed during the construction of multimodal discourse representation structures (MDRSs), which are extensions of DRSs in DRT (Kamp and Reyle 1993) as illustrated in Section 4. In the same way that DRT makes no provision for ambiguity resolution and alternative DRSs are constructed for different readings of a sentence, several MDRSs would have to be constructed in our approach for ambiguous multimodal messages. 3 However, as natural language terms in L in our simplified domain refer to graphical objects, indefinites are very unlikely to have specific readings (e.g, &quot;a city&quot; normally refers to any city) and a simple heuristic in which indefinites are within the scope of definite descriptions and proper names can be used to obtain the preferred reading of sentences such as the one in Figures 2. Nevertheless, even if only this reading is considered, and the interpreter knows that the drawing is a map and is aware of the interpretation conventions of this kind of graphical representations (i.e., countries are represented by regions, cities by dots, etc.), drawings can still be ambiguous. In Figure 2, for instance, there are four possible interpretations for the graphical symbols that are consistent with the text if no knowledge of the geography of Europe is assumed. Our algorithm is designed to resolve reference for spatial referential and anaphoric terms in the course of the multimodal discourse interpretation, and the graphical ambiguity is resolved in the course of this process, as will be shown in detail in Sections 3 and 4.</Paragraph> <Paragraph position="7"> To conclude this section, we believe the formalization of the syntax and semantics of graphical representations in a form compatible with the syntax and semantics of natural language, as in the scheme in Figure 3, may be a point of departure for investigating how the graphical or visual context helps to resolve natural language ambiguities at different levels of representation and processing.</Paragraph> </Section> </Section> class="xml-element"></Paper>