A Model for Multimodal Reference 
Resolution 
Luis Pineda* 
National Autonomous University of 
Mexico (UNAM) 
Gabriela Garza 
An important aspect of the interpretation of multimodal messages is the ability to identify when 
the same object in the world is the referent of symbols in different modalities. To understand the 
caption of a picture, for instance, one needs to identify the graphical symbols that are referred to 
by names and pronouns in the natural  text. One way to think of this problem is in terms 
of the notion of anaphora; however, unlike linguistic anaphoric inference, in which antecedents 
for pronouns are selected from a linguistic context, in the interpretation of the textual part of 
multimodal messages the antecedents are selected from a graphical context. Under this view, 
resolving multimodal references is like resolving anaphora across modalities. Another way to 
see the same problem is to look at pronouns in texts about drawings as deictic. In this second 
view, the context of interpretation of a natural  term is defined as a set of expressions of 
a graphical  with well-defined syntax and semantics. Natural  and graphical 
terms are thought of as standing in a relation of translation similar to the translation relation that 
holds between natural s. In this paper a theory based on this second view is presented. In 
this theory, the relations between multimodal representation and spatial deixis, on the one hand, 
and multimodal reasoning and deictic inference, on the other, are discussed. An integrated model 
of anaphoric and deictic resolution in the context of the interpretation of multimodal discourse is 
also advanced. 
1. Reference, Spatial Deixis, and Modality 
In this paper a model for the resolution of multimodal references is presented. This 
is the problem of finding the referent of a symbol in one modality using information 
present either in the same or in other modalities. A model of this kind can be useful 
both for implementing intelligent multimodal tools (e.g., authoring tools to input nat- 
ural  and graphics interactively for the automatic construction of tutorials or 
manuals) and from the point of view of human-computer interaction (HCI) where it 
can help in the design of computer interfaces in which the interpretation constraints 
of multimodal messages should be taken into account. 
Consider Figure 1 (adapted from Rist \[1996\]) in which a message is expressed 
through two different modalities, namely text and graphics. The figure illustrates a 
kind of reasoning required to understand multimodal presentations: in order to make 
sense of the message, the interpreter must realize what individuals are referred to by 
the pronouns he and it in the text. For the sake of argument, it is assumed that the 
graphical symbols in the figure are understood directly in terms of a graphical lexicon, 
in the same way that the words he, it, and washed are understood in terms of the textual 
• Department of Computer Science, Institute for Applied Mathematics and Systems (IIMAS), National 
Autonomous University of Mexico (UNAM), Mexico. E-mail: luis@leibniz.iimas.unam.mx. 
@ 2000 Association for Computational Linguistics 
Computational Linguistics Volume 26, Number 2 
I©c  
"He washed it" 
Figure 1 
Instance of linguistic anaphor with pictorial antecedent. 
"Saarbrficken lies at the intersection between the border between 
France and Germany and a line from Paris to FrankJhrt. " 
Figure 2 
Instance of a pictorial anaphor with linguistic antecedent. 
lexicon. It can easily be seen that given the graphical context, he should resolve to the 
man, and it should resolve to the car. However, this inference is not valid since the 
information inferred is not contained in the overt graphical context and the meaning 
of the words involved. 
One way to look at this problem is as a case of anaphoric inference. Consider 
that the information provided by graphical means can also be expressed through the 
following piece of discourse: There is a man, a car, and a bucket. He washed it. With Kamp's 
discourse representation theory (DRT) (Kamp 1981; Kamp and Reyle 1993) a discourse 
representation structure (DRS) in which the reference to the pronoun he is constrained 
to be the man can be built. However, the pronoun it has two possible antecedents, 
and conceptual knowledge is required to select the appropriate one. In particular, 
the knowledge that a man can wash objects with water, and that water is carried in 
buckets, must be employed. If these concepts are included in the interpretation context 
like DRT conditions (which should be retrieved from memory rather than from the 
normal flow of discourse), the anaphora can be solved. By analogy, situations like the 
one illustrated in Figure 1 have been considered problems of anaphors with pictorial 
antecedents in which the interpretation context is built not from a preceding text but 
from a graphical representation that is introduced with the text (Andr6 and Rist 1994). 
Consider now the converse situation shown in Figure 2 (adapted from Rist \[1996\]), 
in which a drawing is interpreted as a map in the context of the preceding text. The 
dots and lines in the drawing, and their properties, do not have an interpretation 
and the picture in itself is meaningless. However, given the context introduced by the 
text, and also considering the common knowledge that Paris is a city in France, and 
Frankfurt a city in Germany, and that Germany lies to the east of France (to the right), 
140 
Pineda and Garza Multimodal Reference Resolution 
it is possible to infer that the denotations of the dots to the left, middle, and right 
in the picture are Paris, Saarbr~.icken, and Frankfurt, respectively, and that the dotted 
lines denote borders between countries, and in particular, the lower segment denotes 
the border between France and Germany. In this example, graphical symbols can be 
thought of as "variables" of the graphical representation or "graphical pronouns" that 
can be resolved in terms of the textual antecedent. Here again, the inference is not 
valid, as the graphical symbols could be given other interpretations or none at all. 
The situation in Figure 2 has been characterized as an instance of a pictorial 
anaphor with linguistic antecedent, and further related examples can be found in 
Andr6 and Rist (1994). This situation, however, cannot be modeled very easily in 
terms of Kamp's DRT because the "pronouns" are not linguistic objects, and lacking 
a proper formalization of the graphical information, there is no straightforward way 
to express in a discourse representation structure that a dot representing "a variable" 
in the graphical domain has the same denotation as a natural  name or de- 
scription introduced from text in a DRS. Furthermore, the situation in Figure 1 can be 
thought of as anaphoric only if we ignore the modality of the graphics, as was done 
above; but if the notion of modality is to be considered at all in the analysis, then 
the situation in Figure 1 poses the same kinds of problems as the one in Figure 2. In 
general, graphical objects, functioning as constant terms or as variables, introduced as 
antecedents or as pronouns, cannot be expressed in a DRS, since the rules constructing 
these structures are triggered by specific syntactic configurations of the natural lan- 
guage in which the information is expressed. However, this limitation can be overcome 
if graphical information can be expressed in a  with well-defined syntax and 
semantics. 
An alternative is to look at these kinds of problems in terms of the traditional 
linguistic notion of deixis (Lyons 1968). Deixis has to do with the orientational fea- 
tures of , which are relative to the spatio-temporal situation of an utterance. 
Under this view, and in connection with the notion of graphical anaphora discussed 
above, it is possible to mention the deictic category of demonstrative pronouns: words 
like this and that permit us to make reference to extralinguistic objects. In Figure 1, 
for instance, the pronouns he and it can be supported by overt pointing acts at the 
time the expression he washed it is uttered. Note that the purpose of the pointing act 
is to provide the referents for the pronouns directly, greatly simplifying the resolu- 
tion process. However, the deictic use of a pronoun does not necessarily have to be 
supported by a physical gesture, because deictic use is characterized, more generally, 
by the identification of the referent in a metalinguistic context. Ambiguity in such 
words is not unusual, as they can also function as anaphors if they are preceded by 
a linguistic context, and even as determiners with a deictic component (e.g., this car). 
Additionally, not only demonstratives and pronouns but also proper names, definite 
descriptions, and even indefinites can be used deictically. As a great variety of con- 
textual factors are conceivably involved in the interpretation of a deictic expression, 
gestures, although prominent, should be thought of only as one particular kind of 
contextual factor. In summary, the denotation of a deictic term is the individual that 
is picked out by the human interpreter in relation to the interpretation context. 1 Con- 
sider that in the same way that an anaphoric inference is required for identifying the 
antecedent of an anaphoric term, an inference process is required for interpreting a 
term used deictically. We refer to this process as a deictic inference. The inference by 
1 An operator called DTHAT for mapping deictic terms into their referents in an interpretation context is 
introduced in Kaplan's logic of demonstratives (Kaplan 1978). 
141 
Computational Linguistics Volume 26, Number 2 
which one determines that he and it are the man and the car is, accordingly, a deictic 
inference. 
For our purposes, it is important to investigate the nature of the relation between 
the notions of deixis and modality, on the one hand, and multimodal reasoning and 
inference, either deictic or anaphoric, on the other. According to Kamp (1981, 283), the 
difference between deictic and anaphoric pronouns is that, 
deictic and anaphoric pronouns select their referents from certain sets 
of antecedently available entities. The two pronoun's uses differ with 
regard to the nature of these sets. In the case of a deictic pronoun 
the set contains entities that belong to the real world, whereas the 
selection set for an anaphoric pronoun is made up of constituents of 
the representation that has been constructed in response to antecedent 
discourse. 
Our concern here is how "the set of entities that belong to the real world" is ac- 
cessible to the interpreter. In normal deictic spatial situations the referent of a deictic 
term is perceived directly through the visual modality, and as a result of such a vi- 
sual interpretation process, the object is represented by the subject. The question is 
how the information can be expressed in this intermediate "visual" representation. A 
plausible answer is that there is a coding system and a medium associated with each 
particular modality. Our suggestion is that the notion of modality is a representational 
notion, and not a sensory one as normally assumed in psychological discussion. In 
our sense, a modality is a formal , with a lexicon and well-defined syntac- 
tic and semantic structures, with an associated medium in which the expressions of 
the modality are written. Multimodal reasoning is a process involving information 
expressed in the s associated with different modalities, and is achieved with 
the help of a translation relation similar to the relation of translation between natural 
s. Performing a multimodal reasoning process is possible if the translation 
relation between expressions of different modalities is available. However, for particu- 
lar multimodal reasoning tasks, the translation relation between individual constants 
of different modalities cannot be stated beforehand and has to be worked out dy- 
namically through a deictic inferential process, as will be argued in the rest of this 
paper. 
1.1 A Model for Multimodal Representation 
This view of multimodal representation and reasoning can be formalized in terms of 
Montague's general semiotic program (Dowty, Wall, and Peters 1985). Each modality 
in the system can be captured through a particular , and relations between 
expressions of different modalities can be modeled in terms of translation functions 
from basic and composite expressions of the source modality into expressions of the 
target modality. In a system of this kind, interpreting examples in Figures 1 and 2 
in relation to the linguistic modality is a matter of interpreting the information ex- 
pressed through natural  directly when enough information is available, and 
completing the interpretation process by means of translating expressions of the graph- 
ical modality into the linguistic one, and vice versa. Consider Figure 3--developing 
from previous work (Pineda 1989, 1998; Klein and Pineda 1990; Santana 1999)--in 
which a multimodal representational system for linguistic and graphical modalities is 
illustrated. 
The circles labeled L and G in Figure 3 stand for sets of expressions of the natural 
 (e.g., English) and the graphical , respectively, and the circle labeled 
142 
Pineda and Garza Multimodal Reference Resolution 
FL L W 
Figure 3 
Multimodal representational system for linguistic and graphical modalities. 
P stands for the set of graphical symbols constituting the graphical modality proper 
(i.e., the actual symbols on a piece of paper or on the screen). Note that two sets of 
expressions are considered for the graphical modality: the expressions in G belong 
to a formal  in which the geometry of pictures is represented and reasoned 
about, and P contains the overt graphical symbols that can be seen and drawn but 
cannot be manipulated directly. The functions PL-G and PC-L stand for the translation 
mappings between the s L and G, and the functions PP-c and Pc-P stand 
for the corresponding translations between G and P. The translation function pP-c 
maps well-defined objects of the graphical modality into expressions of G where the 
interpretation process is performed. The translation Pc-P, on the other hand, maps 
geometrical expressions of G into pictures; for every well-defined term of G of a 
graphical type (e.g., dot, line, etc.) there is a graphical object or a graphical composi- 
tion that can be drawn or highlighted with the application of geometrical algorithms 
associated to operators of G in a systematic fashion. The circle labeled W stands for 
the world and together with the functions FL and Fp constitutes a multimodal system 
of interpretation. The ordered pair (W, FL) defines the model ML for the natural lan- 
guage, and the ordered pair (W, Fp) defines the model Mp for the interpretation of 
drawings. The interpretation of expressions in G in relation to the world is defined 
either by the composition FL°pc_L or, alternatively, by Fp°pG_p. The denotation of the 
word France in L, for instance, is the same as the denotation of the corresponding 
region of the map of Europe that denotes France, the country, since both refer to the 
same individual. The denotation of the symbol rl in G that is related to the word 
France in L through PG-L, and to a particular region in P through pG-P, is also France, 
as translation is a meaning-preserving relation between expressions. The interpreta- 
tion functions FL and Fp relate basic expressions, either graphical or linguistic, to the 
objects or relations of the world that these expressions happen to represent, and the 
definition of a semantic algebra for computing the denotation of composite graphical 
and linguistic expressions is required. 
An important consideration for the scheme in Figure 3 is that the symbols of P 
have two roles: on the one hand, they are representational objects (e.g., a region of 
143 
Computational Linguistics Volume 26, Number 2 
the drawing represents a country), but on the other, they are also geometrical ob- 
jects that can be talked about as geometrical entities. The geometrical region of the 
map representing France, for instance, is itself represented by the constant rl in G. In 
this second view, geometrical entities are individual objects in the world of geometry, 
and as such they have a number of geometrical properties that are independent of 
whether we think of graphical symbols as objects in themselves or as symbols rep- 
resenting something else. The same duality can be stated from the point of view of 
the expressions of G, since the set of individual geometrical objects (i.e., P) constitutes 
a domain of interpretation for the  G. This is to say that expressions of G 
have two interpretations: they represent geometrical objects, properties, and relations 
directly, but they also represent the objects of the world (e.g., France, Germany, etc.) 
indirectly through the translation relation and interpretation of symbols in P taken as 
a  (i.e., the composition Fp°pG_p). The ordered pair (P, Fc/defines the model 
Mc for the geometrical interpretation of G as geometrical objects; the geometrical in- 
terpretation function FG assigns a denotation for every constant of G; the denotation 
of individual constants of G are the graphical symbols themselves, and the denotation 
of operators and function symbols of G denoting graphical properties and relations 
will be given by predefined geometrical algorithms commonly used in computational 
geometry and computer graphics--see, for instance, Shamos (1978). The semantic in- 
terpretation of composite expressions of G, on the other hand, is defined through a 
semantic algebra, as will be shown below in Section 2.3.2. The definition of this ge- 
ometrical interpreter will allow us to perform inferences about the geometry of the 
drawing in a very effective fashion. Consider that to state explicitly all true and false 
geometrical statements about a drawing would be a very cumbersome task, as the 
number of statements that would have to be made even for small drawings would 
be very large. Note also that although a map can be an incomplete representation 
of the world (e.g., some cities might have been omitted), the geometrical algorithms 
associated with operators of G will always provide complete information on the map 
as a geometrical object. 
1.2 Multimodal interpretation 
For the kind of problem exemplified in Figures 1 and 2, the objects in L, P, and G 
are given, and the function FL establishes the relation between linguistic constants 
and the objects of the world that such constants happen to refer to. To interpret these 
multimodal messages, Fp must be made explicit. If one asks who is he? looking at 
Figure 1, for instance, the answer is found by computing pG-p(pL-G(he)), whose value 
is the picture of the man on the drawing. Once this computation is performed, the 
picture can be highlighted or signaled by other graphical means. However, in other 
kinds of situations the knowledge of Fp might be available and the purpose of the 
interpretation process could be to identify Ft. If one points out the middle dot in 
Figure 2 at the time the question what is this? is asked, the answer can be found 
by applying the function PG_L°PP_G to the dot indicated (i.e., PG_L(PP_G(O))), whose 
value would be the word Saarbriicken. A similar situation arises in the interpretation of 
multimodal referring expressions. Consider the following example--also from Andrd 
and Rist (1994)--in which a multimodal message is constituted by a picture of an 
espresso machine that has two switches, and by the textual expression the temperature 
control. In this scenario, the denotation of the natural  expression can be 
found by the human interpreter if the corresponding switch is identified in the picture 
through visual inspection (e.g., if the switch is highlighted). In general, multimodal 
coreference can be established if pL-G and PG-L are defined, as Fp can be made explicit 
in terms of FL and vice versa. 
144 
Pineda and Garza Multimodal Reference Resolution 
In situations in which all theoretical elements illustrated in Figure 3 are given, 
questions about multimodal scenarios can be answered through the evaluation of ex- 
pressions of a given modality in terms of the interpreters of the s involved 
and the translation functions. However, when one is instructed to interpret a multi- 
modal message, like Figures 1 and 2, not all information in the scheme of Figure 3 is 
available. In particular, the translation functions PL-G and PG-L of the graphical and 
linguistic individual constants mentioned in the texts and the pictures of the multi- 
modal messages are not known, and the crucial inference of the interpretation process 
has as its goal to find out the definition of these functions (i.e., to establish the rela- 
tions between names of L and G). It is important to emphasize that in order to find 
out PL-G and PC-L, the information overtly provided in the multimodal message is 
usually not enough, and in order to carry out such an interpretation process it will be 
necessary to consider the grammatical structure of the s involved, the defi- 
nition of translations rules between s, and also conceptual knowledge stored 
in memory about the interpretation domain. 
An additional consideration regarding the scheme in Figure 3 is related to the 
problem of ambiguity in the interpretation of multimodal messages. In the literature 
of intelligent multimodal systems, ambiguity is commonly seen from the perspective of 
human users. A multimodal referring expression constituted by the text the temperature 
control and a drawing with two switches is said to be ambiguous, for instance, if the 
human user is not able to tell which one is the temperature control. A well-designed 
presentation should avoid this kind of ambiguity by providing additional information 
either in a textual form (e.g., the temperature control is the switch on the left) or 
by a graphical focusing technique (e.g., highlighting the left switch). An important 
motivation in the design of intelligent presentation systems like WIP (Wahlster et al. 
1993) and COMET (Feiner and McKeown 1993) is to generate graphical and linguistic 
explanations in which these kinds of ambiguities are avoided. 2 Note, however, that 
such situations are better characterized as problems of underspecification, rather than 
as problems of ambiguity, since the expression the temperature control has only one 
syntactic structure and one meaning, and the referent can be identified in a given 
context if enough information is available. 
Ambiguity in multimodal systems has also been related to the granularity of 
graphical pointing acts. A map, for instance, can be represented by an expression 
of G that translates into a graphical composition in P denoting a single individual 
(e.g., Europe) or by a number of expressions of G that refer to the minimal graphical 
partitions in P (e.g., the countries of Europe) depending on whether the focus of the 
interpretation process is the whole of the drawing or its constituent parts. This prob- 
lem has also been addressed in a number of intelligent multimodal systems like XTRA 
(Wahlster 1991) and AlFresco (Stock et al. 1993), but the lack of a formalized notion 
of graphical  (and also a better understanding of indexical expressions), has 
prevented a deeper analysis of this kind of ambiguity. 
These notions of "ambiguity" in multimodal systems contrast with the traditional 
notion of ambiguity in natural  in which an ambiguous expression has sev- 
eral interpretations. The formalization of graphical representations through the def- 
inition of graphical s with well-defined syntax and semantics allows us to 
face the problem of ambiguity directly in terms of the relation of translation between 
natural and graphical s, and the semantics of expressions of both modal- 
2 It is also worth noticing that systems like WIP and COMET do not interpret multimodal messages 
input by human users through the interaction and, therefore, there is no ambiguity to be resolved. 
145 
Computational Linguistics Volume 26, Number 2 
ities. An interesting question is whether the graphical context offers clues that the 
parser can use to resolve lexical and structural ambiguity. Although we have yet to 
explore this issue, there are some antecedents in this regard. In Steedman's theory 
of incremental interpretation in dialogue, for instance, the rules of syntax, seman- 
tics, and processing are very closely linked (Steedman 1986) and local ambiguities 
may be resolved by taking into account their appropriateness to the context, which 
can be graphical. Structural ambiguity in G can be appreciated, for instance, in rela- 
tion to the granularity of graphical objects, as the same drawing will have different 
syntactic analysis depending on whether it is interpreted as a whole or as an aggre- 
gation of parts. It is likely that the resolution of this latter kind of ambiguity is also 
influenced by pragmatic factors concerning the purpose of the task, the interpreta- 
tion domain, and the attentional state of the interpreter, but this investigation is also 
pending. 
We do, however, address issues of ambiguity related to the resolution of spa- 
tial indexical terms and anaphoric references in an integrated fashion. In Section 
3, an incremental constraint satisfaction algorithm for resolving referential terms in 
relation to the graphical domain is presented. This algorithm relies on spatial con- 
straints of drawings and general knowledge about the interpretation domain, and 
its computation is performed during the construction of multimodal discourse rep- 
resentation structures (MDRSs), which are extensions of DRSs in DRT (Kamp and 
Reyle 1993) as illustrated in Section 4. In the same way that DRT makes no pro- 
vision for ambiguity resolution and alternative DRSs are constructed for different 
readings of a sentence, several MDRSs would have to be constructed in our ap- 
proach for ambiguous multimodal messages. 3 However, as natural  terms 
in L in our simplified domain refer to graphical objects, indefinites are very un- 
likely to have specific readings (e.g, "a city" normally refers to any city) and a sim- 
ple heuristic in which indefinites are within the scope of definite descriptions and 
proper names can be used to obtain the preferred reading of sentences such as the 
one in Figures 2. Nevertheless, even if only this reading is considered, and the in- 
terpreter knows that the drawing is a map and is aware of the interpretation con- 
ventions of this kind of graphical representations (i.e., countries are represented by 
regions, cities by dots, etc.), drawings can still be ambiguous. In Figure 2, for instance, 
there are four possible interpretations for the graphical symbols that are consistent 
with the text if no knowledge of the geography of Europe is assumed. Our algo- 
rithm is designed to resolve reference for spatial referential and anaphoric terms in 
the course of the multimodal discourse interpretation, and the graphical ambiguity 
is resolved in the course of this process, as will be shown in detail in Sections 3 
and 4. 
To conclude this section, we believe the formalization of the syntax and seman- 
tics of graphical representations in a form compatible with the syntax and semantics 
of natural , as in the scheme in Figure 3, may be a point of departure for 
investigating how the graphical or visual context helps to resolve natural  
ambiguities at different levels of representation and processing. 
3 A question for further research is whether our approach can be generalized to address problems of 
ambiguity by means of underspecified representations (e.g., van Deemter and Peters 1995). These 
representations result from the lexical and syntactic disambiguation process, but leave unspecified some 
information, like the interpretation of indexical references, the resolution of anaphoric expressions and the semantic scope of operators. A relevant antecedent related to our extension of multimodal DRSs is 
Poesio's extension of DRT into the so-called Conversational Representation Theory (Poesio 1994). 
146 
Pineda and Garza Multimodal Reference Resolution 
1.3 Multimodal Generation 
An important motivation for the study of the interpretation of multimodal mes- 
sages is the definition of multimodal presentation or explanation systems in which 
users are able to identify the referent of graphical and linguistic expressions eas- 
ily. In WIP, for instance, a central concern is whether the human user is able to 
"activate" the relevant "representations" (presumably in his or her mind) and re- 
solve the graphical and linguistic ambiguities and anaphors (using WIP's terminol- 
ogy) present in multimodal messages. This is possible, in general, if the message 
conveys to the human user explicit interpretation paths from the information that 
is available overfly to the information that the user is expected to infer. The pro- 
duction of multimodal referring expressions in this kind of system depends on the 
use of presentation strategies defined in terms of rhetorical structures and intentional 
goals--e.g., along the lines of Rhetorical Structure Theory (RST) (Mann and Thomp- 
son 1988), and its computational implementation (Moore 1995). The use of a partic- 
ular presentation strategy in a multimodal explanation (e.g., in WIP) depends cru- 
cially on whether the expressions generated on the basis of such a strategy satisfy 
the conditions defined to activate the expected representations in the user's mind 
(an intentional goal). Furthermore, some rhetorical structures are designed explic- 
itly to provide additional information to activate the expected representations if the 
conditions for the identification of the referent of an expression are not met. Con- 
sider again the resolution of the "ambiguity" in the interpretation of the tempera- 
ture control example in WIP in which the presentation strategy provides the infor- 
mation required by the human user to identify the referent, either through the text 
the temperature control is the switch on the left or highlighting or pointing to the corre- 
sponding switch in the drawing. WIP is able to tell whether the presentation would 
be ambiguous for the human user if additional information were not provided be- 
cause it has a representation of the actual situation and a simple model of the user's 
beliefs. 
Although the main representation structure of multimodal presentation and expla- 
nation systems is defined at a rhetorical level, the use of presentation strategies relies 
on algorithms for the generation of graphical and linguistic referring expressions. For 
instance, the "activate" presentation strategy of WIP (Andr6 and Rist 1994), the pur- 
pose of which is to establish a mutual belief between the human user and the system 
about the identity of an object, employs an algorithm for the generation of referring 
expressions based on an incremental interpretation algorithm proposed by Reiter and 
Dale (1992). It is interesting to note that presentations generated by WIP and other 
multimodal explanation systems like COMET (Feiner and McKeown 1993), or TEX- 
PLAN (Maybury 1993), are limited to the production of definite descriptions only, 
even though the use of indefinite descriptions can be natural in multimodal commu- 
nication. However, this restriction can be overcome with a more solid representational 
framework such as the one illustrated in Figure 3. Consider that basic or composite 
expressions of the s G and L can be translated to basic or composite expres- 
sions of the other , depending on the definition of the translation function. 
So, to refer linguistically to a graphical configuration, for instance, it would only be 
necessary to find an expression of G that succinctly expresses the relevant graphical 
properties of the desired object, and then translate it to its corresponding expression 
in L. The resulting natural  expression could be used directly or embedded 
in a larger natural  expression containing words that refer to abstract objects 
or properties. The descriptions obtained through this strategy explicitly employ the 
concrete and graphical properties of the representation, since expressions of G are 
147 
Computational Linguistics Volume 26, Number 2 
r4 
.... c6 ...... 
............. . .,.. 
rl 
! c3 .! r2 
C d'.. 
......................... (, r3 '..,c2 Ii ~ 
dl ( cl 
Figure 4 
Labeling the graphical objects in Figure ~. 
made up of constants and operators that directly describe the geometry of objects and 
configurations. 
Consider the natural  text: Saarbr~cken lies at the intersection between the bor- 
der between France and Germany and a line from Paris to Frankfurt. This sentence contains 
the definite description the intersection between tile border between France and Germany 
and a line from Paris to Frankfurt, which in turn contains the border between France and 
Germany and a line from Paris to Frankfurt. Finding the graphical referents of these ex- 
pressions requires the identification of a dot, a curve, and a line on the map (i.e., the 
corresponding graphical objects). These graphical objects can be referred to directly 
through ; however, there are additional graphical entities on the map in Fig- 
ure 2 that have an interpretation but are not mentioned explicitly in the text of the 
multimodal message. In Figure 4, for instance, Belgium is represented by the region 
r4, and the curve c6 represents the border between France and Belgium. Once a picture 
has been interpreted, one would be entitled to ask not only for graphical objects that 
have been mentioned in the textual part of the message, but also for any meaningful 
graphical object. So, if one points to the curve c6 in Figure 2 at the time the question 
What is this? is asked, the answer could be the border between France and Belgium, or 
alternatively, the indefinite a border. As some graphical objects named by constants of 
the graphical  do not have a proper name in natural , the translation 
function PG-L must associate a basic constant of G with a composite expression of L. 
The process of Inducing such a translation function is closely related to the process of 
generating the corresponding natural  descriptions, and this relation will be 
explored further in Section 3. 
In the rest of this paper, we discuss in more detail how the scheme for multi- 
modal representation and interpretation in Figure 3 can be carried out. In Section 2, 
we present a formalization of the s L, P, and G with their corresponding 
translation functions, along the lines of Montague's general semiotic program. The 
process of multimodal interpretation is explained, and the translation of expressions 
of one modality into expressions of another modality is illustrated. However, such 
a process can be carried out only if the translation functions are known, which is 
not normally the case in the interpretation of multimodal messages (as noted above). 
In Section 3, we offer an account of how such functions can be induced in terms 
of the message, constraints on the interpretation conventions of the modalities, and 
constraints on general knowledge of the domain. In this section we also illustrate 
the process of generating graphical and linguistic descriptions, which is associated 
with the induction of the translation functions. In Section 4, we discuss how to ex- 
148 
Pineda and Garza Multirnodal Reference Resolution 
tend Kamp's DRS with multimodal structures. Finally, in Section 5, some concluding 
remarks and some directions for further work are presented. 
2. A Multimodal System of Representation 
In this section, we present the definition of the syntax and semantics of s L, 
P, and G, illustrating the theory with the multimodal message of Figure 2. Language L 
is a segment of English designed to produce expressions useful for referring to objects, 
properties, and relations commonly found in discourse about maps. In particular, the 
natural  expression of Figure 2 can be constructed in a compositional fashion. 
The syntactic structure of P, on the other hand, imposes a restriction on the possible 
geometries of the family of drawings in the interpretation domain. Language G is a 
logical  in which interpretation and reasoning about geometrical configura- 
tions can be carried out. It is an interlingua representation for information expressed 
in both of the modalities. 
The definitions of L, P, and G closely follow the general guidelines of Montague's 
semiotic program. As a first step in the syntactic definition of a , the set of 
categories or types is stated. A number of constants--or basic expressions--for each 
type is defined, and the combination rules for producing composite expressions are 
stated. For each type of a source , a corresponding type in the target lan- 
guage is assigned. Basic expressions of the source  can be mapped either to 
basic or to composite expressions of the corresponding type in the target  and 
vice versa. For each syntactic rule of a source , a translation rule for map- 
ping the expression formed by the rule into its translation in the target  is 
defined. 
2.1 Definition of Language L 
Language L contains the textual part of multimodal messages in the domain of maps. 
An expression of L is, for instance, Saarbr~icken lies at the intersection between the border 
between France and Germany and a line from Paris to Frankfurt, which is the natural  
part of Figure 2. Constants like France and Germany, and all subexpressions of the 
former sentence, like the border between France and Germany or a line from Paris to Frankfurt 
are also in L. In addition, L contains expressions like France is a country, Frankfurt is 
a city of Germany or Germany is to the east of France, which express general knowledge 
required in the interpretation of maps. 
2.1.1 Syntactic Definition of L. The set of syntactic categories of L is as follows: 
. 
2. 
The basic syntactic categories of L are t, IV, ADJ, CN, and CN I where t is 
the category of sentences, IV is the category of intransitive verbs, ADJ is 
the category of adjectives, and CN and CN' are two categories of 
common nouns. 
If A and B are syntactic categories then A/B is a category. 4 
Traditional syntactic categories of natural  like transitive verbs (TV), terms 
(T), prepositional phrases (PP), and determiners (T/CN) can be derived from the basic 
categories. 
4 An expression of category A/B combines with an expression of category B to give an expression of 
category A. 
149 
Computational Linguistics Volume 26, Number 2 
Constant 
Paris, Frankfurt, Saarbriicken, France, 
Germany 
east bi~ 
be 
be, lie at, be to 
SENTENCES 
a, the 
Category name 
city, country, border, line, intersection CN 
CN' 
ADJ 
TV 
IV/ADJ 
T/CN 
PP pp' 
Category definition 
t/IV 
CN 
CN' 
ADJ 
IV/(t/IV) 
IV/ADJ 
(t/IV)/CN 
CN/CN 
CN/CN' 
IV IV 
Figure 5 
Constants of  L. 
The table in Figure 5 illustrates the constants of L with their category names and 
category definitions. Common nouns are divided into CN and CNC Expressions of 
category CN translate into graphical predicates (sets of graphical objects) while ex- 
pressions of category CN' translate into abstract concepts. For instance, city translates 
into a set of dots representing cities, but east translates into a geometrical function 
from regions to zones (e.g., if the region representing France is the argument of this 
function, the zone to the right of that region is the function value). Prepositional 
phrases are divided into PP and PP~ due to the classification of common nouns into 
CN and CNq There are no basic constants of categories PP, PP~, and IV, as prepo- 
sitional words are introduced syncategorematically and intransitive verb phrases are 
always composite expressions in this grammar. Transitive verbs are defined in a stan- 
dard fashion, and the constant be of category IV/ADJ is used to form attributive 
sentences. 
Next, the syntactic rules of L are presented. Each rule is shown in a separate item 
containing the purpose of the rule, the syntactic rule itself, and some examples of 
expressions that can be formed with the rule. Following Montague, syntactic rules 
and syntactic operations for combining symbols (for instance, FL1) associated to each 
rule are separated. In the following, Pc is the set of expressions of category C. 
SIL. 
TRANSITIVE VERB PHRASES 
If c~ E PT and fl E PIv, then FL1 (Oz, fl) E Pt, where EL1 (0~, fl) : Olfl*, and fl* 
is the result of replacing the first verb in fl by its third person singular 
present form. 
Examples: -Paris is a city of France 
-Germany is to the east of France 
-a country is big 
-Saarbriicken lies at the intersection between the border between 
France and Germany and a line from Paris to Frankfurt 
S2L. 
150 
If ~ E PTV and fl E PT, then FL2(Ol, fl) C PIv, where FL2(O~,fl) = O~fl. 
Examples: -be a city 
-be to the east of France 
Pineda and Garza Multimodal Reference Resolution 
ATTRIBUTIVE VERB PHRASES 
S3L. If c~ E PtV/ADJ and fl E PAD\], then FL2(OL , fl) E PIV. 
Examples: -be big 
TERMS 
S4L. If a E PT/CN and fl E PCN or PcN,, then FL3(a, fl) E PT, where 
FL3(Ol, fl) = o~*fl, and c~* is c~ except in the case where a is a and the first 
word in fl begins with a vowel; here, c~* is an. 
Examples: -a city 
-a city of France 
-the border between France and Germany 
-a line from Paris to Frankfurt 
-the east of France 
COMMON NOUNS 
SSL. If c~ E PCN and fl E Ppp, or c~ E PCN' and fl E Ppe,, then FL2(a, fl) E PcN. 
Examples: -city of France 
-east of France 
-border between France and Germany 
-intersection between the border between France and Germany and a line 
from Paris to Frankfurt 
of PREPOSITIONAL PHRASES 5 
S6L. If ~ E PT, then FL4(Oz) C Ppp or Ppp,, where FL4(O~) --~ Of OL. 
Examples: -of France 
-of Germany 
-of a country 
between PREPOSITIONAL PHRASES 
S7L. If ~,fl E PT, then FLs(c~,fl) E Ppp, where FLs(c~,fl) ---- between c~ and ft. 
Examples: -between France and Germany 
-between France and a country 
-between the border between France and Germany and a line from 
Paris to Frankfurt 
from-to PREPOSITIONAL PHRASES 
SBL. If ~,fl E PT, then FL6(O\[,fl) C Ppp, where FL6(C~,fl) =from ~ to ft. 
Example: -from Paris to Frankfurt 
5 Although of, between, and from have been introduced syncategorematically in L for simplicity, they 
could have been defined as constants of some category of L, and their translations into G would have 
been a composite expression of some graphical type. 
151 
Computational Linguistics Volume 26, Number 2 
Constant c~ 
Paris, Frankfurt, Saarbr~icken, 
France, Germany 
city 
country 
border 
line 
intersection 
east 
be, lie at, be to 
a, the 
FL(~) 
Paris, Frankfurt, Saarbrficken, 
France, Germany 
. {Paris, Frankfurt, Saarbrticken .... } 
{France, Germany, ...} 
{border between France and Germany, ...} 
{line from Paris to Frankfurt .... } 
{intersection between the border between 
France and Germany and a line from 
Paris to Frankfurt, ...} 
Figure 6 
Interpretation of constants of  L. 
2.1.2 Semantic Definition of L. The semantics of L is given in a model-theoretic 
fashion as follows: The interpretation domain is the world W = {Paris, Saarbrficken, 
Frankfurt, France, Germany, the border between France and Germany, ...}. Let Dx be 
the set of possible denotations for expressions of type x, and for any types A and B, 
DA/B = D DB (i.e., the set of all functions from DB to DA). Let FL be an interpretation 
function that assigns to each constant of type A a member of DA. For the example in 
Figure 3, FL is defined as shown in Figure 6. 
Not every constant of L has an interpretation assigned by FL; in particular, words 
like east, be, lie at, and be to have no interpretation defined directly in L. In principle 
the definition of these constants could be stated as an object of the appropriate se- 
mantic type but this is not a straightforward enterprise. Consider, for instance, that 
the constant east of category CN r is a basic object (a kind of predicate), but the indi- 
vidual objects in its extension are not overtly defined in the interpretation domain. 
Furthermore, it is more natural to talk about the interpretation of composite pred- 
icates, like east of France, of which east is a part. However, even the interpretation 
of such composite predicates is problematic, as they have a vague spatial mean- 
ing. For these reasons, the interpretation of these constants is not defined explic- 
itly as a part of the function FL, but in terms of their translation into G, where a 
spatial meaning can be formally defined, as will be shown below. A similar strat- 
egy is used for the interpretation of spatial prepositions; although of, between, and 
from-to were introduced syncategorematically in the syntax of L, they could have 
been defined as objects of an appropriate category and their semantics could have 
been given explicitly through FL or, alternatively, through their translation into in- 
tensional logic along the lines of PTQ. However, the semantic type of such objects is 
extraordinarily complex, and the actual definition of these constants is seldom seen 
in the literature. 6 In our system the interpretation of spatial prepositions will also be 
given in terms of the translation into G and the interpretation of P. Note also that 
no interpretation has been defined for the determiners a and the. One strategy for 
assigning a denotation would be to translate these constants into intensional logic, 
but this would be required only for a larger fragment of English in which reference 
6 In PTQ, prepositions---of category (IV/IV)/T)--are treated semantically as functions that apply to sets 
of properties to give functions from properties to properties, but no explicit example of the actual 
semantic value of any of these constants is provided. In our system it will be possible to compute the 
semantic value of spatial prepositional phrases in an effective manner, yet the approach is fully 
compatible with intensional logic. 
152 
Pineda and Garza Multimodal Reference Resolution 
to space was not the focus of study. In our approach the determiners will be in- 
terpreted in terms of their translations into G in which high-order functions can be 
expressed. 
In summary, the semantics of some constants and all composite expressions of L 
will be given in terms of their translations into G and P. Note that according to the 
scheme in Figure 3, if the translations between L and G, and G and P are defined, and 
the semantic interpretation of P is overtly defined, the interpretation of the natural lan- 
guage expressions can be found. Although the semantics of L is not further discussed 
in this paper, we consider that the interpretation of linguistic expressions referring 
to spatial situations could be embedded in a larger fragment of English, and a full 
semantic interpretation would have to be given by translating English into intensional 
logic. In such a model the semantic value of spatial prepositions would be left unde- 
fined, expressions referring to spatial configurations would be translated into G, and 
the interpretation of expressions of G would be embedded within the interpretation 
of intensional logic. 
2.2 Definition of Language P 
In this section, the syntax and semantics of  P are formally defined. The 
purpose of these definitions is to characterize the family of drawings that can be in- 
terpreted as maps, and to discriminate these drawings from other kinds of graphical 
configurations constituted by dots, curves, and regions. This notion of a multimodal 
system of representation in which objects in the graphical modality are formalized 
through a well-defined  is similar to the notion of graphical  intro- 
duced by Mackinlay for the automatic design of graphical presentations (Mackinlay 
1987), where a number of graphical s (e.g., the s of bar charts, area 
and position graphs, scatter plots, etc.) are formally specified. In Mackinlay's work, 
expressions of graphical s are related to the objects of the world that they 
represent through an encodes relation with three arguments: the graphical constant 
or expression performing the representation, the object of the world that is repre- 
sented through the graphical expression, and the graphical  to which the 
graphical expression belongs. 7 The formalization of P permits us to define a pre- 
cise statement of expressiveness of a graphical , as follows: "a set of facts 
is expressible in a  (graphical) if the  contains a sentence that en- 
codes every fact in the set and does not encode any additional facts" (Mackinlay 1987, 
54). The formalization additionally allows empirical studies to determine how effec- 
tively a human user can interpret expressions of a particular graphical  in 
relation to another in which the same set of facts is encoded. Although all graph- 
ical s studied by Mackinlay are conventional and have a precise geomet- 
7 Incidentally, a similar encoding relation encodes is used in the WIP system to relate the representational 
object to the object that it represents, but the third argument of this relation in WIP is a context space 
that allows use of the same presentation in different perspectives (e.g., an espresso machine may refer 
to an individual machine in a context space, or alternatively it can be seen as the prototype of espresso 
machines in a different context space). The encodes relation in WIP and in Mackinlay is similar to the 
translation relation between objects of P (or G) and L in our system, and we can think of a graphical 
 as a  encoding the information that is intended to be communicated. However, it is 
interesting to note that the status of the "linguistic" argument of the encodes relation is different in WIP 
and in Mackinlay's system. In the former, it is an "internal representation'--a psychological 
notion--while in the latter it stands for an object or a relation in the world itself--a semantic notion. In 
our approach, on the other hand, there are no "internal representations" and the translation relates 
graphical and linguistic expressions that are both "external" and that both refer to the world through a 
well-defined semantics. 
153 
Computational Linguistics Volume 26, Number 2 
Constant Type 
dl, d2, d3,. . . dot 
ll, 12,13 .... line 
Clr C2r C3r . • • curve 
. Ylr r2r r3, .... region 
Zl~ z2~ z3~ . . . zoHe 
crl, cr2, cr3 .... composite_region 
O, dSl, ds2 .... dot~et 
Or IS1, Is2,.. • line.set 
ml, m2, m3, . . . map 
Figure 7 
Constants of  P. 
rical characterization, the notions of expressiveness and effectiveness of graphical 
s can be applied to more unruly graphical domains (e.g., maps are ana- 
logical representations with a diagrammatic conventional component) as long as a 
formalization for the family of drawings can be approximated. Here, the question 
of whether arbitrary families of graphical objects can be formalized through a well- 
defined syntax is left open, and although it is possible to think of many families 
of drawings with very arbitrary geometries, some important efforts have been made 
in the characterization of design and other kinds of objects--see, for instance, shape 
grammars (Stiny 1975). Another related issue that is relevant for the construction of 
multimodal interactive systems is whether it is possible and useful to input expres- 
sions of P directly, and to obtain their syntactic structure through graphical parsing 
techniques (Wittenburg 1998). In summary, the purpose of formalizing P is to be able 
to talk about maps as a modality, where a modality, in our sense, is a code sys- 
tem for the symbols expressed in a medium, and a multimodal system of represen- 
tation relates information expressed through different code systems in a systematic 
fashion. 
2.2.1 Syntactic Definition of P. The types of P are dot, line, curve, region, zone, compos- 
ite_region, dot_set, line_set, and map. Let Cs be the set of constants of type s, and Es the 
set of well-formed expressions of graphical type s. Although the constants of P are 
the actual graphical marks on the screen or a piece of paper, a number of labels for 
facilitating the presentation are illustrated in Figure 7. 
For the syntactic definition of P we capitalize on the distinction introduced by 
Montague between syntactic rules and syntactic operations. This distinction is based 
on the observation that "syntactic rules can be thought of as comprising two parts: 
one which specifies under what conditions the rule is to be applied, and the other 
which specifies what operation to perform under those conditions" (Dowty, Wall, and 
Peters 1985, 254). While a syntactic rule comprises both parts and defines the syntactic 
structure of an expression, the syntactic operation is a rule that depends on--or at 
least takes into account--the shape of the symbols and the medium in which the 
symbols are substantially realized. For instance, the syntactic operation FL5 in the rule 
S7L (i.e., FL5(O~,fl) = between c~ and fl) combines the symbols between and and with 
the arguments to form the linear string indicated by the operation. For the definition 
of syntactic operations of P we generalize the operations that manipulate strings of 
symbols into general geometrical operations on the shapes of the graphical symbols 
on the paper or the screen, and these manipulations are defined according to certain 
geometrical conditions. 
154 
Pineda and Garza Multimodal Reference Resolution 
The definition of well-formed expressions of P is as follows: 
CONSTANT 
SIp. If c~ E Cs then c~ E Es. 
Examples: -, /,. ................... 
LINE 
S2p. If v~, fl E Edot then FpI(o~, fl) C Eline where FpI(o~, fl) is a line from a to ft. 
Example: of:' (the resulting graphical expression is only the line) 
CURVE 
S3F. If ~, fl E Eregion such that a and fl are adjacent then Fp2(ol, fl) E Ecurve 
where Fp2(C~,fl) is the curve between c~ and ft. 
Example: i ................ ~ .............. J (the resulting graphical expression is only the curve) 
INTERSECTION 
S4F. 
RIGHT 
If ~ E Ecurve and fl E Eti,,e then Fp3(ol, fl) E Edot where Fp3(oz, fl) is the dot 
in the intersection between o~ and ft. 
Example: ~ \ (the resulting graphical expression is only the dot) 
S5p. If ~ C Eregion then Fp4(~ ) C Ezone where Fp4(a) is the zone to the right of 
the region ~ (the interpretation of "right" will be given below in the 
semantics of  G). 
Example: 
DOT INSIDE A REGION 
(the resulting graphical expression is only the gray zone) 
S6p. If ~ c Eregion then Fp5(C~) C Eaot where Fps(a) is the drawing of a dot 
inside c~. 
i.,.--'...._.~\ Example: 
i.Z__i (the resulting graphical expression is only the dot) 
155 
Computational Linguistics Volume 26, Number 2 
COMPOSITE REGION (1) 8 
S7p. If c~, fl E Cregion such that a and fl are adjacent then 
Fp6(oz, fl) C Ecomposite_region where Fp6(ct , fl) is the drawing of a and ft. 
COMPOSITE REGION (2) 
S8p. If c~ E Cregion and fl E Ecomposite_region such that a and fl are adjacent then 
Fp6(oz, fl) E Ecomposite_region. 
SET OF DOTS 
S9p. If o~ E Edot_set and fl E C~ot then Fp6(cGfl) C Edot~set. 
SET OF LINES 
S10p. If o~ E Eline._set and fl E Cline then Fp6(oz , fl) E Eline~et. 
MAP 
S11p. If ~ E Ecomposite_region, fl E Edot~et and 6 E Eline~et then Fp7(oz, fl, 6) E Ema p 
where Fp7(oz, fl, 6) is the drawing of ~, fl and 6. 
With the help of this grammar it is possible to draw maps like the one illustrated 
in Figure 2. Note that the basic object in this particular graphical construction is the 
region. The idea is to successfully construct a map from its constituting regions (i.e., 
as in a jigsaw puzzle) until the full map is produced. Once the map is constructed, 
other kinds of objects with conventional meanings, like dots and lines, can be drawn 
upon the assembly of regions. Consider Figure 8 in which the syntactic structure of 
the map in Figure 4 is shown. Note that the decision to use regions as basic objects in 
the graphical composition is not mandatory, and alternative constructions are possible; 
for instance, we could have designated curves as basic objects and obtained regions as 
compositions made out of curves. The set of graphical symbols included in a graphical 
syntactic tree of a map will be called the base. For instance, the base of the map 
in Figure 8 is the set {dl, d2,d3,/1, rl, r2, r3, r4}. The base is just the set of graphical 
objects that are taken as the atoms of the graphical composition in each particular 
interpretation task, and different graphical grammars would select different types of 
graphical objects for the base. 
The purpose of this grammar is illustrative; we make no claims about what con- 
stitutes a map. P imposes very few constraints on graphical expressions, and many 
configurations that can be produced with these rules might not count as maps; in ad- 
dition, P is not expressive enough to characterize a large number of objects that would 
be normally interpreted as maps. Another consideration is that graphical objects can 
be used either as basic building blocks of the construction, or as objects produced by 
graphical compositions (which we call emergent objects); for instance, in the grammar 
of P, regions are basic objects but curves are produced by graphical compositions. Ad- 
ditionally, in some contexts the interpretation of the graphical expression as a whole 
8 Examples for the rules S7p to S11p are included in the construction of the map in Figure 8, as 
explained below. 
156 
Pineda and Garza Multimodal Reference Resolution 
J 
I :. r4 i CJ r 
c6 ¢£.. 2 .............. ,... . ~ rj\c2 
I rt c~d3 
\[ dt ict , P7 
J 
........... • ... i...~. ..... " ~ r2 
rl 
, P6 
d3 ° "a~ "a, ,P6 
Figure 8 
Construction of a map. 
....... . ........ \ 
............... 4..r~. r2 
rt 
, P6 
............. ...,. 
rt ""a"i ~ r2 , P6 
rl ") 
dj" , P6 
/\ 
• , P6 "a, a, 
/\ 
, P6 
K\ 
o 
may be required but in others only the interpretation of some of the parts may be rel- 
evant; for instance, although curves are not a part of the syntactic tree in Figure 8 they 
can be generated and translated into G when required through rules S3p and T3p-G 
as long as the composition is made out of regions included in the base of the map. 
Had the grammar allowed the generation of composite regions out of regions of the 
base, these emergent objects could also be used for the generation of curves. Another 
consideration is that expressions of type map are in general ambiguous as they have 
several syntactic analyses, but since this feature is harmless for the current discussion 
we do not pursue the issue further. A final remark is that alternative grammars could 
be defined for characterizing the same class of drawings with different consequences 
in the syntax and the semantics. One possibilitity, for instance, is to define a syntactic 
operation that takes two adjacent regions and produces the union of the regions as one 
single emerging region, instead of the set of the two regions as currently defined. Such 
a rule would be similar to the rule that combines two regions to produce a curve, and 
it would be useful in applications like XTRA (Wahlster 1991), in which the ambiguity 
of pointing to a part or the whole is intended to be resolved. 
2.2.2 Semantic Definition of F. The semantics of P is given in a model-theoretic 
fashion as follows: Let W = Acity CJ Aline U Aborde r U Acountry U azone be the world. Let Dx 
157 
Computational Linguistics Volume 26, Number 2 
Constant a FF(a) 
. dl, d2, d3,. • • . Paris, Saarbrficken, Frankfurt ..... 
ll, 12,13,. • • line from Paris to Frankfurt, ... 
cl, c2, c3,.., border between France and Germany, ... 
rl, r2, r3,. • • France, Germany,... 
Zl,Z2,Z3,... east of France, east of Germany,... 
crb cr2, cr3 .... region formed by France and Germany, ... 
O, dSl, ds2 .... sets of cities 
O, lsl, Is2 .... set of lines 
ml, m2, m3, • • • maps 
Figure 9 
Semantics of constants of P. 
be the set of possible denotations for expressions of type x, such that Ddo t = Acity, Dline = 
Aline, Dcurve ~- Aborder, Dregion -= Acountry, Dzone = Azone, and, for any types a and b, D(a,b ) = 
D G (i.e., the set of all functions from Da to Db). Let Fp be an interpretation function b 
that assigns to each constant of type a a member of Da. The interpretations of the 
constants are presented in Figure 9. 
Following Montague, we adopt the notational convention by which the semantic 
value or denotation of an expression c~ with respect to a model M is expressed as 
\[\[a\]\] M. The semantic rules for interpreting  L are the following: 
CONSTANT 
M1p. If a E Cs then \[\[a\]\] M = Fp(c~). 
LINE 
M2p. If a, fl E Edot then \[\[Fpl(a, fl)\]\] M = is a line from \[\[a\]\] M to \[\[fl\]\]M. 
CURVE 
M3p. If OZ, fl C Eregion such that a and fl are adjacent then \[\[Fp2(a, fl)\]\] M is the 
border between \[\[a\]\] M and \[\[fl\]\]M. 
INTERSECTION 
M4p. If a E Ecurve and fl E Eline then \[\[Fp3(a, fl)\]\] M is the intersection between 
\[\[a\]\] M and \[\[fl\]\]M. 
RIGHT 
M5p. If a E Ere#on then \[\[Fp4(O~)\]\] M is the east of \[Jail M. 
DOT INSIDE A REGION 
M6p. If o~ E Eregion then \[\[Fps(a)\]\] M is a city of \[\[O~\]\] M. 
158 
Pineda and Garza Multimodal Reference Resolution 
COMPOSITE REGION (1) 
M7p. If a, fl E Cregion such that c~ and fl are adjacent then \[\[Fps(a, fl)\]\] M is the 
union of {\[\[a\]\] M} and {\[\[fl\]\]M}. 
COMPOSITE REGION (2) 
M8p. If c~ C Cregion and fl C Ecomposite_region such that c~ and fl are adjacent then 
\[\[Fp5(c~, fl)\]\]M is the union of the sets {\[\[c~\]\] M} and \[\[fl\]\]M. 
SET OF DOTS 
M9p. If c~ E Edot~et and fl E Cdot then \[\[FFs(o~,fl)\]\] M is the union of the sets 
\[\[~\]\]M and {\[\[fl\]\]M}. 
SET OF LINES 
M10p. If a E Eline_set and fl E Cline then \[\[Fp5(Ol, fl)\]\] M is the union of the sets 
\[\[c~\]\] M and {\[\[fl\]\]M}. 
MAP 
M11p. If ~ E Ecomposite_region, fl C Edot_set and 5 E Eline_set then \[\[Fp6(O G fl, 5)\]\]M is the 
union of the sets \[\[~\]\]M, \[\[fl\]\]M and \[\[6\]\] M. 
2.3 Definition of Language G 
In this section the syntax and semantics of the graphical  G are formally 
stated. G is defined along the lines of intensional logic, and it is expressive enough 
to refer to graphical symbols and configurations, on the one hand, and to express the 
translation of quantified expressions of L, on the other. 
2.3.1 Syntactic Definition of G. The types of the  G are as follows: 9 
1. 
2. 
3. 
4. 
e is a type (graphical objects). 
t is a type (truth values). 
If a and b are any types, then (a, b / is a type. 1° 
Nothing else is a type. 
Let Vs be the set of variables of type s, Cs the set of constants of type s, and 
Es the set of well-formed expressions of graphical type s. The constants of G are 
presented in Figure 10. Note that constants like right, curve_between, etc. have an 
9 A simplifying assumption rests on the consideration that the interpretations of all expressions included 
in these s depend only on the current graphical state and no intensional types are included in 
the system. However, this analysis can be extended along the lines of intensional logic to be able to 
deal with a more comprehensive fragment of English. 
10 An expression of type /a, b) combines with an expression of type a to give an expression of type b. 
159 
Computational Linguistics Volume 26, Number 2 
Constant Type 
e dl, d2, d3, rl, r2, r3, r4, I1 
dot, region, curve, line, intersection 
right 
lie3t, be_in-zone, inside 
z 
A, V, *--+ 
curve_between, intersection_between, linGfrom_to 
right* 
lie_at., be_in_zone., inside. 
(e, t} 
(((e,t},t),(e,t}} 
(((e,t),t},(e,t)) 
(e,(e,t)) 
(t,(t,t}} 
(((e,t),t},(((e,t),t},(e,t}}) 
(e, e} 
le, le, t}) 
curve_between*, intersection_between*, line_from_to* (e, {e, e} ) 
Figure 10 
Constants of  G. 
associated right,, curve_between,, etc. The unsubscripted version of these constants 
denotes a relation between sets of properties of graphical individuals and the sub- 
scripted version denotes the corresponding geometrical relation between individuals; 
the type-raised version is used for preserving quantification properties in the trans- 
lation process from L into G, while the subscripted version is used for computing 
the geometry associated with the corresponding relation, as will be shown below in 
Section 2.3.2. 
G is a formal  with constants and variables for all types, functional ab- 
straction and application, and existential and universal quantification. The syntactic 
rules of G are as follows: 
1. Ifac 
2. If# E 
3. Ifa E 
4. Ifa c 
5. If# C 
6. If# E 
Cs, then a E Es. 
Vs, then # c Es. 
E(a,b) and fl E Ea, then a(fl) C Eb. 
Ea and u C Vb, then .ku\[a\] C E(b,a). 
Vs and fl C Et then 3tt(fl) E Et. 
Vs and fl E Et then V#(fl) E Et. 
G is a very expressive  and not every well-formed expression has a trans- 
lation into L as will be further discussed in Section 2.5. Useful translations are, for 
instance, names and descriptions of geometrical objects and configurations. Next, the 
definition of expressions of G that have a translation into L is presented. For clarity, 
the abbreviations in Figure 11 are used. 
Two geometrical interpretations are given for the spatial prepositions of and be- 
tween. Although the characterization of the meaning of these words is a very complex 
problem that is beyond the scope of this paper, we allow that spatial prepositions can 
be interpreted in more than one way, as long as each interpretation is stated in terms 
of a geometrical algorithm explicitly defined in G. For instance, the spatial meaning 
of of is different in city of France and east of France. In the former, of denotes a spatial 
inclusion relation (OFa), but in the latter it denotes a relation of adjacency (OFB). Sim- 
ilarly, the spatial meaning of between in border between France and Germany and its first 
occurrence in intersection between the border between France and Germany and a line from 
Paris to Frankfurt is different, as it denotes a curve in the first case (BETWEENa) and a 
dot in the second (BETWEENb). 
160 
Pineda and Garza Multimodal Reference Resolution 
Abbreviation 
A 
THE 
BEa 
BEb 
Di 
Ri 
OFa 
Orb 
BETWEENa 
Formal expression 
APAQ3x\[P(x) A Q(x)\] 
;~P)~Q3y\[Vx\[P(x) ~ x = y\] A Q(y)\] 
,kP,kxP(,ky\[x = y\]) 
,kP)~xP(x) 
,kP\[P(di)\] 
AP\[P(ri)\] 
/~X((e,t},t})~y(e,O/~Ze~(Z ) A inside(x)(z)\] 
"~X ( (e,t},t} )~y ( ( (e,t 13},el ~Ze ~y ( X ) (Z)\] 
/~X( (e,t),t) ~y ( (e,t),t) .~Z (e,t) ~Ue 
\[z(u) A curvedaetween(x)(y)(u)\] 
BETWEENb .~X((e,t),t).~y((e,t),t)~Z(e,t).~Ue 
\[z(u) A intersection_between(x) (y)(u)\] 
FROM_TO Ax ( (e,t),t) ,ky ( (e,t).t) .~Z (e,t) .~Ue 
\[z(u) A line_from_to(x)(y)(u)\] 
Type 
((e,t),((e,t),t)) 
((e,t),((e,t),t)) 
(((e,t),t),(e,t)) 
((e,t),(e,t)) 
((e,t),t) 
((e,t),t) 
(((e,t),t),((e,t),(e,t))) 
(((e,t),t),((((e,t),t),e),(e,t))) 
(((e,t),t),(((e,t),t),((e,t),(e,t)))) 
( ( (e,t),t), ( ( (e,t),t), ( (e,t), (e,t) ) ) ) 
(((e, t), t), (((e, t), t), ((e, t), (e, t)))) 
Figure 11 
Shorthand definitions. 
The restrictions for the expressions of G that can be translated into L are given 
below. In rules S6G to S8G, Q stands for either the quantifier A or THE. 
SENTENCES 
SIG. If oz C E((e,t),t ) and fl C E(e,t), then FGl(OZ, fl ) E Et, where FGl(OZ, fl ) ~- o~(fl). 
Examples: - D1 (BEa (A (OFa(R1) (dot)))) 
- R 3 (be_in _zone (THE (OrB(R1) (right)))) 
- A (region) (BEB(bix)) 
- D 3 (lie~qt (THE (BETWEENB(THE (BETWEENa(R1) (R3) (curve))) 
(A (FROM_TO(D1) (D3) (line))) 
(intersection)))) 
TRANSITIVE VERB PHRASES 
$2c. If a E E(((e,t),t),(e,t) ) and fl E E((e,t),t ) then FGl(a, fl) C EGO. 
Examples: - BE a (A (dot)) 
- be_in_zone (THE (OFb(R1)(right))) 
ATTRIBUTIVE VERB PHRASES 
S3G. If a E E((e,t),(e,t) ) and fl E E(e,t } then FGl(a, fl) C EGO. 
Example: - BE b (big) 
TERMS 
S4G. If a C E((e,t),((e,t),t))) and fl C E(e,t), then FGl(a, fl) E E((e,t),t ). 
Examples: - A (dot) 
- A (OFa(R1)(dot)) 
- THE (BETWEENa(R1) (R2) (curve)) 
- A (FROM_TO(D1) (D3) (line)) 
- THE (OPb(R1)(right)) 
161 
Computational Linguistics Volume 26, Number 2 
COMMON NOUNS 
$5c. If a E E(<e,t>,(e,0> and fl E E<e,t ), or a E E<(<(e,0,0,e),<e,0) and 
fl E E(((e,t),t),e ), then FGI(OGfl ) E EGO. 
Examples: - OFa(R1)(dot) 
- OFb(R1)(right) 
- BETWEENa(R1) (R2) (curve) 
- BETWEEN b (THE(BETWEENa(R1)(R2)(curve))) (A(FROM_TO(D1)(D3) 
(line) ) )(intersection ) 
of PREPOSITIONAL PHRASES 
S6G. If (~ E E((e,t),t) such that c~ is either Ri or Q(region), then FG2(a) E E((e,t),(e,O) 
and FG3(O~) E E((((e,t),t},<e,t)),<e,t)l, 
where FG2(Oz) ---- OFa(O~) and FG3(O~) = OFb(O~) 
Examples: - OFa(R1) 
- OFb(R2) 
- OFa(A(region)) 
between PREPOSITIONAL PHRASES 
$7c. (a) If (~, fl E E((e,t),t) such that (~,fl are either Ri or Q(region), then 
FG4(O~, fl) E E((e,t),(e,t} I, where FG4(Ol , fl) = BETWEENa(OI)(fl). 
(b) If c~, fl E E(<e,t),t } such that c~ is either ci or Q(curve) and fl is either 
Li or Q(line), then Fcs(~, fl) E E( (e,O,<e,t) ), 
where FG5(c~, fl) = BETWEENb(OI)(fl). 
Examples: - BETWEENa(R1) (R2) 
- BETWEENa(R1) (A(region)) 
- BETWEEN b (THE (BETWEENa(R1) (R2) (curve))) (A (FROM_TO(D1) (D3) 
(line))) 
from-to PREPOSITIONAL PHRASES 
S8G. If a, fl E E((e,t},t ) such that a, fl are either Di or Q(dot), then 
FG5(Ol, fl) E E((e,t),(e,t) ), 
where FG5(c~,fl) = FROM_TO(o~)(fl) 
Example: - FROM_TO(D1) (D3) 
2.3.2 Semantic Definition of G. The interpretation of expressions of G is defined 
in relation not to the world W but to a domain constituted by the graphical objects 
in P. For this reason, we refer to the interpreter of G as a geometrical interpreter, 
and to the process of interpreting expressions of G as a geometrical interpretation 
process. The semantics of G is given in a model-theoretic fashion as follows: Let 
Phase = {dl, d2, d3, rl, r2, r3, r4,11} be the set of basic graphical objects shown in Figure 4. 
Let P be the union of Phase and all graphical objects that can be produced from Phase 
with the help of geometrical functions: the emergent objects. Emergent objects can also 
be produced on the basis of other emergent objects previously generated. A particular 
162 
Pineda and Garza Multimodal Reference Resolution 
kind of emergent object that is interesting for the current discussion is the zone of 
a map that is considered to be the east of a region. For the production of emergent 
objects in P there is a well-defined computational geometry algorithm associated with 
an operator symbol of G, as will be seen below. 
Let Dx be the set of possible denotations for expressions of type x, such that De = P, 
Dt ~- {1, 0}, and, for any types a and b, D(a,b I = D~ a. Let FG be an interpretation function 
that assigns to each constant of type a a member of Da. For every graphical object ~ in 
Pbase there is a constant ~ of type e such that Fc (c~) = ~; for our example, FG assigns the 
objects dl, d2, d3, rl, r2, r3, r4, and 11 to the constants dl, d2, d3, rl, r2, r3, r4, and ll, respec- 
tively. The interpretation (assigned by FG) of the geometrical-type predicates dot, re- 
gion, curve, line, intersection are the sets containing the corresponding graphical objects. 
The constants right., lie~at., be_in_zone., inside., curve_between., intersection_between., 
and line_from_to are interpreted as geometrical functions. If the arguments of these 
geometrical functions are of an appropriate type, expressions containing these con- 
stants can be properly interpreted through geometrical algorithms; otherwise, these 
expressions have no denotation in G and, as a consequence, their translations into L 
also lack denotation. For further discussion of the interpretation of graphical expres- 
sions that have no proper graphical referent in the interpretation state, see Pineda 
(1992). 
Following Montague, the interpretation of variables is defined in terms of an as- 
signment function g. We adopt the notational convention by which the semantic value 
or denotation of an expression ~ with respect to a model M and a value assignment 
g is expressed as \[\[c~\]\] M'~. 
The semantic rules for interpreting expressions of G are the following: 
° 
2. 
3. 
4. 
. 
. 
If c~ E Cs, then \[\[c~\]\] M = FG(c~). 
If # E Vs, then \[\[#\]\]M,g = g(#). 
If c~ E Ela,b I, and fl E Ea, then \[\[c~(fl)\]\]M,g = \[\[c~\]\]M'g(\[\[fl\]\])M'g 
If OL C Ea and u E Vb, then \[\[)~u\[c~\]\]\] M'g is that function h from Db into Da 
such that for all objects k in Db, h(k) is equal to \[\[c~\]\] M,~', where g' is 
exactly like g except that g'(u) = k. 
If # E Vs and fl c Et then \[\[3#(fl)\]\]M,~ = 1 iff for some value assignment g' 
such that g' is exactly like g except possibly for the individual assigned 
to # by g', \[\[fl\]\]M,g' = 1. 
If # C Vs and fl E Et then \[\[V#(fl)\]\]M,g = 1 iff for every value assignment 
g~ such that g' is exactly like g except possibly for the individual 
assigned to # by g', \[\[fl\]\]M,g' = 1. 
In order to capture the translation of expressions of L into G compositionally, 
while preserving the quantificational properties of the original source natural lan- 
guage expression, terms in G referring to graphical objects are type-raised; conse- 
quently, graphical predicates like be_in_zone, curve_between, and inside have type-raised 
arguments. The expression curve_between()~P\[P(rl)\]) (,~P\[P(r2)\]) (x), for instance, refers 
to the curve x between regions rl and r2; the first two arguments refer not to the 
regions themselves, but to the set of properties that such regions have. Similarly, the 
expression inside(,~P3y\[region(y)/~ P(y)\])(z) denotes that the dot z is inside a region 
y, but the first argument denotes the set of properties P that the region has, rather 
163 
Computational Linguistics Volume 26, Number 2 
than denoting y directly. However, whenever the full interpretation of these expres- 
sions in relation to a finite domain of graphical objects is required, they must be 
transformed into equivalent first-order expressions. This transformation is achieved 
through meaning postulates. The result of these transformations for the examples 
above are curve_between. ( q, r2 ) = x and 3y \[region (y ) A inside. (z, y)\], where curve_between. 
and inside, denote geometrical functions whose arguments are graphical entities. The 
meaning postulates are defined as follows: 
MP1. VxVP\[6(P) (x) ~ P(Ay\[8. (x, y)\]) where 6 c {lieAt, be_in_zone, inside} 
MP2. VxVP\[8(P)(x) ,-. P(Ay\[6.(y) = x\]) where 6 E {right} 
MP3. VxVPlVP2\[~(P1)(P2)(x) ~ P2(Au\[Pl(Av\[f.(v,u) = x\])\]) where 6 c {curve_between, 
intersection_between, line_from_to} 
where P, P1, and P2 are variables ranging over sets of properties (i.e., of type ((e, t/, t)), 
and x, y, u, and v are variables ranging over individuals. Meaning postulate MP1 
establishes, for instance, that a geometrical relation that holds between a set of prop- 
erties of an individual a and an individual b stands in one-to-one correspondence 
with the relation that holds between the individuals a and b themselves, since the 
only property of a that is relevant for the geometrical interpretation process is the 
property of being in such a geometrical relation with the object b (i.e., that the object a 
lies at, is in a zone of, or is inside the object b). Similarly for meaning postulates MP2 
and MP3. 
The five examples that follow illustrate how the graphical interpreter works. 
Example 1 
Consider the interpretation of the expression A(region) (BE (big)), which is the transla- 
tion of a country is big. The expression can be reduced as follows: 
1° 
2. 
3. 
4. 
5. 
APAQ3x\[P(x) A Q(x)\] (region)(APAzP(z))(big) 
APAQ3x\[P(x) A Q(x)\] (region) (Az big(z)) 
AQ3x\[region(x) A Q(x)\](Az big(z)) 
3x\[region(x) A Az big(z)(x)\] 
3x\[region(x) A big(x)\] 
Expression (5) is interpreted through the standard quantification rules of the geomet- 
rical interpreter without the help of meaning postulates. The interpretation of big is 
an algorithm that computes the average area of all regions in the map and returns the 
set of all regions whose area is larger than the average. This is a simple convention 
for illustrative purposes and alternative conventions could be chosen. Although the 
purpose of this paper is not to explore issues related to the interpretation of vague 
terms, it is interesting to note that within the present framework specific algorithms 
related to specific application domains that take into account the graphical context 
could be defined for the construction of practical applications. 
Example 2 
Consider the interpretation of THE (BETWEEN a (R1) (R2) (curve))--which is the trans- 
lation of the border between France and Germany, as will be shown in Section 2.4.1. The 
164 
Pineda and Garza Multimodal Reference Resolution 
expression without the abbreviations is: 
1. APAQ3y\[Vx\[P(x) ~ x = y\] A Q(y)\] 
(,~X( (e,t),t) AY( (e,t),t) "~Z(e,t~ "~Ue \[Z(U) A curvedJetween (x)(y) (u)\] 
(AP\[P(rl)\])(AP\[P(r2)\])(curve)) 
which can be reduced as follows: 
. 
. 
. 
5. 
6. 
. 
. 
. 
10. 
APAQ3y\[Vx\[P(x) ~ x -- y\] A Q(y)\] 
(Au\[curve(u) A curve_between(&P\[P(rl)\])(AP\[P(r2)\])(u)\]) 
~Q3y~x\[,ku \[curve(u) A curve_between( &P\[P(rl )\])( ),P\[P(r2)\])(u)\] (x) 
x = y\] A Q(y)\] 
~Q3y\[Vx\[(curve(x) A curve_between(&P\[P(rl)\])(~P\[P(r2)\])(x)) ~ x = y\] A Q(y)\] 
&Q3y\[Vx\[(curve(x) A curved~etween (&P\[P(rl)\])(~P\[P(r2)\])(x)) ~-* x -- y\] A Q(y)\] 
~Q3y\[Vx\[(curve(x) A ~P\[P(r2)\] ( ~u\[~P\[P(rl )\] ( ~v\[curve_between, (v, u) = 
x\])\])) ~ x = y\] A Q(y)\] 
~Q3y\[Vx\[(curve(x) A,~P\[P(r2)\] (&u\[&v\[curve_between, (v, u) = x\] (rl)\])) ~ x = y\] 
AQ(y)\] 
A Q 3y\[ V x\[ ( eurve( x ) 
AQ(y)\] 
A Q 3y\[V x\[ (curve( x ) 
A Q3y\[Vx\[ (curve( x ) 
A ~hP\[P(r2)\] (/~u\[curved3etween, (rl, u) -- x\])) ~ x -- y\] 
A Au\[curve_between, (rl, u) = x\] (r2)) ~ x = y\] A Q(y)\] 
A curved~etween,(rl, r2) -- x) *-* x = y\] A Q(y)\] 
Note that Expression (5) cannot be further reduced unless the types of the ar- 
guments of the predicate curve_between are lowered with the help of meaning pos- 
tulate MP3. The geometrical functions in Expression (10) can be evaluated directly. 
Expression (10) is a denoting concept that refers to the curve between the regions 
rl and r2 and cannot be further reduced. Consider that the expression the border 
between France and Germany is a definite description and, in order to obtain a truth 
value, must be combined with a predicate. The graphical object referred to by (10), 
on the other hand, could be identified regardless of the nature of the predicate Q, 
as this predicate is not used for picking out the object referred to by the definite 
description. 11 We call the object referred to by the denoting concept its concrete exten- 
sion. The concrete extension of (10) can be identified, for instance, by interpreting the 
denoting concept without using the predicative abstraction Q (i.e., 3y\[Vx\[(curve(x) A 
curve&etween*(rl, r2) = x) ~ x = y\]\]) in relation to the graphical domain; if the denot- 
ing concept is indefinite, we take any object satisfying the expression as its concrete 
extension. 
11 As argued by Kaplan, contextual factors have to be considered for the identification of the referent of a 
definite description used referentially rather than attributively (Kaplan 1978). If the referent is identified 
deictically, as in the current example, the referent is found through the translation of the definite 
description into the graphical , where the shape of the object is available directly. Note as well 
that as expressions of G have an interpretation not only in relation to the graphical domain but also in 
relation to the world, through the translation into P and the semantics of P, the referent of a definite 
description in L can be found by computing the geometrical interpretation of its translation into G. 
165 
Computational Linguistics Volume 26, Number 2 
Example 3 
Consider the interpretation of an expression similar to the one in Example 2, but 
in which an indefinite is included. The expression is THE (BF.TWEENa (R1) (A(region)) 
(curve)), which is the translation of the border between France and a country. The full 
expression is: 
. ,~P)~Q3y\[Vx\[P(x) ~ x = y\] A Q(y)\] 
('~X((e,t),o)~Y(<e,t),t}'~Z(e,t))~Ue\[Z(U) A curve_between(x)(y)(u)\] 
(,~P\[P(rl)\])(,~P3z\[region(z) A P(z)\]) 
(curve))) 
the reduction is as follows: 
. 
. 
. 
. 
. 
. 
. 
. 
,~P,~Q3y\[Vx\[P(x) *-* x = y\] A Q(y)\] 
(,~u \[curve(u) A curve_between( )~P\[P(rl)\])( )~P3z\[region(z) A P(z)\])(u)\]) 
~Q3y\[Vx\[~u \[curve(u) A curve_between (,~P\[P(rl)\]) (,~P3z \[region (z) A 
P(z)\])(u)\](x) *-* x = y\] A Q(y)\] 
)~Q3y\[Vx\[(curve(x) A curve_between(&P\[P(rl)\])(&P3z\[region(z) A P(z)\])(x) ) 
x = y\] A Q(y)\] 
,~Q3y\[Vx\[(curve(x) A &P3z\[region(z) A 
P(z)\]()~u\[&P\[P(rl)\]()~v\[curve_between.(v, u) = x\])\])) +-* x = y\] A Q(y)\] 
)~Q3y\[Vx\[(curve(x) A &P3z\[region(z) A P(z)\](&u\[&v\[curve_between.(v, u) = 
x\](rl)\])) *-* x = y\] A Q(y)\] 
)~Q3y\[Vx\[(curve(x) A ,~P3z\[region(z) A P(z)\] ()~u\[curve_between. (rl, u) = 
x\])) *-~ x = y\] A Q(y)\] 
,~Q3y\[Vx\[(curve(x) A 3z\[region(z) A )tu(\[curvedaetween. (rl, u) = x\] (z)\]) ~-* x = 
y\] A Q(y)\] 
,~Q3y\[Vx\[(curve(x) A 3z\[region(z) Acurve_between. (rl, z) = x\]) ~ x = y\] A Q(y)\] 
Meaning postulate MP3 is used for reducing from (4) to (5). Expression (9) is a denoting 
concept similar to the final expression in Example 2, but one which has an embedded 
quantified expression. Meaning postules MP1 to MP3 are defined in such a way that 
terms preserve quantificational properties through the reduction process. 
Example 4 
Consider the expression R2(be_in_zone(TI-IE(OFb(R1)(right))))--which is the translation of 
Germany is to the east of France. The reduced expression is the following: 
1. be_in_zone(,~Q3y\[Vx\[right(,~P\[P(rl)\])(x) ~ x = y\] A Q(y)\])(r2) 
by meaning postulate MP2: 
. 
3. 
4. 
be_in_zone( )~Q3y\[Vx\[)~P\[P(rl)\] ( &z\[right. (z) = x\]) ~-~ x = y\] A Q(y)\])(r2) 
be_in_zone(,~Q3y\[Vx\[~z\[right. (z) = x\] (rx) *-+ x = y\] A Q(y)\])(r2) 
be_in_zone()~Q3y\[Vx\[right. (rl) = x *-~ x = y\] A Q(y)\])(r2) 
166 
Pineda and Garza Multimodal Reference Resolution 
by meaning postulate MPI: 
. 
6. 
7. 
)~Q3y\[Vx\[right, (rl) = x *-* x = y\] A Q(y)\] (&z\[be_in_zone, (r2, z)\]) 
3y\[Vx\[right, (rl) = x ~-~ x = y\] A &z\[be_in~zone, (r2, z)\] (y)\] 
3y\[Vx\[right , (rl) = x *--* x = y\] A be_in_zone,(r2, y)\]. 
Expression (7) is a first-order formula that can be directly evaluated by the interpreter 
of G. The operator right, is interpreted as a geometrical algorithm that computes the 
centroid (xc, yc) of a region r and returns the semiplane to the right of the centroid 
of r (i.e., the set of all ordered pairs of reals (xi, yi) such that xi)xc). This convention 
captures objects that are to the right of a region, or those in the right part of a region. 12 
The graphical predicate be_in_zone, checks whether r2 is within y--i.e., the zone to the 
right of rl. 
Example 5 
Consider the interpretation of the translation into G of the textual part of the multi- 
modal message in Figure 2. The translation of Saarbr~icken lies at the intersection between 
the border between France and Germany and a line from Paris to Frankfurt is shown in (1), its 
reduction in (2), and its final reduction applying the meaning postulates in (3): 
. 
. 
. 
D 3 (lie-at (THE (BETWEEN b (THE (BETWEEN a (R1) (R3) (curve))) 
(A (FROM_TO(D 0 (D3) (line))) 
(intersection)) ) ) 
lie_at( )~Q3y\[Vx\[intersection(x) A intersection-between 
( ;~Q~u \[Vv \[(curve(v)Acurve_between(&P\[P(rl )\])( &P\[P(r2)\])(v) ) ~ v 
= u\] A Q(y)\]) 
(&Q3z\[line(z) A line_from_to(,~P\[P(dl)\])(&P\[P(d3)\])(z) A Q(z)\]) 
= x ~ x = y\] A Q(y)\]) (D3) 
3y\[Vx\[(intersection(x) A 3z\[line(z) A line=from_to, (dl, d3) = zA 
3u\[Vv\[(curve(v) A curve_between, (rl, r2) = v) 
V = U\]A 
intersection_between, (u, z) = x\]) ~ x 
= y\] A lie_at, (d2, y)\]. 
Expression (3) is true if the position of dot d3 is the same as the position of the 
intersection between the curve between rl and r2 and the line from dl to d3, as is the 
case in Figure 2. 
It is worth emphasizing that as the five examples illustrate, the reason for type- 
raising graphical terms is to be able to translate natural  quantified expression 
into the graphical domain compositionally in a rather elegant way. The scheme pro- 
vides a clear specification strategy; however, in a practical implementation, it would 
12 This is an arbitrary convention defined for illustrative purposes and altemative conventions could be 
chosen. Similar conventions could be used to interpret whether other kinds of graphical objects stand 
in a right-of relation. Furthermore, several conventions for the interpretation of such words can be 
used and a particular geometric algorithm can be defined for each interpretation. These algorithms 
need not be fully quantitative; more qualitative approaches can be employed as long as the 
computation returns a semantic value of an appropriate kind. 
167 
Computational Linguistics Volume 26, Number 2 
constant of L: 
o¢ 
Paris 
Frankfurt 
Category 
name 
Germany 
Category 
definition 
t/IV 
t/iV 
Saarbriicken T t/IV 
France T t/IV 
T 
CN 
CN 
city 
country 
t/IV 
CN 
CN 
Translation into G: 
PL--G (O z ) 
APIP(dl)\] 
AP\[P(d3)\] 
AP\[P(d2)\] 
AP\[P(rl)\] 
AP\[P(r2)\] 
dot 
region 
border CN CN curve 
line CN CN line 
intersection intersection 
east 
CN 
CN' 
ADJ 
TV 
big 
be 
be IV/ADJ 
lie at TV 
be to 
CN 
CN' 
ADJ 
W/(t/IV) 
IV/ADJ 
IV/(t/IV) 
IV/(t/IV) 
(t/IV)/CN 
(t/IV)/CN 
TV 
a T/CN 
the T/CN 
right 
big 
APAxP(Ay\[x = y\]) 
APAxP(x) 
lie_at 
be_in_zone 
APAQ3x\[P(x) A Q(x)\] 
APAQ3y\[Vx\[P(x) ~ x = y\] A Q(y)\] 
Corresponding type 
in G 
((e,t),t) 
((e,t),t) 
((e,t),t) 
((e,t),t) 
((GO, t ) (e, t) 
(e, t) 
(e, t) 
(e, t) 
(e, t) 
(((e, t), t), e) 
(e, t) 
(((e,t),t), (e,t)) 
((e, t), (e, t)) 
(((e,t),t), (e,t)) 
(((e, t), t), (e, t)) 
((e, t), ((e, t), t)) 
((e, t), ((e, t), t)) 
Figure 12 
Translation of constants of L into G. 
be convenient to limit the expressive power of G and to define it as a first-order 
. 
2.4 Translations between L and G 
In this section, the translation functions PL-G and PG-L are defined. As discussed 
in Section 1, the goal in interpreting a multimodal message like the one in Figure 2 
is to find the translations of individual constants, which are not known. In this sec- 
tion, however, we assume that the translation is fully defined in order to illustrate 
all theoretical elements of the scheme in Figure 3. The induction of the translation of 
individual constants, on the other hand, will be shown in Section 3. 
For each syntactic category of L there is a corresponding type in G. The correspon- 
dence between linguistic categories and geometrical types resembles the translation 
from English to intensional logic (Dowty, Wall, and Peters 1985) and is defined in 
terms of the function f as follows: 
1. f(t) = t. 
2. f(CN) = f(IV) = f(ADJ) = (e, t). 
3. For any categories A and B,f(A/B) = {f(B),f(A)). 
2.4.1 Translation from L into G. Figure 12 shows the translation of constants of L. 
Simple terms, such as the names of cities and countries, translate into expressions 
denoting characteristic functions of sets of graphical entities. This graphical type is 
interpreted as the set of "properties" that an individual named by the term has (for 
the purpose of this discussion a property is just the set of individuals, as no inten- 
sional types are considered). So, as a city is represented by a dot in the graphical 
domain, the translation of Paris, for instance, is the set of geometrical properties that 
the dot representing Paris has in the interpretation state. Common nouns of category 
CN and CN' translate into predicates and functions from sets of properties to individ- 
uals, respectively. Adjectives occurring in attributive sentences are translated as sets 
of individuals. Note that there are two constants be: one combines with a term and 
168 
Pineda and Garza Multimodal Reference Resolution 
the other with an adjective and both combinations produce intransitive verbs. The 
translations corresponding to these constants are functions from sets of properties to 
sets of individuals, and from sets of individuals to sets of individuals, respectively. 
Transitive verbs like lie~t and be_to translate into geometrical operators whose type is 
a function from sets of properties to sets of individuals. Determiners are translated in 
a standard fashion. 
The translation rules for composite expressions are as follows: 
SENTENCES 
TIL-G. If c~ E PT and fl c Ply, and PL-G(a) = ~',PL-G(fl) = fl' then 
PL-G (FL1 (c~, fl)) ---- a'(fl'), that is to say, the function c~' applied to the 
argument fl'. 
Examples: PL-G(Paris is a city of France) = D1 (BEa (A (OFa(R1) (dot)))) 
PL_G(Germany is to the east of France) ---- 
R 3 (be_in_zone (THE (OFD(R1) (right)))) 
PL-G (a country is big) = A(region) (BEB(bix)) 
PL-G ( Saarbr~icken lies at the intersection between the border between 
France and Germany and a line from Paris to Frankfurt) ---- 
D 3 (lie_at (THE (BETWEEN b (THE (BETWEEN a (R1) (R3)(curve))) 
(A (FROM_TO(D1) (D3) (line))) 
(intersection)))) 
TRANSITIVE VERB PHRASES 
T2L--G. If c~ E PTV and fl E PT, and PL-a(a) = a',PL-a(fl) = fl' then 
PL_G(FL2(O~, fl) ) = O/ (fl'). 
Examples: PL-G(bea city) = BEa (A (dot)) 
PL-a(be to the east of France) = be_in,zone (THE (OFb(R1)(right))) 
ATTRIBUTIVE VERB PHRASES 
T3L-G. If a C PW/ADI and fl E PADI, and PL-G(~) = OLt,;OL--G(fl) = fit then 
PL--G(FL2(O~, fl) ) = O/ (fl'). 
Example: PL_G(be big) ~- BEb(big ) 
TERMS 
T4L-G. If a E PT/CN and fl E PCN, and PL-G(a) ---- a',pL--G(fl) = fl' then 
PL--G(FL3(O~, fl)) = O/(fl'). 
Examples: PL-G(a city) = A (dot) 
PL--G(a city of France) = A (OFa(R1)(dot)) 
PL-c(the border between France and Germany) = THE (BETWEEN a 
(R1) (R2) (curve)) 
PL-G (a line from Paris to Frankfurt) = A (FROM_TO(D1) (D3) (line)) 
PL-G(the east of France) = THE (OFb(R1)(right)) 
169 
Note that the term theeast can be formed by the rule S4L, but it cannot be translated 
into G because there is a type restriction in the definition of T4L-G (i.e., fl E PCN, but 
east C PCN,). This restriction prevents the translation of terms like the east as these 
expressions have no concrete graphical representation; however, the east of France can 
be generated, translated into G and interpreted through the geometry as shown in 
Section 2.3.2. In general, natural  expressions denoting abstract concepts do 
not have a graphical representation (i.e., the population of France), and although in this 
grammar we have focused on expressions that can be translated into G, the  
can be extended with linguistic terms that would be interpreted only in the linguistic 
modality. 
COMMON NOUNS 
T5L-G. 
T7L-G. 
If a C PCN and fl E Ppp, or a E PCN, and fl c Ppp,, and 
flL_G(O~) ~-= O~', flL-G(fl) ~-- fl' then PL-G(FL2(a, fl)) ----- flt(OJ). 
Examples: pL-G(city of France) = OFa(R1)(dot) 
PL_G(east of France) = OFb(R1)(right) 
PL_G(border between France and Germany) = BETWEENa (R1) (R2) 
(curve) 
PL-G (intersection between the border between France and Germany and 
a line from Paris to Frankfurt) = BETWEEN b (THE (BETWEEN a 
(R1) (R2) (curve))) (A (FROM_TO(D1) (D3) (line))) (intersection) 
of PREPOSITIONAL PHRASES 
T6L-G. If a E PT, and PL_a(Ol) = OJ, then PL-G(FL4(a)) is either OFa(OJ) or 
OrB (O~') . 
Examples: Pi_a(ofFrance) = OFa(R1) 
Pi_a(ofGermany) = OFD(R,2) 
between PREPOSITIONAL PHRASES 
From-to PREPOSITIONAL PHRASES 
If a, fl E PT, and Pr-a(a) ---- a',pr-a(fl) = fl' then 
Examples: pr-a(F~(a, fl)) is either BETWEENa(O~')(fl') or BETWEENB(O~')(fl' ) 
pr_a(between France and Germany) -- BETWEENa(R1) (R2) 
PL-G (between the border between France and Germany and a line from 
Paris to Frankfurt) = BETWEENB(THE(BETWEENa(R1)(R2)(curve) ) ) 
(A(FROM_TO(D1)(D3) (line))) 
T8L-G. 
170 
If a, fl C PT, and PL-G(a) ---- a',PL-G(fl) = fl' then PL_G(FL6(OGfl)) = 
FROM_TO(O/) (fl'). 
Example: PL-a(from Paris to Frankfurt) = FROM_TO (D1) (D3) 
Computational Linguistics Volume 26, Number 2 
Pineda and Garza Multimodal Reference Resolution 
Constant of G: 
C~ 
dot 
region 
CRFve 
line 
intersection 
right big 
lie_at 
be_in,zone 
Figure 13 
Translation into L: pc-L(~) 
city 
country 
border 
line 
intersection 
east big 
lie at 
be to 
Translation of constants of  G into L. 
2.4.2 Translation from G into L. In this section, the translation function PG-L is de- 
fined. The translation of expressions of G into L are shown in Figures 13 and 14. Note 
that constants of G in Figure 13 translate into constants of L; however, the translations 
shown in Figure 14 are more complex, since composite expressions of G can translate 
into basic or composite expressions of L. 
The translation from G into L is shown below. In rules T6G-L to T8G--L Q stands 
for either the quantifier A or THE. 
SENTENCES 
TIG-L. If a E E((e,t),t ) and fl E E(e,t ), and PG-L(O~) = Ol',PG-L(fl) = fl' then 
PG-L(FG1 (a, fl)) = a'fl" (the concatenation), where t" is the result of 
replacing the first verb in fl' with its third person singular present 
form. 
Examples: PC-L( D1 (BEa (A (OFa(R1) (dot))))) = Paris is a city of France 
PG--L( R3 (be_in_zone (THE (OFB(R1) (right))))) = Germany is to the 
east of France 
PC-L(A(region) (BEB(bix)) ) = a country is big 
PG--L( D3 (lie3t (THE (BETWEEN b (THE (BETWEEN a (R1) (R3) (curve))) 
(A (FROM_TO(D1)(D3) (line))) 
(intersection)) ) ) ) = 
Saarbrficken lies at the intersection between the border between France 
and Germany and a line from Paris to Frankfurt 
Expression of G: 6 
~P \[P ( dl ) \], ~P\[P( d2 ) \], ~P\[P ( d3 ) \] 
AP\[P(rl ) \], ,kP\[P( r2 ) \] 
,XP\[P( o ) \] 
,~P)~xP()w\[x = y\]) 
APAxP(x) 
)~PAQ3x\[P(x) A Q(x)\] 
APAQ3y\[Vx\[P(x) ~ x = y\] A Q(y)\] 
Translation into L" pG-L(6) 
Paris, Frankfurt, Saarbrfdcken, respectively 
France, Germany, respectively 
the border between France and Germany 
be 
be a 
the 
Figure 14 
Translation of some composite expressions of G into constants of L. 
171 
Computational Linguistics Volume 26, Number 2 
TRANSITIVE VERB PHRASES 
T2G-L. If a E E(((e,t),t},(e,t} ) and fl E E((e,t),t), and fiG-L(a) = a', fiG-L(fl) = fl' 
then PG_L ( FGI ( OZ, fl) ) = a ' fl'. 
Examples: fiG-L( BEa (A (dot))) = be a city 
fiG_L(be_in~zone (THE (OWb(R1)(right))) ) = be to the east of France 
ATTRIBUTIVE VERB PHRASES 
T3G-L. If a E E((e,t},(e,t) } and fl E E<e,t), and fiG_L(OZ) ~-- O/, fiG--L(fl) = fl/ then 
fiG-L(Fc1(a, fl)) = 
Example: fiG-L(BEb(big) ) = be big 
TERMS 
T4G-L. If a C E((e,t),((e,t),t})), fl E E(e,t ), and fiG-L(Ol) = O~', fiG-L(fl) = fl' then 
fiG-L(FGl(a, fl)) = a" fl', where a" is a' except in the case where a' is 
a and the first word in fl begins with a vowel; 
here, a" is an. 
Examples: fiG-L( A (dot)) = a city 
fiG-L( A (OFa(R1)(dot)) ) = a city of France 
fG-L( THE (BETWEENa (R1) (R2) (curve))) = the border between 
France and Germany 
f/G--L( A (FROM_TO (D1) (D3) (line))) = a line from Paris to Frankfurt 
fiG--L( THE (OFb(R1)(right)) ) = the east of France 
COMMON NOUNS 
TSG-L. If a C E((e,t),(e,t) } and fl C E(e,t), or a C E((((e,t),t),e),(e,t} ) and 
fl E E(((e,t), t),e), and PG-L(a) = a', fiG-L(fl) = fl' then 
fiG-L(FGI( , 9)) = 
Examples: PG-L( O~a(R1)(dot) ) = city of France 
fiG-L(OFb(R1)(right) ) = east of France 
fiG-L( BETWEENa (R1) (R2) (curve)) = border between France and 
Germany 
f/G-L( BETWEENb (THE(BETWEENa(R1)(R2)(curve))) 
(A(FROM_TO(D1)(D3)(line))) (intersection)) = 
intersection between the border between France and Germany 
and a line from Paris to Frankfurt 
of PREPOSITIONAL PHRASES 
T6G-L. If c~ E EI(e,t),t ) such that a is either Ri or Q(region) and PC-L(a) = a' 
then PG_L(FG2(OI) ) = fiG_L(FG3(Oz) ) ~- of a' 
Examples: PG-L(OFa(R1) ) = of France 
fiG-L(OFb(R2) ) = of Germany 
fiG--L(OG(A(region)) ) = of a country 
172 
Pineda and Garza Multimodal Reference Resolution 
between PREPOSITIONAL PHRASES 
T7G-L. If a, fl C E{(e,t},t } such that 
(a) a, fl are either Ri or Q(region) or 
(b) a is either ci or Q(curve) and fl is either Li or Q(line), 
and pC-L(a) = a',pc-r(fl) = fl' then 
flG_L(FG4(O~ , fl) ) = between cr' and fl'. 
Examples: PG-L( BETWEENa (R1) (R2)) = between France and Germany 
PG-L( BETWEENa (R1) (OFa(A(region))) ) = between France and a 
country 
PG-L( BETWEENB (THE (BETWEEN a (R1) (R2) (curve))) 
(A (FROM_TO(D1) (D3) (line))))= 
between the border between France and Germany and a line from 
Paris to Frankfurt 
from-to PREPOSITIONAL PHRASES 
T8G-L. If c~,fl E E(ie, t),t} such that a, fl are either Di or Q(dot) and 
pG-L(C~) ---- c~',pG-r(fl) = fl' then pa-L(Fcs(C~,fl)) ----from c~' to fl'. 
Example: Pc-r(FROM-TO(D1) (D3)) =frOm Paris to Frankfurt 
As mentioned above, G is a very expressive ; not all expressions of G 
can be translated into expressions of L. Rules Tlc-c to T8G-L define the expressions 
that do have a translation. Instances of expressions that cannot be translated are indi- 
vidual constants (e.g., dl), equality relations between individuals (e.g., dl -- d2), and 
conjunctions or disjunctions (e.g., dot(d1) A dot(d2)). Other examples are expressions of 
the form &P\[P(el) V P(e2) V..-V P(en)\], where ei is an individual constant, which denote 
the set of properties that one or another individual has. However, this latter kind of 
expression could be translated if the expressiveness of L were augmented by allowing 
conjoined term phrases in the grammar. 
2.5 Translations between G and P 
The translation functions PG-P and PP-C are defined in this section, concluding the 
presentation of the theoretical elements of the system of multimodal representation. 
For each type of P there is a corresponding type in G and it is defined in terms of the 
functionfp_c as follows: 
1. 
2. 
fP-c (dot) = fP-C (line) = fP-C (curve) = fP-C (region) = fP-C (zone) = 
fp-c (composite_region) = fP-C (doLset) = fp-c (line,set) = fP-C (map) = (e, t}. 
For any types a and b, fp-c((a, b)) = {fp-c(a),fp-c(b)). 
2.5.1 Translation from P into G. The translations of the constants of P into G are 
presented in Figure 15. In the following definitions, Q stands for either the quantifier 
173 
Computational Linguistics Volume 26, Number 2 
Constant of P: 
. dl,d2,d3 ..... 
11,12,13... 
Clr C2, C3r... 
rl/r2r Y3,... 
ZI~ Z2, Z3,... 
Translation into G: 
pp-G(a) 
.XP\[P( dl ) \], ,XP\[P( d2 ) \], .XP\[P( d3 ) \] .... 
,kP\[P(ll)\], ,~P\[P(12)\], ,XP\[P(13)\] .... 
)~P\[P(cl)\], )~P\[P(c2)\], .kP\[P(c3)\] .... 
~P\[P(rl)\], ~P\[P(r2)\], ,~P\[P(r3)\] .... 
AP\[P(zl )\], AP\[P(z2)\], ,'kP\[P(z3)\] .... 
Figure 15 
Translation of constants of  P into G. 
A or THE. The translation rules are as follows: 
CONSTANT \]3 
T1p-G. If a C Cs where s E {dot, line, curve, region, zone} then 
(a) pp-G(a) is as shown in Figure 15. 
(b) PP-C(fl) = Q(s). 
Examples: PP-G(') = &P\[P(dl)\] 
.e-G(/) = )~P\[P(I1)\] 
PI'-G (\[iiiiii17::i)= )~P\[P(rl)\] 
\[ ............. i 
PI'-G (ii_....._i) = A(region ) 
LINE 
T2p_G. If a, fl E Edot, and pp-G(a) = a' and PP-G(fl) = fl' then 
PP-G (Fp1 (a, fl)) = Q(FROM_TO(a')(fl')(line)). 
Example: pp_G(~.~ ~:') = A(FROM_TO (,kP\[P(dl)\])(~n\[P(d3)\])(line)) 
CURVE 14 
T3p-G. If a, fl C Eregio n such that a and fl are adjacent, and pp-G(a) = (~' and 
PP-G(fl) = fl' then pp_G(Fp2(a, fl) ) = Q(BETWEENa(a')(fl')(curve) ). 
i ................. i I .......... ,.\ i 
Examplesl4: pp_G(\[ ................. ~ ............... i) = THE(BETWEENa(AP\[P(rl)\])(AP\[P(r2)\])(curve)) 
\[ .................. i 
pp_G(i ................... i) = THE(BETWEENa(/~P\[P(rl)\]) (A (region)) (curve)) 
INTERSECTION 
T4p_G. If a C Ecurve and fl C Eline, and pp-G(a) = a' and PP-G(fl) = fl' then 
pp_G(Fp3(a, fl) ) = Q(BETWEENb(a')(fl')(intersection) ). 
13 Rule (b) allows the concrete extension of a graphical object in P to be represented as its corresponding 
denoting concept in G. 
14 These two example expressions correspond to the abbreviated expressions in Examples 2 and 3, 
respectively, presented in Section 2.3.2. 
174 
Pineda and Garza Multimodal Reference Resolution 
RIGHT 
Example: 
) 
pp_GC......~...""'' )~ = THE(BETWEEN b (THE(BETWEENa(/~P\[P(rl)\]) 
( ~P\[P(r2)\])(curve) ) )(A (FROM_TO 
(,~P\[P(dl)l)( )~P\[P(d3)\])(line) ) ) 
(intersection)) 
T5p-G. If a E Ere#o, and pp-G(a) = a' then pp_G(Fp4(O~)) = Q(OF b (right)(a')). 
Example: tiP-G( 
DOT INSIDE A REGION 
) = THE(OF b (right) (&P\[P(r3)\])) 
T6p-G. If a C Eregion and flp-G(OZ) = OZ then pI,-G(Fp5(a)) = Q(oFa(a')(dot)). 
i --..., r-.,.-.., 
Example: pp-G(\[...'........)) = A(OFa()~P\[P(F1)\])(dot)) 
COMPOSITE REGION (1) 15 
TTv-G. If a, fl E Cresion such that a and/3 are adjacent, and PV-G(a) = AP\[a'\] 
and PP-G (fl) = /~P\[flq then PP-G (Ep6 (°G /3) ) = AP\[a' V fl'\]. 
COMPOSITE REGION (2) 
TSp_G. If a E Cr~gio, and/3 E Ecomposite_re,¢on such that a and/3 are adjacent, and 
pp-G(a) = AP\[a'\] and PP-G(fl) = AP\[/3'\] then 
pp-G(Fp6(a, /3) ) =/~P\[o/V/3'\]. 
SET OF DOTS 
T9p-G. (a) If a E Edot_set = 0 and fl C Cdot, and PP-G(fl) = ~P\[fl'\] then 
PP-G ( Fp6 ( a, fl ) ) = ,~P\[fl'\]. 
(b) If a E Edot_set ¢ 0 and fl C C~ot, and pp-G(a) = )~P\[a'\] and 
PP-G(fl) = ,~P\[fl'\] then pp_G(Fp6(a, fl)) = AP\[a' V fl'\]. 
SET OF LINES 
T10p_G. (a) If a E Eline~set = ~ and fl E Cline, and PP-G(fl) = &P\[fl'\] then 
PP-G (Fp6 (oL, fl) ) = AP\[fl'\]. 
(b) If a c Eline_set ~ 0 and fl E Cline, and pv-G(a) = &Pa'\] and 
PP-G(fl) = AP\[fl'\] then pp-G(Fp6(a, fl)) = &P\[a' V fl'\]. 
15 Examples of the application of the rules T7p_G to T11p_G are included in the translation of a map 
shown in Figure 16, as explained below. 
175 
Computational Linguistics Volume 26, Number 2 
r, )c3 r, 
...... ~ ....... c~..~,, c 1 " 
I dl !c1 
LP\[P(r/) v P(r2) v P(rj) v P(r4) v P(dt) v P(d2) v P(dz) v P(lt)\] 
J 
r4 ) 
• ,~. r: 
( 
kP\[P(r,) v P(r2) v P(r3)\] v P(r4)\] 
...... ....r4 ,"": 
KP\[P(r.)I 
........................... ~;.~ ~: 
r, { 
kP\[P(r/) v P(r2) v P(rgl 
KP\[P(r~)\] ............................. ",,'~'i:, r2 r/ ! 
o 
kP\[P(dt) v P(d2) v P(d3)\] 
Q • • 
dl d, 
~\[P(d,)l XP\[P(d9 v P(a~)l 
a3 
~\[P(&)l ~\[P(a,)l 
° 0 
, dj 
~P\[P(rl) V P(r2)\] 
~'\[P(h)I 
........... "~........ x..,.\ 
rl \ 
KP\[P(r/)\] 
Figure 16 
Translation into G of a map. 
'il 
XP\[P(r2)\] 
MAP 
T11p-G. If a E EcompositeJegion, fl E Edot_set and 6 C Eline_set, and pp_G(a) = AP\[a'\], pP-G(fl) = /~P\[flt\],PP-G(5) = AP\[6'\], then 
pp_G(Fp7(Oz, fl,~)) = AP\[a' V fl'V 6'\]. 
An example of the translation of a map from P into G by rule T11I,-G is shown 
in Figure 16. A map is interpreted in G as the set of properties that one or an- 
other graphical object in the base of the map has. Computing and translating all 
possible syntactic structures that can be generated in P on the basis of the overt 
graphical symbols of the drawing is not required for the interpretation of the pic- 
ture in Figure 4. The translation rules permit mapping a large number of syntactic 
structures into G, and they can be used as necessary. However, for the interpreta- 
tion of a map we will only translate a designated expression ~ of type map that 
results from parsing a full drawing in terms of the graphical objects in the base. 
176 
Pineda and Garza Multimodal Reference Resolution 
will be called the map. This criterion ensures that the drawing belongs to the map 
modality P. In addition, the graphical terms in the disjunction of the body of expres- 
sions of type map are used in G to define the interpretation domain Pbase. When 
this set is defined the semantic rules to interpret expressions of G can be evalu- 
ated. 
2.5.2 Translation from G into P. As mentioned in Section 1 in relation to the scheme 
in Figure 3, the purpose of this translation is to draw the graphical symbols that are 
referred to in G. To picture the full map, the only symbols that must be drawn are the 
symbols of the base (Pbase), as emerging symbols do not have an independent pictorial 
realization. Thus, the only translations that have to be defined are the translations of 
the symbols contained in the expression ~ (i.e., the map). We also have to consider 
that graphical terms occurring in expressions of G can have a graphical realization, 
which may be required for specific purposes. For instance, if one needs to highlight 
the region to the east of France the term of G denoting that region should be translated 
and depicted in P. In the definition of the rules below, Q stands for either the quantifier 
A or THE. 
CONSTANT 16 
T1G-p. (a) If a = AP\[P(a*)\] and a* E Ce then pG-p(a) is the drawing of a*. 
(b) If a = Q(s) where s c {dot, line, curve, region, zone} then PG-e(a) is 
the drawing of whatever graphical object in Cs. 
Examples: " pG_p( AP\[P( dl )\]) = • 
pG_p(AP\[P(I1)\]) = / 
pG_p(AP\[P(rl)\]) = ~ 
pG_p(A(region)) = i ........ , 
LINE 
T2G-p. If a, fl E E((e,t),t ) such that a, fl are either Di or Q(dot), and PG-P(a) = a' 
and PG-P(fl) = fl' then pG_p(Q(FROM_TO(oz)(fl)(line))) = Fp1 (oz', fl'). 
Example: PG-P( A(FROM_TO (AP\[P(dO\])(AP\[P(d3)\])(line)) ) = ~:~'~ 
CURVE 
T3G_p. If a, fl C E{(e,t),t ) such that a, fl are either ai or Q(region), and 
PG-P(a) = ~' and PG-P(fl) = fl' then 
pG_p(Q(BETWEENa(O~)(fl)(curve))) = Fp2(OZ', fl'). 
Examp les:'7 PG-P ( THE(BETWEENa(/~P\[P(FI)\])( &P\[P(r2)\])(curve) ) ) = ! ................. } ................ i 
i ................... i 
PG-P(THE(BETWEENa(AP\[P(rl)\]) (A (region)) (curve))) = i ................... i 
16 Rule (b) allows a graphical denoting concept in G to be represented in P as its concrete extension. 
17 These two example expressions correspond to the abbreviated expressions in Examples 2 and 3, 
respectively, presented in Section 2.3.2. 
177 
Computational Linguistics Volume 26, Number 2 
INTERSECTION 
T4G-p. 
RIGHT 
If a, fl C E((e,t),t ) such that a is either ci or Q(curve) and fl is either Li or 
Q(line), and pG-p(a) = a' and PG-P(fl) = fl' then 
pG_p(Q(BETWEENb(a)(fl)(intersection) ) ) = Fp3(a', fl'). 
Example: PG-P( THE(BETWEEND (THE(BETWEENa(AP\[P(rl)\])(AP\[P(ra)\])(curve))) 
(A (FROM_TO (AP\[P(dl)\])(AP\[P(d3)\]) 
(line)) )(intersection ) ) ) = ~.~...~ ............ 
T5G_p. If a E E((e,t),t) such that a is either Ri or Q(region), and fla_p(a) = a' 
then pG_p(Q(OFb(right)(a))) = Fp4(a'). 
Example: PG-P(THE(OFb(right)(AP\[P(r3)\])) ) = 
DOT INSIDE A REGION 
T6G_p. If a E E((e,t),t ) such that a is either Ri or Q(region), and pG_p(a) = a / 
then pa_p(Q(OFa(a)(dot))) = Fps(a'). 
Example: PG-P(A(OFa(AP\[P(rl)\])(dot)) ) = ................... 
COMPOSITE REGION (1) 18 
T7G_p. If a = ,\P\[P(a*)\] and fl = AP\[P(fl*)\] such that region(a*) and 
region(fl*), and flG_p(a) ~- a', flG-P(fl) = fit, and a' and fl' are adjacent 
then pG_p(AP\[P(a*) A P(fl*)\]) = Fp6(a', fl'). 
COMPOSITE REGION (2) 
T8G_p. If a = AP\[P(a*)\] and fl = AP\[fl"\] = )~P\[P(fll) V P(fl2) V... V P(fln)\] such 
that region(a*) and region(fli), and pG-p(a) = a',pG-p(fl) = fl', and a' 
and fl' are adjacent then pG_p(,XP\[P(a*) V fin\] = Fp6(a," fl,). 
SET OF DOTS 
T9G_p. (a) If fl = AP\[P(fl*)\] such that dot(fl*) then PG-P(fl) = Fp6(O, fl)" 
(b) If a = AP\[a"\] -- AP\[P(al) V P(a2) V... V P(an)\] and fl = AP\[P(fl*)\] 
such that dot(ai) and dot(fl*), and pG-p(a) = a' and PG-P(fl) = fl' then 
pa_p(AP\[a"V P(fl*)\]) = Fp6(a', fl') 
18 Examples for rules T6G_p to T11G_p are included in Figure 16 above. 
178 
Pineda and Garza Multimodal Reference Resolution 
SET OF LINES 
T10G-p. (a) If fl = AP\[P(fl*)\] such that line(fl*) then PG-P(fl) = Fp6(O, fl). 
(b) If a = AP\[a"\] = AP\[P(al) V P(a2) V... V P(an)\] and fl = AP\[P(fl*)\] 
such that line(ai) and line(fl*), and pG-p(O~) = OL' and PG-P(fl) ~- fl' 
then pG-p( AP\[a" V P(fl*)\]) = Fp6(a', fl') 
MAP 
TllG_p. If a C E((e,t},t ) = AP\[c~"\] = AP\[P(crl) V P(a2) V--" V P(oem)\],fl E 
E((e,t),t) = AP\[fl"\] = AP\[P(fll) V P(fl2) V'" V P(fln)\] and 
6 E E((e,t),t) = AP\[6"\] = AP\[P(61) V P(62) V... VP(6F)\] such that 
region(ai),dot(fli) and line(6i), and pG-p(a) = a',PG-P(fl) = fl' and 
pG-p(6) = 6' then PG-P(AP\[a" V fl" V 6"\]) = Fp7(a',fl',6'). 
This completes the specification of the system of multimodal representation in 
Figure 3. In this system, it is possible to express natural  and graphical infor- 
mation about maps and translate expressions between these two modalities. Natural 
 can be seen as stating or imposing an interpretation upon graphical rep- 
resentations, making the graphics meaningful. Alternatively, graphics can be seen as 
representing knowledge in an effective fashion. Expressions of the s L and 
P can be translated through the interface  G in which both the semantics of 
L and the geometrical structure of P can be represented and reasoned about in an 
integrated fashion. 
The system provides solid semantic ground on which to state and resolve prob- 
lems of reference in multimodal scenarios. The syntactic and semantic structures of 
the three s permit expression and interpretation of information in each of the 
modalities, and the ability to systematically find correlated expressions in different 
modalities with the same semantic values. As a consequence, it is possible to state 
formally what it means to resolve a multimodal reference: according to this theory, to 
resolve a multimodal reference is to find the semantic value of an expression using ei- 
ther the information expressed in the modality or information expressed through other 
modalities with the help of the translation functions. In a fully interpreted multimodal 
system such as the one illustrated in this section, interpreting a multimodal message 
is a matter of evaluating the multimodal expression. However, as argued in Section 1, 
the relationship between individual constants input through different modalities must 
be established before multimodal expressions can be evaluated. How to establish this 
relationship, the crucial part of the interpretation process, is illustrated in Section 3. 
3. Resolution of Deictic Inference by Constraint Satisfaction 
In the theory developed in Section 2, it was assumed that the translations of constants 
of all categories from L into G and vice versa were available, and then multimodal 
interpretation could be carried out; however, in the interpretation of multimodal mes- 
sages, natural  and graphics are input from different sources, and working 
out the meaning of a multimodal message is by no means trivial. As discussed in 
Section 1, resolving the references and inducing the translation between graphical and 
linguistic terms can be thought of as the same problem. Consider, for instance, reading 
a book with words and pictures: when the associations between textual and graphical 
179 
Computational Linguistics Volume 26, Number 2 
symbols are realized by the reader, the message as a whole has been properly un- 
derstood. However, it cannot be expected that such an association can be known in 
advance. 
The process of inducing the translation functions for constants of G and L is 
similar to the computer vision problem of interpreting drawings. A related antecedent 
is the work on the logic of depiction (Reiter and Mackworth 1987) in which a logic for 
the interpretation of maps, to be applied to computer vision and intelligent graphics, 
is developed. It is argued that any adequate representation scheme for visual (and 
computer graphics) knowledge must make a distinction between knowledge of the 
image (the geometry) and knowledge of the scene (its linguistic interpretation), and 
about the relation between symbols at these two levels of representation; following 
Reiter and Mackworth (1987) we call this the depiction relation. In Reiter's system, 
two sets of first-order logic representing the scene and the image are employed. They 
express, respectively, the conceptual and geometrical knowledge about handdrawn 
sketch maps of geographical regions. In the view adopted here, the depiction relation 
corresponds to the translation function between constants of L and G as discussed 
above. An interpretation in Reiter's system is defined as a model, in the logical sense, 
of both sets of sentences and the depiction relation, and interpreting a drawing is 
a matter of finding all possible models of such sets of sentences. The domain for 
these models is determined by the set of individuals in the image and the scene 
of the picture that is being interpreted. Although computing the set of models of a 
set of first-order logical formulae is a very hard computational problem, the entities 
constituting a drawing normally form a finite set, which is often small. So, whether it 
is possible to compute the set of models of a given drawing is an empirical question. 
In particular, Reiter's system employs a constraint satisfaction algorithm to find all 
possible interpretations of maps, and the output of his system is a set of labels for 
such as "river", "road", or "shore" for curves or chains, and "land region" or "water 
region" for areas. As mentioned above, finding the translation functions between G 
and L is a similar problem, with the same level of complexity. In Section 3.1, we 
present a constraint satisfaction algorithm for the induction of the translation into G 
of individual constants of L mentioned explicitly in the text of a multimodal message. 
We also show how composite terms of L can be translated into their corresponding 
graphical expressions of G (and subsequently of P). 
A second consideration in this section is that working out the translation between 
graphical and linguistic individual constants suggests a method for generating natu- 
ral  expressions that refer to graphical objects and configurations. Note that 
inducing the linguistic translation of a graphical term that has not been mentioned 
overtly in the textual part of a multimodal message is the same as generating a lin- 
guistic description for the object denoted by the corresponding graphical term: once 
one knows the translation between individual constants of both of the modalities, the 
generation of multimodal descriptions can be achieved through the translation rules. 
For instance, in the map of Figure 4, if one points to the curve cl once the translation 
of individual constants has been found, the expression the border between France and 
Germany can be generated. This strategy for producing natural  descriptions 
is discussed further in Section 3.2. 
3.1 Resolution of Spatial Deixis 
From the point of view of our system, in interpreting multimodal messages like Fig- 
ures 1 and 2, what is given are expressions of L and expressions of P and what has to 
be worked out is the composition pc_p°pL_c and the reciprocal function PG_L°PP_G . 
However, note that the expressions of P are the graphical symbols on the drawings 
180 
Pineda and Garza Multirnodal Reference Resolution 
and parsing a drawing (an expression of type map) produces a syntactic structure of 
P whose translation into G is the expression ~ (which we called the map). Emergent 
objects can also be represented in G as long as they can be produced from the base 
through syntactic rules of P and their translations into G. Consequently, expressions of 
G that refer to graphical objects stand in a one-to-one relation with the corresponding 
objects in P. Although, theoretically, expressions in G and P are different represen- 
tational objects, in actual interpretation processes they always come packed together. 
The relation between expressions of G and L, on the other hand, has to be worked 
out. For this purpose we present an algorithm for establishing a relationship between 
the individual constants of L and the graphical constants included in the expression 
C (the map), which correspond to the interpretation domain Phase. The algorithm for 
computing the translation function assigns a graphical constant to all proper names 
overtly mentioned in the linguistic part of a multimodal message (e.g., the graphi- 
cal symbols dl, d2, d3, rl, and r2 to the linguistic constants Paris, Saarbrficken, Frankfurt, 
France, and Germany, respectively). The set of proper names appearing in a particular 
multimodal message will be referred to as Names. As the translations for linguistic con- 
stants of other types are given beforehand, once the translations for proper names are 
available, it is possible to find the graphical symbols and configurations that corefer 
with composite natural  descriptions through the translation rules between 
L, G, and P. For instance, once the regions representing France and Germany have 
been identified, the term the border between France and Germany can be translated into an 
expression of G, which denotes the corresponding curve, and also into the drawing of 
the curve in P, which denotes the border between France and Germany itself. Here, it 
is important to highlight that the translation for individual constants cannot normally 
be found with the overt information expressed through the multimodal message only. 
For working out the interpretation of Figure 2, for instance, we need, in addition to 
the text and graphics, knowledge about the geography of Europe and also knowledge 
about the interpretation conventions of maps. 
For the definition of the algorithm, a table representing the set of possible functions 
from linguistics predicates (e.g., city, country, etc.) to their corresponding graphical 
types (e.g, dot, region, etc.) is defined. This table will be referred to as a function table. 
For each particular interpretation task, a set of appropriate function tables is defined 
according to the following rule: For each 6 E CcN of L and ~/ E Cle, t I of G such that 
pL-c(6) = 6 ~, create a function table (Xe, Y~,) such that: 
X~ = {x E CTI\[\[x is a 6\]\] M is true and x E Names} 
Y~, = {y E Cei\[\[6'(y)\]\] M is true} 
where X~ and Y~ are not empty. In case either of these two sets is empty no function 
table for the corresponding pair is defined. 
The function tables for our example are illustrated in Figure 17. 
Note that if only one cell of each column of a function table is filled in, a function 
from proper names to graphical constants is defined. Furthermore, if the result of this 
process is a table in which only one cell of each row is also marked, the function is 
one-to-one. Accordingly, if there are n names and m graphical objects, the first column 
of a function table can be filled up in m different ways, the second in m - 1 different 
ways, and so on, until n graphical objects have been assigned. As a consequence, 
each function table with n names and m graphical objects defines m!/(m -n)! possible 
translation functions. 19 In the example, (Xcity, Ydot) and (Xcountry, Yregion) define 6 and 12 
19 In general, if graphical objects can receive more than one name--e.g., as in the multimodal 
181 
Computational Linguistics Volume 26, Number 2 
,/3 
al 
Paris Saarbriicken Frankfurt 
(Xc~, Yao,) 
Figure 17 
Function tables for the message in Figure 2. 
0 
03 
02 
Ol 
0 o X X X 
nl n2 n3 N 
I F-Index 1010101 
Figure 18 
Set of functions associated with a function table. 
r4 
r3 
r2 
rl 
France Germany 
(Xcountet, Y~g,on) 
possible functions, respectively. Let T~ be the set of possible translation functions for 
the function table (X~, Y~,), and let r be the cross product of all T~ in an interpretation 
context (i.e., the set of possible translation models). For our example, P = Zcity × Zcountry, 
where IF\] = 72. This set contains 72 ordered pairs of functions, and each one represents 
a possible translation model for the multimodal message. Translation models can be 
enumerated by assigning a natural number to every cell in the array F. We give the 
following enumeration for bidimensional translation models: let % = ~, gj) be the nth 
translation model in F = Tx x Ty, where 0(n(\]F\] and~ C Tx, gj E Ty. For every n, if mod 
\[Txl # 0 then i = (n - 1 rood \[Tx\]) + 1 and j = (n - 1 div \]Tx\]) + 1. Similar expressions 
can be defined for higher dimensions. 
To enumerate the set of possible functions from n names to m graphical objects we 
use the following procedure: Let N be a list of names and O a list of graphical objects, 
and let F-INDEX be an n-digit string containing the n digits of a numeral in base m. 
Every string in F-INDEX codifies a total function in which the jth graphical object mj 
in O (where 0 < j(m) is assigned to the nith name in N by the rule F-INDEX(i) = j. The 
set of possible entries in F-INDEX codes the m" possible functions from n names to 
m graphical objects. One-to-one functions are those in which no mj occurs more than 
once in a given value of F-INDEX. The functions sought are the m!/(m - n)! one-to-one 
functions that result from enumerating in base m all possible values for F-INDEX from 
0 to rrl n - 1, and filtering out all numbers in which the same digit occurs more than once 
in the enumeration order. Consider the graphical illustration of the F-INDEX scheme 
for identifying the functions corresponding to a function table with three names and 
four graphical objects in Figure 18. This function table has 43 = 64 possible functions 
out of which 4!/(4 - 3)! = 24 are one-to-one. The graphical object my in 0 is associated 
to the name rli in N by marking the corresponding cell in the table, where j is placed 
in the corresponding cell of F-INDEX. The string in F-INDEX is the numeral 000 in 
interpretation scenarios related to the Hyperproof system (Barwise and Etchemendy 1994)--the 
number of possible translation functions will be m n, where m is the number of graphical objects and n 
is the number of names. 
182 
Pineda and Garza Multimodal Reference Resolution 
0 0 0 0 
03 03 X O~ X O~ X X X 
o~ x o2 X 02 x 02 
01 X 01 X 01 X 01 
0 0 X O0 O0 O0 
n~ n2 ns N nl n2 ns N nt n2 ns N nt n2 ns N 
I F-IndexlOI 1 I zl \[ F-Index\[ 1 I 2 I 3 \[ I F-Indexl3121 I I I F-Indexl 3 \[ 313 I 
Figure 19 
Examples of function index. 
base 4 and represents the function in which the graphical object o0 is assigned to all 
three names. 
Some examples of the enumeration of functions are shown, in Figure 19. The first 
table shows function 12 (base 4), which is the smallest index for a one-to-one function; 
the second table shows function 123, which associates names nl, n2, and n3 to the 
graphical objects ol, o2, and 03, respectively; the third table illustrates the function 321, 
which is the largest index for a total function in the set; and finally, the fourth table 
illustrates the function 333, which is a constant function assigning the object 03 to all 
three names. 
Armed with these concepts, we can define an algorithm for working out the in- 
terpretation of a multimodal message, as follows: Let message L be a sentence of L (the 
textual part of the multimodal message), 0c an empty set of expressions of G, and 
P the set of possible translation models for message L. Then, for each 7/ E F assume 
that 7i is a translation model for message L and include its translation message G under 
3'i in 0G--i.e., PL_G(messageL) = message G. If the semantic value of all expressions 0G 
in relation to the geometrical domain Pbase is true, then Vi is a translation model for 
messageL; otherwise, exclude 7/from P. Once all translation models have been tested, 
check whether there is only one 7j in P. If so, that 7/ is the translation function; other- 
wise, select a new appropriate expression of L (a general knowledge constraint) and 
include its translation into G in PG, and repeat the process until there is only one 7j 
in P. 
For our example, 4 translations out of the 72 7's in P will come out true for the first 
cycle of the algorithm in which the multimodal message is used as the only constraint 
(Example 5 in Section 2.3), as shown in Figure 20. 
To continue with the algorithm, some knowledge of the geography of Europe is 
required. For our problem the constraints relevant to interpreting the message are 
illustrated in Figure 21. 
The idea of the algorithm is simply to take constraints one at a time and produce 
the interpretation of the message incrementally. Considering constraint 1 in Figure 21, 
the translation functions (2) and (4) in Figure 20 can be removed; the translation func- 
tion (3), in turn, can be ruled out either through constraints 2 or 3 (the interpretation 
of the translation of constraint 1 into G is shown in Example 4 in Section 2.3). For the 
example, only three cycles of the algorithm are required to rule out all but the correct 
translation model in F, which is the translation function (1) in Figure 20. 
This concludes the presentation of the procedure for interpreting proper names 
deictically in relation to a graphical context. Although only the interpretation of this 
kind of constant was required for our example, the interpretation of other kinds of 
terms, e.g. pronouns, can be carried out in the present framework. Consider that to 
be able to cope with multimodal messages in which pronouns were included in the 
textual part, as in Figure 1, a more general definition of the  L would be 
required, but in such an extension both proper names and pronouns would be con- 
183 
Computational Linguistics Volume 26, Number 2 
(1) de de 
dl 
X 
X 
X 
Paris Saarbriicken Frankfurt 
r4 
rz 
r2 
rl 
X 
X 
France Germany 
x 
x 
France Germany 
× 
X 
France Germany 
X 
X 
(2) 
de 
de 
dl 
X 
X 
Paris Saarbriicken Frankfurt 
X 
× 
X 
Paris Saarbriicken Frankfurt 
r4 
rz 
r2 
rl 
(3) 43 de 
dl 
r4 
r3 
r2 
rl 
del x 
(4) de\[ x 
dt x 
Paris Saarbriicken Frankfurt 
Figure 20 
Possible translation models without additional constraints. 
1. Germany is to the east of France. 
2. Paris is a city of France. 
3. Frankfurt is a city of Germany. 
Figure 21 
General knowledge of geography. 
r4 
r3 
r2 
rl 
France Germany 
stants of category T in the grammar. In the present framework, pronouns would be 
interpreted along the lines of proper names. For the definition of function tables, each 
pronoun present in a multimodal text would be included in the set Names, and as 
a first approximation, it would be a member of the domain of all function tables; 
different instances of the same pronoun would be considered two different objects 
in the interpretation process (e.g., heo, he1 ..... etc.), and the interpretation would be 
worked out as shown above. It is also possible to think of a situation in which there 
are two or more graphical objects with the same name; in this context, proper names 
would be considered kinds of pronouns, and from the point of view of L, a different 
subscripted constant of category T (nameo, name1 .... ) would be assigned to each such 
graphical object. To differentiate these objects, alternative definite descriptions could 
be obtained through the translation from constants of P into expressions of G, as will 
be argued in Section 3.2, and such descriptions could be used in the context of the 
particular rhetorical structures and communicative purposes of multimodal messages. 
A further consideration is that not only the constants of category T in a grammar can 
be used deictically; definite and indefinite descriptions can also be interpreted in this 
way. Consider that the textual part of the multimodal message in Figure 1 could have 
been John washed it, the man washed it, or even a man washed it and all three terms John, 
184 
Pineda and Garza Multimodal Reference Resolution 
the man, and a man would have to be interpreted deictically in relation to the graphical 
context. To be able to deal with this latter situation, descriptions can be interpreted 
deictically in our approach if terms of this kind are also included in the set Names 
for the construction of function tables. More generally, our interpretation procedure 
defines a function from terms into individuals of the world through the graphical 
context. This is because although function tables define translation models between 
linguistic and graphical terms, graphical objects in P denote the corresponding indi- 
viduals in the world. We can think of our deictic interpretation procedure as a specific 
implementation for our graphical domain of Kaplan's operator DTHAT--in our sim- 
plified extensional --which takes a term and maps it into an individual of 
the world in the interpretation context whenever the term is used deictically (Kaplan 
1978). 
The interpretation of proper names and definite descriptions has long been a 
source of interesting semantic problems. Consider that linguistic terms serve to iden- 
tify individuals, and whenever they are used, the individual they denote should exist. 
However, as pointed out by Donnellan and commented on by Kaplan, "using a definite 
description referentially a speaker may say something true even though the descrip- 
tion correctly applies to nothing" (Kaplan 1978). For example, suppose a bachelor 
enters a room accompanied by a woman who is misintroduced as his wife. Someone 
who notices the woman's solicitous attention to the man, says His wife is kind to him. 
The speaker uses the description his wife to refer to the woman, which implies that 
the bachelor has a wife (!), and nevertheless, what the speaker says is true. Here, one 
might be inclined to say that his wife applies to nothing, but if the woman is in the 
visual field of the speaker, it would be more proper to say that his wife applies de- 
ictically to the woman. If the expression she is kind to him had been used instead, or 
simply, the speaker had pointed to the woman at the time the expression is kind to him 
were uttered, the deictic nature of the reference would be easily revealed. According 
to Kaplan, whenever a description is used referentially (as opposed to attributively), 
describing can be taken as a form of pointing, and as he suggested, instead of taking 
the sense of a description as the subject of a proposition, the sense is used only to fix 
the denotation, which is then taken directly as the subject of the proposition. Similarly, 
although a proper name is usually thought of as related to an individual (the bearer) 
in an intimate fashion through an interpretation function in model theory, and it is of- 
ten stated that proper names are related to the same individual through all world and 
time indices (i.e., as rigid designators \[Kripke 1972\]), we would argue proper names 
can be also interpreted deictically to fix a referent, which can then be taken directly 
as the subject of a proposition. 
3.2 Generation of Natural Language Descriptions 
The multimodal representation scheme and the resolution of deictic inferences pre- 
sented above permit the generation of multimodal descriptions in a simple and sys- 
tematic fashion. Once a multimodal representational system is fully defined, the gen- 
eration of graphical and linguistic expressions can be achieved directly through the 
translation rules. As the crucial piece of knowledge required for use of the translation 
rules is the translation model, the deictic inference required to identify an individual 
and the inference required to generate a description for such an individual are but 
two sides of the same coin. 
If a graphical object is pointed out on the screen, a number of natural  
descriptions to refer to it can be produced. Several strategies for finding an appropri- 
ate description are available, depending on whether the object pointed at is in Phase 
or whether it is an emergent object. Another consideration is whether the object has 
185 
Computational Linguistics Volume 26, Number 2 
a proper name or can be referred to either by a definite or an indefinite description. 
Suppose that a graphical cursor for pointing to graphical objects is available. The cur- 
sor itself is modeled as a graphical object of type dot within the graphical domain. 
With this interactive device we can identify dots if the position of a dot in a drawing 
is the same as the position of the cursor, lines and curves if the cursor lies on the line 
or the curve, and regions if the cursor is inside or on the border of a region. With 
this device, we can select all basic or emergent graphical objects that are identified by 
an individual pointing act. Basic objects will be identified directly, and the emergent 
objects selected by a pointing act will be those that can be produced by the grammar 
of P and satisfy the geometrical conditions associated with the cursor. Objects iden- 
tified by a cursor will be terms of P that can be translated into G as proper names, 
definite descriptions, or indefinite descriptions. To refer to a graphical object we can 
use the following simple strategy: if a graphical object can be translated into L as a 
proper name, use the proper name; if the graphical object can be translated as a de- 
scription and it is the only one satisfying such a description, use a definite description; 
otherwise, use an indefinite description. 
Consider a pointing action in which the dot dl or the region rl is selected. The 
translations from P into G of these objects, according to rule TIp_G (a), are )~P\[P(dl)\] 
and )~P\[P(rl)\], respectively. As these expressions can be translated into L as shown 
in Figure 14, these objects can be referred to in L as Paris and France, respectively. 
However, consider that as these objects are the concrete extension of a number of 
denoting concepts of G, these objects can also be translated into G through rule TIp_G 
(b) as A/THE(dot), and A/THE(region), respectively, and they can be translated into L 
as a/the city and a/the country. Another possible translation for the dot dl whenever it 
is produced by rule S4p and translated into G through T4p_G is A/THE(OFa(R1)(dot)), 
which in turn can be translated into L as a/the city of France. 
There may be constants of P that cannot be translated into L as proper names. 
Consider, for instance, the line 11 in Figure 4; the translation of this line into G is 
&P\[P(I1)\], but as can be seen in Figure 14, there is no proper name that corresponds 
to this expression in L. However, this constant could be translated by rule TIp_G (b) 
as the denoting concepts A/THE(line). As the line is also an emergent object that can 
be produced via the syntactic rule S2p, its translations according to rule T2p_G are 
A/THE(FROM_TO(D1)(D3)(line)); subsequently, such an object can be referred to in L as 
a/the line and a/the line from Paris to Frankfurt. A similar example is the description of 
the curve cl, which is produced by rule S3p and can be translated into L through G 
as a/the border between France and Germany. 
As mentioned in Section 1, the generation of descriptions is required within the 
context of specific rhetorical and intentional structures, such as the activate structure 
of the WIP system, which employs Reiter and Dale's algorithm for the production of 
definite descriptions on demand. Our system can be used to support the generation 
of descriptions either definite or indefinite, and even pronouns used deictically, in 
multimodal generation systems with a solid semantic base. These descriptions could 
be used according to particular rhetorical and intentional structures related to specific 
application domains. The advantage of such an approach is that the choice of the ex- 
pressions to be used in multimodal presentations could be made not only on the basis 
of predefined heuristics, but also on the basis of the semantic value of these expres- 
sions in the context of use. In addition, the decision about what kind of knowledge is 
expressed through either modality for the production of coordinated natural  
and graphical explanations could take into account not only the kind of heuristics that 
are currently employed in systems like WIP and COMET, but also the expressiveness 
and effectiveness criteria of natural and graphical s. 
186 
Pineda and Garza Multimodal Reference Resolution 
discourse referents 
translation conditions 
graphical conditions 
linguistic conditions 
Figure 22 
Components of the multimodal discourse representation structure. 
4. Multimodal Discourse Representation Theory 
The ability to interpret individual multimodal messages is a prerequisite for interpret- 
ing sequences of multimodal messages occurring in the normal flow of interactive 
conversations. In the same sense that discourse theories, like DRT, are designed to 
interpret sequences of sentences, it is desirable to have a theory in which sequences 
of multimodal messages can be interpreted. Such a theory would have to support 
anaphoric and deictic resolution models in an integrated fashion, and would have 
to be placed in a larger pragmatic setting in which intentions and presuppositions 
are considered, and in which mechanisms to retrieve knowledge from memory are 
also taken into account. To work out such a theory is quite an ambitious goal; how- 
ever, in the same way that DRT focuses in internal structural processes that govern 
anaphoric resolution, it is plausible to consider a multimodal discourse representation 
theory (MDRT) to cope with the resolution of spatial deictic inferences. In the same 
way that DRT postulates discourse representation structures in which referents and 
conditions are introduced incrementally through the interpretation of the incoming 
natural  discourse by means of the application of construction rules, it is 
plausible to conceive similar multimodal discourse representation structures (MDRS) 
whose referents and conditions would be introduced by modality-dependent con- 
struction rules acting upon the expressions of the corresponding modality. In these 
structures, DRS conditions extracted from different modalities would be kept in sep- 
arate partitions, but discourse referents would be abstract objects common to the 
whole MDRS. In particular, MDRS's could help to specify accessibility relations be- 
tween anaphoric and deictic terms and their antecedents and interpretation context, 
imposing severe constraints on the possible interpretations, as is normal in DRT. 
The resolution process itself would be accomplished by incremental constraint sat- 
isfaction, as shown for deictic inferences. In the rest of this section, we present a 
schematic picture of how an MDRS can be developed, and illustrate using the inter- 
pretation of the multimodal message in Figure 2. Consider first the empty MDRS in 
Figure 22. 
The MDRS is a structure with four partitions; it extends traditional DRS with 
one partition for graphical conditions and another to store the translation models 
that hold in a particular interpretation state. The partition for linguistic conditions 
is used as in normal DRS, and the top partition for referents includes a variable 
for every individual that is referred to in the multimodal message in either of the 
modalities. Figure 23 illustrates the initial state for the interpretation process of the 
multimodal message in Figure 2. Graphical expressions of G (the map) are included 
in the graphical section of the MDRS, and textual conditions, with the associated 
type information, are included in the linguistic section as in normal DRS. A refer- 
187 
Computational Linguistics Volume 26, Number 2 
Saarbriicken lies at the intersection between 
the border between France and Germany 
and a line from Paris to Frankfurt 
Figure 23 
Hi, n2, n3, n4, 115, 
01, 02, 03, 04, 05, 06, 07, 0,~, 
Saarbmicken(nt) city(nl) 
France(n2) country(n2) 
Germany(n~) country(n j) 
Paris(n4) ci(y(n4) 
Frankfurt(n~) city(ns) 
n~ lies at the intersection between 
the border between n: and n3 and 
a line from n¢ to n5 
Initial MDRS for the interpretation of the multimodal message. 
ent is included in the corresponding partition for every individual that has been in- 
troduced through either modality, as referents are considered medium-independent 
abstractions. In the same way that the order of processing of linguistic information 
is not crucial for the definition of the linguistic conditions, we abstract over scan- 
ning considerations and assume that graphical expressions are introduced as a single 
"sentence. "2° Finally, the partition for the possible translation models is empty at this 
stage, as the coreference relation between text and graphics has not yet been estab- 
lished. 
The interpretation process by constraint satisfaction is illustrated in Figure 24. 
Figure 24(a) illustrates the interpretation state after the first cycle of the constraint 
satisfaction algorithm presented in Section 3.1 has been carried out. In this state, 
the partition for the translation conditions contains the disjunction of the four pos- 
sible translation models that are consistent with the message, taking the message 
itself as the only interpretation constraint. Figure 24(b) illustrates the interpretation 
state once the additional constraint that Germany is to the east of France has been 
considered, al The interpretation of the corresponding expression introduces two addi- 
tional discourse referents (n6 and n7), as the terms Germany and France in the textual 
part of the message should be resolved anaphorically in relation to the context pre- 
viously built. However, this anaphoric resolution process is kept within the linguistic 
section of the MDRS and should take into account the accessibility constraints between 
anaphor and antecedent, as commonly done in DRT. The result of this anaphoric in- 
ference is reflected in the equality conditions n6 =- n3 and n7 = n2. The inclusion 
of the constraint Germany is to the east of France permits us to rule out two possible 
translation models, and the result of the second cycle of the constraint satisfaction 
20 We leave for further research whether the analysis of scanning protocols by means of eye-tracking 
techniques can provide information for imposing additional constraints on accessibility relations and 
possible translation models for the construction of MDRS's. An interesting antecedent for the definition 
of such constraints can be found in Faraday and Sutcliffe (1998). 21 How this constraint is selected is beyond the scope of this paper and we only make the assumption 
that the symbols in the graphical and linguistic partitions of the MDRS form a part of the indexing 
scheme required to retrieve the information from memory. For a prototype implementation, this kind 
of constraint could be provided by the human user directly. 
188 
Pineda and Garza Multimodal Reference Resolution 
hi, n2, n.~, n4, n3, 
Oh 02, 05, 04, 05, 06, 07, 08 
{nz=06, nz=ot, n~=02, n4=05, nz=ot}v 
{nl=06, n2=02, nj=ot, n~=05, nj=oT}v 
nx=06, n:,=ot, nj=02, n4=oz, nj=o~}v 
n1=06, nz=02, nj=01, n4r=or, ns=03} 
Saarbracken(m) city(m) 
France(n2) country(n2) 
Germany(n3) country(n j) 
Parts(n4) ci~(n4) 
Frankfurt(ns) city(ns) 
n~ lies at the intersection between 
the border between n2 and ns and 
a line from n4 to n5 
hi, n2, nj, n4, n5, 
01~ 02~ 03~ 04, 05~ 06, 07~ 08 
n6, n7 
~lt=06, n2=ol, n~r=02, n~=os, nj=ol} 
V 
nt=o6, nz~ot, nz=o2, nc=Oz, ns=o3} 
Saarbrficken(nl) city(nt) 
France(n2) countrT(ne) 
Germany(nj) country(n~) 
Paris(n4) city(m) 
Frankfio't(n3) city(n3) 
n~ lies at the intersection between 
the border between n2 and n5 and 
a line from n~ to n3 
n6 is to the east of nz 
n6=n5, n;~n2 
(a) (b) 
Figure 24 
Interpretation of multimodal message by constraint satisfaction. 
hi, n2, n3, n4, n j, 
01, 02, 03, 04, 05, 06, 07, 08 
n6, n7 
ns, n9 
Int=o6, n:~ol, ny=o~, n4r=06, ns=oz} 
o, i 
ol .- o o 
Sdarbracken(n3 cUy(nD France(nD country(nD 
Germany(ns) country(ns) 
Paris(n j) city(n4) 
Frankfurt(na) city(n3) 
n~ lies at the intersection between 
the border between n2 and ns and 
a line from n~ to nj 
n6 is to the east of n7 
hahn3, nz=n2 
n8 is a city of n9 
risen4, rig=n2 
(¢) 
algorithm is reflected in the new state of the partition for translation conditions of 
the MDRS. Figure 24(c) illustrates the final interpretation state in which the constraint 
that Paris is a city of France has been considered and involves anaphoric and deictic 
resolution inferences as in the interpretation of the previous constraint. As a result of 
this last constraint satisfaction cycle, only one translation model is left in the parti- 
tion for translation conditions and reflects the correct interpretation of the multimodal 
message. 
As a last example of the integrated anaphoric and deictic interpretation, consider a 
situation in which the natural  expression It is big is mentioned after the mul- 
timodal message in Figure 2 has been interpreted, as illustrated in Figure 25. In this 
situation, the natural  information would enter into the partition for linguistic 
conditions and the pronoun it should be interpreted anaphorically in relation to the 
context currently provided by the MDRS and could resolve to Saarbriicken (although 
there are several possibilities). However, if the expression is supported by an overt 
gesture indicating the city of Paris, for instance, it would be deictic and its interpre- 
tation would have to be worked out with the same machinery; although in this latter 
situation the translation relation between graphical and linguistic referents could be 
asserted directly in the translation model as the gesture would render unnecessary the 
constraint satisfaction part of the deictic inference. 
With this we conclude the presentation of our model for integrated deictic and 
anaphoric inferences. The distinction between anaphora and deixis is clearly demar- 
cated. The antecedent for a pronoun, a proper name, or a description used anaphori- 
cally is provided by the discourse interpretation context, while the referent for a deictic 
pronoun, proper name, definite or indefinite description, or a demonstrative word like 
this or that is taken from an intermediate representation of a nonlinguistic modality 
such as the graphical context, and denotes an individual of the world directly, a view 
that is consistent with Kamp's distinction quoted in Section 1. 
189 
Computational Linguistics Volume 26, Number 2 
It is big 
hi, n2~ ns, n4, n$, 
01, 02~ oj, 04, oj~ 06~ 07~ 08 
n6~ n7 
ha, n9 
n~=06, n2=ot, na=o2, n,=oa, nj=oT} 
Saarbru'cken(nl) city(n1) 
France(n2) country(n2) 
Germany(n j) country(nz) 
Paris(n,) city(n,) 
Franyurt(ns) city(n j) 
n j lies at the intersection between 
the border between n2 and n3 and 
a line from n~ to ns 
n~ is to the east of nr 
na=ns, nT=n2 
n8 is a city of n9 
ns=n4, rig=n2 
(a) 
Figure 25 
Integrated interpretation of anaphora and spatial deixis. 
hi, n2~ n3, n4, n j, 
Ol, 02, O& 04, OJ, 06, 07~ 08 
n6, n7 
n8, n9, nlo 
nl=oa, n2=ol, nJ=o2, n,=o& nj=o7} 
Saarbrficken(nl) city(n1) 
France(n2) country(n2) 
Germany(n3) country(n3) 
Paris(n,) city(n,) 
Frankfurt(n j) ciO~(ns) 
nl lies at the intersection between 
the border between n2 and n3 and 
a line from n~ to n5 
na is to the east of m 
n6=n3, nz=n2 
n8 is a city of n9 
ns=n4, n9=n2 
nlo is big 
nl~nl 
(b) 
5. Conclusions 
In this paper, we have presented a theory of representation and interpretation for 
multimodal messages and a model for multimodal reference resolution. The model 
is based on the view that a modality is a code system on a medium that can be 
characterized by well-defined syntax and semantics. Multimodal interpretation is a 
matter of working out coreference relations between terms of different modalities. 
A central concern in articulating this theory is a clear characterization of how spatial 
deictic reference is resolved and of how spatial deictic reference relates to the resolution 
of anaphors in the normal flow of discourse. A key theoretical assumption we make is 
that graphics are interpreted deictically, which is in opposition to the view graphical 
representations are interpreted anaphorically. 
The theoretical machinery for the definition of the syntax and semantics is formally 
developed along the lines of Montague's semiotic program and its associated general 
theory of translation. We have also illustrated an algorithm for finding the translation 
between texts and graphics, as messages in these modalities are introduced through 
independent input channels, and the translation between linguistic terms and their cor- 
responding graphical expressions must be induced dynamically. We also suggested an 
extension of Kamp's DRT with multimodal discourse structures (MDRS). This model 
defines an integrated interpretation model for multimodal messages while maintain- 
ing a clear demarcation between indexical and anaphoric inferential processes. Natural 
 terms, like proper names, pronouns, and descriptions can be interpreted in 
relation to a model; however, these linguistic terms also admit anaphoric and deictic 
interpretations. 
It is important to note that although we used a simplified extensional definition for 
the semantics of natural  and graphical expressions, the system was carefully 
190 
Pineda and Garza Multimodal Reference Resolution 
designed to move smoothly into the intensional domain. Consider that the exten- 
sional formulation used in the semantic definition of the graphical  can be 
easily extended into an intensional one by changing the types of constants, predicates, 
and sentences from individuals, sets of individuals, and truth values, into individual 
concepts, properties of individuals, and propositions, respectively. This is achieved 
simply by indexing the interpretation of expressions in terms of a possible world and 
time, and all definitions presented above could be considered relative to the current 
world and time. The move to the intensional domain would allow the definition of 
the interpretation of more comprehensive natural  segments. 
Intensionality is also relevant for the interpretation of graphical s, in gen- 
eral, and for the definition of graphical and linguistic interactive systems, in particular. 
In interactive sessions with a computer graphic interface, the interaction states can be 
considered as possible worlds and the interpretation of graphical constants would 
depend on particular graphical states. If a graphical object like a dot, for instance, is 
moved from one position to another in an interactive transaction, we have the intuition 
that the object before and after the change is the same and denotes the same object of 
the world and yet not even its position, which one could think of as an essential prop- 
erty of a dot, is the same. Accordingly, in the intensional setting the semantic value of 
a graphical constant is not an individual but an individual concept. Consider as well 
that the same graphical description can have different semantic values in different in- 
teractive states; for instance, the value of the expression position of dl will be a different 
ordered pair before and after dot dl is moved. According to this, the interpretation 
of graphical operators at every index will be a function from sequences of graphical 
objects of the proper kind into graphical objects; however, unlike normal linguistic sit- 
uations in which different functions at different indices are assigned to operators and 
predicates, the same function at every index has to be assigned to geometrical operators 
and predicates, as the geometry is always the same. Moving into the intensional setting 
is also relevant for our treatment of indexicals. In our current approach, the interpreta- 
tion of a term used deictically is an individual of the world; in the intensional context, 
the interpretation of the same term in one particular interaction state will be the same 
in every state despite the fact that the description for referring to such an object in the 
state in which it was selected might pick up a different individual in a different state. 
In the future, it would be interesting to deal with a more general fragment of nat- 
ural  that includes temporal expressions. In the same way that the  
G provided a finite and small domain for the interpretation of linguistic spatial prepo- 
sitions, a similar  T for the interpretation of temporal prepositions could be 
defined. Temporal predicates and operators of this  would be interpreted in 
terms of arithmetic functions like those presented, for instance, in Allen's temporal 
logic (Allen 1983). In the same way that the constraint satisfaction algorithm for the 
definition of the translation between graphical and linguistic terms helped to solve 
deictic inferences, a constraint satisfaction algorithm for resolving temporal deictic 
references in relation to a finite and small domain of actions and events is conceiv- 
able. The definition for such a spatial and temporal indexical model could be quite 
helpful for the implementation of natural  and graphics systems in which 
actions and events are mentioned in the course of interactive conversations. 
6. Implementation 
Although a prototype system for the theory presented in this paper has not been im- 
plemented, several aspects of the theory have been implemented in relation to simpler 
systems. A simpler version of the strategy for multimodal interpretation of the scheme 
191 
Computational Linguistics Volume 26, Number 2 
in Figure 3 was implemented in the first version of the Graflog system (Pineda 1989). 
Several versions of the graphical  and its geometrical interpreter have been 
implemented in relation to different application domains (Morales 1994; Masse 1994; 
Santana 1999; Garza 1995) with BinProlog and the TCL/TK programming environ- 
ment. The geometrical interpreter and the strategy of evaluating a set of geometrical 
constraints incrementally in relation to a graphical domain was used in a later version 
of Graflog to solve and generate graphical explanations of geometrical constraint sat- 
isfaction problems (Pineda 1992, 1998), and also for the definition of a model (not yet 
fully implemented) for the production of solids from orthogonal views of polyhedra 
(Garza and Pineda 1998). We also implemented the scheme for enumerating functions 
used in the definition of translation models for a semantic theorem-proving system 
written in Prolog, in order to find the possible models satisfying logical theories about 
graphical scenarios of the Hyperproof system (Barwise and Etchemendy,1994). 
Acknowledgments 
The authors gratefully acknowledge 
support to Luis Pineda from the Institute 
for Applied Mathematics and Systems 
(IIMAS) at the National University of 
Mexico (UNAM) and Conacyt grant 
400316-5-27948-A and to Luis Pineda and 
Gabriela Garza from the Institute for 
Electrical Research (IIE), Mexico. We are 
grateful also to the anonymous reviewers, 
and for helpful discussions with numerous 
people including James Allen, Elisabeth 
AndrG Kees Van Deemter, John Lee, Sergio 
Santana, Oliviero Stock, Thomas Rist, Henk 
Zeevat, and very specially to Ewan Klein. 

References 
Allen, James. 1983. Maintaining knowledge 
about temporal intervals. Communications 
of the ACM. 26(11): 832-843. 
AndrG Elisabeth and Thomas Rist. 1994. 
Referring to world objects with text and 
pictures. In Proceedings of COLING 94, 
pages 530-534. 
Barwise, Jon and John Etchemendy. 1994. 
Hyperproof. CSLI. 
van Deemter, Kees and Stanley Peters, 
editors. 1995. Semantic Ambiguity and 
Underspecification. CSLI Publications 1996, 
Stanford, CA. 
Dowty, David R., Robert E. Wall, and 
Stanley Peters. 1985. Introduction to 
Montague Semantics. D. Reidel Publishing 
Company, Dordrecht, Holland. 
Faraday, Pete and Alistair Sutcliffe. 1998. 
Providing advice for multimedia 
designers. In L. Pineda, T. Rist, and J. Lee, 
editors, Proceedings of the Workshop on 
Interpretation and Generation in Intelligent 
Multimodal Systems and Graphical 
Reasonings in Expert Systems, pages 72-79, 
Fourth World Congress on Expert 
Systems. ITESM Mexico City Campus, 
Mexico. 
Feiner, Steven and Kathleen McKeown. 
1993. Automating the generation of 
coordinated multimedia explanations. In 
M. Maybury, editor, Intelligent Multimedia 
Interfaces. The MIT Press and AAAI Press, 
Menlo Park, CA, pages 117-239. 
Garza, Gabriela. 1995. Sfntesis de Poliedros 
a partir de sus Vistas Ortogonales: Un 
Caso de Estudio acerca del Razonamiento 
Gr~fico. M. Sc. thesis, ITESM, Campus 
Morelos, Mexico. 
Garza, Gabriela and Luis Pineda. 1998. 
Synthesis of solid models of polyhedra 
from their orthogonal views using logical 
representations. In Expert Systems with 
Applications. Elsevier Science Ltd., 
Volume 14, pages 91-108. 
Kamp, Hans. 1981. A theory of truth and 
semantic representation. Formal Methods in 
the Study of Language, 136: 277-322. 
Mathematical Centre Tracts. 
Kamp, Hans and Uwe Reyle. 1993. From 
Discourse to Logic. Kluwer Academic 
Publisher, Dordrecht, Holland. 
Kaplan, David. 1978. DTHAT. Syntax and 
Semantics. Volume 9, pages 383-399. 
Klein, Ewan and Luis Pineda. 1990. 
Semantics and graphical information. In 
Diaper, Gilmore, Cockton, and Shackel, 
editors, Human-Computer Interaction, 
Interact'90. IFIP, North-Holland, 
pages 485--491. 
Kripke, Saul. 1972. Naming and Necessity. 
Basil Blackwell, Oxford. 
Lyons, John. 1968. Introduction to Theoretical 
Linguistics, Cambridge University Press, 
Cambridge. 
Mackinlay, Jock Douglas. 1987. Automatic 
Design of Graphical Presentations. Ph.D. 
thesis, Stanford University. University 
Microfilms International. 
Mann, William C. and Sandra A. 
Thompson. 1988. Rhetorical Structure 
192 
Pineda and Garza Multimodal Reference Resolution 
Theory: Toward a functional theory of 
text organization. Text, 8(3): 243-281. 
Masse, J. Antonio. 1994. Satisfacci6n de 
Restricciones por Referencia Simb61ica en 
Dibujos Geom4tricos. B. Sc. thesis, ENEP 
Arag6n, UNAM, Mexico. 
Maybury, Mark. 1993. Planning multimedia 
explanations using communicative acts. In 
M. Maybury, editor, Intelligent Multimedia 
Interfaces. The MIT Press and AAAI Press, 
Menlo Park, CA, pages 59-74. 
Moore, Johanna. 1995. Participating in 
Explanatory Dialogues: Interpreting and 
Responding to Questions. The MIT Press, 
Cambridge. A Bradford Book. 
Morales, Rafael. 1994. Pizarrones 
Interactivos Multimodales para la 
Ensefianza de Conceptos Matem~iticso, M. 
Sc. thesis, ITESM, Campus Morelos, 
Mexico. 
Pineda, Luis. 1989. Graflog: A Theory of 
Semantics for Graphics with Applications to 
Human-Computer Interaction and CAD 
Systems. Ph.D. thesis, University of 
Edinburgh. 
Pineda, Luis. 1992. Reference, synthesis and 
constraint satisfaction. Computer Graphics 
Forum. 11(3): 333-344. 
Pineda, Luis. 1998. Graphical and linguistic 
dialogue for intelligent multimodal 
systems. In Expert Systems with 
Applications. Elsevier Science Ltd., 
Volume 14, pages 149-157. 
Poesio, Massimo. 1994. Discourse 
Interpretation and the Scope of Operators. 
Ph.D. thesis, University of Rochester. 
Reiter, Ehud and Robert Dale. 1992. A fast 
algorithm for the generation of referring 
expressions. In Proceedings of the 
COLING'92. Volume 1, pages 232-238, 
Nantes, France. 
Reiter, Raymond and Alan K. Mackworth. 
1987. The logic of depiction. Technical 
Reports on Research in Biological and 
Computational Vision at the University of 
Toronto. RCBV-TR-87-18. 
Rist, Thomas. 1996. Current state of the 
reference model for intelligent 
multimedia presentation systems. Paper 
presented in the workshop "Towards a 
Standard Reference Model for Intelligent 
Presentation Systems" at the 12th 
European Conference on Artificial 
Intelligence. Budapest. August. 
Santana, Sergio. 1999. The Generation of 
Coordinated Natural and Graphical 
Explanations in Design Environments. Ph.D. 
thesis, Universidad de Salford. 
Shamos, M. I. 1978. Computational Geometry. 
Ph.D. thesis, Yale University. University 
Microfilms International. 
Steedman, Mark J. 1986. Incremental 
interpretation in dialogue. ACORD 
Project Deliverable T2.4. Department of 
Artificial Intelligence and Centre for 
Cognitive Science. University of 
Edinburgh. 
Stiny, G. 1975. Pictorical and Formal Aspects of 
Shape Grammars. Birkhauser Verlag, Basel. 
Stock, Oliviero and the AlFresco Project 
Team. 1993. AlFresco: Enjoying the 
combination of natural  
processing and hypermedia for 
information exploration. In M. Maybury, 
edior, Intelligent Multimedia Interfaces. The 
MIT Press and AAAI Press, Menlo Park, 
CA, pages 197-224. 
Wahlster, Wolfgang. 1991. User and 
discourse models for multimodal 
communication. In J. W. Sullivan and 
S. W. Tyler, editors, Intelligent User 
Interfaces. ACM Press, New York, 
pages 45-67. 
Wahlster, Wolfgang, Elisabeth AndrG 
Wolfgang Finkler, Hans-Jfirgen Profitlich, 
and Thomas Rist. 1993. Plan-based 
integration of natural  and 
graphics generation. Artificial Intelligence 
63: 387-427. 
Wittenburg, Kent. 1998. Visual  
parsing: If I had a hammer .... In Harry 
Bunt, Robbert-Jan Beun, and Tijn 
Borghuis, editors, Multimodal 
Human-Computer Communication: Systems, 
Techniques and Experiments. 
Springer-Verlag, pages 231-249. 
