Abstract
To support context-based multimodal interpre-
tation in conversational systems, we have devel-
oped a semantics-based representation to
capture salient information from user inputs and
the overall conversation. In particular, we
present three unique characteristics: fine-
grained semantic models, flexible composition
of feature structures, and consistent representa-
tion at multiple levels. This representation
allows our system to use rich contexts to resolve
ambiguities, infer unspecified information, and
improve multimodal alignment. As a result, our
system is able to enhance understanding of mul-
timodal inputs including those abbreviated,
imprecise, or complex ones.
1 Introduction
Inspired by earlier works on multimodal interfaces
(e.g., Bolt, 1980; Cohen el al., 1996; Wahlster, 1991;
Zancanaro et al., 1997), we are currently building an
intelligent infrastructure, called Responsive Informa-
tion Architect (RIA) to aid users in their informa-
tion-seeking process. Specifically, RIA engages
users in a full-fledged multimodal conversation,
where users can interact with RIA through multiple
modalities (speech, text, and gesture), and RIA can
act/react through automated multimedia generation
(speech and graphics) (Zhou and Pan 2001). Cur-
rently, RIA is embodied in a testbed, called Real
Hunter
TM
, a real-estate application to help users find
residential properties.
As a part of this effort, we are building a seman-
tics-based multimodal interpretation framework
MIND (Multimodal Interpretation for Natural Dia-
log) to identify meanings of user multimodal inputs.
Traditional multimodal interpretation has been
focused on integrating multimodal inputs together
with limited consideration on the interaction context.
In a conversation setting, user inputs could be abbre-
viated or imprecise. Only by combining multiple
inputs together often cannot reach a full understand-
ing. Therefore, MIND applies rich contexts (e.g.,
conversation context and domain context) to
enhance multimodal interpretation. In support of this
context-based approach, we have designed a seman-
tics-based representation to capture salient informa-
tion from user inputs and the overall conversation.
In this paper, we will first give a brief overview on
multimodal interpretation in MIND. Then we will
present our semantics-based representation and dis-
cuss its characteristics. Finally, we will describe the
use of this representation in context-based multimo-
dal interpretation and demonstrate that, with this rep-
resentation, MIND is able to process a variety of user
inputs including those ambiguous, abbreviated and
complex ones.
2 Multimodal Interpretation
To interpret user multimodal inputs, MIND takes
three major processes as in Figure 1: unimodal
understanding, multimodal understanding, and dis-
course understanding. During unimodal understand-
ing, MIND applies modality specific recognition and
understanding components (e.g., a speech recognizer
and a language interpreter) to identify meanings
from each unimodal input, and captures those mean-
ings in a representation called modality unit.During
multimodal understanding, MIND combines seman-
tic meanings of unimodal inputs (i.e., modality
units), and uses contexts (e.g., conversation context
and domain context) to form an overall understand-
ing of user multimodal inputs. Such an overall
understanding is then captured in a representation
called conversation unit. Furthermore, MIND also
identifies how an input relates to the overall conver-
sation discourse through discourse understanding. In
particular, MIND uses a representation called con-
versation segment to group together inputs that con-
tribute to a same goal or sub-goal (Grosz and Sidner,
1986). The result of discourse understanding is an
evolving conversation history that reflects the over-
all progress of a conversation.
Figure 2 shows a conversation fragment between a
user and MIND. In the first user input U1,thedeictic
Figure 1. MIND components
gesture
speech
text
Multimodal
Interpreter
Discourse
Interpreter
Language
Interpreter
Gesture
Interpreter
Speech
Recognizer
Gesture
Recognizer
Modality Unit
(Speech
& Text)
Modality Unit
(Gesture)
Conversation Unit
Unimodal
Understanding
Discourse
Understanding
Multimodal
Understanding
Other RIA Components
C
onv
er
sat
io
n
H
ist
o
r
y
Conversation
Segment
MIND
D
o
main,
V
isual
C
ontexts
Semantics-based Representation for Multimodal Interpretation in
Conversational Systems
Joyce Chai
IBM T. J. Watson Research Center
19 Skyline Drive
Hawthorne, NY 10532, USA
{jchai@us.ibm.com}
gesture (shown in Figure 3) is ambiguous. It is not
clear which object the user is pointing at: two houses
nearby or the town of Irvington
1
. The third user input
U3 by itself is incomplete since the purpose of the
input is not specified. Furthermore, in U4, a single
deictic gesture overlaps (in terms of time) with both
“this style”and“here” from the speech input, it is hard
to determine which one of those two references
should be aligned and fused with the gesture. Finally,
U5 is also complex since multiple objects (“these two
houses”) specified in the speech input need to be uni-
fied with a single deictic gesture.
This example shows that user multimodal inputs
exhibit a wide range of varieties. They could be
abbreviated, ambiguous or complex. Fusing inputs
together often cannot reach a full understanding. To
process these inputs, contexts are important.
3 Semantics-based Representation
To support context-based multimodal interpretation,
both representation of user inputs and representation
of contexts are crucial. Currently, MIND uses three
types of contexts: domain context, conversation con-
text, and visual context. The domain context provides
domain knowledge. The conversation context reflects
the progress of the overall conversation. The visual
context gives the detailed semantic and syntactic
structures of visual objects and their relations. In this
paper, we focus on representing user inputs and the
conversation context. In particular, we discuss two
aspects of representation: semantic models that cap-
ture salient information and structures that represent
thosesemanticmodels.
3.1 Semantic Models
When two people participate in a conversation, their
understanding of each other’s purposes forms strong
constraints on how the conversation is going to pro-
ceed. Especially, in a conversation centered around
information seeking, understanding each other’s
information needs is crucial. Information needs can
be characterized by two main aspects: motivation for
seeking the information of interest and the informa-
tion sought itself. Thus, MIND uses an intention
model to capture the first aspect and an attention
model to capture the second. Furthermore, since users
can use different ways to specify their information of
interest, MIND also uses a constraint model to cap-
ture different types of constraints that are important
for information seeking.
3.1.1 Intention and Attention
Intention describes the purpose of a message. In an
information seeking environment, intention indicates
the motivation or task related to the information of
interest. An intention is modeled by three dimensions:
Motivator indicating one of the three high level pur-
poses: DataPresentation, DataAnalysis (e.g., compari-
son), and ExceptionHandling (e.g., clarification), Act
specifying whether the input is a request or a reply,
and Method indicating a specific task, e.g., Search
(activating the relevant objects based on some crite-
ria) or Lookup (evaluating/retrieving attributes of
objects).
Attention relates to objects, relations that are
salient at each point of a conversation. In an informa-
tion seeking environment, it relates to the information
sought. An attention model is characterized by six
dimensions. Base indicates the semantic type of the
information of interest (e.g., House, School,orCity
which are defined in our domain ontology). Topic
specifies the granularity of the information of interest
(e.g., Instance or Collection). Focus identifies the scope
of the topic as to whether it is about a particular fea-
ture (i.e., SpecficAspect) or about all main features
(i.e., MainAspect). Aspect provides specific features of
the topic. Constraint describes constraints to be satis-
fied (described later). Content points to the actual data.
The intention and attention models were derived
based on preliminary studies of user information
needs in seeking for residential properties. The details
are described in (Chai et al., 2002).
For example, Figure 4(a-b) shows the Intention and
Attention identified from U1 speech and gesture input
respectively. Intention in Figure 4(a) indicates the user
is requesting RIA (Act: Request) to present her some
data (Motivator: DataPresentation) about attributes of
1 The generated display has multiple layers, where the house icons
are on top of the Irvington town map. Thus this deictic gesture could
either refer to the town of Irvington or houses.
Figure 2. A conversation fragment
Speech: Here is the comparison chart.
Graphics: Show a chart
R5:
Speech: Compare these two houses with the previous house.
Graphics: Point to the corner of the screen where two house icons are
displayed
U5:
Speech: This is a Victorian style house. I find seven Victorian houses
in White Plains.
Graphics: Show seven houses in White Plains
R4:
Speech: Show me houses with this style around here
Gesture: Point to a position east of Irvington on the map
U4:
Speech: This house costs 320,000 dollars.
Graphics: Highlight the house icon and show a picture
R3:
Speech: What about this one?
Gesture: Point to a house icon on the screen
U3:
Speech: The green house costs 250,000 dollars.
R2:
Speech: The green one.
U2:
Speech: Which house are you interested in?
Graphics: Highlight two house icons
R1:
Speech: How much is this?
Gesture: Point to the screen (not directly on any object)
U1:
A collection of houses are shown on the map of Irvington
Figure 3. An example of graphics output
user points
here
certain object(s) (Method: Lookup).TheAttention indi-
cates that the information of interest is about the price
(Aspect: Price) of a certain object (Focus: Instance). The
exact object is not known but is referred by a demon-
strative “this”(inConstraint). Intention in Figure 4(b)
does not have any information since the high level
purpose and the specific task cannot be identified
from the gesture input. Furthermore, because of the
ambiguity of the deictic gesture, three Attentions are
identified. The first two Attentions are about house
instances MLS0234765 and MLS0876542 (ID from Mul-
tiple Listing Service) and the third is about the town
of Irvington.
3.1.2 Constraints
In an information seeking environment, based on the
conversation context and the graphic display, users
can refer to objects using different types of refer-
ences, for example, through temporal or spatial rela-
tions, visual cues, or simply a deictic gesture.
Furthermore, users can also search for objects using
different constraints on data properties. Therefore,
MIND models two major types of constraints: refer-
ence constraints and data constraints. Reference con-
straints characterize different types of references.
Data constraints specify relations of data properties.
A summary of our constraint model is shown in
Figure 5. Both reference constraints and data con-
straints are characterized by six dimensions. Category
sub-categorizes constraints (described later). Manner
indicates the specific way such a constraint is
expressed. Aspect indicates a feature (features) this
constraint is concerned about. Relation specifies the
relation to be satisfied between the object of interest
and other objects or values. Anchor provides a particu-
lar value, object or a reference point this constraint
relates to. Number specifies cardinal numbers that are
associated with the constraint.
Reference Constraints
Reference constraints are further categorized into
four categories: Anaphora, Temporal, Visual,andSpatial.
An anaphora reference can be expressed through pro-
nouns such as “it” or “them” (Pronoun), demonstra-
tives such as “this” or “these” (Demonstrative),hereor
there (Here/There), or proper names such as “Lyn-
hurst” (ProperNoun). An example is shown in
Figure 4(a), where a demonstrative “this”(Manner:
Demonstrative-This)isusedintheutterance“this house”
to refer to a single house object (Number: 1). Note that
Manner also keeps track of the specific type of the
term. The subtle difference between terms can pro-
vide additional cues for resolving references. For
example, the different use of “this”and“that”may
indicate the recency of the referent in the user mental
model of the discourse, or the closeness of the refer-
ent to the user’s visual focus.
Temporal references use temporal relations to refer
to entities that occurred in the prior conversation.
Manner is characterized by Relative and Absolute. Rela-
tive indicates a temporal relation with respect to a cer-
tain point in a conversation, and Absolute specifies a
temporal relation regarding to the whole interaction.
Relation indicates the temporal relations (e.g., Precede
or Succeed) or ordinal relations (e.g., first). Anchor
indicates a reference point. For example, as in
Figure 6(a), a Relative temporal constraint is used
since “the previous house” refers to the house that pre-
cedes the current focus (Anchor: Current) in the conver-
sation history. On the other hand, in the input: “the
first house you showed me,” an Absolute temporal con-
straint is used since the user is interested in the first
house shown to her at the beginning of the entire con-
versation.
Spatial references describe entities on the graphic
display in terms of their spatial relations. Manner is
again characterized by Absolute and Relative. Absolute
indicates that entities are specified through orienta-
tions (e.g., left or right, captured by Relation)with
respect to the whole display screen (Anchor: Display-
Frame). In contrast, Relative specifies that entities are
described through orientations with respect to a par-
ticular sub-frame (Anchor: FocusFrame, e.g., an area
Figure 4. Intention and Attention for U1 unimodal inputs
Motivator: DataPresentation
Act: Request
Method: Lookup
(b) U1 gesture: pointing
Base: House
Topic: Instance
Content: {MLS0234765}
Intention
Attention
Intention
Topic: Instance
Focus: SpecificAspect
Aspect:Price
Constraint:
Attention
Category: Anaphora
Manner: Demonstrative(THIS)
Number:1
Base: City
Topic: Instance
Content: {“Irvington”}
(a) U1 speech: “How much is this”
Base: House
Topic: Instance
Content: {MLS0876542}
Figure 5. Constraint model
MannerCategory Aspect Relation Anchor Number
Anaphora Demonstrative,
Pronoun,
Here/There,
ProperNoun,
Temporal Relative,
Absolute
Spatial
Procede,
Succeed,
Ordinal
(e.g., first)
Visual
Attributive
Current,
Object
DisplayFrame,
FocusFrame,
Object
Orientation
(e.g., Left,
Right )
Multiple,
Cardinal-
number
(e.g., 1, 2)
Relative,
Absolute
Comparative
Comparative,
Superlative,
Fuzzy
Visual-
Properties
(e.g., Color,
Highlight)
Data Features
(e.g., Price,
Size)
Less-Than,
Equals,
Greater-Than
Equals
DataValue,
ValueOfObject,
Object
DataValue,
ValueOfObject,
Object
R
e
fe
r
e
n
c
e
C
on
s
t
r
a
i
n
ts
Data
Co
nst
r
ai
n
t
s
-
-
-
--
Figure 6. Temporal and visual reference constraints
Base: House
Topic: Instance
Constraint:
Attention
Cateogry:Temporal
Manner: Relative
Relation:Precede
Anchor:Current
Number:1
(a) “ the previous house”
(b) “ the green house”
Base:House
Topic:Instance
Constraint:
Attention
Cateogry:Visual
Manner: Comparative
Aspect: Color
Relation:Equals
Anchor: “Green”
Number:1
with highlighted objects) or another object.
Visual references describe entities on the graphic
output using visual properties (such as displaying col-
ors or shapes) or visual techniques (such as high-
light). Manner of Comparative indicates a visual entity
is compared with another value (captured by Anchor).
Aspect indicates the visual entity used (such as Color
and Shape, which are defined in our domain ontol-
ogy). Relation specifies the relation to be satisfied
between the visual entity and some value. For exam-
ple, constraint used in the input “the green house”is
shown in Figure 6(b). It is worth mentioning that dur-
ing reference resolution, the color Green will be fur-
ther mapped to the internal color encoding used by
graphics generation.
Data Constraints
Data constraints describe objects in terms of their
actual data attributes (Category: Attributive). The Man-
ner of Comparative indicates the constraint is about a
comparative relation between (aspects of) the desired
entities with other entities or values. Superlative indi-
cates the constraint is about minimum or maximum
requirement(s) for particular attribute(s). Fuzzy indi-
cates a fuzzy description on the attributes (e.g.,
“cheap house”). For example, for the input “houses
under 300,000 dollars” in Figure 7(a), Manner is Compar-
ative since the constraint is about a “less than” rela-
tionship (Relation: Less-Than) between the price
(Aspect: Price) of the desired object(s) and a particular
value (Anchor: “300000 dollars”). For the input “3 largest
houses”inFigure7(b),Manner is Superlative since it is
about the maximum (Relation: Max) requirement on
the size of the houses (Aspect: Size).
The refined characterization of different constraints
provides rich cues for MIND to identify objects of
interest. In an information seeking environment, the
objects sought can come from different sources. They
could be entities that have been described earlier in
the conversation, entities that are visible on the dis-
play, or entities that have never been mentioned or
seen but exist in a database. Thus, fine-grained con-
straints allow MIND to determine where and how to
find the information of interest. For example, tempo-
ral constraints help MIND navigate the conversation
history by providing guidance on where to start,
which direction to follow in the conversation history,
and how many to look for.
Our fine-grained semantic models of intention,
attention and constraints characterize user informa-
tion needs and therefore enable the system to come
up with an intelligent response. Furthermore, these
models are domain independent and can be applied to
any information seeking applications (for structured
information).
3.1.3 Representing User Inputs
Given the semantic models of intention, attention and
constraints, MIND represents those models using a
combination of feature structures (Carpenter, 1992).
This representation is inspired by the earlier works
(Johnston et al., 1997; Johnston, 1998) and offers a
flexibility to accommodate complex inputs. Specifi-
cally, MIND represents intention, attention and con-
straints identified from user inputs as a result of both
unimodal understanding and multimodal understand-
ing.
During unimodal understanding, MIND applies a
decision tree based semantic parser on natural lan-
guage inputs (Jelinek et al., 1994) to identify salient
information. For the gesture input, MIND applies a
simple geometry-based recognizer. As a result, infor-
mation from each unimodal input is represented in a
modality unit. We have seen several modality units
(in Figure 4, Figure 6, and Figure 7), where intention,
attention and constraints are represented in feature
structures. Note that only features that can be instan-
tiated by information from the user input are included
in the feature structure. For example, since the exact
object cannot be identified from U1 speech input, the
Content feature is not included in its Attention structure
(Figure 4a). In addition to intention, attention and
constraints, a modality unit also keeps a time stamp
that indicates when a particular input takes place.
This time information is used for multimodal align-
ment which we do not discuss here.
Depending on the complexity of user inputs, the
representation can be composed by a flexible combi-
Figure 7. Attributive data constraints
Base: House
Topic: Collection
Constraint:
Attention
Cateogry: Attributive
Manner: Comparative
Aspect:Price
Relation:Equals
Anchor: “300000 dollars”
(a) “houses under 300,000 dollars”
Base:House
Topic: Collection
Constraint:
Attention
Cateogry: Attributive
Manner: Superlative
Aspect:Size
Relation:Max
Number:3
(a) “3 largest houses”
Figure 8. Attention structures for U4
Base:House
Topic: Collection
Constraint:
Attention (A1)
Category: Attributeive
Manner: Comparative
Aspect:Style
Relation:Equals
Anchor:*
Topic:Instance
Constraint:
Category: Anaphora
Manner:Demonstrative
(THIS)
Number:1
Attention (A2)
Category: Attributive
Manner: Comparative
Aspect: Location
Relation:Equals
Anchor:*
Base: GeoLocation
Topic:Instance
Constraint:
Category: Anaphora
Manner:HERE
Attention (A3)
Constraint:
(a) Attention structure in the modality unit for U4 speech input
Base:House
Topic: Collection
Constraint:
Attention (A1)
Category: Attributeive
Manner: Comparative
Aspect: Style
Relation:Equals
Anchor:“Victorian”
Category: Attributive
Manner: Comparative
Aspect:Location
Relation:Equals
Anchor: “White Plains”
Constraint:
(b) Attention structure in the conversation unit for U4 speech input
nation of different feature structures. Specifically, an
attention structure may have a constraint structure as
its feature, and on the other hand, a constraint struc-
ture may also include another attention structure.
For example, U4 in Figure 2 is a complex input,
where the speech input “what about houses with this
style around here” consists of multiple objects with dif-
ferent relations. The modality unit created for U4
speech input is shown in Figure 8(a). The Attention
feature structure (A1) contains two attributive con-
straints indicating that the objects of interest are a
collection of houses that satisfy two attributive con-
straints. The first constraint is about the style (Aspect:
Style), and the second is about the location. Both of
these constraints are related to other objects (Manner:
Comparative), which are represented by Attention struc-
tures A2 and A3 through Anchor respectively. A2 indi-
cates an unknown object that is referred by a
Demonstrative reference constraint (this style), and A3
indicates a geographic location object referred by
HERE. Since these two references are overlapped with
a single deictic gesture, it is hard to decide which one
should be unified with the gesture input. We will
show in Section 4.3 that the fine-grained representa-
tion in Figure 8(a) allows MIND to use contexts to
resolve these two references and improve alignment.
During multimodal understanding, MIND com-
bines information from modality units together and
generates a conversation unit that represents the over-
all meaning of user multimodal inputs. A conversa-
tion unit also has the same type of intention and
attention feature structures, as well as the feature
structure for data constraints. Since references are
resolved during the multimodal understanding pro-
cess, the reference constraints are no longer present in
conversation units. For example, once two references
in Figure 8(a) are resolved during multimodal under-
standing (details are described in Section 4.3), and
MIND identifies “this style”is“Victorian”and“here”is
“White Plains”, it creates a conversation unit represent-
ing the overall meanings of this input in Figure 8(b).
3.2 Representing Conversation Context
MIND uses a conversation history to represent the
conversation context based on the goals or sub-goals
of user inputs and RIA outputs. For example, in the
conversation fragment mentioned earlier (Figure 2),
the first user input (U1) initiates a goal of looking up
the price of a particular house. Due to the ambiguous
gesture input, in the next turn, RIA (R2) initiates a
sub-goal of disambiguating the house of interest. This
sub-goal contributes to the goal initiated by U1.Once
the user replies with the house of interest (U2), the
sub-goal is fulfilled. Then RIA gives the price infor-
mation (R2), and the goal initiated by U1 is accompol-
ished. To reflect this progress, our conversation
history is a hierarchical structure which consists of
conversation segments and conversation units (in
Figure 9). As mentioned earlier, a conversation unit
records user (rectangle U1, U2)orRIA(rectangleR1,
R2) overall meanings at a single turn in the conversa-
tion. These units can be grouped together to form a
conversation segment (oval DS1, DS2) based on their
goals and sub-goals. Furthermore, a conversation seg-
ment contains not only intention and attention, but
also other information such as the conversation initi-
ating participant (Initiator). In addition to conversation
segments and conversation units, a conversation his-
tory also maintains different relations between seg-
ments and between units. Details can be found in
(Chai et al., 2002).
Another main characteristic of our representation is
the consistent representation of intention and atten-
tion across different levels. Just like modality units
and conversation units, conversation segments also
consist of the same type of intention and attention
feature structures (as shown in Figure 9). This consis-
tent representation not only supports unification
based multimodal fusion, but also enables context-
based inference to enhance interpretation (described
later).
We have described our semantics-based representa-
tion and presented three characteristics: fine-grained
semantic models, flexible composition, and consis-
tent representation. Next we will show that how this
representation is used effectively in the multimodal
interpretation process.
4 The Use of Representation in Multimodal
Interpretation
As mentioned earlier, multimodal interpretation in
MIND consists of three processes: unimodal under-
standing, multimodal understanding and discourse
understanding. Here we focus on multimodal under-
standing. The key difference between MIND and ear-
lier works is the use of rich contexts to improve
understanding. Specifically, multimodal understand-
ing consists of two sub-processes: multimodal fusion
and context-based inference. Multimodal fusion fuses
intention and attention structures (from modality
units) for unimodal inputs and forms a combined rep-
resentation. Context-based inference uses rich con-
Figure 9. A fragment of a conversation history
Motivator: DataPresentation
Act:Request
Method: Lookup
Intention
Base:House
Topic:Instance
Focus: SpecificAspect
Aspect:Price
Content: {MLS0234765|
MLS0876542}
Attention
U1
Motivator: DataPresentation
Method: Lookup
Intention
Base:House
Topic: Instance
Focus: SpecificAspect
Aspect:Price
Content: {MLS0234765}
Attention
DS1
R2
Initiator: User
Motivator: ExceptionHandling
Method: Disambiguate
Intention
Base:House
Topic:Instance
Content: {MLS0234765 |
MLS0876542}
Attention
DS1
Initiator: RIA
R1 U2
Intention
….
Attention
….
texts to improve interpretation by resolving
ambiguities, deriving unspecified information, and
improving alignment.
4.1 Resolving Ambiguities
User inputs could be ambiguous. For example, in U1,
the deictic gesture is not directly on a particular
object. Fusing intention and attention structures from
each individual inputs presents some ambiguities. For
example, in Figure 4(b), there are three Attention
structures for U1 gesture input. Each of them can be
unified with the Attention structure from U1 speech
input (in Figure 4a). The result of fusion is shown in
Figure 10(a). Since the reference constraint in the
speech input (Number: 1 in Figure 4a) indicates that
only one attention structure is allowed, MIND uses
contexts to eliminate inconsistent structures. In this
case, A3 in Figure 10(a) indicates the information of
interest is about the price of the city Irvington. Based
on the domain knowledge that the city object cannot
have the price feature, A3 is filtered out. As a result,
both A1 and A2 are potential interpretation. Therefore,
the Content in those structures are combined using a
disjunctive relation as in Figure 10(b). Based on this
revised conversation unit, RIA is able to arrange the
follow-up question to further disambiguate the house
of interest (R2 in Figure 2). This example shows that,
modeling semantic information by fine-grained
dimensions supports the use of domain knowledge in
context-based inference, and can therefore resolve
some ambiguities.
4.2 Deriving Unspecified Information
In a conversation setting, user inputs are often abbre-
viated. Users tend to only provide new information
when it is their turn to interact. Sometimes, fusing
individual modalities together still cannot provide
overall meanings of those inputs. For example, after
multimodal fusion, the conversation unit for U3
(“What about this one”) does not give enough informa-
tion on what the user exactly wants. The motivation
and task of this input is not known as in Figure 11(a).
Only based on the conversation context, is MIND
able to identify the overall meaning of this input. In
this case, based on the most recent conversation seg-
ment (DS1) in Figure 9 (also as in Figure 11b), MIND
is able to derive Motivator and Method features from
DS1 to update the conversation unit for U3
(Figure 11c). As a result, this revised conversation
unit provides the overall meaning that the user is
interested in finding out the price information about
another house MLS7689432. Note that it is important
to maintain a hierarchical conversation history based
on goals and subgoals. Without such a hierarchical
structure, MIND would not be able to infer the moti-
vation of U3. Furthermore, because of the consistent
representation of intention and attention at both the
discourse level (in conversation segments) and the
input level (in conversation units), MIND is able to
directly use conversation context to infer unspecified
information and enhance interpretation.
4.3 Improving Alignment
In a multimodal environment, users could use differ-
ent ways to coordinate their speech and gesture
inputs. In some cases, one reference/object men-
tioned in the speech input coordinates with one deic-
tic gesture (U1, U3). In other cases, several references/
objects in the speech input are coordinated with one
deictic gesture (U4, U5). In the latter cases, only using
time stamps often cannot accurately align and fuse
the respective attention structures from each modal-
ity. Therefore, MIND uses contexts to improve align-
ment based on our semantics-based representation.
For example, from the speech input in U4 (“show me
houses with this style around here”),threeAttention struc-
tures are generated as shown in Figure 8(a). From the
gesture input, only one Attention structure is generated
which corresponds to the city of White Plains. Since
the gesture input overlaps with both “this style” (corre-
sponding to A2)and“here” (corresponding to A3),
there is no obvious temporal relation indicating
which of these two references should be unified with
the deictic gesture. In fact, both A2 and A3 are poten-
tial candidates. Based on the domain context that a
city cannot have a feature Style,MINDdetermines
that the deictic gesture is actually resolving the refer-
Figure 10. Resolving ambiguity for U1
Motivator: DataPresentation
Act:Request
Method: Lookup
Intention
Base: House
Topic: Instance
Focus: SpecificAspect
Aspect:Price
Content:{MLS0234765}
Attention
Base: House
Topic: Instance
Focus: SpecificAspect
Aspect:Price
Content:{MLS0876542}
Base: City
Topic: Instance
Focus: SpecificAspect
Aspect: Price
Content:{“Irvington”}
Motivator: DataPresentation
Act:Request
Method: Lookup
Intention
Base: House
Topic: Instance
Focus: SpecificAspect
Aspect:Price
Content:{MLS0234765 |
MLS0876542}
Attention
A
1
A
2
A
3
(a) Conversation unit for U1 as a
result of multimodal fusion
(b) Revised conversation unit for U1 as a
result of context-based inference
Figure 11. Deriving unspecified information for U3
Act:Request
Intention
Base:House
Topic:Instance
Content: {MLS7689432}
Attention
U3
Motivator: DataPresentation
Method: Lookup
Intention
Base:House
Topic:Instance
Focus: SpecificAspect
Aspect:Price
Content: {MLS0234765}
Attention
DS1
Initiator: User
(a) Conversation unit for U3 as a
result of multimodal fusion
(b) Conversation segment DS1 in
the conversation history
Motivator: DataPresentation
Act:Request
Method: Lookup
Intention
Base:House
Topic:Instance
Focus: SpecificAspect
Aspect:Price
Content: {MLS7689432}
Attention
U1
(c) Revised conversation unit for U3 as
a result of context-based inference
ence of “here”. To resolve the reference of “this style”,
MIND uses the visual context which indicates a
house is highlighted on the screen. A recent study
(Kehler, 2000) shows that objects in the visual focus
are often referred by pronouns, rather than by full
noun phrases or deictic gestures. Based on this study,
MIND is able to infer that most likely “this style”
refers to the style of the highlighted house
(MLS7689432). Suppose the style is “Victorian”, then
MIND is able to figure out that the overall meaning
of U4 is looking for houses with a Victorian style and
locatedinWhitePlains(asshowninFigure8b).
Furthermore, for U5 (“Comparing these two houses
with the previous house”),therearetwoAttention struc-
tures (A1 and A2) created for the speech input as in
Figure 12(a). A1 corresponds to “these two houses”,
where the Number feature in the reference constraint is
set 2. Although there is only one deictic gesture
which points to two potential houses (Figure 12b),
MIND is able to figure out that this deictic gesture is
actually referring to a group of two houses rather than
an ambiguous single house. Although the gesture
input in U5 is the same kind as that in U1, because of
the fine-grained information captured from the
speech input (i.e., Number feature), MIND processes
them differently. For the second reference of “previous
house”(A2 in Figure 12a), based on the information
captured in the temporal constraint, MIND searches
the conversation history and finds the most recent
house explored (MLS7689432). Therefore, MIND is
able to reach an overall understanding of U5 that the
user is interested in comparing three houses (as in
Figure 12c).
5 Conclusion
To facilitate multimodal interpretation in conversa-
tional systems, we have developed a semantics-based
representation to capture salient information from
user inputs and the overall conversation. In this paper,
we have presented three unique characteristics of our
representation. First, our representation is based on
fine grained semantic models of intention, attention
and constraints that are important in information
seeking conversation. Second, our representation is
composed by a flexible combination of feature struc-
tures and thus supports complex user inputs. Third,
our representation of intention and attention is consis-
tent at different levels and therefore facilitates con-
text-based interpretation. This semantics-based
representation allows MIND to use contexts to
resolve ambiguities, derive unspecified information
and improve alignment. As a result, MIND is able to
process a large variety of user inputs including those
incomplete, ambiguous or complex ones.
6 Acknowledgement
The author would like to thank Shimei Pan and
Michelle Zhou for their contributions on semantic
models.
Figure 12. Improving alignment for U5
Motivator: DataAnalysis
Act:Request
Method: Compare
Intention
Base: House
Topic: Collection
Focus: MainAspect
Constraint:
Attention
Base: House
Topic: Instance
Content:{MLS0765489}
Attention
A
1
(a) Modality unit for U5 speech input
(b) Modality unit for U5 gesture input
Category: Anaphora
Manner: Demonstrative
Number:2
Base: House
Topic:Instance
Focus: MainAspect
Constraint:
A
2
Category: Temporal
Manner: Relative
Relation: Precede
Anchor: Current
Number:1
Base: House
Topic: Instance
Content:{MLS0468709}
Motivator: DataAnalysis
Act:Request
Method: Compare
Intention
Base: House
Topic: Collection
Focus:MainAspect
Content: {MLS0468709,
MLS0765489,
MLS7689432}
Attention
A
1
(c) Conversation unit for U5

References

Bolt, R. (1980) Voice and gesture at the graphics inter-
face. Computer Graphics, pages 262-270.

Carpenter, R. (1992) The logic of typed feature struc-
tures. Cambridge University Press.

Chai, J.; Pan, S.; and Zhou, M. X. (2002) MIND: A Se-
mantics-based multimodal interpretation framework
for conversational systems. To appear in Proceedings
of International CLASS Workshop on Natural, Intelli-
gent and Effective Interaction in Multimodal Dialog
Systems.

Cohen, P.; Johnston, M.; McGee, D.; S. Oviatt, S.; Pitt-
man, J.; Smith, I.; Chen, L; and Clow, J. (1996) Quick-
set: Multimodal interaction for distributed
applications. Proc. ACM MM'96, pages 31-40.

Grosz, B. J. and Sidner, C. (1986) Attention, intentions,
and the structure of discourse. Computational Linguis-
tics, 12(3):175-204.

Jelinek, F.; Lafferty, J.; Magerman, D. M.; Mercer, R.
and Roukos, S. (1994) Decision tree parsing using a
hidden derivation model. Proc. Darpa Speech and Nat-
ural Language Workshop.

Johnston, M.; Cohen, P. R.; McGee, D.; Oviatt, S. L.;
Pittman,J.A.;andSmith,I.(1997)Unificationbased
multimodal integration. Proc. 35th ACL, pages 281-
288.

Johnston, M. (1998) Unification-based multimodal pars-
ing. Proc. COLING-ACL'98.

Kehler, A. (2000) Cognitive status and form of reference
in multimodal human-computer interaction. Proc.
AAAI’01, pages 685–689.

Wahlster, W. (1998) User and discourse models for mul-
timodal communication. In M. Maybury and W. Wahl-
ster, editors, Intelligent User Interfaces, pages 359-
370.

Zancanaro, M.; Stock, O.; and Strapparava, C. (1997)
Multimodal interaction for information access: Ex-
ploiting cohesion. Computational Intelligence,
13(4):439-464.

Zhou, M. X. and Pan, S. (2001) Automated authoring of
coherent multimedia discourse for conversation sys-
tems. Proc. ACM MM’01, pages 555–559.
