File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1001_metho.xml

Size: 23,329 bytes

Last Modified: 2025-10-06 14:08:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1001">
  <Title>Optimization in Multimodal Interpretation</Title>
  <Section position="3" start_page="0" end_page="1" type="metho">
    <SectionTitle>
2 Necessities for Optimization in
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
Multimodal Interpretation
</SectionTitle>
      <Paragraph position="0"> In a multimodal conversation, the way a user interacts with a system is dependent not only on the available input channels (e.g., speech and gesture), but also upon his/her conversation goals, the state of the conversation, and the multimedia feedback from the system. In other words, there is a rich context that involves dependencies from many different aspects established during the interaction. Interpreting user inputs can only be situated in this rich context. For example, the temporal relations between speech and gesture are important criteria that determine how the information from these two modalities can be combined. The focus of attention from the prior conversation shapes how users refer to those objects, and thus, influences the interpretation of referring expressions. Therefore, we need to simultaneously consider the temporal relations between the referring expressions and the gestures, the semantic constraints specified by the referring expressions, and the contextual constraints from the prior conversation. It is important to have a mechanism that supports competition and ranking among these constraints to achieve an optimal interpretation, in particular, a mechanism to allow constraint violation and support soft constraints.</Paragraph>
      <Paragraph position="1"> We use temporal constraints as an example to illustrate this viewpoint  . The temporal constraints specify whether multiple modalities can be combined based on their temporal alignment. In earlier work, the temporal constraints are empirically determined based on user studies (Oviatt 1996). For example, in the unification-based approach (Johnston 1998), one temporal constraint indicates that speech and gesture can be combined only when the speech either overlaps with gesture or follows the gesture within a certain time frame. This is a hard constraint that has to be satisfied in order for the unification to take place. If a given input does not satisfy these hard constraints, the unification fails.</Paragraph>
      <Paragraph position="2"> In our user studies, we found that, although the majority of user temporal alignment behavior may satisfy pre-defined temporal constraints, there are  We implemented a system using real estate as an application domain. The user can interact with a map using both speech and gestures to retrieve information. All the user studies mentioned in this paper were conducted using this system. some exceptions. Table 1 shows the percentage of different temporal relations collected from our user studies. The rows indicate whether there is an overlap between speech referring expressions and their accompanied gestures. The columns indicate whether the speech (more precisely, the referring expressions) or the gesture occurred first.</Paragraph>
      <Paragraph position="3"> Consistent with the previous findings (Oviatt et al, 1997), in most cases (85% of time), gestures occurred before the referring expressions were uttered. However, in 15% of the cases the speech referring expressions were uttered before the gesture occurred. Among those cases, 8% had an overlap between the referring expressions and the gesture and 7 although (i.e., nonintegration are quite consis interaction, there are still 1 shows the te individ maintained a consistent b 's speech re with gestures ahead of the speech expressions. The other five users exhibited varied speech and gesture during the interaction.</Paragraph>
      <Paragraph position="4"> be difficult te acco Therefore, it is desirable to have a m  gesture % had no overlap.</Paragraph>
      <Paragraph position="5"> Furthermore, as shown in (Oviatt et al., 2003), multimodal behaviors such as sequential overlap) or simultaneous (e.g., overlap) tent during the course of some exceptions. Figure mporal alignments from seven ual users in our study. User 2 and User 6 ehavior in that User 2 ferring expressions always overlapped and User 6's gesture always occurred temporal alignment between It will for a system using pre-defined mporal constraints to anticipate and mmodate all these different behaviors.</Paragraph>
      <Paragraph position="6"> echanism that allows violation of these constraints and support soft or graded constraints.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="1" end_page="2" type="metho">
    <SectionTitle>
3 A Graph-based Optimization Approach
</SectionTitle>
    <Paragraph position="0"> To address the necessities described above, we developed an optimization approach for interpreting multimodal references using graph matching. The graph representation captures both salient entities and their inter-relations. The graph matching is an optimization process that finds the best matching between two graphs based on constraints modeled as links or nodes in these graphs. This type of structure and process is especially useful for interpreting multimodal references. One graph can represent all the referring expressions and their inter-relations, and the other graph can represent all the potential referents. The question is how to match them together to achieve a maximum compatibility given a particular context.</Paragraph>
    <Section position="1" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Overview
Graph-based Representation
</SectionTitle>
      <Paragraph position="0"> Attribute Relation Graph (ARG) (Tsai and Fu, 1979) is used to represent information in our approach.</Paragraph>
      <Paragraph position="1"> An ARG consists of a set of nodes that are connected by a set of edges. Each node represents an entity, which in our case is either a referring expression to be resolved or a potential referent.</Paragraph>
      <Paragraph position="2"> Each node encodes the properties of the corresponding entity including: * Semantic information that indicates the semantic type, the number of potential referents, and the specific attributes related to the corresponding entity (e.g., extracted from the referring expressions).</Paragraph>
      <Paragraph position="3"> * Temporal information that indicates the time when the corresponding entity is introduced into the discourse (e.g., uttered or gestured).</Paragraph>
      <Paragraph position="4"> Each edge represents a set of relations between two entities. Currently we capture temporal relations and semantic type relations. A temporal relation indicates the temporal order between two related entities during an interaction, which may  have one of the following values: * Precede: Node A precedes Node B if the entity represented by Node A is introduced into the discourse before the entity represented by Node B. * Concurrent: Node A is concurrent with Node B if the entities represented by them are referred to or mentioned simultaneously.</Paragraph>
      <Paragraph position="5"> * Non-concurrent: Node A is non-concurrent with Node B if their corresponding objects/references cannot be referred/mentioned simultaneously.</Paragraph>
      <Paragraph position="6"> * Unknown: The temporal order between two entities is unknown. It may take the value of any of the  above.</Paragraph>
      <Paragraph position="7"> A semantic type relation indicates whether two related entities share the same semantic type. It currently takes the following discrete values: Same, Different, and Unknown. It could be beneficial in the future to consider a continuous function measuring the rate of compatibility instead.</Paragraph>
      <Paragraph position="8"> Specially, two graphs are generated. One graph, called the referring graph, captures referring expressions from speech utterances. For example, suppose a user says Compare this house, the green house, and the brown one. Figure 2 show a referring graph that represents three referring expressions from this speech input. Each node captures the semantic information such as the semantic type (i.e., Semantic Type), the attribute (Color), the number (Number) of the potential referents, as well as the temporal information about when this referring expression is uttered (BeginTime and EndTime). Each edge captures the semantic (e.g., SemanticTypeRelation) and temporal relations (e.g., TemporalRelation) between the referring expressions. In this case, since the green house is uttered before the brown one, there is a temporal Precede relationship between these two expressions.</Paragraph>
      <Paragraph position="9"> Furthermore, according to our heuristic that objects-to-be-compared should share the same semantic type, therefore, the SemanticTypeRelation between two nodes is set to Same.</Paragraph>
      <Paragraph position="10">  multiple sources (e.g., from the last conversation, gestured by the user, etc). Each node captures the semantic and temporal information about a potential referent (e.g., the time when the potential referent is selected by a gesture). Each edge captures the semantic and temporal relations between two potential referents. For instance, suppose the user points to one position and then points to another position. The corresponding referent graph is shown in Figure 3. The objects inside the first dashed rectangle correspond to the potential referents selected by the first pointing gesture and those inside the second dashed rectangle correspond to the second pointing gesture. Each node also contains a probability that indicates the likelihood of its corresponding object being selected by the gesture. Furthermore, the salient objects from the prior conversation are also included in the referent graph since they could also be the potential referents (e.g., the rightmost dashed rectangle in Figure 3  ).</Paragraph>
      <Paragraph position="11"> To create these graphs, we apply a grammar-based natural language parser to process speech inputs and a gesture recognition component to process gestures. The details are described in (Chai et al. 2004a).</Paragraph>
      <Paragraph position="12">  Each node from the conversation context is linked to every node corresponding to the first pointing and the second pointing. null</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Graph-matching Process
</SectionTitle>
      <Paragraph position="0"> Given these graph representations, interpreting multimodal references becomes a graph-matching problem. The goal is to find the best match between a referring graph (G  are named referent nodes.</Paragraph>
      <Paragraph position="1"> The following equation finds a match that achieves the maximum compatibility between G  ) measures the degree of the overall match between the referent graph and the referring graph. P(a</Paragraph>
      <Paragraph position="3"> ) is the matching probability between a node a x in the referent graph and a node a m in the referring graph. The overall compatibility depends on the similarities between nodes (NodeSim) and the similarities between edges (EdgeSim). The function NodeSim(a x ,a m ) measures the similarity between a referent node a x and a referring node a m by combining semantic constraints and temporal constraints. The function</Paragraph>
      <Paragraph position="5"> , which depends on the semantic and temporal constraints of the corresponding edges. These functions are described in detail in the next section.</Paragraph>
      <Paragraph position="6"> We use the graduated assignment algorithm (Gold and Rangarajan, 1996) to maximize Q(G</Paragraph>
      <Paragraph position="8"> in Equation (1). The algorithm first initializes</Paragraph>
      <Paragraph position="10"> ) and then iteratively updates the values of</Paragraph>
      <Paragraph position="12"> ) gives the matching probabilities between the referent node a x and the referring node a m that maximizes the overall compatibility function. Given this probability matrix, the system is able to assign the most probable referent(s) to each referring expression.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Similarity Functions
</SectionTitle>
      <Paragraph position="0"> As shown in Equation (1), the overall compatibility between a referring graph and a referent graph depends on the node similarity  function and the edge similarity function. Next we give a detailed account of how we defined these functions. Our focus here is not on the actual definitions of those functions (since they may vary for different applications), but rather a mechanism that leads to competition and ranking of constraints.  Given a referring expression (represented as a m in the referring graph) and a potential referent (represented as a x in the referent graph), the node similarity function is defined based on the semantic and temporal information captured in a</Paragraph>
      <Paragraph position="2"> through a set of individual compatibility functions:</Paragraph>
      <Paragraph position="4"> Currently, in our system, the specific return values for these functions are empirically determined through iterative regression tests.</Paragraph>
      <Paragraph position="6"> ) captures the constraint of the compatibilities between identifiers specified in a</Paragraph>
      <Paragraph position="8"> . It indicates that the identifier of the potential referent, as expressed in a referring expression, should match the identifier of the true referent. This is particularly useful for resolving proper nouns. For example, if the referring expression is house number eight, then the correct referent should have the identifier number eight.</Paragraph>
      <Paragraph position="9"> We currently define this constraint as follows:</Paragraph>
      <Paragraph position="11"> is unknown. The different return values enforce that a large reward is given to the case where the identifiers from the referring expressions match the identifiers from the potential referents.</Paragraph>
      <Paragraph position="13"> ) captures the constraint of semantic type compatibility between a</Paragraph>
      <Paragraph position="15"> indicates that the semantic type of a potential referent as expressed in the referring expression should match the semantic type of the correct referent. We define the following: SemType(a</Paragraph>
      <Paragraph position="17"> is unknown. Note that the return value given to the case where semantic types are the same (i.e., &amp;quot;1&amp;quot;) is much lower than that given to the case where identifiers are the same (i.e., &amp;quot;100&amp;quot;). This was designed to support constraint ranking. Our assumption is that the constraint on identifiers is more important than the constraint on semantic types. Because identifiers are usually unique, the corresponding constraint is a greater indicator of node matching if the identifier expressed from a referring expression matches the identifier of a potential referent.</Paragraph>
      <Paragraph position="19"> ) captures the domain specific constraint concerning a particular semantic feature (indicated by the subscription k). This constraint indicates that the expected features of a potential referent as expressed in a referring expression should be compatible with features associated with the true referent. For example, in the referring expression the Victorian house, the style feature is Victorian. Therefore, an object can only be a possible referent if the style of that object is Victorian. Thus, we define the following: A</Paragraph>
      <Paragraph position="21"> share the kth feature with the  same value. A</Paragraph>
      <Paragraph position="23"> the feature k and the values of the feature k are not equal. Otherwise, when the kth feature is not present in either a</Paragraph>
      <Paragraph position="25"> Note that these feature constraints are dependent on the specific domain model for a particular  . As discussed in Section 2, a hard constraint concerning temporal relations between referring expressions and gestures will be incapable of handling the flexibility of user temporal alignment behavior. Thus the temporal constraint in our approach is a graded constraint, which is defined as follows:  This constraint indicates that the closer a referring expression and a potential referent in terms of their temporal alignment (regardless of the absolute precedence relationship), the more compatible they are.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Edge Similarity Function
</SectionTitle>
      <Paragraph position="0"> The edge similarity function measures the compatibility of relations held between referring expressions (i.e., an edge g mn in the referring graph) and relations between the potential referents (i.e., an edge r xy in the referent graph). It is defined by two individual compatibility functions as follows:  ) encodes the semantic type compatibility between an edge in the referring graph and an edge in the referent graph. It is defined in Table 2. This constraint indicates that the relation held between referring expressions should be compatible with the relation held between two correct referents. For example, consider the utterance How much is this green house and this blue house. This utterance indicates that the referent to the first expression this green house should share the same semantic type as the referent to the second expression this blue house. As shown in Table 2, if the semantic type relations of r  ) captures the temporal compatibility between an edge in the referring graph and an edge in the referent graph. It is defined in Table 3. This constraint indicates that the temporal relationship between two referring expressions (in one utterance) should be compatible with the relations of their corresponding referents as they are introduced into the context (e.g., through gesture). The temporal relation between referring expressions (i.e., g  ) returns 1. Because potential references could come from prior conversation, even if r xy and g mn are not the same, the function does not return zero when g mn is Precede.</Paragraph>
      <Paragraph position="1"> Next, we discuss how these definitions and the process of graph matching address optimization, in particular, with respect to key principles of Optimality Theory for natural language interpretation.</Paragraph>
    </Section>
    <Section position="5" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Optimality Theory
</SectionTitle>
      <Paragraph position="0"> Optimality Theory (OT) is a theory of language and grammar, developed by Alan Prince and Paul Smolensky (Prince and Smolensky, 1993). In Optimality Theory, a grammar consists of a set of well-formed constraints. These constraints are applied simultaneously to identify linguistic structures. Optimality Theory does not restrict the content of the constraints (Eisner 1997). An innovation of Optimality Theory is the conception of these constraints as soft, which means violable and conflicting. The interpretation that arises for an utterance within a certain context maximizes the degree of constraint satisfaction and is consequently the best alternative (hence, optimal interpretation) among the set of possible interpretations.</Paragraph>
      <Paragraph position="1"> The key principles or components of Optimality Theory can be summarized as the following three components (Blutner 1998): 1) Given a set of input, Generator creates a set of possible outputs for each input. 2) From the set of candidate output, Evaluator selects the optimal output for that input. 3) There is a strict dominance in term of the ranking of constraints. Constraints are absolute and the ranking of the constraints is strict in the sense that outputs that have at least one violation of a higher ranked constraint outrank outputs that have arbitrarily many violations of lower ranked constraints.</Paragraph>
      <Paragraph position="2"> Although Optimality Theory is a grammar-based framework for natural language processing, its key principles can be applied to other representations.</Paragraph>
      <Paragraph position="3"> At a surface level, our approach addresses these main principles.</Paragraph>
      <Paragraph position="4"> First, in our approach, the matching matrix</Paragraph>
      <Paragraph position="6"> ) captures the probabilities of all the possible matches between a referring node a m and a referent node a x . The matching process updates these probabilities iteratively. This process corresponds to the Generator component in Optimality Theory.</Paragraph>
      <Paragraph position="7"> Second, in our approach, the satisfaction or violation of constraints is implemented via return values of compatibility functions. These  corresponding intended constraints are violated. In this case, the overall similarity function will return zero. However, because of the iterative updating nature of the matching algorithm, the system will still find the most optimal match as a result of the matching process even some constraints are violated. Furthermore, A function that never returns zero such as Temp(a</Paragraph>
      <Paragraph position="9"> ) in the node similarity function implements a gradient .</Paragraph>
    </Section>
    <Section position="6" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.4 Evaluation
</SectionTitle>
      <Paragraph position="0"> We conducted several user studies to evaluate the performance of this approach. Users could interact with our system using both speech and deictic gestures. Each subject was asked to complete five tasks. For example, one task was to find the cheapest house in the most populated town.</Paragraph>
      <Paragraph position="1"> Data from eleven subjects was collected and analyzed.</Paragraph>
      <Paragraph position="2"> Table 4 shows the evaluation results of 219 inputs. These inputs were categorized in terms of the number of referring expressions in the speech input and the number of gestures in the gesture inputs. Out of the total 219 inputs, 137 inputs had their referents correctly interpreted. For the remaining 82 inputs in which the referents were not correctly identified, the problem did not come from the approach itself, but rather from other sources such as speech recognition and language understanding errors. These were two major error sources, which were accounted for 55% and 20% of total errors respectively (Chai et al. 2004b).</Paragraph>
      <Paragraph position="3"> In our studies, the majority of user references were simple in that they involved only one referring expression and one gesture as in earlier findings (Kehler 2000). It is trivial for our approach to handle these simple inputs since the size of the graph is usually very small and there is only one node in the referring graph. However, we did find 23% complex inputs (the row S3 and the column G3 in Table 4), which involved multiple referring expressions from speech utterances and/or multiple gestures. Our optimization approach is particularly effective to interpret these complex inputs by simultaneously considering semantic, temporal, and contextual constraints.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML