File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1004_metho.xml

Size: 28,682 bytes

Last Modified: 2025-10-06 14:08:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1004">
  <Title>A Salience-Based Approach to Gesture-Speech Alignment</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Our Approach
</SectionTitle>
    <Paragraph position="0"> The most important goal of our system is the ability to handle natural, human-to-human language usage. This includes disfluencies and grammatically incorrect utterances, which become even more problematic when considering that the output of speech recognizers is far from perfect. Any approach that requires significant parsing or other grammatical analysis may be ill-suited to meet these goals.</Paragraph>
    <Paragraph position="1"> Instead, we identify keywords that are likely to require gestural referents for resolution. Our goal is to produce an alignment - a set of bindings - that match at least some of the identified keywords with one or more gestures.</Paragraph>
    <Paragraph position="2"> There are several things that are known to contribute to the salience of candidate gesture-speech bindings: The relevant gesture is usually close in time to the keyword (Oviatt et al., 1997; Cohen et al., 2002) The gesture usually precedes the keyword (Oviatt et al., 1997).</Paragraph>
    <Paragraph position="3"> A one-to-one mapping is preferred. Multiple key-words rarely align with a single gesture, and multiple gestures almost never align with a single key-word (Eisenstein and Davis, 2003).</Paragraph>
    <Paragraph position="4"> Some types of gestures, such as deictic pointing gestures, are more likely to take part in keyword bindings. Other gestures (i.e., beats) do not carry this type of semantic content, and instead act to moderate turn taking or indicate emphasis. These gestures are unlikely to take part in keyword bindings (Cassell, 1998).</Paragraph>
    <Paragraph position="5"> Some keyword/gesture combinations may be particularly likely; for example, the keyword &amp;quot;this&amp;quot; and a deictic pointing gesture.</Paragraph>
    <Paragraph position="6"> These rules mirror the salience weighting features employed by the anaphora resolution methods described in the previous section. We define a parameterizable penalty function that prefers alignments that adhere to as many of these rules as possible. Given a set of verbal utterances and gestures, we then try to find the set of bindings with the minimal penalty. This is essentially an optimization approach, and we use the simplest possible optimization technique: greedy hill-climbing. Of course, given a set of penalties and the appropriate representation, any optimization technique could be applied. In the evaluation section, we discuss whether and how much our system would benefit from using a more sophisticated optimization technique. Later in this section, we formalize the problem and our proposed solution.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Leveraging Empirical Data
</SectionTitle>
      <Paragraph position="0"> One of the advantages of the salience-based approach is that it enables the creation of a hybrid system that benefits both from our intuitions about multimodal communication and from a corpus of annotated data. The form of the salience metric, and the choice of features that factor into it, is governed by our knowledge about the way speech and gesture work. However, the penalty function also requires parameters that weigh the importance of each factor. These parameters can be crafted by hand if no corpus is available, but they can also be learned from data. By using knowledge about multimodal language to derive the form and features of the salience metric, and using a corpus to fine-tune the parameters of this metric, we can leverage the strengths of both knowledge-based and data-driven approaches.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Formalization
</SectionTitle>
    <Paragraph position="0"> We define a multimodal transcript M to consist of a set of spoken utterances S and gestures G. S contains a set of references R that must be ground by a gestural referent. We define a binding, b 2 B, as a tuple relating a gesture, g 2 G, to a corresponding speech reference, r 2 R. Provided G and R, the set B enumerates all possible bindings between them. Formally, each gesture, reference, and binding are defined as</Paragraph>
    <Paragraph position="2"> wherets,te describe the start and ending time of a gesture or reference, w2S is the word corresponding to r, and Gtype is the type of gesture (e.g. deictic or trajectory).</Paragraph>
    <Paragraph position="3"> An alternative, useful description of the set B is as the function b(g) which returns for each gesture a set of corresponding references. This function is defined as b(g) =frjhg;ri2Bg (2)</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Rules
</SectionTitle>
      <Paragraph position="0"> In this section we provide the analytical form for the penalty functions of Section 3. We have designed these functions to penalize bindings that violate the preferences that model our intuitions about the relationship between speech and gesture. We begin by presenting the analytical form for the binding penalty function, b.</Paragraph>
      <Paragraph position="1"> It is most often the case that verbal references closely follow the gestures that they refer to; the verbal reference rarely precedes the gesture. To reflect this knowledge, we parameterize b using a time gap penalty, tg, and a wrong order penalty, wo as follows,</Paragraph>
      <Paragraph position="3"> and wtg =jtrs tgsj In addition to temporal agreement, specific words or parts-of-speech have varying affinities for different types of gestures. We incorporate these penalties into b by introducing a binding agreement penalty, (b), as follows:</Paragraph>
      <Paragraph position="5"> The remaining penalty functions model binding fertility. Specifically, we assign a penalty for each unassigned gesture and reference, g(g) and r(r) respectively, that reflect our desire for the algorithm to produce bindings.</Paragraph>
      <Paragraph position="6"> Certain gesture types (e.g., deictics) are much more likely to participate in bindings than others (e.g., beats). An unassigned gesture penalty is associated with each gesture type, given by g(g). Similarly, we expect references to have a likelihood of being bound that is conditioned on their word or part-of-speech tag. However, we currently handle all keywords in the same way, with a constant penalty r(r) for unassigned keywords.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Minimization Algorithm
</SectionTitle>
      <Paragraph position="0"> GivenGandR we wish to find aB B that minimizes the penalty function (B;G;R):</Paragraph>
      <Paragraph position="2"> Using the penalty functions of Section 4.1 ( ^B;G;R) is defined as,</Paragraph>
      <Paragraph position="4"> Although there are numerous optimization techniques that may be applied to minimize Equation 5, we have chosen to implement a naive gradient decent algorithm presented below as Algorithm 1. Observing the problem, note we could have initialized B = B; in other Algorithm 1 Gradient Descent Initialize B =;and B0 = B repeat Let b0 be the first element in B0</Paragraph>
      <Paragraph position="6"> Convergence test: is max &lt;limit? until convergence words, start off with all possible bindings, and gradually prune away the bad ones. But it seems likely that jB j min(jRj;jGj); thus, starting from the empty set will converge faster. The time complexity of this algorithm is given by O(jB jjBj). SincejBj = jGjjRj, and assumingjB j/jGj/jRj, this simplifies to O(jB j3), cubic in the number of bindings returned.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Learning Parameters
</SectionTitle>
      <Paragraph position="0"> We explored a number of different techniques for finding the parameters of the penalty function: setting them by hand, gradient descent, simulated annealing, and a genetic algorithm. A detailed comparison of the results with each approach is beyond the scope of this paper, but the genetic algorithm outperformed the other approaches in both accuracy and rate of convergence.</Paragraph>
      <Paragraph position="1"> The genome representation consisted of a thirteen bit string for each penalty parameter; three bits were used for the exponent, and the remaining ten were used for the base. Parameters were allowed to vary from 10 4 to 103. Since there were eleven parameters, the overall string length was 143. A population size of 200 was used, and training proceeded for 50 generations. Single-point crossover was applied at a rate of 90%, and the mutation rate was set to 3% per bit. Tournament selection was used rather than straightforward fitness-based selection (Goldberg, 1989).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> We evaluated our system by testing its performance on a set of 26 transcriptions of unconstrained human-to-human communication, from nine different speak- null ers (Eisenstein and Davis, 2003). Of the four women and five men who participated, eight were right-handed, and one was a non-native English speaker. The participants ranged in age from 22 to 28. All had extensive computer experience, but none had any experience in the task domain, which required explaining the behavior of simple mechanical devices.</Paragraph>
    <Paragraph position="1"> The participants were presented with three conditions, each of which involved describing the operation of a mechanical device based on a computer simulation. The conditions were shown in order of increasing complexity, as measured by the number of moving parts: a latchbox, a piston, and a pinball machine. Monologues ranged in duration from 15 to 90 seconds; the number of gestures used ranged from six to 58. In total, 574 gesture phrases were transcribed, of which 239 participated in gesture-speech bindings.</Paragraph>
    <Paragraph position="2"> In explaining the devices, the participants were allowed - but not instructed - to refer to a predrawn diagram that corresponded to the simulation. Vocabulary, grammar, and gesture were not constrained in any way.</Paragraph>
    <Paragraph position="3"> The monologues were videotaped, transcribed, and annotated by hand. No gesture or speech recognition was performed. The decision to use transcriptions rather than speech and gesture recognizers will be discussed in detail below.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Empirical Results
</SectionTitle>
      <Paragraph position="0"> We averaged results over ten experiments, in which 20% of the data was selected randomly and held out as a test set. Entire transcripts were held out, rather than parts of each transcript. This was necessary because the system considers the entire transcript holistically when choosing an alignment.</Paragraph>
      <Paragraph position="1"> For a baseline, we evaluated the performance of choosing the temporally closest gesture to each keyword. While simplistic, this approach is used in several implemented multimodal user interfaces (Bolt, 1980; Koons et al., 1993). Kettebekov and Sharma even reported that 93.7% of gesture phrases were &amp;quot;temporally aligned&amp;quot; with the semantically associated keyword in their corpus (Kettebekov and Sharma, 2001). Our results with this base-line were somewhat lower, for reasons discussed below.</Paragraph>
      <Paragraph position="2"> Table 1 shows the results of our system and the base-line on our corpus. Our system significantly outperforms the baseline on both recall and precision on this corpus (p&lt; 0:05, two-tailed). Precision and recall differ slightly because there are keywords that do not bind to any gesture. Our system does not assume a one-to-one mapping between keywords and gestures, and will refuse to bind some keywords if there is no gesture with a high enough salience. One benefit of our penalty-based approach is that it allows us to easily trade off between recall and precision. Reducing the penalties for unassigned gestures and keywords will cause the system to create fewer alignments, increasing precision and decreasing recall.</Paragraph>
      <Paragraph position="3"> This could be useful in a system where mistaken gesture/speech alignments are particularly undesirable. By increasing these same penalties, the opposite effect can also be achieved.</Paragraph>
      <Paragraph position="4"> Both systems perform worse on longer monologues.</Paragraph>
      <Paragraph position="5"> On the top quartile of monologues by length (measured in number of keywords), the recall of the baseline system falls to 75%, and the recall of our system falls to 90%.</Paragraph>
      <Paragraph position="6"> For the baseline system, we found a correlation of -0.55 (df = 23, p &lt; 0:01) between F-measure and monologue length.</Paragraph>
      <Paragraph position="7"> This may help to explain why Kettebekov and Sharma found such success with the baseline algorithm. The multimodal utterances in their corpus consisted of relatively short commands. The longer monologues in our corpus tended to be more grammatically complex and included more disfluency. Consequently, alignment was more difficult, and a relatively na&amp;quot;ive strategy, such as the baseline algorithm, was less effective.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> To our knowledge, very few multimodal understanding systems have been evaluated using natural, unconstrained speech and gesture. One exception is (Quek et al., 2002), which describes a system that extracts discourse structure from gesture on a corpus of unconstrained human-to-human communication; however, no quantitative analysis is provided. Of the systems that are more relevant to the specific problem of gesture-speech alignment (Cohen et al., 1997; Johnston and Bangalore, 2000; Kettebekov and Sharma, 2001), evaluation is always conducted from an HCI perspective, in which participants act as users of a computer system and communicate in short, grammatically-constrained multimodal commands. As shown in Section 5.1, such commands are significantly easier to align than the natural multimodal communication found in our corpus.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 The Corpus
</SectionTitle>
      <Paragraph position="0"> A number of considerations went into gathering this corpus.1 One of our goals was to minimize the use of discourse-related &amp;quot;beat&amp;quot; gestures, so as to better focus on the deictic and iconic gestures that are more closely related to the content of the speech; that is why we focused on monologues rather than dialogues. We also wanted the corpus to be relevant to the HCI community. That is why we provided a diagram to gesture at, which we believe serves a similar function to a computer display, providing reference points for deictic gestures. We used a predrawn diagram - rather than letting participants draw the diagram themselves - because interleaved speech, gesture, and sketching is a much more complicated problem, to be addressed only after bimodal speech-gesture communication is better understood.</Paragraph>
      <Paragraph position="1"> For a number of reasons, we decided to focus on transcriptions of speech and gesture, rather than using speech and gesture recognition systems. Foremost is that we wanted the language in our corpus to be as natural as possible; in particular, we wanted to avoid restricting speakers to a finite list of gestures. Building a recognizer that could handle such unconstrained gesture would be a substantial undertaking and an important research contribution in its own right. However, we are sensitive to the concern that our system should scale to handle possibly erroneous recognition data. There are three relevant classes of errors that our system may need to handle: speech recognition, gesture recognition, and gesture segmentation.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Speech Recognition Errors
</SectionTitle>
      <Paragraph position="0"> The speech recognizer could fail to recognize a keyword; in this case, a binding would simply not be created. If the speech recognizer misrecognized a non-keyword as a keyword, a spurious binding might be created. However, since our system does not require that all keywords have bindings, we feel that our approach is likely to degrade gracefully in the face of this type of error.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Gesture Recognition Errors
</SectionTitle>
      <Paragraph position="0"> This type of error would imply a gestural misclassification, e.g., classifying a deictic pointing gesture as an iconic. Again, we feel that a salience-based system will degrade gracefully with this type of error, since there are no hard requirements on the type of gesture for forming a binding. In contrast, a system that required, say, a deictic gesture to accompany a certain type of command would be very sensitive to a gesture misclassification.</Paragraph>
      <Paragraph position="1">  corpus from the Linguistic Data Consortium. However, this corpus is presently more focused on the kinematics of hand and upper body movement, rather than on higher-level linguistic information relating to gestures and speech.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Gesture Segmentation Errors
</SectionTitle>
      <Paragraph position="0"> Gesture segmentation errors are probably the most dangerous, since this could involve incorrectly grouping two separate gestures into a single gesture, or vice versa. It seems that this type of error would be problematic for any approach, and we have no reason to believe that our salience-based approach would fare differently from any other approach.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Success Cases
</SectionTitle>
      <Paragraph position="0"> Our system outperformed the baseline by more than 10%.</Paragraph>
      <Paragraph position="1"> There were several types of phenomena that the base-line failed to handle. In this corpus, each gesture precedes the semantically-associated keyword 85% of the time. Guided by this fact, we first created a baseline system that selected the nearest preceding gesture for each keyword; clearly, the maximum performance for such a baseline is 85%. Slightly better results were achieved by simply choosing the nearest gesture regardless of whether it precedes the keyword; this is the baseline shown in Table 1. However, this baseline incorrectly bound several cataphoric gestures. The best strategy is to accept just a few cataphoric gestures in unusual circumstances, but a na&amp;quot;ive baseline approach is unable to do this.</Paragraph>
      <Paragraph position="2"> Most of the other baseline errors came about when the mapping from gesture to speech was not one-to-one.</Paragraph>
      <Paragraph position="3"> For example, in the utterance &amp;quot;this piece here,&amp;quot; the two keywords actually refer to a single deictic gesture. In the salience-based approach, the two keywords were correctly bound to a single gesture, but the baseline insisted on finding two gestures. The baseline similarly mishandled situations where a keyword was used without referring to any gesture.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.3 Failure Cases
</SectionTitle>
      <Paragraph position="0"> Although the recall and precision of our system neared 95%, investigating the causes of error could suggest potential improvements. We were particularly interested in errors on the training set, where overfitting could not be blamed. This section describes two sources of error, and suggests some potential improvements.</Paragraph>
      <Paragraph position="1">  We adopted a keyword-based approach so that our system would be more robust to disfluency than alternative approaches that depended on parsing. While we were able to handle many instances of disfluent speech, we found that disfluencies occasionally disturbed the usual relationship between gesture and speech. For example, consider the following utterance: It has this... this spinning thing...</Paragraph>
      <Paragraph position="2"> Our system attempted to bind gestures to each occurrence of &amp;quot;this&amp;quot;, and ended up binding each reference to a different gesture. Moreover, both references were bound incorrectly. The relevant gesture in this case occurs after both references. This is an uncommon phenomenon, and as such, was penalized highly. However, anecdotally it appears that the presence of a disfluency makes this phenomenon more likely. A disfluency is frequently accompanied by an abortive gesture, followed by the full gesture occurring somewhat later than the spoken reference. It is possible that a system that could detect disfluency in the speech transcript could account for this phenomenon.</Paragraph>
      <Paragraph position="3">  Our system applies a greedy hill-climbing optimization to minimize the penalty. While this greedy optimization performs surprisingly well, we were able to identify a few cases of errors that were caused by the greedy nature of our optimization, e.g.</Paragraph>
      <Paragraph position="4"> ...once it hits this, this thing is blocked.</Paragraph>
      <Paragraph position="5"> In this example, the two references are right next to each other. The relevant gestures are also very near each other. The ideal bindings are shown in Figure 1a. The earlier &amp;quot;this&amp;quot; is considered first, but from the system's perspective, the best possible binding is the second gesture, since it overlaps almost completely with the spoken utterance (Figure 1b). However, once the second gesture is bound to the first reference, it is removed from the list of unassigned gestures. Thus, if the second gesture were also bound to the second utterance, the penalty would still be relatively high. Even though the earlier gesture is farther away from the second reference, it is still on the list of unassigned gestures, and the system can reduce the overall penalty considerably by binding it. The system ends up crisscrossing, and binding the earlier gesture to the later reference, and vice versa (Figure 1c).</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Future Work
</SectionTitle>
    <Paragraph position="0"> The errors discussed in the previous section suggest some potential improvements to our system. In this section, we describe four possible avenues of future work: dynamic programming, deeper syntactic analysis, other anaphora resolution techniques, and user adaptation.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.1 Dynamic Programming
</SectionTitle>
      <Paragraph position="0"> Algorithm 1 provides only an approximate solution to Equation 5. As demonstrated in Section 6.3.2, the greedy choice is not always optimal. Using dynamic programming, an exhaustive search of the space of bindings can be performed within polynomial time.</Paragraph>
      <Paragraph position="1">  We define m[i;j] to be the penalty of the optimal sub-set B fbi;:::;bjg2B, i j. m[i;j] is implemented as a k k lookup table, where k = jBj = jGjjRj. Each entry of this table is recursively defined by preceding table entries. Specifically, m[i;j] is computed by performing exhaustive search on its subsets of bindings. Using this lookup table, an optimal solution to Equation 5 is therefore found as (B ;G;R) = m[1;k]. Again assumingjB j/jGj/jRj, the size of the lookup table is given by O(jB j4). Thus, it is possible to find the globally optimal set of bindings, by moving from an O(n3) algorithm to O(n4). The precise definition of a recurrence relation for m[i;j] and a proof of correctness will be described in a future publication.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.2 Syntactic Analysis
</SectionTitle>
      <Paragraph position="0"> One obvious possibility for improvement would be to include more sophisticated syntactic information beyond keyword spotting. However, we require that our system remain robust to disfluency and recognition errors. Part of speech tagging is a robust method of syntactic analysis which could allow us to refine the penalty function depending on the usage case. Consider that there at least three relevant uses of the keyword &amp;quot;this.&amp;quot;  1. This movie is better than A.I.</Paragraph>
      <Paragraph position="1"> 2. This is the bicycle ridden by E.T.</Paragraph>
      <Paragraph position="2"> 3. The wheel moves like this.</Paragraph>
      <Paragraph position="3">  When &amp;quot;this&amp;quot; is followed by a noun (case 1), a deictic gesture is likely, although not strictly necessary. But when &amp;quot;this&amp;quot; is followed by a verb (case 2), a deictic gesture is usually crucial for understanding the sentence. Thus, the penalty for not assigning this keyword should be very high. Finally, in the third case, when the keyword follows a preposition, a trajectory gesture is more likely, and the penalty for any such binding should be lowered.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.3 Other Anaphora Resolution Techniques
</SectionTitle>
      <Paragraph position="0"> We have based this research on salience values, which is just one of several possible alternative approaches to anaphora resolution. One such alternative is the use of constraints: rules that eliminate candidates from the list of possible antecedents (Rich and Luperfoy, 1988). An example of a constraint in anaphora resolution is a rule requiring the elimination of all candidates that disagree in gender or number with the referential pronoun. Constraints may be used in combination with a salience metric, to prune away unlikely choices before searching.</Paragraph>
      <Paragraph position="1"> The advantage is that enforcing constraints could be substantially less computationally expensive than searching through the space of all possible bindings for the one with the highest salience. One possible future project would be to develop a set of constraints for speech-gesture alignment, and investigate the effect of these constraints on both accuracy and speed.</Paragraph>
      <Paragraph position="2"> Ge, Hale, and Charniak propose a data-driven approach to anaphora resolution (Ge et al., 1998). For a given pronoun, their system can compute a probability for each candidate antecedent. Their approach of seeking to maximize this probability is similar to the saliencemaximizing approach that we have described. However, instead of using a parametric salience function, they learn a set of conditional probability distributions directly from the data. If this approach could be applied to gesture-speech alignment, it would be advantageous because the binding probabilities could be combined with the output of probabilistic recognizers to produce a pipeline architecture, similar to that proposed in (Wu et al., 1999). Such an architecture would provide multimodal disambiguation, where the errors of each component are corrected by other components.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.4 Multimodal Adaptation
</SectionTitle>
      <Paragraph position="0"> Speakers have remarkably entrenched multimodal communication patterns, with some users overlapping gesture and speech, and others using each modality sequentially (Oviatt et al., 1997). Moreover, these multimodal integration patterns do not seem to be malleable, suggesting that multimodal user interfaces should adapt to the user's tendencies. We have already shown how the weights of the salience metric can adapt for optimal performance against a corpus of user data; this approach could also be extended to adapt over time to an individual user.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML