File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-1004_intro.xml
Size: 5,042 bytes
Last Modified: 2025-10-06 14:02:16
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1004"> <Title>A Salience-Based Approach to Gesture-Speech Alignment</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> This research draws mainly from two streams of related work. Researchers in human-computer interaction have worked towards developing multimodal user interfaces, which allow spoken and gestural input. These systems often feature powerful algorithms for fusing modalities; however, they also restrict communication to short grammatically-constrained commands over a very limited vocabulary. Since our goal is to handle more complex linguistic phenomena, these systems were of little help in the design of our algorithm. Conversely, we found that the problem of anaphora resolution faces a very similar set of challenges as gesture-speech alignment. We were able to apply techniques from anaphora resolution to gesture-speech alignment.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Multimodal User Interfaces </SectionTitle> <Paragraph position="0"> Discussion of multimodal user interfaces begins with the seminal &quot;Put-That-There&quot; system (Bolt, 1980), which allowed users to issue natural language commands and use deictic hand gestures to resolve references from speech.</Paragraph> <Paragraph position="1"> Commands were subject to a strict grammar and alignment was straightforward: keywords created holes in the semantic frame, and temporally-aligned gestures filled the holes.</Paragraph> <Paragraph position="2"> More recent systems have extended this approach somewhat. Johnston and Bangalore describe a multi-modal parsing algorithm that is built using a 3-tape, finite state transducer (FST) (Johnston and Bangalore, 2000).</Paragraph> <Paragraph position="3"> The speech and gestures of each multimodal utterance are provided as input to an FST whose output is a semantic representation conveying the combined meaning.</Paragraph> <Paragraph position="4"> A similar system, based on a graph-matching algorithm, is described in (Chai et al., 2004). These systems perform mutual disambiguation, where each modality helps to correct errors in the others. However, both approaches restrict users to a predefined grammar and lexicon, and rely heavily on having a complete, formal ontology of the domain.</Paragraph> <Paragraph position="5"> In (Kettebekov et al., 2002), a co-occurrence model relates the salient prosodic features of the speech (pitch variation and pause) to characteristic features of gesticulation (velocity and acceleration). The goal was to improve performance of gesture recognition, rather than to address the problem of alignment directly. Their approach also differs from ours in that they operate at the level of speech signals, rather than recognized words.</Paragraph> <Paragraph position="6"> Potentially, the two approaches could compliment each other in a unified system.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Anaphora Resolution </SectionTitle> <Paragraph position="0"> Anaphora resolution involves linking an anaphor to its corresponding antecedent in the same or previous sentence. In many cases, speech/gesture multimodal fusion works in a very similar way, with gestures grounding some of the same anaphoric pronouns (e.g., &quot;this&quot;, &quot;that&quot;, &quot;here&quot;).</Paragraph> <Paragraph position="1"> One approach to anaphora resolution is to assign a salience value to each noun phrase that is a candidate for acting as a grounding referent, and then to choose the noun phrase with the greatest salience (Lappin and Leass, 1994). Mitkov showed that a salience-based approach can be applied across genres and without complex syntactic, semantic, and discourse analysis (Mitkov, 1998). Salience values are typically computed by applying linguistic knowledge; e.g., recent noun phrases are more salient, gender and number should agree, etc. This knowledge is applied to derive a salience value through the application of a set of predefined salience weights on each feature. Salience weights may be defined by hand, as in (Lappin and Leass, 1994), or learned from data (Mitkov et al., 2002).</Paragraph> <Paragraph position="2"> Anaphora resolution and gesture-speech alignment are very similar problems. Both involve resolving ambiguous words which reference other parts of the utterance. In the case of anaphora resolution, pronomial references resolve to previously uttered noun phrases; in gesture-speech alignment, keywords are resolved by gestures, which usually precede the keyword. The salience-based approach works for anaphora resolution because the factors that contribute to noun-phrase salience are well understood. We define a parallel set of factors for evaluating the salience of gestures.</Paragraph> </Section> </Section> class="xml-element"></Paper>