XML Viewer - n06-3001

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-3001_metho.xml
Size: 9,671 bytes
Last Modified: 2025-10-06 14:10:12
<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-3001">
  <Title>Incorporating Gesture and Gaze into Multimodal Models of Human-to-Human Communication</Title>
  <Section position="3" start_page="0" end_page="213" type="metho">
    <SectionTitle>
2 Completed Works
</SectionTitle>
    <Paragraph position="0"> Our previous research efforts related to multimodal analysis of human communication can be roughly grouped to three fields: (1) multimodal corpus col- null lection, annotation, and data processing, (2) measurement studies to enrich knowledge of non-verbal cues to structural events, and (3) model construction using a data-driven approach. Utilizing non-verbal cues in human communication processing is quite new and there is no standard data or off-the-shelf evaluation method. Hence, the first part of my research has focused on corpus building. Through measurement investigations, we then obtain a better understanding of the non-verbal cues associated with structural events in order to model those structural events more effectively.</Paragraph>
    <Section position="1" start_page="211" end_page="211" type="sub_section">
      <SectionTitle>
2.1 Multimodal Corpus Collection
</SectionTitle>
      <Paragraph position="0"> Under NSF KDI award (Quek and et al., ), we collected a multimodal dialogue corpus. The corpus contains calibrated stereo video recordings, time-aligned word transcriptions, prosodic analyses, and hand positions tracked by a video tracking algorithm (Quek et al., 2002). To improve the speed of producing a corpus while maintaining its quality, we have investigated factors impacting the accuracy of the forced alignment of transcriptions to audio files (Chen et al., 2004a).</Paragraph>
      <Paragraph position="1"> Meetings, in which several participants communicate with each other, play an important role in our daily life but increase the challenges to current information processing techniques. Understanding human multimodal communicative behavior, and how witting and unwitting visual displays (e.g., gesture, head orientation, gaze) relate to spoken content is critical to the analysis of meetings. These multi-modal behaviors may reveal static and dynamic social structure of the meeting participants, the flow of topics being discussed, the control of floor of the meeting, and so on. For this purpose, we have been collecting a multimodal meeting corpus under the sponsorship of ARDA VACE II (Chen et al., 2005). In a room equipped with synchronized multichannel audio,video and motion-tracking recording devices, participants (from 5 to 8 civilian, military, or mixed) engage in planning exercises, such as managing rocket launch emergency, exploring a foreign weapon component, and collaborating to select awardees for fellowships. we have collected and continued to do multichannel time synchronized audio and video recordings. Using a series of audio and video processing techniques, we obtain the word transcriptions and prosodic features, as well as head, torso and hand 3D tracking traces from visual trackers and Vicon motion capture device. Figure 1 depicts our meeting corpus collection process.</Paragraph>
    </Section>
    <Section position="2" start_page="211" end_page="212" type="sub_section">
      <SectionTitle>
2.2 Gesture Patterns during Speech Repairs
</SectionTitle>
      <Paragraph position="0"> In the dynamic speech production process, speakers may make errors or totally change the content of what is being expressed. In either of these cases, speakers need refocus or revise what they are saying  and therefore speech repairs appear in overt speech. A typical speech repair contains a reparandum, an optional editing phrase, and a correction. Based on the relationship between the reparandum and the correction, speech repairs can be classified into three types: repetitions, content replacements, and false starts. Since utterance content has been modified in last two repair types, we call them content modification (CM) repairs. We carried out a measurement study (Chen et al., 2002) to identify patterns of gestures that co-occur with speech repairs that can be exploited by a multimodal processing system to more effectively process spontaneous speech. We observed that modification gestures (MGs), which exhibit a change in gesture state during speech repair, have a high correlation with content modification (CM) speech repairs, but rarely occur with content repetitions. This study does not only provide evidence that gesture and speech are tightly linked in production, but also provides evidence that gestures provide an important additional cue for identifying speech repairs and their types.</Paragraph>
    </Section>
    <Section position="3" start_page="212" end_page="212" type="sub_section">
      <SectionTitle>
2.3 Incorporating Gesture in SU Detection
</SectionTitle>
      <Paragraph position="0"> A sentence unit (SU) is defined as the complete expression of a speaker's thought or idea. It can be either a complete sentence or a semantically complete smaller unit. We have conducted an experiment that integrates lexical, prosodic and gestural cues in order to more effectively detect sentence unit boundaries in conversational dialog (Chen et al., 2004b).</Paragraph>
      <Paragraph position="1"> As can be seen in Figure 2, our multimodal model combines lexical, prosodic, and gestural knowledge sources, with each knowledge source implemented as a separate model. A hidden event language model (LM) was trained to serve as lexical model (P(W,E)). Using a direct modeling approach (Shriberg and Stolcke, 2004), prosodic features were extracted using the SRI prosodic feature extraction tool1 by collaborators at ICSI and then were used to train a CART decision tree as the prosodic model (P(E|F)). Similarly to the prosodic model, we computed gesture features directly from visual tracking measurements (Quek et al., 1999; Bryll et al., 2001): 3D hand position, Hold (a state when there is no hand motion beyond some adaptive 1A similar prosody feature extraction tool has been developed in our lab (Huang et al., 2006) using Praat.</Paragraph>
      <Paragraph position="2"> threshold results), and Effort (analogous to the kinetic energy of hand movement). Using gestural features, we trained a CART tree to serve as the gestural model (P(E|G)). Finally, an HMM based model combination scheme was used to integrate predictions from individual models to obtain an overall SU prediction (argmax(E|W,F,G)). In our investigations, we found that gesture features complement the prosodic and lexical knowledge sources; by using all of the knowledge sources, the model is able to achieve the lowest overall detection error rate.</Paragraph>
      <Paragraph position="3">  model using lexical, prosodic and gestural cues</Paragraph>
    </Section>
    <Section position="4" start_page="212" end_page="213" type="sub_section">
      <SectionTitle>
2.4 Floor Control Investigation on Meetings
</SectionTitle>
      <Paragraph position="0"> An underlying, auto-regulatory mechanism known as &amp;quot;floor control&amp;quot;, allows participants communicate with each other coherently and smoothly. A person controlling the floor bears the burden of moving the discourse along. By increasing our understanding of floor control in meetings, there is a potential to impact two active research areas: human-like conversational agent design and automatic meeting analysis. We have recently investigated floor control in multi-party meetings (Chen et al., 2006). In particular, we analyzed patterns of speech (e.g., the use of discourse markers) and visual cues (e.g., eye gaze exchange, pointing gesture for next speaker) that are often involved in floor control changes. From this analysis, we identified some multimodal cues that will be helpful for predicting floor control events.</Paragraph>
      <Paragraph position="1"> Discourse markers are found to occur frequently at the beginning of a floor. During floor transitions, the  previous holder often gazes at the next floor holder and vice verse. The well-known mutual gaze break pattern in dyadic conversations is also found in some meetings. A special participant, an active meeting manager, is found to play a role in floor transitions.</Paragraph>
      <Paragraph position="2"> Gesture cues are also found to play a role, especially with respect to floor capturing gestures.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="213" end_page="213" type="metho">
    <SectionTitle>
3 Research Directions
</SectionTitle>
    <Paragraph position="0"> In the next stage of my research, I will focus on integrating previous efforts into a complete multimodal model for structural event detection. In particular, I will improve current gesture feature extraction, and expand the non-verbal features to include both eye gaze and body posture. I will also investigate alternative integration architectures to the HMM shown in Figure 2. In my thesis, I hope to better understand the role that the non-verbal cues play in assisting structural event detection. My research is expected to support adding multimodal perception capabilities to current human communication systems that rely mostly on speech. I am also interested in investigating mutual impacts among the structural events.</Paragraph>
    <Paragraph position="1"> For example, we will study SUs and their relationship to floor control structure. Given progress in structural event detection in human communication, I also plan to utilize the detected structural events to further enhance meeting understanding. A particularly interesting task is to locate salient portions of a meeting from multimodal cues (Chen, 2005) to summarize it.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML