Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 211–214,
New York, June 2006. c©2006 Association for Computational Linguistics
Incorporating Gesture and Gaze into Multimodal Models of
Human-to-Human Communication
Lei Chen
Dept. of Electrical and Computer Engineering
Purdue University
West Lafayette, IN 47907
chenl@ecn.purdue.edu
Abstract
Structural information in language is im-
portant for obtaining a better understand-
ing of a human communication (e.g., sen-
tence segmentation, speaker turns, and
topic segmentation). Human communica-
tion involves a variety of multimodal be-
haviors that signal both propositional con-
tent and structure, e.g., gesture, gaze, and
body posture. These non-verbal signals
have tight temporal and semantic links to
spoken content. In my thesis, I am work-
ing on incorporating non-verbal cues into
a multimodal model to better predict the
structural events to further improve the
understanding of human communication.
Some research results are summarized in
this document and my future research plan
is described.
1 Introduction
In human communication, ideas tend to unfold in
a structured way. For example, for an individual
speaker, he/she organizes his/her utterances into sen-
tences. When a speaker makes errors in the dy-
namic speech production process, he/she may cor-
rect these errors using a speech repair scheme. A
group of speakers in a meeting organize their ut-
terances by following a floor control scheme. All
these structures are helpful for building better mod-
els of human communication but are not explicit in
the spontaneous speech or the corresponding tran-
scription word string. In order to utilize these struc-
tures, it is necessary to first detect them, and to do
so as efficiently as possible. Utilization of various
kinds of knowledge is important; For example, lex-
ical and prosodic knowledge (Liu, 2004; Liu et al.,
2005) have been used to detect structural events.
Human communication tends to utilize not only
speech but also visual cues such as gesture, gaze,
and so on. Some studies (McNeill, 1992; Cassell
and Stone, 1999) suggest that gesture and speech
stem from a single underlying mental process, and
they are related both temporally and semantically.
Gestures play an important role in human commu-
nication but use quite different expressive mecha-
nisms than spoken language. Gaze has been found
to be widely used in coordinating multi-party con-
versations (Argyle and Cook, 1976; Novick, 2005).
Given the close relationship between non-verbal
cues and speech and the special expressive capac-
ity of non-verbal cues, we believe that these cues
are likely to provide additional important informa-
tion that can be exploited when modeling structural
events. Hence, in my Ph.D thesis, I have been in-
vestigating the combination of lexical, prosodic, and
non-verbal cues for detection of the following struc-
tural events: sentence units, speech repairs, and
meeting floor control.
This paper is organized as follows: Section 1 has
described the research goals of my thesis. Section 2
summarizes the efforts made related to these goals.
Section 3 lays out the research work needed to com-
plete my thesis.
2 Completed Works
Our previous research efforts related to multimodal
analysis of human communication can be roughly
grouped to three fields: (1) multimodal corpus col-
211
Figure 1: VACE meeting corpus production
lection, annotation, and data processing, (2) mea-
surement studies to enrich knowledge of non-verbal
cues to structural events, and (3) model construc-
tion using a data-driven approach. Utilizing non-
verbal cues in human communication processing is
quite new and there is no standard data or off-the-
shelf evaluation method. Hence, the first part of my
research has focused on corpus building. Through
measurement investigations, we then obtain a bet-
ter understanding of the non-verbal cues associated
with structural events in order to model those struc-
tural events more effectively.
2.1 Multimodal Corpus Collection
Under NSF KDI award (Quek and et al., ), we col-
lected a multimodal dialogue corpus. The corpus
contains calibrated stereo video recordings, time-
aligned word transcriptions, prosodic analyses, and
hand positions tracked by a video tracking algo-
rithm (Quek et al., 2002). To improve the speed
of producing a corpus while maintaining its qual-
ity, we have investigated factors impacting the ac-
curacy of the forced alignment of transcriptions to
audio files (Chen et al., 2004a).
Meetings, in which several participants commu-
nicate with each other, play an important role in our
daily life but increase the challenges to current infor-
mation processing techniques. Understanding hu-
man multimodal communicative behavior, and how
witting and unwitting visual displays (e.g., gesture,
head orientation, gaze) relate to spoken content is
critical to the analysis of meetings. These multi-
modal behaviors may reveal static and dynamic so-
cial structure of the meeting participants, the flow
of topics being discussed, the control of floor of
the meeting, and so on. For this purpose, we have
been collecting a multimodal meeting corpus un-
der the sponsorship of ARDA VACE II (Chen et al.,
2005). In a room equipped with synchronized mul-
tichannel audio,video and motion-tracking record-
ing devices, participants (from 5 to 8 civilian, mil-
itary, or mixed) engage in planning exercises, such
as managing rocket launch emergency, exploring a
foreign weapon component, and collaborating to se-
lect awardees for fellowships. we have collected and
continued to do multichannel time synchronized au-
dio and video recordings. Using a series of audio
and video processing techniques, we obtain the word
transcriptions and prosodic features, as well as head,
torso and hand 3D tracking traces from visual track-
ers and Vicon motion capture device. Figure 1 de-
picts our meeting corpus collection process.
2.2 Gesture Patterns during Speech Repairs
In the dynamic speech production process, speak-
ers may make errors or totally change the content
of what is being expressed. In either of these cases,
speakers need refocus or revise what they are saying
212
and therefore speech repairs appear in overt speech.
A typical speech repair contains a reparandum, an
optional editing phrase, and a correction. Based
on the relationship between the reparandum and the
correction, speech repairs can be classified into three
types: repetitions, content replacements, and false
starts. Since utterance content has been modified
in last two repair types, we call them content mod-
ification (CM) repairs. We carried out a measure-
ment study (Chen et al., 2002) to identify patterns of
gestures that co-occur with speech repairs that can
be exploited by a multimodal processing system to
more effectively process spontaneous speech. We
observed that modification gestures (MGs), which
exhibit a change in gesture state during speech re-
pair, have a high correlation with content modifica-
tion (CM) speech repairs, but rarely occur with con-
tent repetitions. This study does not only provide ev-
idence that gesture and speech are tightly linked in
production, but also provides evidence that gestures
provide an important additional cue for identifying
speech repairs and their types.
2.3 Incorporating Gesture in SU Detection
A sentence unit (SU) is defined as the complete ex-
pression of a speaker’s thought or idea. It can be ei-
ther a complete sentence or a semantically complete
smaller unit. We have conducted an experiment that
integrates lexical, prosodic and gestural cues in or-
der to more effectively detect sentence unit bound-
aries in conversational dialog (Chen et al., 2004b).
As can be seen in Figure 2, our multimodal model
combines lexical, prosodic, and gestural knowl-
edge sources, with each knowledge source imple-
mented as a separate model. A hidden event lan-
guage model (LM) was trained to serve as lexical
model (P(W,E)). Using a direct modeling ap-
proach (Shriberg and Stolcke, 2004), prosodic fea-
tures were extracted using the SRI prosodic fea-
ture extraction tool1 by collaborators at ICSI and
then were used to train a CART decision tree as the
prosodic model (P(E|F)). Similarly to the prosodic
model, we computed gesture features directly from
visual tracking measurements (Quek et al., 1999;
Bryll et al., 2001): 3D hand position, Hold (a state
when there is no hand motion beyond some adaptive
1A similar prosody feature extraction tool has been devel-
oped in our lab (Huang et al., 2006) using Praat.
threshold results), and Effort (analogous to the ki-
netic energy of hand movement). Using gestural fea-
tures, we trained a CART tree to serve as the gestu-
ral model (P(E|G)). Finally, an HMM based model
combination scheme was used to integrate predic-
tions from individual models to obtain an overall SU
prediction (argmax(E|W,F,G)). In our investiga-
tions, we found that gesture features complement the
prosodic and lexical knowledge sources; by using
all of the knowledge sources, the model is able to
achieve the lowest overall detection error rate.
Figure 2: Data flow diagram of multimodal SU
model using lexical, prosodic and gestural cues
2.4 Floor Control Investigation on Meetings
An underlying, auto-regulatory mechanism known
as “floor control”, allows participants communicate
with each other coherently and smoothly. A person
controlling the floor bears the burden of moving the
discourse along. By increasing our understanding of
floor control in meetings, there is a potential to im-
pact two active research areas: human-like conver-
sational agent design and automatic meeting analy-
sis. We have recently investigated floor control in
multi-party meetings (Chen et al., 2006). In particu-
lar, we analyzed patterns of speech (e.g., the use of
discourse markers) and visual cues (e.g., eye gaze
exchange, pointing gesture for next speaker) that are
often involved in floor control changes. From this
analysis, we identified some multimodal cues that
will be helpful for predicting floor control events.
Discourse markers are found to occur frequently at
the beginning of a floor. During floor transitions, the
213
previous holder often gazes at the next floor holder
and vice verse. The well-known mutual gaze break
pattern in dyadic conversations is also found in some
meetings. A special participant, an active meeting
manager, is found to play a role in floor transitions.
Gesture cues are also found to play a role, especially
with respect to floor capturing gestures.
3 Research Directions
In the next stage of my research, I will focus on inte-
grating previous efforts into a complete multimodal
model for structural event detection. In particular, I
will improve current gesture feature extraction, and
expand the non-verbal features to include both eye
gaze and body posture. I will also investigate alter-
native integration architectures to the HMM shown
in Figure 2. In my thesis, I hope to better understand
the role that the non-verbal cues play in assisting
structural event detection. My research is expected
to support adding multimodal perception capabili-
ties to current human communication systems that
rely mostly on speech. I am also interested in inves-
tigating mutual impacts among the structural events.
For example, we will study SUs and their relation-
ship to floor control structure. Given progress in
structural event detection in human communication,
I also plan to utilize the detected structural events
to further enhance meeting understanding. A par-
ticularly interesting task is to locate salient portions
of a meeting from multimodal cues (Chen, 2005) to
summarize it.
References
M. Argyle and M. Cook. 1976. Gaze and Mutual Gaze.
Cambridge Univ. Press.
R. Bryll, F. Quek, and A. Esposito. 2001. Automatic
hand hold detection in natural conversation. In IEEE
Workshop on Cues in Communication, Kauai,Hawaii,
Dec.
J. Cassell and M. Stone. 1999. Living Hand to Mouth:
Psychological Theories about Speech and Gesture in
Interactive Dialogue Systems. In AAAI.
L. Chen, M. Harper, and F. Quek. 2002. Gesture pat-
terns during speech repairs. In Proc. of Int. Conf. on
Multimodal Interface (ICMI), Pittsburg, PA, Oct.
L. Chen, Y. Liu, M. Harper, E. Maia, and S. McRoy.
2004a. Evaluating factors impacting the accuracy of
forced alignments in a multimodal corpus. In Proc. of
Language Resource and Evaluation Conference, Lis-
bon, Portugal, June.
L. Chen, Y. Liu, M. Harper, and E. Shriberg. 2004b.
Multimodal model integration for sentence unit detec-
tion. In Proc. of Int. Conf. on Multimodal Interface
(ICMI), University Park, PA, Oct.
L. Chen, T.R. Rose, F. Parrill, X. Han, J. Tu, Z.Q. Huang,
I. Kimbara, H. Welji, M. Harper, F. Quek, D. McNeill,
S. Duncan, R. Tuttle, and T. Huang. 2005. VACE
multimodal meeting corpus. In Proceeding of MLMI
2005 Workshop.
L. Chen, M. Harper, A. Franklin, T. R. Rose, I. Kimbara,
Z. Q. Huang, and F. Quek. 2006. A multimodal anal-
ysis of floor control in meetings. In Proc. of MLMI 06,
Washington, DC, USA, May.
L. Chen. 2005. Locating salient portions of meeting us-
ing multimodal cues. Research proposal submitted to
AMI training program, Dec.
Z. Q. Huang, L. Chen, and M. Harper. 2006. An open
source prosodic feature extraction tool. In Proc. of
Language Resource and Evaluation Conference, May
2006.
Y. Liu, E. Shriberg, A. Stolcke, B. Peskin, J. Ang, Hillard
D., M. Ostendorf, M. Tomalin, P. Woodland, and
M. Harper. 2005. Structural Metadata Research in
the EARS Program. In Proc. of ICASSP.
Y. Liu. 2004. Structural Event Detection for Rich Tran-
scription of Speech. Ph.D. thesis, Purdue University.
D. McNeill. 1992. Hand and Mind: What Gestures Re-
veal about Thought. Univ. Chicago Press.
D. G. Novick. 2005. Models of gaze in multi-party dis-
course. In Proc. of CHI 2005 Workshop on the Virtu-
ality Continuum Revisted, Portland OR, April 3.
F. Quek and et al. KDI: Cross-model Analysis
Signal and Sense- Data and Computational Re-
sources for Gesture, Speech and Gaze Research,
http://vislab.cs.vt.edu/kdi.
F. Quek, R. Bryll, and X. F. Ma. 1999. A parallel algo-
righm for dynamic gesture tracking. In ICCV Work-
shop on RATFG-RTS, Gorfu,Greece.
F. Quek, D. McNeill, R. Bryll, S. Duncan, X. Ma, C. Kir-
bas, K. E. McCullough, and R. Ansari. 2002. Mul-
timodal human discourse: gesture and speech. ACM
Trans. Comput.-Hum. Interact., 9(3):171–193.
E. Shriberg and A. Stolcke. 2004. Direct modeling of
prosody: An overview of applications in automatic
speech processing. In International Conference on
Speech Prosody.
214
