File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1054_intro.xml
Size: 5,470 bytes
Last Modified: 2025-10-06 14:00:46
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1054"> <Title>Finite-state Multimodal Parsing and Understanding</Title> <Section position="3" start_page="0" end_page="369" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Multimodal interfaces are systems that allow input and/or output to be conveyed over multiple different channels such as speech, graphics, and gesture. They enable more natural and effective interaction since different kinds of content can be conveyed in the modes to which they are best suited (Oviatt, 1997).</Paragraph> <Paragraph position="1"> Our specific concern here is with multimodal interfaces supporting input by speech, pen, and touch, but the approach we describe has far broader applicability. These interfaces stand to play a critical role in the ongoing migration of interaction fi'oln the desktop to wireless portable computing devices (PI)As, nextgeneration phones) that offer limited screen real estale, and other keyboard-less platforms such as public information kiosks.</Paragraph> <Paragraph position="2"> To realize their full potential, multimodal interfaces need to support not just input from multiple modes, but synergistic multimodal utterances optimally distributed over the available modes (Johnston et al., 1997). In order to achieve this, an ef fcctive method for integration of content fi'Oln dill ferent modes is needed. Johnston (1998b) shows how techniques from natural language processing (unification-based gramumrs and chart parsing) can be adapted to support parsing and interpretation of utterances distributed over multiple modes. In that approach, speech and gesture recognition produce ~,best lists of recognition results which are assigned typed feature structure representations (Carpenter, 1992) and passed to a luultidimensioual chart parsel deg that uses a lnultimodal unification-based granunar to combine the representations assigned to the input elements. Possible multimodal interpretations are then ranked and the optimal interpretation is passed on for execution. This approach overcomes many of the limitations of previous approaches to multimodal integration such as (Bolt, 1980; Neal and Shapiro, 1991) (See (Johnston ct al., 1997)(1). 282)). It supports speech with multiple gestures, visual parsing of unimodal gestures, and its dechu'ative nature facilitates rapid l)rototyping and iterative develol)meut of multimodal systems. Also, the unification-based approach allows for mutual COlnpensatiou of recognition errors in the individual modalities (Oviatt, 1999).</Paragraph> <Paragraph position="3"> However, the unification-based approach does not allow for tight-conpling of nmltimodal parsing with speech and gesture recognition. Compensation el L fects are dependent on the correct answer appearing in the ~;,-best list of interpretations assigned to each mode. Multimodal parsing cannot directly influence the progress of speech or gesture recognition. The multidimensional parsing approach is also sub-ject to significant concerns in terms of computational complexity. In the worst case, the multidimensional parsing algorithm (Johnston, 1998b) (p. 626) is exponential with respect to the number of input elements. Also this approach does not provide a natural fiamework for combining the probabilities of speech and gesture events in order to select among multiple competing multimodal interpretations. Wu et.al. (1999) present a statistical approach for selecting among multiple possible combinations of speech and gesture. However; it is not clear how the approach will scale to more complex verbal language and combinations of speech with multiple gestures.</Paragraph> <Paragraph position="4"> In this papm, we propose an alternative approach that addresses these limitations: parsing, understanding, and integration of speech and gesture am pe> formed by a single finite-state device. With certain simplifying assumptions, multidimensional parsing and understanding with multimodal grammars can be achieved using a weighted finite-state automaton (FSA) running on throe tapes which represent speech input (words), gesture input (gesture symbols and reference markers), and their combined interpretation. We have implemented our approach in the context of a multimodal messaging application in which users interact with a company directo W using synergistic combinations of speech and pen input; a multimodal variant of VPQ (Buntschuh et al., 1998). For example, the user might say email this person and this person and gesture with the pen on pictures of two people on a user interface display. In addition to the user interface client, the architecture contains speech and gesture recognition components which process incoming streams of speech and electronic ink, and a multimodal language processing component (Figure 1 ).</Paragraph> <Paragraph position="5"> Section 2 provides background on finite-state language processing. In Section 3, we define and exemplify multimodal context-fiee grammars (MCFGS) and their approximation as multimodal FSAs. We describe our approach to finite-state representation of meaning and explain how the three-tape finite state automaton can be factored out into a number of finite-state transducers. In Section 4, we explain how these transducers can be used to enable tight-coupling of multimodal language processing with speech and gesture recognition.</Paragraph> </Section> class="xml-element"></Paper>