File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/c00-1053_abstr.xml

Size: 4,983 bytes

Last Modified: 2025-10-06 13:41:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1053">
  <Title>Deixis and Conjunction in Multimodal Systems</Title>
  <Section position="1" start_page="0" end_page="362" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> In order to realize their full potential, multimodal interfaces need to support not just input from multiple modes, but single comnmnds optinmlly distributed across the available input modes. A multimodal language processing architecture is needed to integrate semantic content from the different modes. Johnston 1998a proposes a modular approach to multimodal language processing in which spoken language parsing is completed before lnultimodal parsing. In this paper, I will demonstrate the difficulties this approach faces as the spoken language parsing component is expanded to provide a compositional analysis of deictic expressions. I propose an alternative architecture in which spoken and multimodal parsing are tightly interleaved. This architecture greatly simplifies the spoken language parsing grm-nmar and enables predictive information fiom spoken language parsing to drive the application of multimodal parsing and gesture combination rules. I also propose a treatment of deictic numeral expressions that supports the broad range of pen gesture combinations that can be used to refer to collections of objects in the interface.</Paragraph>
    <Paragraph position="1"> Introduction Multimodal interfaces allow content to be conveyed between humans and machines over multiple different channels such speech, graphics, pen, and hand gesture. This enables more natural and efficient interaction since different kinds of content are best suited to particular modes. For example, spatial information is effectively conveyed using gesture for input (Oviatt 1997) and 2d or 3d graphics for output (Towns et al 1998).</Paragraph>
    <Paragraph position="2"> Multimodal interfaces also stand to play a critical role in the ongoing migration of interaction onto wireless portable computing devices, such as PDAs and next generation phones, which have limited screen real estate and no keyboard. For such devices, complex graphical user interfaces are not feasible and speech and pen will be the primary input lnodes. I focus here on multimodal interfaces which support speech and pen input.</Paragraph>
    <Paragraph position="3"> Pen input consists of gestures and drawings which are made in electronic ink on the computer display and processed by a gesture recognizer. Speech input is transcribed using a speech recognizer.</Paragraph>
    <Paragraph position="4"> This paper is concerned with the relationship between spoken language parsing and nmltimodal parsing, specifically whether they should be separate modular components, and the related issue of determining the appropriate level of constituent structure at which nmltimodal integration should apply. Johnston 1998a proposes a modular approach in which the individual modes are parsed and assigned typed feature structures representing their combinatory properties and semantic content. A nmltidimensional chart parser then combines these structures in accordance with a unification-based lnultimodal grammar. This approach is outlined in Section 1. Section 2 addresses the compositional analysis of deictic expressions and their interaction with conjunction and other aspects of the gramnmr. In Section 3, a new architecture is presented in which spoken and multimodal parsing are interleaved. Section 4 presents an analysis of deictic numeral expressions, and Section 5 discusses certain constructions in which multimodal integration applies at higher levels of constituent structure than a simple deictic noun phrase. I will draw examples from a nmltimodal directory and messaging application, specifically a multimodal variant of VPQ (Buntschuh et al 1998).</Paragraph>
    <Paragraph position="5"> 1 Unification-based nmltimodal parsing Johnston 1998a presents an approach to language processing for multimodal systems in which multimodal integration strategies are specified declaratively in a unification-based grammar formalism. The basic architecture of the approach is given in Figure I. The results of speech recognition and gesture recognition are interpreted by spoken language processing (SLP) and gesture processing (GP) components respectively. These assign typed feature structure representations  (Carpenter 1992) to speech and gesture and pass those on to a nmltimodal parsing component (MP).</Paragraph>
    <Paragraph position="6"> Tim typed feature structure formalism is augmented with ftmctioual constraints (Wittenbnrg 1993). MP uses a multidimensional chart parser to combine the interpretations of speech and gesture in accordance with a nmltimodal unil'ication-based grammar, determines the range of possible lnultimodal interpretations, selects the one with the highest joint probability, and passes it on for execution.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML