File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-1102_intro.xml
Size: 6,480 bytes
Last Modified: 2025-10-06 14:06:32
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1102"> <Title>Unification-based Multimodal Parsing</Title> <Section position="2" start_page="0" end_page="624" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Multimodal interfaces enable more natural and efficient interaction between humans and machines by providing multiple channels through which input or output may pass. Our concern here is with multi-modal input, such as interfaces which support simultaneous input from speech and pen. Such interfaces have clear task performance and user preference advantages over speech only interfaces, in particular for spatial tasks such as those involving maps (Oviatt 1996). Our focus here is on the integration of input from multiple modes and the role this plays in the segmentation and parsing of natural human input. In the examples given here, the modes are speech and pen, but the architecture described is more general in that it can support more than two input modes and modes of other types such as 3D gestural input.</Paragraph> <Paragraph position="1"> Our multimodal interface technology is implemented in QuickSet (Cohen et al 1997), a working system which supports dynamic interaction with maps and other complex visual displays. The initial applications of QuickSet are: setting up and interacting with distributed simulations (Courtemanche and Cercanowicz 1995), logistics planning, and navigation in virtual worlds. The system is distributed; consisting of a series of agents (Figure 1) which communicate through a shared blackboard (Cohen et al 1994). It runs on both desktop and handheld PCs, communicating over wired and wireless LANs.</Paragraph> <Paragraph position="2"> The user interacts with a map displayed on a wireless hand-held unit (Figure 2).</Paragraph> <Paragraph position="3"> They can draw directly on the map and simultaneously issue spoken commands. Different kinds of entities, lines, and areas may be created by drawing the appropriate spatial features and speaking their type; for example, drawing an area and saying 'flood zone'. Orders may also be specified; for example, by drawing a line and saying 'helicopterfollow this route'. The speech signal is routed to an HMM- null based continuous speaker-independent recognizer.</Paragraph> <Paragraph position="4"> The electronic 'ink' is routed to a neural net-based gesture recognizer (Pittman 1991). Both generate N-best lists of potential recognition results with associated probabilities. These results are assigned semantic interpretations by natural language processing and gesture interpretation agents respectively. A multimodal integrator agent fields input from the natural language and gesture interpretation agents and selects the appropriate multimodal or unimodal commands to execute. These are passed on to a bridge agent which provides an API to the underlying applications the system is used to control.</Paragraph> <Paragraph position="5"> In the approach to multimodal integration proposed by Johnston et al 1997, integration of spoken and gestural input is driven by a unification operation over typed feature structures (Carpenter 1992) representing the semantic contributions of the different modes. This approach overcomes the limitations of previous approaches in that it allows for a full range of gestura~ input beyond simple deictic pointing gestures. Unlike speech-driven systems (Bolt 1980, Neal and Shapiro 1991, Koons et al 1993, Wauchope 1994), it is fully multimodal in that all elements of the content of a command can be in either mode. Furthermore, compared to related framemerging strategies (Vo and Wood 1996), it provides a well understood, generally applicable common meaning representation for the different modes and a formally well defined mechanism for multimodal integration. However, while this approach provides an efficient solution for a broad class of multimodal systems, there are significant limitations on the expressivity and generality of the approach.</Paragraph> <Paragraph position="6"> A wide range of potential multimodal utterances fall outside the expressive potential of the previous architecture. Empirical studies of multimodal interaction (Oviatt 1996), utilizing wizard-of-oz techniques, have shown that when users are free to interact with any combination of speech and pen, a single spoken utterance maybe associated with more than one gesture. For example, a number of deictic pointing gestures may be associated with a single spoken utterance: ' calculate distance from here to bere', 'put that there', 'move this team to here and prepare to rescue residents from this building'. Speech may also be combined with a series of gestures of different types: the user circles a vehicle on the map, says 'follow this route', and draws an arrow indicating the route to be followed.</Paragraph> <Paragraph position="7"> In addition to more complex multipart multi-modal utterances, unimodal gestural utterances may contain several component gestures which compose to yield a command. For example, to create an entity with a specific orientation, a user might draw the entity and then draw an arrow leading out from it (Figure 3 (a)). To specify a movement, the user might draw an arrow indicating the extent of the move and indicate departure and arrival times by writing expressions at the base and head (Figure 3 (b)). These are specific examples of the more general problem of visual parsing, which has been a focus of attention in research on visual programming and pen-based interfaces for the creation of complex graphical objects such as mathematical equations and flowcharts (Lakin 1986, Wittenburg et al 1991, Helm et al 1991, Crimi et al 1995).</Paragraph> <Paragraph position="8"> The approach of Johnston et al 1997 also faces fundamental architectural problems. The multi-modal integration strategy is hard-coded into the integration agent and there is no isolatable statement of the rules and constraints independent of the code itself. As the range of multimodal utterances supported is extended, it becomes essential that there be a declarative statement of the grammar of multi-modal utterances, separate from the algorithms and mechanisms of parsing. This will enable system developers to describe integration strategies in a high level representation, facilitating rapid prototyping and iterative development of multimodal systems.</Paragraph> </Section> class="xml-element"></Paper>