File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/97/w97-1415_abstr.xml
Size: 3,182 bytes
Last Modified: 2025-10-06 13:49:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1415"> <Title>The CARTOON project : Towards Integration of Multimodal and Linguistic Analysis for Cartographic Applications</Title> <Section position="2" start_page="0" end_page="0" type="abstr"> <SectionTitle> 1. THE CURRENT PROTOTYPE </SectionTitle> <Paragraph position="0"> The current prototype enables to combine speech recognition, mouse pointing and keyboard to interact with a cartographical database (figure 1). Several functions are available such as requesting information on the name or location of a building, the shortest itinerary and the distance between two points, or zooming in and out. Several combinations are possible such as : hospital Figure 1: Events detected on the three modalities (speech, * I am in front of the police station. How can mouse, keyboard) are displayed in the lower window as a I go here <pointing> ? function of time. The recognized words were: '7 want to go&quot;, * Show me how to go from here <pointing> &quot;here&quot;, &quot;here&quot;. Two mouse clicks were also detected. The to here <pointing>. system displayed the corresponding itinerary. Currently, there is no linguistic analysis.</Paragraph> <Paragraph position="1"> Events produced by the speech recognition system (a Vecsys Datavox) are either words or sequences of words (&quot;I_want_to_go&quot;). There are 38 such speech events which are characterized by: the recognized word, the time of utterance and the recognition score.</Paragraph> <Paragraph position="2"> The pointing gestures events are characterized by an (x, y) position and the time of detection.</Paragraph> <Paragraph position="3"> The overall architecture is described in figure 2 : events detected on the keyboard, mouse and speech modalities (left-hand side) are time-stamped coherently by a modality server and then integrated in the multimodal which merges them and activates the application. Figure 2: current software and hardware architecture. The multimodal interface is based on a theoretical framework of ~ types of cooperation between modalities ~ that we initially presented in (Martin and B&oule 93) and that has been used by other French researchers in (Catinis and Caelen 95, Coutaz and Nigay 94). Our framework proposes six basic types of cooperation between modalities (either input or output): The combination of modalities are described in a specification language that is based on the theoretical framework. Three criteria of fusion are available for redundancy and complementarity: temporal coincidence, sequence and structural completion. The multimodal module uses Guided Propagation Networks (B&oule 1985) which provide what we call << multimodal recognition scores ~ incorporating the score provided by the speech recognizer. In the case of missing events, several commands may be activated with different recognition scores. The command with the highest score is selected by the system and may prompt the user for information if needed. More details on this multimodal framework and module can be found in (Martin et al. 95).</Paragraph> </Section> class="xml-element"></Paper>