File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-3033_metho.xml
Size: 6,938 bytes
Last Modified: 2025-10-06 14:09:07
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-3033"> <Title>MATCHKiosk: A Multimodal Interactive City Guide</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The MATCHKiosk </SectionTitle> <Paragraph position="0"> The MATCHKiosk runs on a Windows PC mounted in a rugged cabinet (Figure 1). It has a touch screen which supports both touch and pen input, and also contains a printer, whose output emerges from a slot below the screen. The cabinet also contains speakers and an array microphone is mounted above the screen. There are three main components to the graphical user interface (Figure 2). On the right, there is a panel with a dynamic map display, a click-to-speak button, and a window for feedback on speech recognition. As the user interacts with the system the map display dynamically pans and zooms and the locations of restaurants and other points of interest, graphical callouts with information, and subway route segments are displayed. In the top left there is a photo-realistic virtual agent (Cosatto and Graf, 2000), synthesized by concatenating and blending image samples. Below the agent, there is a panel with large buttons which enable easy access to help and common functions. The buttons presented are context sensitive and change over the course of interaction.</Paragraph> <Paragraph position="1"> The basic functions of the system are to enable users to locate restaurants and other points of interest based on attributes such as price, location, and food type, to request information about them such as phone numbers, addresses, and reviews, and to provide directions on the subway or metro between locations. There are also commands for panning and zooming the map. The system provides users with a high degree of flexibility in the inputs they use in accessing these functions. For example, when looking for restaurants the user can employ speech e.g. find me moderately priced italian restaurants in Alexandria, a multimodal combination of speech and pen, e.g. moderate italian restaurants in this area and circling Alexandria on the map, or solely pen, e.g. user writes moderate italian and alexandria. Similarly, when requesting directions they can use speech, e.g. How do I get to the Smithsonian?, multimodal, e.g. How do I get from here to here? and circling or touching two locations on the map, or pen, e.g. in Figure 2 the user has circled a loca-tion on the map and handwritten the word route.</Paragraph> <Paragraph position="2"> System output consists of coordinated presentations combining synthetic speech with graphical actions on the map. For example, when showing a subway route, as the virtual agent speaks each instruction in turn, the map display zooms and shows the corresponding route segment graphically. The kiosk system also has a print capability. When a route has been presented, one of the context sensitive buttons changes to Print Directions. When this is pressed the system generates an XHTML document containing a map with step by step textual directions and this is sent to the printer using an XHTML-print capability.</Paragraph> <Paragraph position="3"> If the system has low confidence in a user input, based on the ASR or pen recognition score, it requests confirmation from the user. The user can confirm using speech, pen, or by touching on a checkmark or cross mark which appear in the bottom right of the screen. Context-sensitive graphical widgets are also used for resolving ambiguity and vagueness in the user inputs. For example, if the user asks for the Smithsonian Museum a small menu appears in the bottom right of the map enabling them to select between the different museum sites. If the user asks to see restaurants near a particular location, e.g. show restaurants near the white house, a graphical slider appears enabling the user to fine tune just how near.</Paragraph> <Paragraph position="4"> The system also features a context-sensitive multimodal help mechanism (Hastie et al., 2002) which provides assistance to users in the context of their current task, without redirecting them to separate help system. The help system is triggered by spoken or written requests for help, by touching the help buttons on the left, or when the user has made several unsuccessful inputs. The type of help is chosen based on the current dialog state and the state of the visual interface. If more than one type of help is applicable a graphical menu appears. Help messages consist of multimodal presentations combining spoken output with ink drawn on the display by the system. For example, if the user has just requested to see restaurants and they are now clearly visible on the display, the system will provide help on getting information about them.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Multimodal Kiosk Architecture </SectionTitle> <Paragraph position="0"> The underlying architecture of MATCHKiosk consists of a series of re-usable components which communicate using XML messages sent over sockets through a facilitator (MCUBE) (Figure 3). Users interact with the system through the Multimodal UI displayed on the touchscreen. Their speech and ink are processed by speech recognition (ASR) and handwriting/gesture recognition (GESTURE, HW RECO) components respectively. These recognition processes result in lattices of potential words and gestures/handwriting. These are then combined and assigned a meaning representation using a multimodal language processing architecture based on finite-state techniques (MMFST) (Johnston and Bangalore, 2000; Johnston et al., 2002b). This provides as output a lattice encoding all of the potential meaning representations assigned to the user inputs.</Paragraph> <Paragraph position="1"> This lattice is flattened to an N-best list and passed to a multimodal dialog manager (MDM) (Johnston et al., 2002b) which re-ranks them in accordance with the current dialogue state. If additional information or confirmation is required, the MDM uses the virtual agent to enter into a short information gathering dialogue with the user. Once a command or query is complete, it is passed to the multimodal generation component (MMGEN), which builds a multimodal score indicating a coordinated sequence of graphical actions and TTS prompts. This score is passed back to the Multimodal UI. The Multi-modal UI passes prompts to a visual text-to-speech component (Cosatto and Graf, 2000) which communicates with the AT&T Natural Voices TTS engine (Beutnagel et al., 1999) in order to coordinate the lip movements of the virtual agent with synthetic speech output. As prompts are realized the Multi-modal UI receives notifications and presents coordinated graphical actions. The subway route server is an application server which identifies the best route between any two locations.</Paragraph> </Section> class="xml-element"></Paper>