File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1048_metho.xml

Size: 30,910 bytes

Last Modified: 2025-10-06 14:07:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1048">
  <Title>MATCH: An Architecture for Multimodal Dialogue Systems</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Multimodal Mobile Information Access
</SectionTitle>
    <Paragraph position="0"> In urban environments tourists and residents alike need access to a complex and constantly changing body of information regarding restaurants, theatre schedules, transportation topology and timetables.</Paragraph>
    <Paragraph position="1"> This information is most valuable if it can be delivered effectively while mobile, since places close and plans change. Mobile information access devices (PDAs, tablet PCs, next-generation phones) offer limited screen real estate and no keyboard or mouse, making complex graphical interfaces cumbersome.</Paragraph>
    <Paragraph position="2"> Multimodal interfaces can address this problem by enabling speech and pen input and output combining speech and graphics (See (Andr'e, 2002) for a detailed overview of previous work on multimodal input and output). Since mobile devices are used in different physical and social environments, for different tasks, by different users, they need to be both flexible in input and adaptive in output. Users need to be able to provide input in whichever mode or combination of modes is most appropriate, and system output should be dynamically tailored so that it is maximally effective given the situation and the user's preferences. We present our testbed multimodal application MATCH (Multimodal Access To City Help) and the general purpose multimodal architecture underlying it, that: is designed for highly mobile applications; enables flexible multimodal input; and provides flexible user-tailored multimodal output.</Paragraph>
    <Paragraph position="3">  Highly mobile MATCH is a working city guide and navigation system that currently enables mobile users to access restaurant and subway information for New York City (NYC). MATCH runs standalone on a Fujitsu pen computer (Figure 1), and can also run in client-server mode across a wireless network.</Paragraph>
    <Paragraph position="4"> Flexible multimodal input Users interact with a graphical interface displaying restaurant listings and a dynamic map showing locations and street information. They are free to provide input using speech, by drawing on the display with a stylus, or by using synchronous multimodal combinations of the two modes. For example, a user might ask to see cheap Computational Linguistics (ACL), Philadelphia, July 2002, pp. 376-383. Proceedings of the 40th Annual Meeting of the Association for Italian restaurants in Chelsea by saying show cheap italian restaurants in chelsea, by circling an area on the map and saying show cheap italian restaurants in this neighborhood; or, in a noisy or public environment, by circling an area and writing cheap and italian (Figure 2). The system will then zoom to the appropriate map location and show the locations of restaurants on the map. Users can ask for information about restaurants, such as phone numbers, addresses, and reviews. For example, a user might circle three restaurants as in Figure 3 and say phone numbers for these three restaurants (or write phone). Users can also manipulate the map interface directly. For example, a user might say show upper west side or circle an area and write zoom.</Paragraph>
    <Paragraph position="5">  Flexible multimodal output MATCH provides flexible, synchronized multimodal generation and can take initiative to engage in information-seeking subdialogues. If a user circles the three restaurants in Figure 3 and writes phone, the system responds with a graphical callout on the display, synchronized with a text-to-speech (TTS) prompt of the phone number, for each restaurant in turn (Figure 4).</Paragraph>
    <Paragraph position="6">  The system also provides subway directions. If the user says How do I get to this place? and circles one of the restaurants displayed on the map, the system will ask Where do you want to go from? The user can then respond with speech (e.g., 25th Street and 3rd Avenue), with pen by writing (e.g., 25th St &amp; 3rd Ave), or multimodally ( e.g, from here with a circle gesture indicating location). The system then calculates the optimal subway route and dynamically generates a multimodal presentation of instructions. It starts by zooming in on the first station and then gradually zooms out, graphically presenting each stage of the route along with a series of synchronized TTS prompts. Figure 5 shows the final display of a subway route heading downtown on the 6 train and transferring to the L train Brooklyn bound.</Paragraph>
    <Paragraph position="7">  User-tailored generation MATCH can also provide a user-tailored summary, comparison, or recommendation for an arbitrary set of restaurants, using a quantitative model of user preferences (Walker et al., 2002). The system will only discuss restaurants that rank highly according to the user's dining preferences, and will only describe attributes of those restaurants the user considers important. This permits concise, targeted system responses. For example, the user could say compare these restaurants and circle a large set of restaurants (Figure 6). If the user considers inexpensiveness and food quality to be the most important attributes of a restaurant, the system response might be: Compare-A: Among the selected restaurants, the following offer exceptional overall value. Uguale's price is 33 dollars. It has excellent food quality and good decor. Da Andrea's price is 28 dollars. It has very good food quality and good decor. John's Pizzeria's price is 20 dollars. It has very good food quality and mediocre decor.</Paragraph>
    <Paragraph position="8"> Figure 6: Comparing a large set of restaurants</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Multimodal Application Architecture
</SectionTitle>
    <Paragraph position="0"> The multimodal architecture supporting MATCH consists of a series of agents which communicate through a facilitator MCUBE (Figure 7).</Paragraph>
    <Paragraph position="1">  MCUBE is a Java-based facilitator which enables agents to pass messages either to single agents or groups of agents. It serves a similar function to systems such as OAA (Martin et al., 1999), the use of KQML for messaging in Allen et al (2000), and the Communicator hub (Seneff et al., 1998). Agents may reside either on the client device or elsewhere on the network and can be implemented in multiple different languages. MCUBE messages are encoded in XML, providing a general mechanism for message parsing and facilitating logging.</Paragraph>
    <Paragraph position="2"> Multimodal User Interface Users interact with the system through the Multimodal UI, which is browser-based and runs in Internet Explorer. This greatly facilitates rapid prototyping, authoring, and reuse of the system for different applications since anything that can appear on a webpage (dynamic HTML, ActiveX controls, etc.) can be used in the visual component of a multimodal user interface. A TCP/IP control enables communication with MCUBE.</Paragraph>
    <Paragraph position="3"> MATCH uses a control that provides a dynamic pan-able, zoomable map display. The control has ink handling capability. This enables both pen-based interaction (on the map) and normal GUI interaction (on the rest of the page) without requiring the user to overtly switch 'modes'. When the user draws on the map their ink is captured and any objects potentially selected, such as currently displayed restaurants, are identified. The electronic ink is broken into a lattice of strokes and sent to the gesture recognition and handwriting recognition components which enrich this stroke lattice with possible classifications of strokes and stroke combinations. The UI then translates this stroke lattice into an ink meaning lattice representing all of the possible interpretations of the user's ink and sends it to MMFST.</Paragraph>
    <Paragraph position="4"> In order to provide spoken input the user must tap a click-to-speak button on the Multimodal UI. We found that in an application such as MATCH which provides extensive unimodal pen-based interaction, it is preferable to use click-to-speak rather than pen-to-speak or open-mike. With pen-to-speak, spurious speech results received in noisy environments can disrupt unimodal pen commands.</Paragraph>
    <Paragraph position="5"> The Multimodal UI also provides graphical output capabilities and performs synchronization of multi-modal output. For example, it synchronizes the display actions and TTS prompts in the answer to the route query mentioned in Section 1.</Paragraph>
    <Paragraph position="6"> Speech Recognition MATCH uses AT&amp;T's Watson speech recognition engine. A speech manager running on the device gathers audio and communicates with a recognition server running either on the device or on the network. The recognition server provides word lattice output which is passed to MMFST. Gesture and handwriting recognition Gesture and handwriting recognition agents provide possible classifications of electronic ink for the UI. Recognitions are performed both on individual strokes and combinations of strokes in the input ink lattice. The handwriting recognizer supports a vocabulary of 285 words, including attributes of restaurants (e.g. 'chinese','cheap') and zones and points of interest (e.g. 'soho','empire','state','building'). The gesture recognizer recognizes a set of 10 basic gestures, including lines, arrows, areas, points, and question marks. It uses a variant of Rubine's classic template-based gesture recognition algorithm (Rubine, 1991) trained on a corpus of sample gestures. In addition to classifying gestures the gesture recognition agent also extracts features such as the base and head of arrows. Combinations of this basic set of gestures and hand-written words provide a rich visual vocabulary for multimodal and pen-based commands.</Paragraph>
    <Paragraph position="7"> Gestures are represented in the ink meaning lattice as symbol complexes of the following form: G</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
FORM MEANING (NUMBER TYPE) SEM. FORM
</SectionTitle>
    <Paragraph position="0"> indicates the physical form of the gesture and has values such as area, point, line, arrow. MEANING indicates the meaning of that form; for example an area can be either a loc(ation) or a sel(ection). NUMBER and TYPE indicate the number of entities in a selection (1,2,3, many) and their type (rest(aurant), theatre). SEM is a place holder for the specific content of the gesture, such as the points that make up an area or the identifiers of objects in a selection.</Paragraph>
    <Paragraph position="1"> When multiple selection gestures are present an aggregation technique (Johnston and Bangalore, 2001) is employed to overcome the problems with deictic plurals and numerals described in Johnston (2000). Aggregation augments the ink meaning lattice with aggregate gestures that result from combining adjacent selection gestures. This allows a deictic expression like these three restaurants to combine with two area gestures, one which selects one restaurant and the other two, as long as their sum is three. For example, if the user makes two area gestures, one around a single restaurant and the other around two restaurants (Figure 3), the resulting ink meaning lattice will be as in Figure 8. The first gesture (node numbers 0-7) is either a reference to a location (loc.) (0-3,7) or a reference to a restaurant (sel.) (0-2,4-7). The second (nodes 7-13,16) is either a reference to a location (7-10,16) or to a set of two restaurants (7-9,11-13,16). The aggregation process applies to the two adjacent selections and adds a selection of three restaurants (0-2,4,14-16). If the user says show chinese restaurants in this neighborhood and this neighborhood, the path containing the two locations (0-3,7-10,16) will be taken when this lattice is combined with speech in MMFST. If the user says tell me about this place and these places, then the path with the adjacent selections is taken (0-2,49,11-13,16). If the speech is tell me about these or phone numbers for these three restaurants then the aggregate path (0-2,4,14-16) will be chosen.</Paragraph>
    <Paragraph position="2"> Multimodal Integrator (MMFST) MMFST receives the speech lattice (from the Speech Manager) and the ink meaning lattice (from the UI) and builds a multimodal meaning lattice which captures the potential joint interpretations of the speech and ink inputs. MMFST is able to provide rapid response times by making unimodal timeouts conditional on activity in the other input mode. MMFST is notified when the user has hit the click-to-speak button, when a speech result arrives, and whether or not the user is inking on the display. When a speech lattice arrives, if inking is in progress MMFST waits for the ink meaning lattice, otherwise it applies a short timeout (1 sec.) and treats the speech as unimodal. When an ink meaning lattice arrives, if the user has tapped click-to-speak MMFST waits for the speech lattice to arrive, otherwise it applies a short timeout (1 sec.) and treats the ink as unimodal.</Paragraph>
    <Paragraph position="3"> MMFST uses the finite-state approach to multi-modal integration and understanding proposed by Johnston and Bangalore (2000). Possibilities for multimodal integration and understanding are captured in a three tape device in which the first tape represents the speech stream (words), the second the ink stream (gesture symbols) and the third their combined meaning (meaning symbols). In essence, this device takes the speech and ink meaning lattices as inputs, consumes them using the first two tapes, and writes out a multimodal meaning lattice using the third tape. The three tape finite-state device is simulated using two transducers: G:W which is used to align speech and ink and G W:M which takes a composite alphabet of speech and gesture symbols as input and outputs meaning. The ink meaning lattice G and speech lattice W are composed with G:W and the result is factored into an FSA G W which is composed with G W:M to derive the meaning lattice M.</Paragraph>
    <Paragraph position="4"> In order to capture multimodal integration using finite-state methods, it is necessary to abstract over specific aspects of gestural content (Johnston and Bangalore, 2000). For example, all possible sequences of coordinates that could occur in an area gesture cannot be encoded in the finite-state device.</Paragraph>
    <Paragraph position="5"> We employ the approach proposed in (Johnston and Bangalore, 2001) in which the ink meaning lattice is converted to a transducer I:G, where G are gesture symbols (including SEM) and I contains both gesture symbols and the specific contents. I and G differ only in cases where the gesture symbol on G is SEM,in which case the corresponding I symbol is the specific interpretation. After multimodal integration a projection G:M is taken from the result G W:M machine and composed with the original I:G in order to reincorporate the specific contents that were left out of the finite-state process (I:G D3 G:M = I:M).</Paragraph>
    <Paragraph position="6"> The multimodal finite-state transducers used at runtime are compiled from a declarative multimodal context-free grammar which captures the structure  This grammar captures not just multimodal integration patterns but also the parsing of speech and gesture, and the assignment of meaning. In Figure 9 we present a small simplified fragment capable of handling MATCH commands such as phone numbers for these three restaurants. A multimodal CFG differs from a normal CFG in that the terminals are triples: W:G:M, where W is the speech stream (words), G the ink stream (gesture symbols) and M the meaning stream (meaning symbols). An XML representation for meaning is used to facilate parsing and logging by other system components. The meaning tape symbols concatenate to form coherent XML expressions.</Paragraph>
    <Paragraph position="7"> The epsilon symbol (eps) indicates that a stream is empty in a given terminal.</Paragraph>
    <Paragraph position="8"> When the user says phone numbers for these three restaurants and circles two groups of restaurants (Figure 3). The gesture lattice (Figure 8) is turned into a transducer I:G with the same symbol on each side except for the SEM arcs which are split. For example, path 15-16 SEM([id1,id2,id3]) becomes [id1,id2,id3]:SEM. After G and the speech W are integrated using G:W and G W:M. The G path in the result is used to re-establish the connection between SEM symbols and their specific contents in I:G (I:G D3 G:M = I:M). The meaning read off I:M isBOcmdBQBOphoneBQBOrestaurantBQ [id1,id2,id3] BO/restaurantBQ BO/phoneBQ BO/cmdBQ. This is passed to the multimodal dialog manager (MDM) and from there to the Multimodal UI resulting in a display like Figure 4 with coordinated TTS output. Since the speech input is a lattice and there is also potential for ambiguity in the multimodal grammar, the output from MMFST to MDM is an N-best list of potential multimodal interpretations.</Paragraph>
    <Paragraph position="9"> Multimodal Dialog Manager (MDM) The MDM is based on previous work on speech-act based models of dialog (Stent et al., 1999; Rich and Sidner, 1998). It uses a Java-based toolkit for writing dialog managers that is similar in philosophy to TrindiKit (Larsson et al., 1999). It includes several rule-based  processes that operate on a shared state. The state includes system and user intentions and beliefs, a dialog history and focus space, and information about the speaker, the domain and the available modalities.</Paragraph>
    <Paragraph position="10"> The processes include interpretation, update, selection and generation processes.</Paragraph>
    <Paragraph position="11"> The interpretation process takes as input an N-best list of possible multimodal interpretations for a user input from MMFST. It rescores them according to a set of rules that encode the most likely next speech act given the current dialogue context, and picks the most likely interpretation from the result. The update process updates the dialogue context according to the system's interpretation of user input. It augments the dialogue history, focus space, models of user and system beliefs, and model of user intentions. It also alters the list of current modalities to reflect those most recently used by the user.</Paragraph>
    <Paragraph position="12"> The selection process determines the system's next move(s). In the case of a command, request or question, it first checks that the input is fully specified (using the domain ontology, which contains information about required and optional roles for different types of actions); if it is not, then the system's next move is to take the initiative and start an information-gathering subdialogue. If the input is fully specified, the system's next move is to perform the command or answer the question; to do this, MDM communicates with the UI. Since MDM is aware of the current set of preferred modalities, it can provide feedback and responses tailored to the user's modality preferences.</Paragraph>
    <Paragraph position="13"> The generation process performs template-based generation for simple responses and updates the system's model of the user's intentions after generation. The text planner is used for more complex generation, such as the generation of comparisons.</Paragraph>
    <Paragraph position="14"> In the route query example in Section 1, MDM first receives a route query in which only the destination is specified How do I get to this place? In the selection phase it consults the domain model and determines that a source is also required for a route.</Paragraph>
    <Paragraph position="15"> It adds a request to query the user for the source to the system's next moves. This move is selected and the generation process selects a prompt and sends it to the TTS component. The system asks Where do you want to go from? If the user says or writes 25th Street and 3rd Avenue then MMFST will assign this input two possible interpretations. Either this is a request to zoom the display to the specified location or it is an assertion of a location. Since the MDM dialogue state indicates that it is waiting for an answer of the type location, MDM reranks the assertion as the most likely interpretation. A generalized overlay process (Alexandersson and Becker, 2001) is used to take the content of the assertion (a location) and add it into the partial route request. The result is determined to be complete. The UI resolves the location to map coordinates and passes on a route request to the SUBWAY component.</Paragraph>
    <Paragraph position="16"> We found this traditional speech-act based dialogue manager worked well for our multimodal interface. Critical in this was our use of a common semantic representation across spoken, gestured, and multi-modal commands. The majority of the dialogue rules operate in a mode-independent fashion, giving users flexibility in the mode they choose to advance the dialogue. On the other hand, mode sensitivity is also important since user modality choice can be used to determine system mode choice for confirmation and other responses.</Paragraph>
    <Paragraph position="17"> Subway Route Constraint Solver (SUBWAY) This component has access to an exhaustive database of the NYC subway system. When it receives a route request with the desired source and destination points from the Multimodal UI, it explores the search space of possible routes to identify the optimal one, using a cost function based on the number of transfers, over-all number of stops, and the walking distance from the station at each end. It builds a list of actions required to reach the destination and passes them to the multimodal generator.</Paragraph>
    <Paragraph position="18"> Multimodal Generator and Text-to-speech The multimodal generator processes action lists from SUBWAY and other components and assigns appropriate prompts for each action using a template-based generator. The result is a 'score' of prompts and actions which is passed to the Multimodal UI. The Multimodal UI plays this 'score' by coordinating changes in the interface with the corresponding TTS prompts.</Paragraph>
    <Paragraph position="19"> AT&amp;T's Natural Voices TTS engine is used to provide the spoken output. When the UI receives a multimodal score, it builds a stack of graphical actions such as zooming the display to a particular location or putting up a graphical callout. It then sends the prompts to be rendered by the TTS server. As each prompt is synthesized the TTS server sends progress notifications to the Multimodal UI, which pops the next graphical action off the stack and executes it.</Paragraph>
    <Paragraph position="20"> Text Planner and User Model The text planner receives instructions from MDM for execution of 'compare', 'summarize', and 'recommend' commands. It employs a user model based on multiattribute decision theory (Carenini and Moore, 2001). For example, in order to make a comparison between the set of restaurants shown in Figure 6, the text planner first ranks the restaurants within the set according to the predicted ranking of the user model.</Paragraph>
    <Paragraph position="21"> Then, after selecting a small set of the highest ranked restaurants, it utilizes the user model to decide which restaurant attributes are important to mention. The resulting text plan is converted to text and sent to TTS (Walker et al., 2002). A user model for someone who cares most highly about cost and secondly about food quality and decor leads to a system response such as that in Compare-A above. A user model for someone whose selections are driven by food quality and food type first, and cost only second, results in a system response such as that shown in Compare-B.</Paragraph>
    <Paragraph position="22"> Compare-B: Among the selected restaurants, the following offer exceptional overall value. Babbo's price is 60 dollars. It has superb food quality. Il Mulino's price is 65 dollars. It has superb food quality. Uguale's price is 33 dollars. It has excellent food. Note that the restaurants selected for the user who is not concerned about cost includes two rather more expensive restaurants that are not selected by the text planner for the cost-oriented user.</Paragraph>
    <Paragraph position="23"> Multimodal Logger User studies, multimodal data collection, and debugging were accomplished by instrumenting MATCH agents to send details of user inputs, system processes, and system outputs to a logger agent that maintains an XML log designed for multimodal interactions. Our critical objective was to collect data continually throughout system development, and to be able to do so in mobile settings. While this rendered the common practice of videotaping user interactions impractical, we still required high fidelity records of each multimodal interaction.</Paragraph>
    <Paragraph position="24"> To address this problem, MATCH logs the state of the UI and the user's ink, along with detailed data from other components. These components can in turn dynamically replay the user's speech and ink as they were originally received, and show how the system responded. The browser- and component-based architecture of the Multimodal UI facilitated its reuse in a Log Viewer that reads multimodal log files, replays interactions between the user and system, and allows analysis and annotation of the data. MATCH's logging system is similar in function to STAMP (Oviatt and Clow, 1998), but does not require multimodal interactions to be videotaped and allows rapid reconfiguration for different annotation tasks since it is browser-based. The ability of the system to log data standalone is important, since it enables testing and collection of multimodal data in realistic mobile environments without relying on external equipment.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experimental Evaluation
</SectionTitle>
    <Paragraph position="0"> Our multimodal logging infrastructure enabled MATCH to undergo continual user trials and evaluation throughout development. Repeated evaluations with small numbers of test users both in the lab and in mobile settings (Figure 10) have guided the design and iterative development of the system.</Paragraph>
    <Paragraph position="1"> Figure 10: Testing MATCH in NYC This iterative development approach highlighted several important problems early on. For example, while it was originally thought that users would formulate queries and navigation commands primarily by specifying the names of New York neighborhoods, as in show italian restaurants in chelsea, early field test studies in the city revealed that the need for neighborhood names in the grammar was minimal compared to the need for cross-streets and points of interest; hence, cross-streets and a sizable list of landmarks were added. Other early tests revealed the need for easily accessible 'cancel' and 'undo' features that allow users to make quick corrections. We also discovered that speech recognition performance was initially hindered by placement of the 'click-tospeak' button and the recognition feedback box on the bottom-right side of the device, leading many users to speak 'to' this area, rather than toward the microphone on the upper left side. This placement also led left-handed users to block the microphone with their arms when they spoke. Moving the button and the feedback box to the top-left of the device resolved both of these problems.</Paragraph>
    <Paragraph position="2"> After initial open-ended piloting trials, more structured user tests were conducted, for which we developed a set of six scenarios ordered by increasing level of difficulty. These required the test user to solve problems using the system. These scenarios were left as open-ended as possible to elicit natural responses.</Paragraph>
    <Paragraph position="3"> Sample scenario:You have plans to meet your aunt for dinner later this evening at a Thai restaurant on the Upper West Side near her apartment on 95th St. and Broadway. Unfortunately, you forgot what time you're supposed to meet her, and you can't reach her by phone. Use MATCH to find the restaurant and write down the restaurant's telephone number so you can check on the reservation time.</Paragraph>
    <Paragraph position="4"> Test users received a brief tutorial that was intentionally vague and broad in scope so the users might overestimate the system's capabilities and approach problems in new ways. Figure 11 summarizes results from our last scenario-based data collection for a fixed version of the system. There were five subjects (2 male, 3 female) none of whom had been involved in system development. All of these five tests were conducted indoors in offices.</Paragraph>
    <Paragraph position="5"> exchanges 338 asr word accuracy 59.6% speech only 171 51% asr sent. accuracy 36.1% multimodal 93 28% handwritten sent. acc. 64% pen only 66 19% task completion rate 85%  There were an average of 12.75 multimodal exchanges (pairs of user input and system response) per scenario. The overall time per scenario varied from 1.5 to to 15 minutes. The longer completion times resulted from poor ASR performance for some of the users. Although ASR accuracy was low, overall task completion was high, suggesting that the multimodal aspects of the system helped users to complete tasks.</Paragraph>
    <Paragraph position="6"> Unimodal pen commands were recognized more successfully than spoken commands; however, only 19% of commands were pen only. In ongoing work, we are exploring strategies to increase users' adoption of more robust pen-based and multimodal input.</Paragraph>
    <Paragraph position="7"> MATCH has a very fast system response time.</Paragraph>
    <Paragraph position="8"> Benchmarking a set of speech, pen, and multimodal commands, the average response time is approximately 3 seconds (time from end of user input to system response). We are currently completing a larger scale scenario-based evaluation and an independent evaluation of the functionality of the text planner.</Paragraph>
    <Paragraph position="9"> In addition to MATCH, the same multimodal architecture has been used for two other applications: a multimodal interface to corporate directory information and messaging and a medical application to assist emergency room doctors. The medical prototype is the most recent and demonstrates the utility of the architecture for rapid prototyping. System development took under two days for two people.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML