File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-3029_metho.xml
Size: 7,409 bytes
Last Modified: 2025-10-06 14:09:07
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-3029"> <Title>Multimodal Database Access on Handheld Devices</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Visualization metaphors </SectionTitle> <Paragraph position="0"> The information from the database is presented on the device using metaphors of real world objects (cf. conceptual spaces (G&quot;ardenfors, 2000)) so as to provide an intuitive handling of abstract concepts.</Paragraph> <Paragraph position="1"> The lexicon metaphor, shown in figure 2 to the left, presents the items alphabetically ordered in a rotary card file. Each card represents one album and contains detailed background information. The time- null line visualization shows the items in chronological order, on a &quot;rubber&quot; band that can be stretched to get a more detailed view. The wheel metaphor presents the items as a list on a conveyor belt, which can be easily and quickly rotated. Finally, the terrain metaphor (see figure 1) visualizes the entire database. The rendering is based on a three layer type hierarchy, with genre, sub-genre and title layers. Each node of the hierarchy is represented as a circle containing its daughter nodes. Similarities between the items are computed from the genre and mood information in the database and mapped to interaction forces in a physical model that groups similar items together on the terrain. Since usually albums are assigned more than one genre, they can be contained in different circles and therefore be redundantly represented on the terrain. This redundancy is made clear by lines connecting the different instances of the same item.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The MIAMM prototype </SectionTitle> <Paragraph position="0"> The MIAMM system uses the standard architecture for dialogue systems with analysis and generation layers, interaction management and application interface (see figure 3). To minimize the reaction delay of haptic feedback, the visual-haptic interaction component is decoupled from other more time-consuming reasoning processes. The German experimental prototype3 incorporates the following 3There are also French and English versions of the system. The modular architecture facilitates the replacement of the language dependent modules.</Paragraph> <Paragraph position="1"> components, some of which were reused from other projects (semantic parser and action planning): a speaker independent, continuous speech recognizer converts the spoken input in a word lattice; it uses a 500 word vocabulary, and was trained on a automatically generated corpus. A template based semantic parser for German, see (Engel, 2004), interprets this word lattice semantically. The multimodal fusion module maintains the dialogue history and handles anaphoric expressions and quantification.</Paragraph> <Paragraph position="2"> The action planner, an adapted and enhanced version of (L&quot;ockelt, 2004), uses non-linear regression planning and the notion of communicative games to trigger and control system actions. The visual-haptic interaction manager selects the appropriate visualization metaphor based on data characteristics, and maintains the visualization history. Finally, the domain model provides access to the MYSQL database, which contains 7257 records with 85722 songs by 667 artists. Speech output is done by speech prompts, both for spoken and for written output. The prototype also includes a MP3 Player to play the music and speech output files. The demonstration system requires a Linux based PC for the major parts of the modules written in Java and C++, and a Windows NT computer for visualization and haptics. The integration environment is based on the standard Simple Object Access Protocol SOAP4 for information exchange in a distributed environment.</Paragraph> <Paragraph position="3"> The communication between the modules uses a declarative, XML-schema based representation lan- null interface specification accounts for the incremental integration of multimodal data to achieve a full understanding of the multimodal acts within the system. Therefore, it is flexible enough to handle the various types of information processed and generated by the different modules. It is also independent from any theoretical framework, and extensible so that further developments can be incorporated. Furthermore it is compatible with existing standardization initiatives so that it can be the source of future standardizing activities in the field5. Figure 4 shows a sample of MMIL representing the output of the speech interpretation module for the user's utterance &quot;Give me rock&quot;.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 An example </SectionTitle> <Paragraph position="0"> To sketch the functionality of the running prototype we will use a sample interaction, showing the user's actions, the system's textual feedback on the screen and finally the displayed information. Some of the dialogue capabilities of the MIAMM system in this example are, e.g. search history (S2), relaxation of queries (S3b), and anaphora resolution (S5). At any moment of the interaction the user is allowed to navigate on the visualized items, zoom in and out for details, or change the visualization metaphor.</Paragraph> <Paragraph position="1"> We will show the processing details on the basis of the first utterance in the sample interaction Give me rock. The speech recognizer converts the spoken input in a word graph in MPEG7. The semantic parser analyzes this graph and interprets it semantically. The semantic representation consists, in this example, of a speak and a display event, with two participants, the user and music with constraints on its genre (see figure 4).</Paragraph> <Paragraph position="2"> The multimodal fusion module receives this representation, updates the dialogue context, and passes it on to the action planner, which defines the next goal on the basis of the propositional content of the top event (in the example event id1) and its object (in the example participant id3). In this case the user's goal cannot be directly achieved because the object to display is still unresolved. The action planner has to initiate a database query to acquire the required information. It uses the constraint on the genre of the requested object to produce a database query for the domain model and a feed-back request for the visual-haptic interaction module. This feedback message (S1a in the example) is sent to the user while the database query is being done, providing thus implicit grounding. The do- null main model sends the result back to the action planner who inserts the data in a visualization request. The visual-haptic interaction module computes the most suitable visualization for this data set, and sends the request to the visualization module to render it. This component also reports the actual visualization status to the multimodal fusion module.</Paragraph> <Paragraph position="3"> This report is used to update the dialogue context, that is needed for reference resolution. The user can now use the haptic buttons to navigate on the search results, select a title to be played or continue searching. null</Paragraph> </Section> class="xml-element"></Paper>