File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3004_metho.xml
Size: 22,797 bytes
Last Modified: 2025-10-06 14:09:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3004"> <Title>Virtual Modality: a Framework for Testing and Building Multimodal Applications</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Virtual Modality </SectionTitle> <Paragraph position="0"> This section explains the underlying concept of Virtual Modality, as well as the motivation for the work presented here.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Motivation </SectionTitle> <Paragraph position="0"> Multiple modalities and multimedia are an essential part of our daily lives. Human-human communication relies on a full range of input and output modalities, for instance, speech, gesture, vision, gaze, and paralinguistic, emotional and sensory information. In order to conduct seamless communication between humans and machines, as many such modalities as possible need to be considered.</Paragraph> <Paragraph position="1"> Intelligent devices, wearable terminals, and mobile handsets will accept multiple inputs from different modalities (e.g., voice, text, handwriting), they will render various media from various sources, and in an intelligent manner they will also be capable of providing additional, contextual information about the environment, the interaction history, or even the actual state of the user. Information such as the emotional, affective state of the user, the proximity of physical entities, the dialogue history, and biometric data from the user could be used to facilitate a more accommodating, and concise interaction with a system. Once contextual information is fully utilized and multimodal input is supported, then the load on the user can be considerably reduced. This is especially important for users with severe disabilities.</Paragraph> <Paragraph position="2"> Implementing complex multimodal applications represents the chicken-and-egg problem. A significant amount of data is required in order to build and tune a system; on the other hand, without an operational system, no real data can be collected. Incremental and rule-based implementations, as well as quick mock-ups and Wizard-of-Oz setups (Lemmela and Boda, 2002), aim to address the application development process from both ends; we follow an intermediate approach.</Paragraph> <Paragraph position="3"> The work presented here is performed under the assumption that testing and building multimodal systems can benefit from a vast amount of multimodal data, even if the data is only a result of simulation. Furthermore, generating simulated multimodal data from textual data is justified by the fact that a multimodal system should also operate in speech-only mode.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 The concept </SectionTitle> <Paragraph position="0"> Most current multimodal systems are developed with particular input modalities in mind. In the majority of cases, the primary modality is speech and the additional modality is typically gesture, gaze, sketch, or any combination thereof. Once the actual usage scenario is fixed in terms of the available input modalities, subsequent work focuses only on these input channels. This is advantageous on the one hand; however, on the other hand, there is a good chance that system development will focus on tiny details related to the modalitydependent nature of the recognizers and their particular interaction in the given domain and application scenario. null Virtual Modality represents an abstraction in this sense. The focus is on what semantic units (i.e., meaningful information from the application point of view) are delivered in this channel and how this channel aligns with the speech channel. Note that the speech channel has no exceptional role; it is equal in every sense with the Virtual Modality. There is only one specific consideration regarding the speech channel, namely, it conveys deictic references that establish connections with the semantic units delivered by the Virtual Modality channel.</Paragraph> <Paragraph position="1"> The abstraction provided by the Virtual Modality enables the developer to focus on the interrelation of the speech and the additional modalities, in terms of their temporal correlation, in order to study and experiment with various usage scenarios and usability issues. It also means that we do not care how the information delivered by the Virtual Modality arose, what (actual/physical) recognition process produced them, nor how the recognition processes can influence each other's performance via cross-interaction using early evidence available in one channel or in the other - although we acknowledge that this aspect is important and desired, as pointed out by Coen (2001) and Haikonen (2003), this has not yet been addressed in the first implementation of the model.</Paragraph> <Paragraph position="2"> The term &quot;virtual modality&quot; is not used in the multimodal research community, as far as we know. The only occurrence we found is by Marsic and Dorohonceanu (2003), however, with &quot;virtual modality system&quot; they refer to a multimodal management module that manages and controls various applications sharing common modalities in the context of telecollaboration user interfaces. null</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Operation </SectionTitle> <Paragraph position="0"> The idea behind Virtual Modality is explained with the help of Figure 1. The upper portion describes how the output of a speech recognizer (or direct natural language input from keyboard) and a sequence of words, {w1 ....</Paragraph> <Paragraph position="1"> wN}, is transformed into a corresponding sequence of concepts, {C1 .... CM}. The module responsible for this operation, for the sake of simplicity and generality, is called a classifier (CL). In real-life implementations, this module can be a sophisticated natural language understanding (NLU) unit, a simple semantic grammar, or a hybrid of several approaches.</Paragraph> <Paragraph position="2"> The middle part of Figure 1 exhibits how the Virtual Modality is plugged into the classifier. The Virtual Modality channel (VM) is parallel to the speech channel (Sp) and it delivers certain semantic units to the classifier. These semantic units correspond to a portion of the word sequence in the speech channel. For instance, the original incoming sentence might be, &quot;From Harvard University to MIT.&quot; In the case of a multimodal setup the semantic unit, originally represented by a set of words in the speech channel (e.g., &quot;to MIT&quot;), will be delivered by the Virtual Modality as mi.</Paragraph> <Paragraph position="3"> The tier between the two modality channels is a deictic reference in the speech channel (di in the bottom of Figure 1). There are various realizations of a deictic reference, in this example it can be, for example, &quot;here&quot;, &quot;over there&quot;, or &quot;this university&quot;. Nevertheless, in all cases for these input combinations (i.e., speech only, speech and some other modality) the requirement is that the very same sequence of semantic concepts is to be produced by the classifier.</Paragraph> <Paragraph position="4"> There is one more criterion: there must be a temporal correspondence between the input channels. The deictic reference can only be resolved if the input delivered by the other modality channel is within certain time frames. This is indicated tentatively in the figure as mi occurring either in synchrony with, prior to, or following the deictic reference in time (see Section 4.3).</Paragraph> <Paragraph position="6"> Modality channels, respectively. CL is a classifier and integrator that transforms and fuses a sequence of words, wi, and Virtual Modality inputs, mi, into a corresponding sequence of concepts Ck.).</Paragraph> <Paragraph position="7"> In the above described model the speech channel will always have a deictic replacement when a semantic unit is moved to the Virtual Modality channel, although Oviatt, DeAngeli and Kuhn (1997) reported their findings that in a given application domain users are not even using spoken deictic references in more than half of the multimodal input cases. Therefore, to conform to this, we keep in mind that di can have a void value, as well.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 The use of Virtual Modality </SectionTitle> <Paragraph position="0"> The framework described above enables two steps in the development of multimodal systems. First, with the introduction of the Virtual Modality, modules designed to resolve inputs from multimodal scenarios can be tested.</Paragraph> <Paragraph position="1"> Quite often, these inputs alone represent ambiguity and the combination of two or more input channels are needed to resolve them.</Paragraph> <Paragraph position="2"> On the other hand, with the removal of pre-defined semantic units to the Virtual Modality channel, a multimodal database can be generated from the speech-only data. For instance, in a given application domain, all location references can be moved to the Virtual Modality channel and replaced by randomly chosen deictic references. Furthermore, the temporal relation between the deictic reference and the corresponding semantic unit in the Virtual Modality can be governed by external parameters. This method facilitates the generation of a large amount of &quot;multimodal&quot; data from only a limited amount of textual data. This new database can then be used for the first task, as described above, and equally importantly, it can be used to train statistically motivated multimodal integrator/fusion modules.</Paragraph> <Paragraph position="3"> As it was pointed out by Oviatt et al. (2003), predictive and adaptive integration of multimodal input is necessary in order to provide robust performance for multimodal systems. Availability of data, even if it is generated artificially, can and will help in the development process.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.5 Further considerations </SectionTitle> <Paragraph position="0"> The primary goal of an interactive system is the full understanding of the user's intention in the given context of an application. Processing all active inputs from the user can only attain this task: recognizing and interpreting them accurately. Additionally, by considering all passively and implicitly available information (e.g., location, sensory data, dialogue history, user preferences, pragmatics), the system can achieve an even fuller understanding of the user's intention.</Paragraph> <Paragraph position="1"> The Virtual Modality can be used to simulate the delivery of all the previously described information. From a semantic interpretation point of view, an implicitly available piece of information, i.e., the physical location of the user (detectable by a mobile device, for instance), is equal to an active user input generated in a given modality channel. The only difference might be the temporal availability of the data: a location information derived from a mobile device is continuously available over a longer period of time, while a user gesture over a map specifying, for example the value for a &quot;from here&quot; deictic reference, is present only for a relatively short time.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 System Architecture </SectionTitle> <Paragraph position="0"> Researchers in the Spoken Language Systems group at MIT have been developing human-computer dialogue systems for nearly two decades. These systems are implemented within the Galaxy Communicator architecture, which is a multimodal conversational system framework (Seneff et al., 1998). As shown in Figure 2, a Galaxy system is configured around a central programmable hub, which handles the communications among various human language technology servers, including those that handle speech recognition and synthesis, language understanding and generation, context resolution, and dialogue management.</Paragraph> <Paragraph position="1"> Several Galaxy domains are currently under development at MIT (Zue et al., 1994; Seneff et al., 2000; Zue et al., 2000; Seneff, 2002), but the research effort presented here concerns only Voyager, the traffic and city guide domain (Glass et al., 1995; Wang, 2003) - although the Virtual Modality concept is applicable for other domains as well. Voyager's map-based interface provides opportune conditions for the use of multimodal input and deictic references. For example, a typical user input may be, &quot;How do I get from here to there?&quot; which is spoken while the user clicks on two different locations on a graphical map.</Paragraph> <Paragraph position="2"> After the utterance has been recognized and parsed, the semantic frame representation of the utterance is sent to the Context Resolution (CR) server (Filisko and Seneff, 2003). It is the CR server's duty to interpret the user's utterance in the context of the dialogue history, the user's physical environment, and limited world knowledge, via a resolution algorithm. This protocol includes a step to resolve any deictic references the user has made.</Paragraph> <Paragraph position="3"> In addition to the user's utterance and dialogue history, all gestures for the current turn are sent to the CR server. All of this contextual information can then be utilized to make the most appropriate resolutions of all the deictic references. The context-resolved semantic frame is finally sent to the dialogue manager, where an appropriate reply to the user is formulated.</Paragraph> <Paragraph position="4"> The simulation of such an interaction cycle has been facilitated by the use of a Batchmode server, developed by Polifroni and Seneff (2000). The server receives an input (e.g. the text representation of the spoken utterance) from a file of logged or pre-formatted data. After the input has been processed, the next input is obtained from the input file, and the cycle continues (more details in Section 5).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Data Generation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Application domain </SectionTitle> <Paragraph position="0"> The original data used for generating multimodal simulated inputs are taken from the log files of the Voyager application. The Voyager application provides information about city landmarks (e.g. universities, museums, sport arenas, subway stops), gives navigation guidance and up-to-date traffic information over the phone and via a graphical interface. Geographically it covers the area of Boston and Cambridge in Massachusetts. Users can use natural language in the queries and dialogue management takes care of user-friendly disambiguation, error recovery and history handling. A typical dialogue between Voyager (V) and a user (U) is given below: U: Can you show me the universities in Boston? V: Here is a map and list of universities in Boston...</Paragraph> <Paragraph position="1"> U: What about Cambridge? V: Here is a map and list of universities in Cambridge...</Paragraph> <Paragraph position="2"> U: How do I get there <click Harvard> from here <click MIT>? V: Here are directions to Harvard University from MIT...</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Defining a user population </SectionTitle> <Paragraph position="0"> As mentioned earlier, the data to be generated can be used both for testing and for system development. In both scenarios, real dialogues should be simulated as closely as possible. Therefore a virtual user population was defined for each experiment.</Paragraph> <Paragraph position="1"> First, the distribution of various user types was defined. A user type is specified in terms of the delay a user exhibits with the Virtual Modality data delivery, relative to the speech channel. The following six user types were defined: outspoken, precise, too-fast, quickie, slowhand and everlate. Outspoken is an imaginary user who never uses the Virtual Modality, and communicates with the system using only the speech modality. Precise always issues the Virtual Modality input in synchrony with the spoken deictic reference.</Paragraph> <Paragraph position="2"> Too-fast always issues the Virtual Modality input significantly earlier than the corresponding deictic reference in the speech channel, while Quickie issues the Virtual Modality input only slightly earlier than the deictic reference. Similar rules apply for Slowhand and Everlate, except that they issue the Virtual Modality input slightly later or much later, respectively, than the deictic reference.</Paragraph> <Paragraph position="3"> Once the composition of the user population has been determined, the corresponding temporal deviations must be specified. In a real system the exact instances are typically given as elapsed time from a reference point specified by a universal time value (with different devices synchronized using the Network Time Protocol). However, such accuracy is not necessary for the experiments. Rather, a simplified measurement is introduced in order to describe intuitively how the Virtual Modality input deviates from the instant when the corresponding deictic reference was issued. The unit used here is a word distance, more precisely the average length of a word (how many words are between the deictic reference and the input in the Virtual Modality channel). A 0 means that the deictic reference and the Virtual Modality event are in synchrony, while a -1 (+1) means that the Virtual Modality input was issued one word earlier (later) than the corresponding deictic reference.</Paragraph> <Paragraph position="4"> Using this formalism, the following deviation pattern for the five user types is defined as a starting point for the experiments:</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Generation of the multimodal data </SectionTitle> <Paragraph position="0"> Generating multimodal data is, in a sense, the reverse process of the multimodal integration step. Since it is known how the deictic references are realized in a given domain, generating sentences with deictic references once the actual definite phrases are found, seems straightforward.</Paragraph> <Paragraph position="1"> The idea is simple: find all instances of a given type of semantic unit (e.g., location reference) in the input sentences, move them to the Virtual Modality channel with timing information and, as a replacement, put sensible deictic references back into the original sentences. The implementation, however, reveals several problems. First, identification of the location references is not necessarily an easy task. It may require a complex parsing or keyword-spotting algorithm, depending on the application in question. In our case, the log files include the output of the TINA Natural Language Understanding module, meaning that all semantically relevant units present in an input sentence are marked explicitly in the output parse frame (Seneff, 1992).</Paragraph> <Paragraph position="2"> Figure 3 gives an example of the parse frame.</Paragraph> <Paragraph position="3"> &quot;give me directions from harvard to mit&quot;.</Paragraph> <Paragraph position="4"> The movement and time marker placement step represents no problem.</Paragraph> <Paragraph position="5"> The third step, namely the replacement of the removed semantic units with sensible deictic references, requires certain manipulation. Performing the replacement using only deictic references, such as, &quot;here&quot;, &quot;over here&quot;, &quot;there&quot;, and &quot;over there&quot;, would result in a rather biased data set. Instead, depending on the topic of the location reference (e.g., city, road, university), definite noun phrases like &quot;this city&quot; and &quot;that university&quot; were also used. Eventually, a look-up table was defined which included the above general expressions, as well as patterns such as &quot;this $&quot; and &quot;that $&quot; in which the variable part (i.e., $) was replaced with the actual topic. The selection for a sentence was randomly chosen, resulting in good coverage of various deictic references for the input sentences. For the example depicted in Figure 3, the following sentence is generated: &quot;give me directions from there to this university&quot; The following is a summary of the overall multimodal data generation process: 1. Define the distribution of the user population (e.g., outspoken 20%, precise 40%, quickie, 20%, slowhand 15%, everlate 5%); 2. Define the corresponding deviations (see Table 1); 3. Randomly allocate turns (sentences) to the pre-defined user types (e.g. 40% of all data will go for the precise user type with deviation 0); 4. Identify all location references in the input sentence based on the parse frame; 5. Remove all or a pre-defined quantity of location expressions from the original sentence and replace them with deictic markers; 6. Place the removed location phrases into the Virtual Modality channel; 7. Place time markers to the Virtual Modality channel referring to the original position of the location phrases in the input sentence; 8. Issue the pre-determined time shift, if needed, in the Virtual Modality channel; 9. Randomly select an acceptable deictic reference and insert it into the original sentence in place of the deictic marker; 10. Repeat 4-9 until all data has been processed.</Paragraph> <Paragraph position="6"> An example of the generated Virtual Modality data and the corresponding sentence is shown in Figure 4.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Statistics </SectionTitle> <Paragraph position="0"> data (turn = sentence).</Paragraph> <Paragraph position="1"> Although the above table covers the original data, the newly generated Virtual Modality database has the same characteristics since the location references there become deictic references.</Paragraph> </Section> </Section> class="xml-element"></Paper>