File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-1624_intro.xml

Size: 5,522 bytes

Last Modified: 2025-10-06 14:03:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1624">
  <Title>An Experiment Setup for Collecting Data for Adaptive Output Planning in a Multimodal Dialogue System</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In the larger context of the TALK project1 we are developing a multimodal dialogue system for a Music Player application for in-car and in-home use, which should support natural, flexible interaction and collaborative behavior. The system functionalities include playback control, manipulation of playlists, and searching a large MP3 database. We believe that in order to achieve this goal, the system needs to provide advanced adaptive multimodal output.</Paragraph>
    <Paragraph position="1"> We are conducting Wizard-of-Oz experiments [Bernsen et al., 1998] in order to guide the development of our system. On the one hand, the experiments should give us data on how the potential users interact with such an application. But we also need data on the multimodal interaction strategies that the system should employ to achieve the desired naturalness, flexibility and collaboration.</Paragraph>
    <Paragraph position="2"> We therefore need a setup where the wizard has freedom of 1TALK (Talk and Look: Tools for Ambient Linguistic Knowledge; www.talk-project.org) is funded by the EU as project No. IST-507802 within the 6th Framework program.</Paragraph>
    <Paragraph position="3"> choice w.r.t. their response and its realization through single or multiple modalities. This makes it different from previous multimodal experiments, e.g., in the SmartKom project [T&amp;quot;urk, 2001], where the wizard(s) followed a strict script. But what we need is also different in several aspects from taking recordings of straight human-human interactions: the wizard does not hear the user's input directly, but only gets a transcription, parts of which are sometimes randomly deleted (in order to approximate imperfect speech recognition); the user does not hear the wizard's spoken output directly either, as the latter is transcribed and re-synthesized (to produce system-like sounding output). The interactions should thus more realistically approximate an interaction with a system, and thereby contain similar phenomena (cf.</Paragraph>
    <Paragraph position="4"> [Duran et al., 2001]).</Paragraph>
    <Paragraph position="5"> The wizard should be able to present different screen outputs in different context, depending on the search results and other aspects. However, the wizard cannot design screens on the fly, because that would take too long. Therefore, we developed a setup which includes modules that support the wizard by providing automatically calculated screen output options the wizard can select from if s/he want to present some screen output.</Paragraph>
    <Paragraph position="6"> Outline In this paper we describe our experiment setup and the first experiences with it. In Section 2 we overview the research goals that our setup was designed to address. The actual setup is presented in detail in Section 3. In Section 4 we describe the collected data, and we summarize the lessons we learnt on the basis of interviewing the experiment participants. We briefly discuss possible improvements of the setup and our future plans with the data in Section 5.</Paragraph>
    <Paragraph position="7"> 2 Goals of the Multimodal Experiment Our aim was to gather interactions where the wizard can combine spoken and visual feedback, namely, displaying (complete or partial) results of a database search, and the user can speak or select on the screen.</Paragraph>
    <Paragraph position="8"> Multimodal Presentation Strategies The main aim was to identify strategies for the screen output, and for the multi-modal output presentation. In particular, we want to learn  an in-car music player application, using the Lane Change driving simulator. Top right: User, Top left: Wizard, Bottom: transcribers.</Paragraph>
    <Paragraph position="9"> when and what content is presented (i) verbally, (ii) graphically or (iii) by some combination of both modes. We expect that when both modalities are used, they do not convey the same content or use the same level of granularity. These are important questions for multimodal fission and for turn planning in each modality.</Paragraph>
    <Paragraph position="10"> We also plan to investigate how the presentation strategies influence the responses of the user, in particular w.r.t. what further criteria the user specifies, and how she conveys them. Multimodal Clarification Strategies The experiments should also serve to identify potential strategies for multi-modal clarification behavior and investigate individual strategy performance. The wizards' behavior will give us an initial model how to react when faced with several sources of interpretation uncertainty. In particular we are interested in what medium the wizard chooses for the clarification request, what kind of grounding level he addresses, and what &amp;quot;severity&amp;quot; he indicates. 2 In order to invoke clarification behavior we introduced uncertainties on several levels, for example, multiple matches in the database, lexical ambiguities (e.g., titles that can be interpreted denoting a song or an album), and errors on the acoustic level. To simulate non-understanding on the acoustic level we corrupted some of the user utterances by randomly deleting parts of them.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML