File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1316_metho.xml

Size: 21,328 bytes

Last Modified: 2025-10-06 14:10:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1316">
  <Title>Multimodal Dialog Description Language for Rapid System Development</Title>
  <Section position="4" start_page="109" end_page="112" type="metho">
    <SectionTitle>
2 Specification of MIML
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="109" end_page="111" type="sub_section">
      <SectionTitle>
2.1 Task level markup language
</SectionTitle>
      <Paragraph position="0"> In spoken dialogue system development, we proposed task classification based on the direction of information flow (Araki et al. 1999). We consider that the same analysis can be applied to agent based interactive systems (see Table 1).</Paragraph>
      <Paragraph position="1">  daily life information query In the information assistant class, the agent has information to be presented to the user. Typically, the information contents are Web pages, an instruction of consumer product usage, an educational content, etc. Sometimes the contents are too long to deliver all the information to the user. Therefore, it needs user model that can manage user's preference and past interaction records in order to select or filter out the contents. In the user agent class, the user has information to be delivered to the agent in order to achieve a user's goal. Typically, the information is a command to control networked home equipments, travel schedule to reserve a train ticket, etc. The agent mediates between user and target application in order to make user's input appropriate and easy at the client side process (e.g. checking a mandatory filed to be filled, automatic filling with personal data (name, address, e-mail, etc.)).</Paragraph>
      <Paragraph position="2"> In the Question and Answer class, the user has an intention to acquire some information from the agent that can access to the Web or a database. First, the user makes a query in natural language, and then the agent makes a response according to the result of the information retrieval. If too much information is retrieved, the agent makes a narrowing down subdialogue. If there is no information that matches user's query, the agent makes a request to reformulate an initial query. If the amount of retrieved information is appropriate to deliver to the user by using current modality, the agent reports the results to the user. The appropriate amount of information differs in the main interaction modality of the target device, such as small display, normal graphic display or speech. Therefore, it needs the information of media capability of the target device.</Paragraph>
      <Paragraph position="3"> 2.1.2 Overview of task markup language As a result of above investigation, we specify the task level interaction description language shown in Figure 1.</Paragraph>
      <Paragraph position="4">  Figure. 1 Structure of the Task Markup Language. null The features of this language are (1) the ability to model each participant of dialogue (i.e. user and agent) and (2) to provide an execution framework of each class of task.</Paragraph>
      <Paragraph position="5"> The task markup language &lt;taskml&gt; consists of two parts corresponding to above mentioned features: &lt;head&gt; part and &lt;body&gt; part. The &lt;head&gt; part specifies models of the user (by &lt;userModel&gt; element) and the agent (by &lt;deviceModel&gt; element). The content of each model is described in section 2.1.3. The &lt;body&gt; part specifies a class of interaction task. The content of each task is declaratively specified under the &lt;section&gt;, &lt;xforms&gt; and &lt;qa&gt; elements, which are explained in section 2.1.4.</Paragraph>
      <Paragraph position="6"> 2.1.3 Head part of task markup language In the &lt;head&gt; element of the task markup language, the developer can specify user model in &lt;userModel&gt; element and agent model in &lt;deviceModel&gt; element.</Paragraph>
      <Paragraph position="7"> In the &lt;userModel&gt; element, the developer declares variables which represent user's information, such as expertise to domain, expertise to dialogue system, interest level to the contents, etc.</Paragraph>
      <Paragraph position="8"> In the &lt;deviceModel&gt; element, the developer can specify the type of interactive agent and main modality of interaction. This information is  used for generating template from this task description to interaction descriptions.</Paragraph>
      <Paragraph position="9"> 2.1.4 Body part of task markup language According to the class of the task, the &lt;body&gt; element consists of a sequence of &lt;section&gt; elements, a &lt;xforms&gt; element or a &lt;qa&gt; element. The &lt;section&gt; element represents a piece of information in the task of the information assistant class. The attributes of this element are id, start time and end time of the presentation material and declared user model variable which indicates whether this section meets the user's needs or knowledge level. The child elements of the &lt;section&gt; element specify multimodal presentation. These elements are the same set of the child elements of &lt;output&gt; element in the interaction level description explained in the next subsection. Also, there is a &lt;interaction&gt; element as a child element of the &lt;section&gt; element which specifies agent interaction pattern description as an external pointer. It is used for additional comment generated by the agent to the presented contents. For the sake of this separation of contents and additional comments, the developer can easily add agent's behavior in accordance with the user model. The interaction flow of this class is shown in Figure 2.</Paragraph>
      <Paragraph position="10">  The &lt;xforms&gt; element represents a group of information in the task of the user agent class. It specifies a data model, constraint of the value and submission action following the notation of XForms 1.0.</Paragraph>
      <Paragraph position="11"> In the task of user agent class, the role of interactive agent is to collect information from the user in order to achieve a specific task, such as hotel reservation. XForms is designed to separate the data structure of information and the appearance at the user's client, such as using text field input, radio button, pull-down menu, etc. because such interface appearances are different in devices even in GUI-based systems. If the developer wants to use multimodal input for the user's client, such separation of the data structure and the appearance, i.e. how to show the necessary information and how to get user's input, is very important.</Paragraph>
      <Paragraph position="12"> In MIML, such device dependent 'appearance' information is defined in interaction level. Therefore, in this user agent class, the task description is only to define data structure because interaction flows of this task can be limited to the typical patterns. For example, in hotel reservation, as a result of AP (application) access, if there is no available room at the requested date, the user's reservation request is rejected. If the system recommends an alternative choice to the user, the interaction branches to subdialogue of recommendation, after the first user's request is processed (see Figure 3). The interaction pattern of each subdialogue is described in the interaction level markup language.</Paragraph>
      <Paragraph position="13">  Figure. 3 Interaction flow of User Agent class The &lt;qa&gt; element consists of three children: &lt;query&gt;, &lt;search&gt; and &lt;result&gt;.</Paragraph>
      <Paragraph position="14"> The content of &lt;query&gt; element is the same as the &lt;xforms&gt; element explained above. However, generated interaction patterns are different in user agent class and question and answer class.</Paragraph>
      <Paragraph position="15"> In user agent class, all the values (except for optional slots indicated explicitly) are expected to be filled. On the contrary, in question and answer class, a subset of slots defined by form description can make a query. Therefore, the first ex- null change of the question and answer class task is system's prompt and user's query input.</Paragraph>
      <Paragraph position="16"> The &lt;search&gt; element represents application command using the variable defined in the &lt;query&gt; element. Such application command can be a database access command or SPARQL  The &lt;result&gt; element specifies which information to be delivered to the user from the query result. The behavior of back-end application of this class is not as simple as user agent class. If too many results are searched, the system transits to narrowing down subdialogue. If no result is searched, the system transits to subdialogue that relaxes initial user's query. If appropriate number (it depends on presentation media) of results are searched, the presentation subdialogue begins. The flow of interaction is shown in Figure 4.</Paragraph>
      <Paragraph position="17"> Figure. 4 Interaction flow of Question and Answer class</Paragraph>
    </Section>
    <Section position="2" start_page="111" end_page="112" type="sub_section">
      <SectionTitle>
2.2 Interaction level markup language
</SectionTitle>
      <Paragraph position="0"> guage Previously, we proposed a multimodal interaction markup language (Araki et al. 2004) as an extension of VoiceXML  . In this paper, we modify the previous proposal for specializing human-agent interaction and for realizing interaction pattern defined in the task level markup language. The main extension is a definition of modality independent elements for input and output. In VoiceXML, system's audio prompt is defined in &lt;prompt&gt; element as a child of &lt;field&gt; element  that defines atomic interaction acquiring the value of the variable. User's speech input pattern is defined by &lt;grammar&gt; element under &lt;field&gt; element. In our MIML, &lt;grammar&gt; element is replaced by the &lt;input&gt; element which specifies active input modalities and their input pattern to be bund to the variable that is indicated as name attribute of the &lt;field&gt; element. Also, &lt;prompt&gt; element is replaced by the &lt;output&gt; element which specifies active output modalities and a source media file or contents to be presented to the user. In &lt;output&gt; element, the developer can specify agent's behavior by using &lt;agent&gt; element. The outline of this interaction level markup language is shown in Figure 5.</Paragraph>
      <Paragraph position="1">  The &lt;input&gt; element and the &lt;output&gt; element are designed for implementing various types of interactive agent systems.</Paragraph>
      <Paragraph position="2"> The &lt;input&gt; element specifies the input processing of each modality. For speech input, grammar attribute of &lt;speech&gt; element specifies user's input pattern by SRGS (Speech Recognition Grammar Specification)  , or alternatively, type attribute specifies built-in grammar such as Boolean, date, digit, etc. For image input, type attribute of &lt;image&gt; element specifies built-in behavior for camera input, such as nod, faceRecognition, etc. For touch input, the value of the variable is given by referring external definition of the relation between displayed object and its value.</Paragraph>
      <Paragraph position="3"> The &lt;output&gt; element specifies the output control of each modality. Each child element of  this element is performed in parallel. If the developer wants to make sequential output, it should be written in &lt;smil&gt; element (Synchronized Multimedia Integration Language)</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="112" end_page="112" type="metho">
    <SectionTitle>
, For
</SectionTitle>
    <Paragraph position="0"> audio output, &lt;audio&gt; element works as the same way as VoiceXML, that is, the content of the element is passed to TTS (Text-to-Speech module) and if the audio file is specified by the src attribute, it is a prior output. In &lt;video&gt;, &lt;page&gt; (e.g. HTML) and &lt;smil&gt; (for rich multimedia presentation) output, each element specifies the contents file by src attribute. In &lt;agent&gt; element, the agent's behavior definition, such as move, emotion, status attribute specifies the parameter for each action.</Paragraph>
    <Section position="1" start_page="112" end_page="112" type="sub_section">
      <SectionTitle>
2.3 Platform level description
</SectionTitle>
      <Paragraph position="0"> The differences of agent and other devices for input/output are absorbed in this level. In interaction level markup language, &lt;agent&gt; element specifies agent's behavior. However, some agent can move in a real world (e.g. personal robot), some agent can move on a computer screen (e.g.</Paragraph>
      <Paragraph position="1"> Microsoft Agent), and some cannot move but display their face (e.g. life-like agent).</Paragraph>
      <Paragraph position="2"> One solution for dealing with such variety of behavior is to define many attributes at &lt;agent&gt; element, for example, move, facial expression, gesture, point, etc. However, the defects of this solution are inflexibility of correspondence to progress of agent technology (if an agent adds new ability to its behavior, the specification of language should be changed) and interference of reusability of interaction description (description for one agent cannot apply to another agent).</Paragraph>
      <Paragraph position="3"> Our solution is to use the binding mechanism in XML language between interaction level and platform dependent level. We assume default behavior for each value of the move, emotion and status attributes of the &lt;agent&gt; element. If such default behavior is not enough for some purpose, the developer can override the agent's behavior using binding mechanism and the agent's native control language. As a result, the platform level description is embedded in binding language described in next section.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="112" end_page="113" type="metho">
    <SectionTitle>
3 Rapid system development
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="112" end_page="113" type="sub_section">
      <SectionTitle>
3.1 Usage of application framework
</SectionTitle>
      <Paragraph position="0"> Each task class has a typical execution steps as investigated in previous section. Therefore a sys-tem developer has to specify a data model and</Paragraph>
      <Paragraph position="2"> specific information for each task execution.</Paragraph>
      <Paragraph position="3"> Web application framework can drive interactive task using these declarative parameters.</Paragraph>
      <Paragraph position="4"> As an application framework, we use Struts  which is based on Model-View-Controller (MVC) model. It clearly separates application logic (model part), transition of interaction (controller part) and user interface (view part). Although MVC model is popular in GUI-based Web application, it can be applied in speech-based application because any modality dependent information can be excluded from the view part. Struts provides (1) a controller mechanism and (2) integration mechanism with the back-end application part and the user interface part. In driving Struts, a developer has to (1) define a data class which stores the user's input and responding results, (2) make action mapping rules which defines a transition pattern of the target interactive system, and (3) make the view part which defines human-computer interaction patterns. The process of Struts begins by the request from the user client (typically in HTML, form data is submitted to the Web server via HTTP post method).</Paragraph>
      <Paragraph position="5"> The controller catches the request and stores the submitted data to the data class, and then calls the action class specified by the request following the definition of action mapping rules.</Paragraph>
      <Paragraph position="6"> The action class communicates with the back-end application, such as database management system or outside Web servers by referring the data class, and returns the status of the processing to the controller. According to the status, the controller refers the action mapping rules and selects the view file which is passed to the user's client. Basically, this view file is written in Java Server Pages, which can be any XML file that includes Java code or useful tag libraries. Using this embedded programming method, the results of the application processing is reflected to the response. The flow of processing in the Struts is shown in Figure 6.</Paragraph>
      <Paragraph position="7"> Figure. 6 MVC model.</Paragraph>
      <Paragraph position="8">  The first step of rapid development is to prepare backend application (Typically using Data-base Management System) and their application logic code. The action mapping file and data class file are created automatically from the task level description described next subsection.</Paragraph>
    </Section>
    <Section position="2" start_page="113" end_page="113" type="sub_section">
      <SectionTitle>
3.2 Task definition
</SectionTitle>
      <Paragraph position="0"> information assistant task. In this task setting, video contents which are divided into sections are presented to the user one by one. At the end of a section, a robot agent put in a word in order to help user's understanding and to measure the user's preference (e.g. by the recognition of acknowledging, nodding, etc.) . If low user's preference is observed, unimportant parts of the presentation are skipped and comments of the robot are adjusted to beginner's level. The importance of the section is indicated by interestLevel attribute and knowledgeLevel attribute that are introduced in the &lt;userModel&gt; element.</Paragraph>
      <Paragraph position="1"> If one of the values of these attribute is below the current value of the user model, the relevant section is skipped. The skipping mechanism using user model variables is automatically inserted into an interaction level description.</Paragraph>
    </Section>
    <Section position="3" start_page="113" end_page="113" type="sub_section">
      <SectionTitle>
3.3 Describing Interaction
</SectionTitle>
      <Paragraph position="0"> The connection between task-level and interaction-level is realized by generation of interaction description templates from the task level description. The interaction level description corresponds to the view part of the MVC model on which task level description is based. From this point of view, task level language specification gives higher level parameters over MVC framework which restricts behavior of the model for typical interactive application patterns. Therefore, from this pattern information, the skeletons of the view part of each typical pattern can be generated based on the device model information in task markup language.</Paragraph>
      <Paragraph position="1"> For example, by the task level description shown in Figure 7, data class is generated from &lt;userModel&gt; element by mapping the field of the class to user model variable, and action mapping rule set is generated using the sequence information of &lt;section&gt; elements. The branch is realized by calling application logic which compares the attribute variables of the &lt;section&gt; and user model data class. Following action mapping rule, the interaction level description is generated for each &lt;section&gt; element. In information assistant class, a &lt;section&gt; element corresponds to two interaction level descriptions: the one is presenting contents which transform &lt;video&gt; element to the &lt;output&gt; elements and the other is interacting with user, such as shown in Figure 8.</Paragraph>
      <Paragraph position="2"> The latter file is merely a skeleton. Therefore, the developer has to fill the system's prompt, specify user's input and add corresponding actions. null Figure 8 describes an interaction as follows: at the end of some segment, the agent asks the user whether the contents are interesting or not. The user can reply by speech or by nodding gesture.</Paragraph>
      <Paragraph position="3"> If the user's response is affirmative, the global variable of interest level in user model is incremented. null  The connection between interaction-level and platform-level is realized by binding mechanism of XML. XBL (XML Binding Language)  was originally defined for smart user interface description, extended for SVG afterwards, and furthermore, for general XML language. The concept of binding in XBL is a tree extension by inheriting the value of attributes to the sub tree (see Figure 9). As a result of this mechanism, the base language, in this the case interaction markup language, can keep its simplicity but does not loose flexibility.</Paragraph>
      <Paragraph position="4"> Figure. 9 Concept of XML binding.</Paragraph>
      <Paragraph position="5"> By using this mechanism, we implemented various types of weather information system,  http://www.w3.org/TR/xbl/ such as Microsoft agent (Figure 10), Galatea (Figure 11) and a personal robot. The platform change is made only by modifying agentType attribute of &lt;deviceModel&gt; element of taskML.</Paragraph>
      <Paragraph position="6"> Figure. 10 Interaction with Microsoft agent.</Paragraph>
      <Paragraph position="7"> Figure. 11 Interaction with Galatea.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML