File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4012_metho.xml
Size: 10,063 bytes
Last Modified: 2025-10-06 14:08:54
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4012"> <Title>UI on the Fly: Generating a Multimodal User Interface</Title> <Section position="4" start_page="0" end_page="2" type="metho"> <SectionTitle> 3 Formalism </SectionTitle> <Paragraph position="0"> In this section, we will explain how the Multimodal Functional Unification Grammar (MUG) allows us to generate content. Our formalism and the associated evaluation algorithm work closely with a dialogue manager. As input, they receive an unambiguous, language- and modeindependent representation of the next dialogue turn.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.1 Dialogue acts as input </SectionTitle> <Paragraph position="0"> Although the semantic input is independent of mode (screen, voice) and language (Portuguese), the input semantics are domain-specific. The representation uses the following types of dialogue acts at the top level: ask for missing information, ask for a confirmation of an action or data, inform the user about the state of objects, or give context-dependent help.</Paragraph> <Paragraph position="1"> An example is shown in Figure 1. The input-FD specifies type of act in progress (askconfirmation), and the details of the interaction type. It then specifies the details of the current action, in this case, the email that the user is sending.</Paragraph> <Paragraph position="2"> Furthermore, the dialogue manager may indicate the need to realize a certain portion of an utterance with an attribute realize. The input format integrates with principled, object-oriented dialogue managers.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 The domain: a personal assistant. </SectionTitle> <Paragraph position="0"> In this example, we have constructed a personal assistant to be used in the domain of sending email messages.</Paragraph> <Paragraph position="1"> We implemented a MUG for a PDA-size handheld device with a color touch-screen (see Figure 2a). The initial steps to adapt it to a mobile phone (Figure 2b) involved creating a device profile that uses no GUI widgets and associates a higher cost (see Section 5) with the screen or No?&quot;. b) Voice: &quot;Send the email regarding Aussie weather to Fred Cummins now?&quot; output, as the screen is smaller. All devices used have server-driven TTS output capabilities.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.3 The grammar </SectionTitle> <Paragraph position="0"> MUG is a collection of components. Each of them specifies a realization variant for a given partial semantic or syntactic representation. This representation may be specific to a mode or general. We call these components functional descriptions (FDs) in the tradition of the Functional Unification Grammar (Kay, 1979), from which MUG is derived.</Paragraph> <Paragraph position="1"> For each output, the MUG identifies an utterance plan, consisting of separate constituents in the output. For example, when we ask for missing information (&quot;Who would you like to send the e-mail to?&quot;), the utterance consists of an instruction and an interaction section. Such a plan is defined in a component, as is each more specific generation level down to the choice of GUI widgets or lexicon entries.</Paragraph> <Paragraph position="2"> MUG is based on the unification of such attribute-value structures. Unification can be seen as a process that augments an FD with additional information. FDs are recursive: a value can be atomic or a nested FD. Values in an FD can be bound to the values in a substructure FD (structure sharing).</Paragraph> <Paragraph position="3"> To realize a semantic representation R, we unify a suitable grammar component FD with each m-constituent substructure F in R, until all substructures have been expanded. An m-constituent is an FD that has an attribute path m|cat, that is, which has been designated as a constituent for mode m. Note that zero or one grammar components for a given mode can be unified with F.</Paragraph> <Paragraph position="4"> Components from the grammar invoke each other by instantiating the cat attribute in the mode-specific part of a substructure. Figure 3 shows a component that applies to all modes.</Paragraph> <Paragraph position="5"> There may be several competing components in the grammar. This creates the ambiguity needed to generate a variety of outputs from the same input. Each output will be faithful to the original input. However, only one variant will be optimally adapted to the given situation, user, and device (see Section 5). Our final markup is text for the text to speech system as well as HTML to be displayed in a browser, similar to the MATCH system (Johnston et al., 2002).</Paragraph> <Paragraph position="6"> The nested attribute-value structures and unification are powerful principles that allow us to cover a broad range of planning tasks, including syntactic and lexical choices. The declarative nature of the grammar allows us to easily add new ways to express a given semantic entity. The information that each component has access to is explicitly encapsulated by an FD.</Paragraph> <Paragraph position="7"> A grammar workbench allows us to debug the generation grammar. We could improve the debugging process with a type-hierarchy, which defines allowed attributes for each type.</Paragraph> <Paragraph position="8"> tion of tasks or user input. The mode in variable Mode may be voice or screen.</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Planning for Coherence </SectionTitle> <Paragraph position="0"> Coherence is a key element in designing a multimodal user interface, where the potential for confusion is increased. Our user interface attempts to be both consistent and coherent. For example, lexical choice does not vary: it is either 'mobile phone' or 'cell phone,' but it is the same whether it is in text or voice. This is in line with priming effects, which are known to occur in human-human dialogue.</Paragraph> <Paragraph position="1"> Like humans (McNeill, 1992; Oviatt et al., 1997), our system aims to be coherent and consistent across all modes. We present redundant content, for example, by choosing the same lexical realizations (never mix cell phone and mobile phone). We present complementary input in linked components. If, for example, a deictic expression such as these two e-mails (by voice) requires the e-mails to be put in focus on the screen, it will set a feature accordingly in the complementary mode.</Paragraph> <Paragraph position="2"> This is possible because of a very simple principle encoded in the generation algorithm: all components realizing one semantic entity must unify. Components may still specify mode-specific information. This is done in a feature named after the mode, so it will not interfere with the realization instructions of a component that realizes the same semantic entity in another mode. The FDs allow us to distinguish information a) that needs to be shared across all output modes, b) that is specific to a particular output mode, or c) that requires collaboration between two modes, such as deictic pronouns. The unification principle replaces explicit integration rules for each coordination scheme, such as the ones used by Johnston (1998), which accounts for the integration of user input.</Paragraph> <Paragraph position="3"> 5 Adaptively Choosing the Best Variant The application of the MUG generates several output variants. They may include or exclude pieces of information, which may be of more or less utility to the user. (When information is being confirmed, it should be fully described, but in later interactions, the email could be referred to as 'it.') For example, several components applied to the sub-FD for task in Figure 1 may depend more on the screen (Figure 2a) or be redundant in screen and voice output (Figure 2b). This allows the system to reflect a low benefit for output on the screen if the user is driving a car or to increase the cost of voice output if the user is in a meeting, or reflect the fact that one doesn't hear the voice output on a mobile phone while reading the screen.</Paragraph> <Paragraph position="4"> The system adapts to the user's abilities, her preferences, and the situation she is in by choosing an appropriate variant. These properties are scalar, and the resulting constraints are to be weighted against each other in our objective function. Each piece of output is scored according to a simple trade-off: a) realize content where requested, b) maximize utility to the user, and c) minimize cognitive load in perceiving and analyzing the output.</Paragraph> <Paragraph position="5"> These constraints are formalized in a score that is assigned to each variant o, given a set of available Modes M, a situation model < a,b >, a device model ph and a The first part of the sum in s describes the utility benefit. The function E returns a set of semantic entities in e (substructures) and their embedding depths in d.</Paragraph> <Paragraph position="6"> The function P penalizes the non-realization of requested (attribute realize) semantic entities, while rewarding the (possibly redundant) realization of an entity. The reward decreases with the embedding depth d of the semantic entity. (Deeper entities give less relevant details by default.) The cognitive load (second part of the sum) is represented by a prediction of the time tm(o) it would take to interpret the output. This is the utterance output time for text spoken by the text-to-speech system, or an estimated reading time for text on the screen.</Paragraph> <Paragraph position="7"> Further work will allow us to cover the range of novice to experienced users by relying on natural language phrases versus graphical user interface widgets.</Paragraph> </Section> class="xml-element"></Paper>