File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-3012_metho.xml
Size: 8,850 bytes
Last Modified: 2025-10-06 14:09:49
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-3012"> <Title>Multimodal Generation in the COMIC Dialogue System</Title> <Section position="4" start_page="0" end_page="46" type="metho"> <SectionTitle> 2 Dialogue Management </SectionTitle> <Paragraph position="0"> The task of the Dialogue and Action Manager (DAM) is to decide what the system will show and say in response to user input. The input to the User Tell me about this design [click on Alt Mettlach] COMIC [Look at screen] THIS DESIGN is in the CLASSIC style.[circle tiles] As you can see, the colours are DARK RED and OFF WHITE.[point at tiles] The tiles are from the ALT METTLACH collection by VILLEROY AND BOCH.[point at design name] DAM consists of multiple scored hypotheses containing high-level, modality-independent specifications of the user input; the output is a similar high-level specification of the system action. The DAM itself is modality-independent. For example, the input in Figure 2 could equally well have been the user simply pointing to a design on the screen, with no speech at all. This would have resulted in the same abstract DAM input, and thus in the same output: a request to show and describe the given design.</Paragraph> <Paragraph position="1"> The COMIC DAM (Catizone et al., 2003) is a general-purpose dialogue manager which can handle different dialogue management styles such as system-driven, user-driven or mixed-initiative. The general-purpose part of the DAM is a simple stack architecture with a control structure; all the application-dependent information is stored in a variation of Augmented Transition Networks (ATNs) called Dialogue Action Forms (DAFs).</Paragraph> <Paragraph position="2"> These DAFs represent general dialogue moves, as well as sub-tasks or topics, and are pushed onto and popped off of the stack as the dialogue proceeds.</Paragraph> <Paragraph position="3"> When processing a user input, the control structure decides whether the DAM can stay within the current topic (and thus the current DAF), or whether a topic shift has occurred. In the latter case, a new DAF is pushed onto the stack and executed. After that topic has been exhausted, the DAM returns to the previous topic automatically. The same principle holds for error handling, which is implemented at different levels in our approach.</Paragraph> <Paragraph position="4"> In the guided-browsing phase of the COMIC system, the user may browse tiling designs by colour, style or manufacturer, look at designs in detail, or change the amount of border and decoration tiles.</Paragraph> <Paragraph position="5"> The DAM uses the system ontology to retrieve designs according to the chosen feature, and consults the user model and dialogue history to narrow down the resulting designs to a small set to be shown and described to the user.</Paragraph> </Section> <Section position="5" start_page="46" end_page="46" type="metho"> <SectionTitle> 3 Presentation Planning </SectionTitle> <Paragraph position="0"> The COMIC fission module processes high-level system-output specifications generated by the DAM.</Paragraph> <Paragraph position="1"> For the example in Figure 2, the DAM output indicates that the given tile design should be shown and described, and that the description must mention the style. The fission module fleshes out such specifications by selecting and structuring content, planning the surface form of the text to realise that content, choosing multimodal behaviours to accompany the text, and controlling the output of the whole schedule. In this section, we describe the planning process; output coordination is dealt with in Section 6.</Paragraph> <Paragraph position="2"> Full technical details of the fission module are given in (Foster, 2005).</Paragraph> <Paragraph position="3"> To create the textual content of a description, the fission module proceeds as follows. First, it gathers all of the properties of the specified design from the system ontology. Next, it selects the properties to include in the description, using information from the dialogue history and the user model, along with any properties specifically requested by the dialogue manager. It then creates a structure for the selected properties and creates logical forms as input for the OpenCCG surface realiser. The logical forms may include explicit alternatives in cases where there are multiple ways of expressing a property; for example, it could say either This design is in the classic style or This design is classic. OpenCCG makes use of statistical language models to choose among such alternatives. This process is described in detail in (Foster and White, 2004; Foster and White, 2005).</Paragraph> <Paragraph position="4"> In addition to text, the output of COMIC also incorporates multimodal behaviours including prosodic specifications for the speech synthesiser (pitch accents and boundary tones), facial behaviour specifications (expressions and gaze shifts), and deictic gestures at objects on the application screen using a simulated pointer. Pitch accents and boundary tones are selected by the realiser based on the context-sensitive information-structure annotations (theme/rheme; marked/unmarked) included in the logical forms. At the moment, the other multimodal coarticulations are specified directly by the fission module, but we are currently experimenting with using the OpenCCG realiser's language models to choose them, using example-driven techniques.</Paragraph> </Section> <Section position="6" start_page="46" end_page="46" type="metho"> <SectionTitle> 4 Surface Realisation </SectionTitle> <Paragraph position="0"> Surface realisation in COMIC is performed by the OpenCCG2 realiser, a practical, open-source realiser based on Combinatory Categorial Grammar (CCG) (Steedman, 2000b). It employs a novel ensemble of methods for improving the efficiency of CCG realisation, and in particular, makes integrated use of n-gram scoring of possible realisations in its chart realisation algorithm (White, 2004; White, 2005). The n-gram scoring allows the realiser to work in &quot;anytime&quot; mode--able at any time to return the highest-scoring complete realisation--and ensures that a good realisation can be found reasonably quickly even when the number of possibilities is exponential. This makes it particularly suited for use in an interactive dialogue system such as COMIC.</Paragraph> <Paragraph position="1"> In COMIC, the OpenCCG realiser uses factored language models (Bilmes and Kirchhoff, 2003) over words and multimodal coarticulations to select the highest-scoring realisation licensed by the grammar that satisfies the specification given by the fission module. Steedman's (Steedman, 2000a) theory of information structure and intonation is used to constrain the choice of pitch accents and boundary tones for the speech synthesiser.</Paragraph> </Section> <Section position="7" start_page="46" end_page="47" type="metho"> <SectionTitle> 5 Speech Synthesis </SectionTitle> <Paragraph position="0"> The COMIC speech-synthesis module is implemented as a client to the Festival speech-synthesis system.3 We take advantage of recent advances in version 2 of Festival (Clark et al., 2004) by using a custom-built unit-selection voice with support for APML prosodic annotation (de Carolis et al., 2004).</Paragraph> <Paragraph position="1"> Experiments have shown that synthesised speech with contextually appropriate prosodic features can be perceptibly more natural (Baker et al., 2004).</Paragraph> <Paragraph position="2"> Because the fission module needs the timing information from the speech synthesiser to finalise the schedules for the other modalities, the synthesiser first prepares and stores the waveform for its input text; the sound is then played at a later time, when the fission module indicates that it is required.</Paragraph> </Section> <Section position="8" start_page="47" end_page="47" type="metho"> <SectionTitle> 6 Output Coordination </SectionTitle> <Paragraph position="0"> In addition to planning the presentation content as described earlier, the fission module also controls the system output to ensure that all parts of the presentation are properly coordinated, using the timing information returned by the speech synthesiser to create a full schedule for the turn to be generated.</Paragraph> <Paragraph position="1"> As described in (Foster, 2005), the fission module allows multiple segments to be prepared in advance, even while the preceding segments are being played.</Paragraph> <Paragraph position="2"> This serves to minimise the output delay, as there is no need to wait until a whole turn is fully prepared before output begins, and the time taken to speak the earlier parts of the turn can also be used to prepare the later parts.</Paragraph> </Section> class="xml-element"></Paper>