File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-1203_intro.xml
Size: 9,752 bytes
Last Modified: 2025-10-06 14:06:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1203"> <Title>A compact Representation of prosodically relevant Knowledge in a Speech Dialogue System</Title> <Section position="2" start_page="0" end_page="18" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The speech generation and synthesis modules of a speech dialogue system are very important, since they form the output &quot;visible&quot; to the user. Natural-language generation for a speech dialogue system must therefore operate in real time. In a system which outputs speech, real time is essentially the time the system takes before it starts to utter what it has to say. The reaction time to the previous user input should be minimal. One way of increasing heist erkamp@dbag, ulm. daimlerbenz, corn throughput, and thus coming closer to real-time operation, is to divide the input into small autonomous packets which can be processed independent of each other and thus simultaneously. Then, once the first part of an utterance has been generated, it can be passed on to the synthesis module, and while the latter is producing speech for this segment, the generator can proceed to process the next segment. This type of processing is known as incremental processing. null The acoustic speech signal is by its nature volatile; it can only be heard once. This makes it imperative that the generator provides the synthesizer with all information necessary to produce high-quality speech. In human speech, information is conveyed not only in the individual words, but also in the prosody which provides the listener in many subtle ways how these words are related to each other. It often conveys information not explicitly contained in the spoken words, such as the focal point of a sentence or the contrast of certain words (ideas) with other spoken or unspoken words. A prerequisite for high-grade synthetic speech is therefore that all phonologically and prosodically relevant information contained in the concepts forwarded to the generator and the syntactic structure produced by it be passed on in a suitable form to the synthesis module. Only the consideration of structural information can assure the production of high-quality prosody by the synthesis module. The details of these &quot;phonological&quot; structures are of course language dependent, but there are certain properties common to all languages. As the overall system described here is strictly modular and (in principle) multilingual, the phonologically relevant information of an utterance has to be coded in a high-level interface protocol, that can both be output by a variety of generators as well as used as input for a wriety of synthesis modules.</Paragraph> <Paragraph position="1"> Since the design of such an interface protocol de- null pends on the structure of the semantic input concepts and the syntactic structures generated from them, section 2 gives a short overview of the dialogue management module and our generation system that has been developed in the EFFENDI project 1. Section 3 then describes the compact representation for prosodically relevant knowledge and briefly indicates how the information for this representation is obtained from the input concepts of the generator and the syntactic structures generated from them. As incremental generation often requires the repair of some previously generated parts, section 4 considers the effects of incremental generation that concern the synthesis module and how we try to avoid unnecessary repetitions of words as far as possible. The final section describes the goals of the on-going work in EFFENDI.</Paragraph> <Paragraph position="2"> 2 Overview of the Dialogue System The syntactic generator EFFENDI is integrated into the speech dialogue system implemented by Daimler-Benz. The generator itself is particularily adapted to the specific needs of a real time speech dialogue system (cf. (Poller and Heisterkamp 1997)).</Paragraph> <Paragraph position="3"> A more detailed description of the diaogue system as a whole can be found elsewhere (cf. e.g. (Brietzmann et al. 1994), (Hanrieder and Heisterkamp 1994) or also (Heisterkamp 1993)). We will thus restrict our description to those components of the dialogue system that interact with the generator. 2 The planning of a system utterance (also called &quot;strategic generation&quot; or &quot;what-to-say&quot;) is the main task of the dialogue management component. This means the determination of the appropriate type of utterance in a given dialogue situation, the items that are to be talked about in which manner or style and finally to deliver a semantic description of the utterance to the syntactic generator.</Paragraph> <Paragraph position="4"> A module called the Dialogue Manager operates with a set of goals (cfi e.g. (Heisterkamp and McGlashan 1996)) that result from the contextual interpretation of the user utterance in a Belief Module ((Heisterkamp et al. 1992)), the requirements of the application system, and the current confirmation strategy (cf. (Heisterkamp 1993)).</Paragraph> <Paragraph position="5"> 1EFFENDI stands for &quot;EFfizientes FormulierEN von DIalogbeitr~gen&quot; (Ei-\[icient formulation of dialogue contributions) and is a joint research project of the DFKI</Paragraph> <Section position="1" start_page="17" end_page="18" type="sub_section"> <SectionTitle> Saarbriicken and Daimler-Benz Research Ulm. </SectionTitle> <Paragraph position="0"> 2Historically, our dialogue system goes back in part to the one developed in the SUNDIAL project. The architecture of that system was laid out to accommodate a generator (cf. (Youd and McGlashan 1992), but for various reasons the work on this aspect was discontinued.</Paragraph> <Paragraph position="1"> The Dialogue Module selects from the overall set of goals that subset which should constitute the next system utterance. A Message Planner receives this subset consisting of types utterances (e.g. a request for confirmation), the task item of that goal (e.g. a departure place) and the status of this item (new, repeated n times). It requests a semantic description of that task item from the Belief Module. The semantic description is then combined with the dialogue goal types for the phrase type markers (e.g.</Paragraph> <Paragraph position="2"> question) and verbosity markers inferred from the status (e.g. the possibility of ellipting a verb or reducing it to a prepositional phrase) to result in a semantic structure 3. This semantic structure is then passed on to the generation module. A special interface translates these semantic representations into syntactically oriented input specifications for the generator.</Paragraph> <Paragraph position="3"> The most important property of the EFFENDI generator is its incrementality. Incremental generation means that both the consumption of the input elements as well as the production of the output elements work in a piecemeal and interleaved fashion.</Paragraph> <Paragraph position="4"> Input consumption and output production interleave in such a way that first parts of a sentence are uttered before the generation process is finished and even before all input elements are consumed. This kind of flexible syntactic generation is only possible if the processing can be broken down into a large set of independent tasks which can run in parallel (cf.</Paragraph> <Paragraph position="5"> (Kempen and Hoenkamp 1982)). Applying this principle, generation in EFFENDI is realized by synchronizing a set of actively communicating, independent processes (so-called objects) each of which is responsible for the syntactic realization of an input element and its integration into the syntactic structure of the whole utterance (cfi (Kilger 1994)).</Paragraph> <Paragraph position="6"> In addition, incremental generation should be separated into two main computational steps. The first step must comprise the construction of the hierarchical (syntactic) structure. The word order of the surface string is computed in a second step (linearization). The reason for this separation is the observation that decisions at the hierarchical level are often possible at a time where input information is not yet sufficient to make decisions at the positional level ((gilger 1994)).</Paragraph> <Paragraph position="7"> Incremental syntactic generation can therefore be organized as follows. The incremental input interface immediately translates each incoming input specification into an independent process (object).</Paragraph> <Paragraph position="8"> This process immediately and independently runs the following computational steps. At the hierarchical level, an elementary syntactic structure for the individual input element is selected. In order to build a virtual syntactic structure for the whole sentence, the objects exchange structural and syntactic information by explicitely sending messages to related objects. An object that completes the structural combination with related objects, changes to the positional level the task of which is the deterruination of the resulting word order of the surface string (linearization) and its output. Linearization mid output production have to be synchronized with respect to the word order that globally results from the local linearizations. So, incremental output production is organized as a global visit of all objects. As soon as an object has finished its linearization, it can be uttered, i.e. sent to the synthesizer. The incrementality of the output is automatically ensured because the individual objects finish their local linearizations at different times.</Paragraph> </Section> </Section> class="xml-element"></Paper>