File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0418_metho.xml

Size: 26,893 bytes

Last Modified: 2025-10-06 14:14:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0418">
  <Title>Matchmaking: dialogue modelling and speech generation meet*</Title>
  <Section position="4" start_page="172" end_page="172" type="metho">
    <SectionTitle>
3 Available resources
</SectionTitle>
    <Paragraph position="0"> The overall system architecture for SPEAK! is shown in Figure 1. The text generation system (KOMET-PENMAN) receives input from a dialogue module (colt, dialogue history) and perhaps several other information sources (e.g., confidence measure from a speech recognition unit), which will be made more precise below (see Section 4). Together the information from these input sources controls the traversal of the grammar (see Section 3.2). The KOMET-PENMAN grammar can generate two types of output: A plain text, which can be embedded into, for instance, a dialogue box in a graphical user interface and a text that is marked up with intonational features (see Section 3.2 for an example), which is passed on to the MULTIVOX text-to-speech system \[23\] and presented acoustically to the user.</Paragraph>
    <Paragraph position="1"> In this article we develop a model of how the dialogue module can control the traversal of those regions of the tal generation of utterances from pre-linguistic conceptual structures to the formation of syntactic and phonological structures, with an interface to a speech synthesis module for German.</Paragraph>
    <Paragraph position="2">  grammar concerned with intonation. As a basis for discussion, we introduce our dialogue model and the relevant parts of the grammar in detail.</Paragraph>
    <Section position="1" start_page="172" end_page="172" type="sub_section">
      <SectionTitle>
3.1 The dialogue model
</SectionTitle>
      <Paragraph position="0"> A dialogue model guides the interaction between a user and an information retrieval system, i.e., it calculates a subset of possible dialogue acts that the user action (spoken or deictic) could correspond to, and on the system side it calculates those dialogue acts that would provide appropriate responses to a given user action. In the work presented here, we assume that a component exists that can choose one of the dialogue acts from these subsets (see e.g., \[13, 28\].) In the 'SPEAK!' project we have chosen to employ a modified version of the Conversational Roles model (COR) as our dialogue model (see \[31\]). COR is a task independent model based on Searle's speech act theory \[29\]. It has been modified within the 'SPEAK!' framework in order to include naturally occuring data that the original model failed to account for, but the overall speech act framework remains the same.</Paragraph>
      <Paragraph position="1"> In the model, a dialogue is represented as a sequence of dialogue moves (e.g., Request, Inform, Withdraw request), which are further decomposed into sequences of atomic acts, dialogue moves, and sub-dialogues. This recursive representation of a dialogue enables COR to account for mixed initiative dialogues, where both information seeker and informa\[ion knower can employ, for instance, retraction, correction, and clarification tactics.</Paragraph>
      <Paragraph position="2"> Below we present a simplified rewrite rule version of the dialogue model. In this version we only present the request, inform, and assert moves in detail, since the other moves are cast in the same format as the request, and one only has to insert new move names (e.g., Promise ---+ promise(K), (Dialogue(S)), etc.). Moves in parentheses are optional. The parameters indicate which participant can perform a given move, S=information seeker, K=information knower. Moves begin with upper case and acts with lower case. The first two rules encode the course that the dialogue is expected to take, while the other dialogue rules encode exceptions. (For a more detailed account see \[31\]).</Paragraph>
      <Paragraph position="3">  Based on the dialogue model, the system builds up a tree-like dialogue history of the ongoing dialogue (see Section 4). Two central themes in our current work are to identify the relevant partial structures of such trees and to determine their semantics such that, for instance, the text generation system can search the dialogue history and interpret what it finds in order to guide the choice of intonation for the system utterances. null</Paragraph>
    </Section>
    <Section position="2" start_page="172" end_page="172" type="sub_section">
      <SectionTitle>
3.2 The intonational resources of the
</SectionTitle>
      <Paragraph position="0"> KOMET grammar In this section, we describe the syste m networks that have been introduced to the German grammar of the KOMET-PENMAN text generation component as to include specifications of appropriate intonation selections in its output (\[12\]). The KOMET grammar of German (\[32, 11\]) is a computational NIGEL-style systemic-functional grammar, based on the notion of choice. The systemic-functional framework provides us with representational means for describing available choices and for mapping (even though indirectly) communicative goals to intonational features.</Paragraph>
      <Paragraph position="1"> According to systemic-functional linguistics (SFL) (see \[15, 21, 7\]), intonation is just one means among others--such as syntax and lexis--to realize choices in the grammar. 3 This implies that choices underlying the realization of intonation may be organized in exactly the same way as other choices in the grammar (see \[14, 7\]. Hence, the intonational control required for speech generation in a dialogue system has been built into the existing KOMET grammar. The added discriminations are constraints on the specification of an appropriate intonation rather than constraints on the structural form.</Paragraph>
      <Paragraph position="2"> Treatment of intonation in SFL The three distinct kinds of phonological categories, i.e., tone group, tonic syllable and tone, contribute to the intonation a\[2, 30, 4, 8\] consider intonation part of phonology. specification of a clause (see for instance \[4, 30, 24\]). They signal three different kinds of relation between grammar and intonation (and thus, indirectly, con*ext), and hence realize different meanings* A choice from the available alternatives has to be made for each of the phonological categories in order to realize sentence intonation. The three sets of choices according to \[14\] are: would generate a neutral statement choosing tone la to accompany the presentation, as in//la die ergebnisse sind unten dargestellt//z (&amp;quot;The results are given below&amp;quot;). If, however, the results had so far been presented at a different position on the screen, the system would generate tone lb in order to place special emphasis on the statement: //lb die ergebmsse sina  UNTEN dargestellt//.</Paragraph>
      <Paragraph position="3"> * Tonality: The distribution into tone groups, i.e., the number of tone groups allocated by the speaker to a given stretch of language.</Paragraph>
      <Paragraph position="4"> * Tonicity: The placing of the tonic syllable, i.e., its position within the tone group.</Paragraph>
      <Paragraph position="5"> * Tone: The choice of a tone for each tone group;  the tone is associated with the tonic.</Paragraph>
      <Paragraph position="6"> Choices in the systems of tonality and tonicity lead to an information constituent structure independent of the grammatical constituency, whereas choices in tone result in the assignment of a tone contour for each identified tone group in an utterance. From these systems, only the choices in the tone systems realize an interpersonal function 4, that of indicating a speech function or the speaker's attitude (e.g., \[14\])* This interpersonal function is our present concern. Next, we want to investigate the tone more closely before turning to the actual system networks in the KOMET grammar* Following \[24\], we assume five tones, s the primary tones, plus a number of so-called secondary tones that are necessary for the description of German intonation contours. These tones are: ~all (tonel), rise (tone2), progredient (tone3),/all-rise (tone4), rise-/all (toneS), where the first four can be further differentiated into secondary a and b tones. 8 The primary tones are the undifferentiated variants, whereas the secondary tones are interpreted as realizing additional meaning. They are intepreted as follows:</Paragraph>
      <Paragraph position="8"> Consider the following example taken from one of the information seeking dialogues: The computer has retrieved an answer to a query, and this answer is presented graphically to the user. As a default, the system 4Tone moreover realizes the logical metafunction, however, we will ignore this fact for the present argument.</Paragraph>
      <Paragraph position="9"> ~Other approaches to intonation suggest a different number of tones, ranging from four to six. \[8\] even goes one step further in arguing that it is not sufficient to describe tones by a combination of fall and rise, instead, much finer distinctions have to be made (see \[8\]).</Paragraph>
      <Paragraph position="10"> 8The criteria for the distinction of primary tones is the type of the tone movement, for instance rising or falling tone contour, whereas the degree of the movement, i.e., whether it is strong or weak in expression, is considered to be a variation within a given tone contour* 174  fied) Intonational choices in the KOMET grammar Modelling intonation in the KOMET grammar involves the introduction of more delicate systems in those areas on the lexicogrammatical level, where intonational distinctions exist, thus specifying the relation between intonation features and competing linguistic resources (like lexis and syntax). Here, we will restrict ourselves to the description of the system networks reflecting the choices in tone. The networks are primarily based on the descriptive work by \[24\].</Paragraph>
      <Paragraph position="11"> The interpersonal part of the grammar provides the speaker with resources for interacting with the listener, for exchanging information, goods and services, etc.</Paragraph>
      <Paragraph position="12"> (see \[15, 20\]). On the lexicogrammatical stratum, the MOOD systems are the central resource for expressing these speech functions. More delicate speech functional distinctions--specific to spoken German--are realized by means of tone. The (primary) tone selection in a tone group serves to realize a number of speech functional distinctions* For instance, depending on the tone contour selected, the system output lisle wollen um f~nfzehn uhr fahren//(&amp;quot;You want to leave at 3 pm.&amp;quot;) can be either interpreted as a question (tone 2a) or a statement (tone la).</Paragraph>
      <Paragraph position="13"> More important is the conditioning of the (secondary) tone by attitudinal options such as the speaker's atti-Tin this paper, the following notational conventions hold: //marks tone group boundaries, CAPITAL LETTERS are used to mark the tonic element of a tone group. Numbers following the// at the beginning of a tone group indicate the type of tone contour.</Paragraph>
      <Paragraph position="14"> tude towards the proposition being expressed (surprise, reservation ...), what answer is being expected, emphasis on the proposition etc., referred to as KEY features. If one defines KEY as the part of speech functional distinctions expressed' by means of tone rather than mood alone, one can integrate the MOOD and KEY systems into the grammar by positioning KEY systems as dependent on the various MOOO systems, s  grammar for the declarative and interrogative sentence mood. The networks now include more delicate grammatical distinctions in order to realize the variations that have intonational consequences. The networks are restricted in that they omit some of the incongruent mood codings. The added discriminations to the KOMET grammar impose constraints on the specification of an appropriate intonation contour.</Paragraph>
    </Section>
    <Section position="3" start_page="172" end_page="172" type="sub_section">
      <SectionTitle>
3.3 Integrating COlt and KOMET-PENMAN
</SectionTitle>
      <Paragraph position="0"> As illustrated in Section 3.2 the relation between dialogue moves and tone is many-to-many, hence the appropriate tone selection must be further constrained.</Paragraph>
      <Paragraph position="1"> The dialogue model provides general information about the structure of an information retrieval dialogue, hence we consider it a representation of genre. The KOMET grammar provides linguistic resources including intonational options. In the following section, we determine the kinds of information that are needed in addition to what these resources provide and suggest a method of integrating the additional resources in the overall system,</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="172" end_page="176" type="metho">
    <SectionTitle>
4 Constraints on choice in intonation
</SectionTitle>
    <Paragraph position="0"> In information-seeking, human-machine dialogue it is crucial to signal to the user as unambiguously as possible at which stage in the dialogue she is and what action (verbal or non-verbal) she is supposed to take (see s\[14\], \[7\] and \[21\] have described this for English, \[24\] adapted Halliday's approach for German.</Paragraph>
    <Paragraph position="1">  Section 2). When spoken mode is envisaged as output, the intonation contour is the major means to convey this information. The relation between dialogue moves and tone types is however not trivial. For instance, a dialogue move REQUEST--depending on the context in which it occurs--may be realized intonationally by using tone 1, tone 2 or tone 4. Hence, the selection of an appropriate tone is conditioned by factors other than just individual coR dialogue moves. When we think about the problem from the perspective of intonation, the picture becomes clearer. It is generally acknowledged in descriptive linguistics that the kind of tone attributed to an information unit encodes a basic semantic speech act or speech function \[27, 25\], such as command, question, statement and offer, even though this relation is not one-to-one. Also, it is uncontroversial to maintain that intonation potentially reflects a speaker's attitude towards the message she verbalizes (see e.g., \[24\]). When looking at dialogue--rather than monologue--other factors coming into play are the history off the dialogue taking place and the expectations on the part off the hearer that are evoked at particular stages in the course of the dialogue.</Paragraph>
    <Paragraph position="2"> In this section, we will discuss how these factors relate to the selection of tone. Our goal is to determine more precisely what is comprised by them and to arrive at a refinement of the general architecture we have presented in Section 3. More concretely, it will be shown that the factors just pointed out are logically independent parameters that in different combinations constrain the selection of a particular tone. We will then propose an organization of these different parameters in terms of stratification that allows for the necessary flexibility and brigdes the gap between the dialogue model and the generator. Discussing a sample dialogue (Section 4.2), we will then apply the model developed.</Paragraph>
    <Paragraph position="3"> We start from the stratum of grammar and move to the other linguistic and pragmatic resources relevant to the present problem. As the starting point for discussion we take the grammatical systems of MOOD and KEY, for they grammatically encode semantic speech function and speaker's attitudes and lead directly to selections in tone.</Paragraph>
    <Section position="1" start_page="175" end_page="176" type="sub_section">
      <SectionTitle>
4.1 The meanings of tone
</SectionTitle>
      <Paragraph position="0"> One of the primary grammatical choices relevant for the selection of tone is the choice of mood, such as declarative, interrogative and imperative. 9 The relation between mood and tone is potentially many-to-many with one exception:imperative is always realized by tone 1.</Paragraph>
      <Paragraph position="1"> However, the choice of mood is crucial since it leads to a whole variety of options that are eventually realized in different tones (these are the KEY systems).</Paragraph>
      <Paragraph position="2"> How is choice in the basic mood options constrained? 9We assume here that the information unit is the clause and that tonality is unmarked, i.e., that there is one tonegroup only. We are aware, however, that generally there is no one-to-one correspondence between information unit and clause.</Paragraph>
      <Paragraph position="3"> Mood is in the first instance tim grammatical realization of semantic speech function. Speech functions comprise command, offer, statement and question. Systemically, they are derived from the SPEECH FUNCTION network (see e.g., \[20\] and Figure 4). Again, the relation between speech function and mood is potentially many-to-many:All of imperative, declaraLive and interrogative may for instance encode a command.</Paragraph>
      <Paragraph position="4"> For example Schliefl das Fenster! (Close the window!), W~rdest Du das Fenster schlieflen, bitte? ( Would you close the window, please?), Du sollst das Fenster nicht bffnen! (You're not supposed to open the window O.</Paragraph>
      <Paragraph position="5"> How can the mapping between speech function and mood be constrained then? A major constraint on the mapping between speech function and mood is the kind of discourse or genre, and the type of discourse stage the message is produced in. For instance, the genre of information-seeking, human-machine dialogues is characterized by certain genre-specific stages or dialogue moves (see Section 3.1). A typical move in this genre is the REQUEST move. In terms of speech function, a REQUEST is typically a question, i.e., \[demanding:information\], m The REQUEST-question correlation in the kind of dialogue we are dealing with here constrains the choice of mood to interrogative or declarative, e.g., (1) Wohin mSchten Sie fahren? (Where do you want to go?) (interrogative)-- (2) Sie wollen um drei Uhr fahren? (You want to go at three o'clock?) (declarative). So, in information-seeking dialogues, the type of move largely constrains the selection of speech function, but it only partially constrains the mapping of speech time,ion and mood.</Paragraph>
      <Paragraph position="6"> Deciding between declarative and interrogative as realization of a move REQUEST requires information about the immediate context of the utterance, i.e., about the dialogue history. It is in the area of combinations of dialogue moves that we find reflections of speaker's attitudes and intentions and hearer's expectations as determined by the context. The area in the grammar encoding this is key.</Paragraph>
      <Paragraph position="7"> The KEY systems are subsystems of the basic MOOD options (see Section 3.2). In terms of key, example (1) would be \[interrogative:wh-type:whnontonic:neutral-involvement\], thus leading to an intonational realization as tone 1, example (2) would be \[ declarative:answering:answer-to-question:strong\] leading to an intonational realization as tone 2. Consider the contexts in which (1) or (2) would be appropriate: (1) would typically be used as an initiating move of an exchange, where there is no immediately preceding context--the speaker's attitude is essentially neutral. (2) would typically be used in an exchange as the realization of a responding to move; in terms of the coa model, (2) would be a possible realization of a REQUEST within an INFORM or within a REQUEST--the speaker wants to make sure she has understood correctly. Only in the REQUEST or INraThe notation \[x:y:z\] gives a path through a system network.</Paragraph>
      <Paragraph position="8">  FORM contexts of a REQUEST does it become possible to map the dialogue move/speech function correlation of REQUEST-question to the mood and key features \[ declarative:answering:answer-to-question:strong\] (see also Section 4.2).</Paragraph>
      <Paragraph position="9"> For the representation of constraints between dialogue moves on the dialogue side and speech function on the side of interpersonal semantics and mood and key on the part of the grammar, this means that a good candidate for the ultimate constraint on tone selection is the type of move in context (or: the dialogue history).</Paragraph>
      <Paragraph position="10"> Given that all of the parameters (dialogue move type, dialogue history, speech function, mood and key) are logically independent and that different combinations of them go together with different selections of tone, an organization of these parameters in terms of stratification suggests itself, for it provides the required flexibility in mapping the different categories. Such an organization is for instance proposed in systemic functional work on interaction and dialogue \[3, 20, 36\].</Paragraph>
      <Paragraph position="11"> 1 In the systemic functional model, the strata assumed are context (extra-linguistic), semantics and grammar (linguistic). On the semantic stratum, general knowledge about interactions is located, described in terms of the NEGOTIATION network (cf. Figure 4). A pass (or passes, since NEGOTIATION is recursive) through the network results in a syntagmatic structure of an interaction called exchange structure. An exchange structure consists of moves which are the units for which the SPEECH FUNCTION network holds. NEGOTIATION and SPEECH FUNCTION are the two ranks of the stratum of interpersonal semantics (see Figure 4). The MOOD and KEY systems represent the grammatical realization of a move (given that a move is realized as a clause). The ultimate constraint on the selection of features in the interpersonal semantics and grammar is the information located at the stratum of context. This is knowledge about the type of discourse or genre. In the present scenario, this contextual knowledge is provided by the dialogue model, reflecting the genre of information-seeking, human-machine dialogue.</Paragraph>
      <Paragraph position="12"> Since the stratum of context is extra-linguistic, locating the dialogue model--which has originally not been designed to be a model of linguistic dialogue, but of retrieval dialogue in general--here is a straightforward step. For a graphical overview of the stratified architecture we just described briefly see Figure 5.</Paragraph>
    </Section>
    <Section position="2" start_page="176" end_page="176" type="sub_section">
      <SectionTitle>
4.2 A top-down perspective
</SectionTitle>
      <Paragraph position="0"> In this section, we discuss our proposal of bridging the gap between the dialogue model and the text generator KOMET-PENMAN from a top-down perspective. We develop concrete mappings between the extra-linguistic and semantic strata. Further, we show how competent choices at the semantic stratum guide the selection of features in the MOOD and KEY systems, which finally result in the assignment of a tone. We base our derivation of mappings between the strata on the following sample dialogue and its COR analysis ll from the domain of giving out train information. In the following, we will discuss the different system utterances one by one. This discussion is summarized in Table 1 .</Paragraph>
      <Paragraph position="2"> Initial requests Utterance B) results from a real information need on the system part. In order to do llIn the analysis: D=(sub-)Dialogue, R/r=R/request, I/i=\[/inform. 177 anything at all, the system must know where the user wants to travel. Unless the user volunteers the destination, it must request this information from her. 12 The user did not say where she wanted to travel, hence the system initiated the exchange, this is represented by the following path through the NEGOTIATIONha network: \[negotiation:negotiating:exchanging/initiating\]. In terms of speech function, we realize this request as a question (\[demanding/in/ormation\]). Other possible realizations of a request would be command, offer and statement, though none of them applies in the given context. The scenario itself excludes the command and the statement option, since the system is in need for information. Finally, since the system is incapable of handing over, say, a ticket, this request cannot be realized as an offer (&amp;quot;Let me give you a ticket to your destination&amp;quot; ).</Paragraph>
      <Paragraph position="3"> Knowing that we have to realize a question, we have three MOOD options available: \[declarative\], \[yes/noquestion\], and \[wh-question\]. Keeping in mind that we want our system to be user friendly, we do not want it to realize this request as a yes/no question (&amp;quot;Do you want to go to Heidelberg?&amp;quot;), or a statement (&amp;quot;You want to travel to Heidelberg?&amp;quot;), since it would then exhaustively have to search through its knowledge base in order to find the right destination to include in its utterance. Hence we conclude that requests that are not in response to a user utterance should be realized as a wh-question.</Paragraph>
      <Paragraph position="4"> In terms of KEY, we do not want our system to be overly \[involved\] in the conversation or \[surprised\] by the fact that it has to request some information and there is nothing to \[clarify\], hence the only accessible KEY feature is \[neutral-involvement\], which implies that utterance B)--an initiating, neutral, wh-question--should be realized as tone 1.</Paragraph>
      <Paragraph position="5"> Responding requests Utterance D) is a request in response to the destination that the user informed. In terms of semantic choices it is \[initiating\] a new embedded exchange, while it is \[responding to\] a user move in the embedding exchange. The speech function is question since the system wants to initiate a response.</Paragraph>
      <Paragraph position="6"> We suggest that the linguistic realization of this question depends on how confident the system is about what the user informed, hence in order to choose appropriate MOOD and KEY features, we argue that we need access to an additional resource--a confidence measure.13 For the current example, we suggest the following alternatives: null 12We assume that the system has an abstract internal specification of its information needs and that it keeps a record of the information it has already received.</Paragraph>
      <Paragraph position="7"> laTbis is highly relevant if the input channel is spoken, since speech recognizers cannot achieve a 100% recognition rate. Technically, the confidence measure would come from the speech recognition unit.</Paragraph>
      <Paragraph position="8">  ~. &amp;quot;Could you repeat that, please?&amp;quot; z o &amp;quot;Where do you want to travel?&amp;quot; C.9</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML