File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1024_metho.xml

Size: 25,289 bytes

Last Modified: 2025-10-06 14:15:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1024">
  <Title>The CommandTalk Spoken Dialogue System*</Title>
  <Section position="2" start_page="0" end_page="595" type="metho">
    <SectionTitle>
2 Example Dialogues
</SectionTitle>
    <Paragraph position="0"> The following examples constitute a single extended dialogue illustrating the capabilities of the dialogue manager with regard to structured dialogue, clarification and correction, changes in initiative, integration of speech and gesture, and sensitivity to events occurring in the underlying simulated world. 1  (r) I will create CEV at FQ 643 576 Utterances 1 and 3 illustrate typical successful interactions between an operator and the system. When no exceptional event occurs, CommandTalk does not respond verbally. However, it does provide an audible tone to indicate that it has completed processing. For a successful command, it produces a rising tone, illustrated by the (r) symbol in utterances 2 and 4. For an unsuccessful command it produces a falling tone, illustrated by the (r) symbol in utterances 12 and 14. 2  logue version of the system. They were added because we observed that users did not always notice when the system had not understood them correctly, and a textual error message alone did not always get the user's attention. These tones also perform basic grounding behavior. null  Utterance 6 demonstrates a case where, although the system successfully completed the command, it chose to provide an explicit confirmation. Explicit confirmations can be given at any time. In this case, the system chose to give the confirmation because it performed a nontrivial reference, resolving &amp;quot;here&amp;quot; to the map coordinates given by the gesture, FQ 643 576. Similar situations in which the system gives an explicit confirmation are the resolution of pronouns and elided, definite or plural noun phrases.</Paragraph>
    <Paragraph position="1">  Utterance 9 is a correction of utterance 7, and is interpreted as though the operator had said &amp;quot;Put Objective Alpha here&amp;quot;. This illustrates two points. First, since utterance 7 was successful, the system undoes its effects (that is, deletes Objective Golf) before creating Objective Alpha. Second, although the edited utterance contains the word &amp;quot;here&amp;quot;, the gesture that was used to resolve that is no longer available. The system keeps track of gestural information along with linguistic information in its representation of context in order to interpret corrections. null  Example 3 illustrates a structured discourse segment containing two subsegments. Utterance 11 is uninterpretable for two reasons: the reference to &amp;quot;CEV&amp;quot; is ambiguous, and Objective Golf does not exist. The first difficulty is resolved in discourse segment 12-13, and the second in discourse segment 14-16. Notice that the operator is not required to answer the question posed by the system in utterance 14, but is free to correct the system's misunderstanding of utterance 11 even though it is not the immediately prior utterance. This is true because utterance 13 (the most recent utterance) is interpreted as if the operator had said &amp;quot;100All  Example 4 demonstrates a case where, although there are no errors in the operator's utterance, the system requires additional information before it can execute the command. Also note that the question asked by the system in utterance 18 is answered with an isolated gesture. null Ex. 5: Delayed Response U 21 A13 continue to Checkpoint 1 in a column formation.</Paragraph>
    <Paragraph position="2"> S 22 (r) There is no A13. Which unit should proceed in a column formation to Checkpoint 17  formation to Checkpoint 1.</Paragraph>
    <Paragraph position="3"> In example 5, the system asks a question but the operator needs to perform some other activity before answering it. The question asked by the system in utterance 22 is answered by the operator in utterance 25. Due to the intervening material, the most natural way to answer the question posed in utterance 22 is with a  Example 6 demonstrates the use of a guard, or test to see if a situation holds. In utterance 27, a presupposition failure occurs, leading to the open proposition expressed in utterance 28. A guard, associated with the open proposition, tests to see if the system can successfully resolve &amp;quot;Objective Bravo&amp;quot;. Rather than answering the question in utterance 28, the operator chooses to create Objective Bravo. The system then tests the guard, which succeeds because Objective Bravo now exists. The system therefore takes dialogue initiative by asking the operator in utterance 31 if that operator would like to carry out the original command. Although, in this case, the simulated world changed in direct response to a linguistic act, in general the world can change for a variety of reasons, including the operator's activities on the GUI or the activities of other operators.</Paragraph>
  </Section>
  <Section position="3" start_page="595" end_page="595" type="metho">
    <SectionTitle>
3 Language Interpretation and
Generation
</SectionTitle>
    <Paragraph position="0"> The language used in CommandTalk is derived from a single grammar using Gemini (Dowding et al., 1993), a unification-based grammar formalism. This grammar is used to provide all the language modeling capabilities of the system, including the language model used in the speech recognizer, the syntactic and semantic interpretation of user utterances (Dowding et al., 1994), and the generation of system responses (Shieber et al., 1990).</Paragraph>
    <Paragraph position="1"> For speech recognition, Gemini uses the Nuance speech recognizer. Nuance accepts language models written in a Grammar Specification Language (GSL) format that allows context-free, as well as the more commonly used finite-state, models. 3 Using a technique described in (Moore, 1999), we compile a context-free covering grammar into GSL format from the main Gemini grammar.</Paragraph>
    <Paragraph position="2"> This approach of using a single grammar source for both sides of the dialogue has several advantages. First, although there are differences between the language used by the system and that used by the speaker, there is a large degree of overlap, and encoding the grammar once is efficient. Second, anecdotal evidence suggests that the language used by the system influences the kind of language that speakers use in response. This gives rise to a consistency problem if the language models used for interpretation and generation are developed independently.</Paragraph>
    <Paragraph position="3"> The grammar used in CommandTalk contains features that allow it to be partitioned into a set of independent top-level grammars. For instance, CommandTalk contains related, but distinct, grammars for each of the four armed services (Army, Navy, Air Force, and Marine Corps). The top-level grammar currently in use by the speech recognizer can be changed dynamically. This feature is used in the dialogue manager to change the top-level grammar, depending on the state of the dialogue. Currently in CommandTalk, for each service there are two main grammars, one in which the user is free to give any top-level command, and another that contains everything in the first grammar, plus isolated noun phrases of the semantic types that can be used as answers to wh-questions, as well as answers to yes/no questions.</Paragraph>
    <Section position="1" start_page="595" end_page="595" type="sub_section">
      <SectionTitle>
3.1 Prosody
</SectionTitle>
      <Paragraph position="0"> A separate Prosody agent annotates the system's utterances to provide cues to the speech synthesizer about how they should be produced.</Paragraph>
      <Paragraph position="1"> It takes as input an utterance to be spoken, along with its parse tree and logical form. The output is an expression in the Spoken Text Markup Language 4 (STML) that annotates the locations and lengths of pauses and the locations of pitch changes.</Paragraph>
    </Section>
    <Section position="2" start_page="595" end_page="595" type="sub_section">
      <SectionTitle>
3.2 Speech Synthesis
</SectionTitle>
      <Paragraph position="0"> Speech synthesis is performed by another agent that encapsulates the Festival speech synthesizer. Festival 5 was developed by the Centre for Speech Technology Research (CSTR) at the University of Edinburgh. Festival was selected because it accepts STML commands, is available for research, educational, and individual use without charge, and is open-source.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="595" end_page="595" type="metho">
    <SectionTitle>
4 Dialogue Manager
</SectionTitle>
    <Paragraph position="0"> The role of the dialogue manager in CommandTalk is to manage the representation of linguistic context, interpret user utterances within that context, plan system responses, and set the speech recognition system's language model. The system supports natural, structured mixed-initiative dialogue and multi-modal interactions.</Paragraph>
    <Paragraph position="1"> When interpreting a new utterance from the user, the dialogue manager considers these possibilities in order:  1. Corrections: The utterance is a correction of a prior utterance.</Paragraph>
    <Paragraph position="2"> 2. Transitions/Responses: The utterance is a continuation of the current discourse segment. null 3. New Commands/Questions: The utterance is initiating a new discourse segment.</Paragraph>
    <Paragraph position="3">  The following sections will describe the data structures maintained by the dialogue manager, and show how they are affected as the dialogue manager processes each of these three types of user utterances.</Paragraph>
    <Section position="1" start_page="595" end_page="595" type="sub_section">
      <SectionTitle>
4.1 Dialogue Stack
</SectionTitle>
      <Paragraph position="0"> CommandTalk uses a dialogue stack to keep track of the current discourse context. The dialogue stack attempts to keep track of the open discourse segments at each point in the dialogue. Each stack frame corresponds to one user-system discourse pair, and contains at least the following elements:  tations of the items referred to in each system and user utterance a gesture space containing the gestures used in the interpretation of each user utterance null * an optional guard The semantic representation of the system response is related to the background, but there are cases where the background may contain more information than the response. For example, in utterance 28 the system could have simply said &amp;quot;There is no Objective Bravo&amp;quot;, and omitted the explicit follow-up question. In this case, the background may still contain the open proposition.</Paragraph>
      <Paragraph position="1"> Unlike in dialogue analyses carried out on completed dialogues (Grosz and Sidner, 1986), the dialogue manager needs to maintain a stack of all open discourse segments at each point in an on-going dialogue. When a system allows corrections, it can be difficult to determine when a user has completed a discourse segment.</Paragraph>
      <Paragraph position="2">  (r) There is no point named Objective Charlie. What point should I center on?</Paragraph>
      <Paragraph position="4"> (r) I will center on FQ 950 650 I said 55 65 (r) I will center on FQ 550 650 In example 7, for instance, when the user answers the question in utterance 36, the system will pop the frame corresponding to utterances 34-35 off the stack. However, the information in that frame is necessary to properly interpret the correction in utterance 38. Without some other mechanism it would be unsafe to ever pop a  frame from the stack, and the stack would grow indefinitely. Since the dialogue stack represents our best guess as to the set of currently open discourse segments, we want to allow the system to pop frames from the stack when it believes discourse segments have been closed. We make use of another representation, the dialogue trail, to let us to recover from these moves if they prove to be incorrect.</Paragraph>
      <Paragraph position="5"> The dialogue trail acts as a history of all dialogue stack operations performed. Using the trail, we record enough information to be able to restore the dialogue stack to any previous configuration (each trail entry records one operation taken, the top of the dialog stack before the operation, and the top of the dialog stack after). Unlike the stack, the dialogue trail represents the entire history of the dialogue, not just the set of currently open propositions. The fact that the dialogue trail can grow arbitrarily long has not proven to be a problem in practice since the system typically does not look past the top item in the trail.</Paragraph>
    </Section>
    <Section position="2" start_page="595" end_page="595" type="sub_section">
      <SectionTitle>
4.2 Finite State Machines
</SectionTitle>
      <Paragraph position="0"> Each stack frame in the dialogue manager contains a unique dialogue state identifier. These states form a collection of finite-state machines (FSMs), where each FSM describes the turns comprising a particular discourse segment. The dialogue stack is reminiscent of a recursive transition network, in that the stack records the system's progress through a series of FSMs in parallel. However, in this case, the stack operations are not dictated explicitly by the labels on the FSMs, but stack push operations correspond to the onset of a discourse segment, and stack pop operations correspond to the conclusion of a discourse segment.</Paragraph>
      <Paragraph position="1"> Most of the FSMs currently used in CommandTalk coordinate dialogue initiative. These FSMs have a very simple structure of at most two states. For instance, there are FSMs representing discourse segments for clarification questions (utterances 23-24), reference failures (utterances 27-28), corrections (utterances 910), and guards becoming true (utterances 3133). CommandTalk currently uses 22 such small FSMs. Although they each have a very simple structure, they compose naturally to support more complex dialogues. In these sub-dialogues the user retains the task initiative, but the system may temporarily take the dialogue initiative. This set of FSMs comprises the core dialogue competence of the system.</Paragraph>
      <Paragraph position="2"> In a similar way, more complex FSMs can be designed to support more structured dialogues, in which the system may take more of the task initiative. The additional structure imposed varies from short 2-3 turn interactions to longer &amp;quot;form-filling&amp;quot; dialogues. We currently have three such FSMs in CommandTalk: The Embark/Debark command has four required parameters; a user may have difficulty expressing them all in a single utterance. CommandTalk will query the user for missing parameters to fill in the structure of the command.</Paragraph>
      <Paragraph position="3"> The Infantry Attack command has a number of required parameters, a potentially unbounded number of optional parameters, and some constraints between optional arguments (e.g., two parameters are each optional, but if one is specified then the other must be also).</Paragraph>
      <Paragraph position="4"> The Nine Line Brief is a strMght-forward form-filling command with nine parameters that should be provided in a specified order. null When the system interprets a new user utterance that is not a correction, the next alternative is that it is a continuation of the current discourse segment. Simple examples of this kind of transition occur when the user is answering a question posed by the system, or when the user has provided the next entry in a form-filling dialogue. Once the transition is recognized, the current frame on top of the stack is popped. If the next state is not a final state, then a new frame is pushed corresponding to the next state. If it is a final state, then a new frame is not created, indicating the end of the discourse segment. null The last alternative for a new user utterance is that it is the onset of a new discourse segment. During the course of interpretation of the utterance, the conditions for entering one or more new FSMs may be satisfied by the utterance.</Paragraph>
      <Paragraph position="5"> These conditions may be linguistic, such as presupposition failures, or can arise from events that occur in the simulation, as when a guard  is tested in example 6. Each potential FSM has a corresponding priority (error, warning, or good). An FSM of the highest priority will be chosen to dictate the system's response.</Paragraph>
      <Paragraph position="6"> One last decision that must be made is whether the new discourse segment is a subsegment of the current segment, or if it should be a sibling of that segment. The heuristic thatwe use is to consider the new segment a subsegment if the discourse frame on top of the stack contains an open proposition (as in utterance 23). In this case, we push the new frame on the stack. Otherwise, we consider the previous segment to now be closed (as in utterance 3), and we pop the frame corresponding to it prior to pushing on the new frame.</Paragraph>
    </Section>
    <Section position="3" start_page="595" end_page="595" type="sub_section">
      <SectionTitle>
4.3 Mechanisms for Reference
</SectionTitle>
      <Paragraph position="0"> CommandTalk employs two mechanisms for maintaining local context and performing reference: a list of salient objects in the simulation, and focus spaces of linguistic items used in the dialogue.</Paragraph>
      <Paragraph position="1"> Since CommandTalk is controlling a distributed simulation, events can occur asynchronously with the operator's linguistic acts, and objects may become available for reference independently of the on-going dialogue. For instance, if an enemy unit suddenly appears on the operator's display, that unit is available for immediate reference, even if no prior linguistic reference to it has been made. The ModSAF agent notifies the dialogue manager whenever an object is created, modified, or destroyed, and these objects are stored in a salience list in order of recency. The salience list can also be updated when simulation objects are referred to using language.</Paragraph>
      <Paragraph position="2"> The salience list is not part of the dialogue stack. It does not reflect attentional state; rather, it captures recency and &amp;quot;known&amp;quot; information. null While the salience list contains only entities that directly correspond to objects in the simulation, focus spaces contain representations of entities realized in linguistic acts, including objects not directly represented in the simulation. This includes objects that do not exist (yet), as in &amp;quot;Objective Bravo&amp;quot; in utterance 28, which is referred to with a pronoun in utterance 29, and sets of objects introduced by plural noun phrases. All items referred to in an utterance are stored in a focus space associated with that utterance in the stack frame. There is one focus space per utterance.</Paragraph>
      <Paragraph position="3"> Focus spaces can be used during the generation of pronouns and definite noun phrases. Although at present CommandTalk does not generate pronouns (we choose to err on the side of verbosity, to avoid potential confusion due to misrecognitions), focus spaces could be used to make intelligent decisions about when to use a pronoun or a definite reference. In particular, while it might be dangerous to generate a pronoun referring to a noun phrase that the user has used, it would be appropriate to use a pronoun to refer to a noun phrase that the system has used.</Paragraph>
      <Paragraph position="4"> Focus spaces are also used during the interpretation of responses and corrections. In these cases the salience list reflects what is known now, not what was known at the time the utterance being corrected or clarified was made.</Paragraph>
      <Paragraph position="5"> The focus spaces reflect what was known and in focus at that earlier time; they track attentional state. For instance, imagine example 6 had instead been: Ex. 6b:</Paragraph>
      <Paragraph position="7"/>
    </Section>
    <Section position="4" start_page="595" end_page="595" type="sub_section">
      <SectionTitle>
Focusing
</SectionTitle>
      <Paragraph position="0"> A14 advance there.</Paragraph>
      <Paragraph position="1"> (r) There is no A14. Which unit should advance to Checkpoint 1? Create CEV at 635 545 and name it A14.</Paragraph>
      <Paragraph position="2"> At the end of utterance 42 the system will reinterpret utterance 40, but the most recent location in the salience list is FQ 635 545 rather than Checkpoint 1. The system uses the focus space to determine the referent for &amp;quot;there&amp;quot; at the time utterance 40 was originally made. In conclusion, CommandTalk's dialogue manager uses a dialogue stack and trail, reference mechanisms, and finite state machines to handle a wide range of different kinds of dialogue, including form-filling dialogues, freeflowing mixed-initiative dialogues, and dialogues involving multi-modality.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="595" end_page="595" type="metho">
    <SectionTitle>
5 Related Work
</SectionTitle>
    <Paragraph position="0"> CommandTalk differs from other recent spoken language systems in that it is a command and control application. It provides a particularly  interesting environment in which to design spoken dialogue systems in that it supports distributed stochastic simulations, in which one operator controls a certain collection of forces while other operators simultaneously control other allied and/or opposing forces, and unexpected events can occur that require responses in real time. Other applications (Litman et al., 1998; Walker et al., 1998) have been in domains that were sufficiently limited (e.g., queries about train schedules, or reading email) that the system could presume much about the user's goals, and make significant contributions to task initiative. However, the high number of possible commands available in CommandTalk, and the more abstract nature of the user's high-level goals (to carry out a simulation of a complex military engagement) preclude the system from taking significant task initiative in most cases. The system most closely related to CommandTalk in terms of dialogue use is TRIPS (Ferguson and Allen, 1998), although there are several important differences. In contrast to TRIPS, in CommandTalk gestures are fully incorporated into the dialogue state. Also, CommandTalk provides the same language capabilities for user and system utterances.</Paragraph>
    <Paragraph position="1"> Unlike other simulation systems, such as QuickSet (Cohen et al., 1997), CommandTalk has extensive dialogue capabilities. In Quick-Set, the user is required to confirm each spoken utterance before it is processed by the system (McGee et al., 1998).</Paragraph>
    <Paragraph position="2"> Our earlier work on spoken dialogue in the air travel planning domain (Bratt et al., 1995) (and related systems) interpreted speaker utterances in context, but did not support structured dialogues. The technique of using dialogue context to control the speech recognition state is similar to one used in (Andry, 1992).</Paragraph>
  </Section>
  <Section position="6" start_page="595" end_page="595" type="metho">
    <SectionTitle>
6 Future Work
</SectionTitle>
    <Paragraph position="0"> We have discussed some aspects of CommandTalk that make it especially suited to handle different kinds of interactions. We have looked at the use of a dialogue stack, salience information, and focus spaces to assist interpretation and generation. We have seen that structured dialogues can be represented by composing finite-state models. We have briefly discussed the advantages of using the same grammar for all linguistic aspects of the system. It is our belief that most of the items discussed could easily be transferred to a different domain.</Paragraph>
    <Paragraph position="1"> The most significant difficulty with this work is that it has been impossible to perform a formal evaluation of the system. This is due to the difficulty of collecting data in this domain, which requires speakers who are both knowledgeable about the domain and familiar with ModSAF. CommandTalk has been used in simulations of real military exercises, but those exercises have always taken place in classified environments where data collection is not permitted. null To facilitate such an evaluation, we are currently porting the CommandTalk dialogue manager to the domain of air travel planning. There is a large body of existing data in that domain (MADCOW, 1992), and speakers familiar with the domain are easily available.</Paragraph>
    <Paragraph position="2"> The internal representation of actions in CommandTalk is derived from ModSAF. We would like to port that to a domain-independent representation such as frames or explicit representations of plans.</Paragraph>
    <Paragraph position="3"> Finally, there are interesting options regarding the finite state model. We are investigating other representations for the semantic contents of a discourse segment, such as frames or active templates.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML