File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/j95-3001_metho.xml

Size: 79,359 bytes

Last Modified: 2025-10-06 14:13:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="J95-3001">
  <Title>DOMAIN PROCESSOR GENERAL REASONING C, enera Oomph Kno.kx~</Title>
  <Section position="2" start_page="0" end_page="283" type="metho">
    <SectionTitle>
2. Target Behaviors
</SectionTitle>
    <Paragraph position="0"> The purpose of the architecture is to deliver to users in real time the behaviors needed for efficient human-machine dialog. Specifically, it is aimed at achieving the following: Convergence to a goal. Efficient dialog requires that each participant understand the purpose of the interaction and have the necessary prerequisites to cooperate in its achievement. This is the intentional structure of Grosz and Sidner (1986), the goal-oriented mechanism that gives direction to the interaction. The primary required facilities are a problem solver that can deduce the necessary action sequences and a set of subsystems capable of carrying out those sequences.</Paragraph>
    <Paragraph position="1"> Subdialogs and effective movement between them. Efficient human dialog is usually segmented into utterance sequences, subdialogs, that are individually aimed at achieving relevant subgoals (Grosz 1978; Linde and Goguen 1978; Polanyi and Scha 1983; Reichman 1985). These are called &amp;quot;segments&amp;quot; by Grosz and Sidner (1986) and constitute the linguistic structure defined in their paper. The global goal is approached by a series of attempts at subgoals each of which involves a set of interactions, the subdialogs.</Paragraph>
    <Paragraph position="2"> An aggressive strategy for global success is to choose the subgoals judged most likely to lead to success and carry out their associated subdialogs. As the system proceeds on a given subdialog, it should always be ready to drop it abruptly if some other subdialog suddenly seems more appropriate. This leads to the fragmented style that so commonly appears in efficient human communication. A subdialog is opened, leading to another, then another, then a jump to a previously opened subdialog, and so forth, in an unpredictable order until the necessary subgoals have been solved for an overall success.</Paragraph>
    <Paragraph position="3"> An accounting for user knowledge and abilities. Cooperative problem solving involves maintaining a dynamic profile of user knowledge, termed a user model. This concept is described, for example, in Kobsa and Wahlster (1988, 1989), Chin (1989), Cohen and Jones (1989), Finin (1989), Lehman and Carbonell (1989), Morik (1989), and Paris (1988). The user model specifies information needed for efficient interaction with the conversational partner. Its purpose is to indicate what needs to be said to the user to enable the user to function effectively. It also indicates what should be omitted because of existing user knowledge.</Paragraph>
    <Paragraph position="4"> Because considerable information is exchanged during the dialog, the user model changes continuously. Mentioned facts are stored in the model as known to the user and are not repeated. Previously unmentioned information may be assumed to be unknown and may be explained as needed. Questions from the user may indicate lack of knowledge and result in the removal of items from the user model.</Paragraph>
    <Paragraph position="5"> Change of initiative. A real possibility in a cooperative interaction is that the user's problem-solving ability, either on a given subgoal or on the global task, may exceed that of the machine. When this occurs, an efficient interaction requires that the machine yield control so that the more competent partner can lead the way to the fastest possible solution. Thus, the machine must be able to carry out its own problem-solving process and direct the actions to task completion or yield to the user's control and respond cooperatively to his or her requests. This is a variable initiative dialog, as studied by Kitano and Van Ess-Dykema (1991), Novick (1988), Whittaker and Stenton (1988), and Walker and Whittaker (1990). As a pragmatic issue, we have found that at least four initiative modes are useful: (1) directive. The computer has complete dialog control. It recommends a subgoal for completion and will use whatever dialog is necessary to obtain the needed item of knowledge related to the subgoal.</Paragraph>
    <Paragraph position="6">  Smith, Hipp, and Biermann An Architecture for Voice Dialog Systems (2) suggestive. The computer still has dialog control, but not as strongly. The computer will make suggestions about the subgoal to perform next, but it is also willing to change the direction of the dialog according to stated user preferences.</Paragraph>
    <Paragraph position="7"> (3) declarative. The user has dialog control, but the computer is free to mention relevant, though not required, facts as a response to the user's statements.</Paragraph>
    <Paragraph position="8"> (4) passive. The user has complete dialog control. The computer responds  directly to user questions and passively acknowledges user statements without recommending a subgoal as the next course of action.</Paragraph>
    <Paragraph position="9"> Expectation of user input. Since all interactions occur in the context of a current subdialog, the user's input is far more predictable than would be indicated by a general grammar for English. In fact, the current subdialog specifies the focus of the interaction, the set of all objects and actions that are locally appropriate. This is the attentional structure described by Grosz and Sidner (1986), and its most important function in our system is to predict the meaning structures the user is likely to communicate in an input. For illustration, the opening of a chassis cover plate will often evoke comments about the objects behind the cover; the measurement of a voltage is likely to include references to a voltmeter, leads, voltage range, and the locations of measurement points.</Paragraph>
    <Paragraph position="10"> Thus the subdialog structure provides a set of expected utterances at each point in the conversation, and these have two important roles:</Paragraph>
    <Paragraph position="12"> (2) The expected utterances provide strong guidance for the speech recognition system so that error correction can be enhanced. Where ambiguity arises, recognition can be biased in the direction of meaningful statements in the current context. Earlier researchers who have investigated this insight are Erman et al. (1980), Walker (1978), Fink and Biermann (1986), Mudler and Paulus (1988), Carbonell and Pierrel (1988), Young (1990), and Young et al. (1989).</Paragraph>
    <Paragraph position="13"> The expected utterances from subdialogs other than the current one can indicate that a shift from the current subdialog is occurring. Thus, expectations are one of the primary mechanisms needed for tracking the conversation as it jumps from subdialog to subdialog. This is known elsewhere as the plan recognition problem, and it has received much attention in recent years. See, for example, Allen and Perrault (1980), Allen (1983), Kautz (1991), Litman and Allen (1987), Pollack (1986), and Carberry (1988, 1990).</Paragraph>
    <Paragraph position="14"> Systems capable of all of the above behaviors are rare, as has been observed by Allen et al. (1989): &amp;quot;no one knows how to fit all of the pieces together.&amp;quot; An impressive early example along these lines is the MINDS system of Young et al. (1989). This system maintains an AND-OR goal tree to represent the problem-solving space, and it engages in dialog in the process of trying to achieve subgoals in the tree. A series of interactions related to a given subgoal constitute a subdialog, and expectations associated with currently active goals are used to predict incoming user utterances. These predictions are further sharpened by a user model and then are passed down to the signal-processing level to improve speech recognition. The resulting system  Computational Linguistics Volume 21, Number 3 demonstrated dramatic improvements over performance levels that had been observed without such predictive capabilities. For example, the effective perplexity in one test was reduced from 242.4 to 18.3 using dialog level constraints, while word accuracy recognition was increased from 82.1 percent to 97.0 percent.</Paragraph>
    <Paragraph position="15"> In another dialog project, Allen et al. (1989) describe an architecture that concentrates on representations for subdialog mechanisms and their interactions with sentence-level processing. Their mechanism uses the blackboard organization, which displays at a global level all pertinent information and has subroutines with specialty functions to update the blackboard. There are subroutines, for example, to do processing at the lexical, syntactic, and semantic levels, to handle referencing problems, to manage discourse structure and speech act issues, tense, and much more. A typical task for their system is to properly parse and analyze a given dialog.</Paragraph>
    <Paragraph position="16"> A third interesting project has produced the TINA system (Seneff 1992), which uses probabilistic networks to parse token sequences provided by a speech recognition system, SUMMIT, created by Zue et al. (1989). The networks and their probabilities are created automatically from grammatical rules and text samples input by the designer. Their main utility is to provide expectation for error correction as we do in our system. However, their expectation is primarily syntax-based while ours uses structure from all levels, subdialog (or focus-based), semantic, and syntactic. Their semantics is built directly into the parse trees and translated into SQL for access to a database.</Paragraph>
    <Paragraph position="17"> Our approach resembles that of Young et al. (1989), except that we employ Prolog-style rules for the knowledge base and the associated proofs for directing the goal-oriented behavior. This leads to the &amp;quot;missing axiom theory&amp;quot; we describe for processing discourse, and some rather simple mechanisms for employing the user model, for managing multiple subdialogs, and for creating and using expectation. Our theory also shows how variable initiative is built into the same simple architecture. Our implementation shows that the mechanisms are efficient enough to run in real time and sufficiently well designed to yield successful dialogs with humans.</Paragraph>
  </Section>
  <Section position="3" start_page="283" end_page="285" type="metho">
    <SectionTitle>
3. Sample Dialogs
</SectionTitle>
    <Paragraph position="0"> Many of the points discussed above are illustrated by the following excerpts from repair dialogs using our system. In the following example, the system is in directive mode. It has determined that a particular knob should be set to 10 and then a voltage measurement should be made. The movement to subdialogs is indicated by indentation.</Paragraph>
    <Paragraph position="1">  1. Computer: 2. User: 3. Computer: 4. User: 5. Computer: 6. User: 7. Computer: 8. User: 9. Computer: 10. User:  Put the knob to one zero.</Paragraph>
    <Paragraph position="2"> Okay.</Paragraph>
    <Paragraph position="3"> What is the voltage between connector one two one and connector three four? I do not know.</Paragraph>
    <Paragraph position="4"> Locate the voltmeter.</Paragraph>
    <Paragraph position="5"> Done.</Paragraph>
    <Paragraph position="6"> Add a wire between the minus com hole on the voltmeter and the connector one two one. How do I do that? Connect the end of the black wire with the small plug to the minus corn hole on the voltmeter. Done.</Paragraph>
    <Paragraph position="7">  Smith, Hipp, and Biermann An Architecture for Voice Dialog Systems 11. Computer: Connect the end of the black wire with the large plug to connector one two one. 12. User: Which knob? 13. Computer: Near the top boundary and in the center is a green region. Now, in the middle right of this green region is the control knob.</Paragraph>
    <Paragraph position="8"> 14. User: Okay.</Paragraph>
    <Paragraph position="9"> 15. Computer: Put the knob to zero one. 16. User: Done.</Paragraph>
    <Paragraph position="10"> 17. Computer: Connect the end of the black wire with the large plug to connector one two one.</Paragraph>
    <Paragraph position="11"> 18. User: Where is connector three four? 19. Computer: Near the left boundary and toward the middle is a green region. Now, in the bottom center of this green region is connector three four. 20. User: Okay.</Paragraph>
    <Paragraph position="12"> 21. Computer: What is the voltage between connector one two one and connector three four? 22. User: Zero.</Paragraph>
    <Paragraph position="13">  This dialog fragment is interesting in that it illustrates goal-oriented behavior while simultaneously jumping between subdialogs. The dialog also illustrates the effects of the user model. For example, the user model at the beginning of the dialog indicates that the user can find the knob (find(knob)). This assertion prevents the system from giving any explanation about the location of the knob. However, utterance 12 indicates an error in the current model and the system removes this assertion from the user model. This leads to a clarification subdialog explaining the position of the knob and a response from the user, &amp;quot;Okay.&amp;quot; The user's affirmation then enables the system to add the assertion find(knob) back into the user model and proceed. The use of expectation can also be illustrated by the example. When the user is preparing to respond with utterance 2, the system expects the response to pertain to the command &amp;quot;Put the knob to one zero.&amp;quot; The primary expectations are for an assertion that this request has been done or for a question about how to do this task. Consequently, speech recognition can be biased to accept this limited class of utterances. A different use of expectation appears in sentence 12, where local expectations were related to the connection of a wire. Here, local expectations failed but a previous expectation from sentence 1 is fulfilled. So control is returned to the previously active subdialog where the goal was to &amp;quot;Put the knob to one zero.&amp;quot; The following dialog segment illustrates the behavior of the system in declarative mode.</Paragraph>
    <Paragraph position="14">  1. Computer: This is the circuit-fixit-shop. How may I help you? 2. User: RSlll is not working.</Paragraph>
    <Paragraph position="15"> 3. Computer: I am familiar with that circuit. The LED is supposed to be displaying alternately flashing one and seven.</Paragraph>
    <Paragraph position="16"> 4. User: There is no wire on one zero four.</Paragraph>
    <Paragraph position="17"> 5. Computer: There is supposed to be a wire between connector one zero four and connector one zero two.</Paragraph>
    <Paragraph position="18"> 6. User: The circuit is working.</Paragraph>
    <Paragraph position="19"> 7. Computer: Glad to have been of assistance. Good-bye.  Here the machine diagnoses as well as it can the current subdialog from user comments. Then it presents information it surmises may be helpful, specifically facts  Computational Linguistics Volume 21, Number 3 from the currently active subdialog that are not in the user model. The user may ignore suggestions by following his or her own preferred dialog paths, as occurs in statements 4 and 6. In each case, the system tracks the selected topic and responds in an appropriate manner.</Paragraph>
  </Section>
  <Section position="4" start_page="285" end_page="292" type="metho">
    <SectionTitle>
4. Mechanisms for Achieving the Behaviors
</SectionTitle>
    <Paragraph position="0"> This paper presents a single self-consistent mechanism capable of achieving simultaneously the above-described behavior. We will examine sequentially (1) a theory of task-oriented language, (2) an implementation of the subdialog feature, (3) a method for accounting for user knowledge, (4) mechanisms needed to obtain variable initiative, and (5) the implementation and uses of expectation. The section following this one will give a detailed example showing all of these mechanisms working together.</Paragraph>
    <Section position="1" start_page="285" end_page="286" type="sub_section">
      <SectionTitle>
4.1 A Theory of Task-Oriented Language
</SectionTitle>
      <Paragraph position="0"> The central mechanism of our architecture is a Prolog-style theorem-proving system.</Paragraph>
      <Paragraph position="1"> The goal of the dialog is stated in a Prolog-style goal and rules are invoked to &amp;quot;prove the theorem&amp;quot; or &amp;quot;achieve the goal&amp;quot; in a normal top-down fashion. If the proof succeeds using internally available knowledge, the dialog terminates without any interaction with the user. Thus it completes a dialog of length zero. More typically, however, the proof fails, and the system finds itself in need of more information before it can proceed. In this case, it looks for so-called &amp;quot;missing axioms,&amp;quot; which would help complete the proof, and it engages in dialog to try to acquire them.</Paragraph>
      <Paragraph position="2"> As an example, the system might have the goal of determining the position, up or down, of a certain switch swl. This might appear in the proof tree as observeposition(swl,X) *-- find(swl), reportposition(swl,X) That is, it is necessary to find swl and then to report its position X. It is possible that in the process of the interaction the user has both found the switch and reported its position and that both find(swl) and report position(swl,up) appear in the database.</Paragraph>
      <Paragraph position="3"> Then the system will achieve observeposition(swl,up) and answer its question without interaction with the user. It is also possible that the user has previously found the switch but not recently reported its position. Here theorem proving would succeed with the first goal, find(swl), but fail to find reportposition(swl,-). Then reportposition(swl,X) would become a missing axiom and could be returned to the dialog controller for possible vocalization. The third possibility is that neither find(swl) or reportposition(swl,-) would exist in the database, in which case both could be sent to the controller as missing axioms for possible vocalization. (The decision as to whether to send a missing axiom to the controller depends on whether the axiom represents an answerable question by the user. The system maintains a mechanism for indicating what is reasonable to ask and what is not, as described below.) Thus our system is built around a theorem prover at its core, and the role of language is to supply missing axioms. Our system engages in dialog only for the purpose of enabling theorem proving, and voice interactions do not otherwise occur.</Paragraph>
      <Paragraph position="4"> (A later publication will propose other uses of voice interactions, but our current system uses them for only this purpose.) The user may not always respond with the desired information. He or she may respond with a request for a clarifcation such as &amp;quot;Where is the switch?&amp;quot; or with an unanticipated comment such as &amp;quot;There is no wire connected to terminal 102.&amp;quot; Thus the theorem prover needs to be immensely more flexible than ordinary Prolog, and this is the topic of the next section.</Paragraph>
      <Paragraph position="5">  Smith, Hipp, and Biermann An Architecture for Voice Dialog Systems</Paragraph>
    </Section>
    <Section position="2" start_page="286" end_page="286" type="sub_section">
      <SectionTitle>
4.2 Implementing the Subdialog Feature
</SectionTitle>
      <Paragraph position="0"> Instead of turning control over to a depth first policy, theorem proving in this system must allow for abrupt freezing of any proof and transfer of control to any other partially completed subproof or function of the system. The implementation of this was the IPSIM (Interruptible Prolog SIMulator) theorem prover, which can maintain a set of partially completed proofs and jump to the appropriate one as dialog proceeds. null The processing of IPSIM proceeds with normal theorem proving, but is interruptable in two ways. First, IPSIM may discover that there could be outside information that could be used to supply missing axioms, as described above. When this occurs, IPSIM can halt and pass control to the dialog controller, indicating an opportunity to engage in dialog. The dialog controller may choose to invoke the proposed interaction or it may select another action. Second, IPSIM may be interrupted by the dialog controller to inquire about proof status. In a real-time system, a theorem prover can never be released arbitrarily. Timing considerations by the controller may dictate the halting of a given proof and resorting to other action.</Paragraph>
      <Paragraph position="1"> Since voice dialog is always tied to proving a given subgoal, the set of all interactions related to that goal comprise a subdialog. The set of logical rules leading to the subgoal are, by definition, related to that subgoal, and the voice interactions will necessarily have the coherence that theorists (Hobbs 1979) have often discussed.</Paragraph>
      <Paragraph position="2"> The partial or completed proof of a subgoal is not erased or popped from any stack when processing moves to another part of the proof tree. This makes it possible to reopen any subdialog at a later time to clarify, revise, or continue that interaction. The reopening may be initiated either by the system because of a change in priorities in its agenda or by the user.</Paragraph>
      <Paragraph position="3"> Subdialogs are entered in several ways. First, normal theorem proving may create a new subgoal to be proved, and its related voice interactions will yield a subdialog.</Paragraph>
      <Paragraph position="4"> Second, the controller may halt an interaction that it deems unfruitful and send the system in pursuit of a new subgoal. Third, the user may initiate dialog on a new subgoal in ways that will be discussed below.</Paragraph>
    </Section>
    <Section position="3" start_page="286" end_page="288" type="sub_section">
      <SectionTitle>
4.3 Accounting for User Knowledge
</SectionTitle>
      <Paragraph position="0"> Theorem proving will often reach goals that can only be satisfied by interactions with the user. For example, if the machine has no means to manipulate some variable and the user does, the only alternative is to make a request of the user. It is necessary that the knowledge base store information related to what the user can be expected to do, and the system only should make requests that will be within this repertoire.</Paragraph>
      <Paragraph position="1"> Of course, the abilities of the user will depend on his or her level of expertise and experience in the current environment. Thus most users will know how to adjust a knob to a specified level even if they are novices; but they may be able to measure a voltage only after they have been told all of the steps at least once in the current situation.</Paragraph>
      <Paragraph position="2"> It is necessary to style outputs to the user to account for these variations, and the Prolog theorem-proving tree easily adapts to this requirement. The example above related to finding the position of a switch shows how this works. If the user knows, for example (according to the user model), how to find the switch, the model prevents the useless interaction related to finding the switch from occurring. If the user does not know (according to the user model) where the switch is, the model releases the system to so inform the user. The theory of user modeling thus is simply to specify user capabilities in the Prolog-style rules and let the natural execution of IPSIM select what to say or not say.</Paragraph>
      <Paragraph position="3">  Computational Linguistics Volume 21, Number 3 The user model must change continuously during a dialog as the interactions occur. Almost every statement will change the useVs knowledge base, and future interactions will be ill-conceived if the appropriate updates are not made. Acquisition of user model axioms is made from inferences based on user inputs. A description of the inferences is given below:  If the input indicates that the user has a goal to learn some information, then conclude that the user does not know about the information.</Paragraph>
      <Paragraph position="4"> If the input indicates that an action to achieve or observe a physical state was completed, then conclude that the user knows how to perform the action.</Paragraph>
      <Paragraph position="5"> If the input describes some physical state, then conclude that the user knows how to observe this physical state. In addition, if the physical state is a property, then infer that the user knows how to locate the object that has the property.</Paragraph>
      <Paragraph position="6"> If the input indicates that the user has not performed some primitive action, make the appropriate inference about the user's knowledge about how to perform this action.</Paragraph>
      <Paragraph position="7"> If the user has completed an action by completing each substep, then conclude that the user knows how to do the action.</Paragraph>
      <Paragraph position="8"> Infer that the user has intensional knowledge about a physical state if the user has knowledge on how to observe or achieve the physical state.</Paragraph>
      <Paragraph position="9"> Infer that the user has knowledge on how to observe a physical state if he or she has knowledge on how to achieve the physical state.</Paragraph>
      <Paragraph position="10"> The basic implementation of these rules is a &amp;quot;compute_inferences&amp;quot; predicate in Prolog that takes the meaning of the user's current utterance and causes inferences to be asserted into the axiom base. Here is an example from the Prolog code. It is the implementation of statement (2) given above: /* Inference 2: If we have learned that an action to achieve or observe a physical state was completed, then conclude that the physical state has the appropriate status and that the user knows how to perform the action. */</Paragraph>
      <Paragraph position="12"> makeAnference(ProofNum, mentaLstate(user, int_know~action(how_to_do( GoalAction)),true),infer(Meaning)).</Paragraph>
      <Paragraph position="13"> In typical dialogs, the user modeling system added a net of about 1.2 Prolog-style assertions to the user model per user utterance. There are both additions and deletions  Smith, Hipp, and Biermann An Architecture for Voice Dialog Systems occurring after each utterance, and this figure gives the average increase in assertions per user utterance.</Paragraph>
      <Paragraph position="14"> Formalizing a theory of what constitutes appropriate inferences could be a separate research project. What this research has contributed is a theory of usage for these inferences, which is usage by the theorem prover as it attempts to complete task goals.</Paragraph>
    </Section>
    <Section position="4" start_page="288" end_page="289" type="sub_section">
      <SectionTitle>
4.4 Mechanisms for Obtaining Variable Initiative
</SectionTitle>
      <Paragraph position="0"> Variable initiative dialog allows either participant to have control, and it allows the initiative to change between participants during the exchange. It also allows intermediate levels of control where one participant may gently rather than strongly lead the interactions.</Paragraph>
      <Paragraph position="1"> A system can participate in variable initiative dialog if it properly manages (1) the selection of the current subdialog, (2) the level of assertiveness in its outputs, and (3) the interpretation of its inputs. We discuss each in the following paragraphs.</Paragraph>
      <Paragraph position="2"> Selection of Subdialog. The most important aspect of dialog control is the ability to select the next subdialog to be entered. Very strong control means that the participant will select the subdialog and will ignore attempts by the partner to vary from it.</Paragraph>
      <Paragraph position="3"> Weaker control allows the partner to introduce minor but not major variations from the selected path. Loss of control means that the partner will select unconditionally the next subgoal.</Paragraph>
      <Paragraph position="4"> In our system, the four implemented levels of initiative follow these guidelines:  (1) Directive Mode. Unless the user explicitly needs some type of  clarification, the computer will select its response solely according to its next goal for the task. If the user expresses need for clarification about the previous goal, this must be addressed first. No interruptions to other subdialogs are allowed.</Paragraph>
      <Paragraph position="5"> (2) Suggestive Mode. The computer will again select its response according to its next goal for the task, but it will allow minor interruptions to subdialogs about closely related goals. As before, user requests for clarification of the previous goal have priority.</Paragraph>
      <Paragraph position="6">  (3) Declarative Mode. The user has dialog control. Consequently, the user can interrupt to any desired subdialog at any time, but the computer is free to mention relevant, though not required, facts as a response to the user's statements.</Paragraph>
      <Paragraph position="7"> (4) Passive Mode. The user has complete dialog control. Consequently, the  computer will passively acknowledge user statements. It will provide information only as a direct response to a user question.</Paragraph>
      <Paragraph position="8"> Level of Assertiveness of Outputs. The system must output statements that are compatible with its level of initiative. This means that the output generator must have a parameter that enables the system to specify assertiveness. Examples of the request to &amp;quot;turn the switch up&amp;quot; at various levels of assertiveness are as follows: Turn the switch up.</Paragraph>
      <Paragraph position="9"> Would you \[please\] turn the switch up? Can you turn the switch up? The switch can be turned up.</Paragraph>
      <Paragraph position="10"> Turning the switch up is necessary.</Paragraph>
      <Paragraph position="11">  Computational Linguistics Volume 21, Number 3 Two examples of querying for the switch position are as follows: What is the switch position? I need to know the switch position.</Paragraph>
      <Paragraph position="12"> (request) (indirect request) Interpretation of Inputs. Inputs from a passive participant can be expected to be more predictable and well behaved than those from a directive one. Our system does not account for this effect at this time. The only implemented variation in behavior concerns the treatment of silence. The system may allow rather longer silences when it is in passive mode than when it is in directive mode.</Paragraph>
    </Section>
    <Section position="5" start_page="289" end_page="292" type="sub_section">
      <SectionTitle>
4.5 The Implementation and Uses of Expectation
</SectionTitle>
      <Paragraph position="0"> The response received after a given input is likely to be related to the currently active subdialog. If it is not, then it may be related to a nearby active subdialog or, with less probability, a more remote one. The expectation facility provides a list of expected meanings organized in a hierarchy, and it is used for two purposes: (1) If the incoming utterance is syntactically near the syntax for an expected meaning in the active subdialog, the expectation provides a powerful error-correction mechanism. (2) If the incoming utterance is not in the locally active subdialog, the expectations of other active subdialogs provide a means for tracing the movement to those subdialogs. (This is known as &amp;quot;plan recognition&amp;quot; in the literature. See, for example, Allen and Perrault (1980), Allen (1983) and Carberry (1990). Expectations are specified in GADL (Goal and Action Description Language) form, an internal language for representing predicates. For example, the expectation that the user is going to report the setting of a switch would be represented as obs(phys_state(prop(switchl,state, PropValue), TruthStatus)).</Paragraph>
      <Paragraph position="1"> Expectation of user responses provides a model of the attentional state described by Grosz and Sidner (1986). It contains the list of semantic structures that have meaning for the current subdialog and for other active subdialogs. For example, after the computer produces an utterance that is an attempt to have a specific task step S performed, there are expectations for any of the following types of responses:  A statement about missing or uncertain background knowledge necessary for the accomplishment of S.</Paragraph>
      <Paragraph position="2"> A statement about a subgoal of S.</Paragraph>
      <Paragraph position="3"> A statement about the underlying purpose for S.</Paragraph>
      <Paragraph position="4"> A statement about ancestor task steps of which accomplishment of S is a part.</Paragraph>
      <Paragraph position="5"> A statement about another task step which, along with S, is needed to accomplish some ancestor task step.</Paragraph>
      <Paragraph position="6"> A statement indicating accomplishment of S.</Paragraph>
      <Paragraph position="7"> The central pragmatic issues for the management of expectation are, what are the sources of expectation (or how is the expectation list created) and how is it used. The following paragraphs describe both.</Paragraph>
      <Paragraph position="8">  Smith, Hipp, and Biermann An Architecture for Voice Dialog Systems Sources of expectation. The first source of expectation is the domain processor described below. One of the main tasks of this system is to supply the dialog machine with debugging queries such as &amp;quot;What is the LED showing?&amp;quot; A secondary task is to associate with each such query a list of expected answers. Thus the query about the LED would yield as expectations some possible descriptions for the LED. These are called situation_specific_expectations. The domain processor also supplies situation_related_expectations, which are not directly connected to the observation but which could naturally occur. For example, the LED query might also result in observations about the presence or absence of wires, the position of the power switch, and the presence or absence of a battery.</Paragraph>
      <Paragraph position="9"> The other source of expectations is the dialog_controller, also described below, which provides coordination for the complete system. The dialog controller manages a number of generic rules related to dialog and can provide the associated expectations. For example, &amp;quot;What is the LED showing?&amp;quot; can represent the action &amp;quot;observe the value for the display property of the object LED.&amp;quot; The associated task~specific_expectations would represent expectations based on this general notion with values for property and object instantiated to the situation values. Thus the task-specific expectations for the sample topic would include questions on the location of the object, on the definition of the property, and on how to perform the action. In addition, these expectations would include responses that can be interpreted as state descriptions of the relevant property. In general, they include potential questions and statements about subtasks of the current task. There are rules for 12 different generic actions (Smith 1991). The rules are based on a characterization of response types obtained during a Wizard-of-Oz study on the effects of restricted vocabulary (Moody 1988).</Paragraph>
      <Paragraph position="10"> The dialog controller also provides a broader class of expectations, called task-related expectations, which are based on general principles about the performance of actions. Example task-related expectations would include general requests for help or questions about the purpose of an action. Another important member of the task-related expectations are the expectations for topics that are ancestors of the current topic in the discourse structure. For example, the current topic could be the location of a connector, which could be a subtopic of connecting a voltmeter wire, which could be a subtopic of performing a voltage measurement. The task-related expectations for the location of this connector would include all the expectations related to the topics of connecting a voltmeter wire and performing a voltage measurement.</Paragraph>
      <Paragraph position="11"> Utilizing expectation. After the semantic expectations are computed, they are translated into linguistic expectations according to grammatical rules. Once the linguistic expectations are produced, they are labeled with an expectation cost, which is a measure of how strongly each is anticipated at the current point in the dialog. The situation-specific expectations are the most strongly anticipated, followed by the other three types. Meanings output by the minimum distance parsing algorithm (described below) have a corresponding utterance cost, which is the distance between the user's input and an equivalent well-formed phrase. Each meaning is matched with its corresponding dialog expectation, and its expectation and utterance costs are combined into a total cost by an expectation function. The meaning with the smallest total cost is selected to be the final output of the parser. An important side effect of matching meanings with expectations is the ability to interpret an utterance whose content does not fully specify its meaning. These semantic and linguistic expectations can provide the necessary context. Some examples are as follows: (a) The referent of pronouns (In the implemented system, the only pronoun is &amp;quot;it.&amp;quot;) The parser leaves the slot for the referent of &amp;quot;it&amp;quot; unspecified in its interpretation. If this interpretation of the utterance can be matched to the linguistic expectation, the value  Computational Linguistics Volume 21, Number 3 for &amp;quot;it&amp;quot; is filled with the value provided by the expectation. Consider the following example:</Paragraph>
      <Paragraph position="13"> (2) Computer: Turn the switch up. User: Where is it? In computing the task-specific expectations for the user utterance, one expectation is for a statement asking about the location of the object of interest in the current topic-in this case the switch. The parser interprets the user statement as a statement asking  about the location of an unspecified object. The linguistic expectation provides the value of the unspecified object.</Paragraph>
      <Paragraph position="14"> (b) The meaning of short answers (In the implemented system, these include such responses as &amp;quot;yes,&amp;quot; &amp;quot;no,&amp;quot; and &amp;quot;okay.&amp;quot;) The idea for each of these is similar. These utterances may have any of several meanings. The proper choice is determined by the expectations produced based on the situation. In the following example, it is likely that &amp;quot;okay&amp;quot; denotes affirmation of completion of the goal to turn up the switch: (1) Computer: Turn the switch up.</Paragraph>
      <Paragraph position="15"> (2) User: Okay.</Paragraph>
      <Paragraph position="16">  It is less likely that it denotes comprehension of the request. In any case, after statement (1), the situation-specific expectations include an expectation for an affirming utterance indicating completion, and this becomes the interpretation given to &amp;quot;okay.&amp;quot; Contrast this with the following:  (1) Computer: Turn up the switch.</Paragraph>
      <Paragraph position="17"> (2) User: Where is it? (3) Computer: In the lower left corner.</Paragraph>
      <Paragraph position="18"> (4) User: Okay.</Paragraph>
      <Paragraph position="19"> In this case, the interpretation of &amp;quot;okay&amp;quot; could be either that the location description (3) has been understood or that the original goal (1) has been accomplished. The expectation system scores the likelihood of each meaning and selects the most likely one using its scoring method.</Paragraph>
      <Paragraph position="20"> (c) Maintain dialog coherence. Consider the following subdialog taken from usage of the implemented system: (1) Computer: What is the voltage between connector 121 and connector 120? (2) User: I need help.</Paragraph>
      <Paragraph position="21"> (3) Computer: Locate the voltmeter.</Paragraph>
      <Paragraph position="22"> (4) User: Done.</Paragraph>
      <Paragraph position="23"> (5) Computer: Add a wire between the &amp;quot;- corn&amp;quot; hole on the voltmeter and connector 121.</Paragraph>
      <Paragraph position="24"> (6) User: Done.</Paragraph>
      <Paragraph position="25">  Smith, Hipp, and Biermann An Architecture for Voice Dialog Systems a (7) Computer: Add a wire between the &amp;quot;+ v omega a&amp;quot; hole on the voltmeter and connector 120.</Paragraph>
      <Paragraph position="26"> (8) User: Nine.</Paragraph>
      <Paragraph position="27">  When utterance (8) is spoken, there are two active task steps: (1) performing the voltage measurement and (2) connecting a wire between the &amp;quot;+ v omega a&amp;quot; hole and connector 120. The user response does not satisfy the missing axiom for completing the substep (7). The expectations for the response to (7) are checked, but this utterance is not one of them. However, (8) does satisfy the missing axiom for completing the main task step (1). It has meaning in that context and it is so interpreted.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="292" end_page="293" type="metho">
    <SectionTitle>
5. The Zero-Level Model
</SectionTitle>
    <Paragraph position="0"> The system built in our laboratory (described at length in Smith \[1991\] and Hipp \[1992\] and Smith and Hipp \[1995\]) implements the theory given above. Figure 1 presents a zero-level model of the main processor, which illustrates the system principles of operation without burdening the reader with too much detail. This model is the recursive subroutine ZmodSubdialog, and it is entered with a single argument, a goal to be proven. Its actions are to carry out a Prolog-style proof of the goal. A side effect of the proof may be some voice interactions with the user to supply missing axioms as described above. In fact, the only voice interactions the system undertakes are those called for by the theorem-proving machinery.</Paragraph>
    <Paragraph position="1"> The ZmodSubdialog routine is a Prolog-style interpreter with a number of special features designed for the dialog processing application. It is typical of such interpreters in that it lifts goals from a priority queue and applies rules from the knowledge base to try to satisfy them. See, for example, the first and third branches beginning with &amp;quot;If R...&amp;quot; where respectively, the trivial case and the general case for applying a rule are handled. The deviations from a standard such interpreter are (1) in the second branch &amp;quot;If R...&amp;quot;, which handles the case of a missing axiom where voice interaction is to be invoked, (2) in three steps (marked &amp;quot;mode&amp;quot;) where processing will vary according to the level of the initiative the system is in, and (3) in the controlling module for the interpreter, which may freeze execution of this computation at any time to initiate or continue some other such computation.</Paragraph>
    <Paragraph position="2"> Typical execution of ZmodSubdialog involves opening a proof tree and proceeding with a computation until an interrupt or clarification subdialog occurs. This may come, for example, from a new goal suggested by the domain processor or from a statement by the user causing movement to a different subdialog. The interrupt will cause control to pass to another existing proof tree (that was previously frozen) or to a new one aimed at the newly presented goal. Thus a set of partially completed trees will exist at all times, and control will jump back and forth between them. Of course, many of these trees will invoke associated voice interactions--and these constitute the subdialogs of the conversation.</Paragraph>
    <Paragraph position="3"> The process of dialog described here is a kind of interactive theorem proving where the guidance down paths can come from either the user's knowledge or system knowledge. However, the emphasis in the traditional interactive theorem-proving literature is on giving the user substantial opportunity to propose the notations and individual steps of the proof in a way that is not possible or desirable in our environment.</Paragraph>
    <Paragraph position="4">  Transfer control depending on which expected response was received Success response: Return with success Negative response: No action Confused response: Modify rule for clarification; prioritize for execution Interrupt: Match response to expected response of another subdialog; Go to that subdialog (mode) If R is a general rule then Store its antecedents While there are more antecedents to process Grab the next one and enter ZmodSubdialog with it If the ZmodSubdialog exits with failure then terminate processing of R If all antecedents of R succeed, return with success Halt with failure</Paragraph>
  </Section>
  <Section position="6" start_page="293" end_page="293" type="metho">
    <SectionTitle>
NOTE: SUCCESSFUL COMPLETION OF THIS ROUTINE DOES NOT NECESSARILY
MEAN TRANSFER OF CONTROL TO THE CALLING ROUTINE. CONTROL PASSES
TO THE SUBDIALOG SELECTED BY THE DIALOG CONTROLLER.
</SectionTitle>
    <Paragraph position="0"> Figure 1 The zero-level model of the main subdialog processing algorithm.</Paragraph>
  </Section>
  <Section position="7" start_page="293" end_page="302" type="metho">
    <SectionTitle>
6. Executing an Example Subdialog
</SectionTitle>
    <Paragraph position="0"> The operation of ZmodSubdialog (and similarly our implemented system) becomes clear if a complete example subdialog is carried out. Here we will trace the execution of the example 22 utterance subdialog given in Section 3 and thereby illustrate the theory of operation in detail. An overview of the computation is given here and a detailed trace of all significant details appears in Appendix A. The following database of Prolog-like rules is needed for proper system operation.</Paragraph>
    <Paragraph position="1">  Smith, Hipp, and Biermann An Architecture for Voice Dialog Systems</Paragraph>
    <Paragraph position="3"> We assume that the machine has selected a new goal that comes from the domain processor: Tl_circuit_Test2(V), where V is a voltage to be returned by the test. Thus, the domain processor is asking that test 2 on circuit T1 be performed returning a voltage V. ZmodSubdialog begins in Prolog fashion looking for a rule in the database to prove the goal Tl_circuit_Test2(V), and it finds the debugging rule Tl_circuit_Test2(V) set(knob,10), measurevoltage(121,34,V). Referring to Figure 1, this rule R is a general rule resulting in the third choice branch. Here the algorithm selects the first subgoal set(knob,10) of R and creates another subdialog by entering ZmodSubdialog with this subgoal.</Paragraph>
    <Paragraph position="4"> The new subdialog finds the rule set(knob,Y) *-- find(knob), adjust(knob,Y), which says the way to set the knob is to first find it and then do the adjustment. Execution of this rule demonstrates the mechanisms related to the use of the user model and the initiation of voice interaction. For example, the first new subgoal find(knob) causes a new entry into ZmodSubdialog, where it is immediately satisfied by find(knob) in the user model. That is, the user has achieved find(knob) (knows how to find the knob), and no further consideration of this subgoal is needed. If the user did not know (according to the user model) how to find the knob, the system might have invoked a voice interaction to try to achieve this subgoal. Moving to the second goal, adjust(knob,10), again it might have occurred that the user has just achieved this also. (Since the algorithm does unification as a rule is invoked, the variable Y has been set to 10.) Entry of ZmodSubdialog with this subgoal, however, finds no trivial resolution for this subgoal. But it can invoke Y ~ usercan(Y), vocalize(Y), which says that a goal Y can be achieved if the user is capable of doing Y (which is represented as usercan(Y)) and if we vocalize Y. (This type of rule has not been  Computational Linguistics Volume 21, Number 3 explicitly implemented in our system, but we include it here as a model of what does happen.) Further recurrences on ZmodSubdialog discover usercan(adjust(knob,X)) and undertake vocalize(adjust(knob,10)).</Paragraph>
    <Paragraph position="5"> Entry of ZmodSubdialog with vocalize(adjust(knob,10)) sends control down its second path. Sentence generation and voice output produces the statement &amp;quot;Put the knob to one zero.&amp;quot; Next a set of expected responses is compiled. Some of these include: question(location,knob).</Paragraph>
    <Paragraph position="6"> question(ACTION,how-to-do).</Paragraph>
    <Paragraph position="7"> assertion(knob,status,10).</Paragraph>
    <Paragraph position="8"> assertion(ACTION,done).</Paragraph>
    <Paragraph position="9"> When a vocalized response comes back, parsing and error correction will be biased to recognize one of these meanings.</Paragraph>
    <Paragraph position="10"> After a response meaning has been resolved, it is entered into the database with all its presuppositions. For example, the mention of an object is assumed to indicate that the user can recognize and find that object if needed. These assertions are entered into the user model. The user model thus changes on almost every interaction to note new facts that are probably known to the user or to remove facts that the user apparently does not know. Finally, control changes because of the response either to (1) return from this subdialog with success, (2) continue in this subdialog searching for a success (having received an unsuccessful response), (3) enter special processing to deal with a need for clarification, or (4) interrupt processing to jump to some other subdialog. In the example subdialog, the user responds &amp;quot;OK,&amp;quot; which corresponds to one of the expected meanings: assertion(ACTION, done). Expectations equate ACTION to the current action. Consequently, there is a successful exit of the current subdialog and achievement of the set(knob,10) goal.</Paragraph>
    <Paragraph position="11"> The first invoked rule, Tl_circuit_Test2(V) ~ set(knob,10), measurevoltage(121,34,V), is thus half satisfied, and the goal measurevoltage(121,34,V) is undertaken. The new goal leads to vocalization in the same manner described above. But the response in this case is negative: &amp;quot;I do not know.&amp;quot; So attempts to achieve measurevoltage(121,34,V) continue and involve the rule measurevoltage(X,Y,V) *-- find(voltmeter), set(voltmeter,20), connect(bw, com,X), connect(rw,+,Y), vocalize(read(voltmeter, V)).</Paragraph>
    <Paragraph position="12"> This leads to a number of interactions, as traced in detail in Appendix A. The reader should note in this continued interaction the manner in which the theorem proving drives the dialog and the user model inhibits or enables voice interactions appropriate to the situation.</Paragraph>
    <Paragraph position="13"> The next interesting action occurs in utterance 12, when the user answers a request to connect a wire with the question &amp;quot;Which knob?&amp;quot; Here the response is parsed against expected meanings without success. So the system looks for expectations of other subdialogs that either have been invoked or might be invoked. In this example,  Smith, Hipp, and Biermann An Architecture for Voice Dialog Systems &amp;quot;which knob&amp;quot; is an expected response for the first utterance, so control returns to that subdialog. In fact, in that subdialog, this response corresponds to a request for clarification. In our studies, we have found such requests for clarification to be routine and have designed a special mechanism for handling them. Our system dynamically modifies the active rule</Paragraph>
    <Paragraph position="15"> Thus, the original user model was incorrect where it included find(knob), and this assertion is deleted. Next, as specified in Figure 1, the system reattempts the computation with this revised rule. The newly inserted subgoal causes the voice output of utterance 13.</Paragraph>
    <Paragraph position="16"> This discussion shows the operation of all parts of the ZmodSubdialog model and illustrates the mechanisms used in our dialog machine. Appendix A gives the detailed steps required for completing the first part of the 22-utterance sample dialog.</Paragraph>
    <Paragraph position="17"> 7. An Overview of the System Architecture The architecture of the system is given in Figure 2, where five major subsystems are shown: the dialog controller, the domain processor, the knowledge base, the general reasoning system, and the linguistic interface. These modules will be described next. Dialog controller. This is the overall &amp;quot;boss&amp;quot; of the dialog processing system. It formulates goals at the top level to be passed on to the theorem-proving stage. It determines the role that the computer plays in the dialog by determining how user inputs relate to already established dialog as well as determining the type of response given. It also maintains all dialog information shared by the other modules and controls their activation. Its control algorithm is the highest-level dialog processing algorithm.</Paragraph>
    <Paragraph position="18"> The basic cycle followed by the dialog controller is shown below.</Paragraph>
    <Paragraph position="19">  Obtain suggested goal from the domain processor.</Paragraph>
    <Paragraph position="20"> Based on the suggested goal and the current state of the dialog, select the next goal to be pursued by the computer and determine the expectations associated with that goal. (The goal may thus be selected from one of the active subdialogs. The choice is partially dependent on the current level of initiative.) Attempt to complete the goal using the IPSIM system, possibly invoking voice interactions.</Paragraph>
    <Paragraph position="21"> Update system knowledge based on efforts at goal completion.</Paragraph>
    <Paragraph position="22"> Determine next operations to be performed by the domain processor in providing a suggested goal.</Paragraph>
    <Paragraph position="23"> Go to step 1.</Paragraph>
    <Paragraph position="24">  The system architecture.</Paragraph>
    <Paragraph position="25"> The domain processor. This is the primary application-dependent portion of the system. It contains much of the information about the application domain. It receives from the controller a request for the next suggested goal to be undertaken, and it returns to the controller its suggestion along with expected results from attempting the test. It then receives the results of the interaction and appropriately updates its data structures.</Paragraph>
    <Paragraph position="26"> In the implemented system, the domain processor assists in electronic equipment repair and contains a debugging tree that organizes the debugging task. The debug- null Smith, Hipp, and Biermann An Architecture for Voice Dialog Systems ging tree has as its root a node representing the whole device to be debugged and other nodes representing all of the subsystems. It is organized using the &amp;quot;part of&amp;quot; relationship with each node that represents a part of a subsystem connected as a child node below that subsystem.</Paragraph>
    <Paragraph position="27">  For example, in the implemented system, the top level device is a circuit called the RSl11; its subsystems are the power circuit, the T1 and T2 circuits, and the LED circuit. Their subsystems are wires, switches, transistors, and so forth. The lowest level of this tree is the atomic element that can be addressed in a dialog.</Paragraph>
    <Paragraph position="28"> Each node has as its primary constituents a set of specifications and a set of flags representing its state. The specifications have three parts, giving the observation to be made, the conditions to be satisfied before making the observation, and the actions to be taken depending on what is observed. An example of one of the specifications for the LED is as follows: Observation: Observe the behavior of the LED.</Paragraph>
    <Paragraph position="29"> Conditions: The switch must be on, and the control must be set to 10.</Paragraph>
    <Paragraph position="30"> Actions: IF the LED is alternately displaying a 1 and 7 with frequency greater than once per second THEN assert the specification is satisfied ELSE IF the LED is not on THEN assert the battery is suspicious, in which case the power circuit is suspicious ELSE IF the LED is on but not blinking THEN assert that the transistor circuits are suspicious ELSE IF the LED is blinking but not alternately displaying a 1 and a 7 THEN assert that the LED circuit is suspicious ELSE IF the LED is damaged THEN assert that the LED device should be replaced ELSE IF...</Paragraph>
    <Paragraph position="31"> The actions may be to assert that the specification is checked, to set suspicion flags on other nodes in the tree, or to replace parts.</Paragraph>
    <Paragraph position="32"> The other major component of a node is a set of status flags for the subsystem represented by the node. One flag indicates whether the subsystem is checked, unchecked, partially checked, or suspicious. Other flags give for each individual specification its status and a counter for the number of iterations that the specification has been checked.</Paragraph>
    <Paragraph position="33"> The domain processor algorithm chooses the node (subsystem) with the greatest suspicion and the specification on that node with greatest suspicion and sends it to the dialog controller for possible checking. When two nodes or specifications are tied for being most suspicious, finer-grained criteria are used to break the tie. The algorithm is designed to guarantee that false information that may be entered into the tree will be eventually found. This is done by allowing the iterations counter on each specification to reduce its effective level of suspicion. Thus in a problematic debugging situation, specifications with the smallest count will be checked again and again until all have been checked the same number of times. Additional checking will  Computational Linguistics Volume 21, Number 3 then make the rounds of all specifications. If erroneous information has been entered after any observation, that observation will eventually be repeated, enabling progress and guaranteeing ultimate success.</Paragraph>
    <Paragraph position="34"> The set of possible observations provides the situation-specific and situationrelated expectations discussed in the section on expectations. Thus, in the example listed above where the LED is being observed, the expectations are as follows:  These are passed to the controller at the time of the request for the LED observation. The reasoning system. The IPSIM system receives as input goals to be proven and commands to start, stop, and furnish information. It yields as output theorems that are proven and status reports on the proof in progress. Its processing follows the usual mechanisms of Prolog-style theorem-proving, and it is modeled by the ZmodSubdialog routine given above. It uses the rules in the knowledge base described below. The knowledge base. This is the repository of information about task-oriented dialogs. This includes the following general knowledge about the performance of actions and goals:  (1) Knowledge about the decompositions of actions into substeps.</Paragraph>
    <Paragraph position="35"> (2) Knowledge about theorems for proving completion of goals.</Paragraph>
    <Paragraph position="36"> (3) Knowledge about the expectations for responses when performing an action.</Paragraph>
    <Paragraph position="37">  There is also general task knowledge about completing locative descriptions. General dialog knowledge includes knowledge about the linguistic realizations of task expectations as well as discourse structure information maintained on the current dialog. Finally, there is also knowledge about the user that is acquired during the course of the dialog. Note that the predefined information of this module is easily modified without requiring changes to the dialog controller.</Paragraph>
    <Paragraph position="38"> Linguistic interface. This system receives the spoken inputs from the user and returns spoken outputs. We will examine first the voice input system.</Paragraph>
    <Paragraph position="39"> The inputs to the voice input system for each utterance are the set of expectations from the dialog controller and the speech utterance from the user. The output from this system is a GADL meaning representation.</Paragraph>
    <Paragraph position="40"> The core of the processor is a syntax-directed translator (Aho and Ullman 1969) with rules of the form A --* Wl:W2, where Wl should be thought of as an ordinary context-free grammar righbhand side. If the terminal string w can be generated by the grammar rules of the from A --* Wl, then the analogous derivation using the rules A ~ w2 will produce the translated output in GADL form. As an example, suppose  Smith, Hipp, and Biermann An Architecture for Voice Dialog Systems the utterance &amp;quot;no wire&amp;quot; has been received and the following rules are in the system</Paragraph>
    <Paragraph position="42"> That is, the meaning representation from the input &amp;quot;no wire&amp;quot; is assertion(false,state(exist,wire(+,+),present)).</Paragraph>
    <Paragraph position="43"> The dialog controller, of course, will provide an expectation for each received input. In the current example, assume the machine has previously output &amp;quot;there should be a wire from terminal 102 to terminal 104.&amp;quot; Then the expectations would be</Paragraph>
    <Paragraph position="45"> The selected meaning of the incoming utterance will be the least-cost match between an output for the translation grammar and the expectations. The mechanisms for finding this least-cost match will be described next.</Paragraph>
    <Paragraph position="46"> The cost C of selecting a given expectation is a function of two parameters: (1) the utterance cost U, which measures the distance of the perceived voice signal from a grammatical utterance as defined by the system grammar, and (2) the expectation cost E, which measures the degree of locality of the selected expectation with respect to  Computational Linguistics Volume 21, Number 3 the current subdialog. The utterance cost U will be small if the voiced signal precisely matches a token sequence generated by the system grammar. The expectation cost E will be small if the meaning of the utterance as generated by the translation grammar precisely matches an expected meaning for the currently active subdialog. Thus the total cost can be represented as</Paragraph>
    <Paragraph position="48"> where the exact nature of the function f is a problem to be solved.</Paragraph>
    <Paragraph position="49"> In this project, it was assumed that the speech recognizer would provide a graph of alternate guesses of the current input. Thus it might pass the following to the parser after a user had spoken &amp;quot;no wire.&amp;quot; The parser then searches for paths through the graph that match as closely as possible grammatical inputs. In searching for a path, the parser may delete or insert words to achieve a match. Each such edit operation has an associated cost, depending on the significance of the word being edited. Some words, such as &amp;quot;not&amp;quot; or major nouns, make a large difference in sentence meaning; other words, such as articles, may not carry significant meaning in a given context. U is the sum of the edit costs required to traverse the graph path and match a grammatical input. In the example, the path no-a-wire matches the grammatical &amp;quot;no wire&amp;quot; with the deletion of only an article &amp;quot;a.&amp;quot; The computation of E was simple in our project. A low value was assigned for all expectations at the current subdialog. The sequential levels of more distant subdialogs each were given higher expectation costs.</Paragraph>
    <Paragraph position="50"> The original hypothesized combining function for U and E was</Paragraph>
    <Paragraph position="52"> where fl is a weighting factor between 0 and 1. A fl near 0 will tend to prefer matches to local expectation, regardless of the values of U; a fl near 1 will place most weight on getting a good match between the input graph and a grammatical string, regardless of the value of E. Our experimentation, as described in Hipp (1992), indicated that fl should be near 1. This supports the intuition that definitive source data at the time of the utterance should be the preferred evidence regardless of expectation.</Paragraph>
    <Paragraph position="53"> Consequently, the usefulness of the expectation is for selecting between grammatical utterances derived from the perceived voice signal that have minimal utterance cost.</Paragraph>
    <Paragraph position="54"> Reflecting this experimentally determined result, the cost computation was revised to</Paragraph>
    <Paragraph position="56"> where U~n is the smallest observed utterance cost for the given utterance. This function selects the meaning with minimum utterance cost and uses expectation to break ties. A big advantage to using this form comes from the fact that any partial parse that exceeds the currently known minimum can be abandoned immediately at great savings in computation time.</Paragraph>
    <Paragraph position="57"> The details of the minimization algorithm are given in Hipp (1992). It follows some of the ideas of Aho and Peterson (1972), Levinson (1985), and Lyon (1974), and  Smith, Hipp, and Biermann An Architecture for Voice Dialog Systems will not be described here. It finds optimum answers in less than two seconds' time for most utterances of lengths used in the environment of our system when running on a Sun Sparc Station 2.</Paragraph>
    <Paragraph position="58"> The voice output system will not be discussed here. It receives a GADL specification for an output and some parameters regarding the statement context, and uses a grammar to generate the desired word sequence. It uses the context information to adapt outputs to their environment and sends the sequence to a DECtalk system for voicing.</Paragraph>
  </Section>
  <Section position="8" start_page="302" end_page="302" type="metho">
    <SectionTitle>
8. Some Implementation Details
</SectionTitle>
    <Paragraph position="0"> The system has been implemented on a Sun 4 workstation with the majority of the code written in Quintus Prolog. The parser is coded in C. Speech recognition is performed by a Verbex 6000 user-dependent connected-speech recognizer running on an IBM PC, and the vocabulary is currently restricted to 125 words. The users are required to begin each utterance with the word &amp;quot;verbie&amp;quot; and end with the word &amp;quot;over.&amp;quot; The Verbex machine acknowledges each input with a small beep sound. These sentinel interactions help to keep the user and machine in synchronization. The grammar used by the parser consists of 491 rules and 263 dictionary entries. The dictionary entries define insertion and deletion costs for individual words as well as substitution costs for phonetically similar words (such as &amp;quot;which&amp;quot; and &amp;quot;switch&amp;quot;).</Paragraph>
    <Paragraph position="1"> The dialog system exclusive of the parser and error correction code consists of about 17,000 lines of Prolog (and this includes some comments) apportioned as follows: Dialog Controller procedural mechanisms (including IPSIM) 15%, Dialog Controller knowledge base 11%, Domain Processing procedural mechanisms 25%, Domain Processing knowledge base 14%, Linguistic Interface including much language generation code 30%, miscellaneous 5%.</Paragraph>
    <Paragraph position="2"> The implemented domain processor was loaded with a model for a particular circuit assembled on a Radio Shack 160-in-One Electronic Project Kit. The model was complete enough to solve any problem of the circuit that involved missing wires.</Paragraph>
    <Paragraph position="3"> For example, if the system were asked to debug the circuit with no wires, it would systematically discover each missing wire and request that the user install it.</Paragraph>
    <Paragraph position="4"> The speech output was done with a DECtalk (trademark of Digital Equipment Corp.) DTCO1 text-to-speech converter.</Paragraph>
  </Section>
  <Section position="9" start_page="302" end_page="305" type="metho">
    <SectionTitle>
9. Testing the System
</SectionTitle>
    <Paragraph position="0"> A reasonable test of the theory and implementation described here is to bring human subjects to the laboratory and determine whether they can converse sufficiently well with the machine to effectively solve problems. The purpose of the testing was to gather general statistics on system performance and timing, to study the effects of mode, and to judge the human factors issues, learnability, and user response. The hypotheses were that the system would function acceptably, that with a reasonable amount of user training, machine directive mode would yield longer completion times and less complex verbal behaviors than a more passive mode, and that users would respond positively to using the system. This section describes the design of the tests and the results obtained.</Paragraph>
    <Section position="1" start_page="302" end_page="304" type="sub_section">
      <SectionTitle>
9.1 Experimental Design
</SectionTitle>
      <Paragraph position="0"> Three experimental sessions were used for each subject. The first was to train the subject and register subject pronunciations on the Verbex machine. The second session  Computational Linguistics Volume 21, Number 3 was a data-gathering test in which the subject could attempt up to ten problems with the dialog system locked in either directive or declarative mode. The third session allowed the subject to attempt up to ten additional problems. It was similar to the second except that the system was placed in the mode that was not used in session 2 (either declarative or directive). Experimentation was thus to be limited to just two modes even though four were operative.</Paragraph>
      <Paragraph position="1"> Eight subjects were recruited from computer science classes. They were selected on the criteria that they (1) have demonstrated problem-solving skills by having successfully completed a computer science course and having enrolled in another, (2) not have excessive familiarity with artificial intelligence or natural language processing as would occur, for example, if they had had a course on one of these topics, and (3) not be an electrical engineering major (in which case they could probably repair the circuits without aid). They were told they would receive $36.00 for participating in the three-part experiment. All selected subjects were used and all collected data are reported regardless of the level of success achieved.</Paragraph>
      <Paragraph position="2"> Session 1 introduced the subjects to the voice equipment and required that they speak at least two examples of each of the 125 vocabulary words. They then were asked to speak 239 sentences to train the system for coarticulation. Repetitions in either of these sessions were used as needed to obtain acceptable recognition rates. Next the subjects were told about the dialog system and its functions and capabilities in brief and simple terms. They were given the basic rules on how to speak to the system, including the need for carefully enunciated speech, the requirement for verbie-over bracketing, the importance of hearing the acknowledging beep, and special requirements for stating numbers. They were told not to direct any comments to the experimenter; however, the experimenter would occasionally give them help, as will be described below. The subjects were asked to listen to and repeat four sentences spoken by the DECtalk system; this exercise was repeated until they overcame any difficulties in understanding. They were shown the target LED displays and given suggestions on how to successfully describe such displays to the system; specifically, the user should tell what they see present on the display (as in &amp;quot;the top of a seven is displaying&amp;quot;) and not describe what does not appear (as in &amp;quot;the bottom of the seven is missing&amp;quot;). The subjects were provided with a list of the allowed vocabulary words and charts on a poster board suggesting implemented syntax if they wished to use it.</Paragraph>
      <Paragraph position="3"> Finally, they were given four practice problems and allowed to try solving them with the machine operating in directive mode. The complete session lasted up to two and one half hours.</Paragraph>
      <Paragraph position="4"> Session 2 was scheduled for three or four days later. It began with a reorientation, 60 practice sentences on the speech recognizer, and some review questions on the general instructions. If this session was in directive mode, the subjects were told the system would act like a teacher and that they should follow its instructions. If this session was in declarative mode, they were told the system would act like an assistant so that they could control the dialog, and they were given an example of a short interaction so that they could observe the kind of control that can be achieved. Then they were released to do up to ten problems.</Paragraph>
      <Paragraph position="5"> Session 3 was scheduled for three or four days after the second session. Appropriate instructions were given to change the subject expectations to the new mode, and ten more sample problems were given. Finally, the subjects were asked to fill in a short form and describe their reactions to using the system.</Paragraph>
      <Paragraph position="6"> An important issue in such tests, as has been observed elsewhere (Biermann, Fineman, and Heidlage 1992), is the problem of giving the subject sufficient error messages to enable satisfactory progress. Users may wander aimlessly in their behaviors without  Smith, Hipp, and Biermann An Architecture for Voice Dialog Systems some guidance when things go wrong. If the input speech is discrete with a pause after every word, an automatic system can confirm words individually and give the user adequate feedback (Biermann et al. 1985). But with connected speech, the system cannot easily pinpoint the source of errors and may not provide satisfactory guidance. The user may receive an unexpected response from the system and then speak again but with increased volume or nonstandard vocabulary; this may yield worse machine responses and even more extreme behavior from the user. Our answer to this problem was to post the experimenter nearby and to allow him or her to give the subject several different standard error messages if they were needed. The experimenter was allowed to deliver any of the following messages if certain criteria were met:  1. Due to misrecognition your words came out as . (Most misrecognitions were corrected automatically and thus resulted in no such messge. This message was given if the interpreted meaning contradicted the intended meaning or referenced the wrong object.) 2. Please be patient. The system is taking a long time to respond.</Paragraph>
      <Paragraph position="7"> 3. The system is ready for your next utterance. (Or other synchronization warnings.) 4. Please remember to start/end utterances with verbie/over.</Paragraph>
      <Paragraph position="8"> 5. Recognition is indicated by a beep.</Paragraph>
      <Paragraph position="9"> 6. The word __ is not in the vocabulary.</Paragraph>
      <Paragraph position="10"> 7. (a number) must be spoken as digits.</Paragraph>
      <Paragraph position="11"> 8. Please restrict your utterances to one sentence.</Paragraph>
      <Paragraph position="12"> 9. Please keep your tone/volume/rhythm similar to the way you trained.</Paragraph>
      <Paragraph position="13"> 10. Please focus on interaction with the computer. (In case of a comment to the experimenter.) 11. Please follow the computer's guidance. (After three repetitions caused by  subject's refusal to cooperate.) Statistics were kept on the number of such messages that were delivered during the test sessions, as reported below.</Paragraph>
    </Section>
    <Section position="2" start_page="304" end_page="304" type="sub_section">
      <SectionTitle>
9.2 Test Problems
</SectionTitle>
      <Paragraph position="0"> The circuit to be repaired was a multivibrator circuit constructed on a Radio Shack 160 in One Project Kit. It contained twenty wires and used a number of components on the board: a switch, potentiometer, light-emitting diode (LED), battery, and two transistors. Its correct behavior was to alternately display a 1 and a 7 on the LED with the rate of alternation being adjustable by the potentiometer. For the purposes of the experiment, failures were introduced by removing one or two wires. The first eight problems for the two sessions were matched in difficulty as well as possible in order to give balance between sessions and to prevent a varying difficulty from overshadowing important effects. The last two problems were repeats of the practice problems from the first session.</Paragraph>
    </Section>
    <Section position="3" start_page="304" end_page="305" type="sub_section">
      <SectionTitle>
9.3 Test Dialog System
</SectionTitle>
      <Paragraph position="0"> The dialog system being tested was the version that was operative at the time of the test, mid February, 1991. At that time, it was running on a Sun 4 machine, which caused  Computational Linguistics Volume 21, Number 3 Table 1 Experimental results for eight sClbjects operating at two levels of machine initiative; declarative and directive. Numbers in parentheses are for the first eight problems only; the last two problems were repeats of the practice problems from Session 1.  significant problems with execution time. A few responses during the experiment were as slow as 10 seconds or in some cases as much as 30 seconds, which hampered the flow of the interaction. These slow responses were primarily due to the computational costs of parsing long utterances containing many misrecognized words. The system was later enhanced by moving it to Sparc-2 machine.</Paragraph>
      <Paragraph position="1"> The ability of the system to respond to silence as a legitimate input was disabled because it had earlier confused our pilot subjects. If the user was silent for a period of time, the system patiently waited for his or her input. A small number of grammar omissions and other minor system shortcomings were noticed in the early subjects and fixed for later subjects.</Paragraph>
      <Paragraph position="2"> Later versions of the system, such as the one on our demonstration tape (Hipp and Smith 1991), included some error message capabilities that were not available in the experiment. Those would have led to substantially better performance if they could have been used.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML