File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1009_metho.xml
Size: 23,699 bytes
Last Modified: 2025-10-06 14:14:19
<?xml version="1.0" standalone="yes"?> <Paper uid="P96-1009"> <Title>A Robust System for Natural Spoken Dialogue</Title> <Section position="5" start_page="63" end_page="63" type="metho"> <SectionTitle> 3. The System </SectionTitle> <Paragraph position="0"> The TRAINS-95 system is organized as shown in Figure 5. At the top are the I/O facilities. The speech recognition system is the Sphinx-II system from CMU (Huang et al, 1993). The speech synthesizer is a commercial product: the TRUETALK system from Entropies. The rest of the system was built at Rochester. The display supports a communication language that allows other modules to control the content of the display. It also handles keyboard input. The speech recognition output is passed through the post-processor described in section 4. The parser, described in section 5, accepts input either from the post-processor (for speech) or the display manager (for keyboard), and produces a set of speech act interpretations that are passed to the discourse manager, described in section 6. The discourse manager breaks into a range of subcomponents handling reference, speech act interpretation and planning (the verbal reasoner), and the back-end of the system: the problem solver and domain reasoner. When a speech act is planned for output, it is passed to the generator, which constructs a sentence and passes this to both the speech synthesizer and the display.</Paragraph> <Paragraph position="1"> The generator is a simple template-based system. It uses templates associated with different speech act forms that are instantiated with descriptions of the particular objects involved. The form of these descriptions is defined for each class of objects in the domain.</Paragraph> <Paragraph position="2"> In order to stress the system in our robustness evaluation, we used the ATIS language model provided from CMU. This system yields an overall word error rate of 30% on TRAINS-95 dialogues, as opposed to a 20% error rate that we can currently obtain by using language models trained on our TRAINS corpus. While this accuracy rate is significantly lower than often reported in the literature, remember that most speech recognition results are reported for read speech, or for constrained dialogue applications such as ATIS. Natural dialogue involves a more spontaneous form of interaction that is much more difficult to interpret.</Paragraph> </Section> <Section position="6" start_page="63" end_page="65" type="metho"> <SectionTitle> 4. Statistical Error Post-Correction </SectionTitle> <Paragraph position="0"> The following are examples of speech recognition (SR) errors that occurred in the sample dialogue. In each, the words tagged REF indicate what was actually said, while those tagged with HYP indicate what the speech recognition system proposed, and HYP' indicates the output of SPEECHPP, our post-processor. While the corrected transcriptions are not perfect, they are typically a better approximation of the actual utterance. As the first example shows, some recognition errors are simple word-for-word confusions: HYP: GO B_X SYRACUSE AT BUFFALO HYP': GO VIA SYRACUSE VIA BUFFALO REF: GO VIA SYRACUSE AND BUFFALO In the next example, a single word was replaced by more than one smaller word: HYP: LET'S GO P_M TO TRY HYP': LET'S GO P_M TO DETROIT REF: LET'S GO VIA DETROIT The post-processor yields fewer errors by effectively refining and tuning the vocabulary used by the speech recognizer. To achieve this, we adapted some techniques from statistical machine translation (such as Brown et al., 1990) in order to model the errors that Sphinx-II makes in our domain. Briefly, the model consists of two parts: a channel model, which accounts for errors made by the SR, and the language model, which accounts for the likelihood of a sequence of words being uttered in the first place.</Paragraph> <Paragraph position="1"> More precisely, given an observed word sequence o from the speech recognizer, SPEECHPP finds the most likely original word sequence by finding the sequence s</Paragraph> <Paragraph position="3"> the sequence o when s was actually spoken.</Paragraph> <Paragraph position="4"> For efficiency, it is necessary to estimate these distributions with relatively simple models by making independence assumptions. For Prob(s), we train a word-bigram &quot;back-offf language model (Katz, 87) from hand-transcribed dialogues previously collected with the TRAINS-95 system. For P(ols), we build a channel model that assumes independent word-for-word</Paragraph> <Paragraph position="6"> The channel model is trained by automatically aligning the hand transcriptions with the output of Sphinx-II on the utterances in the (SPEECHPP) training set and by tabulating the confusions that occurred. We use a Viterbi beam-search to find the s that maximizes the expression. This technique is widely known so is not described here (see Forney (1973) and Lowerre (1986)).</Paragraph> <Paragraph position="7"> Having a relatively small number of TRAINS-95 dialogues for training, we wanted to investigate how well the data could be employed in models for both the SR and the SPEECHPP. We ran several experiments to</Paragraph> <Paragraph position="9"> weigh our options. For a baseline, we built a class-based back-off language model for Sphinx-II using only transcriptions of ATIS spoken utterances. Using this model, the performance of Sphinx-II alone on TRAINS-95 data was 58.7%. Note that this figure is lower than our previously mentioned average of 70%, since we were unable to exactly replicate the ATIS model from CMU.</Paragraph> <Paragraph position="10"> First, we used varying amounts of training data exclusively for building models for the SPEECHPP; this scenario would be most relevant if the speech recognizer were a black-box and we did not know how to train its model(s). Second, we used varying amounts of the training data exclusively for augmenting the ATIS data to build language models for Sphinx-II. Third, we combined the methods, using the training data both to extend the language models for Sphinx-II and to then train SPEECHPP on the newly trained SR.</Paragraph> <Paragraph position="11"> The results of the first experiment are shown by the bottom curve of Figure 6, which indicates the performance of the SPEECHPP with the baseline Sphinx-II. The first point comes from using approx.</Paragraph> <Paragraph position="12"> 25% of the available training data in the SPEECHPP models. The second and third points come from using approx. 50% and 75%, respectively, of the available training data. The curve clearly indicates that the SPEECHPP does a reasonable job of boosting our word recognition rates over baseline Sphinx-II and performance improves with additional training data. We did not train with all of our available data, since the remainder was used for testing to determine the results via repeated leave-one-out cross-validation. The error bars in the figure indicate 95% confidence intervals.</Paragraph> <Paragraph position="13"> Similarly, the results of the second experiment are shown by the middle curve. The points reflect the performance of Sphinx-II (without SPEECHPP) when using 25%, 50%, and 75% of the available training data in its LM. These results indicate that equivalent amounts of training data can be used with greater impact in the language model of the SR than in the post-processor.</Paragraph> <Paragraph position="14"> Finally, the outcome of the third experiment is reflected in the uppermost curve. Each point indicates the performance of the SPEECHPP using a set of models trained on the behavior of Sphinx-II for the corresponding point from the second experiment. The results from this experiment indicate that even if the language model of the SR can be modified, then the post-processor trained on the same new data can still significantly improve word recognition accuracy on a separate test set. Hence, whether the SR's models are tunable or not, the post-processor is in neither case redundant.</Paragraph> <Paragraph position="15"> Since these experiments were performed, we have enhanced the channel model by relaxing the constraint that replacement errors be aligned on a word-by-word basis. We employ a fertility model (Brown et al, 1990) that indicates how likely each word is to map to multiple words or to a partial word in the SR output.</Paragraph> <Paragraph position="16"> This extension allows us to better handle the second example above, replacing TO TRY with DETROIT. For more details, see Ringger and Allen (1996).</Paragraph> </Section> <Section position="7" start_page="65" end_page="65" type="metho"> <SectionTitle> 5. Robust Parsing </SectionTitle> <Paragraph position="0"> Given that speech recognition errors are inevitable, robust parsing techniques are essential. We use a pure bottom-up parser (using the system described in (Allen, 1995)) in order to identify the possible constituents at any point in the utterance based on syntactic and semantic restrictions. Every constituent in each grammar rule specifies both a syntactic category and a semantic category, plus other features to encode cooccurance restrictions as found in many grammars. The semantic features encode selectional restrictions, most of which are domain-independent. For example, there is no general rule for PP attachment in the grammar.</Paragraph> <Paragraph position="1"> Rather there are rules for temporal adverbial modification (e.g., at eight o'clock), locational modification (e.g., in Chicago), and so on.</Paragraph> <Paragraph position="2"> The end result of parsing is a sequence of speech acts rather than a syntactic analysis. Viewing the output as a sequence of speech acts has significant impact on the form and style of the grammar. It forces an emphasis on encoding semantic and pragmatic features in the grammar. There are, for instance, numerous rules that encode specific conventional speech acts (e.g., That's g o o d is a CONFIRM, O k a y is a CONFIRM/ACKNOWLEDGE, Let's go to Chicago is a SUGGEST, and so on). Simply classifying such utterances as sentences would miss the point. Thus the parser computes a set of plausible speech act interpretation based on the surface form, similar to the model described in Hinkelman & Allen (1989).</Paragraph> <Paragraph position="3"> We use a hierarchy of speech acts that encode different levels of vagueness, including a TELL act that indicates content without an identifiable illocutionary force. This allows us to always have an illocutionary force that can be refined as more of the utterance is processed. The final interpretation of an utterance is the sequence of speech acts that provides the &quot;minimal covering&quot; of the input - i.e., the shortest sequence that accounts for the input. Even if an utterance was completely uninterpretable, the parser would still produce output - a TELL act with no content.</Paragraph> <Paragraph position="4"> For example, consider an utterance from the sample dialogue that was garbled: OKAY NOW ! TAKE THE LAST TRAIN IN GO FROM ALBANY TO IS. The best sequence of speech acts to cover this input consists of three acts: 1. a CONFIRM/ACKNOWLEDGE (OKAY) 2. a TELL, with content to take the last train (NOW I TAKE THE LAST TRAIN) 3. a REQUEST to go from Albany (Go FROM ALBANY) Note that the to is at the end of the utterance is simply ignored as it is uninterpretable. While not present in the output, the presence of unaccounted words will lower the parser's confidence score that it assigns to the interpretation.</Paragraph> <Paragraph position="5"> The actual utterance was Okay now let's take the last train and go from Albany to Milwaukee. Note that while the parser is not able to reconstruct the complete intentions of the user, it has extracted enough to continue the dialogue in a reasonable fashion by invoking a clarification subdialogue. Specifically, it has correctly recognized the confirmation of the previous exchange (act 1), and recognized a request to move a train from Albany (act 3). Act 2 is an incorrect analysis, and results in the system generating a clarification question that the user ends up ignoring. Thus, as far as furthering the dialogue, the system has done reasonably well.</Paragraph> </Section> <Section position="8" start_page="65" end_page="67" type="metho"> <SectionTitle> 6. Robust Speech Act Processing </SectionTitle> <Paragraph position="0"> The dialogue manager is responsible for interpreting the speech acts in context, formulating responses, and maintaining the system's idea of the state of the discourse. It maintains a discourse state that consists of a goal stack with similarities to the plan stack of Litman & Allen (1987) and the attentional state of Grosz & Sidner (1986). Each element of the stack captures 1. the domain or discourse goal motivating the segment 2. the object focus and history list for the segment 3. information on the status of problem solving activity (e.g., has the goal been achieved yet or not). A fundamental principle in the design of TRAINS-95 was a decision that, when faced with ambiguity it is better to choose a specific interpretation and run the risk of making a mistake as opposed to generating a clarification subdialogue. Of course, the success of this strategy depends on the system's ability to recognize and interpret subsequent corrections if they arise. Significant effort was made in the system to detect and handle a wide range of corrections, both in the grammar, the discourse processing and the domain reasoning. In later systems, we plan to specifically evaluate the effectiveness of this strategy.</Paragraph> <Paragraph position="1"> The discourse processing is divided into reference resolution, verbal reasoning, problem solving and domain reasoning.</Paragraph> <Paragraph position="2"> Reference resolution, other than having the obvious job of identifying the referents of noun phrases, also may reinterpret the parser's assignment of illocutionary force if it has additional information to draw upon. One way we attain robustness is by having overlapping realms of responsibility: one module may be able to do a better job resolving a problem because it has an alternative view of it. On the other hand, it's important to recognize another module's expertise as well. It could be disastrous to combine two speech acts that arise from I really <garbled> think that's good. for instance, since the garbled part may include don't. Since speech recognition may substitute important words one for the other, it's important to keep in mind that speech acts that have no firm illocutionary force due to grammatical problems may have little to do with what the speaker actually said.</Paragraph> <Paragraph position="3"> The verbal reasoner is organized as a set of prioritized rules that match patterns in the input speech acts and the discourse state. These rules allow robust processing in the face of partial or ill-formed input as they match at varying levels of specificity, including rules that interpret fragments that have no identified illocutionary force. For instance, one rule would allow a fragment such as to Avon to be interpreted as a suggestion to extend a route, or an identification of a new goal. The prioritized rules are used in turn until an acceptable result is obtained.</Paragraph> <Paragraph position="4"> The problem solver handles all speech acts that appear to be requests to constrain, extend or change the current plan. It is also based on a set of prioritized rules, this time dealing with plan corrections and extensions.</Paragraph> <Paragraph position="5"> These rules match against the speech act, the problem solving state, and the current state of the domain. If fragmentary information is supplied, the problem solver attempts to incorporate the fragment into what it knows about the current state of the plan.</Paragraph> <Paragraph position="6"> As example of the discourse processing, consider how the system handles the user's first utterance in the dialogue, OKAY LET'S SEND CONTAIN FROM DETROIT TO WASHINGTOn. From the parser we get three acts: I. a CONFIRM/ACKNOWLEDGE (OKAY) 2. a TELL involving mostly uninterpretable words (LET'S SEND CONTAIN) 3. a TELL act that mentions a route (FROM DETROIT TO WASHINGTON) The discourse manager sets up its initial conversation state and passes the act to reference for identification of particular objects, and then hands the acts to the verbal reasoner. Because there is nothing on the discourse stack, the initial confirm has no effect. (Had there been something on the stack, e.g. a question of a plan, the initial confirm might have been taken as an answer to the question, or a confirm of the plan, respectively). The following empty TELL act is uninterpretable and hence ignored. While it is possible to claim the &quot;send&quot; could be used to indicate the illocutionary force of the following fragment, and that a &quot;container&quot; might even be involved, the fact that the parser separated out the speech act indicates there may have been other fragments lost. The last speech act could be a suggestion of a new goal to move from Detroit to Washington. After checking that there is an engine at Detroit, this interpretation is accepted. The planner is unable to generate a path between these points (since it is greater than four hops). It returns two items: 1. an identification of the speech act as a suggestion of a goal to take a train from Detroit to Washington 2. a signal that it couldn't find a path to satisfy the goal The discourse context is updated and the verbal reasoner generates a response to clarify the route desired, which is realized in the system's response What route would you like to get from Detroit to Washington? As another example of robust processing, consider an interaction later in the dialogue in which the user's response no is misheard as now:Now let's take the train from Detroit to Washington do S_X Albany (instead of No let's take the train from Detroit to Washington via Cincinnati). Since no explicit rejection is identified due to the recognition error, this utterance looks like a confirm and continuation of the plan. Thus the problem solver is called to extend the path with the currently focused engine (enginel) from Detroit to Washington.</Paragraph> <Paragraph position="7"> The problem solver realizes that enginel isn't currently in Detroit, so this can't be a route extension. In addition, there is no other engine at Detroit, so this is not plausible as a focus shift to a different engine. Since engine l originated in Detroit, it then decides to reinterpret the utterance as a correction. Since the utterance adds no new constraints, but there are the cities that were just mentioned as having delays, it presumes the user is attempting to avoid them, and invokes the domain reasoner to plan a new route avoiding the congested cities. The new path is returned and presented to the user.</Paragraph> <Paragraph position="8"> While the response does not address the user's intention to go through Cincinnati due to the speech recognition errors, it is a reasonable response to the problem the user is trying to solve. In fact, the user decides to accept the proposed route and forget about going through Cincinnati. In other cases, the user might persevere and continue with another correction such as No, through Cincinnati. Robustness arises in the example because the system uses its knowledge of the domain to produce a reasonable response. Note these examples both illustrate the &quot;strong commitment&quot; model. We believe it is easier to correct a poor plan, than having to keep trying to explain a perfect one, particularly in the face of recognition problems. For further detail on the problem solver, see Ferguson et al (1996).</Paragraph> <Paragraph position="9"> 7. Evaluating the System While examples can be illuminating, they don't address the issue of how well the system works overall. To explore how well the system robustly handles spoken dialogue, we designed an experiment to contrast speech input with keyboard input. The experiment uses the different input media to manipulate the word error rate and the degree of spontaneity. Task performance was evaluated in terms of two metrics: the amount of time taken to arrive at a solution and the quality of the solution. Solution quality for our domain is determined by the amount of time needed to travel the routes.</Paragraph> <Paragraph position="10"> Sixteen subjects for the experiment were recruited from undergraduate computer science courses. None of the subjects had ever used the system before. The procedure was as follows: perform, in the same order. Half of the subjects were asked to use speech first, keyboard second, speech third and keyboard fourth. The other half used keyboard first and then alternated. All subjects were given a choice of whether to use speech or keyboard input to accomplish the final task.</Paragraph> <Paragraph position="11"> * After performing the final task, the subject completed a questionnaire.</Paragraph> <Paragraph position="12"> An analysis of the experiment results shows that the plans generated when speech input was used are of similar quality to those generated when keyboard input was used. However, the time needed to develop plans was significantly lower when speech input was used.</Paragraph> <Paragraph position="13"> Overall, problems were solved using speech in 68% of the time needed to solve them using the keyboard.</Paragraph> <Paragraph position="14"> out by task.</Paragraph> <Paragraph position="15"> Of the 16 subjects, 12 selected speech as the input medium for the final task and 4 selected keyboard input. Three of the four selecting keyboard input had actually experienced better or similar performance using keyboard input during the first four tasks. The fourth subject indicated on his questionnaire that he believed he could solve the problem more quickly using the keyboard; however, that subject had solved the two tasks using speech input 19% faster than the two tasks he solved using keyboard input.</Paragraph> <Paragraph position="16"> Of the 80 tasks attempted, there were 7 in which the stated goals were not met. In each unsuccessful attempt, the subject was using speech input. There was no particular task that was troublesome and no particular subject that had difficulty. Seven different subjects had a task where the goals were not met, and each of the five tasks was left unaccomplished at least once.</Paragraph> <Paragraph position="17"> A review of the transcripts for the unsuccessful attempts revealed that in three cases, the subject misinterpreted the system's actions, and ended the dialogue believing the goals were met. Each of the other four unsuccessful attempts resulted from a common sequence of events: after the system proposed an inefficient route, word recognition errors caused the system to misinterpret rejection of the proposed route as acceptance. The subsequent subdialogues intended to improve the route were interpreted to be extensions to the route, causing the route to &quot;overshoot&quot; the intended destination. This suggests that, while our robustness techniques were effective on average, the errors do create a higher variance in the effectiveness of the interaction. These problems reveal a need for better handling of corrections, especially as resumptions of previous topics. More details on the evaluation can be found in (Sikorski & Allen, forthcoming).</Paragraph> </Section> class="xml-element"></Paper>