File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1606_metho.xml
Size: 21,454 bytes
Last Modified: 2025-10-06 14:07:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1606"> <Title>An Empirical Study of Speech Recognition Errors in a Task-oriented Dialogue System</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 System Overview </SectionTitle> <Paragraph position="0"> The VIP (&quot;Virtual Interactive Presenter&quot;) system is a dialogue-based interface to an Electronic Programme Guide (EPG). One main advantage of human-computer dialogue is that it breaks down the information exchange into elementary units that correspond to the actual criteria on the basis of which TV programmes are selected, i.e.</Paragraph> <Paragraph position="1"> individual features such as the cast, the movie genre, its rating, etc. It assists the user in progressively refining the programme description without requiring explicit knowledge of the editorial categories used to index the EPG.</Paragraph> <Paragraph position="2"> Related applications, i.e. dialogue systems involving the choice of a film or TV programme have been also described in [Hagen, 2000] [Ludwig et al., 2000]. Our system is a mixed-initiative conversational interface organised around a human-like character with which the user communicates through speech recognition [Nagao and Takeuchi, 1994] [Beskow and McGlashan, 1997]. The interface is based on the Microsoft Agent(TM) system with a set of animated bitmaps acquired from a real human subject (Figure 1).</Paragraph> <Paragraph position="3"> An example dialogue illustrating the system capabilities is presented below (this example has been obtained with keyboard input only). After greetings by the system, the user opens the dialogue with a first request (U1). Even though the system's goal is to refine the selection, it offers potential choices even at early stages of the dialogue, e.g. after only the programme genre has been specified (S2). As the dialogue progresses, it gives a count of programmes matching the current criteria and proposes the best one. The system acknowledges the most specific selections only (S10, S14). Also, the system takes initiative whenever necessary (S12). It repairs non-productive dialogue, i.e.</Paragraph> <Paragraph position="4"> when several utterances have not altered the programme description (&quot;is this programme all right, then?&quot;). Whenever high-level categories are rejected by the user, leaaving the system without sufficient criteria to filter the EPG contents, it re-starts the dialogue (&quot;what would you like then?&quot;, S12).</Paragraph> <Paragraph position="5"> U1: Do you have any sports programmes S2: I have found 5 programmes for this choice.</Paragraph> <Paragraph position="6"> Would you like to watch &quot;Keegan's greatest games&quot;? S16 There are 1 other programmes for this choice. What about the following programme: &quot;Casablanca&quot;? U17: I want a western instead S18: I would suggest the following western: &quot;Unforgiven&quot; The software architecture is a pipeline comprising speech recognition, parsing and dialogue. In the next sections, we describe each of these components from the perspective of speech recognition errors. Finally, we discuss the impact of speech recognition errors on example dialogues and the mechanisms that contribute to dialogue robustness.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The Speech Recognition Component </SectionTitle> <Paragraph position="0"> Speech recognition is based on the ABBOT system [Robinson et al., 1996]. A specific ABBOT version has been developed for the VIP prototype, VIP-ABBOT, with a test vocabulary of 300+ words (Figure 2). This version is based on a trigram model, trained on a small corpus of 200 user questions and replies, using data from six speakers (average recording time is twelve minutes). Though the size of the corpus is in principle too small to obtain an accurate language model, the VIP-ABBOT system achieves a satisfactory performance. Global speech recognition accuracy has been tested as part of the development of the VIP-ABBOT version. The recognition accuracy varied across tests from 65 to a maximum 80 % (at this stage only laboratory conditions with non-noisy environments and good quality microphones have been considered). The system outputs the 1-best recognised utterance, which is passed to the dialogue system via a datagram socket.</Paragraph> <Paragraph position="1"> We have assembled an evaluation corpus of 500 utterances, collected from five speakers including one non-native speaker. Including a non-native speaker was an empirical way of increasing the error rate. Other researchers have suggested varying parameters of the speech recognition system, such as the beam width [Boros et al., 1996], as a method to increase word error rate, in order to collect error corpora. However, they have not documented whether the kind of errors induced in this way actually reproduce (in terms of distribution) those obtained during the actual use of the system. On the other hand, recognition errors obtained with native and non-native speakers appear similar in our experience, the overall error rate just being higher in the latter.</Paragraph> <Paragraph position="2"> For the whole corpus, approximately 50% of recognised utterances contain at least one speech recognition error.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Integrated Parsing of User Utterances </SectionTitle> <Paragraph position="0"> Strictly speaking, a significant proportion (around 50%) of the recognition hypotheses produced by VIP-ABBOT are ungrammatical.</Paragraph> <Paragraph position="1"> For obvious reasons, and since the early stages of system development, we have abandoned the idea of producing a complete parse for the speech input, not so much because user expressions themselves could be ungrammatical but rather because recognised utterances were most certain to be, considering the error rate.</Paragraph> <Paragraph position="2"> One of the key questions for parsing, especially in the case of dialogue, where the average utterance length is 5-7 words, is whether complete parsing is at all necessary [Lewin et al., 1999]. We have implemented a simplified parser based on a variant of Tree-Adjoining Grammars [Cavazza, 1998], This syntactic formalism being lexicalised has interesting properties in terms of syntax-semantics integration. This lexicalised formalism, combined with a simple bottom-up parser, is well adapted to the partial parsing of ungrammatical utterances (Figure 3).</Paragraph> <Paragraph position="3"> The main goal of parsing is to produce a semantic structure from which speech acts can be identified. Semantic features are aggregated as the parsing progresses following the syntactic operations. As a result, the parser produces a feature structure whose semantic elements can be mapped to the descriptors indexing the programmes in the EPG, such as genre (e.g.</Paragraph> <Paragraph position="4"> &quot;movie&quot;, &quot;news&quot;, &quot;documentary&quot;), sub-genre (e.g., &quot;comedy&quot;, &quot;lifestyle&quot;), cast (e.g., &quot;Jeremy Clarkson&quot;), channel (&quot;BBC one&quot;), rating (e.g., &quot;caution&quot;, &quot;family&quot;), etc.</Paragraph> <Paragraph position="5"> Whenever the parser fails to produce a single parse, the semantic structures obtained from partial parses are merged on a content basis. For instance, descriptors such as &quot;cast&quot; or &quot;channel&quot; are attached to programme descriptions, etc.</Paragraph> <Paragraph position="6"> This process confers a good level of robustness and tolerance to ungrammaticality. This kind of approach, where dialogue strategy is privileged over parsing was inspired from early versions of the AGS system [Sadek, 1999]. These semantic structures are used to generate search filters on the EPG database, which correspond to semantic descriptions of the user choice. They are also used for content-based speech act identification, by comparing the semantic contents of successive utterances [Cavazza, 2000].</Paragraph> <Paragraph position="8"> ainly from the task model. As the task is to ressively refine a programme description by elementary dialogue acts, we have adopted h acts based approach [Traum and inkelmann, 1992]. Each speech act esponds to a specific construction operation t is possible to map communicative operations jection, implicit rejection, specification, etc.) ating of the programme description, s a filter through which the EPG database earched.</Paragraph> <Paragraph position="9"> re using a content-based approach to the tification of speech acts [Cavazza, 2000].</Paragraph> <Paragraph position="10"> his method has similarities with the one iously described by Maier [1996]. Another e of inspiration was the work of Walker ], though it was restricted to the nition of acceptance rather than a complete f speech acts. Figure 4 shows the ruction of search filters from the semantic nts of user utterances. Once a new utterance is analysed, its semantic contents are compared with the active search filter, which has been constructed from previous user utterances, and this comparison determines speech act recognition. For instance, when the last utterance contains semantic information for a programme sub-genre, the speech act is a specification. Explicit rejections are signalled by markers of negation, while implicit rejection speech acts are recognised when the semantic contents of the latest utterance overwrite the descriptors of the current filter (this is the case when, for instance, when the current filter contains the comedy sub-genre and the user asks &quot;can I have a western?&quot;).</Paragraph> <Paragraph position="11"> In this context, speech acts provide a unified and consistent way to determine the most appropriate answer to the user as well as the way in which the search filter should be updated at each dialogue turn. In the next section, we propose a first empirical categorisation of speech recognition errors according to their impact on the dialogue process.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 From Speech Recognition Errors to Speech Acts Recognition Errors </SectionTitle> <Paragraph position="0"> Traditional error metrics used in speech recognition such as &quot;word accuracy&quot; are not reliable to measure the global consequences of speech recognition errors on the dialogue process. This is why it has been proposed that a &quot;concept accuracy&quot; be used in place of a word accuracy. These two metrics appear however to be linearly correlated [Boros et al., 1996].</Paragraph> <Paragraph position="1"> Word errors result in semantic errors, which in turn result in speech act recognition errors. It is the latter phenomenon that we are investigating, though through a largely empirical approach.</Paragraph> <Paragraph position="2"> We have developed the following experimental approach to assess the consequences of speech recognition errors. Within a set of possible &quot;test&quot; dialogues, we have been running each user reply on several users to collect an error corpus. Then the impact of errors at critical points of dialogue can be tested on the system using textual input, entering the utterances comprising speech recognition errors. This also makes possible to explore the consequences of a set of errors at various stages of the dialogue process (see below).</Paragraph> <Paragraph position="3"> In doing so, we are essentially exploring system behaviour in a glass box approach: this does not constitute an evaluation method for dialogue performance [Walker et al., 1997]. The only global metric that can be used in our case is dialogue length, as a measure of the extra dialogue turns required to repair the consequences of a speech recognition error.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 &quot;Fatal&quot; Speech Recognition Errors </SectionTitle> <Paragraph position="0"> Obviously, the worst errors are those which terminate the dialogue by selecting a programme that the user intended to reject. This happens when the semantic structure produced from the mis-recognised utterance can be interpreted as an acceptance speech act. In the following short dialogues, the user implicitly rejects the selection offered by the system. He does so by specifying either the parental rating (U21) or the sub-genre (U25). However, due to speech recognition errors, specific semantic information disappears in both cases, while the insertion of the word &quot;good&quot; results in the system producing a semantic structure that is recognised as an acceptance speech act.</Paragraph> <Paragraph position="1"> U19: Can I have a movie? S20: I have found 22 programmes for that selection. I would recommend: &quot;Midnight Fear&quot; U21: I want a movie my good can watch (REF: I want a movie my kids can watch) S22: Here is &quot;Midnight Fear&quot;, then. Goodbye! U23: Do you have any movies? S24: I found 22 programmes for your request. I would suggest: &quot;Twelve Monkeys&quot; U25: Good if I mean anything for long (REF: could you find me anything funny) S26: Here is &quot;Twelve Monkeys&quot;, then.</Paragraph> <Paragraph position="2"> Goodbye!</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 User Correction of Mis-recognition </SectionTitle> <Paragraph position="0"> Some speech recognition errors result in an incorrect speech act being recognised, but can be easily corrected by the user at the expense of a few additional dialogue turns, as shown by the following example (especially U31).</Paragraph> <Paragraph position="1"> U27: I would like to watch a movie S28: I found 22 programmes for this choice. I can recommend: &quot;The Day of the Jackal&quot; U29: Can I have worse him? (REF: can I have a western?) S30: What about the following programme: &quot;Midnight Fear&quot;? U31: No, I want a western S32: I can recommend this programme which is a western: &quot;Unforgiven&quot; The next dialogue illustrates an important aspect of speech recognition errors. Their impact actually varies according to the dialogue stage at which they take place. The same kind of error that resulted in premature acceptance has less dramatic consequences when it occurs in the opening request. It results in a partial understanding, where the EPG is searched for the right genre (&quot;movie&quot;) but without the right parental rating. This additional criterion can thus be subsequently added by the user, resulting in a successful search at the price of a longer dialogue.</Paragraph> <Paragraph position="2"> U33: Is there a movie my good can watch? (REF: is there a movie my kids can watch?) S34: I have 22 programmes for your request. Would you like to watch: &quot;Midnight Fear&quot; U35: What is its rating? S36: Its rating is caution U37: No, I want a movie my kids can watch S38: I would recommend this programme corresponding to that &quot;family&quot; rating: &quot;Rio Bravo&quot;</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 System Detection of Mis-recognition </SectionTitle> <Paragraph position="0"> Some dialogue control mechanisms originally developed to monitor dialogue progression can also assist in recovering from speech recognition errors. The system can detect dead ends in dialogue, for instance when not enough information is available for programme selection. The speech recognition error in U39 results in a semantic structure rejecting sports programmes as &quot;boring&quot;. In the absence of any category selected, the system returns to the user (S40), who can then correct the selection: an explicit request for the sports category overwrites its previous rejection.</Paragraph> <Paragraph position="1"> U39: What kind of sports boring you have? (REF: what kind of sports programmes do you have) S40: What would you like to see, then? U41: A sports programme S42: I could find 5 programmes for this choice. I can recommend: &quot;Row Z&quot; Also, the system reacts to non-productive dialogue, i.e. utterances that simply repeat categories already represented in the programme descriptors. This has an interesting side effect, in that it can detect deletion of negation markers, as in the example below.</Paragraph> <Paragraph position="2"> S43: What about the following programme: &quot;Countryfile&quot; U44: Again want a documentary (REF: I don't want a documentary) S45: Is this programme all right, then?</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.4 &quot;Semantic&quot; Robustness </SectionTitle> <Paragraph position="0"> The basis for robustness can be seen as a combination of shallow processing and natural redundancy of information in some of the user replies. For instance, in U52, even though the entry &quot;role&quot; as been deleted from the recognised utterance, there are sufficient semantic cues to activate a &quot;cast&quot; descriptor.</Paragraph> <Paragraph position="1"> U46: I would like to watch a movie S47: I found 22 programmes for this choice. I would recommend: &quot;Midnight Fear&quot; U48: Good if I mean anything funny (REF: could you find me anything funny) S49: I can recommend this programme which is a comedy: &quot;Casablanca&quot; U50: what kind of movies do you have S51: I have found 22 programmes for this choice. I would suggest: &quot;Twelve Monkeys&quot; U52: Who is playing the title all? (REF: who is playing the title role?) S53: &quot;Twelve Monkeys&quot; features Bruce Willis U54: Do you have any documentaries S55: I have 2 programmes for your request. My first selection is: &quot;Ground Force&quot; U56: An old gardening is not a little A (REF: no, gardening is not OK) S57: I can recommend this programme, which is not a gardening: &quot;Fangs&quot; Another element of robustness lies in the surface form of some linguistic expressions for acknowledgement, acceptance and rejection, which often contains redundant information: &quot;yes, this is fine&quot;, &quot;no, I don't like that one&quot;, &quot;no, gardening is not OK&quot;, etc.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.5 &quot;Accidental&quot; Robustness </SectionTitle> <Paragraph position="0"> Another form of robustness appears to be rather accidental. It happens when a incorrect speech act is recognised, but its effects in terms of system response are a subset of those of the correct speech act. In the following example, the user intends to reject the proposal (&quot;Salvador&quot;) by specifying a parental rating. This is actually interpreted as a simple rejection of the programme instance (Other speech act), preserving the current &quot;movie&quot; category selected. The new proposal happens to match the criteria by chance (i.e., the parental rating is not grounded in the system reply, see e.g. S38).</Paragraph> <Paragraph position="1"> U58: What movies do you have? S59: I found 22 programmes for this choice. I can suggest: &quot;Salvador&quot; U60: Other one like is this to there it is (REF: I don't want my kids to see this) S61: I have 21 other programmes for that selection. What about the following programme: &quot;Casablanca&quot; Conclusion We have explored the consequences of speech recognition errors in terms of content-based speech acts recognition. There appears to be a number of factors that support the robustness of the system to speech recognition errors, among which the fact that dialogue control mechanisms triggered by speech act recognition can contribute to repairing the consequences of speech recognition errors. Some improvement is possible in the treatment of errors involving mismatches between categories and connotations (such as &quot;funny motoring&quot;), by including semantic consistency checks. On the other hand, errors involving wrongful acceptance and dialogue termination appear difficult to deal with.</Paragraph> <Paragraph position="2"> Finally, Fischer and Batliner [2000] have investigated which system replies are most likely to upset the user. These replies cannot always be always avoided, though, precisely because they are used to repair incorrect understanding or inconsistent one. It is thus important to investigate whether speech recognition errors increase the occurrence of these upsetting replies (apart from the unavoidable and necessary repairs). Obviously, in our context the most upsetting cases are the selection of a programme explicitly rejected by the user. However, It would also be necessary to explore whether the repair mechanisms described above are well accepted by the users.</Paragraph> </Section> </Section> class="xml-element"></Paper>