File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0620_metho.xml
Size: 20,254 bytes
Last Modified: 2025-10-06 14:14:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0620"> <Title>Dialogue Strategies for Improving the Usability of Telephone Human-Machine Communication</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> LIMSI Arise Railway Information system (Lamel et </SectionTitle> <Paragraph position="0"> al., 1996) that has 1,500 words, including more than 680 station names. This peculiarity directly impacts on the performance of the language models, that is in these applications, language modeling predictions are weaker: when the dialogue prediction says that next user's utterance is likely to be about a departure place, this does not exclude that the recognizer substitutes the actually uttered name with a phonetically similar one. Only the user is able to detect such kinds of errors. In this situation the dialogue system should identify the user's detection of miscommunication and provide appropriate repairs.</Paragraph> <Paragraph position="1"> All the problems described above lead to the decrease of the recognition performance and of the usability of spoken language systems. More specifically, they identify some severe requirements that spoken dialogue modules have to meet. In particular, dialogue systems for telephone applications have to rely not only on an adequate model of the human user, but they should also implement particular techniques for preventing and recovering communication breakdowns.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Prevention and Repair of </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Miscommunication 3.1 Prevention of miscommunication </SectionTitle> <Paragraph position="0"> The user modeling of the dialogue module of Dialogos is based on the assumption that both the system and the user are active agents that cooperate in order to fulfill the goal of the speech interaction. In the application domain of Dialogos the goal of the interaction is to collect all the parameters that are necessary to access the railway database for retrieving the information that satisfy user's needs.</Paragraph> <Paragraph position="1"> Hello, This is Train Enquiry Service.</Paragraph> <Paragraph position="2"> Please speak after the tone.</Paragraph> <Paragraph position="3"> Please state your departure and your destination. Voglio partire da Mils(no)- Milano di sera.</Paragraph> <Paragraph position="4"> Vado da Milano a Roma.</Paragraph> <Paragraph position="5"> I mould like to leave from Milano ... Milano in the evening.</Paragraph> <Paragraph position="6"> \[ am going from Milano to Rome.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> VOGLIO PAB.TIB.E DA NO MILANO SERA VADO DA MILANO P,.OMA WANT LEA VE FROM NO MILANO EVENING GO FROM MILANO TO ROMA </SectionTitle> <Paragraph position="0"> < confirm=NO, dep-city=MILANO, < arr.city=ROMA, part-day=EVENING> Do you want to go from Milano to Roma leaving in the evening? Si ... da Milano a Roma di sera.</Paragraph> <Paragraph position="1"> Yes ... from Milano to Roma in the evening. SI BLOW DA MILANO ROMA SEtLA YES BLOW FROM MILANO ROMA EVENING < confirm=YES, dep.city=MILANO, arr-city=ROMA, dep.time=EVENING> There are many trains in the evening. Which hour do you want to leave? Mah ... mi dica se c'e' qualcosa nile otto. Who knows ... Tell me if there is something at eight.</Paragraph> </Section> <Section position="7" start_page="0" end_page="116" type="metho"> <SectionTitle> MATTINO GINOSA ALLE OTTO MORNING GINOSA AT EIGHT </SectionTitle> <Paragraph position="0"> <part-day=MORNING, city=GINOSA, hour=EIGHTy TT-S: Train 243 leaves from Milano Centrale at 8:20 p.m.; it arrives at Roma Termini at S a.m.</Paragraph> <Paragraph position="1"> Do you need additional information about this train? As it was explained above, telephone recognition is error-prone. For preventing recognition errors the dialogue system sends to the lower levels of analysis information about the domain objects focused during each turn of the interaction; this information allows the triggering of context dependent language models that help to constraint the lexical choices at the recognition level (see section 4). Moreover, in order to detect recognition or interpretation errors that occurred in previous turns, the dialogue system takes advantage from the global history of the interaction and it only accepts interpretations of user's input that are coherent with the dialogue history. For example, let us consider the excerpt from the Dialogos corpus shown in Figure 1. In the example, on the left, the letter 'T' stands for 'Turn', the letters 'U' and 'S' stand for user and system, respectively. In each user's turn we reported in Italian the original user's utterance and its English translation (in italics). Then we have transcribed the best decoded sequence (in ALL CAPS), that is the recognizer output. The translations into English of the best decoded sequences are shown in ALL CAPS (in italics). The task-oriented semantic frames (the input of the dialogue module) have been put between angles. The system turns have been only reported in their English translation.</Paragraph> <Paragraph position="2"> In T2-U the user utterance contains an hesitation when uttering the name of the departure city, &quot;Milano&quot;. The first part of the word, &quot;Mila-&quot; was misrecognized as a noise, and the last syllable was recognized as &quot;no&quot;, that the parser interpreted as the negation adverb &quot;no&quot;. In this initial dialogue context there were no parameters to be denied, and the dialogue module was able to discard this information related to the negative adverb. It addressed the user with the request of confirmation of T3-S. T4-U was a confirmation turn of the user. After having consulted the data in the railway database, the system realized that the number of connections between Milano and Roma in the evening was high, and it suggested the user to choose a more precise departure time (T5-S). In T6-U the utterance segment &quot;mah ... mi dica se c'e' qualcosa&quot; (who knows ... tell me if there is something) was misrecognized as &quot;mattino Ginosa&quot; (morning Ginosa, where &quot;Ginosa&quot; is the name of an Italian village). As a consequence, the parser output contains another value for the part-ofday parameter and a departure hour. However, the dialogue discarded the information about the partof-day, since this conflicted with a parameter value that the user had already confirmed, and only the second part of the utterance interpretation was retained (that is, the departure hour). In this case the insertion of a concept due to misrecognition was repaired at the dialogue level 1 .</Paragraph> <Paragraph position="3"> As we can see from the example, the dialogue module makes use of confirmation turns because it deals with potentially incorrect information. However, the need for confirmations may result in a lack 1This version of Dialogos only considers the first best solution. If the expected information is not found in the semantic representation of the current user's utterance, the dialogue system hypothesizes that something went wrong in the previous analysis, and it interprets that situation as an occurrence of non-understanding.</Paragraph> <Paragraph position="4"> of naturalness of the telephone human-machine dialogues. In order to reduce the number of confirmation turns, we use the following strategies: * the dialogue system avoids confirmation turns when the acquired information is coherent with the dialogue history and with the current focus * the dialogue system asks for multiple confirmations of the acquired parameters (as in T3-S) * the dialogue system asks for implicit confirmations whenever it is possible (as &quot;From Milano to Roma. When do you want to travel?&quot;)</Paragraph> <Section position="1" start_page="115" end_page="116" type="sub_section"> <SectionTitle> 3.2 Repair of miscommunication </SectionTitle> <Paragraph position="0"> Sometimes the detection of some recognition errors (for example, the substitution of an uttered word with another one of the same class) is outside of the capabilities of both the parser and the dialogue modules. On the contrary, in principle the user is able to detect and correct such errors, and often she does it immediately or in subsequent turns.</Paragraph> <Paragraph position="1"> In (Danieli, 1996) an analysis of such phenomena is offered, and the method described in that paper is currently implemented in Dialogos. The approach is based on pragmatic-based expectations about the semantic content of the user utterance in the next turn. The theoretical background is the account of human-human conversation given by (Grice, 1967), re-interpreted in the context of human-computer conversation. In particular, the dialogue system is able to deal with non-understandings and misunderstandings (see (Hirst et al., 1994) for the classification of non-understanding, misunderstanding, and misconception), and it may recognize the occurrence of a miscommunication phenomena on the basis of the occurrence of two pragmatic counterparts, that is the deviation of the the user's behaviour from the system expectations, and the generation of a conversational implicature.</Paragraph> <Paragraph position="2"> Non-understanding is recognized by the dialogue system as soon as it happens, because the system is not able to find any interpretation of the current user turn. On the contrary, misunderstandings are more difficult to detect and solve, because usually the dialogue system may get an interpretation of the user's utterance, but that interpretation is not the one intended by the speaker. If the user's correction of a misunderstanding occurs when the parameter is focused (that is, it occurs as a third-turn repair, see (Schegloff, 1992)), the focusing mechanism and the dialogue expectations allow to grasp the correction immediately. However, it is more problematic to detect user's corrections if they happen some turns after the occurrence of the errors. The dialogue system initially interprets user's correction with respect to its current set of expectations. As soon as it realizes that there is a deviation of the user behaviour from the expected behaviour, it hypothesizes a misunderstanding, and it re-interprets the current utterance on the basis of the context of the misunderstood utterance (thanks to a focus-shifting mechanism).</Paragraph> <Paragraph position="3"> Finally, the output of the parsing module may be only partially determined. In that case the dialogue module initiates clarification subdialogues. Let us discuss the excerpt shown in Figure 2.</Paragraph> <Paragraph position="4"> In T2-U the arrival city is recognized and understood as a generic city, the dialogue strategy does not reject this information but it enters a clarification subdialogue in order to solve the ambiguity (T3-S and T4-U). The last turn of the system is a dialogue act that fulfill two communicative goals, that is the (implicit) confirmation of the arrival city and the request of the departure city. However, clarification subdialogues may be avoided if the dialogue expectations allow to choose an interpretation of ambiguous input.</Paragraph> <Paragraph position="5"> Clarification subdialogues may also occur in case of parser outputs that contain inconsistent related information. Those may be so either because of recognition errors, or because of user's misconceptions. In our application domain misconceptions, i.e. errors in the prior knowledge of a participant, usuMly concern the expression of departure dates, as in the dialogue excerpt shown in Figure 3. The conversation took place on Thursday February, 27th The dialogue system recognizes the misconception in T2-U because the week day, the day, and the month are n6t interpretable with respect to its knowledge of the year's calendar (and the computer presumes that its calendar is correct). Moreover the dialogue system finds that different chunks of the information supplied by the user could be coherently interpretable. Since it has no principled way to decide between them, it initiates the clarification sub-dialogue of T3-S.</Paragraph> </Section> </Section> <Section position="8" start_page="116" end_page="116" type="metho"> <SectionTitle> 4 Dialogue State Dependent </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="116" end_page="116" type="sub_section"> <SectionTitle> Language Modeling </SectionTitle> <Paragraph position="0"> The dialogue module makes use of pragmatic-based expectations about the semantic content of the next user's utterance. In order to improve speech recognition performance the contextual knowledge may be used as a constraint for the language models.</Paragraph> <Paragraph position="1"> Contextual information is sent to the lower levels of analysis by communicating the dialogue act produced by the system for addressing the user. In Dialogos there are four classes of dialogue acts (request, confirmation, clarification, and request plus confirmation). Each class is further specialized with the indication of the focused parameters: for example &quot;request:date of departure&quot; if the system is asking the user to provide a departure date, &quot;request-plusconfirmation: departure-plus-arrival&quot;, when the system is addressing the user with a feed-back about the departure city and a request of the arrival city, and so on. The information about the system dialog act is called &quot;dialogue prediction&quot;. The recognizer makes use of the predictions by selecting a specific language model which was trained on a coherent partition of the training corpus (Popovici and Baggia, 1997).</Paragraph> <Paragraph position="2"> The training set was collected in previous experimentationsof the system: users' responses to specific system dialog act were classified for training different language models. At present, the speech recognizer makes use of a class-based bigram model; then, in order to re-score the n-best decoded sequences, it uses a class-based trigram model.</Paragraph> <Paragraph position="3"> In order to measure the improvements in recognition performance obtained using dialogue state dependent language models, we compared the differences in the Word Accuracy (WA) and Sentence Understanding (SU) rates obtained using different language models on the same test set. The test set included 2,040 utterances, randomly selected from corpus data collected during a field trial of Dialogos with 500 unexperienced subjects. In the first experiment, a single context-independent language model was used; it was trained on a set of 15,575 utterances (produced by Italian native speakers). In the second experiment, a set of dialogue state dependent models, trained on the same training set of the first experiment, was used; however in this case the training set was encoded according to the different dialogue states, as explained above. The error rate reduction between the two experiments is 8.6% of WA and 10.9% of SU. Moreover, with an improved acoustic model (trained on a domain dependent training-set) the error reduction is even greater (over 12%) for both WA and SU.</Paragraph> </Section> </Section> <Section position="9" start_page="116" end_page="116" type="metho"> <SectionTitle> 5 Experimental Data </SectionTitle> <Paragraph position="0"> The Dialogos corpus consists of 1,404 dialogues, including 13,123 utterances. All the calls were performed over the public telephone network and in different environments (house, office, street, and car).</Paragraph> <Paragraph position="1"> The WA and SU results on the global utterance corpus were: 61% for WA and 76% for SU. These results were greatly influenced by the quality of the telephone acoustic signal, and by the noise environment. Moreover, several city names contained in the dictionary of the system could be easily confused.</Paragraph> <Paragraph position="2"> The overall system performance was measured with the Transaction Success (TS) metric, i.e. the measure of the success of the system in providing users with the information they require (Danieli and Gerbino, 1995). The TS rate was 70% on the 1,404 dialogues. By excluding from the corpus a set of dialogues that failed for users' errors, we obtained a TS result of 84%. The average successful dialogue duration is about 2 minutes: in most of the dialogues all the parameters were acquired and confirmed during the first minute of user-system interaction.</Paragraph> <Paragraph position="3"> It is an open issue if a spoken dialogue system has to generate a clarification subdialogue when faced with ambiguity or unclear input. For example, the system described in (Allen et al., 1996) was designed on the basis of the principle that it was better to assume an interpretation and, of course, to be able to understand corrections when they arise. On the contrary, Dialogos was designed to enter clarification subdialogues when faced with input that cannot receive a single coherent interpretation in the dialogue context. Actually, we think that in general the strategy implemented by (Allen et al., 1996) may be more effective for the naturalness of the dialogue, however we believe that the effectiveness of that choice greatly depends on the ability of the users to grasp inconsistencies in the system feed-back. In the Dialogos corpus we observed that while subjects were usually able to correct errors in confirmation turns that concern a single information, or two semantically related information (such departure and arrival); on the contrary, some errors were not corrected when the feedback was offered together with a system initiative, or when the system asked to confirm information that had not strong relationships.</Paragraph> <Paragraph position="4"> The dialogue shown in Figure 4 is a typical example.</Paragraph> <Paragraph position="5"> The acoustic decoding of &quot;Allora&quot; (a word that is used in Italian for taking turn) was erroneous: it was substituted with &quot;All'una&quot; (at one o'clock). This was interpreted as a departure hour. A conjunct confirmation of departure hour and arrival city was asked and the user confirmed both of them. In next section we will elaborate more on users' error.</Paragraph> <Paragraph position="6"> For the sake of the present discussion, this example shows us that users are not always able to correct errors: on the contrary, we have seen above that the percentage of users' errors is high. In order to evaluate the effectiveness of the different approaches to face ambiguity we should experiment the different strategies on the same domain, or at least with the same interaction modality (phone or microphone).</Paragraph> <Paragraph position="7"> However, we have obtained some data that may give some insights on the issue. In the Dialogos corpus we have calculated the number of turns necessary for acquiring departure and arrival cities in the successful dialogues. While 64% of the users were able to give them in two turns (that is without experiencing recognition errors), the remaining 36% took from three to eight turns, i.e. these users' spent in correction from three to eight turns. Since the percentage of users that was not able to detect recognition errors is around 16%, we may hypothesize that a part of the subjects that experienced clarification subdialogues would have failed to give the correct values of the task parameters. Moreover, if we consider the cost of clarifications and repairs in terms of time, that is not awfully high: giving departure and arrival in less than three turns (that is, without clarifications or repair) takes from 20 to 29 seconds, while the entering of repair subdialogues increased this time of an average of 25 seconds on the total average time of the dialogues.</Paragraph> </Section> class="xml-element"></Paper>