File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1610_metho.xml
Size: 15,706 bytes
Last Modified: 2025-10-06 14:07:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1610"> <Title>Labeling Corrections and Aware Sites in Spoken Dialogue Systems</Title> <Section position="3" start_page="0" end_page="1" type="metho"> <SectionTitle> 2 The data 2.1 The TOOT corpus </SectionTitle> <Paragraph position="0"> Our corpus consists of dialogues between human subjects and TOOT, a spoken dialogue system that allows access to train information from the web via telephone. TOOT was collected to study variations in dialogue strategy and in user-adapted interaction (Litman and Pan, 1999). It is implemented using an IVR (interactivevoice response) platform developed at AT&T, combining ASR and text-to-speech with a phone interface (Kamm et al., 1997). The system's speech recognizer is aspeaker-independent hidden Markov model system with context-dependent phone models for telephone speech and constrained grammars dening vocabulary at any dialogue state. The platform supports barge-in. Subjects performed four tasks with one of several versions of the system that diered in terms of locus of initiative (system, user, or mixed), conrmation strategy (explicit, implicit, or none), and whether these conditions could be changed by the user during the task (adaptive vs. non-adaptive). TOOT's initiative strategy species who has control of the dialogue, while TOOT's conrmation strategy species how and whetherTOOTlets the user know what it just understood. The fragments in Figure 1provide some illustrations of how dialogues vary with strategy. Subjects were 39 students;; 20 native speakers and 19 nonnative, 16 female and 23 male. Dialogues were recorded and system and user behavior logged automatically. The concept accuracy (CA) of each turn was manually labeled. If the ASR correctly captured all task-related information in the turn (e.g. time, departure and arrival cities), the turn's CA score was 1 (semantically correct). Otherwise, the CA score reected the percentage of correctly recognized task information in the turn. The dialogues were also transcribed and automatically scored in comparison to the ASR recognized string to produce a word error rate (WER) for each turn. For the studydescribed below, we examined 2328 user turns (all user input between two system inputs) from 152 dialogues.</Paragraph> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 2.2 Dening Corrections and Aware Sites Toidentify corrections </SectionTitle> <Paragraph position="0"> in the corpus twoauthors independently labeled each turn as to whether or not it constituted a correction of a prior system failure (a rejection or CA error, which were the only system failure subjects were aware of) and subsequently decided upon a consensus label. Note that much of the discrepancies between labels were due to tiredness or incidental sloppiness of individual annotators, rather than true disagreement. Each turn labeled `correction' was further classied as belonging to one of the following categories: REP (repetition, including repetitions with dierences in pronunciation or uency), PAR (paraphrase), ADD (task-relevantcontent added, OMIT (content omitted), and ADD/OMIT (content both added and omitted). Repetitions were further divided into repetitions with pronunciation variation (PRON) (e.g. yes correcting yeah), and repetitions where the correction was pronounced using the same pronunciation as the original turn, but this distinction was di-cult to make and turned out not to be useful.</Paragraph> <Paragraph position="1"> User turns which included both corrections andother speechactswere so distinguishedby labeling them \2+&quot;. For user turns containing a correction plus one or more additional dialogue acts, only the correction is used for purposesof analysis below. We also labeled as restarts user corrections which followed non-initial system-initial prompts (e.g. \Howmay I help you?&quot; or \What city do you want to go to?&quot;);; in such cases system and user essentially started the dialogue over from the beginning. Figure 2 shows examples of each correction type and additional label for corrections of system failures on I want to go to Boston on Sunday. Note that the utterance on the last line of this gure is labeled 2+PAR, given that this turn consist of two speech acts: the goal of the no-part of this The labels discussed in this section for corrections and aware sites maywell be related to more general dialogue acts, like the ones proposed by (Allen and Core, 1997), but this needs to be explored in more detail in the future.</Paragraph> <Paragraph position="2"> turn is to signal a problem, whereas the remainder of this turn serves to correct a prior Each correction was also indexed with an identier representing the closest prior turn it was correcting, so that we could investigate \chains&quot; of corrections of a single failed turn, by tracing back through subsequent corrections of that turn. Figure 3 shows a fragment of a TOOT dialogue with corrections labeled as discussed above.</Paragraph> <Paragraph position="3"> We also identied aware sites in our corpus |turns where a user, while interacting with a machine, rst becomes aware that the system has misrecognized a previous user turn. For our corpus, we tried to determine whether there was some evidence in the user turn indicating that the user had become aware of a mistake in the system's understanding of a previous user turn, and, if so, which previous turn had occasioned that error. Note that such aware sites may or may not also be corrections (another type of post-misrecognition turn), since a user may not immediately provide correcting information. Also, it may take a while before the user is able to notice a system error. Figure 4 shows an example that illustrates cases in which both the user's awareness and correction of a misrecognition occur in the same turn (e.g. turns 1159 and 1160, after system prompts for information already given in turn 1158). It also illustrates cases in which aware sites and corrections occur in dierent turns. For example, after the immediate explicit system conrmation of turn 1162, the user rst becomes aware of the system errors (turn 1163), then separately corrects them (turn 1164). When no immediate conrmation of an utterance occurs (as with turn 1158), it may take several turns before the user becomes aware of any misrecognition errors. For example, it is not until turn 1161 that the user rst becomes aware of the error in date and time from 1158;; the user then corrects the error in 1162. Of all turns in our corpus, 13% represent cases of turns that are only corrections, 14% are only aware sites, and 16% are turns where aware sites and corrections co-occur. Also, note that turns 1162 and 1164 in this dialogue fragment represent cases of restarts after a system-initial, non-initial prompt (\Howmay I help you?&quot;).</Paragraph> </Section> </Section> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 3 Characteristics of corrections </SectionTitle> <Paragraph position="0"> and aware sites in TOOT</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.1 Corrections </SectionTitle> <Paragraph position="0"> We identied 29% of the turns in the corpus as corrections. These were a serious problem for the system, being misrecognized in terms of transcription error about twice as often (61%) as non-corrections (33%) and in terms of concept error in 53% of cases for corrections compared with 21% concept error for non-corrections.</Paragraph> <Paragraph position="1"> Corrections diered from non-corrections in other ways as well, which are suggestive of how they might be identied by an SDS. We have previously discussed the prosodic and acoustic dierences between corrections and non-correction (Swerts et al., 2000;; Hirschberg et al., 2001). In particular, in line with previous studies, we found that corrections more often tend to be hyperarticulated (longer, louder, with great pitch excursions,...) than other turns. More recently we have also discovered dierences in other dimensions. null First, corrections appear to be longer than non-corrections in numbers of words, when we examine actual human transcriptions of Turn Turn ID Aware of Corr of S: How may I help you? U: Get me the train to New York City from Baltimore on Sunday at eightthirtypm 1158 NA NA S: Which city do you want to go to? U: New York City 1159 1158 1158 S: Which city do you want to leave from? U: Baltimore 1160 1158 1158 S: Do you want me to nd the trains from Baltimore to New York City today at anytime now? U: No 1161 1158 NA S: How may I help you? U: Get me the train to New York City from Baltimore on Sunday at eightthirtypm 1162 NA 1158 S: Do you want me to nd the trains from Baltimore to New York City today at anytime now? U: No 1163 1162 NA S: How may I help you? U: Get me the train on Sunday at eight thirtypm 1164 NA 1162 them, both in absolute terms (T=17.68;; df=2326;; p=0) and also controlling for speaker (T=5.32;; df=38;; p=0). Even the ASR hypotheses show this dierence, with hypotheses of corrections being longer in absolute terms (T=13.72;; df=2326;; p=0) and across speakers (T=5.18;; df=38;; p=0).</Paragraph> <Paragraph position="2"> Of the correction types we labeled, the largest number were REPs and OMITs, as shown in Table 1, which shows over-all distribution of correction types, and distributions for eachtype of system failure corrected. Table 1 shows that 39% of TOOT corrections were simple repetitions of the previously mis-recognized turn. While this strategy is often suboptimal in correcting ASR errors (Levow, 1998), REPs (45% error) andOMITs (52% error) were better recognized than ADDs (90% error) and PARs (72% error). Thus, overall, users tend to have a preference for correction types that are more likely to be succesful. That REPs and OMITs are more often correctly recognized can be linked to the observation that they tend to be realized with prosody which is less marked than the prosody on ADDs and PARs. Table 2 shows that REPs and OMITs are closer to normal utterances in terms of their prosodic features than ADDs, which are considerably higher, longer and slower. This is in line with our previous observations that marked settings for these prosodic features more often lead to recognition errors.</Paragraph> <Paragraph position="3"> What the user was correcting also inuenced the type of correction chosen. Table 1 shows that corrections of misrecognitions (Post-Mrec) were more likely to omit information present in the original turn (OMITs), while corrections of rejections (Post-Rej) were more likely to be simple repetitions. The latter nding is not surprising, since the rejection message for tasks was always a close paraphrase of \Sorry, I can't understand you. Can you please repeat your utterance?&quot; However, it does suggest the surprising power of system directions, and how importantitis to craft prompts to favor the type of correction most easily recognized by the system. Corrections following system restarts diered in type somewhat from other corrections, with more turns adding new material to the correction and fewer of them repeating the original turn.</Paragraph> <Paragraph position="4"> Dialogue strategy clearly aected the type of correction users made. For example, users more frequently repeat their misrecognized utterance in the SystemExplicit condition, than in the MixedImplicit or UserNoConrm;; the latter conditions have larger proportions of OMITs and ADDs. This is an important observation given that this suggests that some dialogue strategies lead to correction types, such as ADDs, which are more likely to be misrecognized than correction types elicited by other strategies.</Paragraph> <Paragraph position="5"> As noted above, corrections in the TOOT corpus often take the form of chains of corrections of a single original error. Looking back at Figure 3, for example, we see two chains of corrections: In the rst, which begins with the misrecognition of turn 776 (\Um, tomorrow&quot;), the user repeats the original phrase and then provides a paraphrase (\Saturday&quot;), which is correctly recognized. In the second, beginning with turn 780, the time of departure is misrecognized. The user omits some information (\am&quot;) in turn 781, but without success;; an ADD correction follows, with the previously omitted information restored, in turn 783. Elsewhere (Swerts et al. 2000), wehaveshown that chain position has an inuence on correction behaviour in the sense that more distant corrections tend to be mis-recognized more often than corrections closer to the original error.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.2 Aware Sites </SectionTitle> <Paragraph position="0"> 708 (30%) of the turns in our corpus were labeled aware sites. The majority of these turns (89%) immediately follow the system failures they react to, unlike the more complex cases in Figure 4 above. If a system would be able to detect aware sites with a reasonable accuracy, this would be useful, given that the system would then be able to correctly guess in the majority of the cases that the problem occurred in the preceding turn. Aware turns, like corrections, tend to be misrecognized at a higher rate than other turns;; in terms of transcription accuracy, 50% of awares are misrecognized vs. 35% of other turns, and in terms of concept accuracy, 39% of awares are misrecognized compared to 27% of other turns. In other words, both types of post-error utterances, i.e., corrections and aware sites, share the fact that they tend to lead to additional errors. But whereas we have shown above that for corrections this is probably caused by the fact that these utterances are uttered in a hyperarticulated speaking style, we do not nd dierences in hyperarticulation between aware sites and `normal utterances' (T= 0.9085;; df=38;; p=0.3693).</Paragraph> <Paragraph position="1"> This could mean that these sites are realized in a speaking style which is not perceptibly dierent from normal speaking style and other turns for aware sites versus other utterances when judged by human labelers, but which is still suciently dierent to cause problems for an ASR system.</Paragraph> <Paragraph position="2"> In terms of distinguishing features which might explain or help to identify these turns, we have previously examined the acoustic and prosodic features of aware sites (Litman et al., 2001). Here we present some additional features. Aware sites appear to be signicantly shorter, in general, than other turns, both in absolute terms and controlling for speaker variation, and whether we examine the ASR transcription (absolute: T=4.86;; df=2326;; p=0;; speaker-controlled: T=5.37;; df=38;; p=0) or the human one (absolute: T=3.45;; df=2326;; p<.0001;; speakercontrolled: T=4.69;; df=38;; p=0). A sizable but not overwhelming number of aware sites infact consist ofa simplenegation (i.e., a variant of the word `no') (see Table 4). This at the sametime showsthat a simpleno-detector will not be sucient as an indicator of aware sites (see also (Krahmer et al., 1999;; Krahmer et al., to appear)), given that most aware sites are more complex than that, such as turns 1159 and 1160 in the example of Figure 4.</Paragraph> <Paragraph position="3"> More concretely, Table 4 shows that a single no would correctly predict that the turn is an aware site with a precision of only 57% and a recall of only 23%.</Paragraph> </Section> </Section> class="xml-element"></Paper>