File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2029_metho.xml
Size: 21,245 bytes
Last Modified: 2025-10-06 14:07:01
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2029"> <Title>Predicting Automatic Speech Recognition Performance Using Prosodic Cues</Title> <Section position="3" start_page="218" end_page="218" type="metho"> <SectionTitle> 2 The TOOT Corpus </SectionTitle> <Paragraph position="0"> Our corpus consists of a set of dialogues between humans and TOOT, an SDS for accessing train schedules from the web via telephone, which was collected to study both variations in SDS strategy and user-adapted interaction (Litman and Pan, 1999). TOOT is implemented on a platform combining ASR, text-to-speech, a phone interface, a finite-state dialogue manager, and application functions (Kamm et al., 1997). The speech recognizer is a speaker-independent hidden Markov model system with context-dependent phone models for telephone speech and constrained grammars for each dialogue state. Confidence scores for recognition were available only at the turn, not the word, level (Zeljkovic, 1996). An example TOOT dialogue is shown in Figure 1.</Paragraph> <Paragraph position="1"> Subjects performed four tasks with one of several versions of TOOT, that differed in terms of locus of initiative (system, user, or mixed), confirmation strategy (explicit, implicit, or none), and whether these conditions could be changed by the user during the task. Subjects were 39 students, 20 native speakers of standard American English and 19 non-native speakers; 16 subjects were female and 23 male. Dialogues were recorded and system and user behavior logged automatically. The concept accuracy (CA) of each turn was manually labeled by one of the experimenters. If the ASR output correctly captured all the task-related information in the turn (e.g. time, departure and arrival cities), the turn was given a CA score of 1 (a semantically correct recognition).</Paragraph> <Paragraph position="2"> Otherwise, the CA score reflected the percentage of correctly recognized task information in the turn.</Paragraph> <Paragraph position="3"> The dialogues were also transcribed by hand and these transcriptions automatically compared to the ASR recognized string to produce a word error rate (WEPt) for each turn. Note that a concept can be correctly recognized even though all words are not, so the CA metric does not penalize for errors that are unimportant to overall utterance interpretation.</Paragraph> <Paragraph position="4"> For the study described below, we examined 1994 user turns from 152 dialogues in this corpus. The speech recognizer was able to generate a recognized string and an associated acoustic confidence score per turn for 1975 of these turns. 1 1410 of these 1975 turns had a CA score of 1 (for an overall conceptual accuracy score of 71%) and 961 had a WER of 0 (for an overall transcription accuracy score of 49%, with a mean WER per turn of 47%).</Paragraph> <Paragraph position="5"> 1For the remaining turns, ASR output &quot;no speech&quot; (and TOOT played a timeout message) or &quot;garbage&quot; (TOOT played a rejection message).</Paragraph> </Section> <Section position="4" start_page="218" end_page="220" type="metho"> <SectionTitle> 3 Distinguishing Correct from </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="218" end_page="220" type="sub_section"> <SectionTitle> Incorrect Recognitions </SectionTitle> <Paragraph position="0"> We first looked for distinguishing prosodic characteristics of misrecognitions, defining misrecognitions in two ways: a) as turns with WER>0; and b) as turns with CA<I. As noted in Section 1, previous studies have speculated that hyperarticulated speech (slower and louder speech which contains wider pitch excursions) may be associated with recognition failure. So, we examined the following features for each excursion and loudness, were calculated from the output of Entropic Research Laboratory's pitch tracker, get_fO, with no post-correction. Timing variation was represented by four features. Duration within and length of pause between turns was computed from the temporal labels associated with each turn's beginning and end. Speaking rate was approximated in terms of syllables in the recognized string per second, while % Silence was defined as the percentage of zero frames in the turn, i.e., roughly the percentage of time within the turn that the speaker was silent. These features were chosen based upon previous findings (see Section 1) and observations from our data.</Paragraph> <Paragraph position="1"> To ensure that our results were speaker independent, we calculated mean values for each speaker's recognized turns and their misrecognized turns for every feature. Then, for each feature, we created vectors of speaker means for recognized and misrecognized turns and performed paired t-tests on the vectors. For example, for the feature &quot;F0 max&quot;, we calculated mean maxima for misrecognized turns and for correctly recognized turns for each of our thirty-nine speakers. We then performed a paired t-test on these thirty-nine pairs of means to derive speaker-independent results for differences in F0 maxima between correct and incorrect recognitions.</Paragraph> <Paragraph position="2"> Tables 1 and 2 show results of these comparisons when we calculate misrecognition in terms of 2While the features were automatically computed, turn beginnings and endings were hand segmented in dialogue-level speech files, as the turn-level files created by TOOT were not available.</Paragraph> <Paragraph position="3"> Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you? I want the trains from New York City to Washington DC on Monday at 9:30 in the evening. Do you want me to find the trains from New York City to Washington DC on Monday approximately at 9:30 in the evening now? Yes.</Paragraph> <Paragraph position="4"> I am going to get the train schedule for you ...</Paragraph> <Paragraph position="5"> vs. Recognized Turns by Prosodic Feature Across Speakers. Fe turo I st t deg nMisrecdl rlq ecd *significant at a 95% confidence level (p< .05) WER>0 and CA<l, respectively. These results indicate that misrecognized turns do differ from correctly recognized ones in terms of prosodic features, although the features on which they differ vary slightly, depending upon the way &quot;misrecognition&quot; is defined. Whether defined by WER or CA, mis-recognized turns exhibit significantly higher F0 and RMS maxima, longer durations, and longer preceding pauses than correctly recognized speaker turns. For a traditional WER definition of misrecognition, misrecognitions are slightly higher in mean F0 and contain a lower percentage of internal silence. For a CA definition, on the other hand, tempo is a significant factor, with misrecognitions spoken at a faster rate than correct recognitions -- contrary to our hypothesis about the role of hyperarticulation in recognition error.</Paragraph> <Paragraph position="6"> While the comparisons in Tables 1 and 2 were made on the means of raw values for all prosodic features, little difference is found when values are normalized by value of first or preceding turn, or by converting to z scores. 3 From this similarity between the performance of raw and normalized values, it would seem to be relative differences in speakers' prosodic values, not deviation from some 'acceptable' range, that distinguishes recognition failures from successful recognitions. A given speaker's turns that are The only differences occur for CA defined misrecognition, where normalizing by first utterance results in significant differences in mean RMS, and normalizing by preceding turn results in no significant differences in tempo.</Paragraph> <Paragraph position="7"> higher in pitch or loudness, or that are longer, or that follow longer pauses, are less likely to be recognized correctly than that same speaker's turns that are lower in pitch or loudness, shorter, and follow shorter pauses -- however correct recognition is defined. null It is interesting to note that the features we found to be significant indicators of failed recognitions (F0 excursion, loudness, long prior pause, and longer duration) are all features previously associated with hyperarticulated speech. Since prior research has suggested that speakers may respond to failed recognition attempts by hyperarticulating, which itself may lead to more recognition failures, had we in fact simply identified a means of characterizing and identifying hyperarticulated speech prosodically? Since we had independently labeled all speaker turns for evidence of hyperarticulation (two of the authors labeled each turn as &quot;not hyperarticulated&quot;, &quot;some hyperarticulation in the turn&quot;, and &quot;hyperarticulated&quot;, following Wade et al. (1992)), we were able to test this possibility. We excluded any turn either labeler had labeled as partially or fully hyperarticulated, and again performed paired t-tests on mean values of misrecognized versus recognized turns for each speaker. Results show that for both WER-defined and CA-defined misrecognitions, not only are the same features significant differentiators when hyperarticulated turns are excluded from the analysis, but in addition, tempo also is significant for WER-defined misrecognition. So, our findings for the prosodic characteristics of recognized and of misrecognized turns hold even when perceptibly hyperarticulated turns are excluded from the corpus.</Paragraph> </Section> </Section> <Section position="5" start_page="220" end_page="222" type="metho"> <SectionTitle> 4 Predicting Misrecognitions Using </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="220" end_page="222" type="sub_section"> <SectionTitle> Machine Learning </SectionTitle> <Paragraph position="0"> Given the prosodic differences between misrecognized and correctly recognized utterances in our corpus, is it possible to predict accurately when a particular utterance will be misrecognized or not? This section describes experiments using the machine learning program RIPPER (Cohen, 1996) to automatically induce prediction models, using prosodic as well as additional features. Like many learning programs, RIPPER takes as input the classes to be learned, a set of feature names and possible values, and training data specifying the class and feature values for each training example. RIPPER outputs a classification model for predicting the class of future examples. The model is learned using greedy search guided by an information gain metric, and is expressed as an ordered set of if-then rules.</Paragraph> <Paragraph position="1"> Our predicted classes correspond to correct recognition (T) or not (F). As in Section 3, we examine both WER-defined and CA-defined notions of correct recognition, and represent each user turn as a set of features. The features used in our learning experiments include the raw prosodic features in Tables 1 and 2 (which we will refer to as the feature set &quot;Prosody&quot;), the hyperarticulation score discussed in Section 3, and the following additional potential predictors of misrecognition (described in Section 2): The first three features are derived from the ASR process (the context-dependent grammar used to recognize the turn, the turn-level acoustic confidence score output by the recognizer, and the recognized string). We included these features as a baseline against which to test new methods of predicting misrecognitions, although, currently, we know of no ASR system that includes recognized string in its rejection calculations. 4 TOOT itself used only the 4Note that, while the entire recognized string is provided to the learning algorithm, RIPPER rules test for the presence of particular words in the string.</Paragraph> <Paragraph position="2"> first two features to calculate rejections and ask the user to repeat the utterance, whenever the confidence score fell below a pre-defined grammar-specific threshold. The other features represent the experimental conditions under which the data was collected (whether users could adapt TOOT's dialogue strategies, TOOT's initial initiative and confirmation strategies, experimental task, speaker's name and characteristics). We included these features to determine the extent to which particulars of task, subject, or interaction influenced ASR success rates or our ability to predict them; previous work showed that these factors impact TOOT's performance (Litman and Pan, 1999; Hirschberg et al., 1999). Except for the task, subject, gender, native language, and hyperarticulation scores, all of our features are automatically available.</Paragraph> <Paragraph position="3"> Table 3 shows the relative performance of a number of the feature sets we examined; results here are for misrecognition defined in terms of WER. 5 A baseline classifier for misrecognition, predicting that ASR is always wrong (the majority class of F), has an error of 48.66%. The best performing feature set includes only the raw prosodic and ASR features and reduces this error to an impressive 6.53% +/.63%. Note that this performance is not improved by adding manually labeled features or experimental conditions: the feature set corresponding to ALL features yielded the statistically equivalent 6.68% +/- 0.63%.</Paragraph> <Paragraph position="4"> With respect to the performance of prosodic features, Table 3 shows that using them in conjunction with ASR features (error of 6.53%) significantly out-performs prosodic features alone (error of 12.76%), which, in turn, significantly outperforms any single prosodic feature; duration, with an error of 17.42%, is the best such feature. Although not shown in the table, the unnormalized prosodic features significantly outperform the normalized versions by 713%. Recall that prosodic features normalized by first task utterance, by previous utterance, or by z scores showed little performance difference in the analyses performed in Section 3. This difference may indicate that there are indeed limits on the ranges in features such as F0 and RMS maxima, duration and preceding pause within which recognition performance is optimal. It seems reasonable that extreme deviation from characteristics of the acoustic training material should in fact impact ASR performance, and our experiments may have uncovered, if not the critical variants, at least important acoustic correlates of them. However, it is difficult to com-SThe errors and standard errors (SE) result from 25-fold cross-validation on the 1975 turns where ASR yielded a string and confidence. When two errors plus or minus twice the standard error do not overlap, they are statistically significantly different.</Paragraph> <Paragraph position="5"> pare our machine learning results with the statistical analyses, since a) the statistical analyses looked at only a single prosodic variable at a time, and b) data points for that analysis were means calculated per speaker, while the learning algorithm operated on all utterances, allowing for unequal contributions by speaker.</Paragraph> <Paragraph position="6"> We now address the issue of what prosodic features are contributing to misrecognition identification, relative to the more traditional ASR techniques. Do our prosodic features simply correlate with information already in use by ASR systems (e.g., confidence score, grammar), or at least available to them (e.g., recognized string)? First, the error using ASR confidence score alone (22.23%) is significantly worse than the error when prosodic features are combined with ASR confidence scores (10.99%) -- and is also significantly worse than the use of prosodic features alone (12.76%). Similarly, the error using ASR confidence scores and the ASR grammar (17.77%) is significantly worse than prosodic features alone (12.76%). Thus, prosodic features, either alone or in conjunction with traditional ASR features, significantly outperform these traditional features alone for predicting WER-based misrecognitions.</Paragraph> <Paragraph position="7"> Another interesting finding from our experiments is the predictive power of information available to current ASR systems but not made use of in calculating rejection likelihoods, the identity of the recognized string. This feature is in fact the best performing single feature in predicting our data (15.24%). And, at a 95% confidence level, the error using ASR confidence scores, the recognized string, and grammar (9.01%) matches the performance of our best performing feature set (6.53%). It seems that, at least in our task and for our ASR system, the appearance of particular words in the recognized strings is an extremely useful cue to recognition accuracy. So, even by making use of information currently available from the traditional ASR process, ASR systems could improve their performance on identifying rejections by a considerable margin. A caveat here is that this feature, like grammar state, is unlikely to generalize from task to task or recognizer to recognizer, but these findings suggest that both should be considered as a means of improving rejection performance in stable systems.</Paragraph> <Paragraph position="8"> The classification model learned from the best performing feature set in Table 3 is shown in Figure 2. 6 The first rule RIPPER finds with this feature set is that if the user turn is less than .9 seconds and the recognized string contains the word &quot;yes&quot; (and possibly other words as well), with an acoustic confidence score > -2.6, then predict that the turn will be correctly recognized.7 Note that all of the prosodic fea6Rules are presented in order of importance in classifying data. When multiple rules are applicable, RIPPER uses the first rule.</Paragraph> <Paragraph position="9"> 7The confidence scores observed in our data ranged from a high of -0.087662 to a low of-9.884418.</Paragraph> <Paragraph position="10"> if (duration if (duration if (duration if (duration if (duration if (duration if (duration if (duration if (duration if (duration if (duration if (duration else F < 0.897073) A (confidence > -2.62744 ) A (string contains 'yes') then T < 1.03872 ) A (confidence > -2.69775) A (string contains 'no') then T < 0.982051) A (confidence > -1.99705) A (tempo > 3.1147) then T < 0.813633) A (duration > 0.642652) A (confidence > -3.33945) A (F0 Mean > 176.794) then T < 1.30312) A (confidence > -3.37301) A (% silences ~_ 0.647059) then T 0.610734) A (confidence > -3.37301) A (% silences > 0.521739) then T < 1.09537) A (string contains 'Baltimore') then T < 0.982051) A (string contains 'no') then T < 1.1803) A (confidence > -2.93085) A (grammar ---- date) then T < 1.09537) A (confidence > -2.30717) A (% silences > 0.356436) A (F0 Max > 249.225) then T < 0.868743) A (confidence > -4.14926 ) A (% silences > 0.51923) A (F0 Max > 205.296) then T < 1.18036) A (string contains 'Philadelphia') then T tures except for RMS mean, max, and prior pause appear in at least one rule, and that the features shown to be significant in our statistical analyses (Section 3) are not the same features as in the rules. But, as noted above, our data points in these two experiments differ. It is useful to note though, that while this ruleset contains all three ASR features, none of the experimental parameters was found to be a useful predictor, suggesting that our results are not specific to the particular conditions of and participants in the corpus collection, although they are specific to the lexicon and grammars.</Paragraph> <Paragraph position="11"> Results of our learning experiments with mis-recognition defined in terms of CA rather than WER show the overall role of the features which predict WER-defined misrecognition to be less successful in predicting CA-defined error. Table 4 shows the relative performance of the same feature sets discussed above, with misrecognition now defined in terms of CA<I. As with the WER experiments, the best performing feature set makes use of prosodic and ASR-derived features. However, the predictive power of prosodic over ASR features decreases when misrecognition is defined in terms of CA -- which is particularly interesting since ASR confidence scores are intended to predict WER rather than CA; the error rate using ASR confidence scores alone (13.52%) is now significantly lower than the error obtained using prosody (18.18%). However, prosodic features still improve the predictive power of ASR confidence scores, to 11.34%, although this difference is not significant at a 95% confidence level. And the error rate of the three ASR features combined (11.70%) is reduced to the lowest error rate in our table when prosodic features are added (10.43%); this error rate is (just) significantly different from the use of ASR confidence scores alone. Thus, for CA-defined misrecognitions, our experiments have uncovered only minor improvements over traditional ASR rejection calculation procedures.</Paragraph> </Section> </Section> class="xml-element"></Paper>