File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/p99-1040_evalu.xml

Size: 9,812 bytes

Last Modified: 2025-10-06 14:00:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1040">
  <Title>Automatic Detection of Poor Speech Recognition at the Dialogue Level</Title>
  <Section position="5" start_page="312" end_page="314" type="evalu">
    <SectionTitle>
3 Results
</SectionTitle>
    <Paragraph position="0"> Figure 4 summarizes our most interesting experimental results. For each feature set, we report accuracy rates and standard errors resulting from crossvalidation. 5 It is clear that performance depends on the features that the classifier has available. The BASELINE accuracy rate results from simply choosing the majority class, which in this case means predicting that the dialogue is always &amp;quot;good&amp;quot;. This leads to a 52% BASELINE accuracy.</Paragraph>
    <Paragraph position="1"> The REJECTION% accuracy rates arise from a classifier that has access to the percentage of dialogue utterances in which the system played a rejection message to the user. Previous research suggests that this acoustic feature predicts misrecognitions because users modify their pronunciation in response to system rejection messages in such a way as to lead to further misunderstandings (Shriberg et al., 1992; Levow, 1998). However, despite our expectations, the REJECTION% accuracy rate is not better than the BASELINE at our desired level of statistical significance.</Paragraph>
    <Paragraph position="2"> Using the EFFICIENCY features does improve the performance of the classifier significantly above the BASELINE (61%). These features, however, tend to reflect the particular experimental tasks that the users were doing.</Paragraph>
    <Paragraph position="3"> The EXP-PARAMS (experimental parameters) features are even more specific to this dialogue corpus than the efficiency features: these features consist of the name of the system, the experimen5Accuracy rates are statistically significantly different when the accuracies plus or minus twice the standard error do not overlap (Cohen, 1995), p. 134.</Paragraph>
    <Paragraph position="4"> tal subject, the experimental task, and the experimental condition (dialogue strategy or user expertise). This information alone allows the classifier to substantially improve over the BASELINE classifter, by identifying particular experimental conditions (mixed initiative dialogue strategy, or novice users without tutorial) or systems that were run with particularly hard tasks (TOOT) with bad dialogues, as in Figure 5. Since with the exception of the experimental condition these features are specific to this corpus, we wouldn't expect them to generalize.</Paragraph>
    <Paragraph position="5"> if (condition = mixed) then bad if (system = toot) then bad if (condition = novices without tutorial) then bad  The normalized DIALOGUE QUALITY features result in a similar improvement in performance (65.9%). 6 However, unlike the efficiency and experimental parameters features, the normalization of the dialogue quality features by dialogue length means that rules learned on the basis of these features are more likely to generalize.</Paragraph>
    <Paragraph position="6"> Adding the efficiency and normalized quality feature sets together (EFFICIENCY + NORMALIZED QUALITY) results in a significant performance improvement (69.7%) over EFFICIENCY alone. Figure 6 shows that this results in a classifier with three rules: one based on quality alone (percentage of cancellations), one based on efficiency  alone (elapsed time), and one that consists of a boolean combination of efficiency and quality features (elapsed time and percentage of rejections). The learned ruleset says that if the percentage of cancellations is greater than 6%, classify the dialogue as bad; if the elapsed time is greater than 282 seconds, and the percentage of rejections is greater than 6%, classify it as bad; if the elapsed time is less than 90 seconds, classify it as badT; otherwise classify it as good. When multiple rules are applicable, RIPPER resolves any potential conflict by using the class that comes first in the ordering; when no rules are applicable, the default is used.</Paragraph>
    <Paragraph position="7"> if (cancel% &gt; 6) then bad if (elapsed time &gt; 282 secs) A (rejection% &gt; 6) then bad if (elapsed time &lt; 90 secs) then bad default is good for the MEAN CONFIDENCE classifier (68.4%) is not statistically different than that for the PMISRECS%3 classifier. Furthermore, since the feature does not rely on picking an optimal threshold, it could be expected to better generalize to new dialogue situations.</Paragraph>
    <Paragraph position="8"> The classifier trained on (noisy) ASR lexical output (ASR TEXT) has access only to the speech recognizer's interpretation of the user's utterances. The ASR TEXT classifier achieves 72% accuracy, which is significantly better than the BASELINE, REJECTION% and EFFICIENCY classifiers. Figure 7 shows the rules learned from the lexical feature alone. The rules include lexical items that clearly indicate that a user is having trouble e.g. help and cancel. They also include lexical items that identify particular tasks for particular systems, e.g. the lexical item p-m identifies a task in TOOT.</Paragraph>
    <Paragraph position="9">  We discussed our acoustic REJECTION% results above, based on using the rejection thresholds that each system was actually run with. However, a posthoc analysis of our experimental data showed that our systems could have rejected substantially more misrecognitions with a rejection threshold that was lower than the thresholds picked by the system designers. (Of course, changing the thresholds in this way would have also increased the number of rejections of correct ASR outputs.) Recall that the PMISRECS% experiments explored the use of different thresholds to predict misrecognitions. The best of these acoustic thresholds was PMISRECS%3, with accuracy 72.6%. This classifier learned that if the predicted percentage of misrecognitions using the threshold for that feature was greater than 8%, then the dialogue was predicted to be bad, otherwise it was good. This classifier performs significantly better than the BASELINE, REJECTION% and EFFICIENCY classifiers.</Paragraph>
    <Paragraph position="10"> Similarly, MEAN CONFIDENCE is another acoustic feature, which averages confidence scores over all the non-rejected utterances in a dialogue.</Paragraph>
    <Paragraph position="11"> Since this feature is not tuned to the applications, we did not expect it to perform as well as the best PMISRECS% feature. However, the accuracy rate 7This rule indicates dialogues too short for the user to have completed the task. Note that this role could not be applied to adapting the system's behavior during the course of the dialogue. null if (ASR text contains cancel) then bad if (ASR text contains the) A (ASR text contains get) A (ASR text contains TIMEOUT) then bad if (ASR text contains today) ^ (ASR text contains on) then bad if (ASR text contains the) A (ASR text contains p-m) then bad if (ASR text contains to) then bad if (ASR text contains help) ^ (ASR text contains the) ^ (ASR text contains read) then bad if (ASR text contains help) A (ASR text contains previous) then bad if (ASR text contains about) then bad if (ASR text contains change-s trategy) then bad  Note that the performance of many of the classifiers is statistically indistinguishable, e.g. the performance of the ASR TEXT classifier is virtually identical to the classifier PMISRECS%3 and the EFFICIENCY + QUALITY + EXP-PARAMS classifier.</Paragraph>
    <Paragraph position="12"> The similarity between the accuracies for a range of classifiers suggests that the information provided by different feature sets is redundant. As discussed above, each system and experimental condition resuited in dialogues that contained lexical items that were unique to it, making it possible to identify experimental conditions from the lexical items alone. Figure 8 shows the rules that RIPPER learned when it had access to all the features except for the lexical and acoustic features. In this case, RIPPER learns some rules that are specific to the TOOT system.</Paragraph>
    <Paragraph position="13"> Finally, the last row of Figure 4 suggests that a classifier that has access to ALL FEATURES may do better (77.4% accuracy) than those classifiers that  if (cancel% &gt; 4) ^ (system = toot) then bad if (system turns _&gt; 26) ^ (rejection% _&gt; 5 ) then bad if (condition = mixed) ^ (user turns &gt; 12 ) then bad if (system = toot)/x (user turns &gt; 14 ) then bad if (cancels &gt; 1) A (timeout% _&gt; 11 ) then bad if (elapsed time _&lt; 87 secs) then bad  have access to acoustic features only (72.6%) or to lexical features only (72%). Although these differences are not statistically significant, they show a trend (p &lt; .08). This supports the conclusion that different feature sets provide redundant information, and could be substituted for each other to achieve the same performance. However, the ALL FEATURES classifier does perform significantly better than the EXP-PARAMS, DIALOGUE QUALITY (NORMALIZED), and MEAN CONFIDENCE classifiers. Figure 9 shows the decision rules that the ALL FEATURES classifier learns. Interestingly, this classifier does not find the features based on experimental parameters to be good predictors when it has other features to choose from. Rather it combines features representing acoustic, efficiency, dialogue quality and lexical information.</Paragraph>
    <Paragraph position="14"> if (mean confidence _&lt; -2.2) ^ (pmisrecs%4 _&gt; 6 ) then bad if (pmisrecs%3 &gt;_ 7 ) A (ASR text contains yes) A (mean confidence _&lt; -1.9) then bad if (cancel% _&gt; 4) then bad if (system turns _&gt; 29 ) ^ (ASR text contains message) then bad if (elapsed time &lt;_ 90) then bad</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML