File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1048_metho.xml
Size: 17,550 bytes
Last Modified: 2025-10-06 14:07:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1048"> <Title>Predicting User Reactions to System Error</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Descriptive Analysis and Results </SectionTitle> <Paragraph position="0"> We examined prosodic features for each user turn which had previously been shown to be useful for predicting misrecognized turns and corrections:a2 maximum and mean fundamental frequency values (F0 Max, F0 Mean), maximum and mean energy values (RMS Max, RMS Mean), total duration (Dur), length of pause preceding the turn (Ppau), speaking rate (Tempo) and amount of silence within the turn (%Sil). F0 and RMS values, representing measures of pitch excursion and loudness, were calculated from the output of Entropic Research Laboratory's pitch tracker, get f0, with no post-correction. Timing variation was represented by four features. Duration within and length of pause between turns was computed from the temporal labels associated with each turn's bea3 While the features were automatically computed, beginnings and endings were hand segmented from recordings of the entire dialogue, as the turn-level speech files used as input in the original recognition process created by TOOT were unavailable.</Paragraph> <Paragraph position="1"> ginning and end. Speaking rate was approximated in terms of syllables in the recognized string per second, while %Sil was defined as the percentage of zero frames in the turn, i.e., roughly the percentage of time within the turn that the speaker was silent.</Paragraph> <Paragraph position="2"> To see whether the different turn categories were prosodically distinct from one another, we applied the following procedure. We first calculated mean values for each prosodic feature for each of the four turn categories produced by each individual speaker. So, for speaker A, we divided all turns produced into four classes. For each class, we then calculated mean F0 Max, mean F0 Mean, and so on. After this step had been repeated for each speaker and for each feature, we then created four vectors of speaker means for each individual prosodic feature. Then, for each prosodic feature, we ran a one-factor within subjects anova on the means to learn whether there was an overall effect of turn category.</Paragraph> <Paragraph position="3"> Table 1 shows that, overall, the turn categories do indeed differ significantly with respect to different prosodic features; there is a significant, overall effect of category on F0 Max, RMS Max, RMS Mean, Duration, Tempo and %Sil. To identify which pairs of turns were significantly different where there was an overall significant effect, we performed posthoc paired t-tests using the Bonferroni method to adjust the p-level to 0.008 (on the basis of the number of possible pairs that Classes F0 max F0 mean RMS max RMS mean Dur Ppau Tempo %Sil</Paragraph> <Paragraph position="5"> can be drawn from an array of 4 means). Results are summarized in Table 2, where ' + ' or ' - ' indicates that the feature value of the first category is either significantly higher or lower than the second. Note that, for each of the pairs, there is at least one prosodic feature that distinguishes the categories significantly, though it is clear that some pairs, like aware vs. corr and norm vs. corr appear to have more distinguishing features than others, like norm vs. aware. It is also interesting to see that the three types of post-error turns are indeed prosodically different: awares are less prominent in terms of F0 and RMS maximum than corrawares, which, in turn, are less prominent than corrections, for example. In fact, awares, except for duration, are prosodically similar to normal turns.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Predictive Results </SectionTitle> <Paragraph position="0"> We next wanted to determine whether the prosodic features described above could, alone or in combination with other automatically available features, be used to predict our turn categories automatically. This section describes experiments using the machine learning program RIPPER (Cohen, 1996) to automatically induce prediction models from our data. Like many learning programs, RIPPER takes as input the classes to be learned, a set of feature names and possible values, and training data specifying the class and feature values for each training example. RIPPER outputs a classification model for predicting the class of future examples, expressed as an ordered set of if-then rules. The main advantages of RIPPER for our experiments are that RIPPER supports &quot;set-valued&quot; features (which allows us to represent the speech recognizer's best hypothesis as a set of words), and that rule output is an intuitive way to gain insight into our data.</Paragraph> <Paragraph position="1"> In the current experiments, we used 10-fold cross-validation to estimate the accuracy of the rulesets learned. Our predicted classes correspond to the turn categories described in Section 2 and variations described below. We represent each user turn using the feature set shown in Figure 2, which we previously found useful for predicting corrections (Hirschberg et al., 2001).</Paragraph> <Paragraph position="2"> A subset of the features includes the automatically computable raw prosodic features shown in Table 1 (Raw), and normalized versions of these features, where normalization was done by first turn (Norm1) or by previous turn (Norm2) in a dialogue. The set labeled 'ASR' contains standard input and output of the speech recognition process, which grammar was used for the dialogue state the system believed the user to be in (gram), Raw: f0 max, f0 mean, rms max, rms mean, dur, ppau, tempo, %sil; Norm1: f0 max1, f0 mean1, rms max1, rms mean1, dur1, ppau1, tempo1, %sil1; Norm2: f0 max2, f0 mean2, rms max2, rms mean2, dur2, ppau2, tempo2, %sil2; ASR: gram, str, conf, ynstr, nofeat, canc, help, wordsstr, syls, rejbool; System Experimental: inittype, conftype, adapt, realstrat; Dialogue Position: diadist; PreTurn: features for preceding turn (e.g., pref0max); PrepreTurn: features for preceding preceding turn (e.g., ppref0max); Prior: for each boolean-valued feature (ynstr, nofeat, canc, help, rejbool), the number/percentage of prior turns exhibiting the feature (e.g., priorynstrnum/priorynstrpct); null PMean: for each continuous-valued feature, the mean of the feature's value over all prior turns (e.g., pmnf0max); the system's best hypothesis for the user input (str), and the acoustic confidence score produced by the recognizer for the turn (conf). As subcases of the str feature, we also included whether or not the recognized string included the strings yes or no (ynstr), some variant of no such as nope (nofeat), cancel (canc), or help (help), as these lexical items were often used to signal problems in our system. We also derived features to approximate the length of the user turn in words (wordsstr) and in syllables (syls) from the str features. And we added a boolean feature identifying whether or not the turn had been rejected by the system (rejbool).</Paragraph> <Paragraph position="3"> Next, we include a set of features representing the system's dialogue strategy when each turn was produced. These include the system's current initiative and confirmation strategies (inittype, conftype), whether users could adapt the system's dialogue strategies (adapt), and the combined initiative/confirmation strategy in effect at the time of the turn (realstrat). Finally, given that our previous studies showed that preceding dialogue context may affect correction behavior (Swerts et al., 2000; Hirschberg et al., 2001), we included a feature (diadist) reflecting the distance of the current turn from the beginning of the dialogue, and a set of features summarizing aspects of the prior dialogue: for the latter features, we included both the number of times prior turns exhibited certain characteristics (e.g. priorcancnum) and the percentage of the prior dialogue containing one of these features (e.g. priorcancpct). We also examined means for all raw and normalized prosodic features and some word-based features over the entire dialogue preceding the turn to be predicted (pmn ). Finally, we examined more local contexts, including all features of the preceding turn (pre ) and for the turn preceding that (ppre ).</Paragraph> <Paragraph position="4"> We provided all of the above features to the learning algorithm first to predict the four-way classification of turns into normal, aware, corr and corraware. A baseline for this classification (always predicting norm, the majority class) has a success rate of 57%. Compared to this, our features improve classification accuracy to 74.23% (+/- 0.96%). Figure 3 presents the rules learned for this classification. Of the features that appear in the ruleset, about half are features of current turn and half features of the prior context. Only once does a system feature appear, suggesting that the rules generalize beyond the experimental conditions of the data collection. Of the features specific to the current turn, prosodic features dominate, and, overall, timing features (dur and tempo especially) appear most frequently in the rules.</Paragraph> <Paragraph position="5"> About half of the contextual features are prosodic ones and half are ASR features, with ASR confidence score appearing to be most useful. ASR features of the current turn which appear most often are string-based features and the grammar state the system used for recognizing the turn. There appear to be no differences in which type of features are chosen to predict the different classes.</Paragraph> <Paragraph position="6"> If we express the prediction results in terms of precision and recall, we see how our classification accuracy varies for the different turn categories (Table 3). From Table 3, we see that the majority class (normal) is most accurately classified. Predictions for the other three categories, which occur about equally often in our corpus, vary considerably, with modest results for corr and corraware, and rather poor results for aware. Table 4 shows a confusion matrix for the four classes, produced by if (gram=universal) a17 (dur2 a18 7.31) then CORR if (dur2 a18 2.19) a17 (priornofeatpct a18 0.09) a17 (tempo a18 1.50) a17 (pmntempo a19 2.39) then CORR if (dur2 a18 1.53) a17 (pmnwordsstr a18 2.06) a17 (tempo1 a18 1.07) a17 (predur a18 0.80) a17 (prenofeat=F) a17 (presyls a19 4) then CORR if (predur1 a19 0.26) a17 (dur a18 0.79) a17 (rmsmean2 a18 1.51) a17 (f0mean a19 173.49) then CORR if (dur2 a18 1.41) a17 (prenofeat=T) a17 (str contains word 'eight') then CORR if (predur1 a19 0.18) a17 (dur2 a18 4.21) a17 (dur1 a19 0.50) a17 (f0mean a19 276.43) then CORR if (predur1 a19 0.19) a17 (ppregram=cityname) a17 (rmsmax1 a18 1.10) a17 (pmntempo2 a19 1.64) then CORR if (realstrat=SystemImplicit) a17 (gram=cityname) a17 (pmnf0mean1 a19 0.96) then CORR if (preconf a19 -2.66) a17 (dur2 a19 0.31) a17 (pprenofeat=T) a17 (tempo2 a18 0.61) then AWARE if (preconf a19 -2.85) a17 (syls a19 2) a17 (predur a18 1.05) a17 (pref0max a20 4.82) a17 (tempo2 a18 0.58) a17 (pmn%sil a19 0.53) then AWARE if (preconf a19 -3.34) a17 (syls a19 2) a17 (ppau a18 0.57) a17 (conf a18 -3.07) a17 (preppau a18 0.72) then AWARE if (dur a18 0.74) a17 (pmndur a18 2.57) a17 (preconf a19 -4.36) a17 (f0mean2 a18 0.90) then CORRAWARE if (preconf a19 -2.80) a17 (pretempo a19 2.16) a17 (preconf a19 -3.95) a17 (tempo1 a19 4.67) then CORRAWARE if (preconf a19 -2.80) a17 (dur a18 0.66) a17 (rmsmean a18 488.56) then CORRAWARE if (preconf a19 -3.56) a17 (dur2 a18 0.64) a17 (prerejbool=T) then CORRAWARE if (pretempo a19 0.71) a17 (tempo a19 3.31) then CORRAWARE if (preconf a19 -3.01) a17 (tempo2 a18 0.78) a17 (pmndur a18 2.83) a17 (pmnf0mean a18 199.84) then CORRAWARE if (pmnconf a19 -3.10) a17 (prestr contains the word 'help') a17 (pmndur2 a19 2.01) a17 (ppau a19 0.98) then CORRAWARE if (pmnconf a19 -3.10) a17 (gram=universal) a17 (pregram=universal) a17 ( %sil a19 0.39) then CORRAWARE matrix clearly shows a tendency for the minority classes, aware, corr and corraware, to be falsely classified as normal. It also shows that aware and corraware are more often confused than the other categories.</Paragraph> <Paragraph position="7"> These confusability results motivated us to collapse the aware and corraware into one class, which we will label isaware; this class thus represents all turns in which users become aware of a problem. From a system perspective, such a 3-way classification would be useful in identifying the existence of a prior system failure and in further identifying those turns which simply represent corrections; such information might be as useful, potentially, as the 4-way distinction, if we could achieve it with greater accuracy.</Paragraph> <Paragraph position="8"> Indeed, when we predict the three classes (isaware, corr, and norm) instead of four, we do improve in predictive power -- from 74.23% to 81.14% (+/- 0.83%) classification success.</Paragraph> <Paragraph position="9"> Again, this compares to the baseline (predicting norm, which is still the majority class) of 57%. We also get a corresponding improvement in terms of precision and recall, as shown in Table 5, with the isaware category considerably better distinguished than either aware or corraware in Table 3.</Paragraph> <Paragraph position="10"> The ruleset for the 3-class predictions is given in there appear to be clear differences in which features best predict which classes. First, the features used to predict corrections are balanced between those from the current turn and features from the preceding context, whereas isaware rules primarily make use of features of the preceding context.</Paragraph> <Paragraph position="11"> Second, the features appearing most often in the rules predicting corrections are durational features (dur2, predur1, dur), while duration is used only once in isaware rules. Instead, these rules make considerable use of the ASR confidence score of the preceding turn; in cases where aware turns immediately follow a rejection or recognition error, one would expect this to be true. Isaware rules also appear distinct from correction rules in that they make frequent use of the tempo feature. It is also interesting to note that rules for predicting isaware turns make only limited use of the nofeat feature, i.e. whether or not a variant of the word no appears in the turn. We might expect this lexical item to be a more useful predictor, since in the explicit confirmation condition, users should become aware of errors while responding to a request for confirmation.</Paragraph> <Paragraph position="12"> Note that corrections, now the minority class, are more poorly distinguished than other classes in our 3-way classification task (Table 5). In a third set of experiments, we merged corrections with normal turns to form a 2-way distinction over all between aware turns and all others. Thus, we only distinguish turns in which a user first becomes aware of an ASR failure (our original isaware and corraware categories) from those that are not (our original corr and norm categories). Such a distinction could be useful in flagging a prior system problem, even though it fails to target the material intended to correct that problem. For this new 2-way distinction, we obtain a higher degree of classification accuracy than for the 3-way classification -- 87.80% (+/- 0.61%) compared to 81.14%. Note, however, that the baseline (predict majority class of !isaware) for this new classification is 70%, considerably higher than the previous baseline. Table 6 shows the improvement in terms of accuracy, precision, and recall.</Paragraph> <Paragraph position="13"> The ruleset for the 2-way distinction is shown in Figure 5. The features appearing most frequently if (preconf a19 -4.06) a17 (pretempo a19 2.65) a17 (ppau a18 0.25) then T if (preconf a19 -3.59) a17 (prerejbool=T) then T if (preconf a19 -2.85) a17 (predur a18 1.039) a17 (tempo2 a18 1.04) a17 (preppau a18 0.57) a17 (pretempo a19 2.18) then T if (preconf a19 -3.78) a17 (pmnsyls a18 4.04) then T if (preconf a19 -2.75) a17 (prestr contains the word 'help') then T if (pregram=universal) a17 (pprewordsstr a18 2) then T if (preconf a19 -2.60) a17 (predur a18 1.04) a17 (%sil1 a19 1.06) a17 (prermsmean a18 370.65) then T if (pretempo a19 0.13) then T if (predur a18 1.27) a17 (pretempo a19 2.36) a17 (prermsmean a18 245.36) then T if (pretempo a19 0.80) a17 (pmntempo a19 1.75) a17 (ppretempo2 ISAWARE (T) versus the rest (F).</Paragraph> <Paragraph position="14"> in these rules are similar to those in the previous two rulesets in some ways, but quite different in others. Like the rules in Figures 3 and 4, they appear independent of system characteristics. Also, of the contextual features appearing in the rules, about half are prosodic features and half ASRrelated; and, of the current turn features, prosodic features dominate. And timing features again (especially tempo) dominate the prosodic features that appear in the rules. However, in contrast to previous classification rulesets, very few features of the current turn appear in the rules at all. So, it would seem that, for the broader classification task, contextual features are far more important than for the more fine-grained distinctions.</Paragraph> </Section> class="xml-element"></Paper>