File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2028_metho.xml

Size: 25,657 bytes

Last Modified: 2025-10-06 14:07:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2028">
  <Title>Learning to Predict Problematic Situations in a Spoken Dialogue System: Experiments with How May I Help You?</Title>
  <Section position="4" start_page="210" end_page="210" type="metho">
    <SectionTitle>
USER HANGS UP
</SectionTitle>
    <Paragraph position="0"> agent serving as a wizard who could override the system. There were a number of agents who participated as wizards during the trial of HMIHY and each wizard was simply told to take over the call if s/he perceived problems with the system's performance.</Paragraph>
    <Paragraph position="1"> The wizard's decision was logged by the experimental setup, resulting in labelling the call as one that the wizard took over. s A dialogue where the wizard decided that the dialogue was problematic and took over the call is shown in Figure 3.</Paragraph>
    <Paragraph position="2"> Sh AT&amp;T How may I help you?  Uh (silence) $2: Sorry. Please briefly tell me how I may help you? U2: I'm trying to call 1 8 hundred call A T T.</Paragraph>
    <Paragraph position="3"> $3: Do you want to place a call? U3: Through my calling card.</Paragraph>
    <Paragraph position="4"> $4: May I have your card number, please? U4: It's 8 7 6 5 4 3 2 1 0 and then my pin number is 8 7 6 5. (misrecognized) $5: Sorry. Please enter or say your card number again.</Paragraph>
  </Section>
  <Section position="5" start_page="210" end_page="211" type="metho">
    <SectionTitle>
WIZARD STEPS IN
</SectionTitle>
    <Paragraph position="0"> The third problematic category, the TASKFAILURE dialogues, are cases where the system completed the call, but carried out a task that was not the task that the customer was actually requesting. An example TASKFAILURE dialogue is given in Figure 4: HMIHY interpreted utterance U2 as a request to make a third-party call e.g. to bill it to my home phone. HMIHY then asked the caller for the information it needed to carry out this task, the caller complied, and the system completed the call.</Paragraph>
    <Paragraph position="1"> The corpus of 4774 dialogues used in our experiments was collected in severM experimental trials of HMIHY on live customer traffic (Pdccardi and Gorin, to appear; E. Ammicht and Alonso, 1999), and is referred to as HM2 in (Riccardi and Gorin, to appear)). During the trial, all of the system behaviors were automatically recorded in a log file, and later the dialogues were transcribed by humans and labelled with a semantic category representing the sOf course we can only infer what might have motivated the wizard to take over the call, but we assume that the wizard had good reason for doing so.</Paragraph>
    <Paragraph position="2">  SI: AT&amp;T How may I help you? UI: (silence) $2: Sorry. Please briefly tell me how I may help you? U2: I wanta find out how much my bill's worth.</Paragraph>
    <Paragraph position="3"> $3: What phone number would you like the call billed to? U3: Urn the access number.</Paragraph>
    <Paragraph position="4"> $4: Sorry. Could I have that phone number again? U4:18hundred2255288.</Paragraph>
    <Paragraph position="5"> $5: What number would you like to call? U5:800 225 5288 (touchtone input) $6: Thank you.</Paragraph>
    <Paragraph position="6">  task that the caller was asking HMIHY to perform, on a per utterance basis. The logfiles also included labels indicating whether the wizard had taken over the call or the user had hung up.</Paragraph>
  </Section>
  <Section position="6" start_page="211" end_page="213" type="metho">
    <SectionTitle>
3 Training an Automatic
</SectionTitle>
    <Paragraph position="0"> Our experiments apply the machine learning program RIPPER (Cohen, 1996) to automatically induce a &amp;quot;problematic dialogue&amp;quot; classification model. RIPPER takes as input the names of a set of classes to be learned, the names and ranges of values of a fixed set of features, and training data specifying the class and feature values for each example in a training set.</Paragraph>
    <Paragraph position="1"> Its output is a classification model for predicting the class of future examples. In RIPPER, the classification model is learned using greedy search guided by an information gain metric, and is expressed as an ordered set of if-then rules.</Paragraph>
    <Paragraph position="2"> To apply RIPPER, the dialogues in the corpus must be encoded in terms of a set of classes (the output classification) and a set of input features that are used as predictors for the classes. We start with the dialogue categories described above, but since our goal is to develop algorithms that predict/identify problematic dialogues, we treat HANGUP, WIZARD and TASKFAILURE as equivalently problematic. Thus we train the classifier to distinguish between two classes: TASKSUCCESS and PROBLEMATIC. Note that our categorization is inherently noisy because we do not know the real reasons why a caller hangs up or a wizard takes over the call. The caller may hang up because she is frustrated with the system, or she may simply dislike automation, or her child may have started crying. Similarly, one wizard may have low confidence in the system's ability to recover from errors and use a conservative approach that results in taking over many calls, while another wizard may be more willing to let the system try to recover. Nevertheless we take these human actions as a human labelling of these calls as problematic. Given this classification, approximately 36% of the calls in the corpus of 4774 dialogues are PROBLEMATIC and 64% are TASKSUCCESS.</Paragraph>
    <Paragraph position="3"> Next, we encoded each dialogue in terms of a set of 196 features that were either automatically logged by one of the system modules, hand-labelled by humans, or derived from raw features. We use the hand-labelled features to produce a TOPLINE, an estimation of how well a classifier could do that had access to perfect information. The entire feature set is summarized in Figure 5.</Paragraph>
    <Paragraph position="4">  - a confidence measure for all of the possible tasks that the user could be trying to do - salience-coverage, inconsistency, context-shift, top-task, nexttop-task, top-confidence, dillconfidence null * Dialogue Manager Features - sys-label, utt-id, prompt, reprompt, confirmation, subdial - running tallies: num-reprompts, num- null confirms, num-subdials, reprompt%, confirmation%, subdialogue% * Hand-Labelled Features - tscript, human-label, age, gender, usermodality, clean-tscript, cltscript-numwords, rsuccess * Whole-Dialogue Features num-utts, num-reprompts, percent-reprompts, num-confirms, percent-confirms, numsubdials, percent-subdials, dial-duration.</Paragraph>
    <Paragraph position="5">  There are 8 features that describe the whole dialogue, and 47 features for each of the first four exchanges. We encode features for the first four exchanges because we want to predict failures before they happen. Since 97% of the dialogues in our corpus are five exchanges or less, in most cases, any potential problematic outcome will have occurred by the time the system has participated in five exchanges. Because the system needs to be able to predict whether the dialogue will be problematic using information it has available in the initial part of the dialogue, we train classifiers that only have access to input features from exchange 1, or only the features from exchange 1 and exchange 2. To see whether our results generalize, we also experiment with a subset of features that are task-independent. We compare results for predicting problematic din- null logues, with results for identifying problematic dialogues, when the classifier has access to features representing the whole dialogue.</Paragraph>
    <Paragraph position="6"> We utilized features logged by the system because they are produced automatically, and thus could be used during runtime to alter the course of the dialogue. The system modules that we collected information from were the acoustic processer/automatic speech recognizer (ASR) (Riccardi and Gorin, to appear), the natural language understanding (NLU) module (Gorin et al., 1997), and the dialogue manager (DM) (Abella and Gorin, 1999). Below we describe each module and the features obtained from it.</Paragraph>
    <Paragraph position="7"> ASR takes as input the acoustic signal and outputs a potentially errorful transcription of what it believes the caller said. The ASR features for each of the first four exchanges were the output of the speech recognizer (recog), the number of words in the recognizer output (recog-numwords), the duration in seconds of the input to the recognizer (asr-duration), a flag for touchtone input (dtmf-flag), the input modality expected by the recognizer (rg-modality) (one of: none, speech, touchtone, speech+touchtone, touchtonecard, speech+touchtone-card, touchtone-date, speech+touchtone-date, or none-final-prompt), and the grammar used by the recognizer (rg-grammar).</Paragraph>
    <Paragraph position="8"> The motivation for the ASR features is that any one of them may have impacted performance. For example, it is well known that longer utterances are less likely to be recognized correctly, thus asr-duration could be a clue to incorrect recognition resuits. In addition, the larger the grammar is, the more likely an ASR error is, so the name of the grammar vg-grammar could be a predictor of incorrect recognition.</Paragraph>
    <Paragraph position="9"> The natural language understanding (NLU) module takes as input a transcription of the user's utterance from ASR and produces 15 confidence scores representing the likelihood that the caller's task is one of the 15 task types. It also extracts other relevant information, such as phone or credit card numbers. Thus 15 of the NLU features for each exchange represent the 15 confidence scores. There are also features that the NLU module calculates based on processing the utterance. These include an intra-utterance measure of the inconsistency between tasks that the user appears to be requesting (inconsistency), a measure of the coverage of the utterance by salient grammar fragments (salience- coverage), a measure of the shift in context between utterances (context-shift), the task with the highest confidence score (top-task), the task with the second highest confidence score (nexttop-task), the value of the highest confidence score (top-confidence), and the difference in values between the top and nextto-top confidence scores (diff-confidence).</Paragraph>
    <Paragraph position="10"> The motivation for these NLU features is to make use of information that the NLU module has based on processing the output of ASR and the current discourse context. For example, for utterances that follow the first utterance, the NLU module knows what task it believes the caller is trying to complete. If it appears that the caller has changed her mind, then the NLU module may have misunderstood a previous utterance. The context-shift feature indicates the NLU module's belief that it may have made an error (or be making one now).</Paragraph>
    <Paragraph position="11"> The dialogue manager (DM) takes the output of NLU and the dialogue history and decides what it should say to the caller next. It decides whether it believes there is a single unambiguous task that the user is trying to accomplish, and how to resolve any ambiguity. The DM features for each of the first four exchanges are the task-type label which includes a label that indicates task ambiguity (sys-label), utterance id within the dialogue (implicit in the encoding), the name of the prompt played before the user utterance (prompt), and whether that prompt was a reprompt (reprompt), a confirmation (confirm), or a subdialogue prompt (subdia O, a superset of the reprompts and confirmation prompts.</Paragraph>
    <Paragraph position="12"> The DM features are primarily motivated by previous work. The task-type label feature is to capture the fact that some tasks may be harder than others. The utterance id feature is motivated by the idea that the length of the dialogue may be important, possibly in combination with other features like task-type. The different prompt features for initial prompts, reprompts, confirmation prompts and sub-dialogue prompts are motivated by results indicating that reprompts and confirmation prompts are frustrating for callers and that callers are likely to hyperarticulate when they have to repeat themselves, which results in ASR errors (Shriberg et al., 1992; Levow, 1998).</Paragraph>
    <Paragraph position="13"> The DM features also include running tallies for the number of reprompts (num-reprompts), number of confirmation prompts (num.confirms), and number of subdialogue prompts (num-subdials), that had been played up to each point in the diMogue, as well as running percentages (percent-reprompts, percentconfirms, percent-subdials). The use of running tallies and percentages is based on the assumption that these features are likely to produce generalized predictors (Litman et al., 1999).</Paragraph>
    <Paragraph position="14"> The features obtained via hand-labelling were human transcripts of each user utterance (tscript), a set of semantic labels that are closely related to the system task-type labels (human-label), age (age) and gender (gender) of the user, the actual modality of the user utterance (user-modality) (one of: nothing, speech, touchtone, speech+touchtone, non-speech),  and a cleaned transcript with non-word noise information removed (clean-tscript). From these features we calculated two derived features. The first was the number of words in the cleaned transcript (cltscript numwords), again on the assumption that utterance length is strongly correlated with ASR and NLU errors. The second derived feature was based on calculating whether the human-label matches the sys-label from the dialogue manager (rsuccess). There were four values for rsuccess: rcorrect, rmismatch, rpartial-match and rvacuous-match, indicating respectively correct understanding, incorrect understanding, partial understanding, and the fact that there had been no input for ASR and NLU to operate on, either because the user didn't say anything or because she used touch-tone.</Paragraph>
    <Paragraph position="15"> The whole-dialogue features derived from the per-utterance features were: num-utts, num-reprompts, percent-reprampts, hum.confirms, percent-confirms, num-subdials, and per-cent-subdials for the whole dialogue, and the duration of the entire dialogue in seconds (dial-duration).</Paragraph>
    <Paragraph position="16"> In the experiments, the features in Figure 5 except the Hand-Labelled features are referred to as the AUTOMATIC feature set. We examine how well we can identify or predict problematic dialogues using these features, compared to the full feature set including the Hand-Labelled features. As mentioned earlier, we wish to generalize our problematic dialogue predictor to other systems. Thus we also discuss how well we can predict problematic dialogues using only features that are both automatically acquirable during runtime and independent of the HMIHY task.</Paragraph>
    <Paragraph position="17"> The subset of features from Figure 5 that fit this qualification are in Figure 6. We refer to them as the AUTO, TASK-INDEP feature set.</Paragraph>
    <Paragraph position="18"> The output of each RIPPER. experiment is a classification model learned from the training data. To evaluate these results, the error rates of the learned classification models are estimated using the resampling method of cross-validation. In 5-fold crossvalidation, the total set of examples is randomly divided into 5 disjoint test sets, and 5 runs of the learning program are performed. Thus, each run uses the examples not in the test set for training and the remaining examples for testing. An estimated error rate is obtained by averaging the error rate on the testing portion of the data from each of the 5 runs.</Paragraph>
    <Paragraph position="19"> Since we intend to integrate the rules learned by RIPPER into the HMIHY system, we examine the precision and recall performance of specific hypotheses. Because hypotheses from different cross-validation experiments cannot readily be combined together, we apply the hypothesis learned on one randomly selected training set (80% of the data) to that set's respective test data. Thus the precision and recall results reported below are somewhat less  reliable than the error rates from cross-validation.</Paragraph>
  </Section>
  <Section position="7" start_page="213" end_page="214" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> We present results for both predicting and identifying problematic dialogues. Because we are interested in predicting that a dialogue will be problematic at a point in the dialogue where the system can do something about it, we compare prediction accuracy after having only seen the first exchange of the diMogue with prediction accuracy after having seen the first two exchanges, with identification accuracy after having seen the whole dialogue. For each of these situations we also compare results for the AUTOMATIC and AUTO, TASK-INDEP feature sets (as described earlier), with results for the whole feature set including hand-labelled features. Table 1 summarizes the results.</Paragraph>
    <Paragraph position="1"> The baseline on the first line of Table 1 represents the prediction accuracy from always guessing the majority class. Since 64% of the dialogues are TASKSUCCESS dialogues, we can achieve 64% accuracy from simply guessing TASKSUCCESS without having seen any of the dialogue yet.</Paragraph>
    <Paragraph position="2"> The first EXCHANGE 1 row shows the results of using the AUTOMATIC features from only the first exchange to predict whether the dialogue outcome will be TASKSUCCESS or PROBLEMATIC. The results show that the machine-learned classifier can predict problematic dialogues 8% better than the baseline after having seen only the first user utterance. Using only task-independent automatic features (Figure 6) the EXCHANGE 1 classifier can still do nearly as well.</Paragraph>
    <Paragraph position="3"> The ALL row for EXCHANGE 1 indicates that even if we had access to human perceptual ability (the hand-labelled features) we would still only be able to distinguish between TASKSUCCESS and PROBLEMATIC dialogues with 77% accuracy after having seen the first exchange.</Paragraph>
  </Section>
  <Section position="8" start_page="214" end_page="215" type="metho">
    <SectionTitle>
AUTO, TASK-INDEP
ALL
EXCHANGES l&amp;2 AUTOMATIC
AUTO, TASK-INDEP
ALL
FULL DIALOGUE AUTOMATIC
AUTO, TASK-INDEP
</SectionTitle>
    <Paragraph position="0"> The EXCHANGE l&amp;2 rows of Table 1 show the resuits using features from the first two exchanges in the dialogue to predict the outcome of the dialogue. 4 The additional exchange gives roughly an additional 7% boost in predictive accuracy using either of the AUTOMATIC feature sets. This is only 8% less than the accuracy we can achieve using these features after having seen the whole dialogue (see below). The ALL row for EXCHANGE l&amp;2 shows that we could achieve over 86% accuracy if we had the ability to utilize the hand-labelled features.</Paragraph>
    <Paragraph position="1"> The FULL DIALOGUE row in Table 1 for AUTOMATIC and AUTO, TASK-INDEP features shows the ability of the classifier to identify problematic dialogues, rather than predict them, using features for the whole dialogue. The ALL row for the FULL DIALOGUE shows that we could correctly identify over 92% of the outcomes accurately if we had the ability to utilize the hand-labelled features.</Paragraph>
    <Paragraph position="2"> Note that the task-independent automatic features always perform within 2% error of the automatic features, and the hand-labelled features consistently perform with accuracies ranging from 6-8% greater.</Paragraph>
    <Paragraph position="3"> The rules that RIPPER learned on the basis of the Exchange 1 automatic features are below.</Paragraph>
    <Paragraph position="4"> Exchange 1, Automatic Features: if (el-top-confidence _&lt; .924) A (el-dtmf-flag = '1') then problematic, if (el-cliff-confidence _&lt; .916) A (el-asr-duration &gt; 6.92) then problematic, default is tasksuccess.</Paragraph>
    <Paragraph position="5"> According to these rules, a dialogue will be problematic if the confidence score for the top-ranked 4Since 23% of the dialogues consisted of only two exchanges, we exclude the second exchange features for those dialogues where the second exchange consists only of the system playing a closing prompt. We also excluded any features that indicated to the classifier that the second exchange was the last exchange in the dialogue.</Paragraph>
    <Paragraph position="6"> task (given by the NLU module) is moderate or low and there was touchtone input in the user utterance. The second rule says that if the difference between the top confidence score and the second-ranked confidence score is moderate or low, and the duration of the user utterance is more than 7 seconds, predict PROBLEMATIC.</Paragraph>
    <Paragraph position="7"> The performance of these rules is summarized in Table 2. These results show that given the first exchange, this ruleset predicts that 22% of the dialogues will be problematic, while 36% of them actually will be. Of the dialogues that actually will be problematic, it can predict 41% of them. Once it predicts that a dialogue will be problematic, it is correct 69% of the time. As mentioned earlier, this reflects an overMl improvement in accuracy of 8% over the baseline.</Paragraph>
    <Paragraph position="8"> The rules learned by training on the automatic task-independent features for exchanges 1 and 2 are given below. As in the first rule set, the features that the classifier appears to be exploiting are primarily those from the ASR and NLU modules.</Paragraph>
    <Paragraph position="9"> Exchanges l&amp;2, Automatic Task-Independent Features: if (e2-recog-numwords &lt; 0) A (el-cliff-confidence &lt; .95) then problematic.</Paragraph>
    <Paragraph position="10"> if (el-salience-coverage &lt; .889) A (e2-recog contains &amp;quot;I') A (e2-asr-duration &gt; 7.48) then problematic. if (el-top-confidence &lt; .924) A (e2-asr-duration &gt;_ 5.36) A (el-asr-duration &gt; 8.6) then problematic.</Paragraph>
    <Paragraph position="11"> if (e2-recog is blank) A (e2-asr-duration &gt; 2.8) then problematic.</Paragraph>
    <Paragraph position="12"> if (el-salience-coverage &lt; .737) A (el-recog contains &amp;quot;help&amp;quot;) A (el-asr-duration &lt; 7.04) then problematic. if (el-cliff-confidence &lt; .924) A (el-dtmf-flag = '1') A (el-asr-duration &lt; 6.68) then problematic.</Paragraph>
    <Paragraph position="13"> default is tasksuccess.</Paragraph>
    <Paragraph position="14"> The performance of this ruleset is summarized in Table 3. These results show that, given the first two exchanges, this ruleset predicts that 26% of the  dialogues will be problematic, while 36% of them actually will be. Of the problematic dialogues, it can predict 57% of them. Once it predicts that a dialogue will be problematic, it is correct 79% of the time. Compared with the classifier for the first utterance alone, this classifier has an improvement of 16% in recall and 10% in precision, for an overall improvement in accuracy of 7% over using the first exchange alone.</Paragraph>
    <Paragraph position="15"> One observation from these hypotheses is the classifier's preference for the asr-duration feature over the feature for the number of words recognized (recog-numwords). One would expect longer utterances to be more difficult, but the learned rulesets indicate that duration is a better measure of utterance length than the number of words. Another observation is the usefulness of the NLU confidence scores and the NLU salience-coverage in predicting problematic dialogues. These features seem to provide good general indicators of the system's success in recognition and understanding. The fact that the main focus of the rules is detecting ASR and NLU errors and that none of the DM behaviors are used as predictors also indicates that, in all likelihood, the DM is performing as well as it can, given the noisy input that it is getting from ASR and NLU.</Paragraph>
    <Paragraph position="16"> To identify potential improvements in the problematic dialogue predictor, we analyzed which hand-labelled features made large performance improvements, under the assumption that future work can focus on developing automatic features that approximate the information provided by these hand-labelled features. The analysis indicated that the vsuceess feature alone improves the performance of the TOPLINE from 88.5%, as reported in (Langkilde et al., 1999), to 92.3%. Using rsuccess as the only feature results in 73.75% accuracy for exchange 1, 81.9% accuracy for exchanges 18z2 and 85.3% accuracy for the full dialogue. In addition, for Exchanges l&amp;2, the accuracy of the AUTOMATIC, TASK-INDEP feature set plus the rsuccess feature is 86.5%, which is only 0.2% less than the accuracy of ALL the leatures for Exchanges l&amp;2 as shown in Table 1. The rules that RIPPER learns for Exchanges 1&amp;52 when the AUTOMATIC, TASK-INDEP feature set is augmented with the single hand-labelled rsuccess feature is shown below.</Paragraph>
    <Paragraph position="17">  Note that the rsuccess feature is frequently used in the rules and that RIPPER learns rules that combine the rsuccess feature with other features, such as the confidence, asr-duration, and salience-coverage features. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML