File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1040_metho.xml

Size: 13,597 bytes

Last Modified: 2025-10-06 14:15:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1040">
  <Title>Automatic Detection of Poor Speech Recognition at the Dialogue Level</Title>
  <Section position="3" start_page="309" end_page="309" type="metho">
    <SectionTitle>
Actual Dialogue
</SectionTitle>
    <Paragraph position="0"> AI: Annie here.</Paragraph>
    <Paragraph position="1"> U1: Hello Annie, could you give me the information on a meeting I have tomorrow? A2: What was that? U2: I need to find out uh, a meeting time and place. Kim sent me a message on it. A3: For all options say help me out.</Paragraph>
    <Paragraph position="2"> U3: Help me out.</Paragraph>
    <Paragraph position="3"> A4: Annie here.</Paragraph>
    <Paragraph position="4"> U4: Help.</Paragraph>
    <Paragraph position="5"> A5: Help. Say call or say send a message, followed by a label or a number. Or say get my messages...</Paragraph>
    <Paragraph position="6"> U5: Get messages. (user interrupts A5) A6: Speech and Image Processing Lab.</Paragraph>
    <Paragraph position="7"> U6: Help.</Paragraph>
    <Paragraph position="8"> A7: Help. Say the first and last name ... U7: Kim.</Paragraph>
    <Paragraph position="9">  are in the last two columns of Figure 1. These confidence measures are based on the recognizer's language and acoustic models. The confidence scores are typically used by the system to decide whether it believes it has correctly understood the user's utterance. When the confidence score falls below a threshold defined for each system, the utterance is considered a rejection (e.g., utterances U1, U2, and U3 in D1). Note that since our classification problem is defined by speech recognition performance, it might be argued that this confidence feature (or features derived from it) suffices for accurate classification. null However, an examination of the transcript in D1 suggests that other useful features might be derived from global or high-level properties of the dialogue history, such as features representing the system's repeated use of diagnostic error messages (utterances A2 and A3), or the user's repeated requests for help (utterances U4 and U6).</Paragraph>
    <Paragraph position="10"> Although the work presented here focuses exclusively on the problem of automatically detecting poor speech recognition, a solution to this problem clearly suggests system reaction, such as the strategy changes mentioned above. In this paper, we report on our initial experiments, with particular attention paid to the problem definition and methodology, the best performance we obtain via a machine learning approach, and the performance differences between classifiers based on acoustic and higher-level dialogue features.</Paragraph>
  </Section>
  <Section position="4" start_page="309" end_page="312" type="metho">
    <SectionTitle>
2 Systems, Data, Methods
</SectionTitle>
    <Paragraph position="0"> The learning experiments that we describe here use the machine learning program RIPPER (Cohen, 1996) to automatically induce a &amp;quot;poor speech recognition performance&amp;quot; classification model from a corpus of spoken dialogues. 1 RIPPER (like other learning programs, such as c5.0 and CART) takes as input the names of a set of classes to be learned, the names and possible values of a fixed set of features, training data specifying the class and feature values for each example in a training set, and outputs a classification model for predicting the class of future examples from their feature representation.</Paragraph>
    <Paragraph position="1"> In RIPPER, the classification model is learned using greedy search guided by an information gain metric, and is expressed as an ordered set of if-then rules.</Paragraph>
    <Paragraph position="2"> We use RIPPER for our experiments because it supports the use of &amp;quot;set-valued&amp;quot; features for representing text, and because if-then rules are often easier for people to understand than decision trees (Quinlan, 1993). Below we describe our corpus of dialogues, the assignment of classes to each dialogue, the extraction of features from each dialogue, and our learning experiments.</Paragraph>
    <Paragraph position="3"> Corpus: Our corpus consists of a set of 544 dialogues (over 40 hours of speech) between humans and one of three dialogue systems: ANNIE (Kamm et al., 1998), an agent for voice dialing and messaging; ELVIS (Walker et al., 1998b), an agent for accessing email; and TOOT (Litman and Pan, 1999), an agent for accessing online train schedules. Each agent was implemented using a general-purpose platform for phone-based spoken dialogue systems (Kamm et al., 1997). The dialogues were obtained in controlled experiments designed to evaluate dialogue strategies for each agent. The exper~We also ran experiments using the machine learning program BOOSTEXTER (Schapire and Singer, To appear), with results similar to those presented below.</Paragraph>
    <Paragraph position="4">  iments required users to complete a set of application tasks in conversations with a particular version of the agent. The experiments resulted in both a digitized recording and an automatically produced system log for each dialogue.</Paragraph>
    <Paragraph position="5"> Class Assignment: Our corpus is used to construct the machine learning classes as follows. First, each utterance that was not rejected by automatic speech recognition (ASR) was manually labeled as to whether it had been semantically misrecognized or not. 2 This was done by listening to the recordings while examining the corresponding system log. If the recognizer's output did not correctly capture the task-related information in the utterance, it was labeled as a misrecognition. For example, in Figure 1 U4 and U6 would be labeled as correct recognitions, while U5 and U7 would be labeled as misrecognitions. Note that our labeling is semantically based; if U5 had been recognized as &amp;quot;play messages&amp;quot; (which invokes the same application command as &amp;quot;get messages&amp;quot;), then U5 would have been labeled as a correct recognition. Although this labeling needs to be done manually, the labeling is based on objective criteria.</Paragraph>
    <Paragraph position="6"> Next, each dialogue was assigned a class of either good or bad, by thresholding on the percentage of user utterances that were labeled as ASR semantic misrecognitions. We use a threshold of 11% to balance the classes in our corpus, yielding 283 good and 261 bad dialogues. 3 Our classes thus reflect relative goodness with respect to a corpus. Dialogue D1 in Figure 1 would be classified as &amp;quot;bad&amp;quot;, because U5 and U7 (29% of the user utterances) are misrecognized.</Paragraph>
    <Paragraph position="7"> Feature Extraction: Our corpus is used to construct the machine learning features as follows.</Paragraph>
    <Paragraph position="8"> Each dialogue is represented in terms of the 23 primitive features in Figure 2. In RIPPER, feature values are continuous (numeric), set-valued, or symbolic. Feature values were automatically computed from system logs, based on five types of knowledge sources: acoustic, dialogue efficiency, dialogue quality, experimental parameters, and lexical. Previous work correlating misrecognition rate with acoustic information, as well as our own  - elapsed time, system turns, user turns * Dialogue Quality Features - rejections, timeouts, helps, cancels, bargeins (raw) - rejection%, timeout%, help%, cancel%, bargein% (normalized) null * Experimental Parameters Features - system, user, task, condition * Lexical Features - ASR text  hypotheses about the relevance of other types of knowledge, contributed to our features.</Paragraph>
    <Paragraph position="9"> The acoustic, dialogue efficiency, and dialogue quality features are all numeric-valued. The acoustic features are computed from each utterance's confidence (log-likelihood) scores (Zeljkovic, 1996). Mean confidence represents the average log-likelihood score for utterances not rejected during ASR. The four pmisrecs% (predicted percentage of misrecognitions) features represent different (coarse) approximations to the distribution of log-likelihood scores in the dialogue. Each pmisrecs% feature uses a fixed threshold value to predict whether a non-rejected utterance is actually a misrecognition, then computes the percentage of user utterances in the dialogue that correspond to these predictedmisrecognitions. (Recall that our dialogue classifications were determined by thresholding on the percentage of actual misrecognitions.) For instance, pmisrecs%1 predicts that if a non-rejected utterance has a confidence score below -2 then it is a misrecognition. Thus in Figure 1, utterances U5 and U7 would be predicted as misrecognitions using this threshold. The four thresholds used for the four pmisrecs% features are -2,-3,-4,-5, and were chosen by hand from the entire dataset to be informative. null The dialogue efficiency features measure how quickly the dialogue is concluded, and include elapsed time (the dialogue length in seconds), and system turns and user turns (the number of turns for each dialogue participant).</Paragraph>
    <Paragraph position="10">  The dialogue quality features attempt to capture aspects of the naturalness of the dialogue. Rejections represents the number of times that the system plays special rejection prompts, e.g., utterances A2 and A3 in dialogue D1. This occurs whenever the ASR confidence score falls below a threshold associated with the ASR grammar for each system state (where the threshold was chosen by the system designer). The rejections feature differs from the pmisrecs% features in several ways. First, the pmisrecs% thresholds are used to determine misrecognitions rather than rejections. Second, the pmisrecs% thresholds are fixed across all dialogues and are not dependent on system state. Third, a system rejection event directly influences the dialogue via the rejection prompt, while the pmisrecs% thresholds have no corresponding behavior.</Paragraph>
    <Paragraph position="11"> Timeouts represents the number of times that the system plays special timeout prompts because the user hasn't responded within a pre-specified time frame. Helps represents the number of times that the system responds to a user request with a (contextsensitive) help message. Cancels represents the number of user's requests to undo the system's previous action. Bargeins represents the number of user attempts to interrupt the system while it is speaking. 4 In addition to raw counts, each feature is represented in normalized form by expressing the feature as a percentage. For example, rejection% represents the number of rejected user utterances divided by the total number of user utterances.</Paragraph>
    <Paragraph position="12"> In order to test the effect of having the maximum amount of possibly relevant information available, we also included a set of features describing the experimental parameters for each dialogue (even though we don't expect rules incorporating such features to generalize). These features capture the conditions under which each dialogue was col4Since the system automatically detects when a bargein occurs, this feature could have been automatically logged. However, because our system did not log bargeins, we had to handlabel them.</Paragraph>
    <Paragraph position="13"> lected. The experimental parameters features each have a different set of user-defined symbolic values. For example, the value of the feature system is either &amp;quot;annie&amp;quot;, &amp;quot;elvis&amp;quot;, or &amp;quot;toot&amp;quot;, and gives RIPPER the option of producing rules that are system-dependent.</Paragraph>
    <Paragraph position="14"> The lexical feature ASR text is set-valued, and represents the transcript of the user's utterances as output by the ASR component.</Paragraph>
    <Paragraph position="15"> Learning Experiments: The final input for learning is training data, i.e., a representation of a set of dialogues in terms of feature and class values. In order to induce classification rules from a variety of feature representations our training data is represented differently in different experiments. Our learning experiments can be roughly categorized as follows. First, examples are represented using all of the features in Figure 2 (to evaluate the optimal level of performance). Figure 3 shows how Dialogue D1 from Figure 1 is represented using all 23 features. Next, examples are represented using only the features in a single knowledge source (to comparatively evaluate the utility of each knowledge source for classification), as well as using features from two or more knowledge sources (to gain insight into the interactions between knowledge sources). Finally, examples are represented using feature sets corresponding to hypotheses in the literature (to empirically test theoretically motivated proposals).</Paragraph>
    <Paragraph position="16"> The output of each machine learning experiment is a classification model learned from the training data. To evaluate these results, the error rates of the learned classification models are estimated using the resampling method of cross-validation (Weiss and Kulikowski, 1991). In 25-fold cross-validation, the total set of examples is randomly divided into 25 disjoint test sets, and 25 runs of the learning program are performed. Thus, each run uses the exampies not in the test set for training and the remaining examples for testing. An estimated error rate is obtained by averaging the error rate on the testing portion of the data from each of the 25 runs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML