File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1012_metho.xml
Size: 12,284 bytes
Last Modified: 2025-10-06 14:07:41
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1012"> <Title>Detecting problematic turns in human-machine interactions: Rule-induction versus memory-based learning approaches</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Approach </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Data and Labeling The corpus we used con- </SectionTitle> <Paragraph position="0"> sisted of 3739 question-answer pairs, taken from 444 complete dialogues. The dialogues consist of users interacting with a Dutch spoken dialogue system which provides information about train time tables. The system prompts the user for unknown slots, such as departure station, arrival station, date, etc., in a series of questions. The system uses a combination of implicit and explicit verification strategies.</Paragraph> <Paragraph position="1"> The data were annotated with a highly limited set of labels. In particular, the kind of system question and whether the reply of the user gave rise to communication problems or not. The latter feature is the one to be predicted. The following labels are used for the system questions.</Paragraph> <Paragraph position="2"> O open questions (&quot;From where to where do you want to travel?&quot;) I implicit verification (&quot;When do you want to travel from Tilburg to Schiphol Airport?&quot;) E explicit verification (&quot;So you want to travel from Tilburg to Schiphol Airport?&quot;) Y yes/no question (&quot;Do you want me to repeat the connection?&quot;) M Meta-questions (&quot;Can you please correct me?&quot;) The difference between an explicit verification and a yes/no question is that the former but not the latter is aimed at checking whether what the system understood or assumed corresponds with what the user wants. If the current system question is a repetition of the previous question it asked, this is indicated by the suffix R. A question only counts as a repetition when it has the same contents as the previous system question. Of the user inputs, we only labeled whether they gave rise to a communication problem or not. A communication problem arises when the value which the system assigns to a particular slot (departure station, date, etc.) does not coincide with the value given for that particular slot by the user in his or her most recent contribution to the dialogue or when the system makes an incorrect default assumption (e.g., the dialogue manager assumes that the date slot should be filled with the current date, i.e., that the user wants to travel today). Communication problems are generally easy to label since the spoken dialogue system under consideration here always provides direct feedback (via verification questions) about what it believes the user intends. Consider the following exchange.</Paragraph> <Paragraph position="3"> U: I want to go to Amsterdam.</Paragraph> <Paragraph position="4"> S: So you want to go to Rotterdam? As soon as the user hears the explicit verification question of the system, it will be clear that his or her last turn was misunderstood. The problemfeature was labeled by two of the authors to avoid labeling errors. Differences between the two annotators were infrequent and could always easily be resolved.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Baselines Of the 3739 user utterances </SectionTitle> <Paragraph position="0"> 1564 gave rise to communication problems (an error rate of 41.8%). The majority class is thus formed by the unproblematic user utterances, which form 58.2% of all user utterances. This suggests that the baseline for predicting communication problems is obtained by always predicting that there are no communication problems. This strategy has an accuracy of 58.2%, and a recall of 0% (all problems are missed).a8 The precision is not defined,a9 and consequently neither is the a10a12a11a14a13 .a15 This baseline is misleading, however, when we are interested in predicting whether the previous user utterance gave rise to communication problems. There are cases when the dialogue system is itself clearly aware of communication problems. This is in particular the case when the system repeats the question (labeled with the suffix R) or when it asks a meta-question (M). In the corpus under investigation here this happens 1024 times. It would not be a16 For definitions of accuracy, precision and recall see e.g., very illuminating to develop an automatic error detector which detects only those problems that the system was already aware of. Therefore we take the following as our base-line strategy for predicting whether the previous user utterance gave rise to problems, henceforth referred to as the system-knows-baseline: if the Q(a40 ) is repetition or meta-question, then predict user utterance a40 -1 caused problems, else predict user utterance a40 -1 caused no problems. This 'strategy' predicts problems with an accuracy of 85.6% (1024 of the 1564 problems are detected, thus 540 of 3739 decisions are wrong), a precision of 100% (of 1024 predicted problems 1024 were indeed problematic), a recall of 65.5% (1024 of the 1564 problems are predicted to be problematic) and thus an a10a41a11a42a13 of 79.1. This is a sharp baseline, but for predicting whether the previous user utterance caused problems or not the system-knows-baseline is much more informative and relevant than the majority-classbaseline. Table 1 summarizes the baselines.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Feature representations Question-answer </SectionTitle> <Paragraph position="0"> pairs were represented as feature vectors (or patterns) of the following form. Six features were reserved for the history of system questions asked so far in the current dialogue (6Q). Of course, if the system only asked 3 questions so far, only 3 types of system questions are stored in memory and the remaining three features for system question are not assigned a value. The representation of the user's answer is derived from the word graph produced by the ASR module. It should be kept in mind that in general the word graph is much more complex than the recognized string.</Paragraph> <Paragraph position="1"> The latter typically is the most plausible path (e.g., on the basis of acoustic confidence scores) in the word graph, which itself may contain many other paths. Different systems determine the plausibility of paths in the word graph in different ways. Here, for the sake of generality, we abstract over such differences and simply represent a word graph as a Bag of Words (BoW), collecting all words that occur in one of the paths, irrespective of the associated acoustic confidence score. A lexicon was derived of all the words and phrases that occurred in the corpus. Each word graph is represented as a sequence of bits, where the a43 -th bit is set to 1 if the a43 -th word in the pre-derived lexicon occurred at least once in the word graph corresponding to the current user utterance and 0 otherwise. Finally, for each user utterance, a feature is reserved for indicating whether it gave rise to communication problems or not. This latter feature is the one to be predicted.</Paragraph> <Paragraph position="2"> There are basically two approaches for detecting communication problems. One is to try to decide on the basis of the current user utterance whether it will be recognized and interpreted correctly or not. The other approach uses the current user utterance to determine whether the processing of the previous user utterance gave rise to communication problems. This approach is based on the assumption that users give feedback on communication problems when they notice that the system misunderstood their previous input. In this study, eight prediction tasks have been defined: the first three are concerned with predicting whether the current user input will cause problems, and naturally, for these three tasks, the majority-class-baseline is the relevant one; the last five tasks are concerned with predicting whether the previous user utterance caused problems, and for these the sharp, system-knows-baseline is the appropriate one.</Paragraph> <Paragraph position="3"> The eight tasks are: (1) predict on the basis of the (representation of the) current word graph BoW a40 whether the current user utterance (at time a40 ) will cause a communication problem, (2) predict on the basis of the six most recent system question types up to a40 (6Q a40 ), whether the current user utterance will cause a communication problem, (3) predict on the basis of both BoW a40 and 6Q a40 , whether the current user utterance will cause a problem, (4) predict on the basis of the current word graph BoW a40 , whether the previous user utterance, uttered at time a40 -1, caused a problem, (5) predict on the basis of the six most recent system questions, whether the previous user utterance caused a problem, (6) predict on the basis of BoW a40 and 6Q a40 , whether the previous user utterance caused a problem, (7) predict on the basis of the two most recent word graphs, BoW a40 -1 and BoW a40 , whether the previous user utterance caused a problem, and finally (8) predict on the basis of the two most recent word graphs, BoW a40 -1 and BoW a40 , and the six most recent system question types 6Q a40 , whether the previous user utterance caused a problem.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Learning techniques For the experiments we </SectionTitle> <Paragraph position="0"> used the rule-induction algorithm RIPPER (Cohen 1996) and the memory-based IB1-IG algorithm (Aha et al. 1991, Daelemans et al. 1997).a44 RIPPER is a fast rule induction algorithm. It starts with splitting the training set in two. On the basis of one half, it induces rules in a straightforward way (roughly, by trying to maximize coverage for each rule), with potential overfitting.</Paragraph> <Paragraph position="1"> When the induced rules classify instances in the other half below a certain threshold, they are not stored. Rules are induced per class. By default the ordering is from low-frequency classes to high frequency ones, leaving the most frequent class as the default rule, which is generally beneficial for the size of the rule set.</Paragraph> <Paragraph position="2"> The memory-based IB1-IG algorithm is one of the primary memory-based learning algorithms.</Paragraph> <Paragraph position="3"> Memory-based learning techniques can be characterized by the fact that they store a representation of a set of training data in memory, and classify new instances by looking for the most similar instances in memory. The most basic distance function between two features is the overlap metric in (1), where a45a47a46a49a48a51a50a53a52a36a54 is the distance between patterns a48 and a52 (both consisting of a55 features) and a56 is the distance between the features. If a48 is the test-case, the a45 measure determines which group a57 of cases a52 in memory is the most similar to a48 . The most frequent value for the relevant a58 We used the TiMBL software package, version 3 (Daelemans et al. 2000) to run the IB1-IG experiments.</Paragraph> <Paragraph position="4"> category in a57 is the predicted value for a48 . Usually, a57 is set to 1. Since some features are more important than others, a weighting function a59a61a60 is used. Here a59a61a60 is the gain ratio measure. In sum, the weighted distance between vectors a48 and a52 of length a55 is determined by the following equation, where a56a21a46a63a62a64a60a65a50a53a66a14a60a67a54 gives a point-wise distance between features which is 1 if a62a64a60a69a68a70 a66a14a60 and 0 otherwise. null</Paragraph> <Paragraph position="6"> Both learning techniques were used for the same 8 prediction tasks, and received exactly the same feature vectors as input. All experiments were performed using ten-fold cross-validation, which yields errors margins in the predictions.</Paragraph> </Section> </Section> class="xml-element"></Paper>