File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/p01-1012_evalu.xml

Size: 8,733 bytes

Last Modified: 2025-10-06 13:58:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1012">
  <Title>Detecting problematic turns in human-machine interactions: Rule-induction versus memory-based learning approaches</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> First we look at the results obtained with the IB1-IG algorithm (see Table 2). Consider the problem of predicting whether the current user utterance will cause problems. Either looking at the current word graph (BoW a40 ), at the six most recent system questions (6Q a40 ) or at both, leads to a significant improvement with respect to the majorityclass-baseline.a78 The best results are obtained with only the system question types (although the difference with the results for the other two tasks is not significant): a 63.7% accuracy and an a10a12a11a14a13 a8of 58.3. However, even though this is a significant improvement over the majority-class-baseline, the accuracy is improved with only 5.5%.a79 Next consider the problem of predicting whether the previous user utterance caused communication problems (these are the five remaining tasks). The best result is obtained by taking the two most recent word graphs and the six most recent system question types as input. This yields an accuracy of 88.1%, which is a significant improvement with respect to the a80 All checks for significance were performed with a one-tailed a81 test.</Paragraph>
    <Paragraph position="1"> a82 As an aside, we performed one experiment with the words in the actual, transcribed user utterance at time a81 instead of BoW a81 , where the task is to predict whether the current user utterance would cause a communication problem. This resulted in an accuracy of 64.2% (with a standard deviation of 1.1%). This is not significantly better than the result obtained with the BoW.</Paragraph>
    <Paragraph position="2"> input output acc (%) prec (%) rec (%) a10a12a11a14a13  , with standard deviations) on the eight prediction tasks. a83 : this accuracy significantly improves the majority-class-baseline (a85a87a86a89a88a91a90a42a90a93a92 ). a84 : this accuracy significantly improves the system-knows-baseline (a85a71a86a94a88a91a90a42a90a93a92 ). input output acc (%) prec (%) rec (%) a10a12a11a42a13  , with standard deviations) on the eight prediction tasks. a83 : this accuracy significantly improves the majority-class-baseline (a85a87a86a89a88a91a90a42a90a93a92 ). a84 : this accuracy significantly improves the system-knows-baseline (a85a95a86a96a88a91a90a42a90a93a92 ). a0 : this accuracy result is significantly better than the IB1-IG result given in Table 2 for this particular task, with a85a97a86 .05. a2 : this accuracy result is significantly better than the IB1-IG result given in Table 2 for this particular task, with a85a98a86 .001.</Paragraph>
    <Paragraph position="3"> a6 : this accuracy result is significantly better than the IB1-IG result given in Table 2 for this particular task, with a85a71a86 .01.</Paragraph>
    <Paragraph position="4"> sharp, system-knows-baseline. In addition, the a10a41a11a42a13  of 84.8 is nearly 6 points higher than that of the relevant, majority-class baseline.</Paragraph>
    <Paragraph position="5"> The results obtained with RIPPER are shown in Table 3. On the problem of predicting whether the current user utterance will cause a problem, RIPPER obtains the best results by taking as input both the current word graph and the types of the six most recent system questions, predicting problems with an accuracy of 66.0%. This is a significant improvement over the majority-class-baseline, but the result is not significantly better than that obtained with either the word graph or the system questions in isolation. Interestingly, the result is significantly better than the results for IB1-IG on the same task.</Paragraph>
    <Paragraph position="6"> On the problem of predicting whether the previous user utterance caused a problem, RIPPER obtains the best results by taking all features into account (that is: the two most recent bags of words and the six system questions).a99 This results in a 91.1% accuracy, which is a significant improvement over the sharp system-knows-baseline. This implies that 38% of the communication problems which were not detected by the dialogue system a100 Notice that RIPPER sometimes performs below the system-knows-baseline, even though the relevant feature (in particular the type of the last system question) is present. Inspection of the RIPPER rules obtained by training only on  6Q reveals that RIPPER learns a slightly suboptimal rule set, thereby misclassifying 10 instances on average.</Paragraph>
    <Paragraph position="7"> 1. if Q (a81 ) = R, then problem. (939/2) 2. if Q (a81 ) = I a101 &amp;quot;naar&amp;quot; a102 BoW (a81 -1) a101 &amp;quot;naar&amp;quot; a102 BoW(a81 ) a101 &amp;quot;om&amp;quot; a103a102 BoW (a81 ) then problem. (135/16) 3. if &amp;quot;uur&amp;quot; a102 BoW(a81 -1) a101 &amp;quot;om&amp;quot; a102 BoW(a81 -1) a101 &amp;quot;uur&amp;quot; a102 BoW(a81 ) a101 &amp;quot;om&amp;quot; a102 BoW(a81 ) then problem. (57/4) 4. if Q(a81 ) = I a101 Q(a81 -3) = I a101 &amp;quot;uur&amp;quot; a102 BoW (a81 -1) then problem. (13/2) 5. if &amp;quot;naar&amp;quot; a102 BoW(a81 -1) a101 &amp;quot;vanuit&amp;quot; a102 BoW (a81 ) a101 &amp;quot;van&amp;quot; a103a102 BoW(a81 ) then problem. (29/4) 6. if Q(a81 -1) = I a101 &amp;quot;uur&amp;quot; a102 BoW (a81 -1) a101 &amp;quot;nee&amp;quot; a102 BoW (a81 ) then problem. (28/7) 7. if Q(a81 ) = I a101 &amp;quot;ik&amp;quot; a102 BoW(a81 -1) a101 &amp;quot;van&amp;quot; a102 BoW(a81 -1) a101 &amp;quot;van&amp;quot; a102 BoW(a81 ) then problem. (22/8) 8. if Q(a81 ) = I a101 &amp;quot;van&amp;quot; a102 BoW (a81 -1) a101 &amp;quot;om&amp;quot; a102 BoW(a81 -1) then problem. (16/6) 9. if Q(a81 ) = E a101 &amp;quot;nee&amp;quot; a102 BoW (a81 ) then problem. (42/10) 10. if Q(a81 ) = M a101 BoW (a81 -1) = a104 then problem. (20/0) 11. if Q(a81 -1) = O a101 &amp;quot;ik&amp;quot; a102 BoW (a81 ) a101 &amp;quot;niet&amp;quot; in BoW(a81 ) then problem. (10/2) 12. if Q(a81 -2) = I a101 Q(a81 ) = O a101 &amp;quot;wil&amp;quot; a102 BoW(a81 -1) then problem. (8/0) 13. else no problem. (2114/245)  the basis of the Bags of Words for a40 and a40 -1, and the six most recent system questions. Based on the entire data set. The question features are defined in section 2. The word &amp;quot;naar&amp;quot; is Dutch for to, &amp;quot;om&amp;quot; for at, &amp;quot;uur&amp;quot; for hour, &amp;quot;van&amp;quot; for from, &amp;quot;vanuit&amp;quot; is slightly archaic variant of &amp;quot;van&amp;quot; (from), &amp;quot;ik&amp;quot; is Dutch for I, &amp;quot;nee&amp;quot; for no, &amp;quot;niet&amp;quot; for not and &amp;quot;wil&amp;quot;, finally, for want. The (a55 /a105 ) numbers at the end of each line indicate how many correct (a55 ) and incorrect (a105 ) decisions were taken using this particular if ...then ...statement.</Paragraph>
    <Paragraph position="8"> under investigation could be classified correctly using features which were already present in the system (word graphs and system question types).</Paragraph>
    <Paragraph position="9"> Moreover, the a10a12a11a14a13  is 89, which is 10 points higher than the a10a12a11a42a13  associated with the systemknows baseline strategy. Notice also that this RIPPER result is significantly better than the IB1-IG results for the same task.</Paragraph>
    <Paragraph position="10"> To gain insight into the rules learned by RIPPER for the last task, we applied RIPPER to the complete data set. The rules induced are displayed in Figure 1. RIPPER's first rule is concerned with repeated questions (compare with the system-knows-baseline). One important property of many other rules is that they explicitly combine pieces of information from the three main sources of information (the system questions, the current word graph and the previous word graph).</Paragraph>
    <Paragraph position="11"> Moreover, it is interesting to note that the words which crop up in the RIPPER rules are primarily function words. Another noteworthy feature of the RIPPER rules is that they reflect certain properties which have been claimed to cue communication problems. For instance, Krahmer et al.</Paragraph>
    <Paragraph position="12"> (1999), in their descriptive analysis of dialogue problems, found that repeated material is often an indication of problems, as is the use of a marked vocabulary. The rules 2, 3 and 7 are examples of the former cue, while the occurrence of the somewhat archaic &amp;quot;vanuit&amp;quot; instead of the ordinary &amp;quot;van&amp;quot; is an example of the latter.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML