File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/p02-1049_evalu.xml
Size: 2,854 bytes
Last Modified: 2025-10-06 13:58:53
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1049"> <Title>What's the Trouble: Automatically Identifying Problematic Dialogues in DARPA Communicator Dialogue Systems</Title> <Section position="9" start_page="0" end_page="0" type="evalu"> <SectionTitle> 6 Results for Identifying Problematic </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Dialogues for Data Mining </SectionTitle> <Paragraph position="0"> So far, we have described a PDI that predicts User Satisfaction as a continuous variable. For data mining, system developers will want to extract dialogues with predicted User Satisfaction below a particular threshold. This threshhold could vary during different stages of system development. As the system is ne tuned there will be fewer and fewer dialogues with low User Satisfaction, therefore in order to nd the interesting dialogues for system development one would have to raise the User Satisfaction threshold. In order to illustrate the potential value of our PDI, consider an example threshhold of 12 which divides the data into 73.4% good dialogues where User Satisfaction a17 12 which is our baseline result.</Paragraph> <Paragraph position="1"> Table 3 gives the recall and precision for the PDIs described above which use hand-labelled Task Completion and Auto Task Completion. In the data, 26.6% of the dialogues are problematic (User Satisfaction is under 12), whereas the PDI using hand-labelled Task Completion predicts that 21.8% are problematic. Of the problematic dialogues, 54.5% are classi ed correctly (Recall). Of the dialogues that it classes as problematic 66.7% are problematic (Precision). The results for the automatic system show an improvement in Recall: it identi es more problematic dialogues correctly (66.7%) but the precision is lower.</Paragraph> <Paragraph position="2"> What do these numbers mean in terms of our original goal of reducing the number of dialogues that need to be transcribed to nd good cases to use lematic dialogues (where a good dialogue has User Satisfactiona17 12) for the PDI using hand-labelled Task Completion and Auto Task Completion for system improvement? If one had a budget to transcribe 20% of the dataset containing 100 dialogues, then by randomly extracting 20 dialogues, one would transcribe 5 problematic dialogues and 15 good dialogues. Using the fully automatic PDI, one would obtain 12 problematic dialogues and 8 good dialogues. To look at it another way, to extract 15 problematic dialogues out of 100, 55% of the data would need transcribing. To obtain 15 problematic dialogues using the fully automatic PDI, only 26% of the data would need transcribing. This is a massive improvement over randomly choosing dialogues. null</Paragraph> </Section> </Section> class="xml-element"></Paper>