File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/p01-1066_intro.xml
Size: 4,137 bytes
Last Modified: 2025-10-06 14:01:11
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1066"> <Title>Quantitative and Qualitative Evaluation of Darpa Communicator Spoken Dialogue Systems</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The objective of the DARPA COMMUNICATOR program is to support research on multi-modal speech-enabled dialogue systems with advanced conversational capabilities. In order to make this a reality, it is important to understand the contribution of various techniques to users' willingness and ability to use a spoken dialogue system. In June of 2000, we conducted an exploratory data collection experiment with nine participating communicator systems. All systems supported travel planning and utilized some form of mixed-initiative interaction. However the systems varied in several critical dimensions: (1) They targeted different back-end databases for travel information; (2) System modules such as ASR, NLU, TTS and dialogue management were typically different across systems.</Paragraph> <Paragraph position="1"> The Evaluation Committee chaired by Walker (Walker, 2000), with representatives from the nine COMMUNICATOR sites and from NIST, developed the experimental design. A logfile standard was developed by MITRE along with a set of tools for processing the logfiles (Aberdeen, 2000); the standard and tools were used by all sites to collect a set of core metrics for making cross system comparisons. The core metrics were developed during a workshop of the Evaluation Committee and included all metrics that anyone in the committee suggested, that could be implemented consistently across systems. NIST's contribution was to recruit the human subjects and to implement the experimental design specified by the Evaluation Committee.</Paragraph> <Paragraph position="2"> The experiment was designed to make it possible to apply the PARADISE evaluation framework (Walker et al., 2000), which integrates and unifies previous approaches to evaluation (Price et al., 1992; Hirschman, 2000). The framework posits that user satisfaction is the overall objective to be maximized and that task success and various interaction costs can be used as predictors of user satisfaction. Our results from applying PARADISE include that user satisfaction differed considerably across the nine systems. Subsequent modeling of user satisfaction gave us some insight into why each system was more or less satisfactory; four variables accounted for 37% of the variance in user-satisfaction: task completion, task duration, recognition accuracy, and mean system turn duration.</Paragraph> <Paragraph position="3"> However, when doing our analysis we were struck by the extent to which different aspects of the systems' dialogue behavior weren't captured by the core metrics. For example, the core metrics logged the number and duration of system turns, but didn't distinguish between turns used to request or present information, to give instructions, or to indicate errors. Recent research on dialogue has been based on the assumption that dialogue acts provide a useful way of characterizing dialogue behaviors (Reithinger and Maier, 1995; Isard and Carletta, 1995; Shriberg et al., 2000; Di Eugenio et al., 1998). Several research efforts have explored the use of dialogue act tagging schemes for tasks such as improving recognition performance (Reithinger and Maier, 1995; Shriberg et al., 2000), identifying important parts of a dialogue (Finke et al., 1998), and as a constraint on nominal expression generation (Jordan, 2000). Thus we decided to explore the application of a dialogue act tagging scheme to the task of evaluating and comparing dialogue systems.</Paragraph> <Paragraph position="4"> Section 2 describes the corpus. Section 3 describes the dialogue act tagging scheme we developed and applied to the evaluation of COMMUNICATOR dialogues. Section 4 first describes our results utilizing the standard logged metrics, and then describes results using the DATE metrics. Section 5 discusses future plans.</Paragraph> </Section> class="xml-element"></Paper>