File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1066_metho.xml
Size: 21,249 bytes
Last Modified: 2025-10-06 14:07:38
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1066"> <Title>Quantitative and Qualitative Evaluation of Darpa Communicator Spoken Dialogue Systems</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The Communicator 2000 Corpus </SectionTitle> <Paragraph position="0"> The corpus consists of 662 dialogues from nine different travel planning systems with the number of dialogues per system ranging between 60 and 79. The experimental design is described in (Walker et al., 2001). Each dialogue consists of a recording, a logfile consistent with the standard, transcriptions and recordings of all user utterances, and the output of a web-based user survey. Metrics collected per call included: ease, User expertise, Expected behavior, Future use. The objective metrics focus on measures that can be automatically logged or computed and a web survey was used to calculate User Satisfaction (Walker et al., 2001). A ternary definition of task completion, Exact Scenario Completion (ESC) was annotated by hand for each call by annotators at AT&T. The ESC metric distinguishes between exact scenario completion (ESC), any scenario completion (ANY) and no scenario completion (NOCOMP). This metric arose because some callers completed an itinerary other than the one assigned. This could have been due to users' inattentiveness, e.g. users didn't correct the system when it had misunderstood them. In this case, the system could be viewed as having done the best that it could with the information that it was given. This would argue that task completion would be the sum of ESC and ANY. However, examination of the dialogue transcripts suggested that the ANY category sometimes arose as a rational reaction by the caller to repeated recognition error. Thus we decided to distinguish the cases where the user completed the assigned task, versus completing some other task, versus the cases where they hung up the phone without completing any itinerary.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Dialogue Act Tagging for Evaluation </SectionTitle> <Paragraph position="0"> The hypothesis underlying the application of dialogue act tagging to system evaluation is that a system's dialogue behaviors have a strong effect on the usability of a spoken dialogue system. However, each COMMUNICATOR system has a unique dialogue strategy and a unique way of achieving particular communicative goals. Thus, in order to explore this hypothesis, we needed a way of characterizing system dialogue behaviors that could be applied uniformly across the nine different communicator travel planning systems.</Paragraph> <Paragraph position="1"> We developed a dialogue act tagging scheme for this purpose which we call DATE (Dialogue Act Tagging for Evaluation).</Paragraph> <Paragraph position="2"> In developing DATE, we believed that it was important to allow for multiple views of each dialogue act. This would allow us, for example, to investigate what part of the task an utterance contributes to separately from what speech act function it serves. Thus, a central aspect of DATE is that it makes distinctions within three orthogonal dimensions of utterance classification: (1) a SPEECH-ACT dimension; (2) a TASK-SUBTASK dimension; and (3) a CONVERSATIONAL-DOMAIN dimension. We believe that these distinctions are important for using such a scheme for evaluation. Figure 1 shows a COMMUNICATOR dialogue with each system utterance classified on these three dimensions. The tagset for each dimension are briefly described in the remainder of this section. See (Walker and Passonneau, 2001) for more detail.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Speech Acts </SectionTitle> <Paragraph position="0"> In DATE, the SPEECH-ACT dimension has ten categories. We use familiar speech-act labels, such as OFFER, REQUEST-INFO, PRESENT-INFO, AC-KNOWLEDGE, and introduce new ones designed to help us capture generalizations about communicative behavior in this domain, on this task, given the range of system and human behavior we see in the data. One new one, for example, is STATUS-REPORT. Examples of each speech-act type are in Figure 2.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Conversational Domains </SectionTitle> <Paragraph position="0"> The CONVERSATIONAL-DOMAIN dimension involves the domain of discourse that an utterance is about. Each speech act can occur in any of three domains of discourse described below.</Paragraph> <Paragraph position="1"> The ABOUT-TASK domain is necessary for evaluating a dialogue system's ability to collaborate with a speaker on achieving the task goal of making reservations for a specific trip. It supports metrics such as the amount of time/effort the system takes to complete a particular phase of making an airline reservation, and any ancillary hotel/car reservations.</Paragraph> <Paragraph position="2"> The ABOUT-COMMUNICATION domain reflects the system goal of managing the verbal channel and providing evidence of what has been understood (Walker, 1992; Clark and Schaefer, 1989). Utterances of this type are frequent in human-computer dialogue, where they are motivated by the need to avoid potentially costly errors arising from imperfect speech recognition.</Paragraph> <Paragraph position="3"> All implicit and explicit confirmations are about communication; See Figure 1 for examples.</Paragraph> <Paragraph position="4"> The SITUATION-FRAME domain pertains to the goal of managing the culturally relevant framing expectations (Goffman, 1974). The utterances in this domain are particularly relevant in human-computer dialogues because the users' expectations need to be defined during the course of the conversation. About frame utterances by the system attempt to help the user understand how to interact with the system, what it knows about, and what it can do. Some examples are in Figure 1.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Task Model </SectionTitle> <Paragraph position="0"> The TASK-SUBTASK dimension refers to a task model of the domain task that the system supports and captures distinctions among dialogue acts that reflect the task structure.1 The motivation for this dimension is to derive metrics that quantify the effort expended on particular subtasks. null This dimension distinguishes among 14 subtasks, some of which can also be grouped at a level below the top level task.2, as described in Figure 3. The TOP-LEVEL-TRIP task describes the task which contains as its subtasks the ORIGIN, DESTINATION, DATE, TIME, AIRLINE, TRIP-TYPE, RETRIEVAL and ITINERARY tasks.</Paragraph> <Paragraph position="1"> The GROUND task includes both the HOTEL and CAR subtasks.</Paragraph> <Paragraph position="2"> Note that any subtask can involve multiple speech acts. For example, the DATE subtask can consist of acts requesting, or implicitly or explicitly confirming the date. A similar example is provided by the subtasks of CAR (rental) and HOTEL, which include dialogue acts requesting, confirm-</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Implementation and Metrics Derivation </SectionTitle> <Paragraph position="0"> We implemented a dialogue act parser that classifies each of the system utterances in each dialogue in the COMMUNICATOR corpus. Because the systems used template-based generation and had only a limited number of ways of saying the same content, it was possible to achieve 100% accuracy with a parser that tags utterances automatically from a database of patterns and the corresponding relevant tags from each dimension.</Paragraph> <Paragraph position="1"> A summarizer program then examined each dialogue's labels and summed the total effort expended on each type of dialogue act over the dialogue or the percentage of a dialogue given over to a particular type of dialogue behavior.</Paragraph> <Paragraph position="2"> These sums and percentages of effort were calculated along the different dimensions of the tagging scheme as we explain in more detail below.</Paragraph> <Paragraph position="3"> We believed that the top level distinction between different domains of action might be relevant so we calculated percentages of the total dialogue expended in each conversational domain, resulting in metrics of TaskP, FrameP and CommP (the percentage of the dialogue devoted to the task, the frame or the communication domains respectively).</Paragraph> <Paragraph position="4"> We were also interested in identifying differences in effort expended on different subtasks. The effort expended on each subtask is represented by the sum of the length of the utterances contributing to that subtask. These are the metrics: TripC, OrigC, DestC, DateC, TimeC, AirlineC, RetrievalC, FlightinfoC, PriceC, GroundC, BookingC. See Figure 3.</Paragraph> <Paragraph position="5"> We were particularly interested developing metrics related to differences in the system's dialogue strategies. One difference that the DATE scheme can partially capture is differences in confirmation strategy by summing the explicit and implicit confirms. This introduces two metrics ECon and ICon, which represent the total effort spent on these two types of confirmation.</Paragraph> <Paragraph position="6"> Another strategy difference is in the types of about frame information that the systems provide. The metric CINSTRUCT counts instances of instructions, CREQAMB counts descriptions provided of what the system knows about in the context of an ambiguity, and CNOINFO counts the system's descriptions of what it doesn't know about. SITINFO counts dialogue initial descriptions of the system's capabilities and instructions for how to interact with the system A final type of dialogue behavior that the scheme captures are apologies for misunderstanding (CREJECT), acknowledgements of user requests to start over (SOVER) and acknowledgments of user corrections of the system's understanding (ACOR).</Paragraph> <Paragraph position="7"> We believe that it should be possible to use DATE to capture differences in initiative strategies, but currently only capture differences at the task level using the task metrics above. The TripC metric counts open ended questions about the user's travel plans, whereas other subtasks typically include very direct requests for information needed to complete a subtask.</Paragraph> <Paragraph position="8"> We also counted triples identifying dialogue acts used in specific situations, e.g. the utterance Great! I am adding this flight to your itinerary is the speech act of acknowledge, in the about-task domain, contributing to the booking subtask. This combination is the ACKBOOKING metric.</Paragraph> <Paragraph position="9"> We also keep track of metrics for dialogue acts of acknowledging a rental car booking or a hotel booking, and requesting, presenting or confirming particular items of task information. Below we describe dialogue act triples that are significant predictors of user satisfaction.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Core Metrics </SectionTitle> <Paragraph position="0"/> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> We initially examined differences in cumulative user satisfaction across the nine systems. An ANOVA for user satisfaction by Site ID using the modified Bonferroni statistic for multiple comparisons showed that there were statistically significant differences across sites, and that there were four groups of performers with sites 3,2,1,4 in the top group (listed by average user satisfaction), sites 4,5,9,6 in a second group, and sites 8 and 7 defining a third and a fourth group. See (Walker et al., 2001) for more detail on cross-system comparisons.</Paragraph> <Paragraph position="1"> However, our primary goal was to achieve a better understanding of the role of qualitative aspects of each system's dialogue behavior. We quantify the extent to which the dialogue act metrics improve our understanding by applying the PARADISE framework to develop a model of user satisfaction and then examining the extent to which the dialogue act metrics improve the model (Walker et al., 2000). Section 4.1 describes the PARADISE models developed using the core metrics and section 4.2 describes the models derived from adding in the DATE metrics.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Results using Logfile Standard Metrics </SectionTitle> <Paragraph position="0"> We applied PARADISE to develop models of user satisfaction using the core metrics; the best model fit accounts for 37% of the variance in user satisfaction. The learned model is that User Satisfaction is the sum of Exact Scenario Completion, Task Duration, System Turn Duration and Word Accuracy. Table 1 gives the details of the model, where the coefficient indicates both the magnitude and whether the metric is a positive or negative predictor of user satisfaction, and the P value indicates the significance of the metric in the model.</Paragraph> <Paragraph position="1"> The finding that metrics of task completion and</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> alogue Act Metrics </SectionTitle> <Paragraph position="0"> recognition performance are significant predictors duplicates results from other experiments applying PARADISE (Walker et al., 2000). The fact that task duration is also a significant predictor may indicate larger differences in task duration in this corpus than in previous studies.</Paragraph> <Paragraph position="1"> Note that the PARADISE model indicates that system turn duration is positively correlated with user satisfaction. We believed it plausible that this was due to the fact that flight presentation utterances are longer than other system turns. Thus this metric simply captures whether or not the system got enough information to present some potential flight itineraries to the user. We investigate this hypothesis further below.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Utilizing Dialogue Parser Metrics </SectionTitle> <Paragraph position="0"> Next, we add in the dialogue act metrics extracted by our dialogue parser, and retrain our models of user satisfaction. We find that many of the dialogue act metrics are significant predictors of user satisfaction, and that the model fit for user satisfaction increases from 37% to 42%. The dialogue act metrics which are significant predictors of user satisfaction are detailed in Table 2.</Paragraph> <Paragraph position="1"> When we examine this model, we note that several of the significant dialogue act metrics are calculated along the task-subtask dimension, namely TripC, BookingC and PriceC. One interpretation of these metrics are that they are acting as landmarks in the dialogue for having achieved a particular set of subtasks. The TripC metric can be interpreted this way because it includes open ended questions about the user's travel plans both at the beginning of the dialogue and also after one itinerary has been planned. Other significant metrics can also be interpreted this way; for example the ReqDate metric counts utterances such as Could you tell me what date you wanna travel? which are typically only produced after the origin and the destination have been understood. The ReqTripType metric counts utterances such as From Boston, are you returning to Dallas? which are only asked after all the first information for the first leg of the trip have been acquired, and in some cases, after this information has been confirmed. The AckRental metric has a similar potential interpretation; the car rental task isn't attempted until after the flight itinerary has been accepted by the caller. However, the predictors for the models already include a ternary exact scenario completion metric (ESC) which specifies whether any task was achieved or not, and whether the exact task that the user was attempting to accomplish was achieved. The fact that the addition of these dialogue metrics improves the fit of the user satisfaction model suggests that perhaps a finer grained distinction on how many of the subtasks of a dialogue were completed is related to user satisfaction. This makes sense; a user who the system hung up on immediately should be less satisfied than one who never could get the system to understand his destination, and both of these should be less satisfied than a user who was able to communicate a complete travel plan but still did not complete the task.</Paragraph> <Paragraph position="2"> Other support for the task completion related nature of some of the significant metrics is that the coefficient for ESC is smaller in the model in Table 2 than in the model in Table 1. Note also that the coefficient for Task Duration is much larger. If some of the dialogue act metrics that are significant predictors are mainly so because they indicate the successful accomplishment of particular subtasks, then both of these changes would make sense. Task Duration can be a greater negative predictor of user satisfaction, only when it is counteracted by the positive coefficients for sub-task completion.</Paragraph> <Paragraph position="3"> The TripC and the PriceC metrics also have other interpretations. The positive contribution of the TripC metric to user satisfaction could arise from a user's positive response to systems with open-ended initial greetings which give the user the initiative. The positive contribution of the PriceC metric might indicate the users' positive response to getting price information, since not all systems provided price information.</Paragraph> <Paragraph position="4"> As mentioned above, our goal was to develop metrics that captured differences in dialogue strategies. The positive coefficient of the Econ metric appears to indicate that an explicit confirmation strategy overall leads to greater user satisfaction than an implicit confirmation strategy.</Paragraph> <Paragraph position="5"> This result is interesting, although it is unclear how general it is. The systems that used an explicit confirmation strategy did not use it to confirm each item of information; rather the strategy seemed to be to acquire enough information to go to the database and then confirm all of the parameters before accessing the database. The other use of explicit confirms was when a system believed that it had repeatedly misunderstood the user.</Paragraph> <Paragraph position="6"> We also explored the hypothesis that the reason that system turn duration was a predictor of user satisfaction is that longer turns were used to present flight information. We removed system turn duration from the model, to determine whether FlightInfoC would become a significant predictor. However the model fit decreased and FlightInfoC was not a significant predictor. Thus it is unclear to us why longer system turn durations are a significant positive predictor of user satisfaction.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Discussion and Future Work </SectionTitle> <Paragraph position="0"> We showed above that the addition of dialogue act metrics improves the fit of models of user satisfaction from 37% to 42%. Many of the significant dialogue act metrics can be viewed as landmarks in the dialogue for having achieved particular subtasks. These results suggest that a careful definition of transaction success, based on automatic analysis of events in a dialogue, such as acknowledging a booking, might serve as a substitute for the hand-labelling of task completion.</Paragraph> <Paragraph position="1"> In current work we are exploring the use of tree models and boosting for modeling user satisfaction. Tree models using dialogue act metrics can achieve model fits as high as 48% reduction in error. However, we need to test both these models and the linear PARADISE models on unseen data. Furthermore, we intend to explore methods for deriving additional metrics from dialogue act tags. In particular, it is possible that sequential or structural metrics based on particular sequences or configurations of dialogue acts might capture differences in dialogue strategies.</Paragraph> <Paragraph position="2"> We began a second data collection of dialogues with COMMUNICATOR travel systems in April 2001. In this data collection, the subject pool will use the systems to plan real trips that they intend to take. As part of this data collection, we hope to develop additional metrics related to the quality of the dialogue, how much initiative the user can take, and the quality of the solution that the system presents to the user.</Paragraph> </Section> class="xml-element"></Paper>