File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1614_metho.xml
Size: 25,332 bytes
Last Modified: 2025-10-06 14:07:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1614"> <Title>Empirical Methods for Evaluating Dialog Systems</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Purpose </SectionTitle> <Paragraph position="0"> Performance can be measured in myriad ways.</Paragraph> <Paragraph position="1"> Indeed, for evaluating dialog systems, the one problem designers do not encounter is lack of choice. Dialog metrics come in a diverse assortment of styles. They can be subjective or objective, deriving from questionnaires or log files. They can vary in scale, from the utterance level to the overall dialog (Glass et al., 2000).</Paragraph> <Paragraph position="2"> They can treat the system as a &quot;black box,&quot; describing only its external behavior (Eckert et al., 1998), or as a &quot;glass box,&quot; detailing its internal processing. If one metric fails to suffice, dialog metrics can be combined. For example, the PARADISE framework allows designers to predict user satisfaction from a linear combination of objective metrics such as mean recognition score and task completion (Kamm et al., 1999; Litman & Pan, 1999; Walker et al., 1997).</Paragraph> <Paragraph position="3"> Why so many metrics? The answer has to do with more than just the absence of agreed upon standards in the research community, notwithstanding significant efforts in that direction (Gibbon et al., 1997). Part of the reason deals with the purpose a dialog metric serves. Designers want a dialog metric to address the multiple, sometimes inconsistent needs. Here are four typical needs: (1) Provide an accurate estimation of how well a system meets the goals of the domain task.</Paragraph> <Paragraph position="4"> (2) Allow for comparative judgments of one system against another, and if possible, across different domain tasks.</Paragraph> <Paragraph position="5"> (3) Identify factors or components in the system that can be improved.</Paragraph> <Paragraph position="6"> (4) Discover tradeoffs or correlations between factors.</Paragraph> <Paragraph position="7"> While the above list is not intended to be exhaustive, it is instructive. Creating such a list can help designers to anticipate the kinds of obstacles they are likely to face in trying to satisfy all of the needs. Consider the first need on the list.</Paragraph> <Paragraph position="8"> Providing an accurate estimation of how well a system meets the goals of the domain task depends on how well the designers have delineated all the possible goals of interaction. Unfortunately, users often have finer goals than those anticipated by designers, even for domain tasks that seem well defined, such as airline ticket reservation. For example, a user may be leisurely hunting for a vacation and not care about destination or time of travel, or the user may be frantically looking for an emergency ticket and not care about price. The &quot;appropriate&quot; dialog metric should reflect even these kinds of goals. While &quot;time to completion&quot; is more appropriate for the emergency ticket, &quot;concept efficiency rate&quot; is more appropriate for the savvy vacationer. As psychologists have long recognized, when people engage in conversation, they make sure that they mutually understand the goals, roles, and behaviors that can be expected (Clark, 1996; Clark & Brennan, 1991; Clark & Schaefer, 1987, 1989). They evaluate the &quot;performance&quot; of the dialog based on their mutual understanding and expectations. Not only do different users have different goals, they sometimes have multiple goals, or more often, their goals change dynamically in response to system behavior such as communication failures (Danieli & Gerbino, 1995; Paek & Horvitz, 1999). Because goals engender expectations that then influence evaluation at different points of time, usability ratings are notoriously hard to interpret, especially if the system is not equipped to infer and keep track of user goals (Horvitz & Paek, 1999; Paek & Horvitz, 2000).</Paragraph> <Paragraph position="9"> The second typical need for a dialog metric allowing for comparative judgments, introduces further obstacles. In addition to unanticipated, dynamically changing user goals, different systems employ different dialog strategies operating under different architectural constraints, making the search for a dialog metric that generalizes across systems nearly impossible. While the PARADISE framework facilitates some comparison of dialog systems in different domain tasks, generalization is limited because different components can render factors irrelevant in the statistical model (Kamm et al., 1997). For example, a common measure of task completion would be possible if every system represented the domain task as an Attribute-Value Matrix (AVM). Unfortunately, that requirement excludes systems that use Bayesian networks or other non-symbolic representations. This has prompted some researchers to argue that a &quot;common inventory of concepts&quot; is necessary to have standard metrics for evaluation across systems and domain tasks (Kamm et al., 1997; Glass et al., 2000). As we discuss in the next section, the argument is actually backwards; we can use the metrics we already have to define a common inventory of concepts. Furthermore, with the proper set of descriptive statistics, we can exploit these metrics to address the third and fourth typical needs of designers, that of identifying contributing factors, along with their tradeoffs, and optimizing them.</Paragraph> <Paragraph position="10"> This is not to say that comparative judgments are impossible; rather, it takes some amount of careful work to make them meaningful. When research papers describe evaluation studies of the performance of dialog systems, it is imperative that they provide a baseline comparison from which to benchmark their systems. Even when readers understand the scale of the metrics being reported, without a baseline, the numbers convey very little about the quality of experience users of the system can expect. For example, suppose a paper reports that a dialog system received an average usability score of 9.5/10, a high concept efficiency rate of 90%, and a low word error rate of 5%. These numbers sound terrific, but they could have resulted from low user expectations and a simplistic or highly constrained interface. Practically speaking, readers must either experience interacting with the system themselves, or have a baseline comparison for the domain task from which to make sense of the numbers. This is true even if the paper reports a statistical model for predicting one or more of the metrics from the others, which may reveal tradeoffs but not how well the system performs relative to the baseline.</Paragraph> <Paragraph position="11"> To sum up, in considering the purpose a dialog metric serves, we examined four typical needs and discussed the kinds of obstacles designers are likely to face in finding a dialog metric that satisfies those needs. The obstacles themselves present distinct challenges: first, keeping track of user goals and expectations for performance based on the goals, and second, establishing a baseline from which to benchmark systems and make comparative judgments.</Paragraph> <Paragraph position="12"> Assuming that designers equip their system to handle the first challenge, we now propose empirical methods that allow them to handle the second, while at the same time providing a practical means of optimizing the system. These methods do not require new metrics, but instead take advantage of existing ones through experimental design and a basic set of descriptive statistics.</Paragraph> </Section> <Section position="5" start_page="0" end_page="2" type="metho"> <SectionTitle> 3 Empirical methods </SectionTitle> <Paragraph position="0"> Before designers can make comparative judgments about the performance of a dialog system relative to another system, so that readers unacquainted with either system can understand the reported metrics, they need a baseline.</Paragraph> <Paragraph position="1"> Fortunately, in evaluating dialog between humans and computers, the &quot;gold standard&quot; is oftentimes known; namely, human conversation.</Paragraph> <Paragraph position="2"> The most intuitive and effective way to substantiate performance claims is to compare a dialog system on a particular domain task with how human beings perform on the same task.</Paragraph> <Paragraph position="3"> Because human performance constitutes an ideal benchmark, readers can make sense of the reported metrics by assessing how close the system approaches the gold standard.</Paragraph> <Paragraph position="4"> Furthermore, with a benchmark, designers can optimize their system through component analysis and cost valuation.</Paragraph> <Paragraph position="5"> In this section, we outline an experimental protocol for obtaining human performance data that can serve as a gold standard. We then highlight a basic set of descriptive statistics for substantiating performance claims, as well as for optimization.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Experimental protocol </SectionTitle> <Paragraph position="0"> Collecting human performance data for establishing a gold standard requires conducting a carefully controlled wizard-of-oz (WOZ) study. The general idea is that users communicate with a human &quot;wizard&quot; under the illusion that they are interacting with a computational system. For spoken dialog Figure 1. Wizard-of-Oz study for the purpose of establishing a baseline comparison.</Paragraph> <Paragraph position="1"> systems, maintaining the illusion usually involves utilizing a synthetic voice to output wizard responses, often through voice distortion or a text-to-speech (TTS) generator.</Paragraph> <Paragraph position="2"> The typical use of a WOZ study is to record and analyze user input and wizard output. This allows designers to know what to expect and what they should try to support. User input is especially critical for speech recognition systems that rely on the collected data for acoustic training and language modeling. In iterative WOZ studies, previously collected data is used to adjust the system so that as the performance of the system improves, the studies employ less of the wizard and more of the system (Glass et al., 2000). In the process, design constraints in the interface may be revealed, in which case, further studies are conducted until acceptable tradeoffs are found (Bernsen et al., 1998).</Paragraph> <Paragraph position="3"> In contrast to the typical use, a WOZ study for establishing a gold standard prohibits modifications to the interface or experimental &quot;curtain.&quot; As shown in Figure 1, all input and output through the interface must be carefully controlled. If designers want to use previously collected performance data as a gold standard, they need to verify that all input and output have remained constant. The protocol for establishing a gold standard is straightforward: (1) Select a dialog metric to serve as an objective function for evaluation.</Paragraph> <Paragraph position="4"> (2) Vary the component or feature that best matches the desired performance claim for the dialog metric.</Paragraph> <Paragraph position="5"> (3) Hold all other input and output through the interface constant so that the only unknown variable is who does the internal processing. (4) Repeat using different wizards.</Paragraph> <Paragraph position="6"> To motivate the above protocol, consider how a WOZ study might be used to evaluate spoken dialog systems. The Achilles' heel of spoken interaction is the fragility of the speech recognizer. System performance depends highly on the quality of the recognition. Suppose a designer is interested in bolstering the robustness of a dialog system by exploiting various repair strategies. Using task completion rate as an objective function, the designer varies the repair strategies utilized by the system. To make claims about the robustness of these repair strategies, the designer must keep all other input and output constant. In particular, the wizard in the experiment must receive utterances through the same speech recognizer as the dialog system. The performance of the wizard on the same quality of input as the dialog system constitutes the gold standard. The designer may also wish to keep the set of repair strategies constant while varying the use or disuse of the speech recognizer to estimate how much the recognizer degrades task completion.</Paragraph> <Paragraph position="7"> A deep intuition underlies the experimental control of the speech recognizer. As researchers have observed, people with impaired hearing or non-native language skills still manage to communicate effectively despite noisy or uncertain input. Unfortunately, the same cannot be said of computers with analogous deficiencies. People overcome their deficiencies by collaboratively working out the mutual belief that their utterances have been understood sufficiently for current purposes - a process referred to as &quot;grounding&quot; (Clark, 1996). Repair strategies based on grounding indeed show promise for improving the robustness of spoken dialog systems (Paek & Horvitz, 1999; Paek & Horvitz, 2000).</Paragraph> <Paragraph position="8"> A few precautions are in order. First, WOZ studies for establishing a gold standard work best with dialog systems that are highly modular. Modularity makes it possible to test components by replacing a module with the wizard. Without modularity, it is harder to use because the boundaries between components are the performance of the wizard as a gold standard with respect to the gold standard.</Paragraph> <Paragraph position="9"> blurred. Second, what allows the performance of thewizardtobeusedasagoldstandardisnot the wizard, but rather the fact that the performance constitutes an upper bound. For example, the upper bound may be better established by graphical user interfaces (GUI) or touch-tone systems, in which case, those systems should be the gold standard.</Paragraph> </Section> <Section position="2" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 3.2 Descriptive statistics </SectionTitle> <Paragraph position="0"> After designers collect data from the WOZ study, they can make comparative judgments about the performance of their system relative to other systems using a basic set of descriptive statistics. The descriptive statistics rest on first model fitting the data for both the wizard and the dialog system. Plotting the fitted curves on the same graph sheds light on how best to substantiate any performance claims. In fact, we advocate that designers present this &quot;benchmark graph&quot; to assist readers in interpreting dialog metrics.</Paragraph> <Paragraph position="1"> Using spoken dialog again as an example, suppose a designer is evaluating the robustness of two dialog systems utilizing two different repair strategies. The designer varies the repair strategies, while holding constant the use of the speech recognizer. As speech recognition errors increase, numerous researchers have shown that task completion rate, or dialog success rate, not surprisingly decreases. Plotting task completion rate as a function of word error rate discloses an approximately linear relationship (Lamel et al., 2000; Rudnicky, 2000).</Paragraph> <Paragraph position="2"> Figure 2 displays a benchmark graph for two</Paragraph> <Paragraph position="4"> systems from the gold standard.</Paragraph> <Paragraph position="5"> dialog systems A and B, utilizing different repair strategies. Suppose that the fitted curve for System A is characteristically linear, while the curve for System B is polynomial. Because wizards are presumably more capable of recovering from recognition errors, their performance data make up the gold standard.</Paragraph> <Paragraph position="6"> Figure 2 shows a fitted curve for the gold standard staying close to the upper right hand corner of the graph in a monotonically decreasing fashion; that is, task completion rate remains relatively high as word error rate increases and then gracefully degrades before the error rate reaches 100%.</Paragraph> <Paragraph position="7"> Looking at the benchmark graph, readers immediately get a sense of how to substantiate performance claims about robustness. For example, by noticing that task completion rate for the gold standard rapidly drops from around 65% at the 80% mark to about 15% by 100%, readers know that at 80% word error rate, even wizards, with human level intelligence, cannot recover from failures with better than 65% task completion rate. In other words, the task is difficult. So, even if System A and B report low task completion rates after the 80% word error rate, they may be performing relatively well compared to the gold standard.</Paragraph> <Paragraph position="8"> In making comparative judgments, it helps to plot the absolute difference in performance from the gold standard as a function of the same independent variable as the benchmark graph.</Paragraph> <Paragraph position="9"> Figure 3 displays such a &quot;gold impurity graph&quot; for Systems A and B as a function of word error rate. The closer a system is to the gold standard, the smaller the &quot;mass&quot; of the gold impurity on the graph. Anomalies are easier to see, as they typically show up as bumps or peaks. The advantage of the graph is that if a dialog system reports terrible numbers on various performance metrics but displays a splendidly small gold impurity, the reader can be assured that the system is as good as it can possibly be.</Paragraph> <Paragraph position="10"> Looking at the gold impurity graph for Systems A and B, without having experienced either of the two systems, readers can make comparative judgments. For example, although B performs worse at lower word error rates than A, after about the 35% mark, B stays closer to the gold standard. With such crosses in performance, designers cannot categorically prefer one system to the other. In fact, assuming that the only difference between A and B is the choice of repair strategies, designers should prefer A to B if the average word error rate for the speech recognizer is below 35%, and B to A, if the average error rate is about 40%.</Paragraph> <Paragraph position="11"> With a gold standard, readers are even able to substantiate performance claims about different dialog systems across domain tasks. They need only to look at how close each system is to their respective gold standard in a benchmark graph, and how much mass each system shows in a gold impurity graph.</Paragraph> <Paragraph position="12"> One reason why comparative judgments, without a gold standard, are so hard to make across different domain tasks is task complexity. For example, tutoring physics is generally more complex than retrieving email. Another reason is dialog complexity. A physics tutoring system will be less complex if the system forces users to follow a predefined script. An email system that engages in &quot;mixed initiative&quot; will always be more complex because the user can take more possible actions at any point in time.</Paragraph> <Paragraph position="13"> The way to express complexity in a benchmark graph is to measure the distance of the gold standard to the absolute upper bound of performance. If wizards with human level intelligence cannot perform close to the absolute upper bound, then the task is complex, or the dialog interface is too restrictive for wizard, or both. Because complexity is measured only in connection with the gold standard ceteris paribus, &quot;intellectual complexity&quot; can be defined as: where U is the upper bound value of a performance metric, n is the upper bound value for an independent variable x,andg(x)isthe gold standard along that variable.</Paragraph> <Paragraph position="14"> Designers can use intellectual complexity to compare systems across different domain tasks if they are not too concerned about discriminating task complexity from dialog complexity. Otherwise, they can use intellectual complexity an objective function and vary the complexity of the dialog interface to scrutinize how much task complexity affects wizard performance.</Paragraph> <Paragraph position="15"> Before substantiating performance claims with a benchmark graph, designers should exercise a few precautionary measures. First, in model fitting a gold standard or the performance of a dialog system, beware of insufficient data.</Paragraph> <Paragraph position="16"> Without sufficient data, differences from the gold standard may be due to variance in the models. To guarantee that designers have collected enough data, we recommend that they go through an iterative process. First, run subjects, collect data, and fit a model. Then plot the least squares distance, or ))(( , where f(x) is the fitted model, against the iteration. Keep running more subjects until the plot seems to approach convergence. To inform readers of the reliability of the fitted models, we suggest that designers either show the convergence plot or report their R sfortheir curves (which relate how much of the variance can be accounted for by the fitted models).</Paragraph> <Paragraph position="17"> Second, to guarantee the reliability of the gold standard, use different wizards. The experimental protocol listed this as the last point because it is important to know whether a consistent gold standard is even possible with the given interface. Difference between wizards may reveal serious design flaws. Furthermore, just as adding more subjects improves the fit of the dialog performance models, the law of large numbers applies equally to the gold standard.</Paragraph> <Paragraph position="18"> Finally, designers may encounter problems with residual errors in model fitting that are typically well covered in most statistics pay for improvements to task completion rate.</Paragraph> <Paragraph position="19"> textbooks. For example, because the performance metric shown in Figure 2 and 3, task completion rate, has an upper bound of 100%, it is unlikely that residual errors will be equally spread out at all word error rates.</Paragraph> <Paragraph position="20"> Another common problem is the non-normality of the residual errors, which violates the model assumption.</Paragraph> <Paragraph position="21"> Designers can identify which components are contributing the most to a performance metric by examining the gold impurity graph of the system with and without the component, rendering this kind of test similar to a &quot;lesion&quot; experiment. Carrying out stepwise comparisons of the components, designers can check for tradeoffs, and even use all or part of the mass under the curve as an optimization metric. For example, a designer may wish to improve a dialog system from its current average task completion rate of 70% to 80%. Suppose that System B in Figure 2 incorporates a particular component that System A does not. Looking at the corresponding word error rates in the gold impurity graph for both systems, the mass under the curve for B is slightly greater than that for A. The designer can optimize the performance of the system by selecting components that minimize that mass, in which case, the component in System B should be excluded.</Paragraph> <Paragraph position="22"> Because components may interact with each other, designers may want to carry out a multi-dimensional component analysis for optimization.</Paragraph> <Paragraph position="23"> Suppose the main concern of the designer is to optimize the monetary cost of the dialog system. The designer can determine how much improving the system is worth by calculating the average marginal cost. To do this, a cost function must be elicited that conveys what the designer is willing to pay to achieve various levels of performance. This is actually very easy. Figure 4 displays what dollar amount a designer might be willing to pay for various rates of task completion. The average marginal cost can be computed by using the cost function as a weighting factor for the mass under the gold impurity graph for the system. So, following the previous example, if the designer wishes to improve the system that is currently operating at an average task completion rate of 70% to 80%, then the average marginal cost for that gain is where f(t) is the task completion rate of the system, g(t) is the task completion rate of the gold standard, and c(t)isthecostfunction.</Paragraph> <Paragraph position="24"> Average marginal cost is useful for minimizing expenditure. For example, if the goal is to improve task completion rate from 70% to 80%, and the designer must choose between two systems, one with a particular component and one without, the designer should calculate the average marginal cost of both systems as stated in the above equation and select the cheaper system.</Paragraph> </Section> </Section> class="xml-element"></Paper>