File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-0903_intro.xml
Size: 6,985 bytes
Last Modified: 2025-10-06 14:01:17
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0903"> <Title>Usability Evaluation in Spoken Language Dialogue Systems</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Usability is becoming an increasingly important issue in the development and evaluation of spoken language dialogue systems (SLDSs).</Paragraph> <Paragraph position="1"> Many companies would pay large amounts to know exactly which features make SLDSs attractive to users and how to evaluate whether their system has these features. In spite of its key importance far less resources have been invested in the usability aspect of SLDSs over the years than in SLDS component technologies.</Paragraph> <Paragraph position="2"> The usability aspect has often been neglected in SLDS development and evaluation and there has been surprisingly little research in important user-related issues, such as user reactions to SLDSs in the field, users' linguistic behaviour, or the main factors which determine overall user satisfaction. However, there now seems to be growing recognition that usability is as important as, and partly independent of, the technical quality of any SLDS component and that quality usability constitutes an important competitive parameter.</Paragraph> <Paragraph position="3"> Most of today's SLDSs are walk-up-and-use systems for shared-goal tasks. Usability of walk-up-and-use systems is of utmost importance, since users of such systems cannot be expected to undertake extensive training about the system or to read the user manual. Help must be available online and needed as infrequently as possible.</Paragraph> <Paragraph position="4"> There is at present no systematic understanding of which factors must be taken into account to optimise SLDS usability and thus also no consensus as to which usability evaluation criteria to use. Ideally, such an understanding should be comprehensive, i.e.</Paragraph> <Paragraph position="5"> include all major usability perspectives on SLDSs, and exhaustive, i.e. describe each perspective as it pertains to the detailed development and evaluation of any possible SLDS. This paper addresses the aspect of comprehensiveness by proposing a set of usability evaluation criteria. The criteria are derived from a set of usability issues that have resulted from a decomposition of the complex space of SLDS usability best practice.</Paragraph> <Paragraph position="6"> The present paper focuses on walk-up-and-use SLDSs for shared-goal tasks and reviews substantial parts of the authors' work as presented in, e.g., (Dybkjaer and Bernsen 2000). For additional examples the reader is referred to this paper. Due to space limitations few examples are included in the present paper.</Paragraph> <Paragraph position="7"> In the following we first briefly address types and purpose of evaluation (Section 2), when to evaluate and which methods to use (Section 3), user involvement (Section 4), and how to evaluate (Section 5). Section 6 presents the proposed set of evaluation criteria and discusses the usability issues behind these. Section 7 concludes the paper.</Paragraph> <Paragraph position="8"> 2 Types and purpose of evaluation Evaluation can be quantitative or qualitative, subjective or objective. Quantitative evaluation consists in quantifying some parameter through an independently meaningful number, percentage etc. which in principle allows comparison across systems. Qualitative evaluation consists in estimating or judging some parameter by reference to expert standards and rules. Subjective evaluation consists in judging some parameter by reference to users' opinions. Objective evaluation produces subjectindependent parameter assessment. Ideally, we would like to obtain quantitative and objective progress evaluation scores for usability which can be objectively compared to scores obtained from evaluation of other SLDSs. This is what has been attempted in the PARADISE framework based on the claim that task success and dialogue cost are potentially relevant contributors to user satisfaction (Walker, Litman, Kamm and Abella 1997). However, many important usability issues cannot be subjected to quantification and obje ctive expert evaluation is sometimes highly uncertain or nonexistent. null The purpose of evaluation may be to detect and analyse design and imple mentation errors (diagnostic evaluation), measure SLDS performance in terms of a set of quantitative and/or qualitative parameters (performance evaluation), or evaluate how well the system fits its purpose and meets actual user needs and expectations (adequacy evaluation), cf.</Paragraph> <Paragraph position="9"> (Hirschmann and Thompson 1996, Gibbon, Moore. and Winski 1997, Bernsen et al. 1998).</Paragraph> <Paragraph position="10"> The latter purpose is the more important one from a usability point of view although the others are relevant as well. Which type of evaluation to use and for which purpose, depends on the evaluation criterion which is being applied (see below). Other general references to natural language systems evaluation are (EAGLES 1996, Gaizauskas 1997, Sparck Jones and Galliers 1996).</Paragraph> <Paragraph position="11"> 3 When to evaluate and methods to use Usability evaluation should start as early as possible and continue throughout development.</Paragraph> <Paragraph position="12"> In general, the earlier design errors are being identified, the easier and cheaper it is to correct them. Different methods of evaluation may have to be applied for evaluating a particular parameter depending on the phase in the lifecycle in which evaluation takes place. Early design evaluation can be based on mock-up experiments with users and on design walkthroughs. Wizard of Oz simulations with representative task scenarios can provide valuable evaluation data. When the system has been implemented, controlled scenario-based tests with representative users and field tests can be used. Recorded dialogues with the (simulated) system should be carefully analysed for indications that the users have problems or expectations which exceed the capabilities of the system. Human-system interaction data should be complemented by interviews and questionnaires to enable assessment of user satisfaction. If users are interacting with the prototype on the basis of scenarios, there are at least two issues to be aware of. Firstly, scenarios should be designed to avoid priming the users on how to interact with the system. Secondly, sub-tasks covered by the scenarios will not necessarily be representative of the sub-tasks which real users (not using scenarios) would expect the system to cover.</Paragraph> <Paragraph position="13"> The final test of the system is often called the acceptance test. It involves real users and must satisfy the evaluation criteria defined as part of the requirements specification.</Paragraph> </Section> class="xml-element"></Paper>