File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/92/h92-1006_concl.xml
Size: 5,454 bytes
Last Modified: 2025-10-06 13:56:50
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1006"> <Title>SUBJECT-BASED EVALUATION MEASURES FOR INTERACTIVE SPOKEN LANGUAGE SYSTEMS</Title> <Section position="6" start_page="36" end_page="37" type="concl"> <SectionTitle> 4. SUMMARY AND DISCUSSION </SectionTitle> <Paragraph position="0"> As pointed out by LTC Mettala in his remarks at this meeting, we need to know more than the results of our current benchmark evaluations. We need to know how changes in these benchmarks will change the suitability of a given technology for a given application. We need to know how our benchmarks correlate with user satisfaction and user efficiency. In a sense, we need to evaluate our evaluation measures.</Paragraph> <Paragraph position="1"> At this writing, the MIT software has been transferred to SRI, and data collection is about to begin. We find that what began as an exercise in evaluation has become an exercise in software sharing. We do not want to deny the importance of software sharing and its role in strengthening portability. However, the difficulties involved (legal and other paperwork, acquisition of software and/or hardware, extensive interaction between the two sites) are costly enough that we believe we should also consider mechanisms that achieve our goals without requiring exchange of complete systems.</Paragraph> <Paragraph position="2"> Two such possibilities are described below.</Paragraph> <Paragraph position="3"> Existing logfiles, including standard transcriptions, could be presented to a panel of evaluators for judgments of the appropriateness of individual answers and of the interaction as a whole. In a sense, then, the evaluators would simulate different users going through the same problem solving experience as the subject who generated the logfile. Cross-site variability of subjects used for this procedure could be somewhat controlled by specifying characteristics of these subjects (first time users, 2 hours of experience, daily computer user, etc.). This approach has several important advantages: * It allows a much richer set of interactive strategies than our current metrics can assess, which can spur research in the direction of the stated program goals.</Paragraph> <Paragraph position="4"> * It provides an opportunity to assess and improve the correlation of our current metrics with measures that are closer to the views of consumers of the technology, which should yield greater predictive power in matching a given technology to a given application.</Paragraph> <Paragraph position="5"> * It provides a sanity check for our current evaluation measures, which could otherwise lead to improved scores but not necessarily to improved technology.</Paragraph> <Paragraph position="6"> * It allows the same scenario-session to be experienced by more than one user, which addresses the subject-variability issue.</Paragraph> <Paragraph position="7"> * It requires no exchange of software or hardware, and takes advantage of existing data structures currently required of all data collection sites, which means it is relatively inexpensive to implement.</Paragraph> <Paragraph position="8"> The method however does NOT make use of a strictly within-subject design, i.e., the same subject does not interact with different systems (although the same evaluator would assess different systems). As a result, the logfile evaluation may require use of more subjects, or other techniques for addressing the issue of subject variability. A live evaluation in which sites would bring their respective systems to a common location for assessment by a panel of evaluators could provide a means for a within-subject design. The solution of having a live test would have benefits similar to those outlined above for the logfile evaluation, but m addition subjects could assess the speed of system response, which the logfile proposal largely ignores. However, it would be more costly to transport the systems and the panel of evaluators than to ship logfiles (although most sites curretnly bring demonstration systems to meetings). null The logfile proposal could be modified to overcome its limited value in assessment of timing (at some additional expense) by the creation of a mechanism that would play back the logfiles using a standard display mechanism and based on the time stamps appearing in the logfiles. This would also open the possibility of having evaluators hear the speech of the subject, rather than just seeing transcriptions. null The costs involved for the use of such measures is negligible given the potential benefits. We propose these methods not as a replacement for the current measures, but rather as a complement to them and as a reality check on their function in promoting technological progress.</Paragraph> <Paragraph position="9"> Acknowledgment. We gratefully acknowledge support for the work at SRI by DARPA through the Office of Naval Research Contract N00014-90-C-0085 (SRI), and Research Contract N00014-89-J-1332 (M1T). The Government has certain rights in this material. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the government funding agencies. We also gratefully acknowledge the efforts of David Goodine of MIT and of Steven Tepper at SRI in the software transfer and installation. This research was supported by DARPA</Paragraph> </Section> class="xml-element"></Paper>