File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1073_intro.xml
Size: 3,698 bytes
Last Modified: 2025-10-06 14:06:56
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1073"> <Title>Designing a Task-Based Evaluation Methodology for a Spoken Machine Translation System</Title> <Section position="2" start_page="0" end_page="569" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Task-based evaluations for spoken language systems focus on evaluating whether the speaker's task is achieved, rather than evaluating utterance translation accuracy or other aspects of system performance. Our MT project focuses on the travel reservation domain and facilitates on-line translation of speech between clients and travel agents arranging travel plans. Our prior evaluations (Gates et al., 1996) have focused on end-to-end translation accuracy at the utterance level (i.e., fraction of utterances translated perfectly, acceptably, and unacceptably).</Paragraph> <Paragraph position="1"> While this method of evaluation conveys translation accuracy, it does not give any information about how many of the client's travel arrangement goals have been conveyed, nor does it take into account the complexity of the speaker's goals and task, or the priority that they assign to their goals; for example, the same end-to-end score for two dialogues may hide the fact that in one dialogue the speakers were able to communicate their most important goals while in the other they were only able to communicate successfully the less important goals.</Paragraph> <Paragraph position="2"> One common approach to evaluating spoken language systems focusing on human-machine dialogue is to compare system responses to correct reference answers; however, as discussed by (Walker et al., 1997), the set of reference answers for any particular user query is tied to the system's dialogue strategy. Evaluation methods independent of dialogue strategy have focused on measuring the extent to which systems for interactive problem solving aid users via log-file evaluations (Polifroni et al., 1992), quantifying repair attempts via turn correction ratio, tracking user detection and correction of system errors (Hirschman and Pao, 1993), and considering transaction success (Shriberg et al., 1992). (Danieli and Gerbino, 1995) measure the dialogue module's ability to recover from partial failures of recognition or understanding (i.e., implicit recovery) and inappropriate utterance ratio; (Simpson and Fraser, 1993) discuss applying turn correction ratio, transaction success, and contextual appropriateness to dialogue evaluations, and (Hirschman et ah, 1990) discuss using task completion time as a black box evaluation metric.</Paragraph> <Paragraph position="3"> Current literature on task-based evaluation methodologies for spoken language systems primarily focuses on human-computer interactions rather than system-mediated human-human interactions. For a multilingual MT system, speakers communicate via the system, which translates their responses and generates the output in the target language via speech synthesis.</Paragraph> <Paragraph position="4"> Measuring solution quality (Sikorski and Allen, 1995), transaction success, or contextual appropriateness is meaningless, since we are not interested in measuring how efficient travel agents are in responding to clients' queries, but rather, how well the system conveys the speakers' goals.</Paragraph> <Paragraph position="5"> Likewise, task completion time will not capture task success for MT dialogues since it is dependent on dialogue strategies and speaker styles. Task-based evaluation methodologies for MT systems must focus on whether goals are communicated, rather than whether they are achieved.</Paragraph> </Section> class="xml-element"></Paper>