File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/99/p99-1073_concl.xml
Size: 2,226 bytes
Last Modified: 2025-10-06 13:58:27
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1073"> <Title>Designing a Task-Based Evaluation Methodology for a Spoken Machine Translation System</Title> <Section position="5" start_page="571" end_page="571" type="concl"> <SectionTitle> 6 Conclusions and Future Work </SectionTitle> <Paragraph position="0"> This work describes an initial attempt to account for some of the significant issues in a task-based evaluation methodology for an MT system. Our choice of metric reflects separate domain scores, factors in subgoal complexity and normalizes all counts to allow for comparison among dialogues that differ in dialogue strategy, subgoal complexity, number of goals and speaker-prioritization of goals. The proposed metric is a first attempt, and describes work in progress; we have attempted to present the simplest possible metric as an initial approach.</Paragraph> <Paragraph position="1"> There are many issues that need to be addressed; for instance, we do not take into account optimality of translations. Although we are interested in goal communication and not utterance translation quality, the disadvantage to the current approach is that our optimality measure is binary, and does not give any information about how well-phrased the translated text is. More significantly, we have not resolved whether to use metric (1) for both subgoals and goals together, or to score them separately. The proposed metric does not reflect that communicating main goals may be essential to communicating their subgoals. It also does not account for the possible complexity introduced by multiple main goals per speaker turn. We also do not account for the possibility that in an unsuccessful dialogue, a speaker may become more frustrated as the dialogue proceeds, and her relative goal priorities may no longer be reflected in the number of repair attempts. We may also want to further distinguish in-domain scores based on sub-domain (e.g., flights, hotels, events). Perhaps most importantly, we still need to conduct a full-scale evaluation with the above metric with several scorers and speaker pairs across different versions of the system to be able to provide actual results.</Paragraph> </Section> class="xml-element"></Paper>