File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/p01-1056_evalu.xml
Size: 3,451 bytes
Last Modified: 2025-10-06 13:58:47
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1056"> <Title>Evaluating a Trainable Sentence Planner for a Spoken Dialogue System</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Discussion and Future Work </SectionTitle> <Paragraph position="0"> Other work has also explored automatically training modules of a generator (Langkilde and Knight, 1998; Mellish et al., 1998; Walker, 2000).</Paragraph> <Paragraph position="1"> However, to our knowledge, this is the first reported experimental comparison of a trainable technique that shows that the quality of system utterances produced with trainable components can compete with hand-crafted or rule-based techniques. The results validate our methodology; SPOT outperforms two representative rule-based sentence planners, and performs as well as the hand-crafted TEMPLATE system, but is more easily and quickly tuned to a new domain: the training materials for the SPOT sentence planner can be collected from subjective judgements from a small number of judges with little or no linguistic knowledge.</Paragraph> <Paragraph position="2"> Previous work on evaluation of natural language generation has utilized three different approaches to evaluation (Mellish and Dale, 1998).</Paragraph> <Paragraph position="3"> The first approach is a subjective evaluation methodology such as we use here, where human subjects rate NLG outputs produced by different sources (Lester and Porter, 1997). Other work has evaluated template-based spoken dialog generation with a task-based approach, i.e. the generator is evaluated with a metric such as task completion or user satisfaction after dialog completion (Walker, 2000). This approach can work well when the task only involves one or two exchanges, when the choices have large effects over the whole dialog, or the choices vary the content of the utterance. Because sentence planning choices realize the same content and only affect the current utterance, we believed it important to get local feedback. A final approach focuses on subproblems of natural language generation such as the generation of referring expressions. For this type of problem it is possible to evaluate the generator by the degree to which it matches human performance (Yeh and Mellish, 1997). When evaluating sentence planning, this approach doesn't make sense because many different realizations may be equally good.</Paragraph> <Paragraph position="4"> However, this experiment did not show that trainable sentence planners produce, in general, better-quality output than template-based or rule-based sentence planners. That would be impossible: given the nature of template and rule-based systems, any quality standard for the output can be met given sufficient person-hours, elapsed time, and software engineering acumen. Our principal goal, rather, is to show that the quality of the TEMPLATE output, for a currently operational dialog system whose template-based output component was developed, expanded, and refined over about 18 months, can be achieved using a trainable system, for which the necessary training data was collected in three person-days. Furthermore, we wished to show that a representative rule-based system based on current literature, without massive domain-tuning, cannot achieve the same level of quality. In future work, we hope to extend SPoT and integrate it into AMELIA.</Paragraph> </Section> class="xml-element"></Paper>