File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1401_intro.xml
Size: 3,378 bytes
Last Modified: 2025-10-06 14:01:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1401"> <Title>Evaluation Metrics for Generation</Title> <Section position="2" start_page="0" end_page="1" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> For many applications in natural language generation (NLG), the range of linguistic expressions that must be generated is quite restricted, and a grammar for a surface realization component can be fully specified by hand. Moreover, iLL inany cases it is very important not to deviate from very specific output in generation (e.g., maritime weather reports), in which case hand-crafted grammars give excellent control. In these cases, evaluations of the generator that rely on human judgments (Lester and Porter, I997) or on human annotation of the test corpora (Kukich, 1983) are quite sufficient ....</Paragraph> <Paragraph position="1"> However. in other NLG applications the variety of the output is much larger, and the demands on the quality of the output are solnewhat less stringent. A typical example is NLG in the context of (interlingua- or transfer-based) inachine translation.</Paragraph> <Paragraph position="2"> Another reason for relaxing the quality of the output may be that not enough time is available to develop a full gramnlar for a new target, language in NLG. ILL all these cases, stochastic methods provide an alternative to hand-crafted approaches to NLG.</Paragraph> <Paragraph position="3"> To our knowledge, the first to use stochastic techniques in an NLG realization module were Langkilde and Knight (1998a) and (~998b) (see also (Langkilde, 2000)). As is the case for stochastic approaches in natural language understanding, the research and development itself requires an effective intrinsic metric in order to be able to evaluate progress.</Paragraph> <Paragraph position="4"> In this paper, we discuss several evaluation metrics that we are using during the development of FERGUS</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> (Flexible Empiricist/Rationalist Generation Using </SectionTitle> <Paragraph position="0"> Syntax). FERCUS, a realization module, follows Knight and Langkilde's seminal work in using an n-gram language model, but we augment it with a tree-based stochastic model and a lexicalized syntactic grammar. The metrics are useful to us as relative quantitative assessments of different models we experiment with; however, we do not pretend that these metrics in themselves have any validity. Instead, we follow work done in dialog systems (Walker et al., 1997) and attempt to find metrics which on tim one hand can be computed easily but on the other hand correlate with empirically verified human judgments in qualitative categories such as readability. null The structure of the paper is as follows. In Section 2, we briefly describe the architecture of FEacUS, and some of the modules. In Section 3 we present four metrics and some results obtained with these metrics. In Section 4 we discuss the for experimental validation of the metrics using human judgments, and present a new metric based on the results of these experiments. In Section 5 we discuss some of the 'many problematic issues related to the use Of metrics and our metrics in particular, and discuss on-going work.</Paragraph> </Section> </Section> class="xml-element"></Paper>