File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/w00-1401_concl.xml

Size: 5,551 bytes

Last Modified: 2025-10-06 13:52:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1401">
  <Title>Evaluation Metrics for Generation</Title>
  <Section position="5" start_page="2" end_page="8" type="concl">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> We have devised the baseline quantitative metrics presented in this paper for internal use during research and development, in order to evaluate different versions of FERGUS. However, the question also arises whether they can be used to compare two completely different realization modules. In either case, there are two main issues facing the proposed corpus-based quantitative evaluation: does it generalize and is it fair? The problem in generalization is this: can we use this method to evaluate anything other than versions of FERGUS which generate sentences from the WSJ? We claim that we can indeed use the quantitative evaluation procedure to evaluate most realization modules generating sentences from any corpus of unannotated English text. The fact that the tree-based metrics require dependency parses of the corpus is not a major impediment. Using existing syntactic parsers plus ad-hoc postprocessors as needed, one can create the input representations to the generator as well as the syntactic dependency trees needed for the tree-based metrics. The fact that the parsers introduce errors should not affect the way the scores are used, namely as relative scores (they have no real value absolutely). Which realization modules can be evaluated? First, it is clear that our approach can only evaluate single-sentence realization modules which may perform some sentence planning tasks, but cruciaUy not including sentence scoping/aggregation. Second, this approach :only works for generators whose input representation is fairly &amp;quot;syntactic&amp;quot;. For example, it may be difficult to evaluate in this manner a generator that -uses semanzic roles in-its inpntrepresent~ion, since we currently cannot map large corpora of syntactic parses onto such semantic representations, and therefore cannot create the input representation for the evaluation.</Paragraph>
    <Paragraph position="1"> The second question is that of fairness of the evaluation. FE\[,tGt.'S as described in this paper is of limited use. since it only chooses word order (and, to a certain extent, syntactic structure). Other realization and sentence planning tin{ks-which are needed for most applications and which may profit from a stochastic model include lexical choice, introduction of function words and punctuation, and generation of morphology. (See (Langkilde and Knight, 1998a) for a relevant discussion. FERGUS currently can perform punctuation and function word insertion, and morphology and lexical choice are under development.) The question arises whether our metrics will . fairly measure the:quality,~of,a, more comp!ete real~ .... ization module (with some sentence planning). Once the range of choices that the generation component makes expands, one quickly runs into the problem that, while the gold standard may be a good way of communicating the input structure, there are usually other good ways of doing so as well (using other words, other syntactic constructions, and so on).</Paragraph>
    <Paragraph position="2"> Our metrics will penalize such variation. However, in using stochastic methods one is of course precisely interested in learning from a corpus, so that the fact that there may be other ways of expressing an input is less relevant: the whole point of the stochastic approach is precisely to express the input in a manner that resembles as much as possible the realizations found in the corpus (given its genre, register, idiosyncratic choices, and so on). Assuming the test corpus is representative of the training corpus, we can then use our metrics to measure deviance from the corpus, whether it be merely in word order or in terms of more complex tasks such as lexical choice as well. Thus, as long as the goal of the realizer is to enmlate as closely as possible a given corpus (rather than provide a maximal range of paraphrastic capability), then our approach can be used for evaluation, r As in the case of machine translation, evaluation in generation is a complex issue. (For a discussion, see (Mellish and Dale, 1998).) Presumably, the quality of most generation systems can only be assessed at a system level in a task-oriented setting (rather than by taking quantitative measures or by asking humans for quality assessments). Such evaluations are costly, and they cannot be the basis of work in stochastic generation, for which evaluation is a frequent step in research and development. An advantage of our approach is that our quantitative metrics allow us to evaluate without human intervention, automatically and objectively (objectively with respect to the defined metric,-that is).- Independently, the use of the metrics has been validated using human subjects (as discussed in Section 4): once this has happened, the researcher can have increased confidence that choices nlade in research and development based on the quantitative metrics will in fact 7We could also assume a set of acceptable paraphrases for each sentence in the test corpus. Our metrics are run on all paraphrases, and the best score chosen. However. for many applications it will not be emsy to construct such paraphrase sets, be it by hand or automatically.</Paragraph>
    <Paragraph position="3">  correlate with relevant subjective qualitative measures. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML