File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1422_intro.xml

Size: 2,782 bytes

Last Modified: 2025-10-06 14:04:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1422">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics GENEVAL: A Proposal for Shared-task Evaluation in NLG</Title>
  <Section position="4" start_page="0" end_page="136" type="intro">
    <SectionTitle>
2 Comparative Evaluations in NLG
</SectionTitle>
    <Paragraph position="0"> There is a long history of shared task initiatives in NLP, of which the best known is perhaps MUC (Hirschman, 1998); others include TREC, PARSE-VAL, SENSEVAL, and the range of shared tasks organised by CoNLL. Such exercises are now common in most areas of NLP, and have had a major impact on many areas, including machine translation and information extraction (see discussion of history of shared-task initiatives and their impact in Belz and Kilgarriff (2006)).</Paragraph>
    <Paragraph position="1"> One of the best-known comparative studies of evaluation techniques was by Papineni et al.</Paragraph>
    <Paragraph position="2"> (2002)whoproposedthe BLEU metricformachine translation and showed that BLEU correlated well with human judgements when comparing several machine translation systems. Several other studies of this type have been carried out in the MT and Summarisation communities.</Paragraph>
    <Paragraph position="3"> The first comparison of NLG evaluation techniqueswhichweareawareofisbyBangaloreetal. null (2000). The authors manually created several variants of sentences from the Wall Street Journal, and evaluated these sentences using both humanjudgementsandseveralcorpus-basedmetrics. null They used linear regression to suggest a combination of the corpus-based metrics which they be- null lieve is a better predictor of human judgements than any of the individual metrics.</Paragraph>
    <Paragraph position="4"> In our work (Belz and Reiter, 2006), we used several different evaluation techniques (human and corpus-based) to evaluate the output of five NLG systems which generated wind descriptions for weather forecasts. We then analysed how well the corpus-based evaluations correlated with the human-based evaluations. Amongst other things, we concluded that BLEU-type metrics work reasonably well when comparing statistical NLG systems, butlesswellwhencomparingstatistical NLG systems to knowledge-based NLG systems.</Paragraph>
    <Paragraph position="5"> We worked in this domain because of the availability of the SumTime corpus (Sripada et al., 2003), which contains both numerical weather prediction data (i.e., inputs to NLG) and human written forecast texts (i.e., target outputs from NLG). We are not aware of any other NLG-related corpora which contain a large number of texts and corresponding input data sets, and are freely available to the research community.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML