XML Viewer - e06-1040

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1040_metho.xml
Size: 15,530 bytes
Last Modified: 2025-10-06 14:10:06
<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1040">
  <Title>Comparing Automatic and Human Evaluation of NLG Systems</Title>
  <Section position="4" start_page="314" end_page="315" type="metho">
    <SectionTitle>
3 Experimental Procedure
</SectionTitle>
    <Paragraph position="0"> The main goal of our experiments was to determine how well a variety of automatic evaluation metrics correlated with human judgments of text quality in NLG. A secondary goal was to determine if there were types of NLG systems for which the correlation of automatic and human evaluation was particularly good or bad.</Paragraph>
    <Paragraph position="1"> Data: We extracted from each forecast in the SUMTIME corpus the first description of wind (at 10m height) from every morning forecast (the text shown in Table 1 is a typical example), which resulted in a set of about 500 wind forecasts. We excluded several forecasts for which we had no input data (numerical weather predictions) or an incomplete set of system outputs; this left 465 texts, which we used in our evaluation.</Paragraph>
    <Paragraph position="2"> The inputs to the generators were tuples composed of an index, timestamp, wind direction, wind speed range, and gust speed range (see examples at top of Table 1).</Paragraph>
    <Paragraph position="3"> We randomly selected a subset of 21 forecast dates for use in human evaluations. For these 21 forecast dates, we also asked two meteorologists who had not contributed to the original SUMTIME corpus to write new forecasts texts; we used these as reference texts for the automatic metrics. The forecasters created these texts byrewriting thecorpus texts, as this was a more natural task for them than writing texts based on tuples.</Paragraph>
    <Paragraph position="4"> 500 wind descriptions may seem like a small corpus, but in fact provides very good coverage as  the domain language is extremely simple, involving only about 90 word forms (not counting numbers and wind directions) and a small handful of different syntactic structures.</Paragraph>
    <Paragraph position="5"> Systems and texts evaluated: We evaluated four pCRU generators and the SUMTIME system, operating in Hybrid mode (Section 2.3) for better comparability because the pCRU generators do not perform content determination.</Paragraph>
    <Paragraph position="6"> A base pCRU generator was created semi-automatically by running a chunker over the corpus, extracting generation rules and adding some higher-level rules taking care of aggregation, elision etc. This base generator was then trained on 9/10 of the corpus (the training data). 5 different random divisions of the corpus into training and testing data were used (i.e. all results were validated by 5-fold hold-out cross-validation). Additionally, a back-off 2-gram model with Good-Turing discounting and nolexical classes wasbuilt from the same training data, using the SRILM toolkit (Stolcke, 2002). Forecasts were then generated for all corpus inputs, in all four generation modes (Section 2.4).</Paragraph>
    <Paragraph position="7"> Table1 shows anexample ofan input tothe systems, along with the three human texts (Corpus, Human1, Human2) and the texts produced by all five NLG systems from this data.</Paragraph>
    <Paragraph position="8"> Automatic evaluations: We used NIST2, BLEU3, and ROUGE4 to automatically evaluate the above systems and texts. We computed BLEU-N for N = 1..4 (using BLEU-4 as our main BLEU score). We also computed NIST-5 and ROUGE-4.</Paragraph>
    <Paragraph position="9"> As a baseline we used string-edit (SE) distance  are used, the SE score for a generator forecast is the average of its scores against the reference texts; the SE score for a set of generator forecasts is the average of scores for individual forecasts. Human evaluations: We recruited 9 experts (people with experience reading forecasts for offshore oil rigs) and 21 non-experts (people with no such experience). Subjects did not have a background in NLP, and were native speakers of English. They were shown forecast texts from all the generators and from the corpus, and asked to score them on a scale of 0 to 5, for readability, clarity and general appropriateness. Experts were additionally shown the numerical weather data that the forecast text was based on. At the start, subjects were shown two practice examples. The experiments were carried out over the web. Subjects completed the experiment unsupervised, at a time and place of their choosing.</Paragraph>
    <Paragraph position="10"> Expert subjects were shown a randomly selected forecast for18ofthedates. Thenon-experts were shown 21 forecast texts, in a repeated Latin squares (non-repeating column and row entries) experimental design where each combination of date and system is assigned one evaluation.</Paragraph>
  </Section>
  <Section position="5" start_page="315" end_page="317" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> Table 2 shows evaluation scores for the five NLG systems and the corpus texts as assessed by experts, non-experts, NIST-5, BLEU-4, ROUGE-4 and SE. Scores are averaged over the 18 forecasts that were used in the expert experiments (for which we had scores by all metrics and humans) in order to make results as directly comparable as possi- null ble. Human scores are normalised to range 0 to 1. Systems are ranked in order of the scores given to them by experts. All ranks are shown in brackets behind the absolute scores.</Paragraph>
    <Paragraph position="1"> Both experts and non-experts score SUMTIME-Hybrid the highest, and pCRU-2gram and pCRUrandom the lowest. The experts have pCRU-greedy in second place, where the non-experts have pCRU-roulette. The experts rank the corpus forecasts fourth, the non-experts second.</Paragraph>
    <Paragraph position="2"> We used approximate randomisation (AR) as our significance test, as recommended by Riezler and Maxwell III (2005). Pair-wise tests between results in Table 2 showed all but three differences to be significant with the likelihood of incorrectly rejecting the null hypothesis p &lt; 0.05 (the standard threshold in NLP). The exceptions were the differences in NIST and SE scores for SUMTIME-Hybrid/pCRU-roulette, and the difference in BLEU scores for SUMTIME-Hybrid/pCRU-2gram.</Paragraph>
    <Paragraph position="3"> Table 3 shows Pearson correlation coefficients (PCC) for the metrics and humans in Table 2.</Paragraph>
    <Paragraph position="4"> The strongest correlation with experts and non-experts is achieved by NIST-5 (0.82 and 0.83), with ROUGE-4 and SE showing especially poor correlation. BLEU-4 correlates fairly well with the non-experts but less with the experts.</Paragraph>
    <Paragraph position="5"> We computed another correlation statistic (shown in brackets in Table 3) which measures how well scores by an arbitrary single human or runofametriccorrelate withtheaverage scores by a set of humans or runs of a metric. This is computed as the average PCC between the scores assigned by individual humans/runs of a metric (indexing the rows in Table 3) and the average scores assigned by a set of humans/runs of a metric (indexing the columns in Table 3). For example, the PCC for non-experts and experts is 0.845, but the average PCC between individual non-experts and average expert judgment is only 0.496, implying that an arbitrary non-expert is not very likely to correlate well with average expert judgments. Experts are better predictors for each other's judgments (0.799) than non-experts (0.609). Interestingly, it turns out that an arbitrary NIST-5 run is a better predictor (0.822) of average expert opinion than an arbitrary single expert (0.799).</Paragraph>
    <Paragraph position="6"> The number of forecasts we were able to use in our human experiments was small, and to back up the results presented in Table 2 we report NIST-5, BLEU-4, ROUGE-4 and SE scores averaged across the five test sets from the pCRU validation runs, in Table 4. The picture is similar to results for the smaller data set: the rankings assigned by all metrics are the same, except that NIST-5 and SE have swapped the ranks of SUMTIME-Hybrid and pCRU-roulette. Pair-wise AR tests showed all differences to be significant with p &lt; 0.05, except for thedifferences in BLEU, NIST and ROUGE scores for SUMTIME-Hybrid/pCRUroulette, and the difference in BLEU scores for SUMTIME-Hybrid/pCRU-2gram.</Paragraph>
    <Paragraph position="7"> In both Tables 2 and 4, there are two major differences between the rankings assigned by hu- null whereas all the automatic metrics have it the other way around; and (ii) human evaluators score pCRU-roulette highly (second and third respectively), whereas theautomatic metricsscoreitvery low, second worst to random generation (except for NIST which puts it second).</Paragraph>
    <Paragraph position="8"> There are two clear tendencies in scores going from left (humans) to right (SE) across Tables 2 and 4: SUMTIME-Hybrid goes down in rank, and pCRU-2gram comes up.</Paragraph>
    <Paragraph position="9"> In addition to the BLEU-4 scores shown in the tables, wealso calculated BLEU-1, BLEU-2, BLEU3 scores. These give similar results, except that BLEU-1 and BLEU-2 rank pCRU-roulette as highly as the human judges.</Paragraph>
    <Paragraph position="10"> It is striking how low the experts rank the corpus texts, and to what extent they disagree on their quality. This appears to indicate that corpus quality is not ideal. If an imperfect corpus is used as the gold standard for the automatic metrics, thenhighcorrelation withhumanjudgments isless likely, and this may explain the difference in human and automatic scores for SUMTIME-Hybrid.</Paragraph>
  </Section>
  <Section position="6" start_page="317" end_page="318" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> If we assume that the human evaluation scores are the most valid, then the automatic metrics do not do a good job of comparing the knowledge-based SUMTIME system to the statistical systems.</Paragraph>
    <Paragraph position="1"> One reason for this could be that there are cases where SUMTIME deliberately does not choose the most common option in the corpus, because its developers believed that it was not the best for readers. For example, in Table 1, the human forecasters and pCRU-greedy use the phrase by late evening to refer to 0000, pCRU-2gram uses the phrase later, while SUMTIME-Hybrid uses the phrase by midnight. The pCRU choices reflect frequency in the SUMTIME corpus: later (837 instances) and by late evening (327 instances) are more common than by midnight (184 instances).</Paragraph>
    <Paragraph position="2"> However, forecast readers dislike this use of later (because later is used to mean something else in a different type of forecast), and also dislike variants of by evening, because they are unsure how to interpret them (Reiter et al., 2005); this is why SUMTIME uses by midnight.</Paragraph>
    <Paragraph position="3"> The SUMTIME system builders believe deviating from corpus frequency in such cases makes SUMTIME texts better from the reader's perspective, and it does appear to increase human ratings of the system; but deviating from the corpus in such a way decreases the system's score under corpus-similarity metrics. In other words, judging the output of an NLG system by comparing it to corpus texts by a method that rewards corpus similarity will penalise systems which do not base choice on highest frequency of occurrence in the corpus, even if this is motivated by careful studies of what is best for text readers.</Paragraph>
    <Paragraph position="4"> The MT community recognises that BLEU is not effective at evaluating texts which are as good as (or better than) the reference texts. This is not a problem for MT, because the output of current (wide-coverage) MT systems is generally worse thanhuman translations. Butitisanissue for NLG, where systems are domain-specific and can generate texts that are judged better by humans than human-written texts (as seen in Tables 4 and 2).</Paragraph>
    <Paragraph position="5"> Although the automatic evaluation metrics generally replicated human judgments fairly well when comparing different statistical NLG systems, there was a discrepancy in the ranking of pCRU-roulette (ranked high by humans, low by several of the automatic metrics). pCRU-roulette differs from the other statistical generators because it does not  alwaystrytomakethemostcommonchoice(maximise the likelihood of the corpus), instead it tries to vary choices. In particular, if there are several competing words and phrases with similar prob- null abilities, pCRU-roulette will tend to use different words and phrases in different texts, whereas the other statistical generators will stick to those with the highest frequency. This behaviour is penalised by the automatic evaluation metrics, but the human evaluators do not seem to mind it.</Paragraph>
    <Paragraph position="6"> One of the classic rules of writing is to vary lexical and syntactic choices, in order to keep text interesting. However, this behaviour (variation for variation's sake) will always reduce a system's score under corpus-similarity metrics, even if it enhances text quality from the perspective of readers. FosterandOberlander (2006), intheirstudyof facial gestures, have also noted that humans do not mind and indeed in some cases prefer variation, whereas corpus-based evaluations give higher ratings to systems which follow corpus frequency.</Paragraph>
    <Paragraph position="7"> Using more reference texts does counteract this tendency, but only up to a point: no matter how many reference texts are used, there will still be one, or a small number of, most frequent variants, and using anything else will still worsen corpus-similarity scores.</Paragraph>
    <Paragraph position="8"> Canvassing expert opinion of text quality and averaging the results is also in a sense frequencybased, as results reflect what the majority of experts consider good variants. Expert opinions can vary considerably, as shown by the low correlation among experts in our study (and as seen in corpus studies, e.g. Reiter et al., 2005), and evaluations by a small number of experts may also be problematic, unless we have good reason to believe that expert opinions are highly correlated in the domain (which was certainly not the case in our weather forecast domain). Ultimately, such disagreement between experts suggests that (intrinsic) judgments of the text quality -- whether by human or metric -- really should be be backed up by (extrinsic) judgments of the effectiveness of a text in helping real users perform tasks or otherwise achieving its communicative goal.</Paragraph>
  </Section>
  <Section position="7" start_page="318" end_page="318" type="metho">
    <SectionTitle>
6 Future Work
</SectionTitle>
    <Paragraph position="0"> We plan to further investigate the performance of automatic evaluation measures in NLG in the future: (i) performing similar experiments to the one described here in other domains, and with more subjects and larger test sets; (ii) investigating whether automatic corpus-based techniques can evaluate content determination; (iii) investigating how well both human ratings and corpus-based measures correlate with extrinsic evaluations of the effectiveness of generated texts. Ultimately, we would like to move beyond critiques of existing corpus-based metrics to proposing (and validating) new metrics which work well for NLG.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML