File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/01/p01-1057_concl.xml
Size: 5,377 bytes
Last Modified: 2025-10-06 13:53:06
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1057"> <Title>Using a Randomised Controlled Clinical Trial to Evaluate an NLG System</Title> <Section position="12" start_page="31" end_page="31" type="concl"> <SectionTitle> 6 Other Evaluation Techniques in STOP </SectionTitle> <Paragraph position="0"> The clinical trial was by far the biggest evaluation exercise in STOP, but we also performed some smaller evaluations in order to test our algorithms and knowledge acquisition methodology (Reiter, 2000; Reiter et al., 2000). These included: 1. Asking smokers or domain experts to read two letters, and state which one they thought was superior; 2. Statistical analyses of characteristics of smokers; and 3. Comparing the effectiveness of different al- null gorithms at filling up but not exceeding 4 A5 pages.</Paragraph> <Paragraph position="1"> These evaluations were much smaller, simpler, and cheaper than the clinical trial, and often gave easier to interpret results. For example, the letter-comparison experiments suggested (although they did not prove) that older people preferred a more formal writing style than younger people; the statistical analysis suggested (although again did not prove) that the tailoring rules should have been more influenced by level of addiction; and the algorithmic analysis showed that a revision architecture outperformed a conventional pipeline architecture.</Paragraph> <Paragraph position="2"> So, these experiments produced clearer results at a fraction of the cost of the clinical trial. But the cheapness of (1) and (2) were partially due to the fact that they were too small to produce statistically solid findings, and the cheapness of (2) and (3) were partially due to the fact that they exploited data sets and resources that were built as part of the clinical trial. Overall, we believe that these small-scale experiments were worth doing, but as a supplement to, not a replacement for, the clinical trial.</Paragraph> <Paragraph position="3"> 7 When is a Clinical Trial Appropriate? When is it appropriate to evaluate an NLG system with a large-scale task or effectiveness evaluation which compares the NLG system to a non-NLG alternative? Certainly this should be done when a customer is seriously considering using the system, indeed customers may refuse to use a system without such testing.</Paragraph> <Paragraph position="4"> Controlled task/effectiveness evaluations are also scientifically important, because they provide a technique for testing applied hypotheses (such as 'STOP produces effective smoking-cessation letters'). As such, they should be considered whenever a researcher is interested in testing such hypotheses. Of course, much research in NLG is primarily theoretical, and thus perhaps best tested by corpus studies or psycholinguistic experiments; and much work in applied NLG is concerned with pilot studies and other hypothesis formation exercises. But at the end of the day, researchers interested in applied NLG need to test as well as formulate hypotheses. While many speech recognition and natural-language understanding applications can be tested by comparing their output to a human-produced 'gold standard' (for example, speech recogniser output can be compared to a human transcription of a speech signal), this to date has been harder to do in NLG, especially in applications such as STOP where there are no human experts (Reiter et al., 2000) (there are many experts on personalised oral communication with smokers, but none on personalised written communication, because no one currently writes personalised letters to smokers). In such applications, the only way to test hypotheses about the effects of systems on human users may be to run a controlled task/effectiveness evaluation.</Paragraph> <Paragraph position="5"> In other words, there's probably no point in conducting a large-scale task/effectiveness evaluation of an NLG system if you're interested in formulating hypotheses instead of testing them, or if you're interested in theoretical instead of applied hypotheses. But if you want to test an applied hypothesis about the effect of an NLG system on human users, the most rigorous way of doing this is to conduct an experiment where you show some users your NLG texts and other users control texts, and measure the degree to which the desired effect is achieved in both groups.</Paragraph> <Paragraph position="6"> Large-scale evaluation exercises also have the benefit of forcing researchers and developers to make systems robust, and to face up to the messiness of real data, such as awkward boundary cases and noisy data. Indeed we suspect that STOP is one of the most robust non-commercial NLG systems ever built, because the clinical trial forced us to think about issues such as what we should do with inconsistent or improperly scanned questionnaires, or what we should say to unusual smokers. In conclusion, large-scale task/effectiveness evaluations are expensive, time-consuming, and a considerable hassle. But they are also an essential part of the scientific and technological process, especially in testing applied hypotheses about the effectiveness of systems on real users. We hope that more such evaluations are performed in the future, and that their results are reported whether they are positive or negative.</Paragraph> </Section> class="xml-element"></Paper>