File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1057_metho.xml

Size: 21,090 bytes

Last Modified: 2025-10-06 14:07:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1057">
  <Title>Using a Randomised Controlled Clinical Trial to Evaluate an NLG System</Title>
  <Section position="4" start_page="0" end_page="31" type="metho">
    <SectionTitle>
3 STOP and its Clinical Trial
</SectionTitle>
    <Paragraph position="0"> The STOP system has been described elsewhere (Reiter et al., 1999). Very briefly, the system took as input a 4-page questionnaire about smoking history, habits, intentions, and so forth, and from this produced a small (4 pages of A5) personalised smoking cessation letter. All interactions with the smoker were paper-based; he or she filled out a paper questionnaire which was scanned into the computer system, and the resultant letter was printed out and posted back to the smoker. The first page of a typical questionnaire is shown in Figure 1, and part of the letter produced from this questionnaire is shown in Figure 2.1 We wish to emphasise that producing personalised health information letters is not a new idea, many previous researchers have worked in this area; see Lennox et al (2001) for a comparison of STOP to previous work in this area.</Paragraph>
    <Paragraph position="1"> The STOP clinical trial, which is the focus of this paper, was organised as follows. We contacted 7427 smokers, and asked them to participate in the trial. 2553 smokers agreed to participate, and filled out our smoking questionnaire. These smokers were randomly split among three groups: a0 Tailored. These smokers received the letter generated by STOP from their questionnaire.</Paragraph>
    <Paragraph position="2"> a0 Non-tailored. These smokers received a fixed (non-tailored) letter. The non-tailored letter was essentially the letter produced by STOP from a blank questionnaire, with some manual post-editing and tidying up. In other words, during the course of developing STOP we created a set of default rules for handling incomplete or inconsistent questionnaires; the non-tailored letter was produced by activating these default rules without any smoker data. Part of the non-tailored letter is shown in Figure 3.</Paragraph>
    <Paragraph position="3"> a0 No-letter. These smokers just received a letter thanking them for participating in our study.</Paragraph>
    <Paragraph position="4"> After six months we sent a followup questionnaire asking participants if they had quit, and also other questions (for example, if they were intending to try to quit even if they had not actually done so yet). Smokers could also make free-text comments about the letter they received. 2045 smokers responded to the followup questionnaire, of which 154 claimed to have quit. Because people do not always tell the truth about their smoking habits, we asked these 154 people to give saliva samples, which were tested in a lab for nicotine residues. 99 smokers agreed to give such samples, and 89 of these were confirmed as non-smokers.</Paragraph>
    <Paragraph position="6"/>
    <Paragraph position="8"/>
    <Paragraph position="10"/>
    <Paragraph position="12"/>
    <Paragraph position="14"/>
    <Section position="1" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
3.1 Practical Aspects of the Clinical Trial
</SectionTitle>
      <Paragraph position="0"> The STOP clinical trial took 20 months to run (of which the first 4 months overlapped software development), and cost about UKPS75,000 (US$110,000). We believe the STOP clinical trial was the longest and costliest evaluation ever done of an NLG system. The length and cost of the clinical trial were primarily due to the large numbers of subjects. Whereas Levine and Mellish (1995), Young (1999), and Carenini and Moore (2000) included 10, 26, and 30 subjects (respectively) in their task effectiveness evaluations, we had 2553 subjects in our clinical trial. The cost of the trial was partially stationary and postage (we sent out over 10000 mailings to smokers, each of which included a reply-paid envelope), but mostly staff costs to set up the trial, perform the mailings, process and analyse the returns from smokers, and handle various glitches in the trial.</Paragraph>
      <Paragraph position="1"> Another way of looking at the trial was that we spent about UKPS30 (US$45) per subject (including staff time as well as materials). Perhaps the trial could have been done a bit more cheaply, but any experiment involving 2553 subjects is bound to be expensive and time-consuming.</Paragraph>
      <Paragraph position="2"> The reason the trial needed to be so large was that we were measuring a binary outcome variable (laboratory-verified smoking cessation) with a very low positive rate (since smoking is a very difficult habit to quit). Young, in contrast, measured numerical variables (such as the number of mistakes made by a user when following textual instructions) with substantial standard deviations.</Paragraph>
      <Paragraph position="3"> Another complication was that we wanted to use a representative sample of smokers in our trial, which meant that we could not (as Young and Levine and Mellish did) just recruit students and acquaintances. Instead, we contacted a representative set of GPs in our area, and asked them for a list of smokers from their patient record systems. This was the source of the 7427 initial smokers mentioned above.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="31" end_page="31" type="metho">
    <SectionTitle>
4 Results of the Clinical Trial
</SectionTitle>
    <Paragraph position="0"> Detailed results of the STOP clinical trial, including statistical tables, have been published in the medical literature (Lennox et al., 2001). Here we just summarise the key findings which are of NLG Smoking Information for Heather Stewart You have good reasons to stop...</Paragraph>
    <Paragraph position="1"> People stop smoking when they really want to stop. It is encouraging that you have many good reasons for stopping. The scales show the good and bad things about smoking for you. They are tipped in your favour. You could do it...</Paragraph>
    <Paragraph position="2"> Most people who really want to stop eventually succeed. In fact, 10 million people in Britain have stopped smoking - and stayed stopped - in the last 15 years. Many of them found it much easier than they expected. Although you don't feel confident that you would be able to stop if you were to try, you have several things in your favour.</Paragraph>
    <Paragraph position="3">  workmates.</Paragraph>
    <Paragraph position="4"> We know that all of these make it more likely that you will be able to stop. Most people who stop smoking for good have more than one attempt. Overcoming your barriers to stopping...</Paragraph>
    <Paragraph position="5"> You said in your questionnaire that you might find it difficult to stop because smoking helps you cope with stress. Many people think that cigarettes help them cope with stress. However, taking a cigarette only makes you feel better for a short while. Most ex-smokers feel calmer and more in control than they did when they were smoking. There are some ideas about coping with stress on the back page of this leaflet. You also said that you might find it difficult to stop because you would put on weight. A few people do put on some weight. If you did stop smoking, your appetite would improve and you would taste your food much better. Because of this it would be wise to plan in advance so that you're not reaching for the biscuit tin all the time. Remember that putting on weight is an overeating problem, not a no-smoking one. You can tackle it later with diet and exercise.</Paragraph>
    <Paragraph position="6"> And finally...</Paragraph>
    <Paragraph position="7"> We hope this letter will help you feel more confident about giving up cigarettes. If you have a go, you have a real chance of succeeding. With best wishes, The Health Centre.</Paragraph>
  </Section>
  <Section position="6" start_page="31" end_page="31" type="metho">
    <SectionTitle>
THINGS YOU LIKE
</SectionTitle>
    <Paragraph position="0"> it's relaxing it stops stress you enjoy it it relieves boredom it stops weight gain it stops you craving</Paragraph>
  </Section>
  <Section position="7" start_page="31" end_page="31" type="metho">
    <SectionTitle>
THINGS YOU DISLIKE
</SectionTitle>
    <Paragraph position="0"> it makes you less fit it's a bad example for kids you're addicted it's unpleasant for others other people disapprove it's a smelly habit it's bad for you it's expensive it's bad for others' health</Paragraph>
    <Section position="1" start_page="31" end_page="31" type="sub_section">
      <SectionTitle>
Information for Stopping Smoking
</SectionTitle>
      <Paragraph position="0"> Do you want to stop smoking? Everyone has things they like and dislike about their smoking. The decision to stop smoking depends on the things you don't like being more important than the things you do like. It can be useful to think of it as a balance. Have a look on the scales. What are the good and bad things for you? Add any more that you can think of. Are you ready to stop smoking? If yes, maybe it's the right time to have a go. If no, think about the good and bad things about smoking. This might swing the balance for you.</Paragraph>
      <Paragraph position="1"> You can do it.....</Paragraph>
      <Paragraph position="2"> People who want to stop smoking usually succeed. 10 million people in Britain have stopped smoking - and stayed stopped - in the last 15 years. Many of them found it much easier than they expected! Try it out.....</Paragraph>
      <Paragraph position="3"> If you don't feel ready for an all-out attempt to stop smoking, there are some useful ways to prepare yourself. You could try some of the following ideas now. This will help you when you try to stop smoking.  If it gets tough.....</Paragraph>
      <Paragraph position="4"> Many people do hit rough patches; there are ways to deal with these. On the back page are some suggestions that other people have found useful. If you do have a cigarette after a few days just put it behind you and keep on trying. Prepare yourself for another attempt, many people have more than one go before they stop for good! With best wishes.</Paragraph>
      <Paragraph position="5"> The Health Centre.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="31" end_page="31" type="metho">
    <SectionTitle>
GOOD THINGS
</SectionTitle>
    <Paragraph position="0"> you enjoy it it's relaxing it stops stress it breaks up the day it relieves boredom it's sociable it stops weight gain it stops you craving</Paragraph>
  </Section>
  <Section position="9" start_page="31" end_page="31" type="metho">
    <SectionTitle>
BAD THINGS
</SectionTitle>
    <Paragraph position="0"> it's bad for you it makes you less fit it's expensive it's a bad example for kids it's bad for others' health you're addicted it's unpleasant for others other people disapprove it's a smelly habit  (as well as medical) interest. Of the 2553 smokers in the trial, 89 were validated as having stopped smoking. These broke down by group as follows:  stopped smoking The non-tailored group had the lowest number of heavy (more than 20 cigarettes per day) smokers, who are less likely to stop smoking (because they are probably addicted to nicotine) than light smokers; the tailored group had the highest number of heavy smokers. After adjusting for this fact, cessation rates were still higher in the non-tailored group than in the tailored group, but this difference was not statistically significant. We can see this if we look just at cessation rates in light smokers (few heavy smokers from any category managed to stop smoking): a0 4.3% (25 out of 563) of the light smokers in the tailored group stopped smoking a0 4.9% (31 out of 597) of the light smokers in the non-tailored group stopped smoking a0 2.7% (16 out of 582) of the light smokers in the no-letter group stopped smoking The overall conclusion is therefore that recipients of the non-tailored letters were more likely to stop than people who got no letter2 (p=.047 over-all unadjusted; p=.069 overall after adjusting for differences between groups, such as heavy/light smoker split; p=.049 for light smokers). However, there was no evidence that the tailored letters were any better than the non-tailored ones in terms of increasing cessation rates.</Paragraph>
    <Paragraph position="1"> 2Note that while a 1% or 2% increase in cessation rates is small, it is medically useful if it can be achieved cheaply. See Law and Tang (1995) for a discussion of success rates and cost-effectiveness of various smoking-cessation techniques, and Lennox et al (2001) for an analysis that shows that sending letters is very cost-effective compared to most other smoking-cessation techniques.</Paragraph>
    <Paragraph position="2"> There is some very weak evidence that the tailored letter may have been better than the non-tailored letter among smokers for whom quitting was especially difficult. For example, among discouraged smokers (people who wanted to quit but were not intending to quit, usually because they didn't think they could quit), cessation rates were 60% higher among recipients of tailored letters than recipients of non-tailored letters, but the numbers were too small to reach statistical significance, since (as with heavy smokers) very few such people managed to stop smoking. Furthermore, among heavy smokers, recipients of the tailored letter were 50% more likely than recipients of the non-tailored letters to show increased intention to quit (for example, say in their initial questionnaire that they did not intend to quit, but say in the followup questionnaire that they did intend to quit) (p=.059). It would be nice to test the hypothesis that tailored letters were effective among discouraged smokers or heavy smokers by running another clinical trial, but such a trial would need to be even bigger and more expensive than the STOP trial, in order to have enough validated quitters from these categories to make it possible to draw statistically significant conclusions.</Paragraph>
    <Paragraph position="3"> Recipients of the tailored letters were more likely than recipients of non-tailored letters to remember receiving the letter (67% vs 44%, significant at p a0 .01), to have kept the letter (30% vs 19%, significant at p a0 .01), and to make a free-text comment about the letter (20% vs 12%, significant at p a0 .01). However, there was no statistically significant difference in perceptions of the usefulness and relevance of the tailored and non-tailored letters.</Paragraph>
    <Paragraph position="4"> Free-text comments on the tailored letters were varied, ranging from I carried mine with me all the time and looked at it whenever I felt like giving in to I found it patronising . . . Smoking obviously impairs my physical health -- not my intelligence! The most common complaint about content was that not enough information was given about practical 'how-to-stop-smoking' techniques. STOP's tailoring rules only included such information in about one third of the letters; this was in accordance with the well-established Stages of Change model of smoking cessation (Prochaska and diClemente, 1992). Note that all recipients of the non-tailored letter received such information. If practical advice was useful to more than one third of smokers, then the Stagesof-Change based tailoring rules which decided when to include such information may have decreased rather than increased letter effectiveness.</Paragraph>
  </Section>
  <Section position="10" start_page="31" end_page="31" type="metho">
    <SectionTitle>
5 What Can be Learned from a Negative
Result
</SectionTitle>
    <Paragraph position="0"> One of the remarkable things about the NLG, NLP, and indeed AI literatures is that little mention is made of experiments with negative results.</Paragraph>
    <Paragraph position="1"> In more established fields such as medicine and physics, papers which report negative experimental findings are common and are valued; but in NLP they are rare. It seems unlikely that NLP experiments always produce positive results (unless the experiments are badly designed and biased towards demonstrating the experimenter's desired outcome); what is probably happening is that people are choosing not to report negative results.</Paragraph>
    <Paragraph position="2"> One reason for this may be that it can be difficult to draw clear lessons from a negative result. In the case of STOP, for example, the clinical trial did not tell us why STOP failed. There are many possible reasons for the negative result, including: 1. Tailoring cannot have much effect. That is, if a smoker receives a letter from his/her doctor about smoking, then the content of the letter is only of secondary importance, the important thing is the fact of having received a communication from his/her doctor encouraging smoking cessation.</Paragraph>
    <Paragraph position="3"> 2. Tailoring could have an impact, but only if it was based on much more knowledge about the smoker's circumstances than is available via a 4-page multiple choice questionnaire.</Paragraph>
    <Paragraph position="4"> 3. Tailoring based on a multiple-choice questionnaire can work, we just didn't do it right in STOP, perhaps in part because we based our system on inappropriate theoretical models of smoking cessation.</Paragraph>
  </Section>
  <Section position="11" start_page="31" end_page="31" type="metho">
    <SectionTitle>
4. The STOP letters did in fact have an effect
</SectionTitle>
    <Paragraph position="0"> on some groups (such as heavy or discouraged smokers), but the clinical trial was too small to provide statistically significant evidence of this.</Paragraph>
    <Paragraph position="1"> In other words, did we fail because (1) what we were attempting could not work; (2) what we were attempting could only work if we had a lot more knowledge available to us; or (3) we built a poor system? Or (4) did the system actually work to some degree, but the evaluation didn't show this because it was too small? This is a key question for NLG researchers and developers (as opposed to doctors and health administrators who just want to know if they should use STOP as a black-box system), but the clinical trial does not distinguish between these possibilities.</Paragraph>
    <Paragraph position="2"> Arguments can be made for all three of the above possibilities. For example, we could argue for (1) on the basis that brief discussions about smoking with a doctor have about a 2% success rate (Law and Tang, 1995), and this may be an upper limit for the effectiveness of a brief letter from a doctor. If so, then letters cannot do much better that the 1.8% increase in cessation rates produced by the STOP non-tailored letter. Or we could argue for (2) by noting that when we asked smokers to comment on STOP letters in a small pilot study, many of their comments were very specific to their particular circumstances For example, a single mother mentioned that a previous attempt to stop failed because of stress caused by dealing with a child's tantrum, and an older woman discussed the various stop-smoking techniques she had tried in the past and how they failed. Perhaps tailoring according to such specific circumstances would add value to letters; but such tailoring would require much more information than can be obtained from a 4-page multiple-choice questionnaire. We could also argue for (3) because there clearly are many ways in which the tailored letters could have been improved (such as having practical 'how-to-stop' tips in more letters, as mentioned at the end of Section 4); and for (4) on the basis of the weak evidence for this mentioned in Section 4.</Paragraph>
    <Paragraph position="3"> We do not know which of the above reason(s) were responsible for STOP's failure, so we cannot give clear lessons for future researchers or developers. This is perhaps true of many negative experimental results, and may be a reason why people do not publish them in the NLP community. Again there is perhaps a different attitude in the medical community, where papers describing experiments are taken as 'data points' and more theoretically minded researchers may look at a number of experimental papers and see what patterns and insights emerge from the collection as a whole. Under this perspective it is less important to state what lessons or insights can be drawn from a particular negative result, what matters is the overall pattern of positive and negative results in a group of related experiments. And like most such procedures, the process of inferring general rules from a collection of specific experimental results will work much better if it has access to both positive and negative examples; in other words, if researchers publish their failures as well as their successes.</Paragraph>
    <Paragraph position="4"> We believe that negative results are also important in NLG, NLP, and AI, even if it is not possible to draw straightforward lessons from them; and we hope that more such results are reported in the future.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML