File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1429_metho.xml

Size: 29,467 bytes

Last Modified: 2025-10-06 14:07:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1429">
  <Title>Knowledge Acquisition for Natural Language.Generation</Title>
  <Section position="4" start_page="0" end_page="217" type="metho">
    <SectionTitle>
2 Background: The STOP System
</SectionTitle>
    <Paragraph position="0"> The STOP system generates personalised smoking-cessation leaflets, based on the recipient's responses to a questionnaire about smoking beliefs, concerns, and experiences. STOP leaflets consist of four A5 pages, of which only the two inside pages are fully generated; an example of the inside pages of a STOP leaflet are shown in Figure 2. Internally, STOP is a fairly conventional shallow NLG system, with its main innovation being the processing used to control the length of leaflets (Reiter, 2000). STOP has been evaluated in a clinical trial, which compared cessation rates among smokers who received STOP leaflets; smokers who received a non-personalised leaflet with similar structure and appearance to a STOP leaflet: and smokers who did not receive any leaflet (but did fill out a questionnaire). Unfortunately, we cannot discuss the results of the clinical trial in this paper I . One of the research goals of the STOP project was to explore the use of expert-system knowledge acqtfisition techniques in buitding anNLO system. These knowledge acquisition sessions were primarily carried out with the following experts: 1Our medical colleagues intend to publish a paper about the clinical trial in a medical journal, and have requested that we not publish anything about the results of the trial in a computing journal or conference until they have published in a medical journal.</Paragraph>
    <Paragraph position="1">  o three doctors (two general practitioners, one consultant in Thoracic Medicine) o one psychologist specialising in health information leaflets o one nurse None of these experts were paid for their time. We also did a small amount of KA with a (paid) graphic designer on layout and typography issues.</Paragraph>
    <Section position="1" start_page="217" end_page="217" type="sub_section">
      <SectionTitle>
2.1 Unusual Aspects of STOP from a KA
&amp;quot;Perspective
</SectionTitle>
      <Paragraph position="0"> KA research in the expert-system community has largely focused on applications such as medical diagnosis, where (1) there is a single correct solution, and (2) the task being automated is one currently done by a human expert. STOP is a different type of application in that (1) there are many possible leaflets which can be generated (and the system cannot tell which is best), and (2) no human currently writes personalised smoking-cessation leaflets (because manually writing such leaflets is too expensive). Point (2) in particular was repeatedly emphasised by the experts we worked with. The doctors and the nurse were experts on oral consultations with smokers, and the health psychologist was an expert on writing non-personalised health information leaflets, but none of them had experience writing personalised smoking-cessation leaflets.</Paragraph>
      <Paragraph position="1"> Many NLG systems have similar characteristics.</Paragraph>
      <Paragraph position="2"> The flexibility of language means that there are almost always many ways of communicating information and fulfilling communicative goals in a generated text; in other words, there are many possible texts that can be generated. Furthermore, while some synthesis tasks, such as configuration and scheduling, can be formalised as finding an optimal solution under a well-defined numerical evaluation function, this is difficult in NLG because of our poor understanding of how to computationally evaluate texts for effectiveness.</Paragraph>
      <Paragraph position="3"> With regard to human expertise, some NLG systems do indeed generate documents, such as weather reports and customer-service letters, which are currently written by humans. But many systems are similar to STOP in that they generate texts -- such as descriptions of software models (Lavoie et al., 1997) and customised descriptions of museum items (0berlander et al., 1998) -- which are useful in principle but are not currently writ l}en by humans, perhaps because of cost or response-time issues.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="217" end_page="221" type="metho">
    <SectionTitle>
3 KA Techniques Used in STOP
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="217" end_page="218" type="sub_section">
      <SectionTitle>
3.1 Sorting
</SectionTitle>
      <Paragraph position="0"> Sorting is a standard KA technique for building up taxonomies. Experts in a sorting exercise are given a set of entities, and asked to divide the set into subsets, and 'think aloud' into a tape recorder as they do so.</Paragraph>
      <Paragraph position="1"> In STOP, we used sorting to build a classification of smokers. We started off with an initial classification which was motivated by the Stages of Change psychological theory (Prochaska and diClemente, 1992): this divided smokers into the three categories of Precontemplators (not intending to quit anytime soon), Contemplators (seriously considering quitting), and Preparers (definitely decided to quit, and soon). We wished to refine these categories, especially Precontemplator (which includes 67% of smokers in the Aberdeen area), and used sorting to do so. The basic exercise consisted of giving a doctor three sets of questionnaires (a set from Precontemplators; a set from Contemplators; and a set from Preparers), and asking him or her to subdivide each set into subsets. We repeated this exercise with three different doctors.</Paragraph>
      <Paragraph position="2"> The results of this exercise were complex, and the doctors were not in full agreement. After some analysis, we proposed to them that we subdivide all three categories on the basis of desire to quit. Precontemplators in particular would be divided up into people who neither want nor intend to quit (Committed Smokers); people who have mixed feelings about smoking but don't yet intend to quit (Classic Precontemplators); and people who would like to quit but aren't intending to quit, typically because they don't think they'll succeed (Lacks Confidence).</Paragraph>
      <Paragraph position="3"> The three doctors agreed that this was a reasonable subcategorisation, and we proceeded on this basis.</Paragraph>
      <Paragraph position="4"> In particular, we operationalised this categorisation as follows: o We added the question Would you like to stop if it was easy to the questionnaire. People who answered No were put into the 'Committed Smoker' category. For people who answered Not Sure or Yes, we looked at their decisional balance, that is the number of likes and dislikes they had about smoking, and placed them into Lacks Confidence if their dislikes clearly outnumbered their likes, and Classic Precontemplator otherwise.</Paragraph>
      <Paragraph position="5"> We defined different high-level schemas for each of these categories; these schemas essentially specified which sections and (in some cases) paragraphs should be included in the leaflet, but not tile detailed content of individual paragraphs: Under these schemas; Committed Smokers got short non-argumentative letters which gently reminded smokers of some of the drawbacks of smoking, and suggested some sources of information if the smoker ever changed his/her mind; Classic Precontemplators got letters which focused on the drawbacks of smoking; and Lacks Confidence smok- null ers got letters which focused on confidence-building and:deali.ng.~.with.'barriers to :quitting (such as addiction or fear of weight gain). The example leaflet shown in Figure 2, incidentally, is for a Lacks Confidence smoker.</Paragraph>
      <Paragraph position="6">  After the clinical trial was underway, we attempted to partially evaluate the sorting-derived categories by doing a statisticalanalysis of the differences between smokers in the groups. In other words, we hypothesised that if our categories were correct in distinguishing different types of smokers, then we should observe differences in characteristics such as addiction and confidence between the groups. Of course, this is not an ideal evaluation because it does not test the hypothesis that the different classes of smokers we proposed should receive different types of leaflets; but this is a difficult hypothesis to test directly.</Paragraph>
      <Paragraph position="7"> In any case, our analysis suggested that the smokers in each group did indeed have different characteristics. However, it also suggested that we might have done as well (in terms of creating subgroups with different characteristics) by subcategorising purely on the Would you like to stop if it was easy question, and ignoring likes and dislikes about smoking. The analysis also suggested that it might have been useful to subcategorise on the basis of addiction, which we did not do. In fact during the sorting exercises the doctors did mention dividing into groups partially on the basis of the difficulty that individuals would have in quitting, but we did not implement this.</Paragraph>
      <Paragraph position="8"> The statistical analysis also suggested some ways of possibly improving the content schemas. For example, the analysis showed that the Committed Smoker category included many light smokers who probably smoked for social reasons; it might have been useful to specifically address this in the STOP leaflets ('quit now, before you become addicted').</Paragraph>
      <Paragraph position="9"> In retrospect, then, the sorting exercise was useful in proposing ideas about how to divide Stages of Change categories, and in new questions to ask smokers. However, the process of .defining detailed category classification rules and content schemas would have benefited greatly from statistical data about smokers in our target region. In STOP we did not have such data until after the clinical trial had started (and smokers had returned their questionnaires), by which time the system could not be changed. So it would have been difficult to base smoker classification on statistical smoker data in STOP; but certainly we would recommend such an approach in projects where good data is available at the outset.</Paragraph>
    </Section>
    <Section position="2" start_page="218" end_page="219" type="sub_section">
      <SectionTitle>
3.2 Think-aloud Protocols
</SectionTitle>
      <Paragraph position="0"> - .The-detaited.coatent.-and&amp;quot; phrasing of :STOi :i letters was largely based on think-aloud example sessions with experts. In these sessions, health professionals would be given a questionnaire and asked to write a letter or leaflet for this person. They were also asked to 'think aloud' into a tape recorder while they did this, explaining their reasoning. Again this is a standard expert-system technique for KA.</Paragraph>
      <Paragraph position="1">  ...... -k simpte-exainpte=~:the think-aloud process is as follows. One of the doctors wrote a letter for a smoker who had tried to quit before, and managed to stop for several weeks before starting again. The doctor made the following comments in the think-aloud transcript: Has he tried to stop smoking before? Yes, and the longest he has managed to stop -he has ticked the one week right up to three months and that's encouraging in that he has managed to stop at least once before, because it is always said that the people who have had one or two goes are more likely to succeed in the future.</Paragraph>
      <Paragraph position="2"> He also included the following paragraph in the letter that he wrote for this smoker: I see that you managed to stop smoking on one or two occasions before but have gone back to smoking, but you will be glad to know that this is very common and most people who finally stop smoking have had one or two attempts in the past before they finally succeed. What it does show is that you are capable of stopping even for a short period, and that means you are much more likely to be able to stop permanently than somebody who has never ever stopped smoking at all.</Paragraph>
      <Paragraph position="3"> After analysing this session, we proposed two rules: * IF (previous attempt to quit) THEN (message: more likely to succeed) e IF (previous attempt to quit) THEN (message: most people who quit have a few unsuccessful attempts first) The final system incorporated a imle (based Off .... several KA sessions, not just the above one) that stated that if the smoker had tried to quit before, then the confidence-building section of the leaflet (which is only included for some smoker categories, see Section 3A) should include a short message about previous attempts to quit. This message should mention length of previous cessation if this  was greater than one week; otherwise, it should mention recency of previous,,attempt if .this was within..:-the past 6 months. The actual text generated from this rule in the example leaflet of Figure 2 is Although you don't feel confident that you would be able to stop if you were to try, you have several things in your favour.</Paragraph>
      <Paragraph position="4"> * You have stopped before for more than a month.</Paragraph>
      <Paragraph position="5"> Note that the message (text)-produced by-the actual STOP code is considerably simpler than the text originally written by the expert. This is fairly common, as is simplifications in the logic used to decide whether to include a message in a leaflet or not. In some cases this is due to the expert having much more knowledge and expertise than the computer system (Reiter and Dale, 2000, pp 30-36). Consider, for example, the following extract from the same think-aloud session The other thing I notice is that he lives in \[Address\] which I would suspect is quite a few floors up and that he is probably getting quite puffy on the stairs ... and if he gets more breathless he'll end up being a prisoner in his own house because he'll be able to get down, but he won't be able to get up again This type of reasoning perhaps requires too much general 'world knowledge' about addresses, stairs, and breathlessness to be implementable in a computer system.</Paragraph>
      <Paragraph position="6">  Afterwards, we attempted to partially evaluate the rules derived from think-aloud sessions by showing STOP leaflets to smokers and other smoking professionals, and asking for comments. The results were mixed. In terms of content, some smokers found the content of the leaflets to be useful and appropriate for them, but others said they would have liked to see different types of information. For example, STOP leaflets did not go into the medical details of smoking (as none of the think-aloud expert-written letters contained such information), and while this seemed like the right choice for many smokers, a few smokers did say that they would have liked to see more medical information about smoking. Reactions to style were also mixed. For example, based on KA sessions we adopted a positive tone and did not try to scare smokers; and again this seemed right for most smokers, but some smokers said that a more 'brutal' approach would be more effective for them. An issue which our experts (and other project members) disagreed on was whether leaflets should always use stmrt and simple sentences, or whether sentence length and complexity should be varied de,pending, on the' characteristics of'.the smoker. In the STOP implementation we decided to always use moderately simple sentences, and not vary sentence complexity for different users. After the clinical trial started, we performed a small experiment to test this hypothesis. In this experiment, we took a computer-generated leaflet and asked one expert (who believed that short sentences with simple words should always be used) to revise the computer-generated .leaflet to.make it as.~easy to readas possible, and another expert (who believed that more complex sentences were sometimes appropriate, and such sentences could in some cases make letters seem friendlier and more understanding) to revise the computer-generated leaflet to make it friendlier and more understanding. The revisions made by the experts were primarily microplanning ones (using NLG terminology) -- that is, aggregation, ellipsis, lexical choice, and syntactic choice. We then showed the two expert-revised leaflets to 20 smokers and asked them which they preferred. The smokers essentially split 50-50 on this question (8 preferred the easy-toread leaflet, 9 preferred the friendly-understanding leaflet, 3 thought both were the same). This suggests that in principle it indeed may be useful to vary microplanning choices for different leaflet recipients. We hope to further investigate this issue in future research.</Paragraph>
      <Paragraph position="7"> Overall, a general finding of the evaluation was that there were many kinds of variations (including whether to include detailed medical information, whether to adopt a 'positive' or 'brutal' tone, and how complex sentences should be) which were not performed by STOP but might have increased leaflet effectiveness if they had been performed. These types of variations were either not observed at all in the think-aloud sessions, or were observed in sessions with some experts but not others.</Paragraph>
      <Paragraph position="8"> In terms of KA methodology, perhaps the key lesson is similar to the one from the sorting sessions; the think-aloud KA sessions were very useful in suggesting ideas and hypotheses about STOP content and phrasing rules, but we should have used other information sources, such as smoker evaluations and small comparison experiments, to help refine and test these rules.</Paragraph>
    </Section>
    <Section position="3" start_page="219" end_page="220" type="sub_section">
      <SectionTitle>
3.3 Other techniques
</SectionTitle>
      <Paragraph position="0"> Some of the other KA techniques we tried are briefly described below. These had less influence on the system than the sorting and think-aloud exercises described above.</Paragraph>
      <Paragraph position="1">  We gave experts leaflets produced by the STOP system and asked them to critique and revise them. This was especially useful in suggesting local  changes, such as what phrases or sentences should Paragraph from Nov 97 KA exercise: be used to communicate .a. particular~message._ For Finally, .if :yotL.~.do: make: an. ~atter~pt.t0 =stop, you example, an early version of the STOP system used the phrase there are lots of good reasons for stopping. One of the experts commented during a-revision session that the phrasing should be changed to emphasise that the reasons listed (in this particular section of the STOP leaflet) were ones the smoker himself had selected in the questionnaire he filled out. This eventually led to the revised wording It is encouraging that.you have_ many.good~ reasons/or : stopping, which is in the first paragraph of the example leaflet in Figure 2.</Paragraph>
      <Paragraph position="2"> Revision was less useful in suggesting larger changes to the system, and after the clinical trial was underway, one of our experts commented that he might have been able to suggest larger changes if we had explained the system's reasoning to him, instead of just giving him a leaflet to revise. In other words, just as we asked experts to 'think-aloud' as they wrote leaflets, in order to understand their reasoning, it would be useful if we could give the experts something like the computer-system 'thinking aloud' as it produced a leaflet, so they could understand its reasoning.</Paragraph>
      <Paragraph position="3">  Because experts often disagreed, we tried a variety of activities where a group of experts either discussed or collaboratively authored a leaflet, in the hopes that this would help resolve or at least clarify conflicting opinions. This seemed to work best when we asked two experts to collaborate, and was less satisfactory with larger groups. Several experts commented that the larger (that is, more than 2-3 people) group sessions would have benefited from more structure and perhaps a professional facilitator.</Paragraph>
      <Paragraph position="4">  As mentioned in Section 3.2, we showed several smokers the leaflet STOP produced for them, and asked them to comment on the leaflet. In addition to its role as an evaluation exercise for other KA techniques, we hoped that these sessions would in themselves give us ideas for leaflet content and phrasing rules. This was again less successful than we had hoped. Part of the problem was the smokers knew very little about STOP (unlike our expeits, who&amp;quot; were all familiar with the project), and-often made comments which were not useful for improving the system, such as \[ did stop .for t0 -days til-my -daughter threw a wobbly and then I wanted a cigarette and bought some and after smoking for over 30 years I've tried acupuncture and hypnosis all to no avail.</Paragraph>
      <Paragraph position="5"> We were also concerned that most of our comments came from well-educated and articulate smokers (for example, university students). It was harder to get feedback-from less well-educated smokers (for could consider using nicotine patches. For people like yourself who smoke 10-20 cigarettes per day, patches double your chances of success if you are determined to stop. You can get more information on patches from your local pharmacist or GP.</Paragraph>
      <Paragraph position="6"> Paragraph from Feb 99 KA exercise: You. smoke 1.1=20 .cigarePStes..a day,.:and, smokeyourfirst cigarette within 30 minutes of waking. These facts suggest you are moderately addicted to nicotine, and so you might get withdrawal symptoms for a time on stopping. It would be worth considering using nicotine patches when you stop; these double the chances of success for moderately heavy smokers such as yourself who make a determined attempt to stop smoking. Your pharmacist or GP can give you more information about this.</Paragraph>
      <Paragraph position="7">  the same smoker in different KA exercises example, single mothers living in public housing estates). This led to the worry that the feedback we were getting was not representative of the population of smokers as a whole.</Paragraph>
    </Section>
    <Section position="4" start_page="220" end_page="221" type="sub_section">
      <SectionTitle>
3.4 KA and the Smoker Questionnaire
</SectionTitle>
      <Paragraph position="0"> KA sessions also effected the smoker questionnaire (STOP'S input) as well as the text-generation component of the system. We started with an initial questionnaire which was largely based on a literature review of previous projects, and then modified it based on the information that experts used in KA sessions. For example, the original questionnaire asked people who had tried to quit before what quitting techniques they had used in their previous attempts. However, in KA sessions the experts seemed primarily interested in previous experiences with one particular technique, nicotine replacement (nicotine patches or gum); so we replaced the general question about previous quitting techniques with two questions whichfocused on experiences with nicotine replacement. null ..... 4 ..Stability of Knowledge In order to determine how stable the results of NiX sessions were, we asked one of our doctors to repeat in February 1999 a think-aloud exercise which he had originally done in November 1997. This exercise required examining and writing letters for two smokers. The letters and accompanying think-aloud from the 1999 exercise were somewhat different from  the letters and think-aloud from the 1997 exercise; in very general terms, the 19991etters_hadsimilar. core content, but expressed the information differently, in a perhaps (this is very difficult to objectively measure) more 'empathetic' manner. An extract from this experiment is shown in Figure 1.</Paragraph>
      <Paragraph position="1"> We asked a group of seven smokers to compare one of the 1999 letters with the corresponding letter from the 1997 exercise. Five preferred the 1999 letter; one preferred the 1997 letter; one thought both were similar. Written comments from the smokers suggested that they found the'1999 letter ~riendlier and more understanding than the 1997 letter.</Paragraph>
      <Paragraph position="2"> In summary, it appears that our experts may themselves have been learning how to write effective smoking-cessation leaflets during the course of the STOP project. In retrospect this is perhaps not surprising given that none of them had written such leaflets before the project started.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="221" end_page="221" type="metho">
    <SectionTitle>
5 Evaluation of KA
</SectionTitle>
    <Paragraph position="0"> An issue that arose several times during the project was whether we could formally evaluate the effectiveness of KA techniques, in an analogous way to the manner in which we formally evaluated the effectiveness of STOP in a clinical trial which compared smoking-cessation rates in STOP and non-STOP groups. Unfortunately, it was not clear to us how to do this; how can one evaluate a development methodology such as KA? In principle, perhaps it might be possible to ask two groups to develop the same system, one using KA and one not, and then compare the effectiveness of the resultant systems (perhaps using a STOP-like clinical trial), and also engineering issues such as development cost and time-to-completion. This would be an expensive endeavour, however, as it would be necessary to pay for two development efforts. Also, the size of a clinical trial depends on the size of the effect it is trying to validate, and a clinical trial which compared (for example) the effectiveness of two kinds of computer-generated smoking-cessation leaflet might need to be substantially larger (and hence more expensive) than a clinical trial that tested the effectiveness of a computer-generated leaflet against a no-leaflet control group.</Paragraph>
    <Paragraph position="1"> An even more fundamental problem is that in order for such an experiment to produce meaningful results, it would be necessary to control for differences -in skill, expertise, enthusiasm.,-and \]suck between ~the development teams. It might be necessary to repeat this exercise several times, perhaps randomly choosing which development team will use KA and which will not. Of course, repeating the experiment N times will increase the total cost by a factor of N.</Paragraph>
    <Paragraph position="2"> As we did not have the resources to do the above, we elected instead to focus on the smaller 'informal' evaluations described above. We also conducted a * small. :experiment where we asked, a&amp;quot;~gr~ottp :of-five smoking-cessation counsellors to compare lea/lets produced by an early prototype of STOP with leaflets produced by the system used inthe clinical trial.</Paragraph>
    <Paragraph position="3"> 60% of the counsellors thought the clinical trial system's leaflets were more likely to be effective, with the other 40% thinking the two systems produced letters of equal effectiveness. This suggests (although does not prove) that the development effort behind the .clinical_ trial system .improved leaflet effectiveness. However, we cannot deterrnifie how much of the improvement was due to KA and how much was due to other development activities.</Paragraph>
  </Section>
  <Section position="7" start_page="221" end_page="221" type="metho">
    <SectionTitle>
6 Current Work
</SectionTitle>
    <Paragraph position="0"> We are currently in the process of analysing the results of the clinical trial (which we cannot discuss in this paper), to see if it sheds any light on the effectiveness of KA. This is not straightforward because the clinical trial was not designed to give feedback about KA, but there nevertheless seem to be some useful lessons here, which we hope to report in subsequent publications.</Paragraph>
    <Paragraph position="1"> We also are applying the KA techniques used in STOP to a project in a different domain, to see how domain-dependent our findings are. A first attempt to do this, in a domain which involved giving advice to university students, failed because the relevant expert, who initially seemed very enthusiastic, did not give us enough time for KA. This highlights the practical observation that KA requires a substantial amount of time from the expert(s), who must either be paid or otherwise motivated to participate in the sessions. In this case we could not pay the expert, but instead tried to motivate him by pointing out that a successful system would be useful to him in his job; this was not in the end sufficient motivation to get the expert to make time for KA in his (busy) schedule.</Paragraph>
    <Paragraph position="2"> After the above failure we switched to another domain, giving feedback to adults who are taking basicliteracy courses. In this domain, we are working with a company, Cambridge Training and Development, which is paying experts for their time when appropriate. This work is currently in progress. One interesting KA idea which has already emerged from this work is observing tutors working with students (we did not in STOP observe doctors discussing smoking with-their.patients); this~is.similar to the ethnographic techniques suggested by Forsythe (1995).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML