File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1007_metho.xml
Size: 22,357 bytes
Last Modified: 2025-10-06 14:09:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1007"> <Title>experiments in natural language generation for intelligent tutoring systems</Title> <Section position="4" start_page="0" end_page="52" type="metho"> <SectionTitle> 2 Natural Language Generation for DIAG </SectionTitle> <Paragraph position="0"> DIAG (Towne, 1997) is a shell to build ITSs based on interactive graphical models that teach students to troubleshoot complex systems such as home heating and circuitry. A DIAG application presents a student with a series of troubleshooting problems of increasing difficulty. The student tests indicators and tries to infer which faulty part (RU) may cause the abnormal states detected via the indicator readings. RU stands for replaceable unit, because the only course of action for the student to fix the problem is to replace faulty components in the graphical simulation.</Paragraph> <Paragraph position="1"> Fig. 1 shows the furnace, one subsystem of the home heating system in our DIAG application. Fig. 1 includes indicators such as the gauge labeled Water Temperature, RUs, and complex modules (e.g., the Oil Burner) that contain indicators and RUs. Complex components are zoomable.</Paragraph> <Paragraph position="2"> At any point, the student can consult the tutor via the Consult menu (cf. the Consult button in Fig. 1). There are two main types of queries: ConsultInd(icator) and ConsultRU. ConsultInd queries are used mainly when an indicator shows an abnormal reading, to obtain a hint regarding which RUs may cause the problem. DIAG discusses the RUs that should be most suspected given the symptoms the student has already observed. ConsultRU queries are mainly used to obtain feedback on the diagnosis that a certain RU is faulty. DIAG responds with an assessment of that diagnosis and provides evidence for it in terms of the symptoms that have been observed relative to that RU.</Paragraph> <Paragraph position="3"> The original DIAG system (DIAG-orig) uses very simple templates to assemble the text to present to the student. The top parts of Figs. 2 and 3 show the replies provided by DIAG-orig to a ConsultInd on the Visual Combustion Check, and to a ConsultRu on the Water Pump.</Paragraph> <Paragraph position="4"> The highly repetitive feedback by DIAG-orig screams for improvements based on aggregation techniques. Our goal in developing DIAG-NLP1 and DIAG-NLP2 was to assess whether simple, rapidly deployable NLG techniques would lead to measurable improvements in the student's learning.</Paragraph> <Paragraph position="5"> Thus, in both cases it is still DIAG that performs content determination, and provides to DIAG-NLP1 and DIAG-NLP2 a file in which the facts to be communicated are written - a fact is the basic unit of information that underlies each of the clauses in a reply by DIAG-orig. The only way we altered the interaction between student and system is the actual language that is presented in the output window. In DIAG-NLP1 we mostly explored using syntactic aggregation to improve the feedback, whereas DIAG-NLP2 is corpus-based and focuses on functional aggregation. In both DIAG-NLP1 and DIAG-NLP2, we use EXEMPLARS (White and Caldwell, 1998), an object-oriented, rule-based generator. The rules (called exemplars) are meant to capture an exemplary way of achieving a communicative goal in a given context. EXEMPLARS selects rules by traversing the exemplar specialization hierarchy and evaluating the applicability conditions associated with each exemplar.</Paragraph> <Paragraph position="6"> The visual combustion check is igniting which is abnormal (normal is combusting).</Paragraph> <Paragraph position="7"> Oil Nozzle always produces this abnormality when it fails.</Paragraph> <Section position="1" start_page="50" end_page="50" type="sub_section"> <SectionTitle> Oil Supply Valve always </SectionTitle> <Paragraph position="0"> produces this abnormality when it fails.</Paragraph> <Paragraph position="1"> Oil pump always produces this abnormality when it fails. Oil Filter always produces this abnormality when it fails.</Paragraph> </Section> <Section position="2" start_page="50" end_page="51" type="sub_section"> <SectionTitle> System Control Module sometimes </SectionTitle> <Paragraph position="0"> produces this abnormality when it fails.</Paragraph> <Paragraph position="1"> Ignitor Assembly never produces this abnormality when it fails. Burner Motor always produces this abnormality when it fails. The visual combustion check indicator is igniting. This is abnormal.</Paragraph> <Paragraph position="2"> Normal is combusting.</Paragraph> <Paragraph position="3"> Within the furnace system, this is sometimes caused if the System Control Module has failed. Within the Oil Burner this is never caused if the Ignitor Assembly has failed.</Paragraph> <Paragraph position="4"> In contrast, this is always caused if the Burner Motor, Oil Filter, Oil Pump, Oil Supply Valve, or Oil Nozzle has failed. The combustion is abnormal.</Paragraph> <Paragraph position="5"> In the oil burner, check the units along the path of the oil and the burner motor.</Paragraph> <Paragraph position="6"> Water pump is a very poor suspect. Some symptoms you have seen conflict with that theory. Water pump sound was normal.</Paragraph> <Paragraph position="7"> This normal indication never results when this unit fails. Visual combustion check was igniting. This abnormal indication never results when this unit fails. Burner Motor RMP Gauge was 525.</Paragraph> <Paragraph position="8"> This normal indication never results when this unit fails. The Water pump is a very poor suspect. Some symptoms you have seen conflict with that theory. The following indicators never display normally when this unit fails.</Paragraph> <Paragraph position="9"> Within the furnace system, the Burner Motor RMP Gauge is 525.</Paragraph> <Paragraph position="10"> Within the water pump and safety cutoff valve, the water pump sound indicator is normal. The following indicators never display abnormally when this unit fails.</Paragraph> <Paragraph position="11"> Within the fire door sight hole, the visual combustion check indicator is igniting. The water pump is a poor suspect since the water pump sound is ok.</Paragraph> <Paragraph position="12"> You have seen that the combustion is abnormal. Check the units along the path of the oil and the electrical and Mellish, 1998; Shaw, 2002) and what we call structural aggregation, namely, grouping parts according to the structure of the system; (ii) generates some referring expressions; (iii) models a few rhetorical relations; and (iv) improves the format of the output.</Paragraph> <Paragraph position="13"> The middle parts of Figs. 2 and 3 show the revised output produced by DIAG-NLP1. E.g., in Fig. 2 the RUs of interest are grouped by the system modules that contain them (Oil Burner and Furnace System), and by the likelihood that a certain RU causes the observed symptoms. In contrast to the original answer, the revised answer highlights that the Ignitor Assembly cannot cause the symptom.</Paragraph> <Paragraph position="14"> In DIAG-NLP1, EXEMPLARS accesses the SNePS Knowledge Representation and Reasoning System (Shapiro, 2000) for static domain information.2 SNePS makes it easy to recognize structural 1DIAG-NLP1 actually augments and refines the first feed-back generator we created for DIAG, DIAG-NLP0 (Di Eugenio et al., 2002). DIAG-NLP0 only covered (i) and (iv). 2In DIAG, domain knowledge is hidden and hardly accessimilarities and use shared structures. Using SNePS, we can examine the dimensional structure of an aggregation and its values to give preference to aggregations with top-level dimensions that have fewer values, to give summary statements when a dimension has many values that are reported on, and to introduce simple text structuring in terms of rhetorical relations, inserting relations like contrast and concession to highlight distinctions between dimensional values (see Fig. 2, middle).</Paragraph> <Paragraph position="15"> DIAG-NLP1 uses the GNOME algorithm (Kibble and Power, 2000) to generate referential expressions. Importantly, using SNePS propositions can be treated as discourse entities, added to the discourse model and referred to (see This is ... caused if ... in Fig. 2, middle). Information about lexical realization, and choice of referring expression is encoded in the appropriate exemplars.</Paragraph> </Section> <Section position="3" start_page="51" end_page="52" type="sub_section"> <SectionTitle> 2.2 DIAG-NLP2: functional aggregation </SectionTitle> <Paragraph position="0"> In the interest of rapid prototyping, DIAG-NLP1 was implemented without the benefit of a corpus study. DIAG-NLP2 is the empirically grounded version of the feedback generator. We collected 23 tutoring interactions between a student using the DIAG tutor on home heating and two human tutors, for a total of 272 tutor turns, of which 235 in reply to ConsultRU and 37 in reply to ConsultInd (the type of student query is automatically logged). The tutor and the student are in different rooms, sharing images of the same DIAG tutoring screen. When the student consults DIAG, the tutor sees, in tabular form, the information that DIAG would use in generating its advice -- the same &quot;fact file&quot; that DIAG gives to DIAG-NLP1and DIAG-NLP2-- and types a response that substitutes for DIAG's. The tutor is presented with this information because we wanted to uncover empirical evidence for aggregation rules in our domain. Although we cannot constrain the tutor to mention only the facts that DIAG would have communicated, we can analyze how the tutor uses the information provided by DIAG.</Paragraph> <Paragraph position="1"> We developed a coding scheme (Glass et al., 2002) and annotated the data. As the annotation was performed by a single coder, we lack measures of intercoder reliability. Thus, what follows should be taken as observations rather than as rigorous findings - useful observations they clearly are, since sible. Thus, in both DIAG-NLP1 and DIAG-NLP2 we had to build a small knowledge base that contains domain knowledge. DIAG-NLP2 is based on these observations and its language fosters the most learning.</Paragraph> <Paragraph position="2"> Our coding scheme focuses on four areas. Fig. 4 shows examples of some of the tags (the SCM is the System Control Module). Each tag has from one to five additional attributes (not shown) that need to be annotated too.</Paragraph> <Paragraph position="3"> Domain ontology. We tag objects in the domain with their class indicator, RU and their states, denoted by indication and operationality, respectively. Tutoring actions. They include (i) Judgment. The tutor evaluates what the student did. (ii) Problem solving. The tutor suggests the next course of action. (iii) The tutor imparts Domain Knowledge.</Paragraph> <Paragraph position="4"> Aggregation. Objects may be functional aggregates, such as the oil burner, which is a system component that includes other components; linguistic aggregates, which include plurals and conjunctions; or a summary over several unspecified indicators or RUs. Functional/linguistic aggregate and summary tags often co-occur, as shown in Fig. 4.</Paragraph> <Paragraph position="5"> Relation to DIAG's output. Contrary to all other tags, in this case we annotate the input that DIAG gave the tutor. We tag its portions as included / excluded / contradicted, according to how it has been dealt with by the tutor.</Paragraph> <Paragraph position="6"> Tutors provide explicit problem solving directions in 73% of the replies, and evaluate the student's action in 45% of the replies (clearly, they do both in 28% of the replies, as in Fig. 4). As expected, they are much more concise than DIAG, e.g., they never mention RUs that cannot or are not as likely to cause a certain problem, such as, respectively, the ignitor assembly and the SCM in Fig. 2.</Paragraph> <Paragraph position="7"> As regards aggregation, 101 out of 551 RUs, i.e.</Paragraph> <Paragraph position="8"> 18%, are labelled as summary; 38 out of 193 indicators, i.e. 20%, are labelled as summary. These percentages, though seemingly low, represent a considerable amount of aggregation, since in our domain some items have very little in common with others, and hence cannot be aggregated. Further, tutors aggregate parts functionally rather than syntactically. For example, the same assemblage of parts, i.e., oil nozzle, supply valve, pump, filter, etc., can be described as the other items on the fuel line or as the path of the oil flow.</Paragraph> <Paragraph position="9"> Finally, directness - an attribute on the indicator tag - encodes whether the tutor explicitly talks about the indicator (e.g., The water temperature gauge reading is low), or implicitly via the object to which the indicator refers (e.g., the water is too cold). 110 out of 193 indicators, i.e. 57%, are marked as implicit, 45, i.e. 41%, as explicit, and 2% are not marked for directness (the coder was free to leave attributes unmarked). This, and the 137 occurrences of indication, prompted us to refer to objects and their states, rather than to indicators (as implemented by Steps 2 in Fig. 5, and 2(b)i, 3(b)i, 3(c)i in Fig. 6, which generate The combustion is abnormal and The water pump sound is OK in Figs. 2 and 3).</Paragraph> </Section> <Section position="4" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 2.3 Feedback Generation in DIAG-NLP2 </SectionTitle> <Paragraph position="0"> In DIAG-NLP1 the fact file provided by DIAG is directly processed by EXEMPLARS. In contrast, in DIAG-NLP2 a planning module manipulates the information before passing it to EXEMPLARS. This module decides which information to include according to the type of query the system is responding to, and produces one or more Sentence Structure objects. These are then passed to EXEMPLARS that transforms them into Deep Syntactic Structures.</Paragraph> <Paragraph position="1"> Then, a sentence realizer, RealPro (Lavoie and Rambow, 1997), transforms them into English sentences.</Paragraph> <Paragraph position="2"> Figs. 5 and 6 show the control flow in DIAG-NLP2 for feedback generation for ConsultInd and ConsultRU. Step 3a in Fig. 5 chooses, among all the RUs that DIAG would talk about, only those that would definitely result in the observed symptom. Step 2 in the AGGREGATE procedure in Fig. 5 uses a simple heuristic to decide whether and how to use functional aggregation. For each RU, its possible aggregators and the number n of units it covers are listed in a table (e.g., electrical devices covers 4 RUs, ignitor, photoelectric cell, transformer and burner motor). If a group of REL-RUs contains k units that a certain aggregator Agg covers, if k < n2 , Agg will not be used; if n2 [?] k < n, Agg preceded by some of will be used; if k = n, Agg will be used.</Paragraph> <Paragraph position="3"> DIAG-NLP2 does not use SNePS, but a relational database storing relations, such as the ISA hierarchy (e.g., burner motor IS-A RU), information about referents of indicators (e.g., room temperature gauge REFERS-TO room), and correlations between RUs and the indicators they affect.</Paragraph> </Section> </Section> <Section position="5" start_page="52" end_page="53" type="metho"> <SectionTitle> 3 Evaluation </SectionTitle> <Paragraph position="0"> Our empirical evaluation is a three group, betweensubject study: one group interacts with DIAG-orig, [judgment [replaceable[?]unit the ignitor] is a poor suspect] since [indication combustion is working] during startup. The problem is that the SCM is shutting the system off during heating.</Paragraph> <Paragraph position="1"> [domain[?]knowledge The SCM reads [summary [linguistic[?]aggregate input signals from sensors]] and uses the signals to determine how to control the system.] 1. IND-queried indicator 2. Mention the referent of IND and its state 3. IF IND reads abnormal, (a) REL-RUs-choose relevant RUs (b) AGGR-RUs-AGGREGATE(REL-RUs) (c) Suggest to check AGGR-RUs AGGREGATE(RUs) 1. Partition REL-RUs into subsets by system structure 2. Apply functional aggregation to subsets neering majors affiliated with our university. Each subject read some short material about home heating, went through one trial problem, then continued through the curriculum on his/her own. The curriculum consisted of three problems of increasing difficulty. As there was no time limit, every student solved every problem. Reading materials and curriculum were identical in the three conditions. While a subject was interacting with the system, a log was collected including, for each problem: whether the problem was solved; total time, and time spent reading feedback; how many and which indicators and RUs the subject consults DIAG about; how many, and which RUs the subject replaces. We will refer to all the measures that were automatically collected as performance measures.</Paragraph> <Paragraph position="2"> At the end of the experiment, each subject was administered a questionnaire divided into three parts. The first part (the posttest) consists of three questions and tests what the student learned about the domain. The second part concerns whether subjects remember their actions, specifically, the RUs they replaced. We quantify the subjects' recollections in terms of precision and recall with respect to the log that the system collects. We expect precision and recall of the replaced RUs to correlate with transfer, namely, to predict how well a subject is able to apply what s/he learnt about diagnosing malfunctions 1. RU-queried RU REL-IND-indicator associated to RU 2. IF RU warrants suspicion, (a) state RU is a suspect (b) IF student knows that REL-IND is abnormal i. remind him of referent of REL-IND and its abnormal state ii. suggest to replace RU (c) ELSE suggest to check REL-IND 3. ELSE (a) state RU is not a suspect (b) IF student knows that REL-IND is normal i. use referent of REL-IND and its normal state to justify judgment (c) IF student knows of abnormal indicators OTHER-INDs i. remind him of referents of OTHER-INDs and their abnormal states ii. FOR each OTHER-IND A. REL-RUs-RUs associated with OTHER-IND B. AGGR-RUs-AGGREGATE(REL-RUs)</Paragraph> </Section> <Section position="6" start_page="53" end_page="55" type="metho"> <SectionTitle> ConsultRU </SectionTitle> <Paragraph position="0"> to new problems. The third part concerns usability, to be discussed below.</Paragraph> <Paragraph position="1"> We found that subjects who used DIAG-NLP2 had significantly higher scores on the posttest, and were significantly more correct (higher precision) in remembering what they did. As regards performance measures, there are no so clear cut results. As regards usability, subjects prefer DIAG-NLP1/2 to DIAG-orig, however results are mixed as regards which of the two they actually prefer.</Paragraph> <Paragraph position="2"> In the tables that follow, boldface indicates significant differences, as determined by an analysis of variance performed via ANOVA, followed by post-hoc Tukey tests.</Paragraph> <Paragraph position="3"> Table 1 reports learning measures, average across the three problems. DIAG-NLP2 is significantly PostTest3 is illustrated in Fig. 7. Scores in DIAG-NLP2 are always higher, significantly so on questions 2 and 3 (F = 8.481,p = 0.000, and F = 7.909,p = 0.001), and marginally so on question 1</Paragraph> <Paragraph position="5"> tive across the three problems, other than average reading times. Subjects don't differ significantly in the time they spend solving the problems, or in the number of RU replacements they perform. DIAG's assumption (known to the subjects) is that there is only one broken RU per problem, but the simulation allows subjects to replace as many as they want without any penalty before they come to the correct solution. The trend on RU replacements is opposite what we would have hoped for: when repairing a fewer parts in DIAG-orig.</Paragraph> <Paragraph position="6"> The next four entries in Table 2 report the number of queries that subjects ask, and the average time it takes subjects to read the feedback. The subjects ask significantly fewer ConsultInd in DIAG-NLP1 (F = 8.905,p = 0.000), and take significantly less time reading ConsultInd feedback in DIAG-NLP2 (F = 15.266,p = 0.000). The latter result is not surprising, since the feedback in DIAG-NLP2 is much shorter than in DIAG-orig and DIAG-NLP1.</Paragraph> <Paragraph position="7"> Neither the reason not the significance of subjects asking many fewer ConsultInd of DIAG-NLP1 are apparent to us - it happens for ConsultRU as well, to a lesser, not significant degree.</Paragraph> <Paragraph position="8"> We also collected usability measures. Although these are not usually reported in ITS evaluations, in a real setting students should be more willing to sit down with a system that they perceive as more friendly and usable. Subjects rate the system along four dimensions on a five point scale: clarity, usefulness, repetitiveness, and whether it ever misled them (the scale is appropriately arranged: the highest clarity but the lowest repetitiveness receive 5 points). There are no significant differences on individual dimensions. Cumulatively, DIAG-NLP2 (at 15.08) slightly outperforms the other two (DIAG-orig at 14.68 and DIAG-NLP1 at 14.32), however, the difference is not significant (highest possible rating is is generated by the system they just worked with, the second is generated by one of the other two systems. Subjects say which version they prefer, and why (they can judge the system along one or more of four dimensions: natural, concise, clear, contentful). The first two lines in Table 3 show that subjects prefer the NLP systems to DIAG-orig (marginally significant, kh2 = 9.49,p < 0.1). DIAG-NLP1 and DIAG-NLP2 receive the same number of preferences; however, a more detailed analysis (Table 4) shows that subjects prefer DIAG-NLP1 for feed-back to ConsultInd, but DIAG-NLP2 for feedback to ConsultRu (marginally significant, kh2 = 5.6,p < 0.1). Finally, subjects find DIAG-NLP2 more natural, but DIAG-NLP1 more contentful (Table 5,</Paragraph> <Paragraph position="10"/> </Section> class="xml-element"></Paper>