File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1034_metho.xml
Size: 18,655 bytes
Last Modified: 2025-10-06 14:10:10
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1034"> <Title>Modelling User Satisfaction and Student Learning in a Spoken Dialogue Tutoring System with Generic, Tutoring, and User Affect Parameters</Title> <Section position="4" start_page="265" end_page="265" type="metho"> <SectionTitle> ALMOST ALWAYS (5), OFTEN (4), SOMETIMES (3), RARELY (2), ALMOST NEVER (1) </SectionTitle> <Paragraph position="0"/> </Section> <Section position="5" start_page="265" end_page="266" type="metho"> <SectionTitle> 3 Interaction Parameters </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="265" end_page="265" type="sub_section"> <SectionTitle> 3.1 Dialogue System-Generic Parameters </SectionTitle> <Paragraph position="0"> Prior PARADISE applications predicted user satisfaction using a wide range of system-generic parameters, which include measures of speech recognition quality (e.g. word error rate), measures of dialogue communication and ef ciency (e.g. total turns and elapsed time), and measures of task completion (e.g.</Paragraph> <Paragraph position="1"> a binary representation of whether the task was completed) (Mcurrency1oller, 2005a; Mcurrency1oller, 2005b; Walker et al., 2002; Bonneau-Maynard et al., 2000; Walker et al., 2000; Walker et al., 1997). In this prior work, each dialogue between user and system represents a single task (e.g., booking airline travel), thus these measures are calculated on a per-dialogue basis.</Paragraph> <Paragraph position="2"> In our work, the entire tutoring session represents a single task , and every student in our corpora completed this task. Thus we extract 13 system-generic parameters on a per-student basis, i.e. over the 5 dialogues for each user, yielding a single parameter value for each student in our 3 corpora.</Paragraph> <Paragraph position="3"> First, we extracted 9 parameters representing dialogue communication and ef ciency. Of these parameters, 7 were used in prior PARADISE applications: Time on Task, Total ITSPOKE Turns and Words, Total User Turns and Words, Average ITSPOKE Words/Turn, and Average User Words/Turn.</Paragraph> <Paragraph position="4"> Our 2 additional communication-related (Mcurrency1oller, 2005a) parameters measure system-user interactivity, but were not used in prior work (to our knowledge): Ratio of User Words to ITSPOKE Words, Ratio of User Turns to ITSPOKE Turns.</Paragraph> <Paragraph position="5"> Second, we extracted 4 parameters representing speech recognition quality, which have also been used in prior work: Word Error Rate, Concept Accuracy, Total Timeouts, Total Rejections2.</Paragraph> </Section> <Section position="2" start_page="265" end_page="266" type="sub_section"> <SectionTitle> 3.2 Tutoring-Specific Parameters </SectionTitle> <Paragraph position="0"> Although prior PARADISE applications tend to use system-generic parameters, we hypothesize that task-speci c parameters may also prove useful for predicting performance. We extract 12 tutoringspecific parameters over the 5 dialogues for each student, yielding a single parameter value per student, for each student in our 3 corpora. Although these parameters are speci c to our tutoring system, similar parameters are available in other tutoring systems.</Paragraph> <Paragraph position="1"> First, we hypothesize that the correctness of the students' turns with respect to the tutoring topic 2A Timeout occurs when ITSPOKE does not hear speech by a pre-speci ed time interval. A Rejection occurs when ITSPOKE's con dence score for its ASR output is too low.</Paragraph> <Paragraph position="2"> (physics, in our case) may play a role in predicting system performance. Each of our student turns is automatically labeled with 1 of 3 Correctness labels by the ITSPOKE semantic understanding component: Correct, Incorrect, Partially Correct. Labeled examples are shown in Figure 1. From these</Paragraph> </Section> </Section> <Section position="6" start_page="266" end_page="266" type="metho"> <SectionTitle> 3 Correctness labels, we derive 9 parameters: a To- </SectionTitle> <Paragraph position="0"> tal and a Percent for each label, and a Ratio of each label to every other label (e.g. Correct/Incorrect).</Paragraph> <Paragraph position="1"> Second, students write and then may modify their physics essay at least once during each dialogue with ITSPOKE. We thus hypothesize that like Correctness , the total number of essays per student may play a role in predicting system performance.</Paragraph> <Paragraph position="2"> Finally, although student test scores before/after using ITSPOKE will be used as our student learning metric, we hypothesize that these scores may also play a role in predicting user satisfaction.</Paragraph> <Section position="1" start_page="266" end_page="266" type="sub_section"> <SectionTitle> 3.3 User Affect Parameters </SectionTitle> <Paragraph position="0"> We hypothesize that user affect plays a role in predicting user satisfaction and student learning. Although affect parameters have not been used in other PARADISE studies (to our knowledge), they are generic; for example, in various spoken dialogue systems, user affect has been annotated and automatically predicted from e.g., acoustic-prosodic and lexical features (Litman and Forbes-Riley, 2004b; Lee et al., 2002; Ang et al., 2002; Batliner et al., 2003).</Paragraph> <Paragraph position="1"> As part of a larger investigation into emotion adaptation, we are manually annotating the student turns in our corpora for affective state. Currently, we are labeling 1 of 4 states of Certainness : certain, uncertain, neutral, mixed (certain and uncertain), and we are separately labeling 1 of 2 states of Frustration/Anger : frustrated/angry, non-frustrated/angry. These affective states3 were found in pilot studies to be most prevalent in our tutoring dialogues4, and are also of interest in other dialogue research, e.g. tutoring (Bhatt et al., 2004; Moore et al., 2004; Pon-Barry et al., 2004) and spoken dialogue (Ang et al., 2002). Labeled examples are shown in Figure 1.5 To date, one paid annotator has labeled all student turns in our SYN03 corpus, and all the turns of 17 students in our PR05 corpus.6 From these labels, we derived 25 User Affect parameters per student, over the 5 dialogues for that student. First, for each Certainness label, we computed a Total, a Percent, and a Ratio to each other label. We also computed a Total for each sequence of identical Certainness labels (e.g. Certain:Certain), hypothesizing that states maintained over multiple turns may have more impact on performance than single occurrences. Second, we computed the same parameters for each Frustration/Anger label.</Paragraph> </Section> </Section> <Section position="7" start_page="266" end_page="269" type="metho"> <SectionTitle> 4 Prediction Models </SectionTitle> <Paragraph position="0"> In this section, we rst investigate the usefulness of our system-generic and tutoring-speci c parameters for training models of user satisfaction and student learning in our tutoring corpora with the PARADISE framework. We use the SPSS statistical package with a stepwise multivariate linear regression procedure7 to automatically determine parameter inclusion in the model. We then investigate how well these models generalize across different user-system con gurations, by testing the models in different corpora and corpus subsets. Finally, we investigate whether generic user affect parameters increase the usefulness of our student learning models.</Paragraph> <Section position="1" start_page="266" end_page="268" type="sub_section"> <SectionTitle> 4.1 Prediction Models of User Satisfaction </SectionTitle> <Paragraph position="0"> Only subjects in the PR05 and SYN05 corpora completed a user survey (Table 1). Each student's responses were summed to yield a single user satisfaction total per student, ranging from 9 to 24 across corpora (the possible range is 5 to 25), with no difference between corpora (p = .46). This total was used as our user satisfaction metric, as in (Mcurrency1oller, 2005b; Walker et al., 2002; Walker et al., 2000).8 scription within a speech processing tool.</Paragraph> <Paragraph position="1"> We trained a user satisfaction model on each corpus, then tested it on the other corpus. In addition, we split each corpus in half randomly, then trained a user satisfaction model on each half, and tested it on the other half. We hypothesized that despite the decrease in the dataset size, models trained and tested in the same corpus would have higher generalizability than models trained on one corpus and tested on the other, due to the increased data homogeneity within each corpus, since each corpus used a different ITSPOKE version. As predictors, we used only the 13 system-generic and 12 tutoring-speci c parameters that were available for all subjects.</Paragraph> <Paragraph position="2"> Results are shown in Table 2. The rst and fourth columns show the training and test data, respectively. The second and fth columns show the user satisfaction variance accounted for by the trained model in the training and test data, respectively. The third column shows the parameters that were selected as predictors of user satisfaction in the trained model, ordered by degree of contribution9.</Paragraph> <Paragraph position="3"> For example, as shown in the rst row, the model trained on the PR05 corpus uses Total Incorrect student turns as the strongest predictor of user satisfaction, followed by Total Essays; these parameters are not highly correlated10. This model accounts for 27.4% of the user satisfaction variance in the PR05 corpus. When tested on the SYN05 corpus, it accounts for 0.1% of the user satisfaction variance.</Paragraph> <Paragraph position="4"> The low R2 values for both training and testing in the rst two rows show that neither corpus yields 9The ordering re ects the standardized coef cients (beta weights), which are computed in SPSS based on scaling of the input parameters, to enable an assessment of the predictive power of each parameter relative to the others in a model. 10Hereafter, predictors in a model are not highly correlated (R [?] .70) unless noted. Linear regression does not assume that predictors are independent, only that they are not highly correlated. Because correlations above R =.70 can affect the coef cients, deletion of redundant predictors may be advisable. a very powerful model of user satisfaction even in the training corpus, and this model does not generalize very well to the test corpus. As hypothesized, training and testing in a single corpus yields higher R2 values for testing, as shown in the last four rows, although these models still account for less than a quarter of the variance in the test data. The increased R2 values for training here may indicate over- tting.</Paragraph> <Paragraph position="5"> Across all 6 experiments, there is almost no overlap of parameters used to predict user satisfaction.</Paragraph> <Paragraph position="6"> Overall, these results show that this method of developing an ITSPOKE user satisfaction model is very sensitive to changes in training data; this was also found in other PARADISE applications (Mcurrency1oller, 2005b; Walker et al., 2000). Some applications have also reported similarly low R2 values for testing both within a corpus (Mcurrency1oller, 2005b) and also when a model trained on one system corpus is tested on another system corpus (Walker et al., 2000). However, most PARADISE applications have yielded higher R2 values than ours for training (Mcurrency1oller, 2005b; Walker et al., 2002; Bonneau-Maynard et al., 2000; Walker et al., 2000).</Paragraph> <Paragraph position="7"> We hypothesize two reasons for why our experiments did not yield more useful user satisfaction models. First, in prior PARADISE applications, users completed a survey after every dialogue with the system. In our case, subjects completed only one survey, at the end of the experiment (5 dialogues). It may be that this per-student unit for user satisfaction is too large to yield a very powerful model; i.e., this measure is not ne-grained enough. In addition, tutoring systems are not designed to maximize user satisfaction, but rather, their design goal is to maximize student learning. Moreover, prior tutoring studies have shown that certain features correlated with student learning do not have the same relationship to user satisfaction (e.g. are not predictive or have an opposite relationship) (Pon-Barry et al., 2004). In fact, it may be that user satisfaction is not a metric of primary relevance in our application.</Paragraph> </Section> <Section position="2" start_page="268" end_page="269" type="sub_section"> <SectionTitle> 4.2 Prediction Models of Student Learning </SectionTitle> <Paragraph position="0"> As in other tutoring research, e.g. (Chi et al., 2001; Litman et al., 2006), we use posttest score (POST) controlled for pretest score (PRE) as our target student learning prediction metric, such that POST is our target variable and PRE is always a parameter in the nal model, although it is not necessarily the strongest predictor.11 In this way, we measure student learning gains, not just nal test score.</Paragraph> <Paragraph position="1"> As shown in Table 1, all subjects in our 3 corpora took the pretest and posttest. However, in order to compare our student learning models with our user satisfaction models, our rst experiments predicting student learning used the same training and testing datasets that were used to predict user satisfaction in Section 4.1 (i.e. we ran the same experiments except we predicted POST controlled for PRE instead of user satisfaction). Results are shown in the rst 6 rows of Table 3.</Paragraph> <Paragraph position="2"> As shown, these 6 models all account for more than 50% of the POST variance in the training data.</Paragraph> <Paragraph position="3"> Furthermore, most of them account for close to, or more than, 50% of the POST variance in the test data. Although again we hypothesized that training and testing in one corpus would yield higher R2 values for testing, this is not consistently the case; two of these models had the highest R2 values for train11In SPSS, we regress two independent variable blocks. The rst block contains PRE, which is regressed with POST using the enter method, forcing inclusion of PRE in the nal model. The second block contains all remaining independent variables, which are regressed using the stepwise method.</Paragraph> <Paragraph position="4"> ing and the lowest R2 values for testing (PR05:half1 and SYN05:half2), suggesting over- tting.</Paragraph> <Paragraph position="5"> Overall, these results show that this is an effective method of developing a prediction model of student learning for ITSPOKE, and is less sensitive to changes in training data than it was for user satisfaction. Moreover, there is more overlap in these 6 models of parameters that are useful for predicting student learning (besides PRE); Correctness parameters and dialogue communication and ef ciency parameters appear to be most useful overall.</Paragraph> <Paragraph position="6"> Our next 3 experiments investigated how our student learning models are impacted by including our third SYN03 corpus. Using the same 25 parameters, we trained a learning model on each set of two combined corpora, then tested it on the other corpus.</Paragraph> <Paragraph position="7"> Results are shown in the last 3 rows of Table 3.</Paragraph> <Paragraph position="8"> As shown, these models still account for close to, or more than, 50% of the student learning variance in the training data.12 The model trained on PR05+SYN03 accounts for the most student learning variance in the test data, showing that the training data that is most similar to the test data will yield the highest generalizability. That is, the combined PR05+SYN03 corpora contains subjects drawn from the same subject pool (2005) as the SYN05 test data, and also contains subjects who interacted with the same tutor voice (synthesized) as this test data. In contrast, the combined PR05+SYN05 corpora did not overlap in user population with the SYN03 test data, and the combined SYN05+SYN03 corpora did not share a tutor voice with the PR05 test data. Correctness parameters 12However, INCORS/CORS and %INCORRECT are highly correlated in the SYN05+SYN03 model, showing redundancy. and dialogue communication and ef ciency parameters are consistently used as predictors in all 9 of these student learning models.</Paragraph> </Section> <Section position="3" start_page="269" end_page="269" type="sub_section"> <SectionTitle> 4.3 Adding User Affect Parameters </SectionTitle> <Paragraph position="0"> Our nal experiments investigated whether our 25 user affect parameters impacted the usefulness of the student learning models. As shown in Table 1, all 20 subjects in our SYN03 corpus were annotated for user affect, and 17 subjects in our PR05 corpus were annotated for user affect. We trained a model of student learning on each of these datasets, then tested it on the other dataset.13 As predictors, we included our 25 user affect parameters along with the 13 system-generic and 12 tutoring-speci c interaction parameters. These results are shown in the rst two rows of Table 4. We also reran these experiments without user affect parameters, to gauge the impact of the user affect parameters. These results are shown in the last two rows of Table 4. We hypothesized that user affect parameters would produce more useful models, because prior tutoring research has shown correlations between user affect and student learning (e.g. (Craig et al., 2004)).</Paragraph> <Paragraph position="1"> As shown in the rst two rows, user affect predictors appear in both models where these parameters were included. The models trained on SYN03 use pretest score and Total Time on Task as predictors; when affect parameters are included, Neutral Certainness is added as a predictor, which increases the R2 values for both training and testing. However, the two models trained on PR05:17 show no predictor overlap (besides PRE). Moreover, the PR05:17 model that includes an affect predictor (Total Sequence of 2 Non-Frustrated/Angry turns) has the highest training R2, but the lowest testing R2 value.</Paragraph> <Paragraph position="2"> 13As only 17 subjects have both user affect annotation and user surveys, there is not enough data currently to train and test a user satisfaction model including user affect parameters.</Paragraph> </Section> </Section> class="xml-element"></Paper>