File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2018_metho.xml
Size: 9,541 bytes
Last Modified: 2025-10-06 14:08:15
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-2018"> <Title>Towards Emotion Prediction in Spoken Tutoring Dialogues</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The ITSPOKE System and Corpus </SectionTitle> <Paragraph position="0"> We are developing a spoken dialogue system, called ITSPOKE (Intelligent Tutoring SPOKEn dialogue system), which uses as its &quot;back-end&quot; the text-based Why2-Atlas dialogue tutoring system (VanLehn et al., 2002). In Why2-Atlas, a student types an essay answering a qualitative physics problem and a computer tutor then engages him/her in dialogue to provide feedback, correct misconceptions, and elicit more complete explanations, after which the student revises his/her essay, thereby ending the tutoring or causing another round of tutoring/essay revision. To date we have interfaced the Sphinx2 speech recognizer with stochastic language models trained from example user utterances, and the Festival speech synthesizer for text-to-speech, to the Why2-Atlas back-end, and are adapting the knowledge sources needed by the spoken language components; e.g. we have developed a set of dialogue-dependent language models using 4551 student utterances from the Why2-Atlas 2002 human-computer typed corpus and will enhance them using student utterances from our human-human spoken corpus.</Paragraph> <Paragraph position="1"> Our human-human spoken corpus contains spoken dialogues collected via a web interface supplemented with a high quality audio link, where a human tutor performs the same task as ITSPOKE and Why2-Atlas. Our subjects are U. Pittsburgh students who have taken no college level physics and are native speakers of (Amer.) English. Our experimental procedure, taking roughly 7 hours/student over 1-2 sessions, is as follows: students 1) take a pretest measuring their physics knowledge, 2) read a small document of background material, 3) use the web and voice interface to work through up to 10 training problems with the human tutor, and 4) take a post-test similar to the pretest. We have to date collected 63 dialogues (1290 minutes of speech from 4 females and 4 males) and transcribed 20 of them. A corpus example is shown in Figure 1, containing the problem, the student's essay, and an annotated excerpt from the subsequent dialogue.</Paragraph> <Paragraph position="2"> PROBLEM: If a car is able to accelerate at 2 m/s2, what acceleration can it attain if it is towing another car of equal mass? ORIGINAL ESSAY: If the car is towing another car of equal mass, the maximum acceleration would be the same because the car would be towed behind and the friction caused would only be by the front of the first car.</Paragraph> <Paragraph position="3"> . . . dialogue excerpt at 6.5 minutes into session . . .</Paragraph> <Paragraph position="4"> TUTOR: Now this law that force is equal to mass times acceleration, what's this law called? This is uh since this it is a very important basic uh fact uh it is it is a law of physics. Um you have you have read it in the background material. Can you recall it? STUDENT: Um no it was one of Newton's laws but I don't remember which one. (laugh) (EMOTION=NEGATIVE) TUTOR: Right, right, that is Newton's second law of motion.</Paragraph> <Paragraph position="5"> STUDENT: Ok, because I remember one, two, and three, but I didn't know if there was a different name (EMOTION=POSITIVE) null TUTOR: Yeah that's right. You know Newton was a genius and uh he looked at a large number of experiments and experimental data that was available and from that he could come to this general law...</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Predicting Emotional Speech </SectionTitle> <Paragraph position="0"> For this pilot study, we annotated 14 transcribed dialogues from 7 students, 2 dialogues per student. First, turn boundaries were manually annotated (based on consensus labelings from two coders). Each turn was then manually annotated for speaker affect (by a single coder) using three general categorizations: negative (e.g.'uncertain', 'frustration'), positive (e.g. 'confident', 'certain'), or neutral/indeterminate, as shown in Figure 1.</Paragraph> <Paragraph position="1"> Table 1 shows the distribution of our labeled turns.</Paragraph> <Paragraph position="2"> neutral positive negative total We next conducted experiments using the RIPPER (Cohen, 1996) rule induction machine learning program, which takes as input the classes to be learned (e.g. our emotion annotations), the names and possible values in a feature set (discussed below), and training examples, each specifying its class and feature values (e.g. the labeled student turns in our pilot corpus), then outputs a classification model for classifying future examples, expressed as an ordered set of if-then rules. RIPPER's &quot;setvalued&quot; features allow us to represent the speech recognizer's best hypothesis and/or the turn transcription as a set of words, and its rule output is an intuitive way to gain insight into our data.</Paragraph> <Paragraph position="3"> For our first pilot machine learning experiment, our feature set consisted of SUBJECT ID and PROBLEM ID, both representing system state, TURN START-TIME (relative to start of dialogue) and TURN DURATION, both representing timing information, TEXT IN TURN (transcription), and NUMBER OF WORDS IN TURN. Figure 2 presents the ruleset that was learned for this classification task. For example, the first learned rule states that if the duration of the turn is greater than 0.65 seconds and the transcribed text of the turn contains the lexical item &quot;I&quot;, then the turn is predicted to be labeled EMO-TION=NEGATIVE. The estimated mean error and standard error of this ruleset is 33.03% +/- 2.45%, based on 25-fold cross-validation.</Paragraph> <Paragraph position="4"> if (duration a0 0.65) a1 (text has &quot;I&quot;) then neg For comparison, our feature set in our second pilot machine learning experiment consisted of just TEXT IN TURN. The ruleset learned for this classification task contained 21 rules; Figure 3 presents an (ordered) excerpt1. Estimated mean error and standard error of this ruleset is 39.03% +/- 2.40%, based on 25-fold cross-validation.</Paragraph> <Paragraph position="5"> if (text has &quot;I&quot;) a1 (text has &quot;don't&quot;) then neg else if (text has &quot;um&quot;) a1 (text has &quot; a2 hn a3 &quot;) then neg else if (text has &quot;the&quot;) a1 (text has &quot; a2 fs a3 &quot;) then neg else if (text has &quot;right&quot;) then pos else if (text has &quot;so&quot;) then pos else if (text has &quot;(laugh)&quot;) a1 (text has &quot;that's&quot;) then pos Although both these error rates are still fairly high, they are a significant improvement over a majority class</Paragraph> <Paragraph position="7"> false start (e.g. &quot;I th- think&quot;) baseline that always predicts the majority class in our corpus (neutral/indeterminate) - which has an error rate of 55.69%. Moreover, many of the learned rules contain features that are intuitively associated with the predicted emotion; for example, disfluencies such as false starts are often associated with negative emotions such as 'uncertainty', as are lexical items such as &quot;um&quot; used in combination with human noises such as sighs.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Future Directions </SectionTitle> <Paragraph position="0"> Even using a small corpus classified by one coder and predicted using only a handful of features, our results suggest that there are indeed features that can automatically distinguish emotions in tutoring dialogues. We will next explore the utility of a wider variety of features representing many knowledge sources (including acoustic, prosodic, lexical, syntactic, semantic, discourse, and local and global contextual dialogue features), using ablation studies. We will perform our learning using and comparing large corpora of both human-human and human-computer data for training and testing, and will evaluate our results using a variety of metrics (e.g. recall, precision, and F-measure). We will also investigate a variety of emotion annotations with the goal of producing a reliable annotation scheme for the emotions associated with our tutoring domain. Previous studies have shown low inter-annotator reliability (around 70%, Kappa values around 0.47 (Narayanan, 2002)), which originates partly in vague descriptions of the emotions to be labeled.</Paragraph> <Paragraph position="1"> Finally, we hope to use this work to demonstrate that enhancing a spoken dialogue tutoring system to automatically predict and then dynamically respond to student emotional states will measurably improve system performance. Our enhancements will be motivated by tutoring literature (Evens, 2002; Aist et al., 2002) that addresses how a tutor might make use of such information if it could be inferred, as well as by looking at how the human tutor actually responded to emotionally labeled turns. Our methodology will build on previous adaptive (non-tutoring) dialogue systems (see (Litman and Pan, 2002)); however, our system will predict and adapt to both problematic and positive dialogue situations in tutoring. null</Paragraph> </Section> class="xml-element"></Paper>