File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1025_metho.xml
Size: 19,752 bytes
Last Modified: 2025-10-06 14:10:17
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1025"> <Title>Dependencies between Student State and Speech Recognition Problems in Spoken Tutoring Dialogues</Title> <Section position="5" start_page="193" end_page="194" type="metho"> <SectionTitle> ). Similarly, </SectionTitle> <Paragraph position="0"> we define the ASR MIS variable with two values: AsrMis and noAsrMis.</Paragraph> <Paragraph position="1"> Semantic accuracy is more relevant for dialogue evaluation, as it does not penalize for word errors that are unimportant to overall utterance interpretation. In the case of form-based information access spoken dialogue systems, computing semantic accuracy is straightforward (i.e. concept accuracy = percentage of correctly recognized concepts). In contrast, in the tutoring domain there are no clear forms with slots to be filled. We base our semantic accuracy on the &quot;correctness&quot; measure of the student turn. For each student turn, ITSPOKE interprets it and labels its correctness with regard to whether the student correctly answered the tutor question (see the labels between square brackets in Figure 1).</Paragraph> <Paragraph position="2"> We define Semantic Misrecognition as cases where ITSPOKE was confident in its recognition hypothesis and the correctness interpretation of the recognition hypothesis is different from the correctness interpretation of the manual transcript (Figure 1, STD ). Similarly, we define the SEM MIS variable with two values: SemMis and noSemMis. The top part of Table 1 lists the distribution for our three SRP variables.</Paragraph> </Section> <Section position="6" start_page="194" end_page="195" type="metho"> <SectionTitle> 3 Student State Annotation </SectionTitle> <Paragraph position="0"> Our definition of student state includes annotation on three dimensions (our three dialogue factors): Frustration/Anger/Hyperarticulation, Certainty and Correctness (see Table 1). The first two dimensions capture the student emotions and the last one the student correctness.</Paragraph> <Paragraph position="1"> ITSPOKE: Good. Apply the same force to an object of mass M3, which is three times the mass of M1.</Paragraph> <Paragraph position="2"> What is the magnitude M3's acceleration in terms of : they're it but a one (AsrMis, SemMis) [incorrect] ITSPOKE: The acceleration of M3 [...] would be one-third the acceleration of M1. [...] In general for the same force what remains constant? : great (AsrMis) [incorrect] ITSPOKE: Well... The product of the mass of a body and its acceleration remains constant for a given force. What is the relationship between a force acting on the box, the mass of the box, and its accel- null The Frustration/Anger/Hyperarticulation dimension captures the perceived negative student emotional response to the interaction with the system. Three labels were used to annotate this dimension: frustration-anger, hyperarticulation and neutral. Similar to (Ang et al., 2002), because frustration and anger can be difficult to distinguish reliably, they were collapsed into a single label: frustration-anger (Figure 1, STD ).</Paragraph> <Paragraph position="3"> Often, frustration and anger is prosodically marked and in many cases the prosody used is consistent with hyperarticulation (Ang et al., 2002). For this reason we included in this dimension the hyperarticulation label (even though hyperarticulation is not an emotion but a state). We used the hyperarticulation label for turns where no frustration or anger was perceived but nevertheless were hyperarticulated. For our interaction experiments we define the FAH variable with three values: FrAng (frustration-anger), Hyp (hyperarticulation) and Neutral.</Paragraph> <Paragraph position="4"> The Certainty dimension captures the perceived student reaction to the questions asked by our computer tutor and her overall reaction to the tutoring domain (Liscombe et al., 2005).</Paragraph> <Paragraph position="5"> (Forbes-Riley and Litman, 2005) show that student certainty interacts with a human tutor's dialogue decision process (i.e. the choice of feedback). Four labels were used for this dimension: certain, uncertain (e.g. Figure 1, STD ), mixed and neutral. In a small number of turns, both certainty and uncertainty were expressed and these turns were labeled as mixed (e.g. the student was certain about a concept, but uncertain about another concept needed to answer the tutor's question). For our interaction experiments we define the CERT variable with four values: Certain, Uncertain, Mixed and Neutral.</Paragraph> <Paragraph position="6"> To test the impact of the emotion annotation level, we define the Emotional/Non-Emotional annotation based on our two emotional dimensions: neutral turns on both the FAH and the CERT dimension are labeled as neutral ; all other turns were labeled as emotional. Consequently, we define the EnE variable with two values: Emotional and Neutral.</Paragraph> <Paragraph position="7"> Correctness is also an important factor of the student state. In addition to the correctness labels assigned by ITSPOKE (recall the definition of SEM MIS), each student turn was manually annotated by a project staff member in terms of their physics-related correctness. Our annotator used the human transcripts and his physics knowledge to label each student turn for various To be consistent with our previous work, we label hyperarticulated turns as emotional even though hyperarticulation is not an emotion.</Paragraph> <Paragraph position="8"> degrees of correctness: correct, partially correct, incorrect and unable to answer. Our system can ask the student to provide multiple pieces of information in her answer (e.g. the question &quot;Try to name the forces acting on the packet. Please, specify their directions.&quot; asks for both the names of the forces and their direction). If the student answer is correct and contains all pieces of information, it was labeled as correct (e.g. &quot;gravity, down&quot;). The partially correct label was used for turns where part of the answer was correct but the rest was either incorrect (e.g. &quot;gravity, up&quot;) or omitted some information from the ideal correct answer (e.g. &quot;gravity&quot;). Turns that were completely incorrect (e.g. &quot;no forces&quot;) were labeled as incorrect. Turns where the students did not answer the computer tutor's question were labeled as &quot;unable to answer&quot;. In these turns the student used either variants of &quot;I don't know&quot; or simply did not say anything. For our interaction experiments we defined the CRCT variable with four values: C (correct), I (incorrect), PC (partially correct) and UA (unable to answer).</Paragraph> <Paragraph position="9"> Please note that our definition of student state is from the tutor's perspective. As we mentioned before, our emotion annotation is for perceived emotions. Similarly, the notion of correctness is from the tutor's perspective. For example, the student might think she is correct but, in reality, her answer is incorrect. This correctness should be contrasted with the correctness used to define SEM MIS. The SEM MIS correctness uses ITSPOKE's language understanding module applied to recognition hypothesis or the manual transcript, while the student state's correctness uses our annotator's language understanding.</Paragraph> <Paragraph position="10"> All our student state annotations are at the turn level and were performed manually by the same annotator. While an inter-annotator agreement study is the best way to test the reliability of our two emotional annotations (FAH and CERT), our experience with annotating student emotions (Litman and Forbes-Riley, 2004) has shown that this type of annotation can be performed reliably. Given the general importance of the student's uncertainty for tutoring, a second annotator has been commissioned to annotate our corpus for the presence or absence of uncertainty. This annotation can be directly compared with a binary version of CERT: Uncertain+Mixed versus Certain+Neutral. The comparison yields an agreement of 90% with a Kappa of 0.68. Moreover, if we rerun our study on the second annotation, we find similar dependencies. We are currently planning to perform a second annotation of the FAH dimension to validate its reliability.</Paragraph> <Paragraph position="11"> We believe that our correctness annotation (CRCT) is reliable due to the simplicity of the task: the annotator uses his language understanding to match the human transcript to a list of correct/incorrect answers. When we compared this annotation with the correctness assigned by ITSPOKE on the human transcript, we found an agreement of 90% with a Kappa of 0.79.</Paragraph> </Section> <Section position="7" start_page="195" end_page="196" type="metho"> <SectionTitle> 4 Identifying dependencies using kh </SectionTitle> <Paragraph position="0"> To discover the dependencies between our variables, we apply the kh test. We illustrate our analysis method on the interaction between certainty (CERT) and rejection (REJ). The kh value assesses whether the differences between observed and expected counts are large enough to conclude a statistically significant dependency between the two variables (Table 2, last column). For Table 2, which has 3 degrees of freedom ((41)*(2-1)), the critical kh value at a p<0.05 is 7.81.</Paragraph> <Paragraph position="1"> We thus conclude that there is a statistically significant dependency between the student certainty in a turn and the rejection of that turn. Combination Obs. Exp. kh If any of the two variables involved in a significant dependency has more than 2 possible values, we can look more deeply into this overall interaction by investigating how particular values interact with each other. To do that, we compute a binary variable for each variable's value in part and study dependencies between these variables. For example, for the value 'Certain' of variable CERT we create a binary variable with two values: 'Certain' and 'Anything Else' (in this case Uncertain, Mixed and Neutral). By studying the dependency between binary variables we can understand how the interaction works.</Paragraph> <Paragraph position="2"> Table 2 reports in rows 3 and 4 all significant interactions between the values of variables CERT and REJ. Each row shows: 1) the value for each original variable, 2) the sign of the dependency, 3) the observed counts, 4) the expected counts and 5) the kh value. For example, in our data there are 49 rejected turns in which the student was certain. This value is smaller than the expected counts (67); the dependency between Certain and Rej is significant with a kh value of 9.13. A comparison of the observed counts and expected counts reveals the direction (sign) of the dependency. In our case we see that certain turns are rejected less than expected (row 3), while uncertain turns are rejected more than expected (row 4). On the other hand, there is no interaction between neutral turns and rejections or between mixed turns and rejections. Thus, the CERT - REJ interaction is explained only by the interaction between Certain and Rej and the interaction between Uncertain and Rej.</Paragraph> </Section> <Section position="8" start_page="196" end_page="197" type="metho"> <SectionTitle> 5 Results - dependencies </SectionTitle> <Paragraph position="0"> In this section we present all significant dependencies between SRP and student state both within and across turns. Within turn interactions analyze the contribution of the student state to the recognition of the turn. They were motivated by the widely believed intuition that emotion interacts with SRP. Across turn interactions look at the contribution of previous SRP to the current student state. Our previous work (Rotaru and Litman, 2005) had shown that certain SRP will correlate with emotional responses from the user.</Paragraph> <Paragraph position="1"> We also study the impact of the emotion annotation level (EnE versus FAH/CERT) on the interactions we observe. The implications of these dependencies will be discussed in Section 6.</Paragraph> <Section position="1" start_page="196" end_page="197" type="sub_section"> <SectionTitle> 5.1 Within turn interactions </SectionTitle> <Paragraph position="0"> For the FAH dimension, we find only one significant interaction: the interaction between the FAH student state and the rejection of the current turn (Table 3). By studying values' interactions, we find that turns where the student is frustrated or angry are rejected more than expected (34 instead of 16; Figure 1, STD is one of them).</Paragraph> <Paragraph position="1"> Similarly, turns where the student response is hyperarticulated are also rejected more than expected (similar to observations in (Soltau and Waibel, 2000)). In contrast, neutral turns in the FAH dimension are rejected less than expected. Surprisingly, FrAng does not interact with AsrMis as observed in (Bulyko et al., 2005) but they use the full word error rate measure instead of the binary version used in this paper.</Paragraph> <Paragraph position="2"> Combination Obs. Exp. kh Next we investigate how our second emotion annotation, CERT, interacts with SRP. All significant dependencies are reported in Tables 2 and 4. In contrast with the FAH dimension, here we see that the interaction direction depends on the valence. We find that 'Certain' turns have less SRP than expected (in terms of AsrMis and Rej). In contrast, 'Uncertain' turns have more SRP both in terms of AsrMis and Rej. 'Mixed' turns interact only with AsrMis, allowing us to conclude that the presence of uncertainty in the student turn (partial or overall) will result in ASR problems more than expected. Interestingly, on this dimension, neutral turns do not interact with any of our three SRP.</Paragraph> <Paragraph position="3"> Combination Obs. Exp. kh Finally, we look at interactions between student correctness and SRP. Here we find significant dependencies with all types of SRP (see Table 5). In general, correct student turns have fewer SRP while incorrect, partially correct or UA turns have more SRP than expected. Partially correct turns have more AsrMis and SemMis problems than expected, but are rejected less than expected. Interestingly, UA turns interact only with rejections: these turns are rejected more than expected. An analysis of our corpus reveals that in most rejected UA turns the student does not say anything; in these cases, the system's recognition module thought the student said something but the system correctly rejects the recognition hypothesis.</Paragraph> <Paragraph position="4"> Combination Obs. Exp. kh</Paragraph> <Paragraph position="6"> The only exception to the rule is SEM MIS.</Paragraph> <Paragraph position="7"> We believe that SEM MIS behavior is explained by the &quot;catch-all&quot; implementation in our system. In ITSPOKE, for each tutor question there is a list of anticipated answers. All other answers are treated as incorrect. Thus, it is less likely that a recognition problem in an incorrect turn will affect the correctness interpretation (e.g. Figure 1,</Paragraph> </Section> </Section> <Section position="9" start_page="197" end_page="197" type="metho"> <SectionTitle> STD </SectionTitle> <Paragraph position="0"> : very unlikely to misrecognize the incorrect &quot;weight&quot; with the anticipated &quot;the product of mass and acceleration&quot;). In contrast, in correct turns recognition problems are more likely to screw up the correctness interpretation (e.g. misrecognizing &quot;gravity down&quot; as &quot;gravity sound&quot;).</Paragraph> <Section position="1" start_page="197" end_page="197" type="sub_section"> <SectionTitle> 5.2 Across turn interactions </SectionTitle> <Paragraph position="0"> Next we look at the contribution of previous SRP - variable name or value followed by (-1) - to the current student state. Please note that there are two factors involved here: the presence of the SRP and the SRP handling strategy. In ITSPOKE, whenever a student turn is rejected, unless this is the third rejection in a row, the student is asked to repeat using variations of &quot;Could you please repeat that?&quot;. In all other cases, ITSPOKE makes use of the available information ignoring any potential ASR errors. Combination Obs. Exp. kh</Paragraph> <Paragraph position="2"> - trend, p<0.1).</Paragraph> <Paragraph position="3"> Here we find only 3 interactions (Table 6). We find that after a non-harmful SRP (AsrMis) the student is less frustrated and hyperarticulated than expected. This result is not surprising since an AsrMis does not have any effect on the normal dialogue flow.</Paragraph> <Paragraph position="4"> In contrast, after rejections we observe several negative events. We find a highly significant interaction between a previous rejection and the student FAH state, with student being more frustrated and more hyperarticulated than expected (e.g. Figure 1, STD ). Not only does the system elicit an emotional reaction from the student after a rejection, but her subsequent response to the repetition request suffers in terms of the correctness. We find that after rejections student answers are correct or partially correct less than expected and incorrect more than expected. The</Paragraph> </Section> </Section> <Section position="10" start_page="197" end_page="197" type="metho"> <SectionTitle> REJ </SectionTitle> <Paragraph position="0"> (-1) - CRCT interaction might be explained by the CRCT - REJ interaction (Table 5) if, in general, after a rejection the student repeats her previous turn. An annotation of responses to rejections as in (Swerts et al., 2000) (repeat, rephrase etc.) should provide additional insights. We were surprised to see that a previous SemMis (more harmful than an AsrMis but less disruptive than a Rej) does not interact with the student state; also the student certainty does not interact with previous SRP.</Paragraph> <Section position="1" start_page="197" end_page="197" type="sub_section"> <SectionTitle> 5.3 Emotion annotation level </SectionTitle> <Paragraph position="0"> We also study the impact of the emotion annotation level on the interactions we can observe from our corpus. In this section, we look at interactions between SRP and our coarse-level emotion annotation (EnE) both within and across turns. Our results are similar with the results of our previous work (Rotaru and Litman, 2005) on a smaller corpus and a similar annotation scheme. We find again only one significant interaction: rejections are followed by more emotional turns than expected (Table 7). The strength of the interaction is smaller than in previous work, though the results can not be compared directly. No other dependencies are present.</Paragraph> <Paragraph position="1"> Combination Obs. Exp. kh explained mainly by the FAH dimension. Not only is there no interaction between REJ (-1) and CERT, but the inclusion of the CERT dimension in the EnE annotation decreases the strength of the interaction between REJ and FAH (the kh value decreases from 409.31 for FAH to a mere 6.19 for EnE). Collapsing emotional classes also prevents us from seeing any within turn interactions. These observations suggest that what is being counted as an emotion for a binary emotion annotation is critical its success. In our case, if we look at affect (FAH) or attitude (CERT) in isolation we find many interactions; in contrast, combining them offers little insight.</Paragraph> </Section> </Section> class="xml-element"></Paper>