File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1037_metho.xml
Size: 22,217 bytes
Last Modified: 2025-10-06 14:10:04
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1037"> <Title>Using Reinforcement Learning to Build a Better Model of Dialogue State</Title> <Section position="4" start_page="289" end_page="290" type="metho"> <SectionTitle> 3 Corpus </SectionTitle> <Paragraph position="0"> For our study, we used an annotated corpus of 20 human-computer spoken dialogue tutoring sessions. Each session consists of an interaction with one student over 5 different college-level physics problems, for a total of 100 dialogues. Before the 5 problems, the student is asked to read physics material for 30 minutes and then take a pre-test based on that material. Each problem begins with the student writing out a short essay response to the question posed by the computer tutor. The system reads the essay and detects the problem areas and then starts a dialogue with the student asking questions regarding the confused concepts. Informally, the dialogue follows a question-answer format. Each of the dialogues has been manually authored in advance meaning that the system has a response based on the correctness of the student's last answer. Once the student has successfully answered all the questions, he or she is asked to correct the initial essay. On average, each of the dialogues takes 20 minutes and contains 25 student turns. Finally, the student is given a post-test similar to the pre-test, from which we can calculate their normalized learning gain:</Paragraph> <Paragraph position="2"> Prior to our study, the corpus was then annotated for Student and Tutor Moves (see Tables 1 and 2) which can be viewed as Dialogue Acts (Forbes-Riley etal., 2005). Notethattutor andstudent turns can consist of multiple utterances and can thus be labeled with multiple moves. For example, a tutor can give feedback and then ask a question in the same turn. Whether to include feedback will be the action choice addressed in this paper since it is an interesting open questionintheIntelligent TutoringSystems(ITS)community. Student Moves refer to the type of answer a student gives. Answers that involve a concept already introduced in the dialogue are called Shallow, answers that involve a novel concept are called Novel, &quot;I don't know&quot; type answers are called Assertions (As), and Deep answers refer to answers that involve linking two concepts through reasoning. In our study, we merge all non-Shallow moves into a new move &quot;Other.&quot; In addition to Student Moves, we annotated five other features to include in our representation of the student state. Two emotion related features were annotated manually (Forbes-Riley and Litman, 2005): certainty and frustration. Certainty describes how confident a student seemed to be in his answer, while frustration describes how frustrated thestudent seemed tobe inhislast response.</Paragraph> <Paragraph position="3"> We include three other features for the Student state that were extracted automatically. Correctness says if the last student answer was correct or incorrect. As noted above, this is what most current tutoring systems use as their state. Percent Correct is the percentage of questions in the current problem the student has answered correctly so far. Finally, if a student performs poorly when it comes to a certain topic, the system may be forced to repeat a description of that concept again (concept repetition).</Paragraph> <Paragraph position="4"> It should be noted that all the dialogues were authored beforehand by physics experts. For every turn there is a list of possible correct, incorrect and partially correct answers the student can make, and then for each one of these student responses a link to the next turn. In addition to explaining physics concepts, the authors also include feedback and other types of helpful measures (such as hints or restatements) to help the student along. These were not written with the goal of how best to influence student state. Our goal in this study is to automatically learn from this corpus which state-action patterns evoke the highest learning gain.</Paragraph> </Section> <Section position="5" start_page="290" end_page="290" type="metho"> <SectionTitle> 4 Infrastructure </SectionTitle> <Paragraph position="0"> To test different hypotheses of what features best approximate thestudent state andwhatare thebest actions for a tutor to consider, one must have a flexible system that allows one to easily test different configurations of states and actions. To accomplish this, we designed a system similar to the Reinforcement Learning for Dialogue Systems (RLDS) (Singh et al., 1999). The system allows a system designer to specify what features willcompose the state and actions as well as perform operations on each individual feature. For instance, the tool allows the user to collapse features together (such as collapsing all Question Acts together into one) or quantize features that have continuous values (such as the number of utterances in the dialogue so far). These collapsing functions allow the user to easily constrain the trajectory space. To further reduce the search space for the MDP, our tool allows the user to specify a threshold to combine states that occur less than the threshold into a single &quot;threshold state.&quot; In addition, the user can specify a reward function and a discount factor, For this study, we use a threshold of 50 and a discount factor of 0.9, which is also what is commonly used in other RL models, such as (Frampton and Lemon, 2005). For the dialogue reward function, we did a median split on the 20 students based on their normalized learning gain, which is a standard evaluation metric in the Intelligent Tutoring Systems community. So 10 students and their respective 5 dialogues were assigned a positive reward of +100 (high learners), and the other 10 students and their respective dialogues were assigned a negative reward of -100 (low learners). It should be noted that a student's 5 dialogues were assigned the same reward since there was no way to approximate their learning gain in the middle of a session.</Paragraph> <Paragraph position="1"> The output of the tool is a probability matrix over the user-specified states and actions. This matrix isthen passed toan MDPtoolkit (Chades et al., 2005) written in Matlab.1 The toolkit performs policy iteration and generates a policy as well as a list of V-values for each state.</Paragraph> </Section> <Section position="6" start_page="290" end_page="291" type="metho"> <SectionTitle> 5 Experimental Method </SectionTitle> <Paragraph position="0"> With the infrastructure created and the MDP parameters set, we can then move on to the goal of this experiment - to see what sources of information impact a tutoring dialogue system. First, we need to develop a baseline to compare the effects of adding more information. Second, we generate a new policy by adding the new information source to the baseline state. However, since we are currently not running any new experiments to testourpolicy, orevaluating overuser simulations, we evaluate the reliability of our policies by looking at how well they converge over time, that is, if you incrementally add more data (ie. a student's 5 dialogues) does the policy generated tend to stabilize over time? And also, do the V-values for each state stabilize over time as well? The intuition is that if both the policies and V-values tend to converge then wecan besure that the policy generated is reasonable.</Paragraph> <Paragraph position="1"> The first step in our experiment is to determine a baseline. We use feedback as our system action in our MDP. The action size is 3 (tutor can give feedback (Feed), give feedback with another tutor act (Mix), or give no feedback at all (NonFeed).</Paragraph> <Paragraph position="2"> Examples from our corpus can be seen in Table 3.</Paragraph> <Paragraph position="3"> It should be noted that &quot;NonFeed&quot; does not mean that the student's answer is not acknowledged, it means that something more complex than a simple positive or negative phrase is given (such as a Hint or Restatement). Currently, the system's response toastudent depends only onwhether ornot the student answered the last question correctly, so we use correctness as the sole feature in our dialogue state. Recall that a student can either be correct, partially correct, or incorrect. Since partially correct occurs infrequently compared to the other two, we reduced the state size to two by combining Incorrect and Partially Correct into one state (IPC) and keeping correct (C).</Paragraph> <Paragraph position="4"> The third column of Table 4 has the resulting learned MDP policy as well as the frequencies of both states in the data. So for both states, the best action for the tutor to make is to give feedback, without knowing anything else about the student state.</Paragraph> <Paragraph position="5"> The second step in our experiment is to test whether the policies generated are indeed reliable.</Paragraph> <Paragraph position="6"> Normally, thebestwaytoverify apolicy isbyconducting experiments and seeing if the new policy leads to a higher reward for the new dialogues. In our context, this would entail running more subjects with the augmented dialogue manager and checking if the students had a higher learning gain with the new policies. However, collecting data in this fashion can take months. So, we take a different tact of checking if the polices and values for each state are indeed converging as we add data to our MDP model. The intuition here is that if both of those parameters were varying between a corpus of 19 students to 20 students, then we can't assume that our policy is stable, and hence not reliable. However, if these parameters converged as more data was added, this would indicate that the MDP is reliable.</Paragraph> <Paragraph position="7"> To test this out, we conducted a 20-fold cross-averaging test over our corpus of 20 students.</Paragraph> <Paragraph position="8"> Specifically, we made 20 random orderings of our students to prevent any one ordering from giving a false convergence. Each ordering was then chunkedinto20cutsranging fromasizeof1student, to the entire corpus of 20 students. We then passed each cut to our MDP infrastructure such that we started with a corpus of just the first student of the ordering and then determined a MDP policy for that cut, then added another student to that original corpus and reran our MDP system. We continue thisincremental addition ofastudent (5dialogues) until we completed all 20 students. So at the end, we have 20 random orderings with 20 cuts each, so 400 MDP trials were run. Finally, we average the V-values of same size cuts together to produce an average V-value for that cut size. The left-hand graph in Figure 1 shows a plot of the average V-values for each state against a cut. The state with the plusses is the positive final state, and the one at the bottom is the negative final state. However, we are most concerned with how the non-final states converge, which are the states in the middle. The plot shows that for early cuts, there is a lot of instability but then each state tends to stabilize after cut 10. So this tells us that the V-values are fairly stable and thus reliable when we derive policies from the entire corpus of 20 students.</Paragraph> <Paragraph position="9"> As a further test, we also check that the policies generated for each cut tend to stabilize over time. That is, the differences between a policy at a smaller cut and the final cut converge to zero as more data is added. This &quot;diffs&quot; test is discussed in more detail in Section 6.</Paragraph> </Section> <Section position="7" start_page="291" end_page="294" type="metho"> <SectionTitle> 6 Results </SectionTitle> <Paragraph position="0"> In this section, we investigate whether adding more information to our student state will lead to interesting policy changes. First, we add certainty to our baseline of correctness, and then compare this new baseline's policy (henceforth Baseline 2) with the policies generated when student moves, frustration, concept repetition, and percent correctness are included. For each test, we employed the same methodology as with the baseline case of doing a 20-fold cross-averaging and examining if the states' V-values converge.</Paragraph> <Paragraph position="1"> We first add certainty to correctness because prior work (such as (Bhatt et al., 2004)) has shown the importance of considering certainty in tutoring systems. For example, a student who is correct and certain probably does not need a lot of feedback. But one that is correct but uncertain could signal that the student is becoming doubtful or at least confused about a concept. There are three types of certainty: certain (cer), uncertain (unc), and neutral (neu). Adding these to our state representation increases state size from 2 to 6. The new policy is shown in Table 4. The second and third columns show theoriginal baseline states and their policies. The next column shows the new policy when splitting the original state into the new three states based on certainty, as well as the frequency of the new state. Sothe firstrow can be interpreted as if the student is correct and certain, one should give no feedback; if the student is correct and neutral, givefeedback; andifthestudent iscorrect and uncertain, give non-feedback.</Paragraph> <Paragraph position="2"> Our reasoning is that if a feature is important to include in a state representation it should change the policies of the old states. For example, if certainty did not impact how well students learned (as deemed by the MDP) then the policies for certainty, uncertainty, and neutral would be the same as the original policy for Correct or Incorrect/Partially Correct, in this case they would be Feed. However, the figures show otherwise as when you add certainty to the state, only one new state (C while being neutral) retains the old policy of having the tutor give feedback. The policies which differ with the original are shown in bold.</Paragraph> <Paragraph position="3"> So in general, the learned policy is that one should not give feedback if the student is certain or uncertain, but rather give some other form of feedback such as a Hint or a Restatement perhaps.</Paragraph> <Paragraph position="4"> Butwhen the student is neutral with respect to certainty, one should give feedback. One way of interpreting these results is that given our domain, for students who are confident or not confident at all in their last answer, there are better things to say to improve their learning down the road than &quot;Great Job!&quot; But if the student does not display a lot of emotion, than one should use explicit positive or negative feedback to perhaps bolster their confidence level.</Paragraph> <Paragraph position="5"> The right hand graph in Figure 1 shows the convergence plot for the baseline state with certainty. It shows that as we add more data, the values for each state converge. So in general, we can say that the values for our Baseline 2 case are fairly stable. Next, we add Student Moves, Frustration, Con- null cept Repetition, and Percent Correct features individually to Baseline 2. The first graph in Figure 2 shows a plot of the convergence values for the Percent Correct feature. We only show one con- null vergence plot since the other three are similar. The result is that the V-values for all four converge after 14-15 students.</Paragraph> <Paragraph position="6"> The second graph shows the differences in policies between the final cut of 20 students and all smaller cuts. This check is necessary because some states may exhibit stable V-values but actually be oscillating between two different policies of equal values. So each point on the graph tells us how many differences in policies there are between the cut in question and the final cut. For example, if the policy generated at cut 15 was to give feedback for all states, and the policy at the final cut was to give feedback for all but two states, the &quot;diff&quot; for cut 15 would be two. So in the best case, zero differences mean that the policies generated for both cuts are exactly the same. The diff plots shows the differences decrease as data is added and they exhibited very similar plots to both Baseline cases. Forcuts greater than 15, there are still some differences but these are usually due to low frequency states. So we can conclude that since our policies are fairly stable they are worth investigating in more detail.</Paragraph> <Paragraph position="7"> In the remainder of this section, we look at the differences between the Baseline 2 policies and the policies generated by adding a new feature to the Baseline 2 state. If adding a new feature actually does not really change what the tutor should do (that is, the tutor will do the baseline policy regardless of the new information), one can conclude that the feature is not worth including in a student state. Onthe other hand, ifadding the state results in a much different policy, then the feature is important to student modeling.</Paragraph> <Paragraph position="8"> Student Move Feature The results of adding Student Moves to Baseline 2 are shown in Table 5. Out of the 12 new states created, 7 deviate from the original policy. The main trend is for the neutral and uncertain states to give mixed feedback after a student shallow move, and a non-feed response when the student says something deep or novel. When the student is certain, always give a mixed response except in the case where he said Concept Repetition Feature Table 6 shows the new policy generated. Unlike the Student Move policies which impacted all 6 of the base-line states, Concept Repetition changes the policies for the first three baseline states resulting in 4 out of 12 new states differing from the baseline.</Paragraph> <Paragraph position="9"> For states 1 through 4, the trend is that if the concept has been repeated, the tutor should give feed-back or a combination of feedback with another Tutor Act. Intuitively, this seems clear because if a concept were repeated it shows the student is not understanding the concept completely and it is neccessary to give them a little more feedback than when they first see the concept. So, this test indicates that keeping track of repeated concepts has a significant impact on the policy generated.</Paragraph> <Paragraph position="10"> Frustration Feature Table 7 shows the new policy generated. Comparing the baseline policy with the new policy (which includes categories for when the original state is either neutral or frustration), shows that adding frustration changes the policy for state 1, when the student is certain or correct. In that case, the better option is to give them positive feedback. For all other states, frustration occurs with each of them so infrequently 2 that the resulting states appeared less than the our threshold of 50 instances. As a result, these 5 frustration states are grouped together in the &quot;threshold state&quot; and our MDP found that the best policy when in that state is to give no feedback. So the two neutral states change when the student is frustrated. Interestingly, for students that are uncertain, the policy does not change if they are frustrated or neutral. The trend is to always give NonFeedback. null</Paragraph> <Section position="1" start_page="293" end_page="294" type="sub_section"> <SectionTitle> Percent Correctness Feature Table 8 shows </SectionTitle> <Paragraph position="0"> the new policy generated for incorporating a simple model of current student performance within the dialog. This feature, along with Frustration, seems to impact the baseline the state least since both only alter the policies for 3 of the 12 new states. States 3, 4, and 5 show a change in policy for different parameters of correctness. One trend seems to be that when a student has not been performing well (L), to give a NonFeedback response such as a hint or restatement.</Paragraph> </Section> </Section> <Section position="8" start_page="294" end_page="294" type="metho"> <SectionTitle> 7 Related Work </SectionTitle> <Paragraph position="0"> RL has been applied to improve dialogue systems in past work but very few approaches have looked at which features are important to include in the dialogue state. (Paek and Chickering, 2005) showed how the state space can be learned from data along with the policy. One result is that a state space can be constrained by only using features that are relevant to receiving a reward. Singh et al. (1999) found an optimal dialogue length in their domain, and showed that the number of information and distress attributes impact the state.</Paragraph> <Paragraph position="1"> They take a different approach than the work here in that they compare which feature values are optimal for different points in the dialogue. Frampton et al. (2005) is similar to ours in that they experiment on including another dialogue feature into their baseline system: the user's last dialogue act, which was found to produce a 52% increase in average reward. Williams et al. (2003) used Supervised Learning to select good state and action features as an initial policy to bootstrap a RL-based dialoge system. They found that their automatically created state and action seeds outperformed hand-crafted polices inadriving directions corpus.</Paragraph> <Paragraph position="2"> In addition, there has been extensive work on creating new corpora via user simulations (such as (Georgila et al., 2005)) to get around the possible issue of not having enough data to train on. Our results here indicate that a small training corpus is actually acceptable to use in a MDP framework as long as the state and action features are pruned effectively. The use of features such as context and student moves is nothing new to the ITS community however, such as the BEETLE system (Zinn et al., 2005), but very little work has been done using RL in developing tutoring systems.</Paragraph> </Section> class="xml-element"></Paper>