File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0202_evalu.xml
Size: 12,125 bytes
Last Modified: 2025-10-06 13:58:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0202"> <Title>Learning to Identify Student Preconceptions from Texta0</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> We use data from Diagnoser, a web-based system for diagnosing student preconceptions (Hunt and Minstrell, 1994) to test our rule learner. This assessment system has two types of questions, domain-specific base questions, which can be multiple choice or numeric, and secondary follow-up questions, which can be multiple choice or free text. The answers to the base questions are designed to correlate with common student preconceptions and the secondary questions are used to confirm the system's diagnosis. The system includes a database of common preconceptions that has been developed over a period of years (Hunt and Minstrell, 1996). The system primarily uses multiple choice follow-up questions, with just a handful of text-based ones. The developers would like to use more textual questions, but don't currently do so due to a lack of automatic analysis tools.</Paragraph> <Paragraph position="1"> Our data consist of student answers to one of these short-answer questions. The base question is shown in figure 2. The follow-up question just asks the student to explain their reasoning. We used the students' answers to the base question to classify the responses into three categories, one for each of the three possible answers to the base question. According to the system documentation, the first answer is predictive of students who fail to distinguish position and speed (Ppos-speed). Presumably, these students reported that the motion represented by the top line had a higher speed because the line is physically higher. The second answer indicates that students haven't understood the notion of average speed and are just reporting a comparison of the final speeds (Pfinalavg). The third answer corresponds to the correct analysis of the question (Pcorrect). Both objects travel the same total distance in the same time, and neither ever moves backwards, so they have the same average speed.</Paragraph> <Paragraph position="2"> We analyzed the text of responses to confirm that the students' descriptions of their reasoning matched the preconception predicted by system based on their multiple choice answer. We found that it was necessary to create two additional classes. One class was added for students who wrote that they had guessed their answer or otherwise gave an irrelevant answer in the free text (Pmisc). Another class corresponded to a preconception that wasn't explicitly being tested for but which was clearly indicated by some students' responses. The explanations of several students who chose answer A indicate that they didn't confuse position and speed. Instead, they tried to compute the average speed of each object, but ignored the initial conditions of the system, in in answering this questiona0a1a0a2a0 Compare the average speeds of the two objects shown in the graph above.</Paragraph> <Paragraph position="3"> a) The average speed of A is greater.</Paragraph> <Paragraph position="4"> b) The average speed of B is greater.</Paragraph> <Paragraph position="5"> c) They have the same average speed.</Paragraph> <Paragraph position="6"> answered B tended to be confused about the notion of average speed, few of them specifically reported considering the final speeds. Rather, many of them commented that object A's motion was smooth, while object B moved in fits and starts. The system explictly predicts a confusion of average speed with final speed. This shows that the vocabulary of the textual description of the preconception (e.g. &quot;final speed&quot;) isn't necessarily a good indicator of the way student's will express their beliefs.</Paragraph> <Paragraph position="7"> There were 88 responses to the secondary question.</Paragraph> <Paragraph position="8"> Based solely on the answers to the base question, there were 61 answers classified as Ppos-speed, 15 were Pfinal-avg and 12 were Pcorrect. After our manual analysis, the breakdown was 43 Ppos-speed answers, 10 Pfinal-avg answers, 5 Pinitial answers, 9 Pcorrect answers and 21 Pmisc answers.</Paragraph> <Paragraph position="9"> As a baseline for comparison with the performance of our learned rules, we computed precision, recall and F-score measures for simply labeling each textual response with the preconception predicted by the student's answer to the base question. Precision is correct positives over correct positives plus incorrect negatives (i.e. false positives). Recall is correct positives over all positives (correct + incorrect.) The F-score is 2*precision*recall/(precision+recall). These results are shown in table 1. Note that each row of the table shows the breakdown of all 88 examples with respect to the classification of a particular preconception. Thus each row represents the performance of a single binary classifier on the entire dataset. The recall is always 1.000 or 0.000 because of the way the data are generated. The predictions implied by the students' answers to the base question are used and only when their explanation indicated otherwise are they reassigned to a different preconception class. Thus for those classes that were contemplated by the creator of the base question, all positive examples were correctly labeled. Conversely, for preconceptions that weren't included in the base question formulation, no positive examples are correctly identified.</Paragraph> <Paragraph position="10"> Because we have very little data in some categories -as few as five examples for one class and nine for another -- we use a leave-one-out training and testing regime.</Paragraph> <Paragraph position="11"> For each class, we construct a data set in which examples from that class are labeled positive and all other examples are labeled negative. We then cycle through every example, training on all but that example and testing that example. Since our goal is to identify answers that indicate a particular preconception, we're primarily concerned with true and false positives. We report the number of examples correctly and incorrectly labeled as well as the number of examples that the version space was unable to classify. Precision is calculated the same way, but recall is now calculated as correct positives over the sum of correct, incorrect and unclassified positive examples.</Paragraph> <Paragraph position="12"> Our initial results, shown in table 2, show that the algorithm is able to correctly label 48 of the 88 examples and mislabeled none. While the precision of the algorithm is excellent, the recall needs improvement. The results also show that the behavior varies widely from one class to another. Clearly, for some preconceptions, the algorithm isn't generalizing enough.</Paragraph> <Paragraph position="13"> Examining the rules produced by the algorithm, we found that part of the problem is the existence of very similar answers in different classes. In particular, the Pinitial class consists of answers where the student claimed that Object A had a higher average speed, but not because they confused position and speed, as the automated diagnostic system had inferred. These students not only understood the difference between position and speed, but knew that the formula for speed was change in position over elapsed time, though they misapplied that formula due to a different misconception. It was their explanations of their reasoning that led us to separate them into a different class. However, those explanations are extremely similar to those students who knew the formula and applied it correctly. Since the answers were very similar, any generalization in one class would likely be restricted by negative examples from the other class.</Paragraph> <Paragraph position="14"> In order to test this hypothesis, we reran the trials for these two without including Pcorrect examples as negative evidence for Pinitial, and vice versa. These results are shown in table 3. For the Pinitial class, the number of correctly labeled positive examples jumps from zero incorrectly classified and unclassifiable positive examples, the next three columns are the same for negative examples. The final columns show the precision, recall and F-score of the system on the positive examples only. to three, which, while not much in absolute terms, represents a recall of 60% with no reduction in precision. The Pcorrect class had more limited gains, going from one to two correctly labeled examples, again with no loss of precision.</Paragraph> <Paragraph position="15"> These improvements led us to ask whether negative examples were limiting generalization in other cases as well. In order to test this, we ran the same leave-one-out experiment using only positive examples to test the recall of the rules we were producing and then used all the positive examples with no negative examples to learn a set of rules and tested those rules on all the negative examples.</Paragraph> <Paragraph position="16"> The results of this experiment are shown in table 4. The performance of the algorithm has improved significantly.</Paragraph> <Paragraph position="17"> The recall on positive examples for this trial is 89% and there are still no false positives.</Paragraph> <Paragraph position="18"> While these results are promising, we would like to be able to make use of negative examples in our system. In the process of analyzing student response data by hand, we found that it was often helpful to look at the student's answer to the base question associated with a given follow-up question. It seemed likely that this information would also be useful to the rule learner. We added to each text response a pseudo-word indicating the student's base question response and reran the algorithm using negative examples. We included Pcorrect data as negative examples for Pinitial and vice versa because our hope was that the use of the base response tags would allow the algorithm to create rules that wouldn't conflict with negative examples from another class because the examples would have different tags. The results are shown in table 5. For most classes, the addition of the tags improved the performance over untagged data. This is even true in Pinitial, where all the tags were wrong (since those data came from students whose base response indicated Ppos-speed.) In this class, the addition of tags allowed the same number of positive answers to be identified as the removal of negative evidence from the Pcorrect class did, implying that the tags served to avoid the trap of generalization-quashing negative evidence. However, in both these classes, the addition of tags led to some examples being incorrectly classified instead of just remaining unclassified.</Paragraph> <Paragraph position="19"> The only class where tag data posed a problem was the Pmisc class. This is not surprising as this class contains data with a variety of tags. In some cases responses that were exactly the same (e.g. two students who wrote &quot;I guessed.&quot;) were associated with different base question answers. This meant the addition of different tags resulting in non-matching answers. However, this doesn't pose a great problem for the system. The Pmisc class is unusual in that it doesn't really correspond to a specific misconception and the examples in that class come from students responding to the base question in many methodology. Negative examples are tested on rules learned with all positive examples. different ways. Classes of this type are easy to spot and can easily be trained on untagged data. This was, in fact, the class that did the best when trained on untagged data.</Paragraph> <Paragraph position="20"> Had this been done, the total number of correctly classified positive examples would have been 66, for a recall of 75%. The use of tag data also increases the performance of the system on negative examples to over 99%.</Paragraph> </Section> class="xml-element"></Paper>