File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0210_evalu.xml

Size: 7,391 bytes

Last Modified: 2025-10-06 13:58:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0210">
  <Title>A Hybrid Text Classi cation Approach for Analysis of Student Essays</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> We conducted an evaluation to compare the effectiveness of CarmelTC at analyzing student essays in comparison to LSA, Rainbow, and a purely symbolic approach similar to (Furnkranz et al., 1998), which we refer to here as CarmelTCsymb. CarmelTCsymb is identical to CarmelTC except that it does not include in its feature set the prediction from Rainbow. Thus, by comparing CarmelTC with Rainbow and LSA, we can demonstrate the superiority of our hybrid approach to purely bag of words approaches. And by comparing with CarmelTCsymb, we can demonstrate the superiority of our hybrid approach to an otherwise equivalent purely symbolic approach. null We conducted our evaluation over a corpus of 126 previously unseen student essays in response to the Pumpkin Problem described above, with a total of 500 text segments, and just under 6000 words altogether. We rst tested to see if the text segments could be reliably tagged by humans with the six possible Classes associated with the problem. Note that this includes nothing as a class, i.e., Class 6. Three human coders hand classi ed text segments for 20 essays. We computed a pairwise Kappa coef cient (Cohen, 1960) to measure the agreement between coders, which was always greater than .75, thus demonstrating good agreement according to the Krippendorf scale (Krippendorf, 1980). We then selected two coders to individually classify the remaining sentences in the corpus. They then met to come to a consensus on the tagging. The resulting consensus tagged corpus was used as a gold standard for this evaluation. Using this gold standard, we conducted a comparison of the four approaches on the problem of tallying the set of correct answer aspects present in each student essay.</Paragraph>
    <Paragraph position="1"> The LSA space used for this evaluation was trained over three rst year physics text books. The other three approaches are trained over a corpus of tagged examples using a 50 fold random sampling evaluation, similar to a cross-validation methodology. On each iteration, we randomly selected a subset of essays such that the number of text segments included in the test set were greater than 10 but less than 15. The randomly selected essays were then used as a test set for that iteration, and the remainder of the essays were used for training in addition to a corpus of 248 hand tagged example sentences extracted from a corpus of human-human tutoring transcripts in our domain. The training of the three approaches differed only in terms of how the training data was partitioned. Rainbow and CarmelTCsymb were trained using all of the example sentences in the corpus as a single training set. CarmelTC, on the other hand, required partitioning the training data into two subsets, one for training the Rainbow model used for generating the value of its Rainbow feature, and one subset for training the decision trees. This is because for CarmelTC, the data for training Rainbow must be separate from that used to train the decision trees so the decision trees are trained from a realistic distribution of assigned Rainbow classes based on its performance on unseen data rather than on Rainbow's training data.</Paragraph>
    <Paragraph position="2"> In setting up our evaluation, we made it our goal to present our competing approaches in the best possible light in order to provide CarmelTC with the strongest competitors as possible. Note that LSA works by using its trained LSA space to construct a vector representation for any text based on the set of words included therein. It can thus be used for text classi cation by comparing the vector obtained for a set of exemplar texts for each class with that obtained from the text to be classi ed. We tested LSA using as exemplars the same set of examples used  as Rainbow training data, but it always performed better when using a small set of hand picked exemplars. Thus, we present results here using only those hand picked exemplars. For every approach except LSA, we rst segmented the essays at sentence boundaries and classi ed each sentence separately. However, for LSA, rather than classify each segment separately, we compared the LSA vector for the entire essay to the exemplars for each class (other than nothing ), since LSA's performance is better with longer texts. We veri ed that LSA also performed better speci cally on our task under these circumstances.</Paragraph>
    <Paragraph position="3"> Thus, we compared each essay to each exemplar, and we counted LSA as identifying the corresponding correct answer aspect if the cosine value obtained by comparing the two vectors was above a threshold. We tested LSA with threshold values between .1 and .9 at increments of .1 as well as testing a threshold of .53 as is used in the AUTO-TUTOR system (Wiemer-Hastings et al., 1998). As expected, as the threshold increases from .1 to .9, recall and false alarm rate both decrease together as precision increases. We determined based on computing f-scores2 for each threshold level that .53 achieves the best trade off between precision and recall. Thus, we used a threshold of .53, to determine whether LSA identi ed the corresponding key point in the student essay or not for the evaluation presented here.</Paragraph>
    <Paragraph position="4"> We evaluated the four approaches in terms of precision, recall, false alarm rate, and f-score, which were computed for each approach for each test essay, and then averaged over the whole set of test essays. We computed precision by dividing the number of correct answer aspects (CAAs) correctly identi ed by the total number of CAAs identi ed3 We computed recall by dividing the number of CAAs correctly identi ed over the number of CAAs actually present in the essay4 False alarm rate was computed by dividing the number of CAAs incorrectly identi ed by the total number of CAAs that could potentially be incor2We computed our f-scores with a beta value of 1 in order to treat precision and recall as equally important.</Paragraph>
    <Paragraph position="5"> 3For essays containing no CAAs, we counted precision as 1 if none were identi ed and 0 otherwise.</Paragraph>
    <Paragraph position="6"> 4For essays with no CAAs present, we counted recall as 1 for all approaches.</Paragraph>
    <Paragraph position="7"> rectly identi ed5. F-scores were computed using 1 as the beta value in order to treat precision and recall as equally important.</Paragraph>
    <Paragraph position="8"> The results presented in Figure 3 clearly demonstrate that CarmelTC outperforms the other approaches. In particular, CarmelTC achieves the highest f-score, which combines the precision and recall scores into a single measure. In comparison with CarmelTCsymb, CarmelTC achieves a higher recall as well as a slightly higher precision. While LSA achieves a slightly higher precision, its recall is much lower. Thus, the difference between the two approaches is clearly shown in the f-score value, which strongly favors CarmelTC. Rainbow achieves a lower score than CarmelTC in terms of precision, recall, false alarm rate, and f-score.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML