File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0201_evalu.xml

Size: 6,396 bytes

Last Modified: 2025-10-06 13:58:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0201">
  <Title>Marineau Heather Hite-Mitchell</Title>
  <Section position="7" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
6 Evaluation
</SectionTitle>
    <Paragraph position="0"> The classifier was used in AutoTutor sessions throughout the year of 2002. The log files from these sessions contained 9094 student utterances, each of which was classified by an expert. The expert ratings were compared to the classifier's ratings, forming a 2 x 2 contingency table for each category as in Table 4.</Paragraph>
    <Paragraph position="1"> To expedite ratings, utterances extracted from the log files were split into two groups, contributions and non-contributions, according to their logged classification. Expert judges were assigned to a group and instructed to classify a set of utterances to one of the 18 categories. Though inter-rater reliability using the kappa statistic (Carletta 1996) may be calculated for each group, the distribution of categories in the contribution group was highly skewed and warrants further discussion.</Paragraph>
    <Paragraph position="2"> Skewed categories bias the kappa statistic to low values even when the proportion of rater agreement is very high (Feinstein and Cicchetti 1990a; Feinstein and Cicchetti 1990b). In the contribution group, judges can expect to see mostly one category, contribution, whereas judges in the non-contribution group can expect to see the other 17 categories. Expected agreement by chance for the contribution group was 98%. Correspondingly, inter-rater reliability using the kappa statistic was low for the contribution group, .5 despite 99% proportion agreement, and high for non-contribution group, .93.</Paragraph>
    <Paragraph position="3"> However, the .93 inter-rater agreement can be extended to all of the utterance categories. Due to classifier error, the non-contribution group consisted of 38% contributions. Thus the .93 agreement applies to contributions in this group. Equal proportion of agreement for contribution classifications in both groups, 99%, suggests that the differences in kappa solely reflect differences in category skew across groups. Under this analysis, dividing the utterances into two groups improved the distribution of categories for the calculation of kappa (Feinstein and Cicchetti 1990b).</Paragraph>
    <Paragraph position="4"> Expert judges classified questions with a .93 kappa, which supports a monothetic classification scheme for this application. In Section 3 the possibility was raised of a polythetic scheme for question classification, i.e.</Paragraph>
    <Paragraph position="5"> one in which two categories could be assigned to a given question. If a polythetic scheme were truly necessary, one would expect inter-rater reliability to suffer in a monothetic classification task. High inter-rater reliability on the monothetic classification task renders polythetic schemes superfluous for this application.</Paragraph>
    <Paragraph position="6"> The recall column for evaluation in Table 4 is generally much higher than corresponding cells in the precision column. The disparity implies a high rate of false positives for each of the categories. One possible explanation is the reconstruction algorithm applied during classification. It was observed that, particularly in the language of physics, student used question stems in utterances that were not questions, e.g. &amp;quot;The ball will land when ...&amp;quot; Such falsely reconstructed questions account for 40% of the questions detected by the classifier.</Paragraph>
    <Paragraph position="7"> Whether modifying the reconstruction algorithm would improve F-measure, i.e. improve precision without sacrificing recall, is a question for future research.</Paragraph>
    <Paragraph position="8"> The distribution of categories is highly skewed: 97% of the utterances were contributions, and example questions never occurred at all. In addition to recall, fallout, precision, and F-measure, significance tests were calcu- null lated for each category's contingency table to insure that the cells were statistically significant. Since most of the categories had at least one cell with an expected value of less than 1, Fisher's exact test is more appropriate for significance testing than likelihood ratios or chi-square (Pedersen 1996). Those categories that are not significant are starred; all other categories are significant, p &lt; .001.</Paragraph>
    <Paragraph position="9"> Though not appropriate for hypothesis testing in this instance, likelihood ratios provide a comparison of classifier performance across categories. Likelihood ratios are particularly useful when comparing common and rare events (Dunning 1993; Plaunt and Norgard 1998), making them natural here given the rareness of most question categories and the frequency of contributions.</Paragraph>
    <Paragraph position="10"> The likelihood ratios in the rightmost column of Table 4 are on a natural logarithmic scale, -2lnl, so procedural at e . 5 x 20.23 = 24711 is more likely than goal orientation, at e . 5 x 14.49 = 1401, with respect to the base rate, or null hypothesis.</Paragraph>
    <Paragraph position="11"> To judge overall performance on the AutoTutor sessions, an average weighted F-measure may be calculated by summing the products of all category F-measures with their frequencies:  The average weighted F-measure reflects real world performance since accuracy on frequently occurring classes is weighted more. The average weighted F-measure for the evaluation data is .98, mostly due to the great frequency of contributions (.97 of all utterances) and the high associated F-measure. Without weighting, the average F-measure for the significant cells is .54.</Paragraph>
    <Paragraph position="12"> With respect to the three applications mentioned, i) tracking student understanding, ii) mixed-initiative dialogue, and iii) questions answering, the classifier is doing extremely well on the first two and adequately on the last. The first two applications for the most part require distinguishing questions from contributions, which the classifier does extremely well, F-measure = .99. Question answering, on the other hand, can benefit from more precise identification of the question type, and the average unweighted F-measure for the significant questions is .48.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML