File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-0208_evalu.xml

Size: 7,775 bytes

Last Modified: 2025-10-06 13:58:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0208">
  <Title>Automatic Evaluation of Students' Answers using Syntactically Enhanced LSA</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Results and Discussions
</SectionTitle>
    <Paragraph position="0"> We calculated the compatibility score evaluation using SELSA (LSA) in an analogous way to the human evaluation. Thus SELSA (LSA) would evaluate the answers in the following manner. It would first break each student-answer into a number of sentences and then evaluate each sentence against the good answers for that question. If the cosine measure between the SELSA (LSA) representation of the sentence and any good answer exceeded a predefined threshold then that part was considered correct. Thus it would find the fraction of the number of sentences in a student-answer that exceeded the threshold. We performed the experiments by varying threshold between 0:05 to 0:95 with a step of 0.05. We also varied the number of singular values R from 200 to 400 with a step of 50. In the following, we present our results using the three evaluation measures.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Correlation Analysis
</SectionTitle>
      <Paragraph position="0"> For each of the five SVD dimensions R and each value of the thresholds, we calculated the correlation coefficient between the SELSA (LSA) evaluation and each human rater's evaluation. Then we averaged this across the four human evaluators. The resulting average correlation curves for SELSA and LSA are shown in figs. (1) and (2) respectively.</Paragraph>
      <Paragraph position="1"> From these two figures we observe that maximum correlation between SELSA and human raters is 0:47 and that between LSA and human is 0:51 while the average inter-human correlation was 0:59. Thus LSA seems to be closer to human than SELSA in this particular tutoring task. This seems to support the arguments from (Landauer et al., 1997) that syntax plays little role, if any, in semantic similarity judgments and text comprehension.</Paragraph>
      <Paragraph position="2"> But the likely reason behind this could be that the corpus, particularly the student answers, contained very poor syntactic structure and also that human evaluators might not have paid attention to grammatical inaccuracies in this technical domain of computer literacy.</Paragraph>
      <Paragraph position="3"> But it is also worth noting that SELSA is closer to LSA than a previous approach of adding syntactic information to LSA (Wiemer-Hastings, 2000), which had a correlation of 0:40 compared to 0:49 of LSA on the same task of evaluating students' answers, where average inter-human correlation was 0:78 between the expert raters and 0:51 between the intermediate experts. SELSA  of LSA in a modified evaluation task of judging similarity between two sentences where the correlation between skilled raters was 0:45 and that between non-proficient raters was 0:35.</Paragraph>
      <Paragraph position="4"> If we look at these curves more carefully, especially, their behavior across thresholds, then it is interesting to note that SELSA has wider threshold-widths(TW) than LSA across all the cases of SVD dimension R. In table (1) and (2) we have shown the 10% and 20% TW of SELSA and LSA respectively. This is calculated by finding the range over thresholds for which the correlation is within 10% and 20% of the maximum correlation. This observation shows that SELSA is much more robust across thresholds than LSA in the sense that semantic information is discriminated better in SELSA space than in  Another interesting observation occurs when we plot the two curves simultaneously as shown in fig. (3). Here we plotted the SELSA and LSA performances for 250 dimensions of latent space. We can easily see that SELSA performs better than LSA for thresholds less than 0.5 and viceversa. This observation along with the previous observation about TW can be understood in the following manner. When comparing two document vectors for a cosine measure exceeding a threshold, we can consider one of the vectors to be the axis of a right circular cone with a semi-vertical angle decided by the threshold. If the other vector falls within this cone, we say the two documents are matching. Now if the human raters emphasized semantic similarity, which is most likely the case, then this means that LSA could best capture the same information in a narrower cone while SELSA required a wider cone. This is quite intuitive in the sense that SELSA has zoomed the document similarity measure axis by putting finer resolution of syntactic information. Thus mere semantically similar documents are placed wider apart in SELSA space than syntactic-semantically similar documents. This concept can be best used in a language modeling task where a word is to be predicted from the history. It is observed in (Kanejiya et al., 2003) that SELSA assigns better probabilities to syntactic-semantically regular words than LSA, although the overall perplexity reduction over a bi-gram language model was less than that</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Mean Absolute Difference Analysis
</SectionTitle>
      <Paragraph position="0"> Here we calculated the mean absolute difference(MAD) between a human rater's evaluation and SELSA (LSA) evaluations as follow:</Paragraph>
      <Paragraph position="2"> where, hi and li correspond to human and SELSA(LSA) evaluation of ith answer. This was then averaged across human evaluators. These results are plotted in figs. (4) and (5). These two curves show that SELSA and LSA are almost equal to each other. Again SELSA has the advantage of more robustness and in most cases it is even better than LSA in terms of minimum MAD with human. Tables (3) and (4) show values of minimum MAD at various values of SVD dimensions R. The best minimum MAD for SELSA is 0.2412 at 250 dimensional space while that for LSA is 0.2475 at 400 dimensions. The average MAD among human evaluators is 0.2050.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Correct vs False Evaluations Analysis
</SectionTitle>
      <Paragraph position="0"> We define an evaluationli by SELSA (LSA) to be correct or false as below: li CORRECT if jli hij&lt;CT li FALSE if jli hij&gt;FT where CT and FT are correctness and falsehood thresholds which were set to 0.05 and 0.95 respectively for strict measures. Number of such correct as well as false evaluations were then averaged across the four human evaluators. They are plotted in figs. (6) and (7) for SELSA and LSA respectively (the upper curves corresponding to correct and the lower ones to false evaluations). The maximum number of correct (maxCorrect) and the minimum number of false (minFalse) evaluations across the thresholds for each value of SVD dimensions are calculated and shown in tables (3) and (4). We observe that the best performance for SELSA is achieved at 300 dimensions with 126 correct and 30 false evaluations, while for LSA it is at 400 dimensions with 123 correct and 30 false evaluations. The average correct and false evaluations among all human-human evaluator pairs were 132 and 23 respectively. Thus here also SELSA is closer to human evaluators than LSA. In fact, for the cognitive task like AutoTutor, this is a more appealing and explicit measure than the previous two. Apart from these three measures, one can also calculate precision, recall and F-measure (Burstein et al., 2003) to evaluate the performance.</Paragraph>
      <Paragraph position="1">  pared to human evaluators</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML