File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0202_metho.xml

Size: 21,108 bytes

Last Modified: 2025-10-06 14:09:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0202">
  <Title>Automatic Short Answer Marking</Title>
  <Section position="4" start_page="0" end_page="10" type="metho">
    <SectionTitle>
2. The Data
</SectionTitle>
    <Paragraph position="0"> Consider the following GCSE biology question: Statement of the question The blood vessels help to maintain normal body temperature. Explain how the blood vessels reduce heat loss if the body temperature falls below normal.</Paragraph>
    <Paragraph position="1"> Marking Scheme (full mark 3)2 any three: vasoconstriction; explanation (of vasoconstriction); less blood flows to / through the skin / close to the surface; less heat loss to air/surrounding/from the blood / less radiation / conduction / convection; null Here is a sample of real answers:  1. all the blood move faster and dose not go near the top of your skin they stay close to the moses 2. The blood vessels stops a large ammount of blood  going to the blood capillary and sweat gland. This prents the presonne from sweating and loosing heat.</Paragraph>
    <Paragraph position="2"> 3. When the body falls below normal the blood vessels 'vasoconstrict' where the blood supply to the skin is cut off, increasing the metabolism of the 2 X;Y/D/K;V is equivalent to saying that each of X, [L]={Y, D,K}, and V deserves 1 mark. The student has to write only 2 of these to get the full mark. [L] denotes an equivalence class i.e. Y, D, K are equivalent. If the student writes Y and D s/he will get only 1 mark.</Paragraph>
    <Paragraph position="3">  body. This prevents heat loss through the skin, and causes the body to shake to increase metabolism. null It will be obvious that many answers are ungrammatical with many spelling mistakes, even if they contain more or less the right content. Thus using standard syntactic and semantic analysis methods will be difficult. Furthermore, even if we had fully accurate syntactic and semantic processing, many cases require a degree of inference that is beyond the state of the art, in at least the following respects: null * The need for reasoning and making inferences: a student may answer with we do not have to wait until Spring,which only implies the marking key it can be done at any time. Similarly, an answer such as don't have sperm or egg will get a 0 incorrectly if there is no mechanism to infer no fertilisation.</Paragraph>
    <Paragraph position="4"> * Students tend to use a negation of a negation (for an affirmative): An answer like won't be done only at a specific time is the equivalent to will be done at any time. An answer like it is not formed from more than one egg and sperm is the same as saying formed from one egg and sperm. This category is merely an instance of the need for more general reasoning and inference outlined above. We have given this case a separate category because here, the wording of the answer is not very different, while in the general case, the wording can be completely different.</Paragraph>
    <Paragraph position="5"> * Contradictory or inconsistent information: Other than logical contradiction like needs fertilisation and does not need fertilisation, an answer such as identical twins have the same chromosomes but different DNA holds inconsistent scientific information that needs to be detected.</Paragraph>
    <Paragraph position="6"> Since we were sceptical that existing deep processing NL systems would succeed with our data, we chose to adopt a shallow processing approach, trading robustness for complete accuracy. After looking carefully at the data we also discovered other issues which will affect assessment of the accuracy of any automated system, namely: * Unconventional expression for scientific knowledge: Examiners sometimes accept unconventional or informal ways of expressing scientific knowledge, for example, 'sperm and egg get together' for 'fertilisation'.</Paragraph>
    <Paragraph position="7"> * Inconsistency across answers: In some cases, there is inconsistency in marking across answers. Examiners sometimes make mistakes under pressure. Some biological information is considered relevant in some answers and irrelevant in others.</Paragraph>
    <Paragraph position="8"> In the following, we describe various implemented systems and report on their accuracy.</Paragraph>
    <Paragraph position="9"> We conclude with some current work and suggest a road map.</Paragraph>
    <Paragraph position="10"> 3. Information Extraction for Short Answers null In our initial experiments, we adopted an Information Extraction approach (see also Mitchell et al. 2003). We used an existing Hidden Markov Model part-of-speech (HMM POS) tagger trained on the Penn Treebank corpus, and a Noun Phrase (NP) and Verb Group (VG) finite state machine (FSM) chunker. The NP network was induced from the Penn Treebank, and then tuned by hand. The Verb Group FSM (i.e. the Hallidayean constituent consisting of the verbal cluster without its complements) was written by hand. Relevant missing vocabulary was added to the tagger from the tagged British National Corpus (after mapping from their tag set to ours), and from examples encountered in our training data. The tagger also includes some suffix-based heuristics for guessing tags for unknown words.</Paragraph>
    <Paragraph position="11"> In real information extraction, template merging and reference resolution are important components. Our answers display little redundancy, and are typically less than 5 lines long, and so template merging is not necessary. Anaphors do not occur very frequently, and when they do, they often refer back to entities introduced in the text of the question (to which the system does not have access). So at the cost of missing some correct answers, the information extraction components really consists of little more than a set of patterns applied to the tagged and chunked text.</Paragraph>
    <Paragraph position="12"> We wrote our initial patterns by hand, although we are currently working on the development of a tool to take most of the tedious effort out of this task. We base the patterns on recurring head words or phrases, with syntactic annotation where neces- null sary, in the training data. Consider the following example training answers: the egg after fertilisation splits in two the fertilised egg has divided into two The egg was fertilised it split in two One fertilised egg splits into two one egg fertilised which split into two 1 sperm has fertilized an egg.. that split into two These are all paraphrases of It is the same fertilised egg/embryo, and variants of what is written above could be captured by a pattern like:  singular_det + &lt;fertilised egg&gt; +{&lt;split&gt;; &lt;divide&gt;; &lt;break&gt;} + {in, into} + &lt;two_halves&gt;, where &lt;fertilised egg&gt; = NP with the content of 'fertilised egg' singular_det = {the, one, 1, a, an} &lt;split&gt; = {split, splits, splitting, has split, etc.} &lt;divide&gt; = {divides, which divide, has gone, being broken...} &lt;two_halves&gt; = {two, 2, half, halves} etc.</Paragraph>
    <Paragraph position="13">  The pattern basically is all the paraphrases collapsed into one. It is essential that the patterns use the linguistic knowledge we have at the moment, namely, the part-of-speech tags, the noun phrases and verb groups. In our previous example, the requirement that &lt;fertilised egg&gt; is an NP will exclude something like 'one sperm has fertilized an egg' while accept something like 'an egg which is fertilized ...'.</Paragraph>
    <Section position="1" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
System Architecture:
</SectionTitle>
      <Paragraph position="0"> &amp;quot;When the caterpillars are feeding on the tomato plants, a chemical is released from the plants&amp;quot;.</Paragraph>
      <Paragraph position="2"> Table 1 gives results for the current version of the system. For each of 9 questions, the patterns were developed using a training set of about 200 marked answers, and tested on 60 which were not released to us until the patterns had been written. Note that the full mark for each question ranges between 1-4.</Paragraph>
      <Paragraph position="3">  Column 3 records the percentage agreement between our system and the marks assigned by a human examiner. As noted earlier, we detected a certain amount of inconsistency with the marking scheme in the grades actually awarded. Column 4 reflects the degree of agreement between the grades awarded by our system and those which would have been awarded by following the marking scheme consistently. Notice that agreement is correlated with the mark scale: the system appears less accurate on multi-part questions. We adopted an extremely strict measure, requiring an exact match. Moving to a pass-fail criterion produces much higher agreement for questions 6 and 8.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="10" end_page="12" type="metho">
    <SectionTitle>
4. Machine Learning
</SectionTitle>
    <Paragraph position="0"> Of course, writing patterns by hand requires expertise both in the domain of the examination, and in computational linguistics. This requirement makes the commercial deployment of a system like this problematic, unless specialist staff are taken on. We have therefore been experimenting with ways in which a short answer marking system might be developed rapidly using machine learning methods on a training set of marked answers.</Paragraph>
    <Paragraph position="1"> Previously (Sukkarieh et al. 2003) we reported the results we obtained using a simple Nearest  Neighbour Classification techniques. In the following, we report our results using three different machine learning methods: Inductive Logic progamming (ILP), decision tree learning(DTL) and Naive Bayesian learning (Nbayes). ILP (Progol, Muggleton 1995) was chosen as a representative symbolic learning method. DTL and NBayes were chosen following the Weka (Witten and Frank, 2000) injunction to `try the simple things first'. With ILP, only 4 out of the 9 questions shown in the previous section were tested, due to resource limitations. With DTL and Nbayes, we conducted two experiments on all 9 questions.</Paragraph>
    <Paragraph position="2"> The first experiments show the results with non-annotated data; we then repeat the experiments with annotated data. Annotation in this context is a lightweight activity, simply consisting of a domain expert highlighting the part of the answer that deserves a mark. Our idea was to make this as simple a process as possible, requiring minimal software, and being exactly analogous to what some markers do with pencil and paper. As it transpired, this was not always straightforward, and does not mean that the training data is noiseless since sometimes annotating the data accurately requires non-adjacent components to be linked: we could not take account of this.</Paragraph>
    <Section position="1" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
4.1 Inductive Logic Programming
</SectionTitle>
      <Paragraph position="0"> For our problem, for every question, the set of training data consists of students' answers, to that question, in a Prologised version of their textual form, with no syntactic analysis at all initially. We supplied some `background knowledge' predicates based on the work of (Junker et al. 1999). Instead of using their 3 Prolog basic predicates, however, we only defined 2, namely, wordpos(Text,Word,Pos) which represents words and their position in the text and window(Pos2-Pos1,Word1,Word2) which represents two words occurring within a Pos2-Pos1 window distance.</Paragraph>
      <Paragraph position="1"> After some initial experiments, we believed that a stemmed and tagged training data should give better results and that window should be made independent to occur in the logic rules learned by Progol. We used our POS tagger mentioned above and the Porter stemmer (Porter 1980). We set the Progol noise parameter to 10%, i.e. the rules do not have to fit the training data perfectly. They can be more general. The percentages of agreement are shown in table 23. The results reported are on a 5fold cross validation testing and the agreement is on whether an answer is marked 0 or a mark &gt;0, i.e. pass-fail, against the human examiner scores.</Paragraph>
      <Paragraph position="2"> The baseline is the number of answers with the most common mark multiplied by 100 over the  The results of the experiment are not very promising. It seems very hard to learn the rules with ILP. Most rules state that an answer is correct if it contains a certain word, or two certain words within a  predefined distance. A question such as 7, though, scores reasonably well. This is because Progol learns a rule such as mark(Answer) only if wordpos(Answer,'shiver', Pos) which is, according to its marking scheme, all it takes to get its full mark, 1. ILP has in effect found the single keyword that  the examiners were looking for.</Paragraph>
      <Paragraph position="3"> Recall that we only have ~200 answers for training. By training on a larger set, the learning algorithm may be able to find more structure in the answers and may come up with better results.</Paragraph>
      <Paragraph position="4"> However, the rules learned may still be basic since, with the background knowledge we have supplied the ILP learner always tries to find simple and small predicates over (stems of) keywords.</Paragraph>
    </Section>
    <Section position="2" start_page="11" end_page="12" type="sub_section">
      <SectionTitle>
4.2 Decision Tree Learning and Bayesian
Learning
</SectionTitle>
      <Paragraph position="0"> In our marking problem, seen as a machine learning problem, the outcome or target attribute is well-defined. It is the mark for each question and its values are {0,1, ..., full_mark}. The input attributes could vary from considering each word to be an attribute or considering deeper linguistic features like a head of a noun phrase or a verb group to be an attribute, etc. In the following experiments, each word in the answer was considered to be an attribute. Furthermore, Rennie et al. (2003)  propose simple heuristic solutions to some problems with naive classifiers. In Weka, Complement of Naive Bayes (CNBayes) is a refinement to the selection process that Naive Bayes makes when faced with instances where one outcome value has more training data than another. This is true in our case. Hence, we ran our experiments using this algorithm also to see if there were any differences.</Paragraph>
      <Paragraph position="1"> The results reported are on a 10-fold cross validation testing.</Paragraph>
      <Paragraph position="2">  We first considered the non-annotated data, that is, the answers given by students in their raw form.</Paragraph>
      <Paragraph position="3"> The first experiment considered the values of the marks to be {0,1, ..., full_mark} for each question. The results of decision tree learning and Bayesian learning are reported in the columns titled DTL1 and NBayes/CNBayes1. The second experiment considered the values of the marks to be either 0 or &gt;0, i.e. we considered two values only, pass and fail. The results are reported in columns DTL2 and NBayes2/CNBayes2. The baseline is calculated the same way as in the ILP case. Obviously, the result of the baseline differs in each experiment only when the sum of the answers with marks greater than 0 exceeds that of those with mark 0. This affected questions 8 and 9 in Table 3 below. Hence, we took the average of both results. It was no surprise that the results of the second experiment were better than the first on questions with the full mark &gt;1, since the number of target features is smaller.</Paragraph>
      <Paragraph position="4"> In both experiments, the complement of Naive Bayes did slightly better or equally well on questions with a full mark of 1, like questions 4 and 7 in the table, while it resulted in a worse performance on questions with full marks &gt;1.</Paragraph>
      <Paragraph position="5">  on non-annotated data.</Paragraph>
      <Paragraph position="6"> Since we were using the words as attributes, we expected that in some cases stemming the words in the answers would improve the results. Hence, we experimented with the answers of 6, 7, 8 and 9 from the list above but there was only a tiny improvement (in question 8). Stemming does not necessarily make a difference if the attributes/words that make a difference appear in a root form already. The lack of any difference or worse performance may also be due to the error rate in the stemmer.</Paragraph>
      <Paragraph position="7">  We repeated the second experiments with the annotated answers. The baseline for the new data differs and the results are shown in Table 4.</Paragraph>
      <Paragraph position="8">  on annotated data.</Paragraph>
      <Paragraph position="9"> As we said earlier, annotation in this context simply means highlighting the part of the answer that deserves 1 mark (if the answer has &gt;=1 mark), so for e.g. if an answer was given a 2 mark then at least two pieces of information should be highlighted and answers with 0 mark stay the same. Obviously, the first experiments could not be conducted since with the annotated answers the mark is either 0 or 1. Bayesian learning is doing better than DTL and 88% is a promising result. Furthermore, given the results of CNBayes in Table 3, we expected that CNBayes would do better on questions 4 and 7. However, it actually did better on questions 3, 4, 6 and 9. Unfortunately, we cannot see a pattern or a reason for this.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="12" end_page="14" type="metho">
    <SectionTitle>
5. Comparison of Results
</SectionTitle>
    <Paragraph position="0"> IE did best on all the questions before annotating the data as it can be seen in Fig. 1. Though, the training data for the machine learning algorithms is  tiny relative to what usually such algorithms consider, after annotating the data, the performance of NBayes on questions 3, 6 and 8 were better than IE. This is seen in Fig. 2. However, as we said earlier in section 2, the percentages shown for IE method are on the whole mark while the results of DTL and Nbayes, after annotation, are calculated on pass-fail.</Paragraph>
    <Paragraph position="1"> F ig. 1. IE vs D T L &amp; N bayes pre-anno tatio n  In addition, in the pre-annotation experiments reported in Fig. 1, the NBayes algorithm did better than that of DTL. Post-annotation, results in Fig. 2 show, again, that NBayes is doing better than the DTL algorithm. It is worth noting that, in the annotated data, the number of answers whose marks are 0 is less than in the answers whose mark is 1, except for questions 1 and 2. This may have an effect on the results.</Paragraph>
    <Paragraph position="2">  Moreover, after getting the worse performance in NBayes2 before annotation, question 8 jumps to best performance. The rest of the questions maintained the same position more or less, with question 3 always coming nearest to the top (see Fig. 3). We noted that Count(Q,1)-Count(Q,0) is highest for questions 8 and 3, where Count(Q,N) is, for question Q, the number of answers whose mark is N. Also, the improvement of performance for question 8 in relation to Count(8,1) was not surprising, since question 8 has a full-mark of 4 and the annotation's role was an attempt at a one-to-one correspondence between an answer and 1  On the other hand, question 1 that was in seventh place in DTL2 before annotation, jumps down to the worst place after annotation. In both cases, namely, NBayes2 and DTL2 after annotation, it seems reasonable to hypothesize that P(Q1) is better than P(Q2) if Count(Q1,1)-Count(Q1,0) &gt;&gt; Count(Q2,1)-Count(Q2,0), where P(Q) is the percentage of agreement for question Q.</Paragraph>
    <Paragraph position="3"> As they stand, the results of agreement with given marks are encouraging. However, the models that the algorithms are learning are very naive in the sense that they depend on words only. Unlike the IE approach, it would not be possible to provide a reasoned justification for a student as to why they have got the mark they have. One of the advantages to the pattern-matching approach is that it is very easy, knowing which patterns have matched, to provide some simple automatic feed-back to the student as to which components of the answer were responsible for the mark awarded.</Paragraph>
    <Paragraph position="4"> We began experimenting with machine learning methods in order to try to overcome the IE customisation bottleneck. However, our experience so far has been that in short answer marking (as opposed to essay marking) these methods are, while promising, not accurate enough at present to be a real alternative to the hand-crafted, pattern- null matching approach. We should instead think of them either as aids to the pattern writing process for example, frequently the decision trees that are learned are quite intuitive, and suggestive of useful patterns - or perhaps as complementary supporting assessment techniques to give extra confirmation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML