A Hybrid Text Classi cation Approach for Analysis of Student Essays
Carolyn P. Ros·e, Antonio Roque, Dumisizwe Bhembe, Kurt Vanlehn
Learning Research and Development Center, University of Pittsburgh,
3939 O’Hara St., Pittsburgh, PA 15260
rosecp,roque,bhembe,vanlehn@pitt.edu
Abstract
We present CarmelTC, a novel hybrid text clas-
si cation approach for analyzing essay answers
to qualitative physics questions, which builds
upon work presented in (Ros·e et al., 2002a).
CarmelTC learns to classify units of text based
on features extracted from a syntactic analysis
of that text as well as on a Naive Bayes clas-
si cation of that text. We explore the trade-
offs between symbolic and  bag of words ap-
proaches. Our goal has been to combine the
strengths of both of these approaches while
avoiding some of the weaknesses. Our evalu-
ation demonstrates that the hybrid CarmelTC
approach outperforms two  bag of words ap-
proaches, namely LSA and a Naive Bayes, as
well as a purely symbolic approach.
1 Introduction
In this paper we describe CarmelTC, a novel hybrid
text classi cation approach for analyzing essay answers
to qualitative physics questions. In our evaluation we
demonstrate that the novel hybrid CarmelTC approach
outperforms both Latent Semantic Analysis (LSA) (Lan-
dauer et al., 1998; Laham, 1997) and Rainbow (Mc-
Callum, 1996; McCallum and Nigam, 1998), which is
a Naive Bayes approach, as well as a purely symbolic
approach similar to (Furnkranz et al., 1998). Whereas
LSA and Rainbow are pure  bag of words approaches,
CarmelTC is a rule learning approach where rules for
classifying units of text rely on features extracted from
a syntactic analysis of that text as well as on a  bag
of words classi cation of that text. Thus, our evalu-
ation demonstrates the advantage of combining predic-
tions from symbolic and  bag of words approaches for
text classi cation. Similar to (Furnkranz et al., 1998),
neither CarmelTC nor the purely symbolic approach re-
quire any domain speci c knowledge engineering or text
annotation beyond providing a training corpus of texts
matched with appropriate classi cations, which is also
necessary for Rainbow, and to a much lesser extent for
LSA.
CarmelTC was developed for use inside of the Why2-
Atlas conceptual physics tutoring system (VanLehn et al.,
2002; Graesser et al., 2002) for the purpose of grad-
ing short essays written in response to questions such as
 Suppose you are running in a straight line at constant
speed. You throw a pumpkin straight up. Where will it
land? Explain. This is an appropriate task domain for
pursuing questions about the bene ts of tutorial dialogue
for learning because questions like this one are known
to elicit robust, persistent misconceptions from students,
such as  heavier objects exert more force. (Hake, 1998;
Halloun and Hestenes, 1985). In Why2-Atlas, a stu-
dent  rst types an essay answering a qualitative physics
problem. A computer tutor then engages the student in
a natural language dialogue to provide feedback, cor-
rect misconceptions, and to elicit more complete expla-
nations. The  rst version of Why2-Atlas was deployed
and evaluated with undergraduate students in the spring
of 2002; the system is continuing to be actively devel-
oped (Graesser et al., 2002).
In contrast to many previous approaches to automated
essay grading (Burstein et al., 1998; Foltz et al., 1998;
Larkey, 1998), our goal is not to assign a letter grade
to student essays. Instead, our purpose is to tally which
set of  correct answer aspects are present in student es-
says. For example, we expect satisfactory answers to the
example question above to include a detailed explana-
tion of how Newton’s  rst law applies to this scenario.
From Newton’s  rst law, the student should infer that the
pumpkin and the man will continue at the same constant
horizontal velocity that they both had before the release.
Thus, they will always have the same displacement from
the point of release. Therefore, after the pumpkin rises
and falls, it will land back in the man’s hands. Our goal
is to coach students through the process of constructing
good physics explanations. Thus, our focus is on the
physics content and not the quality of the student’s writ-
ing, in contrast to (Burstein et al., 2001).
2 Student Essay Analysis
We cast the Student Essay Analysis problem as a text
classi cation problem where we classify each sentence in
the student’s essay as an expression one of a set of  cor-
rect answer aspects , or  nothing in the case where no
 correct answer aspect was expressed.
After a student attempts an initial answer to the ques-
tion, the system analyzes the student’s essay to assess
which key points are missing from the student’s argu-
ment. The system then uses its analysis of the student’s
essay to determine which help to offer that student. In
order to do an effective job at selecting appropriate inter-
ventions for helping students improve their explanations,
the system must perform a highly accurate analysis of the
student’s essay. Identifying key points as present in es-
says when they are not (i.e., false alarms), cause the sys-
tem to miss opportunities to help students improve their
essays. On the other hand, failing to identify key points
that are indeed present in student essays causes the sys-
tem to offer help where it is not needed, which can frus-
trate and even confuse students. A highly accurate inven-
tory of the content of student essays is required in order
to avoid missing opportunities to offer needed instruction
and to avoid offering inappropriate feedback, especially
as the completeness of student essays increases (Ros·e et
al., 2002a; Ros·e et al., 2002c).
In order to compute which set of key points, i.e.,  cor-
rect answer aspects , are included in a student essay, we
 rst segment the essay at sentence boundaries. Note that
run-on sentences are broken up. Once an essay is seg-
mented, each segment is classi ed as corresponding to
one of the set of key points or  nothing if it does not
include any key point. We then take an inventory of the
classi cations other than  nothing that were assigned to
at least one segment. Thus, our approach is similar in
spirit to that taken in the AUTO-TUTOR system (Wiemer-
Hastings et al., 1998), where Latent Semantic Analysis
(LSA) (Landauer et al., 1998; Laham, 1997) was used to
tally which subset of  correct answer aspects students
included in their natural language responses to short es-
say questions about computer literacy.
We performed our evaluation over essays collected
from students interacting with our tutoring system in re-
sponse to the question  Suppose you are running in a
straight line at constant speed. You throw a pumpkin
straight up. Where will it land? Explain. , which we refer
to as the Pumpkin Problem. Thus, there are a total of six
alternative classi cations for each segment:
Class 1 Sentence expresses the idea that after the release
the only force acting on the pumpkin is the down-
ward force of gravity.
Class 2 Sentence expresses the idea that the pumpkin
continues to have a constant horizontal velocity after
it is released.
Class 3 Sentence expresses the idea that the horizontal
velocity of the pumpkin continues to be equal to the
horizontal velocity of the man.
Class 4 Sentence expresses the idea that the pumpkin
and runner cover the same distance over the same
time.
Class 5 Sentence expresses the idea that the pumpkin
will land on the runner.
Class 6 Sentence does not adequately express any of the
above speci ed key points.
Note that this classi cation task is strikingly different
from those typically used for evaluating text classi ca-
tion systems. First, these classi cations represent spe-
ci c whole propositions rather than general topics, such
as those used for classifying web pages (Craven et al.,
1998), namely  student ,  faculty ,  staff , etc. Sec-
ondly, the texts are much shorter, i.e., one sentence in
comparison with a whole web page, which is a disadvan-
tage for  bag of words approaches.
In some cases what distinguishes sentences from one
class and sentences from another class is very subtle.
For example,  Thus, the pumpkin’s horizontal velocity,
which is equal to that of the man when he released it, will
remain constant. belongs to Class 2 although it could
easily be mistaken for Class 3. Similarly,  So long as
no other horizontal force acts upon the pumpkin while it
is in the air, this velocity will stay the same. , belongs
to Class 2 although looks similar on the surface to ei-
ther Class 1 or 3. A related problem is that sentences
that should be classi ed as  nothing may look very sim-
ilar on the surface to sentences belonging to one or more
of the other classes. For example,  It will land on the
ground where the runner threw it up. contains all of the
words required to correctly express the idea correspond-
ing to Class 5, although it does not express this idea, and
in fact expresses a wrong idea. These very subtle distinc-
tions also pose problems for  bag of words approaches
since they base their decisions only on which words are
present regardless of their order or the functional relation-
ships between them. That might suggest that a symbolic
approach involving syntactic and semantic interpretation
might be more successful. However, while symbolic ap-
proaches can be more precise than  bag of words ap-
proaches, they are also more brittle. And approaches that
rely both on syntactic and semantic interpretation require
a larger knowledge engineering effort as well.
3 CarmelTC
Figure 1: This example shows the deep syntactic parse
of a sentence.
Sentence: The pumpkin moves slower because the
man is not exerting a force on it.
Deep Syntactic Analysis
((clause2
((mood *declarative)
(root move)
(tense present)
(subj
((cat dp)(root pumpkin)
(speci er ((cat detp)(def +)(root the)))
(modi er ((car adv) (root slow)))))))
(clause2
(mood *declarative)
(root exert)
(tense present)
(negation +)
(causesubj
((cat dp)(root man)(agr 3s)
(speci er
((cat detp)(def +)(root the)))))
(subj
((cat dp)(root force)
(speci er ((cat detp)(root a)))))
(obj ((cat dp)(root it))))
(connective because))
The hybrid CarmelTC approach induces decision trees
using features from both a deep syntactic functional anal-
ysis of an input text as well as a prediction from the Rain-
bow Naive Bayes text classi er (McCallum, 1996; Mc-
Callum and Nigam, 1998) to make a prediction about the
correct classi cation of a sentence. In addition, it uses
features that indicate the presence or absence of words
found in the training examples. Since the Naive Bayes
classi cation of a sentence is more informative than any
single one of the other features provided, CarmelTC can
be conceptualized as using the other features to decide
whether or not to believe the Naive Bayes classi cation,
and if not, what to believe instead.
From the deep syntactic analysis of a sentence, we ex-
tract individual features that encode functional relation-
Figure 2: This example shows the features extracted
from the deep syntactic parse of a sentence.
Sentence: The pumpkin moves slower because the
man is not exerting a force on it.
Extracted Features
(tense-move present)
(subj-move pumpkin)
(speci er-pumpkin the)
(modi er-move slow)
(tense-exert present)
(negation-exert +)
(causesubj-exert man)
(subj-exert force)
(obj-exert it)
(speci er-force a)
(speci er-man the)
ships between syntactic heads (e.g., (subj-throw man)),
tense information (e.g., (tense-throw past)), and infor-
mation about passivization and negation (e.g., (negation-
throw +) or (passive-throw -)). See Figures 1 and 2. Rain-
bow has been used for a wide range of text classi cation
tasks. With Rainbow, P(doc,Class), i.e., the probability of
a document belonging to class Class, is estimated by mul-
tiplying P(Class), i.e., the prior probability of the class,
by the product over all of the words a0a2a1 found in the text of
a3a5a4
a0a7a6a9a8a10a12a11a14a13a9a15a16a15a18a17 , i.e., the probability of the word given that
class. This product is normalized over the prior probabil-
ity of all words. Using the individual features extracted
from the deep syntactic analysis of the input as well as
the  bag of words Naive Bayes classi cation of the in-
put sentence, CarmelTC builds a vector representation
of each input sentence, with each vector position corre-
sponding to one of these features. We then use the ID3
decision tree learning algorithm (Mitchell, 1997; Quin-
lin, 1993) to induce rules for identifying sentence classes
based on these feature vectors.
The symbolic features used for the CarmelTC ap-
proach are extracted from a deep syntactic functional
analysis constructed using the CARMEL broad coverage
English syntactic parsing grammar (Ros·e, 2000) and the
large scale COMLEX lexicon (Grishman et al., 1994),
containing 40,000 lexical items. For parsing we use an
incremental version of the LCFLEX robust parser (Ros·e
et al., 2002b; Ros·e and Lavie, 2001), which was designed
for ef cient, robust interpretation. While computing a
deep syntactic analysis is more computationally expen-
sive than computing a shallow syntactic analysis, we can
do so very ef ciently using the incrementalized version
of LCFLEX because it takes advantage of student typ-
ing time to reduce the time delay between when students
submit their essays and when the system is prepared to
respond.
Syntactic feature structures produced by the CARMEL
grammar factor out those aspects of syntax that modify
the surface realization of a sentence but do not change
its deep functional analysis. These aspects include tense,
negation, mood, modality, and syntactic transformations
such as passivization and extraction. In order to do this
reliably, the component of the grammar that performs the
deep syntactic analysis of verb argument functional re-
lationships was generated automatically from a feature
representation for each of COMLEX’s verb subcatego-
rization tags. It was veri ed that the 91 verb subcatego-
rization tags documented in the COMLEX manual were
covered by the encodings, and thus by the resulting gram-
mar rules. These tags cover a wide range of patterns of
syntactic control and predication relationships. Each tag
corresponds to one or more case frames. Each case frame
corresponds to a number of different surface realizations
due to passivization, relative clause extraction, and wh-
movement. Altogether there are 519 syntactic patterns
covered by the 91 subcategorization tags, all of which are
covered by the grammar.
There are nine syntactic functional roles assigned by
the grammar. These roles include subj (subject), caus-
esubj (causative subject), obj (object), iobj (indirect ob-
ject), pred (descriptive predicate, like an adjectival phrase
or an adverb phrase), comp (a clausal complement), mod-
i er, and possessor. The roles pertaining to the rela-
tionship between a verb and its arguments are assigned
based on the subcat tags associated with verbs in COM-
LEX. However, in some cases, arguments that COM-
LEX assigns the role of subject get rede ned as caus-
esubj (causative subject). For example, the subject in  the
pumpkin moved is just a subject but in  the man moved
the pumpkin , the subject would get the role causesubj
instead since ’move’ is a causative-inchoative verb and
the obj role is  lled in in the second case 1. The modi er
role is used to specify the relationship between any syn-
tactic head and its adjunct modi ers. Possessor is used
to describe the relationship between a head noun and its
genitive speci er, as in man in either  the man’s pump-
kin or  the pumpkin of the man .
With the hybrid CarmelTC approach, our goal has been
to keep as many of the advantages of both symbolic anal-
ysis as well as  bag of words classi cation approaches
as possible while avoiding some of the pitfalls of each.
Since the CarmelTC approach does not use the syntactic
analysis as a whole, it does not require that the system be
able to construct a totally complete and correct syntactic
analysis of the student’s text input. It can very effectively
1The causative-inchoative verb feature is one that we added
to verb entries in COMLEX, not one of the features provided
by the lexicon originally.
make use of partial parses. Thus, it is more robust than
purely symbolic approaches where decisions are based on
complete analyses of texts. And since it makes use only
of the syntactic analysis of a sentence, rather than also
making use of a semantic interpretation, it does not re-
quire any sort of domain speci c knowledge engineering.
And yet the syntactic features provide information nor-
mally not available to  bag of words approaches, such
as functional relationships between syntactic heads and
scope of negation and other types of modi ers.
4 Related Work: Combining Symbolic and
Bag of Words Approaches
CarmelTC is most similar to the text classi cation ap-
proach described in (Furnkranz et al., 1998). In the ap-
proach described in (Furnkranz et al., 1998), features that
note the presence or absence of a word from a text as
well as extraction patterns from AUTOSLOG-TS (Riloff,
1996) form the feature set that are input to the RIPPER
(Cohen, 1995), which learns rules for classifying texts
based on these features. CarmelTC is similar in spirit
in terms of both the sorts of features used as well as the
general sort of learning approach. However, CarmelTC is
different from (Furnkranz et al., 1998) in several respects.
Where (Furnkranz et al., 1998) make use of
AUTOSLOG-TS extraction patterns, CarmelTC makes
use of features extracted from a deep syntactic analysis
of the text. Since AUTOSLOG-TS performs a surface
syntactic analysis, it would assign a different representa-
tion to all aspects of these texts where there is variation in
the surface syntax. Thus, the syntactic features extracted
from our syntactic analyses are more general. For exam-
ple, for the sentence  The force was applied by the man
to the object , our grammar assigns the same functional
roles as for  The man applied the force to the object and
also for the noun phrase  the man that applied the force to
the object . This would not be the case for AUTOSLOG-
TS.
Like (Furnkranz et al., 1998), we also extract word
features that indicate the presence or absence of a root
form of a word from the text. However, in contrast for
CarmelTC one of the features for each training text that is
made available to the rule learning algorithm is the clas-
si cation obtained using the Rainbow Naive Bayes clas-
si er (McCallum, 1996; McCallum and Nigam, 1998).
Because the texts classi ed with CarmelTC are so
much shorter than those of (Furnkranz et al., 1998), the
feature set provided to the learning algorithm was small
enough that it was not necessary to use a learning algo-
rithm as sophisticated as RIPPER (Cohen, 1995). Thus,
we used ID3 (Mitchell, 1997; Quinlin, 1993) instead with
excellent results. Note that in contrast to CarmelTC, the
(Furnkranz et al., 1998) approach is purely symbolic.
Thus, all of its features are either word level features or
surface syntactic features.
Recent work has demonstrated that combining multi-
ple predictors yields combined predictors that are supe-
rior to the individual predictors in cases where the in-
dividual predictors have complementary strengths and
weaknesses (Larkey and Croft, 1996; Larkey and Croft,
1995). We have argued that this is the case with symbolic
and  bag of words approaches. Thus, we have reason to
expect a hybrid approach that makes a prediction based
on a combination of these single approaches would yield
better results than either of these approaches alone. Our
results presented in Section 5 demonstrate that this is true.
Other recent work has demonstrated that symbolic and
 Bag of Words approaches can be productively com-
bined. For example, syntactic information can be used
to modify the LSA space of a verb in order to make LSA
sensitive to different word senses (Kintsch, 2002). How-
ever, this approach has only been applied to the analysis
of mono-transitive verbs. Furthermore, it has never been
demonstrated to improve LSA’s effectiveness at classify-
ing texts.
In the alternative Structured Latent Semantic Analy-
sis (SLSA) approach, hand-coded subject-predicate in-
formation was used to improve the results obtained by
LSA for text classi cation (Wiemer-Hastings and Zipi-
tria, 2001), but no fully automated evaluation of this ap-
proach has been published.
In contrast to these two approaches, CarmelTC is both
fully automatic, in that the symbolic features it uses are
obtained without any hand coding whatsoever, and fully
general, in that it applies to the full range of verb subcat-
egorization frames covered by the COMLEX lexicon, not
only mono-transitive verbs. In Section 5 we demonstrate
that CarmelTC outperforms both LSA and Rainbow, two
alternative bag of words approaches, on the task of stu-
dent essay analysis.
5 Evaluation
We conducted an evaluation to compare the effective-
ness of CarmelTC at analyzing student essays in compar-
ison to LSA, Rainbow, and a purely symbolic approach
similar to (Furnkranz et al., 1998), which we refer to
here as CarmelTCsymb. CarmelTCsymb is identical to
CarmelTC except that it does not include in its feature
set the prediction from Rainbow. Thus, by comparing
CarmelTC with Rainbow and LSA, we can demonstrate
the superiority of our hybrid approach to purely  bag of
words approaches. And by comparing with CarmelTC-
symb, we can demonstrate the superiority of our hybrid
approach to an otherwise equivalent purely symbolic ap-
proach.
We conducted our evaluation over a corpus of 126 pre-
viously unseen student essays in response to the Pumpkin
Problem described above, with a total of 500 text seg-
ments, and just under 6000 words altogether. We  rst
tested to see if the text segments could be reliably tagged
by humans with the six possible Classes associated with
the problem. Note that this includes  nothing as a class,
i.e., Class 6. Three human coders hand classi ed text
segments for 20 essays. We computed a pairwise Kappa
coef cient (Cohen, 1960) to measure the agreement be-
tween coders, which was always greater than .75, thus
demonstrating good agreement according to the Krippen-
dorf scale (Krippendorf, 1980). We then selected two
coders to individually classify the remaining sentences in
the corpus. They then met to come to a consensus on
the tagging. The resulting consensus tagged corpus was
used as a gold standard for this evaluation. Using this
gold standard, we conducted a comparison of the four
approaches on the problem of tallying the set of  correct
answer aspects present in each student essay.
The LSA space used for this evaluation was trained
over three  rst year physics text books. The other three
approaches are trained over a corpus of tagged examples
using a 50 fold random sampling evaluation, similar to a
cross-validation methodology. On each iteration, we ran-
domly selected a subset of essays such that the number
of text segments included in the test set were greater than
10 but less than 15. The randomly selected essays were
then used as a test set for that iteration, and the remain-
der of the essays were used for training in addition to a
corpus of 248 hand tagged example sentences extracted
from a corpus of human-human tutoring transcripts in
our domain. The training of the three approaches dif-
fered only in terms of how the training data was parti-
tioned. Rainbow and CarmelTCsymb were trained us-
ing all of the example sentences in the corpus as a single
training set. CarmelTC, on the other hand, required parti-
tioning the training data into two subsets, one for training
the Rainbow model used for generating the value of its
Rainbow feature, and one subset for training the decision
trees. This is because for CarmelTC, the data for train-
ing Rainbow must be separate from that used to train the
decision trees so the decision trees are trained from a re-
alistic distribution of assigned Rainbow classes based on
its performance on unseen data rather than on Rainbow’s
training data.
In setting up our evaluation, we made it our goal to
present our competing approaches in the best possible
light in order to provide CarmelTC with the strongest
competitors as possible. Note that LSA works by using
its trained LSA space to construct a vector representation
for any text based on the set of words included therein. It
can thus be used for text classi cation by comparing the
vector obtained for a set of exemplar texts for each class
with that obtained from the text to be classi ed. We tested
LSA using as exemplars the same set of examples used
Figure 3: This Table compares the performance of the 4 alternative approaches in the per essay evaluation in
terms of precision, recall, false alarm rate, and f-score.
Approach Precision Recall False Alarm Rate F-Score
LSA 93% 54% 3% .70
Rainbow 81% 73% 9% .77
CarmelTCsymb 88% 72% 7% .79
CarmelTC 90% 80% 8% .85
as Rainbow training data, but it always performed better
when using a small set of hand picked exemplars. Thus,
we present results here using only those hand picked ex-
emplars. For every approach except LSA, we  rst seg-
mented the essays at sentence boundaries and classi ed
each sentence separately. However, for LSA, rather than
classify each segment separately, we compared the LSA
vector for the entire essay to the exemplars for each class
(other than  nothing ), since LSA’s performance is better
with longer texts. We veri ed that LSA also performed
better speci cally on our task under these circumstances.
Thus, we compared each essay to each exemplar, and we
counted LSA as identifying the corresponding  correct
answer aspect if the cosine value obtained by compar-
ing the two vectors was above a threshold. We tested
LSA with threshold values between .1 and .9 at incre-
ments of .1 as well as testing a threshold of .53 as is
used in the AUTO-TUTOR system (Wiemer-Hastings et
al., 1998). As expected, as the threshold increases from
.1 to .9, recall and false alarm rate both decrease together
as precision increases. We determined based on comput-
ing f-scores2 for each threshold level that .53 achieves the
best trade off between precision and recall. Thus, we used
a threshold of .53, to determine whether LSA identi ed
the corresponding key point in the student essay or not
for the evaluation presented here.
We evaluated the four approaches in terms of precision,
recall, false alarm rate, and f-score, which were computed
for each approach for each test essay, and then averaged
over the whole set of test essays. We computed preci-
sion by dividing the number of  correct answer aspects 
(CAAs) correctly identi ed by the total number of CAAs
identi ed3 We computed recall by dividing the number of
CAAs correctly identi ed over the number of CAAs actu-
ally present in the essay4 False alarm rate was computed
by dividing the number of CAAs incorrectly identi ed by
the total number of CAAs that could potentially be incor-
2We computed our f-scores with a beta value of 1 in order to
treat precision and recall as equally important.
3For essays containing no CAAs, we counted precision as 1
if none were identi ed and 0 otherwise.
4For essays with no CAAs present, we counted recall as 1
for all approaches.
rectly identi ed5. F-scores were computed using 1 as the
beta value in order to treat precision and recall as equally
important.
The results presented in Figure 3 clearly demon-
strate that CarmelTC outperforms the other approaches.
In particular, CarmelTC achieves the highest f-score,
which combines the precision and recall scores into a
single measure. In comparison with CarmelTCsymb,
CarmelTC achieves a higher recall as well as a slightly
higher precision. While LSA achieves a slightly higher
precision, its recall is much lower. Thus, the difference
between the two approaches is clearly shown in the f-
score value, which strongly favors CarmelTC. Rainbow
achieves a lower score than CarmelTC in terms of preci-
sion, recall, false alarm rate, and f-score.
6 Conclusion and Current Directions
In this paper we have introduced the CarmelTC text clas-
si cation approach as it is applied to the problem of stu-
dent essay analysis in the context of a conceptual physics
tutoring system. We have evaluated CarmelTC over data
collected from students interacting with our system in re-
sponse to one of its 10 implemented conceptual physics
problems. Our evaluation demonstrates that the novel
hybrid CarmelTC approach outperforms both Latent Se-
mantic Analysis (LSA) (Landauer et al., 1998; Laham,
1997) and a Naive Bayes approach (McCallum, 1996;
McCallum and Nigam, 1998) as well as a purely sym-
bolic approach similar to (Furnkranz et al., 1998). We
plan to run a larger evaluation with essays from multiple
problems to test the generality of our result. We also plan
to experiment with other rule learning approaches, such
as RIPPER (Cohen, 1995).
7 Acknowledgments
This research was supported by the Of ce of Naval Re-
search, Cognitive Science Division under grant number
N00014-0-1-0600 and by NSF grant number 9720359
to CIRCLE, Center for Interdisciplinary Research on
Constructive Learning Environments at the University of
Pittsburgh and Carnegie Mellon University.
5For essays containing all possible CAAs, false alarm rate
was counted as 0 for all approaches.

References
J. Burstein, K. Kukich, S. Wolff, C. Lu, M. Chodorow,
L. Braden-Harder, and M. D. Harris. 1998. Au-
tomated scoring using a hybrid feature identi cation
technique. In Proceedings of COLING-ACL’98, pages
206 210.
J. Burstein, D. Marcu, S. Andreyev, and M. Chodorow.
2001. Towards automatic classi cation of discourse
elements in essays. In Proceedings of the 39th Annual
Meeting of the Association for Computational Linguis-
tics, Toulouse, France.
J. Cohen. 1960. A coef cient of agreement for nominal
scales. Educational and Psychological Measurement,
20(Winter):37 46.
W. W. Cohen. 1995. Fast effective rule induction. In
Proceedings of the 12th International Conference on
Machine Learning.
M. Craven, D. DiPasquio, D. Freitag, A. McCallum,
T. Mitchell, K. Nigam, and S. Slattery. 1998. Learn-
ing to extract symbolic knowledge from the world wide
web. In Proceedings of the 15th National Conference
on Arti cial Intelligence.
P. W. Foltz, W. Kintsch, and T. Landauer. 1998. The
measurement of textual coherence with latent semantic
analysis. Discourse Processes, 25(2-3):285 307.
J. Furnkranz, T. Mitchell Mitchell, and E. Riloff. 1998.
A case study in using linguistic phrases for text cat-
egorization on the www. In Proceedings from the
AAAI/ICML Workshop on Learning for Text Catego-
rization.
A. Graesser, K. Vanlehn, TRG, and NLT Group. 2002.
Why2 report: Evaluation of why/atlas, why/autotutor,
and accomplished human tutors on learning gains for
qualitative physics problems and explanations. Tech-
nical report, LRDC Tech Report, University of Pitts-
burgh.
R. Grishman, C. Macleod, and A. Meyers. 1994. COM-
LEX syntax: Building a computational lexicon. In
Proceedings of the 15th International Conference on
Computational Linguistics (COLING-94).
R. R. Hake. 1998. Interactive-engagement versus tra-
ditional methods: A six-thousand student survey of
mechanics test data for introductory physics students.
American Journal of Physics, 66(64).
I. A. Halloun and D. Hestenes. 1985. The initial knowl-
edge state of college physics students. American Jour-
nal of Physics, 53(11):1043 1055.
W. Kintsch. 2001. Predication. Cognitive Science,
25:173 202.
K. Krippendorf. 1980. Content Analysis: An Introduc-
tion to Its Methodology. Sage Publications.
D. Laham. 1997. Latent semantic analysis approaches
to categorization. In Proceedings of the Cognitive Sci-
ence Society.
T. K. Landauer, P. W. Foltz, and D. Laham. 1998. In-
troduction to latent semantic analysis. Discourse Pro-
cesses, 25(2-3):259 284.
L. S. Larkey and W. B. Croft. 1995. Automatic assign-
ment of icd9 codes to discharge summaries. Technical
Report IR-64, University of Massachusetts Center for
Intelligent Information Retrieval.
L. S. Larkey and W. B. Croft. 1996. Combining classi-
 ers in text categorization. In Proceedings of SIGIR.
L. Larkey. 1998. Automatic essay grading using text
categorization techniques. In Proceedings of SIGIR.
A. McCallum and K. Nigam. 1998. A comparison of
event models for naive bayes text classi cation. In
Proceedings of the AAAI-98 Workshop on Learning for
Text Classi cation.
Andrew Kachites McCallum. 1996. Bow: A toolkit
for statistical language modeling, text retrieval, clas-
si cation and clustering. http://www.cs.cmu.edu/ mc-
callum/bow.
T. M. Mitchell. 1997. Machine Learning. McGraw Hill.
J. R. Quinlin. 1993. C4.5: Programs for Machine Learn-
ing. Morgan Kaufmann Publishers: San Mateo, CA.
E. Riloff. 1996. Using learned extraction patterns for text
classi cation. In S. Wermter, R. Riloff, and G. Scheler,
editors, Connectionist, Statistical, and Symbolic Ap-
proaches for Natural Language Processing. Springer-
Verlag.
C. P. Ros·e and A. Lavie. 2001. Balancing robustness
and ef ciency in uni cation augmented context-free
parsers for large practical applications. In J. C. Junqua
and G. Van Noord, editors, Robustness in Language
and Speech Technologies. Kluwer Academic Press.
C. P. Ros·e, D. Bhembe, A. Roque, S. Siler, R. Srivas-
tava, and K. Vanlehn. 2002a. A hybrid language un-
derstanding approach for robust selection of tutoring
goals. In Proceedings of the Intelligent Tutoring Sys-
tems Conference.
C. P. Ros·e, D. Bhembe, A. Roque, and K. VanLehn.
2002b. An ef cient incremental architecture for ro-
bust interpretation. In Proceedings of the Human Lan-
guages Technology Conference, pages 307 312.
C. P. Ros·e, P. Jordan, and K. VanLehn. 2002c. Can we
help students with high initial competency? In Pro-
ceedings of the ITS Workshop on Empirical Methods
for Tutorial Dialogue Systems.
C. P. Ros·e. 2000. A framework for robust semantic in-
terpretation. In Proceedings of the First Meeting of the
North American Chapter of the Association for Com-
putational Linguistics, pages 311 318.
K. VanLehn, P. Jordan, C. P. Ros·e, and The Natural Lan-
guag e Tutoring Group. 2002. The architecture of
why2-atlas: a coach for qualitative physics essay writ-
ing. In Proceedings of the Intelligent Tutoring Systems
Conference, pages 159 167.
P. Wiemer-Hastings and I. Zipitria. 2001. Rules for
syntax, vectors for semantics. In Proceedings of the
Twenty-third Annual Conference of the Cognitive Sci-
ence Society.
P. Wiemer-Hastings, A. Graesser, D. Harter, and the Tu-
toring Res earch Group. 1998. The foundations
and architecture of autotutor. In B. Goettl, H. Halff,
C. Red eld, and V. Shute, editors, Intelligent Tutor-
ing Systems: 4th International Conference (ITS ’98 ),
pages 334 343. Springer Verlag.
