Thumbs up? Sentiment Classiflcation using Machine Learning
Techniques
Bo Pang and Lillian Lee
Department of Computer Science
Cornell University
Ithaca, NY 14853 USA
fpabo,lleeg@cs.cornell.edu
Shivakumar Vaithyanathan
IBM Almaden Research Center
650 Harry Rd.
San Jose, CA 95120 USA
shiv@almaden.ibm.com
Abstract
We consider the problem of classifying doc-
uments not by topic, but by overall senti-
ment, e.g., determining whether a review
is positive or negative. Using movie re-
views as data, we flnd that standard ma-
chine learning techniques deflnitively out-
perform human-produced baselines. How-
ever, the three machine learning methods
we employed (Naive Bayes, maximum en-
tropy classiflcation, and support vector ma-
chines) do not perform as well on sentiment
classiflcation as on traditional topic-based
categorization. We conclude by examining
factors that make the sentiment classiflca-
tion problem more challenging.
1 Introduction
Today, very large amounts of information are avail-
able in on-line documents. As part of the efiort to
better organize this information for users, researchers
have been actively investigating the problem of au-
tomatic text categorization.
The bulk of such work has focused on topical cat-
egorization, attempting to sort documents accord-
ing to their subject matter (e.g., sports vs. poli-
tics). However, recent years have seen rapid growth
in on-line discussion groups and review sites (e.g.,
the New York Times’ Books web page) where a cru-
cial characteristic of the posted articles is their senti-
ment, or overall opinion towards the subject matter
| for example, whether a product review is pos-
itive or negative. Labeling these articles with their
sentiment would provide succinct summaries to read-
ers; indeed, these labels are part of the appeal and
value-add of such sites as www.rottentomatoes.com,
which both labels movie reviews that do not con-
tain explicit rating indicators and normalizes the
difierent rating schemes that individual reviewers
use. Sentiment classiflcation would also be helpful in
business intelligence applications (e.g. MindfulEye’s
Lexant system1) and recommender systems (e.g.,
Terveen et al. (1997), Tatemura (2000)), where user
input and feedback could be quickly summarized; in-
deed, in general, free-form survey responses given in
natural language format could be processed using
sentiment categorization. Moreover, there are also
potential applications to message flltering; for exam-
ple, one might be able to use sentiment information
to recognize and discard \ ames"(Spertus, 1997).
In this paper, we examine the efiectiveness of ap-
plying machine learning techniques to the sentiment
classiflcation problem. A challenging aspect of this
problem that seems to distinguish it from traditional
topic-based classiflcation is that while topics are of-
ten identiflable by keywords alone, sentiment can be
expressed in a more subtle manner. For example, the
sentence \How could anyone sit through this movie?"
contains no single word that is obviously negative.
(See Section 7 for more examples). Thus, sentiment
seems to require more understanding than the usual
topic-based classiflcation. So, apart from presenting
our results obtained via machine learning techniques,
we also analyze the problem to gain a better under-
standing of how di–cult it is.
2 Previous Work
This section brie y surveys previous work on non-
topic-based text categorization.
One area of research concentrates on classifying
documents according to their source or source style,
with statistically-detected stylistic variation (Biber,
1988) serving as an important cue. Examples in-
clude author, publisher (e.g., the New York Times vs.
The Daily News), native-language background, and
\brow" (e.g., high-brow vs. \popular", or low-brow)
(Mosteller and Wallace, 1984; Argamon-Engelson et
1http://www.mindfuleye.com/about/lexant.htm
                                            Association for Computational Linguistics.
                      Language Processing (EMNLP), Philadelphia, July 2002, pp. 79-86.
                         Proceedings of the Conference on Empirical Methods in Natural
al., 1998; Tomokiyo and Jones, 2001; Kessler et al.,
1997).
Another, more related area of research is that of
determining the genre of texts; subjective genres,
such as \editorial", are often one of the possible
categories (Karlgren and Cutting, 1994; Kessler et
al., 1997; Finn et al., 2002). Other work explicitly
attempts to flnd features indicating that subjective
language is being used (Hatzivassiloglou and Wiebe,
2000; Wiebe et al., 2001). But, while techniques for
genre categorization and subjectivity detection can
help us recognize documents that express an opin-
ion, they do not address our speciflc classiflcation
task of determining what that opinion actually is.
Most previous research on sentiment-based classi-
flcation has been at least partially knowledge-based.
Some of this work focuses on classifying the semantic
orientation of individual words or phrases, using lin-
guistic heuristics or a pre-selected set of seed words
(Hatzivassiloglou and McKeown, 1997; Turney and
Littman, 2002). Past work on sentiment-based cat-
egorization of entire documents has often involved
either the use of models inspired by cognitive lin-
guistics (Hearst, 1992; Sack, 1994) or the manual or
semi-manual construction of discriminant-word lex-
icons (Huettner and Subasic, 2000; Das and Chen,
2001; Tong, 2001). Interestingly, our baseline exper-
iments, described in Section 4, show that humans
may not always have the best intuition for choosing
discriminating words.
Turney’s (2002) work on classiflcation of reviews
is perhaps the closest to ours.2 He applied a spe-
ciflc unsupervised learning technique based on the
mutual information between document phrases and
the words \excellent" and \poor", where the mu-
tual information is computed using statistics gath-
ered by a search engine. In contrast, we utilize sev-
eral completely prior-knowledge-free supervised ma-
chine learning methods, with the goal of understand-
ing the inherent di–culty of the task.
3 The Movie-Review Domain
For our experiments, we chose to work with movie
reviews. This domain is experimentally convenient
because there are large on-line collections of such re-
views, and because reviewers often summarize their
overall sentiment with a machine-extractable rat-
ing indicator, such as a number of stars; hence, we
did not need to hand-label the data for supervised
learning or evaluation purposes. We also note that
Turney (2002) found movie reviews to be the most
2Indeed, although our choice of title was completely
independent of his, our selections were eerily similar.
di–cult of several domains for sentiment classiflca-
tion, reporting an accuracy of 65.83% on a 120-
document set (random-choice performance: 50%).
But we stress that the machine learning methods and
features we use are not speciflc to movie reviews, and
should be easily applicable to other domains as long
as su–cient training data exists.
Our data source was the Internet Movie Database
(IMDb) archive of the rec.arts.movies.reviews
newsgroup.3 We selected only reviews where the au-
thor rating was expressed either with stars or some
numerical value (other conventions varied too widely
to allow for automatic processing). Ratings were
automatically extracted and converted into one of
three categories: positive, negative, or neutral. For
the work described in this paper, we concentrated
only on discriminating between positive and nega-
tive sentiment. To avoid domination of the corpus
by a small number of proliflc reviewers, we imposed
a limit of fewer than 20 reviews per author per sen-
timent category, yielding a corpus of 752 negative
and 1301 positive reviews, with a total of 144 re-
viewers represented. This dataset will be available
on-line at http://www.cs.cornell.edu/people/pabo/-
movie-review-data/ (the URL contains hyphens only
around the word \review").
4 A Closer Look At the Problem
Intuitions seem to difier as to the di–culty of the sen-
timent detection problem. An expert on using ma-
chine learning for text categorization predicted rela-
tively low performance for automatic methods. On
the other hand, it seems that distinguishing positive
from negative reviews is relatively easy for humans,
especially in comparison to the standard text catego-
rization problem, where topics can be closely related.
One might also suspect that there are certain words
people tend to use to express strong sentiments, so
that it might su–ce to simply produce a list of such
words by introspection and rely on them alone to
classify the texts.
To test this latter hypothesis, we asked two gradu-
ate students in computer science to (independently)
choose good indicator words for positive and nega-
tive sentiments in movie reviews. Their selections,
shown in Figure 1, seem intuitively plausible. We
then converted their responses into simple decision
procedures that essentially count the number of the
proposed positive and negative words in a given doc-
ument. We applied these procedures to uniformly-
distributed data, so that the random-choice baseline
result would be 50%. As shown in Figure 1, the
3http://reviews.imdb.com/Reviews/
Proposed word lists Accuracy Ties
Human 1 positive: dazzling, brilliant, phenomenal, excellent, fantastic 58% 75%
negative: suck, terrible, awful, unwatchable, hideous
Human 2 positive: gripping, mesmerizing, riveting, spectacular, cool, 64% 39%
awesome, thrilling, badass, excellent, moving, exciting
negative: bad, cliched, sucks, boring, stupid, slow
Figure 1: Baseline results for human word lists. Data: 700 positive and 700 negative reviews.
Proposed word lists Accuracy Ties
Human 3 + stats positive: love, wonderful, best, great, superb, still, beautiful 69% 16%
negative: bad, worst, stupid, waste, boring, ?, !
Figure 2: Results for baseline using introspection and simple statistics of the data (including test data).
accuracy | percentage of documents classifled cor-
rectly | for the human-based classiflers were 58%
and 64%, respectively.4 Note that the tie rates |
percentage of documents where the two sentiments
were rated equally likely | are quite high5 (we chose
a tie breaking policy that maximized the accuracy of
the baselines).
While the tie rates suggest that the brevity of
the human-produced lists is a factor in the relatively
poor performance results, it is not the case that size
alone necessarily limits accuracy. Based on a very
preliminary examination of frequency counts in the
entire corpus (including test data) plus introspection,
we created a list of seven positive and seven negative
words (including punctuation), shown in Figure 2.
As that flgure indicates, using these words raised the
accuracy to 69%. Also, although this third list is of
comparable length to the other two, it has a much
lower tie rate of 16%. We further observe that some
of the items in this third list, such as \?" or \still",
would probably not have been proposed as possible
candidates merely through introspection, although
upon re ection one sees their merit (the question
mark tends to occur in sentences like \What was the
director thinking?"; \still" appears in sentences like
\Still, though, it was worth seeing").
We conclude from these preliminary experiments
that it is worthwhile to explore corpus-based tech-
niques, rather than relying on prior intuitions, to se-
lect good indicator features and to perform sentiment
classiflcation in general. These experiments also pro-
vide us with baselines for experimental comparison;
in particular, the third baseline of 69% might actu-
ally be considered somewhat di–cult to beat, since
it was achieved by examination of the test data (al-
though our examination was rather cursory; we do
4Later experiments using these words as features for
machine learning methods did not yield better results.
5This is largely due to 0-0 ties.
not claim that our list was the optimal set of four-
teen words).
5 Machine Learning Methods
Our aim in this work was to examine whether it suf-
flces to treat sentiment classiflcation simply as a spe-
cial case of topic-based categorization (with the two
\topics" being positive sentiment and negative sen-
timent), or whether special sentiment-categorization
methods need to be developed. We experimented
with three standard algorithms: Naive Bayes clas-
siflcation, maximum entropy classiflcation, and sup-
port vector machines. The philosophies behind these
three algorithms are quite difierent, but each has
been shown to be efiective in previous text catego-
rization studies.
To implement these machine learning algorithms
on our document data, we used the following stan-
dard bag-of-features framework. Let ff1;:::;fmg be
a predeflned set of m features that can appear in
a document; examples include the word \still" or
the bigram \really stinks". Let ni(d) be the num-
ber of times fi occurs in document d. Then, each
document d is represented by the document vector
~d := (n1(d);n2(d);:::;nm(d)).
5.1 Naive Bayes
One approach to text classiflcation is to assign to a
given document d the class c⁄ = argmaxc P(c j d).
We derive the Naive Bayes (NB) classifler by flrst
observing that by Bayes’ rule,
P(c j d) = P(c)P(d j c)P(d) ;
where P(d) plays no role in selecting c⁄. To estimate
the term P(d j c), Naive Bayes decomposes it by as-
suming the fi’s are conditionally independent given
d’s class:
PNB(c j d) := P(c)
¡Qm
i=1 P(fi j c)
ni(d)¢
P(d) :
Our training method consists of relative-frequency
estimation of P(c) and P(fi j c), using add-one
smoothing.
Despite its simplicity and the fact that its con-
ditional independence assumption clearly does not
hold in real-world situations, Naive Bayes-based text
categorization still tends to perform surprisingly well
(Lewis, 1998); indeed, Domingos and Pazzani (1997)
show that Naive Bayes is optimal for certain problem
classes with highly dependent features. On the other
hand, more sophisticated algorithms might (and of-
ten do) yield better results; we examine two such
algorithms next.
5.2 Maximum Entropy
Maximum entropy classiflcation (MaxEnt, or ME,
for short) is an alternative technique which has
proven efiective in a number of natural lan-
guage processing applications (Berger et al., 1996).
Nigam et al. (1999) show that it sometimes, but not
always, outperforms Naive Bayes at standard text
classiflcation. Its estimate of P(c j d) takes the fol-
lowing exponential form:
PME(c j d) := 1Z(d) exp
ˆX
i
‚i;cFi;c(d;c)
!
;
where Z(d) is a normalization function. Fi;c is a fea-
ture/class function for feature fi and class c, deflned
as follows:6
Fi;c(d;c0) :=
n1; n
i(d) > 0 and c0 = c
0 otherwise :
For instance, a particular feature/class function
might flre if and only if the bigram \still hate" ap-
pears and the document’s sentiment is hypothesized
to be negative.7 Importantly, unlike Naive Bayes,
MaxEnt makes no assumptions about the relation-
ships between features, and so might potentially per-
form better when conditional independence assump-
tions are not met.
The ‚i;c’s are feature-weight parameters; inspec-
tion of the deflnition of PME shows that a large ‚i;c
means that fi is considered a strong indicator for
6We use a restricted deflnition of feature/class func-
tions so that MaxEnt relies on the same sort of feature
information as Naive Bayes.
7The dependence on class is necessary for parameter
induction. See Nigam et al. (1999) for additional moti-
vation.
class c. The parameter values are set so as to max-
imize the entropy of the induced distribution (hence
the classifler’s name) subject to the constraint that
the expected values of the feature/class functions
with respect to the model are equal to their expected
values with respect to the training data: the under-
lying philosophy is that we should choose the model
making the fewest assumptions about the data while
still remaining consistent with it, which makes intu-
itive sense. We use ten iterations of the improved
iterative scaling algorithm (Della Pietra et al., 1997)
for parameter training (this was a su–cient num-
ber of iterations for convergence of training-data ac-
curacy), together with a Gaussian prior to prevent
overfltting (Chen and Rosenfeld, 2000).
5.3 Support Vector Machines
Support vector machines (SVMs) have been shown to
be highly efiective at traditional text categorization,
generally outperforming Naive Bayes (Joachims,
1998). They are large-margin, rather than proba-
bilistic, classiflers, in contrast to Naive Bayes and
MaxEnt. In the two-category case, the basic idea be-
hind the training procedure is to flnd a hyperplane,
represented by vector ~w, that not only separates
the document vectors in one class from those in the
other, but for which the separation, or margin, is as
large as possible. This search corresponds to a con-
strained optimization problem; letting cj 2 f1;¡1g
(corresponding to positive and negative) be the cor-
rect class of document dj, the solution can be written
as
~w :=
X
j
fijcj ~dj; fij ‚ 0;
where the fij’s are obtained by solving a dual opti-
mization problem. Those ~dj such that fij is greater
than zero are called support vectors, since they are
the only document vectors contributing to ~w. Clas-
siflcation of test instances consists simply of deter-
mining which side of ~w’s hyperplane they fall on.
We used Joachim’s (1999) SVMlight package8 for
training and testing, with all parameters set to their
default values, after flrst length-normalizing the doc-
ument vectors, as is standard (neglecting to normal-
ize generally hurt performance slightly).
6 Evaluation
6.1 Experimental Set-up
We used documents from the movie-review corpus
described in Section 3. To create a data set with uni-
form class distribution (studying the efiect of skewed
8http://svmlight.joachims.org
Features # of frequency or NB ME SVM
features presence?
(1) unigrams 16165 freq. 78.7 N/A 72.8
(2) unigrams " pres. 81.0 80.4 82.9
(3) unigrams+bigrams 32330 pres. 80.6 80.8 82.7
(4) bigrams 16165 pres. 77.3 77.4 77.1
(5) unigrams+POS 16695 pres. 81.5 80.4 81.9
(6) adjectives 2633 pres. 77.0 77.7 75.1
(7) top 2633 unigrams 2633 pres. 80.3 81.0 81.4
(8) unigrams+position 22430 pres. 81.0 80.1 81.6
Figure 3: Average three-fold cross-validation accuracies, in percent. Boldface: best performance for a given
setting (row). Recall that our baseline results ranged from 50% to 69%.
class distributions was out of the scope of this study),
we randomly selected 700 positive-sentiment and 700
negative-sentiment documents. We then divided this
data into three equal-sized folds, maintaining bal-
anced class distributions in each fold. (We did not
use a larger number of folds due to the slowness of
the MaxEnt training procedure.) All results reported
below, as well as the baseline results from Section 4,
are the average three-fold cross-validation results on
this data (of course, the baseline algorithms had no
parameters to tune).
To prepare the documents, we automatically re-
moved the rating indicators and extracted the tex-
tual information from the original HTML docu-
ment format, treating punctuation as separate lex-
ical items. No stemming or stoplists were used.
One unconventional step we took was to attempt
to model the potentially important contextual efiect
of negation: clearly \good" and \not very good" in-
dicate opposite sentiment orientations. Adapting a
technique of Das and Chen (2001), we added the tag
NOT to every word between a negation word (\not",
\isn’t", \didn’t", etc.) and the flrst punctuation
mark following the negation word. (Preliminary ex-
periments indicate that removing the negation tag
had a negligible, but on average slightly harmful, ef-
fect on performance.)
For this study, we focused on features based on
unigrams (with negation tagging) and bigrams. Be-
cause training MaxEnt is expensive in the number of
features, we limited consideration to (1) the 16165
unigrams appearing at least four times in our 1400-
document corpus (lower count cutofis did not yield
signiflcantly difierent results), and (2) the 16165 bi-
grams occurring most often in the same data (the
selected bigrams all occurred at least seven times).
Note that we did not add negation tags to the bi-
grams, since we consider bigrams (and n-grams in
general) to be an orthogonal way to incorporate con-
text.
6.2 Results
Initial unigram results The classiflcation accu-
racies resulting from using only unigrams as fea-
tures are shown in line (1) of Figure 3. As a whole,
the machine learning algorithms clearly surpass the
random-choice baseline of 50%. They also hand-
ily beat our two human-selected-unigram baselines
of 58% and 64%, and, furthermore, perform well in
comparison to the 69% baseline achieved via limited
access to the test-data statistics, although the im-
provement in the case of SVMs is not so large.
On the other hand, in topic-based classiflcation,
all three classiflers have been reported to use bag-
of-unigram features to achieve accuracies of 90%
and above for particular categories (Joachims, 1998;
Nigam et al., 1999)9 | and such results are for set-
tings with more than two classes. This provides
suggestive evidence that sentiment categorization is
more di–cult than topic classiflcation, which cor-
responds to the intuitions of the text categoriza-
tion expert mentioned above.10 Nonetheless, we still
wanted to investigate ways to improve our senti-
ment categorization results; these experiments are
reported below.
Feature frequency vs. presence Recall that we
represent each document d by a feature-count vector
(n1(d);:::;nm(d)). However, the deflnition of the
9Joachims (1998) used stemming and stoplists; in
some of their experiments, Nigam et al. (1999), like us,
did not.
10We could not perform the natural experiment of at-
tempting topic-based categorization on our data because
the only obvious topics would be the fllm being reviewed;
unfortunately, in our data, the maximum number of re-
views per movie is 27, too small for meaningful results.
MaxEnt feature/class functions Fi;c only re ects the
presence or absence of a feature, rather than directly
incorporating feature frequency. In order to investi-
gate whether reliance on frequency information could
account for the higher accuracies of Naive Bayes and
SVMs, we binarized the document vectors, setting
ni(d) to 1 if and only feature fi appears in d, and
reran Naive Bayes and SVMlight on these new vec-
tors.11
As can be seen from line (2) of Figure 3,
better performance (much better performance for
SVMs) is achieved by accounting only for fea-
ture presence, not feature frequency. Interestingly,
this is in direct opposition to the observations of
McCallum and Nigam (1998) with respect to Naive
Bayes topic classiflcation. We speculate that this in-
dicates a difierence between sentiment and topic cat-
egorization | perhaps due to topic being conveyed
mostly by particular content words that tend to be
repeated | but this remains to be verifled. In any
event, as a result of this flnding, we did not incor-
porate frequency information into Naive Bayes and
SVMs in any of the following experiments.
Bigrams In addition to looking speciflcally for
negation words in the context of a word, we also
studied the use of bigrams to capture more context
in general. Note that bigrams and unigrams are
surely not conditionally independent, meaning that
the feature set they comprise violates Naive Bayes’
conditional-independence assumptions; on the other
hand, recall that this does not imply that Naive
Bayes will necessarily do poorly (Domingos and Paz-
zani, 1997).
Line (3) of the results table shows that bigram
information does not improve performance beyond
that of unigram presence, although adding in the bi-
grams does not seriously impact the results, even for
Naive Bayes. This would not rule out the possibility
that bigram presence is as equally useful a feature
as unigram presence; in fact, Pedersen (2001) found
that bigrams alone can be efiective features for word
sense disambiguation. However, comparing line (4)
to line (2) shows that relying just on bigrams causes
accuracy to decline by as much as 5.8 percentage
points. Hence, if context is in fact important, as our
intuitions suggest, bigrams are not efiective at cap-
turing it in our setting.
11Alternatively, we could have tried integrating fre-
quency information into MaxEnt. However, feature/class
functions are traditionally deflned as binary (Berger et
al., 1996); hence, explicitly incorporating frequencies
would require difierent functions for each count (or count
bin), making training impractical. But cf. (Nigam et al.,
1999).
Parts of speech We also experimented with ap-
pending POS tags to every word via Oliver Mason’s
Qtag program.12 This serves as a crude form of word
sense disambiguation (Wilks and Stevenson, 1998):
for example, it would distinguish the difierent usages
of \love" in \I love this movie" (indicating sentiment
orientation) versus \This is a love story" (neutral
with respect to sentiment). However, the efiect of
this information seems to be a wash: as depicted in
line (5) of Figure 3, the accuracy improves slightly
for Naive Bayes but declines for SVMs, and the per-
formance of MaxEnt is unchanged.
Since adjectives have been a focus of previous work
in sentiment detection (Hatzivassiloglou and Wiebe,
2000; Turney, 2002)13, we looked at the performance
of using adjectives alone. Intuitively, we might ex-
pect that adjectives carry a great deal of informa-
tion regarding a document’s sentiment; indeed, the
human-produced lists from Section 4 contain almost
no other parts of speech. Yet, the results, shown in
line (6) of Figure 3, are relatively poor: the 2633
adjectives provide less useful information than uni-
gram presence. Indeed, line (7) shows that simply
using the 2633 most frequent unigrams is a better
choice, yielding performance comparable to that of
using (the presence of) all 16165 (line (2)). This may
imply that applying explicit feature-selection algo-
rithms on unigrams could improve performance.
Position An additional intuition we had was that
the position of a word in the text might make a dif-
ference: movie reviews, in particular, might begin
with an overall sentiment statement, proceed with
a plot discussion, and conclude by summarizing the
author’s views. As a rough approximation to deter-
mining this kind of structure, we tagged each word
according to whether it appeared in the flrst quar-
ter, last quarter, or middle half of the document14.
The results (line (8)) didn’t difier greatly from using
unigrams alone, but more reflned notions of position
might be more successful.
7 Discussion
The results produced via machine learning tech-
niques are quite good in comparison to the human-
generated baselines discussed in Section 4. In terms
of relative performance, Naive Bayes tends to do the
worst and SVMs tend to do the best, although the
12http://www.english.bham.ac.uk/stafi/oliver/soft-
ware/tagger/index.htm
13Turney’s (2002) unsupervised algorithm uses bi-
grams containing an adjective or an adverb.
14We tried a few other settings, e.g., flrst third vs. last
third vs middle third, and found them to be less efiective.
difierences aren’t very large.
On the other hand, we were not able to achieve ac-
curacies on the sentiment classiflcation problem com-
parable to those reported for standard topic-based
categorization, despite the several difierent types of
features we tried. Unigram presence information
turned out to be the most efiective; in fact, none of
the alternative features we employed provided consis-
tently better performance once unigram presence was
incorporated. Interestingly, though, the superiority
of presence information in comparison to frequency
information in our setting contradicts previous obser-
vations made in topic-classiflcation work (McCallum
and Nigam, 1998).
What accounts for these two difierences | dif-
flculty and types of information proving useful |
between topic and sentiment classiflcation, and how
might we improve the latter? To answer these ques-
tions, we examined the data further. (All examples
below are drawn from the full 2053-document cor-
pus.)
As it turns out, a common phenomenon in the doc-
uments was a kind of \thwarted expectations" narra-
tive, where the author sets up a deliberate contrast
to earlier discussion: for example, \This fllm should
be brilliant. It sounds like a great plot, the actors are
flrst grade, and the supporting cast is good as well, and
Stallone is attempting to deliver a good performance.
However, it can’t hold up" or \I hate the Spice Girls.
...[3 things the author hates about them]... Why I saw
this movie is a really, really, really long story, but I
did, and one would think I’d despise every minute of
it. But... Okay, I’m really ashamed of it, but I enjoyed
it. I mean, I admit it’s a really awful movie ...the ninth
 oor of hell...The plot is such a mess that it’s terrible.
But I loved it." 15
In these examples, a human would easily detect
the true sentiment of the review, but bag-of-features
classiflers would presumably flnd these instances dif-
flcult, since there are many words indicative of the
opposite sentiment to that of the entire review. Fun-
damentally, it seems that some form of discourse
analysis is necessary (using more sophisticated tech-
15This phenomenon is related to another common
theme, that of \a good actor trapped in a bad movie":
\AN AMERICAN WEREWOLF IN PARIS is a failed at-
tempt... Julie Delpy is far too good for this movie. She im-
bues Seraflne with spirit, spunk, and humanity. This isn’t
necessarily a good thing, since it prevents us from relax-
ing and enjoying AN AMERICAN WEREWOLF IN PARIS
as a completely mindless, campy entertainment experience.
Delpy’s injection of class into an otherwise classless produc-
tion raises the specter of what this fllm could have been
with a better script and a better cast ... She was radiant,
charismatic, and efiective ...."
niques than our positional feature mentioned above),
or at least some way of determining the focus of each
sentence, so that one can decide when the author is
talking about the fllm itself. (Turney (2002) makes
a similar point, noting that for reviews, \the whole
is not necessarily the sum of the parts".) Further-
more, it seems likely that this thwarted-expectations
rhetorical device will appear in many types of texts
(e.g., editorials) devoted to expressing an overall
opinion about some topic. Hence, we believe that an
important next step is the identiflcation of features
indicating whether sentences are on-topic (which is
a kind of co-reference problem); we look forward to
addressing this challenge in future work.
Acknowledgments
We thank Joshua Goodman, Thorsten Joachims, Jon
Kleinberg, Vikas Krishna, John Lafierty, Jussi Myl-
lymaki, Phoebe Sengers, Richard Tong, Peter Tur-
ney, and the anonymous reviewers for many valuable
comments and helpful suggestions, and Hubie Chen
and Tony Faradjian for participating in our baseline
experiments. Portions of this work were done while
the flrst author was visiting IBM Almaden. This pa-
per is based upon work supported in part by the Na-
tional Science Foundation under ITR/IM grant IIS-
0081334. Any opinions, flndings, and conclusions or
recommendations expressed above are those of the
authors and do not necessarily re ect the views of
the National Science Foundation.

References
Shlomo Argamon-Engelson, Moshe Koppel, and
Galit Avneri. 1998. Style-based text categoriza-
tion: What newspaper am I reading? In Proc. of
the AAAI Workshop on Text Categorization, pages
1{4.
Adam L. Berger, Stephen A. Della Pietra, and Vin-
cent J. Della Pietra. 1996. A maximum entropy
approach to natural language processing. Compu-
tational Linguistics, 22(1):39{71.
Douglas Biber. 1988. Variation across Speech and
Writing. Cambridge University Press.
Stanley Chen and Ronald Rosenfeld. 2000. A survey
of smoothing techniques for ME models. IEEE
Trans. Speech and Audio Processing, 8(1):37{50.
Sanjiv Das and Mike Chen. 2001. Yahoo! for
Amazon: Extracting market sentiment from stock
message boards. In Proc. of the 8th Asia Paciflc
Finance Association Annual Conference (APFA
2001).
Stephen Della Pietra, Vincent Della Pietra, and John
Lafierty. 1997. Inducing features of random flelds.
IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 19(4):380{393.
Pedro Domingos and Michael J. Pazzani. 1997. On
the optimality of the simple Bayesian classifler un-
der zero-one loss. Machine Learning, 29(2-3):103{
130.
Aidan Finn, Nicholas Kushmerick, and Barry Smyth.
2002. Genre classiflcation and domain transfer
for information flltering. In Proc. of the Eu-
ropean Colloquium on Information Retrieval Re-
search, pages 353{362, Glasgow.
Vasileios Hatzivassiloglou and Kathleen McKeown.
1997. Predicting the semantic orientation of adjec-
tives. In Proc. of the 35th ACL/8th EACL, pages
174{181.
Vasileios Hatzivassiloglou and Janyce Wiebe. 2000.
Efiects of adjective orientation and gradability on
sentence subjectivity. In Proc. of COLING.
Marti Hearst. 1992. Direction-based text interpre-
tation as an information access reflnement. In
Paul Jacobs, editor, Text-Based Intelligent Sys-
tems. Lawrence Erlbaum Associates.
Alison Huettner and Pero Subasic. 2000. Fuzzy
typing for document management. In ACL
2000 Companion Volume: Tutorial Abstracts and
Demonstration Notes, pages 26{27.
Thorsten Joachims. 1998. Text categorization with
support vector machines: Learning with many rel-
evant features. In Proc. of the European Confer-
ence on Machine Learning (ECML), pages 137{
142.
Thorsten Joachims. 1999. Making large-scale SVM
learning practical. In Bernhard Sch˜olkopf and
Alexander Smola, editors, Advances in Kernel
Methods - Support Vector Learning, pages 44{56.
MIT Press.
Jussi Karlgren and Douglass Cutting. 1994. Recog-
nizing text genres with simple metrics using dis-
criminant analysis. In Proc. of COLING.
Brett Kessler, Geofirey Nunberg, and Hinrich
Sch˜utze. 1997. Automatic detection of text genre.
In Proc. of the 35th ACL/8th EACL, pages 32{38.
David D. Lewis. 1998. Naive (Bayes) at forty: The
independence assumption in information retrieval.
In Proc. of the European Conference on Machine
Learning (ECML), pages 4{15. Invited talk.
Andrew McCallum and Kamal Nigam. 1998. A com-
parison of event models for Naive Bayes text clas-
siflcation. In Proc. of the AAAI-98 Workshop on
Learning for Text Categorization, pages 41{48.
Frederick Mosteller and David L. Wallace. 1984. Ap-
plied Bayesian and Classical Inference: The Case
of the Federalist Papers. Springer-Verlag.
Kamal Nigam, John Lafierty, and Andrew McCal-
lum. 1999. Using maximum entropy for text clas-
siflcation. In Proc. of the IJCAI-99 Workshop on
Machine Learning for Information Filtering, pages
61{67.
Ted Pedersen. 2001. A decision tree of bigrams is an
accurate predictor of word sense. In Proc. of the
Second NAACL, pages 79{86.
Warren Sack. 1994. On the computation of point of
view. In Proc. of the Twelfth AAAI, page 1488.
Student abstract.
Ellen Spertus. 1997. Smokey: Automatic recog-
nition of hostile messages. In Proc. of Innova-
tive Applications of Artiflcial Intelligence (IAAI),
pages 1058{1065.
Junichi Tatemura. 2000. Virtual reviewers for col-
laborative exploration of movie reviews. In Proc.
of the 5th International Conference on Intelligent
User Interfaces, pages 272{275.
Loren Terveen, Will Hill, Brian Amento, David Mc-
Donald, and Josh Creter. 1997. PHOAKS: A sys-
tem for sharing recommendations. Communica-
tions of the ACM, 40(3):59{62.
Laura Mayfleld Tomokiyo and Rosie Jones. 2001.
You’re not from round here, are you? Naive Bayes
detection of non-native utterance text. In Proc. of
the Second NAACL, pages 239{246.
Richard M. Tong. 2001. An operational system for
detecting and tracking opinions in on-line discus-
sion. Workshop note, SIGIR 2001 Workshop on
Operational Text Classiflcation.
Peter D. Turney and Michael L. Littman. 2002. Un-
supervised learning of semantic orientation from
a hundred-billion-word corpus. Technical Report
EGB-1094, National Research Council Canada.
Peter Turney. 2002. Thumbs up or thumbs down?
Semantic orientation applied to unsupervised clas-
siflcation of reviews. In Proc. of the ACL.
Janyce M. Wiebe, Theresa Wilson, and Matthew
Bell. 2001. Identifying collocations for recognizing
opinions. In Proc. of the ACL/EACL Workshop
on Collocation.
Yorick Wilks and Mark Stevenson. 1998. The gram-
mar of sense: Using part-of-speech tags as a flrst
step in semantic disambiguation. Journal of Nat-
ural Language Engineering, 4(2):135{144.
