Proceedings of the 43rd Annual Meeting of the ACL, pages 115–124,
Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
Seeing stars: Exploiting class relationships for sentiment categorization with
respect to rating scales
Bo Panga0a2a1a3 and Lillian Leea0a2a1a4a5a1a3
(1) Department of Computer Science, Cornell University
(2) Language Technologies Institute, Carnegie Mellon University
(3) Computer Science Department, Carnegie Mellon University
Abstract
We address the rating-inference problem,
wherein rather than simply decide whether
a review is  thumbs up or  thumbs
down , as in previous sentiment analy-
sis work, one must determine an author’s
evaluation with respect to a multi-point
scale (e.g., one to  ve  stars ). This task
represents an interesting twist on stan-
dard multi-class text categorization be-
cause there are several different degrees
of similarity between class labels; for ex-
ample,  three stars is intuitively closer to
 four stars than to  one star .
We  rst evaluate human performance at
the task. Then, we apply a meta-
algorithm, based on a metric labeling for-
mulation of the problem, that alters a
given a6 -ary classi er’s output in an ex-
plicit attempt to ensure that similar items
receive similar labels. We show that
the meta-algorithm can provide signi -
cant improvements over both multi-class
and regression versions of SVMs when we
employ a novel similarity measure appro-
priate to the problem.
1 Introduction
There has recently been a dramatic surge of inter-
est in sentiment analysis, as more and more people
become aware of the scienti c challenges posed and
the scope of new applications enabled by the pro-
cessing of subjective language. (The papers col-
lected by Qu, Shanahan, and Wiebe (2004) form a
representative sample of research in the area.) Most
prior work on the speci c problem of categorizing
expressly opinionated text has focused on the bi-
nary distinction of positive vs. negative (Turney,
2002; Pang, Lee, and Vaithyanathan, 2002; Dave,
Lawrence, and Pennock, 2003; Yu and Hatzivas-
siloglou, 2003). But it is often helpful to have more
information than this binary distinction provides, es-
pecially if one is ranking items by recommendation
or comparing several reviewers’ opinions: example
applications include collaborative  ltering and de-
ciding which conference submissions to accept.
Therefore, in this paper we consider generalizing
to  ner-grained scales: rather than just determine
whether a review is  thumbs up or not, we attempt
to infer the author’s implied numerical rating, such
as  three stars or  four stars . Note that this differs
from identifying opinion strength (Wilson, Wiebe,
and Hwa, 2004): rants and raves have the same
strength but represent opposite evaluations, and ref-
eree forms often allow one to indicate that one is
very con dent (high strength) that a conference sub-
mission is mediocre (middling rating). Also, our
task differs from ranking not only because one can
be given a single item to classify (as opposed to a
set of items to be ordered relative to one another),
but because there are settings in which classi cation
is harder than ranking, and vice versa.
One can apply standarda6 -ary classi ers or regres-
sion to this rating-inference problem; independent
work by Koppel and Schler (2005) considers such
115
methods. But an alternative approach that explic-
itly incorporates information about item similarities
together with label similarity information (for in-
stance,  one star is closer to  two stars than to
 four stars ) is to think of the task as one of met-
ric labeling (Kleinberg and Tardos, 2002), where
label relations are encoded via a distance metric.
This observation yields a meta-algorithm, applicable
to both semi-supervised (via graph-theoretic tech-
niques) and supervised settings, that alters a given
a6 -ary classi er’s output so that similar items tend to
be assigned similar labels.
In what follows, we  rst demonstrate that hu-
mans can discern relatively small differences in (hid-
den) evaluation scores, indicating that rating infer-
ence is indeed a meaningful task. We then present
three types of algorithms  one-vs-all, regression,
and metric labeling  that can be distinguished by
how explicitly they attempt to leverage similarity
between items and between labels. Next, we con-
sider what item similarity measure to apply, propos-
ing one based on the positive-sentence percentage.
Incorporating this new measure within the metric-
labeling framework is shown to often provide sig-
ni cant improvements over the other algorithms.
We hope that some of the insights derived here
might apply to other scales for text classifcation that
have been considered, such as clause-level opin-
ion strength (Wilson, Wiebe, and Hwa, 2004); af-
fect types like disgust (Subasic and Huettner, 2001;
Liu, Lieberman, and Selker, 2003); reading level
(Collins-Thompson and Callan, 2004); and urgency
or criticality (Horvitz, Jacobs, and Hovel, 1999).
2 Problem validation and formulation
We  rst ran a small pilot study on human subjects
in order to establish a rough idea of what a reason-
able classi cation granularity is: if even people can-
not accurately infer labels with respect to a  ve-star
scheme with half stars, say, then we cannot expect a
learning algorithm to do so. Indeed, some potential
obstacles to accurate rating inference include lack
of calibration (e.g., what an understated author in-
tends as high praise may seem lukewarm), author
inconsistency at assigning  ne-grained ratings, and
Rating diff. Pooled Subject 1 Subject 2
a7 or more 100% 100% (35) 100% (15)
2 (e.g., 1 star) 83% 77% (30) 100% (11)
1 (e.g., a0a4 star) 69% 65% (57) 90% (10)
0 55% 47% (15) 80% ( 5)
Table 1: Human accuracy at determining relative
positivity. Rating differences are given in  notches .
Parentheses enclose the number of pairs attempted.
ratings not entirely supported by the text1.
For data, we  rst collected Internet movie reviews
in English from four authors, removing explicit rat-
ing indicators from each document’s text automati-
cally. Now, while the obvious experiment would be
to ask subjects to guess the rating that a review rep-
resents, doing so would force us to specify a  xed
rating-scale granularity in advance. Instead, we ex-
amined people’s ability to discern relative differ-
ences, because by varying the rating differences rep-
resented by the test instances, we can evaluate mul-
tiple granularities in a single experiment. Speci -
cally, at intervals over a number of weeks, we au-
thors (a non-native and a native speaker of English)
examined pairs of reviews, attemping to determine
whether the  rst review in each pair was (1) more
positive than, (2) less positive than, or (3) as posi-
tive as the second. The texts in any particular review
pair were taken from the same author to factor out
the effects of cross-author divergence.
As Table 1 shows, both subjects performed per-
fectly when the rating separation was at least 3
 notches in the original scale (we de ne a notch
as a half star in a four- or  ve-star scheme and 10
points in a 100-point scheme). Interestingly, al-
though human performance drops as rating differ-
ence decreases, even at a one-notch separation, both
subjects handily outperformed the random-choice
baseline of 33%. However, there was large variation
in accuracy between subjects.2
1For example, the critic Dennis Schwartz writes that  some-
times the review itself [indicates] the letter grade should have
been higher or lower, as the review might fail to take into con-
sideration my overall impression of the  lm  which I hope to
capture in the grade (http://www.sover.net/ ozus/cinema.htm).
2One contributing factor may be that the subjects viewed
disjoint document sets, since we wanted to maximize experi-
mental coverage of the types of document pairs within each dif-
ference class. We thus cannot report inter-annotator agreement,
116
Because of this variation, we de ned two differ-
ent classi cation regimes. From the evidence above,
a three-class task (categories 0, 1, and 2  es-
sentially  negative ,  middling , and  positive , re-
spectively) seems like one that most people would
do quite well at (but we should not assume 100%
human accuracy: according to our one-notch re-
sults, people may misclassify borderline cases like
2.5 stars). Our study also suggests that people could
do at least fairly well at distinguishing full stars in
a zero- to four-star scheme. However, when we
began to construct  ve-category datasets for each
of our four authors (see below), we found that in
each case, either the most negative or the most pos-
itive class (but not both) contained only about 5%
of the documents. To make the classes more bal-
anced, we folded these minority classes into the ad-
jacent class, thus arriving at a four-class problem
(categories 0-3, increasing in positivity). Note that
the four-class problem seems to offer more possi-
bilities for leveraging class relationship information
than the three-class setting, since it involves more
class pairs. Also, even the two-category version of
the rating-inference problem for movie reviews has
proven quite challenging for many automated clas-
si cation techniques (Pang, Lee, and Vaithyanathan,
2002; Turney, 2002).
We applied the above two labeling schemes to
a scale dataset3 containing four corpora of movie
reviews. All reviews were automatically pre-
processed to remove both explicit rating indicators
and objective sentences; the motivation for the latter
step is that it has previously aided positive vs. neg-
ative classi cation (Pang and Lee, 2004). All of the
1770, 902, 1307, or 1027 documents in a given cor-
pus were written by the same author. This decision
facilitates interpretation of the results, since it fac-
tors out the effects of different choices of methods
for calibrating authors’ scales.4 We point out that
but since our goal is to recover a reviewer’s  true recommen-
dation, reader-author agreement is more relevant.
While another factor might be degree of English  uency, in
an informal experiment (six subjects viewing the same three
pairs), native English speakers made the only two errors.
3Available at http://www.cs.cornell.edu/People/pabo/movie-
review-data as scale dataset v1.0.
4From the Rotten Tomatoes website’s FAQ:  star systems
are not consistent between critics. For critics like Roger Ebert
and James Berardinelli, 2.5 stars or lower out of 4 stars is al-
ways negative. For other critics, 2.5 stars can either be positive
it is possible to gather author-speci c information
in some practical applications: for instance, systems
that use selected authors (e.g., the Rotten Tomatoes
movie-review website  where, we note, not all
authors provide explicit ratings) could require that
someone submit rating-labeled samples of newly-
admitted authors’ work. Moreover, our results at
least partially generalize to mixed-author situations
(see Section 5.2).
3 Algorithms
Recall that the problem we are considering is multi-
category classi cation in which the labels can be
naturally mapped to a metric space (e.g., points on a
line); for simplicity, we assume the distance metric
a8a10a9a12a11a14a13a15a11a17a16a19a18a21a20a23a22a11a25a24a26a11a17a16a27a22 throughout. In this section, we
present three approaches to this problem in order of
increasingly explicit use of pairwise similarity infor-
mation between items and between labels. In order
to make comparisons between these methods mean-
ingful, we base all three of them on Support Vec-
tor Machines (SVMs) as implemented in Joachims’
(1999) a28a30a29a32a31a34a33a36a35a38a37a40a39a42a41 package.
3.1 One-vs-all
The standard SVM formulation applies only to bi-
nary classi cation. One-vs-all (OVA) (Rifkin and
Klautau, 2004) is a common extension to the a6 -ary
case. Training consists of building, for each label a11 ,
an SVM binary classi er distinguishing label a11 from
 not-a11  . We consider the  nal output to be a label
preference function a43a45a44a27a46a2a47 a9a49a48a50a13a15a11a51a18 , de ned as the signed
distance of (test) item a48 to the a11 side of the a11 vs.
not-a11 decision plane.
Clearly, OVA makes no explicit use of pairwise
label or item relationships. However, it can perform
well if each class exhibits suf ciently distinct lan-
guage; see Section 4 for more discussion.
3.2 Regression
Alternatively, we can take a regression perspective
by assuming that the labels come from a discretiza-
tion of a continuous function a52 mapping from the
or negative. Even though Eric Lurio uses a 5 star system, his
grading is very relaxed. So, 2 stars can be positive. Thus,
calibration may sometimes require strong familiarity with the
authors involved, as anyone who has ever needed to reconcile
con icting referee reports probably knows.
117
feature space to a metric space.5 If we choose a52
from a family of suf ciently  gradual functions,
then similar items necessarily receive similar labels.
In particular, we consider linear, a53 -insensitive SVM
regression (Vapnik, 1995; Smola and Schcurrency1olkopf,
1998); the idea is to  nd the hyperplane that best  ts
the training data, but where training points whose la-
bels are within distance a53 of the hyperplane incur no
loss. Then, for (test) instance a48 , the label preference
function a43a55a54a12a56a58a57 a9a49a48a59a13a15a11a60a18 is the negative of the distance be-
tween a11 and the value predicted for a48 by the  tted
hyperplane function.
Wilson, Wiebe, and Hwa (2004) used SVM re-
gression to classify clause-level strength of opinion,
reporting that it provided lower accuracy than other
methods. However, independently of our work,
Koppel and Schler (2005) found that applying lin-
ear regression to classify documents (in a different
corpus than ours) with respect to a three-point rat-
ing scale provided greater accuracy than OVA SVMs
and other algorithms.
3.3 Metric labeling
Regression implicitly encodes the  similar items,
similar labels heuristic, in that one can restrict
consideration to  gradual functions. But we can
also think of our task as a metric labeling prob-
lem (Kleinberg and Tardos, 2002), a special case
of the maximum a posteriori estimation problem
for Markov random  elds, to explicitly encode our
desideratum. Suppose we have an initial label pref-
erence function a43 a9a49a48a50a13a15a11a51a18 , perhaps computed via one
of the two methods described above. Also, let a8
be a distance metric on labels, and let a6a55a6a62a61 a9a49a48a63a18 de-
note the a64 nearest neighbors of item a48 according
to some item-similarity function a65a5a66a12a67 . Then, it is
quite natural to pose our problem as  nding a map-
ping of instances a48 to labels a11a60a68 (respecting the orig-
inal labels of the training instances) that minimizes
a69
a68a14a70 test
a71a72
a24
a43
a9a49a48a50a13a15a11 a68 a18a45a73a26a74
a69
a75
a70a77a76a78a76a80a79a42a81a82a68a51a83a78a84
a9a85a8a86a9a12a11 a68 a13a15a11
a75
a18a2a18
a65a5a66a12a67
a9a49a48a50a13a2a87a88a18a27a89a90a91a13
where
a84
is monotonically increasing (we chose
a84
a9a85a8a92a18a93a20a94a8 unless otherwise speci ed) and a74 is a
trade-off and/or scaling parameter. (The inner sum-
mation is familiar from work in locally-weighted
5We discuss the ordinal regression variant in Section 6.
learning6 (Atkeson, Moore, and Schaal, 1997).) In a
sense, we are using explicit item and label similarity
information to increasingly penalize the initial clas-
si er as it assigns more divergent labels to similar
items.
In this paper, we only report supervised-learning
experiments in which the nearest neighbors for any
given test item were drawn from the training set
alone. In such a setting, the labeling decisions for
different test items are independent, so that solving
the requisite optimization problem is simple.
Aside: transduction The above formulation also
allows for transductive semi-supervised learning as
well, in that we could allow nearest neighbors to
come from both the training and test sets. We
intend to address this case in future work, since
there are important settings in which one has a
small number of labeled reviews and a large num-
ber of unlabeled reviews, in which case consider-
ing similarities between unlabeled texts could prove
quite helpful. In full generality, the correspond-
ing multi-label optimization problem is intractable,
but for many families of
a84
functions (e.g., con-
vex) there exist practical exact or approximation
algorithms based on techniques for  nding mini-
mum s-t cuts in graphs (Ishikawa and Geiger, 1998;
Boykov, Veksler, and Zabih, 1999; Ishikawa, 2003).
Interestingly, previous sentiment analysis research
found that a minimum-cut formulation for the binary
subjective/objective distinction yielded good results
(Pang and Lee, 2004). Of course, there are many
other related semi-supervised learning algorithms
that we would like to try as well; see Zhu (2005)
for a survey.
4 Class struggle:  nding a label-correlated
item-similarity function
We need to specify an item similarity function a65a95a66a49a67
to use the metric-labeling formulation described in
Section 3.3. We could, as is commonly done, em-
ploy a term-overlap-based measure such as the co-
sine between term-frequency-based document vec-
tors (henceforth  TO(cos) ). However, Table 2
6If we ignore the
a96a98a97a100a99a92a101a49a102a15a103 term, different choices of a104 cor-
respond to different versions of nearest-neighbor learning, e.g.,
majority-vote, weighted average of labels, or weighted median
of labels.
118
Label difference:
1 2 3
Three-class data 37% 33%  
Four-class data 34% 31% 30%
Table 2: Average over authors and class pairs of
between-class vocabulary overlap as the class labels
of the pair grow farther apart.
shows that in aggregate, the vocabularies of distant
classes overlap to a degree surprisingly similar to
that of the vocabularies of nearby classes. Thus,
item similarity as measured by TO(cos) may not cor-
relate well with similarity of the item’s true labels.
We can potentially develop a more useful similar-
ity metric by asking ourselves what, intuitively, ac-
counts for the label relationships that we seek to ex-
ploit. A simple hypothesis is that ratings can be de-
termined by the positive-sentence percentage (PSP)
of a text, i.e., the number of positive sentences di-
vided by the number of subjective sentences. (Term-
based versions of this premise have motivated much
sentiment-analysis work for over a decade (Das and
Chen, 2001; Tong, 2001; Turney, 2002).) But coun-
terexamples are easy to construct: reviews can con-
tain off-topic opinions, or recount many positive as-
pects before describing a fatal  aw.
We therefore tested the hypothesis as follows.
To avoid the need to hand-label sentences as posi-
tive or negative, we  rst created a sentence polarity
dataset7 consisting of 10,662 movie-review  snip-
pets (a striking extract usually one sentence long)
downloaded from www.rottentomatoes.com; each
snippet was labeled with its source review’s label
(positive or negative) as provided by Rotten Toma-
toes. Then, we trained a Naive Bayes classi er on
this data set and applied it to our scale dataset to
identify the positive sentences (recall that objective
sentences were already removed).
Figure 1 shows that all four authors tend to ex-
hibit a higher PSP when they write a more pos-
itive review, and we expect that most typical re-
viewers would follow suit. Hence, PSP appears to
be a promising basis for computing document sim-
ilarity for our rating-inference task. In particular,
7Available at http://www.cs.cornell.edu/People/pabo/movie-
review-data as sentence polarity dataset v1.0.
we de ned
a24a12a24a88a24a105a24a88a24a38a106
a107
a28
a107 a9a49a48a55a18 to be the two-dimensional vec-
tor a9a107 a28 a107 a9a49a48a55a18a95a13a60a108a109a24 a107 a28 a107 a9a49a48a63a18a2a18 , and then set the item-
similarity function required by the metric-labeling
optimization function (Section 3.3) to a65a5a66a12a67 a9a49a48a50a13a2a87a88a18a110a20
a111a17a112a114a113a116a115
a24a49a24a105a24a88a24a105a24a82a106
a107
a28
a107 a9a49a48a63a18a17a13
a24a88a24a88a24a105a24a117a106
a107
a28
a107 a9a49a87a105a18a119a118a121a1208
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
0 2 4 6 8 10
mean and standard deviation of PSP
rating (in notches)
Positive-sentence percentage (PSP) statistics
Author a
Author b
Author c
Author d
Figure 1: Average and standard deviation of PSP for
reviews expressing different ratings.
But before proceeding, we note that it is possi-
ble that similarity information might yield no extra
bene t at all. For instance, we don’t need it if we
can reliably identify each class just from some set
of distinguishing terms. If we de ne such terms
as frequent ones (a6a23a122 a123a14a124 ) that appear in a sin-
gle class 50% or more of the time, then we do  nd
many instances; some examples for one author are:
 meaningless ,  disgusting (class 0);  pleasant ,
 uneven (class 1); and  oscar ,  gem (class 2)
for the three-class case, and, in the four-class case,
  at ,  tedious (class 1) versus  straightforward ,
 likeable (class 2). Some unexpected distinguish-
ing terms for this author are  lion for class 2 (three-
class case), and for class 2 in the four-class case,
 jennifer , for a wide variety of Jennifers.
5 Evaluation
This section compares the accuracies of the ap-
proaches outlined in Section 3 on the four corpora
comprising our scale dataset. (Results using a125 a0 er-
ror were qualitatively similar.) Throughout, when
8While admittedly we initially chose this function because
it was convenient to work with cosines, post hoc analysis re-
vealed that the corresponding metric space  stretched certain
distances in a useful way.
119
we refer to something as  signi cant , we mean sta-
tistically so with respect to the paireda126 -test,a127a129a128 a120a124a114a130 .
The results that follow are based on a28a92a29a131a31 a33a36a35a38a37a40a39a42a41 ’s
default parameter settings for SVM regression and
OVA. Preliminary analysis of the effect of varying
the regression parameter a53 in the four-class case re-
vealed that the default value was often optimal.
The notation  Aa73 B denotes metric labeling
where method A provides the initial label preference
function a43 and B serves as similarity measure. To
train, we  rst select the meta-parameters a64 and a74
by running 9-fold cross-validation within the train-
ing set. Fixing a64 and a74 to those values yielding the
best performance, we then re-train A (but with SVM
parameters  xed, as described above) on the whole
training set. At test time, the nearest neighbors of
each item are also taken from the full training set.
5.1 Main comparison
Figure 2 summarizes our average 10-fold cross-
validation accuracy results. We  rst observe from
the plots that all the algorithms described in Section
3 always de nitively outperform the simple baseline
of predicting the majority class, although the im-
provements are smaller in the four-class case. In-
cidentally, the data was distributed in such a way
that the absolute performance of the baseline it-
self does not change much between the three- and
four-class case (which implies that the three-class
datasets were relatively more balanced); and Author
c’s datasets seem noticeably easier than the others.
We now examine the effect of implicitly using la-
bel and item similarity. In the four-class case, re-
gression performed better than OVA (signi cantly
so for two authors, as shown in the righthand ta-
ble); but for the three-category task, OVA signi -
cantly outperforms regression for all four authors.
One might initially interprete this   ip as showing
that in the four-class scenario, item and label simi-
larities provide a richer source of information rela-
tive to class-speci c characteristics, especially since
for the non-majority classes there is less data avail-
able; whereas in the three-class setting the categories
are better modeled as quite distinct entities.
However, the three-class results for metric label-
ing on top of OVA and regression (shown in Figure 2
by black versions of the corresponding icons) show
that employing explicit similarities always improves
results, often to a signi cant degree, and yields the
best overall accuracies. Thus, we can in fact effec-
tively exploit similarities in the three-class case. Ad-
ditionally, in both the three- and four- class scenar-
ios, metric labeling often brings the performance of
the weaker base method up to that of the stronger
one (as indicated by the  disappearance of upward
triangles in corresponding table rows), and never
hurts performance signi cantly.
In the four-class case, metric labeling and regres-
sion seem roughly equivalent. One possible inter-
pretation is that the relevant structure of the problem
is already captured by linear regression (and per-
haps a different kernel for regression would have
improved its three-class performance). However,
according to additional experiments we ran in the
four-class situation, the test-set-optimal parameter
settings for metric labeling would have produced
signi cant improvements, indicating there may be
greater potential for our framework. At any rate, we
view the fact that metric labeling performed quite
well for both rating scales as a de nitely positive re-
sult.
5.2 Further discussion
Q: Metric labeling looks like it’s just combining
SVMs with nearest neighbors, and classi er combi-
nation often improves performance. Couldn’t we get
the same kind of results by combining SVMs with
any other reasonable method?
A: No. For example, if we take the strongest
base SVM method for initial label preferences, but
replace PSP with the term-overlap-based cosine
(TO(cos)), performance often drops signi cantly.
This result, which is in accordance with Section
4’s data, suggests that choosing an item similarity
function that correlates well with label similarity
is important. (ovaa73 PSP a132a80a132a80a132a80a132 ovaa73 TO(cos) [3c];
rega73 PSP a132 rega73 TO(cos) [4c])
Q: Could you explain that notation, please?
A: Triangles point toward the signi cantly bet-
ter algorithm for some dataset. For instance,
 M a132a80a132a80a133 N [3c] means,  In the 3-class task, method
M is signi cantly better than N for two author
datasets and signi cantly worse for one dataset (so
the algorithms were statistically indistinguishable on
the remaining dataset) . When the algorithms be-
ing compared are statistically indistinguishable on
120
Average accuracies, three-class data Average accuracies, four-class data
 0.35
 0.4
 0.45
 0.5
 0.55
 0.6
 0.65
 0.7
 0.75
 0.8
Author a Author b Author c Author d
majority
ova
ova+PSP
reg
reg+PSP
 0.35
 0.4
 0.45
 0.5
 0.55
 0.6
 0.65
 0.7
 0.75
 0.8
Author a Author b Author c Author d
majority
ova
ova+PSP
reg
reg+PSP
Average ten-fold cross-validation accuracies. Open icons: SVMs in either one-versus-all (square) or re-
gression (circle) mode; dark versions: metric labeling using the corresponding SVM together with the
positive-sentence percentage (PSP). The a87 -axes of the two plots are aligned.
Signi cant differences, three-class data Signi cant differences, four-class data
ova ova+PSP reg reg+PSP
a b c d a b c d a b c d a b c d
ova a134a86a134a63a134 . a132a63a132a63a132a86a132 .a132 . .
ova+PSP a135a63a135a63a135 . a132a63a132a63a132a86a132 a132a63a132a63a132 .
reg a134a63a134a63a134a86a134 a134a86a134a63a134a63a134 .a134 .a134
reg+PSP .a134 . . a134a86a134a63a134 . .a135 .a135
ova ova+PSP reg reg+PSP
a b c d a b c d a b c d a b c d
ova .a134a63a134a63a134 a134a63a134 . . a134 . .a134
ova+PSP .a135a63a135a63a135 a134 . . . a134 . . .
reg a132a63a132 . . a132 . . . . . . .
reg+PSP a132 . .a132 a132 . . . . . . .
Triangles point towards signi cantly better algorithms for the results plotted above. Speci cally, if the
difference between a row and a column algorithm for a given author dataset (a, b, c, or d) is signi cant, a
triangle points to the better one; otherwise, a dot (.) is shown. Dark icons highlight the effect of adding PSP
information via metric labeling.
Figure 2: Results for main experimental comparisons.
all four datasets (the  no triangles case), we indi-
cate this with an equals sign ( = ).
Q: Thanks. Doesn’t Figure 1 show that the
positive-sentence percentage would be a good
classi er even in isolation, so metric labeling isn’t
necessary?
A: No. Predicting class labels directly from
the PSP value via trained thresholds isn’t as
effective (ovaa73 PSP a132a80a132a80a132a80a132 threshold PSP [3c];
rega73 PSP a132a80a132 threshold PSP [4c]).
Alternatively, we could use only the PSP com-
ponent of metric labeling by setting the la-
bel preference function to the constant function
0, but even with test-set-optimal parameter set-
tings, doing so underperforms the trained met-
ric labeling algorithm with access to an ini-
tial SVM classi er (ovaa73 PSP a132a80a132a80a132a80a132 0a73 a107 a28
a107a137a136 [3c];
rega73 PSP a132a80a132 0a73 a107 a28 a107a138a136 [4c]).
Q: What about using PSP as one of the features for
input to a standard classi er?
A: Our focus is on investigating the utility of simi-
larity information. In our particular rating-inference
setting, it so happens that the basis for our pair-
wise similarity measure can be incorporated as an
121
item-speci c feature, but we view this as a tan-
gential issue. That being said, preliminary experi-
ments show that metric labeling can be better, barely
(for test-set-optimal parameter settings for both al-
gorithms: signi cantly better results for one author,
four-class case; statistically indistinguishable other-
wise), although one needs to determine an appropri-
ate weight for the PSP feature to get good perfor-
mance.
Q: You de ned the  metric transformation func-
tion
a84
as the identity function
a84
a9a85a8a30a18a139a20a140a8 , imposing
greater loss as the distance between labels assigned
to two similar items increases. Can you do just as
well if you penalize all non-equal label assignments
by the same amount, or does the distance between
labels really matter?
A: You’re asking for a comparison to the Potts
model, which sets
a84
to the function a141
a84
a9a85a8a30a18 a20
a108 if a8 a142
a124 , a124 otherwise. In the one set-
ting in which there is a signi cant difference
between the two, the Potts model does worse
(ovaa73 PSP a132 ova
a141
a73 PSP [3c]). Also, employing the
Potts model generally leads to fewer signi cant
improvements over a chosen base method (com-
pare Figure 2’s tables with: reg
a141
a73 PSP
a132 reg [3c];
ova
a141
a73 PSP
a132a80a132 ova [3c]; ova
a141
a73 PSP a20 ova [4c]; but
note that reg
a141
a73 PSP
a132 reg [4c]). We note that opti-
mizing the Potts model in the multi-label case is NP-
hard, whereas the optimal metric labeling with the
identity metric-transformation function can be ef -
ciently obtained (see Section 3.3).
Q: Your datasets had many labeled reviews and only
one author each. Is your work relevant to settings
with many authors but very little data for each?
A: As discussed in Section 2, it can be quite dif-
 cult to properly calibrate different authors’ scales,
since the same number of  stars even within what
is ostensibly the same rating system can mean differ-
ent things for different authors. But since you ask:
we temporarily turned a blind eye to this serious is-
sue, creating a collection of 5394 reviews by 496 au-
thors with at most 80 reviews per author, where we
pretended that our rating conversions mapped cor-
rectly into a universal rating scheme. Preliminary
results on this dataset were actually comparable to
the results reported above, although since we are
not con dent in the class labels themselves, more
work is needed to derive a clear analysis of this set-
ting. (Abusing notation, since we’re already play-
ing fast and loose: [3c]: baseline 52.4%, reg 61.4%,
rega73 PSP 61.5%, ova (65.4%) a133 ovaa73 PSP (66.3%);
[4c]: baseline 38.8%, reg (51.9%) a133 rega73 PSP
(52.7%), ova (53.8%) a133 ovaa73 PSP (54.6%))
In future work, it would be interesting to deter-
mine author-independent characteristics that can be
used on (or suitably adapted to) data for speci c au-
thors.
Q: How about trying  
A:  Yes, there are many alternatives. A few
that we tested are described in the Appendix, and
we propose some others in the next section. We
should mention that we have not yet experimented
with all-vs.-all (AVA), another standard binary-to-
multi-category classi er conversion method, be-
cause we wished to focus on the effect of omit-
ting pairwise information. In independent work on
3-category rating inference for a different corpus,
Koppel and Schler (2005) found that regression out-
performed AVA, and Rifkin and Klautau (2004) ar-
gue that in principle OVA should do just as well as
AVA. But we plan to try it out.
6 Related work and future directions
In this paper, we addressed the rating-inference
problem, showing the utility of employing label sim-
ilarity and (appropriate choice of) item similarity
 either implicitly, through regression, or explicitly
and often more effectively, through metric labeling.
In the future, we would like to apply our methods
to other scale-based classi cation problems, and ex-
plore alternative methods. Clearly, varying the ker-
nel in SVM regression might yield better results.
Another choice is ordinal regression (McCullagh,
1980; Herbrich, Graepel, and Obermayer, 2000),
which only considers the ordering on labels, rather
than any explicit distances between them; this ap-
proach could work well if a good metric on labels is
lacking. Also, one could use mixture models (e.g.,
combine  positive and  negative language mod-
els) to capture class relationships (McCallum, 1999;
Schapire and Singer, 2000; Takamura, Matsumoto,
and Yamada, 2004).
We are also interested in framing multi-class but
non-scale-based categorization problems as metric
122
labeling tasks. For example, positive vs. negative vs.
neutral sentiment distinctions are sometimes consid-
ered in which neutral means either objective (En-
gstrcurrency1om, 2004) or a con ation of objective with a rat-
ing of mediocre (Das and Chen, 2001). (Koppel and
Schler (2005) in independent work also discuss var-
ious types of neutrality.) In either case, we could
apply a metric in which positive and negative are
closer to objective (or objective+mediocre) than to
each other. As another example, hierarchical label
relationships can be easily encoded in a label met-
ric.
Finally, as mentioned in Section 3.3, we would
like to address the transductive setting, in which one
has a small amount of labeled data and uses rela-
tionships between unlabeled items, since it is par-
ticularly well-suited to the metric-labeling approach
and may be quite important in practice.
Acknowledgments We thank Paul Bennett, Dave Blei,
Claire Cardie, Shimon Edelman, Thorsten Joachims, Jon Klein-
berg, Oren Kurland, John Lafferty, Guy Lebanon, Pradeep
Ravikumar, Jerry Zhu, and the anonymous reviewers for many
very useful comments and discussion. We learned of Moshe
Koppel and Jonathan Schler’s work while preparing the camera-
ready version of this paper; we thank them for so quickly an-
swering our request for a pre-print. Our descriptions of their
work are based on that pre-print; we apologize in advance for
any inaccuracies in our descriptions that result from changes
between their pre-print and their  nal version. We also thank
CMU for its hospitality during the year. This paper is based
upon work supported in part by the National Science Founda-
tion (NSF) under grant no. IIS-0329064 and CCR-0122581;
SRI International under subcontract no. 03-000211 on their
project funded by the Department of the Interior’s National
Business Center; and by an Alfred P. Sloan Research Fellow-
ship. Any opinions,  ndings, and conclusions or recommen-
dations expressed are those of the authors and do not neces-
sarily re ect the views or of cial policies, either expressed or
implied, of any sponsoring institutions, the U.S. government, or
any other entity.
References
Atkeson, Christopher G., Andrew W. Moore, and Stefan Schaal.
1997. Locally weighted learning. Arti cial Intelligence Re-
view, 11(1):11 73.
Boykov, Yuri, Olga Veksler, and Ramin Zabih. 1999. Fast ap-
proximate energy minimization via graph cuts. In Proceed-
ings of the International Conference on Computer Vision
(ICCV), pages 377 384. Journal version in IEEE Transac-
tions on Pattern Analysis and Machine Intelligence (PAMI)
23(11):1222 1239, 2001.
Collins-Thompson, Kevyn and Jamie Callan. 2004. A language
modeling approach to predicting reading dif culty. In HLT-
NAACL: Proceedings of the Main Conference, pages 193 
200.
Das, Sanjiv and Mike Chen. 2001. Yahoo! for Amazon: Ex-
tracting market sentiment from stock message boards. In
Proceedings of the Asia Paci c Finance Association Annual
Conference (APFA).
Dave, Kushal, Steve Lawrence, and David M. Pennock. 2003.
Mining the peanut gallery: Opinion extraction and semantic
classi cation of product reviews. In Proceedings of WWW,
pages 519 528.
Engstrcurrency1om, Charlotta. 2004. Topic dependence in sentiment
classi cation. Master’s thesis, University of Cambridge.
Herbrich, Ralf, Thore Graepel, and Klaus Obermayer. 2000.
Large margin rank boundaries for ordinal regression. In
Alexander J. Smola, Peter L. Bartlett, Bernhard Schcurrency1olkopf,
and Dale Schuurmans, editors, Advances in Large Margin
Classi ers, Neural Information Processing Systems. MIT
Press, pages 115 132.
Horvitz, Eric, Andy Jacobs, and David Hovel. 1999. Attention-
sensitive alerting. In Proceedings of the Conference on Un-
certainty and Arti cial Intelligence, pages 305 313.
Ishikawa, Hiroshi. 2003. Exact optimization for Markov ran-
dom  elds with convex priors. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 25(10).
Ishikawa, Hiroshi and Davi Geiger. 1998. Occlusions, discon-
tinuities, and epipolar lines in stereo. In Proceedings of the
5th European Conference on Computer Vision (ECCV), vol-
ume I, pages 232 248, London, UK. Springer-Verlag.
Joachims, Thorsten. 1999. Making large-scale SVM learning
practical. In Bernhard Schcurrency1olkopf and Alexander Smola, edi-
tors, Advances in Kernel Methods - Support Vector Learning.
MIT Press, pages 44 56.
Kleinberg, Jon and ·Eva Tardos. 2002. Approximation al-
gorithms for classi cation problems with pairwise relation-
ships: Metric labeling and Markov random  elds. Journal
of the ACM, 49(5):616 639.
Koppel, Moshe and Jonathan Schler. 2005. The importance
of neutral examples for learning sentiment. In Workshop on
the Analysis of Informal and Formal Information Exchange
during Negotiations (FINEXIN).
Liu, Hugo, Henry Lieberman, and Ted Selker. 2003. A model
of textual affect sensing using real-world knowledge. In Pro-
ceedings of Intelligent User Interfaces (IUI), pages 125 132.
McCallum, Andrew. 1999. Multi-label text classi cation with
a mixture model trained by EM. In AAAI Workshop on Text
Learning.
McCullagh, Peter. 1980. Regression models for ordinal data.
Journal of the Royal Statistical Society, 42(2):109 42.
123
Pang, Bo and Lillian Lee. 2004. A sentimental education: Sen-
timent analysis using subjectivity summarization based on
minimum cuts. In Proceedings of the ACL, pages 271 278.
Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. 2002.
Thumbs up? Sentiment classi cation using machine learning
techniques. In Proceedings of EMNLP, pages 79 86.
Qu, Yan, James Shanahan, and Janyce Wiebe, editors. 2004.
Proceedings of the AAAI Spring Symposium on Explor-
ing Attitude and Affect in Text: Theories and Applications.
AAAI Press. AAAI technical report SS-04-07.
Rifkin, Ryan M. and Aldebaro Klautau. 2004. In defense of
one-vs-all classi cation. Journal of Machine Learning Re-
search, 5:101 141.
Schapire, Robert E. and Yoram Singer. 2000. BoosTexter:
A boosting-based system for text categorization. Machine
Learning, 39(2/3):135 168.
Smola, Alex J. and Bernhard Schcurrency1olkopf. 1998. A tuto-
rial on support vector regression. Technical Report Neuro-
COLT NC-TR-98-030, Royal Holloway College, University
of London.
Subasic, Pero and Alison Huettner. 2001. Affect analysis of
text using fuzzy semantic typing. IEEE Transactions on
Fuzzy Systems, 9(4):483 496.
Takamura, Hiroya, Yuji Matsumoto, and Hiroyasu Yamada.
2004. Modeling category structures with a kernel function.
In Proceedings of CoNLL, pages 57 64.
Tong, Richard M. 2001. An operational system for detecting
and tracking opinions in on-line discussion. SIGIR Work-
shop on Operational Text Classi cation.
Turney, Peter. 2002. Thumbs up or thumbs down? Semantic
orientation applied to unsupervised classi cation of reviews.
In Proceedings of the ACL, pages 417 424.
Vapnik, Vladimir. 1995. The Nature of Statistical Learning
Theory. Springer.
Wilson, Theresa, Janyce Wiebe, and Rebecca Hwa. 2004. Just
how mad are you? Finding strong and weak opinion clauses.
In Proceedings of AAAI, pages 761 769.
Yu, Hong and Vasileios Hatzivassiloglou. 2003. Towards an-
swering opinion questions: Separating facts from opinions
and identifying the polarity of opinion sentences. In Pro-
ceedings of EMNLP.
Zhu, Xiaojin (Jerry). 2005. Semi-Supervised Learning with
Graphs. Ph.D. thesis, Carnegie Mellon University.
A Appendix: other variations attempted
A.1 Discretizing binary classi cation
In our setting, we can also incorporate class relations
by directly altering the output of a binary classi er,
as follows. We  rst train a standard SVM, treating
ratings greater than 0.5 as positive labels and others
as negative labels. If we then consider the resulting
classi er to output a positivity-preference function
a43a10a143
a9a49a48a55a18 , we can then learn a series of thresholds to
convert this value into the desired label set, under
the assumption that the bigger a43 a143 a9a49a48a63a18 is, the more
positive the review.9 This algorithm always outper-
forms the majority-class baseline, but not to the de-
gree that the best of SVM OVA and SVM regres-
sion does. Koppel and Schler (2005) independently
found in a three-class study that thresholding a pos-
itive/negative classi er trained only on clearly posi-
tive or clearly negative examples did not yield large
improvements.
A.2 Discretizing regression
In our experiments with SVM regression, we dis-
cretized regression output via a set of  xed decision
thresholds a144a51a124 a120a130 a13a60a108a14a120a130 a13 a123 a120a130 a13a42a120a100a120a100a120a38a145 to map it into our set of
class labels. Alternatively, we can learn the thresh-
olds instead. Neither option clearly outperforms the
other in the four-class case. In the three-class set-
ting, the learned version provides noticeably better
performance in two of the four datasets. But these
results taken together still mean that in many cases,
the difference is negligible, and if we had started
down this path, we would have needed to consider
similar tweaks for one-vs-all SVM as well. We
therefore stuck with the simpler version in order to
maintain focus on the central issues at hand.
9This is not necessarily true: if the classi er’s goal is to opti-
mize binary classi cation error, its major concern is to increase
con dence in the positive/negative distinction, which may not
correspond to higher con dence in separating   ve stars from
 four stars .
124
