Towards Answering Opinion Questions: Separating Facts from Opinions
and Identifying the Polarity of Opinion Sentences
Hong Yu
Department of Computer Science
Columbia University
New York, NY 10027, USA
hongyu@cs.columbia.edu
Vasileios Hatzivassiloglou
Department of Computer Science
Columbia University
New York, NY 10027, USA
vh@cs.columbia.edu
Abstract
Opinion question answering is a challenging task
for natural language processing. In this paper, we
discuss a necessary component for an opinion ques-
tion answering system: separating opinions from
fact, at both the document and sentence level. We
present a Bayesian classifier for discriminating be-
tween documents with a preponderance of opinions
such as editorials from regular news stories, and
describe three unsupervised, statistical techniques
for the significantly harder task of detecting opin-
ions at the sentence level. We also present a first
model for classifying opinion sentences as positive
or negative in terms of the main perspective be-
ing expressed in the opinion. Results from a large
collection of news stories and a human evaluation
of 400 sentences are reported, indicating that we
achieve very high performance in document classi-
fication (upwards of 97% precision and recall), and
respectable performance in detecting opinions and
classifying them at the sentence level as positive,
negative, or neutral (up to 91% accuracy).
1 Introduction
Newswire articles include those that mainly present
opinions or ideas, such as editorials and letters to
the editor, and those that mainly report facts such as
daily news articles. Text materials from many other
sources also contain mixed facts and opinions. For
many natural language processing applications, the
ability to detect and classify factual and opinion sen-
tences offers distinct advantages in deciding what in-
formation to extract and how to organize and present
this information. For example, information extrac-
tion applications may target factual statements rather
than subjective opinions, and summarization sys-
tems may list separately factual information and ag-
gregate opinions according to distinct perspectives.
At the document level, information retrieval systems
can target particular types of articles and even utilize
perspectives in focusing queries (e.g., filtering or re-
trieving only editorials in favor of a particular policy
decision).
Our motivation for building the opinion detec-
tion and classification system described in this pa-
per is the need for organizing information in the
context of question answering for complex ques-
tions. Unlike questions like “Who was the first
man on the moon?” which can be answered with
a simple phrase, more intricate questions such as
“What are the reasons for the US-Iraq war?” require
long answers that must be constructed from multi-
ple sources. In such a context, it is imperative that
the question answering system can discriminate be-
tween opinions and facts, and either use the appro-
priate type depending on the question or combine
them in a meaningful presentation. Perspective in-
formation can also help highlight contrasts and con-
tradictions between different sources—there will be
significant disparity in the material collected for the
question mentioned above between Fox News and
the Independent, for example.
Fully analyzing and classifying opinions involves
tasks that relate to some fairly deep semantic and
syntactic analysis of the text. These include not only
recognizing that the text is subjective, but also de-
termining who the holder of the opinion is, what the
opinion is about, and which of many possible posi-
tions the holder of the opinion expresses regarding
that subject. In this paper, we are presenting three
of the components of our opinion detection and or-
ganization subsystem, which have already been in-
tegrated into our larger question-answering system.
These components deal with the initial tasks of clas-
sifying articles as mostly subjective or objective,
finding opinion sentences in both kinds of articles,
and determining, in general terms and without refer-
ence to a specific subject, if the opinions are positive
or negative. The three modules of the system dis-
cussed here provide the basis for ongoing work for
further classification of opinions according to sub-
ject and opinion holder and for refining the original
positive/negative attitude determination.
We review related work in Section 2, and then
present our document-level classifier for opinion or
factual articles (Section 3), three implemented tech-
niques for detecting opinions at the sentence level
(Section 4), and our approach for rating an opinion
as positive or negative (Section 5). We have evalu-
ated these methods using a large collection of news
articles without additional annotation (Section 6)
and an evaluation corpus of 400 sentences anno-
tated for opinion classifications (Section 7). The
results, presented in Section 8, indicate that we
achieve very high performance (more than 97%) at
document-level classification and respectable per-
formance (86–91%) at detecting opinion sentences
and classifying them according to orientation.
2 Related Work
Much of the earlier research in automated opinion
detection has been performed by Wiebe and col-
leagues (Bruce and Wiebe, 1999; Wiebe et al., 1999;
Hatzivassiloglou and Wiebe, 2000; Wiebe, 2000;
Wiebe et al., 2002), who proposed methods for dis-
criminating between subjective and objective text at
the document, sentence, and phrase levels. Bruce
and Wiebe (1999) annotated 1,001 sentences as sub-
jective or objective, and Wiebe et al. (1999) de-
scribed a sentence-level Naive Bayes classifier using
as features the presence or absence of particular syn-
tactic classes (pronouns, adjectives, cardinal num-
bers, modal verbs, adverbs), punctuation, and sen-
tence position. Subsequently, Hatzivassiloglou and
Wiebe (2000) showed that automatically detected
gradable adjectives are a useful feature for subjec-
tivity classification, while Wiebe (2000) introduced
lexical features in addition to the presence/absence
of syntactic categories. More recently, Wiebe et al.
(2002) report on document-level subjectivity classi-
fication, using a k-nearest neighbor algorithm based
on the total count of subjective words and phrases
within each document.
Psychological studies (Bradley and Lang, 1999)
found measurable associations between words and
human emotions. Hatzivassiloglou and McKeown
(1997) described an unsupervised learning method
for obtaining positively and negatively oriented ad-
jectives with accuracy over 90%, and demonstrated
that this semantic orientation, or polarity, is a con-
sistent lexical property with high inter-rater agree-
ment. Turney (2002) showed that it is possible
to use only a few of those semantically oriented
words (namely, “excellent” and “poor”) to label
other phrases co-occuring with them as positive or
negative. He then used these phrases to automati-
cally separate positive and negative movie and prod-
uct reviews, with accuracy of 66–84%. Pang et al.
(2002) adopted a more direct approach, using super-
vised machine learning with words and n-grams as
features to predict orientation at the document level
with up to 83% precision.
Our approach to document and sentence classi-
fication of opinions builds upon the earlier work
by using extended lexical models with additional
features. Unlike the work cited above, we do not
rely on human annotations for training but only on
weak metadata provided at the document level. Our
sentence-level classifiers introduce additional crite-
ria for detecting subjective material (opinions), in-
cluding methods based on sentence similarity within
a topic and an approach that relies on multiple clas-
sifiers. At the document level, our classifier uses the
same document labels that the method of (Wiebe et
al., 2002) does, but automatically detects the words
and phrases of importance without further analy-
sis of the text. For determining whether an opin-
ion sentence is positive or negative, we have used
seed words similar to those produced by (Hatzivas-
siloglou and McKeown, 1997) and extended them to
construct a much larger set of semantically oriented
words with a method similar to that proposed by
(Turney, 2002). Our focus is on the sentence level,
unlike (Pang et al., 2002) and (Turney, 2002); we
employ a significantly larger set of seed words, and
we explore as indicators of orientation words from
syntactic classes other than adjectives (nouns, verbs,
and adverbs).
3 Document Classification
To separate documents that contain primarily opin-
ions from documents that report mainly facts, we ap-
plied Naive Bayes1, a commonly used supervised
machine-learning algorithm. This approach pre-
supposes the availability of at least a collection of ar-
ticles with pre-assigned opinion and fact labels at the
document level; fortunately, Wall Street Journal ar-
ticles contain such metadata by identifying the type
of each article as Editorial, Letter to editor, Business
and News. These labels are used only to provide the
correct classification labels during training and eval-
uation, and are not included in the feature space. We
used as features single words, without stemming or
stopword removal. Naive Bayes assigns a document
a0 to the class
a1 that maximizes a2a4a3a5a1a7a6
a0a9a8 by applying
Bayes’ rule a2a4a3a5a1a7a6 a0a9a8a11a10a13a12a15a14a17a16a19a18a20a12a15a14a17a21a23a22a16a19a18
a12a15a14a17a21a24a18
and assuming con-
ditional independence of the features.
Although Naive Bayes can be outperformed in
text classification tasks by more complex methods
such as SVMs, Pang et al. (2002) report similar per-
formance for Naive Bayes and other machine learn-
ing techniques for a similar task, that of distinguish-
ing between positive and negative reviews at the
document level. Further, we achieved such high per-
formance with Naive Bayes (see Section 8) that ex-
ploring additional techniques for this task seemed
unnecessary.
4 Finding Opinion Sentences
We developed three different approaches to clas-
sify opinions from facts at the sentence level. To
avoid the need for obtaining individual sentence an-
notations for training and evaluation, we rely in-
stead on the expectation that documents classified
as opinion on the whole (e.g., editorials) will tend to
have mostly opinion sentences, and conversely doc-
uments placed in the factual category will tend to
have mostly factual sentences. Wiebe et al. (2002)
report that this expectation is borne out 75% of the
time for opinion documents and 56% of the time for
factual documents.
4.1 Similarity Approach
Our first approach to classifying sentences as opin-
ions or facts explores the hypothesis that, within a
given topic, opinion sentences will be more simi-
lar to other opinion sentences than to factual sen-
1Using the Rainbow implementation, available from www.
cs.cmu.edu/˜mccallum/bow/rainbow.
tences. We used SIMFINDER (Hatzivassiloglou et
al., 2001), a state-of-the-art system for measuring
sentence similarity based on shared words, phrases,
and WordNet synsets. To measure the overall simi-
larity of a sentence to the opinion or fact documents,
we first select the documents that are on the same
topic as the sentence in question. We obtain topics
as the results of IR queries (for example, by search-
ing our document collection for “welfare reform”).
We then average its SIMFINDER-provided similari-
ties with each sentence in those documents. Then
we assign the sentence to the category for which the
average is higher (we call this approach the “score”
variant). Alternatively, for the “frequency” variant,
we do not use the similarity scores themselves but
instead we count how many of them, for each cate-
gory, exceed a predetermined threshold (empirically
set to 0.65).
4.2 Naive Bayes Classifier
Our second method trains a Naive Bayes classifier
(see Section 3), using the sentences in opinion and
fact documents as the examples of the two cate-
gories. The features include words, bigrams, and
trigrams, as well as the parts of speech in each sen-
tence. In addition, the presence of semantically ori-
ented (positive and negative) words in a sentence is
an indicator that the sentence is subjective (Hatzi-
vassiloglou and Wiebe, 2000). Therefore, we in-
clude in our features the counts of positive and neg-
ative words in the sentence (which are obtained with
the method of Section 5.1), as well as counts of
the polarities of sequences of semantically oriented
words (e.g., “++” for two consecutive positively ori-
ented words). We also include the counts of parts
of speech combined with polarity information (e.g.,
“JJ+” for positive adjectives), as well as features en-
coding the polarity (if any) of the head verb, the
main subject, and their immediate modifiers. Syn-
tactic structure was obtained with Charniak’s statis-
tical parser (Charniak, 2000). Finally, we used as
one of the features the average semantic orientation
score of the words in the sentence.
4.3 Multiple Naive Bayes Classifiers
Our designation of all sentences in opinion or factual
articles as opinion or fact sentences is an approxima-
tion. To address this, we apply an algorithm using
multiple classifiers, each relying on a different sub-
set of our features. The goal is to reduce the training
set to the sentences that are most likely to be cor-
rectly labeled, thus boosting classification accuracy.
Given separate sets of features a0a2a1a4a3a5a0a7a6a8a3a4a9a8a9a4a9a10a3a5a0a12a11 ,
we train separate Naive Bayes classifiers a13a14a1a15a3a16a13a17a6 ,
a9a4a9a8a9a18a3a16a13a19a11 corresponding to each feature set. Assum-
ing as ground truth the information provided by the
document labels and that all sentences inherit the
status of their document as opinions or facts, we
first traina13 a1 on the entire training set, then use the
resulting classifier to predict labels for the training
set. The sentences that receive a label different from
the assumed truth are then removed, and we train
a13 a6 on the remaining sentences. This process is re-
peated iteratively until no more sentences can be re-
moved. We report results using five feature sets,
starting from words alone and adding in bigrams, tri-
grams, part-of-speech, and polarity.
5 Identifying the Polarity of Opinion
Sentences
Having distinguished whether a sentence is a fact or
opinion, we separate positive, negative, and neutral
opinions into three classes. We base this decision
on the number and strength of semantically oriented
words (either positive or negative) in the sentence.
We first discuss how such words are automatically
found by our system, and then describe the method
by which we aggregate this information across the
sentence.
5.1 Semantically Oriented Words
To determine which words are semantically ori-
ented, in what direction, and the strength of their
orientation, we measured their co-occurrence with
words from a known seed set of semantically ori-
ented words. The approach is based on the hypothe-
sis that positive words co-occur more than expected
by chance, and so do negative words; this hypothe-
sis was validated, at least for strong positive/negative
words, in (Turney, 2002). As seed words, we used
subsets of the 1,336 adjectives that were manually
classified as positive (657) or negative (679) by
Hatzivassiloglou and McKeown (1997). In earlier
work (Turney, 2002) only singletons were used as
seed words; varying their number allows us to test
whether multiple seed words have a positive effect
in detection performance. We experimented with
seed sets containing 1, 20, 100 and over 600 positive
and negative pairs of adjectives. For a given seed set
size, we denote the set of positive seeds as ADJa20
and the set of negative seeds as ADJa21 . We then cal-
culate a modified log-likelihood ratio a22 a3a24a23a26a25a27a3 POSa28 a8
for a word a23 a25 with part of speech POSa28 (a29 can be
adjective, adverb, noun or verb) as the ratio of its
collocation frequency with ADJa20 and ADJa21 within a
sentence,
a22 a3a30a23 a25a3 POSa28 a8 a10a32a31a34a33a36a35
a37a38 Freq
a14a40a39a19a41a43a42POS
a44
a42ADJ
a45
a18a47a46a49a48
Freqa14a50a39 alla42POSa44a42ADJa45 a18
Freqa14a50a39a51a41a30a42POSa44a42ADJa52 a18a53a46a54a48
Freqa14a50a39 alla42POSa44a42ADJa52 a18
a55a56
where Freqa3a30a23 alla3 POSa28a3 ADJa20
a8 represents the collo-
cation frequency of all wordsa23 all of part of speech
POSa28 with ADJa20 and a57 is a smoothing constant
(a57 a10a59a58 a9a61a60 in our case). We used Brill’s tagger (Brill,
1995) to obtain part-of-speech information.
5.2 Sentence Polarity Tagging
As our measure of semantic orientation across an
entire sentence we used the average per word log-
likelihood scores defined in the preceding section.
To determine the orientation of an opinion sentence,
all that remains is to specify cutoffs a62a47a20 and a62a63a21 so
that sentences for which the average log-likelihood
score exceeds a62a53a20 are classified as positive opinions,
sentences with scores lower than a62a63a21 are classified
as negative opinions, and sentences with in-between
scores are treated as neutral opinions. Optimal val-
ues fora62a20 anda62a21 are obtained from the training data
via density estimation—using a small, hand-labeled
subset of sentences we estimate the proportion of
sentences that are positive or negative. The values
of the average log-likelihood score that correspond
to the appropriate tails of the score distribution are
then determined via Monte Carlo analysis of a much
larger sample of unlabeled training data.
6 Data
We used the TREC2 8, 9, and 11 collections, which
consist of more than 1.7 million newswire arti-
cles. The aggregate collection covers six differ-
ent newswire sources including 173,252 Wall Street
2http://trec.nist.gov/.
Journal (WSJ) articles from 1987 to 1992. Some
of the WSJ articles have structured headings that
include Editorial, Letter to editor, Business, and
News (2,877, 1,695, 2,009 and 3,714 articles, re-
spectively). We randomly selected 2,000 articles3
from each category so that our data set was approx-
imate evenly divided between fact and opinion ar-
ticles. Those articles were used for both document
and sentence level opinion/fact classification.
7 Evaluation Metrics and Gold Standard
For classification tasks (i.e., classifying between
facts and opinions and identifying the semantic ori-
entation of sentences), we measured our system’s
performance by standard recall and precision. We
evaluated the quality of semantically oriented words
by mapping the extracted words and labels to an ex-
ternal gold standard. We took the subset of our out-
put containing words that appear in the standard, and
measured the accuracy of our output as the portion
of that subset that was assigned the correct label.
A gold standard for document-level classification
is readily available, since each article in our Wall
Street Journal collection comes with an article type
label (see Section 6). We mapped article types News
and Business to facts, and article types Editorial and
Letter to the Editor to opinions. We cannot auto-
matically select a sentence-level gold standard dis-
criminating between facts and opinions, or between
positive and negative opinions. We therefore asked
human evaluators to classify a set of sentences be-
tween facts and opinions as well as determine the
type of opinions.
Since we have implemented our methods in an
opinion question answering system, we selected four
different topics (gun control, illegal aliens, social
security, and welfare reform). For each topic, we
randomly selected 25 articles from the entire com-
bined TREC corpus (not just the WSJ portion); these
were articles matching the corresponding topical
phrase given above as determined by the Lucene
search engine.4 From each of these documents we
randomly selected four sentences. If a document
happened to have less than four sentences, additional
3Except for Letters to Editor, for which we included all
1,695 articles available.
4http://www.jguru.com/faq/Lucene.
Label A B Agreement
Fact 123 16 46%
Opinion 258 65 77%
Uncertain 19 1 33%
Breakdown of opinion labels
Positive 33 4 29%
Negative 131 27 51%
No orientation 45 6 26%
Mixed orientation 8 0 0%
Uncertain orientation 41 1 7%
Table 1: Statistics of gold standards A and B.
documents from the same topic were retrieved to
supply the missing sentences. The resulting a0a2a1
a3
a60a4a1a5a0
a10
a0
a58a36a58 sentences were then interleaved so
that successive sentences came from different top-
ics and documents and divided into ten 50-sentence
blocks. Each block shares ten sentences with the
preceding and following block (the last block is con-
sidered to precede the first one), so that 100 of the
400 sentences appear in two blocks. Each of ten hu-
man evaluators (all with graduate training in com-
putational linguistics) was presented with one block
and asked to select a label for each sentence among
the following: “fact”, “positive opinion”, “negative
opinion”, “neutral opinion”, “sentence contains both
positive and negative opinions”, “opinion but cannot
determine orientation”, and “uncertain”.5
Since we have one judgment for 300 sentences
and two judgments for 100 sentences, we created
two gold standards for sentence classification. The
first (Standard A) includes the 300 sentences with
one judgment and a single judgment for the remain-
ing 100 sentences.6 The second standard (Standard
B) contains the subset of the 100 sentences for which
we obtained identical labels. Statistics of these two
standards are given in Table 1. We measured the
pairwise agreement among the 100 sentences that
were judged by two evaluators, as the ratio of sen-
tences that receive a label a6 from both evaluators
divided by the total number of sentences receiving
label a6 from any evaluator. The agreement across
5The full instructions can be viewed online at http:
//www1.cs.columbia.edu/˜hongyu/research/
emnlp03/opinion-eval-instructions.html.
6In order to assign a unique label, we arbitrarily chose the
first evaluator for those sentences.
F-measure
News vs. Editorial 0.96
News+Business vs. Editorial+Letter 0.97
Table 2: Document-level fact/opinion classification
by Naive Bayes algorithm.
the 100 sentences for all seven choices was 55%;
if we group together the five subtypes of opinion
sentences, the overall agreement rises to 82%. The
low agreement for some labels was not surprising
because there is much ambiguity between facts and
opinions. An example of an arguable sentence is “A
lethal guerrilla war between poachers and wardens
now rages in central and eastern Africa”, which one
rater classified as “fact” and another rater classified
as “opinion”.
Finally, for evaluating the quality of extracted
words with semantic orientation labels, we used two
distinct manually labeled collections as gold stan-
dards. One set consists of the previously described
657 positive and 679 negative adjectives (Hatzi-
vassiloglou and McKeown, 1997). We also used
the ANEW list which was constructed during psy-
cholinguistic experiments (Bradley and Lang, 1999)
and contains 1,031 words of all four open classes.
As described in (Bradley and Lang, 1999), humans
assigned valence scores to each score according to
dimensions such as pleasure, arousal, and domi-
nance; following heuristics proposed in psycholin-
guistics7 we obtained 284 positive and 272 negative
words from the valence scores.
8 Results and Discussion
Document Classification We trained our Bayes
classifier for documents on 4,000 articles from the
WSJ portion of our combined TREC collection, and
evaluated on 4,000 other articles also from the WSJ
part. Table 2 lists the F-measure scores (the har-
monic mean of precision and recall) of our Bayesian
classifier for document-level opinion/fact classifica-
tion. The results show the classifier achieved 97%
F-measure, which is comparable or higher than the
93% accuracy reported by (Wiebe et al., 2002),
who evaluated their work based on a similar set of
WSJ articles. The high classification performance
7http://www.sci.sdsu.edu/CAL/wordlist/.
Variant Class Standard A Standard B
Score Fact a0 0.61,0.34a1 a0 1.00,0.27 a1Opinion
a0 0.30,0.49a1 a0 0.16,0.64 a1
Frequency Fact a0 0.82,0.32a1 a0 0.89,0.19 a1Opinion
a0 0.17,0.55a1 a0 0.28,0.55 a1
Table 3: a0 Recall, precision a1 of similarity classifier.
is also consistent with a high inter-rater agreement
(kappa=0.95) for document-level fact/opinion anno-
tation (Wiebe et al., 2002). Note that we trained and
evaluated only on WSJ articles for which we can ob-
tain article class metadata, so the classifier may per-
form less accurately when used for other newswire
articles.
Sentence Classification Table 3 shows the re-
call and precision of the similarity-based approach,
while Table 4 lists the recall and precision of naive
Bayes (single and multiple classifiers) for sentence-
level opinion/fact classification. In both cases, the
results are better when we evaluate against Stan-
dard B, containing the sentences for which two hu-
mans assign the same label; obviously, it is easier for
the automatic system to produce the correct label in
these more clear-cut cases.
Our Naive Bayes classifier has a higher recall and
precision (80–90%) for detecting opinions than for
facts (around 50%). While words and n-grams had
little performance effect for the opinion class, they
increased the recall for the fact class around five fold
compared to the approach by Wiebe et al. (1999).
In general, the additional features helped the classi-
fier; the best performance is achieved when words,
bigrams, trigrams, part-of-speech, and polarity are
included in the feature set. Further, using multiple
classifiers to automatically identify an appropriate
subset of the data for training slightly increases per-
formance.
Polarity Classification Using the method of Sec-
tion 5.1, we automatically identified a total of 39,652
(65,773), 3,128 (4,426), 144,238 (195,984), and
22,279 (30,609) positive (negative) adjectives, ad-
verbs, nouns, and verbs, respectively. Extracted pos-
itive words include inspirational, truly, luck, and
achieve. Negative ones include depraved, disas-
trously, problem, and depress. Figure 1 plots the
Features Class Standard A Standard BSingle Multiple Single Multiple
Features from (Wiebe et al., 1999) Fact a0 0.03,0.38 a1 a0 0.03,0.38 a1 a0 0.06,1.00a1 a0 0.06,1.00 a1Opinion
a0 0.97,0.69 a1 a0 0.97,0.69 a1 a0 1.00,0.80a1 a0 1.00,0.80 a1
Words only Fact a0 0.14,0.39 a1 a0 0.12,0.42 a1 a0 0.28,0.42a1 a0 0.28,0.45 a1Opinion
a0 0.90,0.69 a1 a0 0.92,0.69 a1 a0 0.90,0.82a1 a0 0.91,0.83 a1
Words and Bigrams Fact a0 0.15,0.39 a1 a0 0.12,0.43 a1 a0 0.16,0.25a1 a0 0.16,0.25 a1Opinion
a0 0.89,0.69 a1 a0 0.92,0.69 a1 a0 0.87,0.79a1 a0 0.87,0.79 a1
Words, Bigrams, and Trigrams Fact a0 0.18,0.44 a1 a0 0.13,0.41 a1 a0 0.26,0.50a1 a0 0.26,0.50 a1Opinion
a0 0.89,0.70 a1 a0 0.91,0.69 a1 a0 0.93,0.82a1 a0 0.93,0.82 a1
Words, Bigrams, Trigrams,
and Part-of-Speech
Fact a0 0.17,0.42 a1 a0 0.13,0.40 a1 a0 0.18,0.49a1 a0 0.27,0.44 a1
Opinion a0 0.89,0.70 a1 a0 0.91,0.69 a1 a0 0.92,0.70a1 a0 0.85,0.84 a1
Words, Bigrams, Trigrams,
Part-of-Speech, and Polarity
Fact a0 0.15,0.43 a1 a0 0.13,0.42 a1 a0 0.44,0.50a1 a0 0.44,0.53 a1
Opinion a0 0.91,0.69 a1 a0 0.92,0.70 a1 a0 0.88,0.86a1 a0 0.91,0.86 a1
Table 4: a0 Recall, precisiona1 of opinion/fact sentence classification using different features and either a
single or multiple (data cleaning) classifiers.
Figure 1: Recall and precision (1,336 manually la-
beled positive and negative adjectives as gold stan-
dard) of extracted adjectives using 1, 20, and 100
positive and negative adjective pairs as seeds.
recall and precision of extracted adjectives by us-
ing randomly selected seed sets of 1, 20, and 100
pairs of positive and negative adjectives from the list
of (Hatzivassiloglou and McKeown, 1997). Both re-
call and precision increase as the seed set becomes
larger. We obtained similar results with the ANEW
list of adjectives (Section 7). As an additional ex-
periment, we tested the effect of ignoring sentences
with negative particles, obtaining a small increase in
precision and recall.
We subsequently used the automatically extracted
polarity score for each word to assign an aggregate
Parts-of-speech Used A B
Adjectives 0.49 0.55
Adverbs 0.37 0.46
Nouns 0.54 0.52
Verbs 0.54 0.52
Adjectives and Adverbs 0.55 0.84
Adjectives, Adverbs, and Verbs 0.68 0.90
Adjectives, Adverbs, Nouns,
and Verbs 0.62 0.74
Table 5: Accuracy of sentence polarity tagging on
gold standards A and B for different sets of parts-of-
speech.
polarity to opinion sentences. Table 5 lists the accu-
racy of our sentence-level tagging process. We ex-
perimented with different combinations of part-of-
speech classes for calculating the aggregate polarity
scores, and found that the combined evidence from
adjectives, adverbs, and verbs achieves the highest
accuracy (90% over a baseline of 48%). As in the
case of sentence-level classification between opin-
ion and fact, we also found the performance to be
higher on Standard B, for which humans exhibited
consistent agreement.
9 Conclusions
We presented several models for distinguishing be-
tween opinions and facts, and between positive and
negative opinions. At the document level, a fairly
straightforward Bayesian classifier using lexical in-
formation can distinguish between mostly factual
and mostly opinion documents with very high pre-
cision and recall (F-measure of 97%). The task is
much harder at the sentence level. For that case,
we described three novel techniques for opinion/fact
classification achieving up to 91% precision and re-
call on the detection of opinion sentences. We also
examined an automatic method for assigning polar-
ity information to single words and sentences, accu-
rately discriminating between positive, negative, and
neutral opinions in 90% of the cases.
Our work so far has focused on characterizing
opinions and facts in a generic manner, without ex-
amining who the opinion holder is or what the opin-
ion is about. While we have found presenting in-
formation organized in separate opinion and fact
classes useful, our goal is to introduce further analy-
sis of each sentence so that opinion sentences can be
linked to particular perspectives on a specific sub-
ject. We intend to cluster together sentences from
the same perspective and present them in summary
form as answers to subjective questions.
Acknowledgments
We wish to thank Eugene Agichtein, Sasha Blair-
Goldensohn, Roy Byrd, John Chen, Noemie El-
hadad, Kathy McKeown, Becky Passonneau, and
the anonymous reviewers for valuable input on ear-
lier versions of this paper. We are grateful to the
graduate students at Columbia University who par-
ticipated in our evaluation of sentence-level opin-
ions. This work was supported by ARDA under
AQUAINT project MDA908-02-C-0008. Any opin-
ions, findings, or recommendations are those of
the authors and do not necessarily reflect ARDA’s
views.

References
M. M. Bradley and P. J. Lang. 1999. Affective norms
for English words (ANEW): Stimuli, instruction man-
ual and affective ratings. Technical Report C-1, The
Center for Research in Psychophysiology, University
of Florida, Gainesville, Florida.
Eric Brill. 1995. Transformation-based error-driven
learning and natural language processing: A case study
in part of speech tagging. Computational Linguistics.
Rebecca Bruce and Janyce Wiebe. 1999. Recognizing
subjectivity: A case study in manual tagging. Natural
Language Engineering, 5(2).
Eugene Charniak. 2000. A maximum-entropy–inspired
parser. In Proceedings of NAACL-2000.
Vasileios Hatzivassiloglou and Kathleen R. McKeown.
1997. Predicting the semantic orientation of adjec-
tives. In Proceedings of the 35th Annual Meeting
of the ACL and the 8th Conference of the European
Chapter of the ACL, pages 174–181, Madrid, Spain,
July. Association for Computational Linguistics.
Vasileios Hatzivassiloglou and Janyce Wiebe. 2000. Ef-
fects of adjective orientation and gradability on sen-
tence subjectivity. In Conference on Computational
Linguistics (COLING-2000).
Vasileios Hatzivassiloglou, Judith Klavans, Melissa Hol-
combe, Regina Barzilay, Min-Yen Kan, and Kathleen
McKeown. 2001. SIMFINDER: A flexible clustering
tool for summarization. In Proceedings of the Work-
shop on Summarization in NAACL-01.
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
2002. Thumps up? Sentiment classification using ma-
chine learning techniques. In Proceedings of the 2002
Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP-02), Philadelphia, Penn-
sylvania.
Peter Turney. 2002. Thumps up or thumbs down? Se-
mantic orientation applied to unsupervised classifica-
tion of reviews. In Proceedings of the 40th Annual
Meeting of the Association for ComputationalLinguis-
tics, Philadelphia, Pennsylvania.
Janyce Wiebe, Rebecca Bruce, and Thomas O’Hara.
1999. Development and use of a gold standard data
set for subjectivity classifications. In Proceedings of
the 37th Annual Meeting of the Association for Com-
putational Linguistics (ACL-99), pages 246–253.
Janyce Wiebe, Theresa Wilson, Rebecca Bruce, Matthew
Bell, and M. Martin. 2002. Learning subjective
language. Technical Report TR-02-100, Department
of Computer Science, University of Pittsburgh, Pitts-
burgh, Pennsylvania.
Janyce Wiebe. 2000. Learning subjective adjectives
from corpora. In Proceedings of the 17th National
Conference on Artificial Intelligence (AAAI-2000),
Austin, Texas.
