A Corpus Study of Evaluative and Speculative Language
Janyce Wiebe

, Rebecca Bruce
y
, Matthew Bell

, Melanie Martin
z
, Theresa Wilson

University of Pittsburgh

,University of North Carolina at Asheville
y
, New Mexico State University
z
wiebe,mbell,twilson@cs.pitt.edu, bruce@cs.unca.edu, mmartin@cs.nmsu.edu
Abstract
This paper presents a corpus study
of evaluative and speculative language.
Knowledge of such language would be
useful in many applications, suchas
text categorization and summarization.
Analyses of annotator agreement and of
characteristics of subjective language are
performed. This study yields knowl-
edge needed to design eective machine
learning systems for identifying subjec-
tive language.
1 Introduction
Subjectivity in natural language refers to aspects
of language used to express opinions and evalua-
tions (Baneld, 1982;; Wiebe, 1994). Subjectivity
tagging is distinguishing sentences used to present
opinions and other forms of subjectivity(subjec-
tive sentences) from sentences used to objectively
present factual information (objective sentences).
This task is especially relevant for news report-
ing and Internet forums, in which opinions of var-
ious agents are expressed. There are numerous
applications for which subjectivity tagging is rele-
vant. Two are information retrieval and informa-
tion extraction. Current extraction and retrieval
technology focuses almost exclusively on the sub-
ject matter of documents. However, additional
aspects of a document inuence its relevance, in-
cluding, e.g., the evidential status of the material
presented, and the attitudes expressed about the
topic (Kessler et al., 1997). Knowledge of subjec-
tive language would also be useful in ame recog-
nition (Spertus, 1997;; Kaufer, 2000), email clas-
sication (Aone et al., 2000), intellectual attribu-
tion in text (Teufel and Moens, 2000), recogniz-
ing speaker role in radio broadcasts (Barzilayet
al., 2000), review mining (Terveen et al., 1997),
generation and style (Hovy, 1987), clustering doc-
uments by ideological point of view (Sack, 1995),
and any other application that would benet from
knowledgeofhowopinionatedthe languageis, and
whether or not the writer purports to objectively
present factual material.
To use subjectivity tagging in applications,
goodlinguisticclues mustbe found. As withmany
pragmaticand discourse distinctions, existing lex-
ical resources are not comprehensively coded for
subjectivity. The goalof ourcurrentworkis learn-
ing subjectivity clues from corpora. This paper
contributes to this goal by empirically examin-
ingsubjectivity.We explore annotatingsubjectiv-
ity at dierent levels (expression, sentence, docu-
ment) and produce corpora annotated at dierent
levels. Annotator agreement is analyzed to un-
derstand and assess the viabilityofsuch annota-
tions. In addition, because expression-level anno-
tations are ne-grained and thus very informative,
these annotations are examinedto gainknowledge
about subjectivity.
We also use our annotations and existing ed-
itorial annotations to generate and test features
of subjectivity. Altogether, the observations and
results of these studies provide valuable informa-
tionthat willfacilitatedesigning eective machine
learning systems for recognizing subjectivity.
The remainder of this paper rst provides back-
ground about subjectivity, then presents results
for document-level annotations, followed byan
analysis of expression-level annotations. Results
for features generated using document-level anno-
tations are next, ending with conclusions.
2 Subjectivity
Sentence (1) is an example of a simple subjective
sentence, and (2) is an example of a simple objec-
tivesentence:
1
(1) At several dierentlayers, it's a fascinating
tale.
1
The term subjectivity is due to Ann Baneld
(1982). For references to work on subjectivity, please
see (Baneld, 1982;; Fludernik, 1993;; Wiebe, 1994;;
Stein and Wright, 1995).
(2) Bell Industries Inc. increased its quarterly to
10 cents from 7 cents a share.
The main types of subjectivity are:
1. Evaluation. This category includes emotions
such as hope and hatred as well as evalua-
tions, judgements, and opinions. Examples
of expressions involving positiveevaluation
are enthused, wonderful,andgreat product!.
Examples involving negativeevaluation are
complained, you idiot!,andterrible product.
2. Speculation. This category includes anything
that removes the presupposition of events oc-
curring or states holding, such as speculation
and uncertainty. Examples of speculative ex-
pressions are speculated,andmaybe.
Following are examples of strong negative
evaluative language from a corpus of Usenet
newsgroup messages:
(3a) I had in mind your facts, buddy, not hers.
(3b) Nice touch. \Alleges" whenever facts posted
are not in your persona of what is \real".
Following is an example of opinionated, edito-
rial language, taken from an editorial in the Wall
Street Journal:
(4) We stand in awe of the Woodstock genera-
tion's ability to be unceasingly fascinated bythe
subject of itself.
Sentences (5) and (6) illustrate the fact that
sentences about speechevents may be subjective
or objective:
(5) Northwest Airlines settled the remaining
lawsuits led on behalf of 156 people killed in
a 1987 crash, but claims against the jetliner's
maker are being pursued, a federal judge said.
(6) \The cost of health care is eroding our stan-
dard of living and sapping industrial strength,"
complains Walter Maher, a Chrysler health-and-
benets specialist.
In (5), the material about lawsuits and claims is
presented as factual information, and a federal
judge is given as the source of information. In
(6), in contrast, acomplaintispresented. An NLP
system performing information extraction on (6)
should not treat the material in the quoted string
as factual information, with the complainer as a
source of information, whereas a corresponding
treatmentofsentence (5) would be appropriate.
Subjectivesentences often contain individual
expressions of subjectivity. Examples are fasci-
nating in (1), and eroding, sapping,andcomplains
in (6). The following paragraphs mention aspects
of subjectivity expressions that are relevant for
NLP applications.
First, although some expressions, suchas!, are
subjective in all contexts, many, suchassapping
and eroding,mayormay not be subjective, de-
pending on the context in which they appear. A
potential subjective element (PSE) is a linguistic
elementthatmay be used to express subjectivity.
A subjective element is an instance of a potential
subjective element, in a particular context, that is
indeed subjectiveinthatcontext (Wiebe, 1994).
Second, a subjectiveelement expresses the sub-
jectivityofasource, who may be the writer or
someone mentioned in the text. For example, the
source of fascinating in (1) is the writer, while
the source of the subjective elements in (6) is Ma-
her. In addition, a subjective elementhasatar-
get, i.e., what the subjectivity is about or directed
toward. In (1), the target is a tale;; in (6), the tar-
get of Maher's subjectivity is the cost of health
care. These are examples of object-centric sub-
jectivity, which is about an object mentioned in
the text (other examples: \I love this project";;
\The software is horrible"). Subjectivitymayalso
be addressee-oriented, i.e., directed toward the lis-
tener or reader (e.g., \You are an idiot").
Third, there maybemultiple subjective ele-
ments in a sentence, possibly of dierenttypes
and attributed to dierent sources and targets.
Forexample,in (4), subjectivityoftheWoodstock
generation is described (specically, its fascina-
tion with itself). In addition, subjectivity of the
writer isexpressed (e.g., `westand in awe'). As de-
scribed below, individualsubjective elements were
annotated as part of this work, rening previous
work on sentence-level annotations. Finally,PSEs
may be complex expressions such as `village id-
iot', `powers that be', `You' NP, and `What a'
NP. There is a great varietyofsuch expressions,
including manystudied under the rubric of idioms
(see, for example, (Nunberg et al., 1994)). Wead-
dress learning such expressions in another project.
3 Previous Work on Subjectivity
Tagging
In previous work (Wiebe et al., 1999;; Bruce and
Wiebe, 1999), a corpus of sentences from the Wall
Street Journal Treebank Corpus (Marcus et al.,
1993) was manually annotated with subjectivity
classications bymultiplejudges. The judges were
instructed to consider a sentence to be subjective
if they perceived any signicant expression of sub-
jectivity (of any source) in the sentence, and to
consider the sentence to be objective, otherwise.
Agreementwas summarized in terms of Cohen's
 (Cohen, 1960), which compares the total proba-
bilityof agreement to that expected if the taggers'
classications were statistically independent (i.e.,
\chance agreement"). After two rounds of tag-
ging by three judges, an average pairwise  value
of .69 was achieved on a test set. The EM learn-
ing algorithm was used to produce corrected tags
representing the consensus opinions of the taggers
(Goodman, 1974;; Dawid and Skene, 1979). An
automatic system to perform subjectivity tagging
was developed using the new tags as training and
testing data. In 10-fold cross validation experi-
ments, a probabilistic classier obtained an aver-
age accuracy on subjectivity tagging of 72.17%,
more than 20 percentage points higher than a
baseline accuracy obtained byalways choosing the
more frequent class. Five part-of-speech features,
two lexical features, and a paragraph feature were
used.
Toidentify richer features, (Wiebe, 2000) used
Lin's (1998) method for clustering words accord-
ing to distributional similarity,seededby a small
amount of detailed manual annotation, to auto-
matically identify adjective PSEs. There are two
parameters of this process, neither of whichwas
varied in (Wiebe, 2000): C, the cluster size con-
sidered, andFT, a lteringthreshold, such that, if
the seed word and the words in its cluster have, as
a set, lower precision than the ltering threshold
on the training data, the entire cluster, includ-
ing the seed word, is ltered out. This process is
adapted for use in the current paper, as described
in section 7.
4 Choices in Annotation
In expression-level annotation, the judges rst
identify the sentences they believe are subjective.
They next identify the subjective elements in
the sentence, i.e., the expressions they feel are
responsible for the subjective classication. For
example (subjective elements are in parentheses):
They promised (yet) more for (really good stu).
(Perhaps you'll forgive me) for reposting his
response.
Subjective-element (expression-level) annota-
tions are probably the most natural. Ultimately,
wewould like to recognize the subjective elements
in a text, and their types, targets, and sources.
However, both manual and automatic tagging at
this level are dicult because the tags are very
ne-grained, and there is no predetermined clas-
sication unit;; a subjectiveelementmay be a sin-
gle word or a large expression. Thus, in the short
term, it is probably best to use subjective-element
annotations for knowledge acquisition (analysis,
training, feature generation) alone, and not target
automatic classication of subjective elements.
In this work, document-level subjectivity anno-
tations are text categories of which subjectivity
is a key aspect. We use three text categories:
editorials (Kessler et al., 1997), reviews, and
\ames", i.e., hostile messages (Spertus, 1997;;
Kaufer, 2000). For ease of discussion, we group
editorials and reviews together under the term
opinion pieces.
There are benets to using suchdocument-level
annotations. First, they are more directly re-
lated to applications (e.g., ltering hostile mes-
sages and mining reviews from Internet forums).
Second, there are existing annotations to be ex-
ploited, such as editorials and arts reviews marked
as suchby newspapers, as well as on-line product
reviews accompanied by formal numerical ratings
(e.g., 4 on a scale from 1 to 5).
However, a challenging aspect of such data is
that opinion pieces and ames contain objective
sentences, while documents in other text cate-
gories contain subjective sentences. News reports
present reactions to andattitudes toward reported
events (van Dijk 1988);; they often contain seg-
ments starting with expressions suchascritics
claim and supporters argue. In addition, quoted-
speech sentences inwhichindividualsexpress their
subjectivity are often included (Barzilay et al.,
2000). On the other hand, editorials contain ob-
jectivesentences presenting facts supporting the
writer's argument, and reviews contain sentences
objectively presenting facts about the product.
This \impure" aspect of opinionated text cate-
gories must be considered when such data is used
for training and testing. Some specic results are
given below in section 7.
We believe that sentence-level classications
will continue to provide an importantlevel of
analysis. The sentence provides a prespeci-
ed classication unit
2
and, while sentence-level
judgements are not as ne-grained as subjective-
2
While sentence boundaries are not always unam-
biguous in unedited text or spoken language, the data
can always be segmented into sentence-like units be-
fore subjectivity tagging is performed.
element judgements, they do not involve the large
amount of noise we face with document-level an-
notations.
5 Document-Level Annotation
Results
5.1 Flame Annotations
In this study, newsgroup messages were assigned
the tags ame or not-ame. The corpus con-
sists of 1140 Usenet newsgroup messages, bal-
anced amongthe categories alt, sci, comp, and rec
in the Usenet hierarchy. The corpus was divided,
preserving the category balance, intoatrainingset
of 778 messages and a test set of 362 messages.
The annotators were instructed to mark a mes-
sage as a ame if the \main intention of the mes-
sage is a personal attack, containing insulting or
abusive language." A number of policy decisions
were made in the instructions, dealing, primarily,
with included messages (part or all of a previous
message, included in the current message as part
of a reply). Some additional issues addressed in
the instructions were who the attackwas directed
at, nonsense, sarcasm, humor, rants, and raves.
During the training phase, two annotators, MM
and R, participated in multiple rounds of tagging,
revising the annotation instructions as they pro-
ceeded. During the testing phase, MM and R in-
dependently annotated the test set, achieving a 
value on these messages of 0.69. A third annota-
tor, L, trained on 492 messages from the training
set, and then annotated 88 of the messages in the
test set. The pairwise  values on this set of 88
are: MM & R: 0.80;; MM & L: 0.75;; R & MM:
0.80;; for an average pairwise  of .78.
This study provides evidence for the viability
of document-level ame annotation. Weplanto
build a ame-recognition system in the future. As
will be seen below, MM and R also tagged this
data at the subjective-elementlevel.
5.2 Opinion-Piece Classications
Our opinion-piece classications are builtonexist-
ingannotationsinthe WallStreet Journal. Specif-
ically, there are articles explicitly identied to be
Editorials, Letters to the Editor, Arts & Leisure,
and Viewpoints;; together, wecalltheseopinion
pieces. This data is a good resource for subjectiv-
ity recognition. However, an inspection of some
data revealed that some editorials and reviews are
not marked as such. For example, there are arti-
cles written in the rst person, and the purpose of
the article is to present an argument rather than
cover a news story, but there is no explicit indi-
cation that they are editorials. To create high
quality test data, two judges manually annotated
WSJ data for opinion pieces. The instructions
were to nd any additional opinion pieces that
are not marked as such. The annotators also had
the option of disagreeing with the existing anno-
tations, but did not opt to do so in any instances.
One judge annotated allarticles in four datasets
of the Wall Street Journal Treebank corpus (Mar-
cus et al., 1993) (W9-4, W9-10, W9-22, and W9-
33, each approximately 160K words) as well as
thecorpusofWall Street Journal articles used in
(Wiebe et al., 1999) (called WSJ-SE below). An-
other judge annotated all articles in twoofthe
datasets (W9-22 and W9-33).
This annotation task appears to be relatively
easy. With no training at all, the values are very
high: .94 for dataset W9-33 and .95 for dataset
W9-22.
The agreement data for W9-22 is given in Table
1 in the form of a contingency table. In section
7, this data is used to generate and test candidate
potential subjective elements (PSEs).
6 Subjective-Element Annotation
Results and Analyses
6.1 Annotations and Data
These subsections analyze subjective elementan-
notations performed on three datasets, WSJ-SE,
NG-FE,andNG-SE.
WSJ-SE is the corpus of 1001 sentences of the
Wall Street Journal Treebank Corpus referred to
above in section 3. Recall that the sentences of
this corpus were manuallyannotated with subjec-
tivity classications as described in (Wiebe et al.,
1999;; Bruce and Wiebe, 1999).
For this paper, two annotators (D and M)were
asked to identify the subjective elements in WSJ-
SE. Specically, the taggers were given the sub-
jective sentences identied in the previous study,
and asked to put brackets around the words they
believe cause the sentence to be classied as sub-
jective.
Note that inammatory language is a kind of
subjective language. NG-FE is a subset of the
Usenet newsgroup corpus used in the document-
level ame-annotation study described in section
5.1. Specically, NG-FE consists of the 362-
message test set for taggers R and MM. For this
study,RandMMwere asked to identify the ame
elements in NG-FE. Flame elements are the sub-
set of subjective elements that are perceived to
be inammatory.RandMMwere asked to do
this in all 362 messages, because some messages
that were not judged to be ames at the message
level do contain individual inammatory phrases
Tagger 2
Op NotOp
Tagger 1 Op n
11
=23 n
12
=0 n
1+
=23
NotOp n
21
=2 n
22
=268 n
2+
= 270
n
+1
=25 n
+2
= 268 n
++
= 293
Table 1: Contingency Table for Opinion Piece Agreement in W9-22
(in these cases, the tagger does not believe that
these phrases express the main intent of the mes-
sage).
In addition to the above annotations, tagger M
performed subjective-element tagging on a dier-
ent set of Usenet newsgroup messages, corpus NG-
SE. The size of this corpus is 15413 words.
In datasets WSJ-SE and NG-SE, the taggers
were also asked to specify one of ve subjective
elementtypes: e+ (positiveevaluative), e; (neg-
ativeevaluative), e? (some other typeofevalua-
tion), u (uncertainty), and o (none of the above),
with the option to assign multiple types to an in-
stance. All corpora were stemmed (Karp et al.,
1992) and part-of-speech tagged (Brill, 1992).
6.2 Agreement Among Taggers
There are techniques for analyzing agreement
when annotations involve segment boundaries
(Litman and Passonneau, 1995;; Marcu et al.,
1999), but our focus in this paper is on words.
Thus, our analyses are at the word level: each
word is classied as either appearing in a subjec-
tive element or not. Punctuation is excluded from
our analyses. The WSJ data is divided into two
subsets in this section, Exp1 and Exp2.
As mentioned above, in WSJ-SE Exp1 and
Exp2, the taggers also classied subjective ele-
ments with respect to the type of subjectivity
being expressed. Subjectivitytype agreementis
again analyzed at the word level, but, in this anal-
ysis, only the words classied as belonging to sub-
jective elements by both taggers are considered.
Table 2 provides  values for word agreement
in NG-FE (the ame data) as well as for WSJ-SE
Exp1 and Exp2. The task of identifying subjec-
tive elements in a body of text is dicult, and the
agreement results reect this fact;; agreementis
much stronger than that expected bychance, but
less than what wewould like to see when verify-
ing a new classication. Further renementofthe
codingmanualisrequired. Additionally,itmaybe
possible to rene the classications automatically
using methods such as those described in (Wiebe
et al., 1999). In this analysis, we explore the pat-
terns of agreement exhibited by the taggers in an
eort to better understand the classication.
We begin by looking at word agreement. Word
agreement is higher in the ame experiment
(NG-FE) than it is in either WSJ experiment
(WSJ-SE Exp1 and Exp2). Looking at the WSJ
data provides one plausible explanation for the
lower word agreement in the WSJ experiments.
As exhibited in the subjectiveelements identied
for the single clause below,
D: (e+ played the role well) (e? obligatory
ragged jeans a thicket of long hair and rejection
of all things conventional)
M: (e+ well) (e? obligatory) (e- ragged) (e?
thicket) (e- rejection) (e- all things conventional)
tagger D consistently identies entire phrases
as subjective, while Tagger M prefers to select
discrete lexical items. This dierence in inter-
pretation of the tagging instructions does not
occur in the ame experiment. Nonetheless, even
within the ame data, there are many instances
where both taggers identify the same segmentof
a sentence as forming a subjective element but
disagree on the boundaries of that segment, as in
the example below.
R: (classic case of you deliberately misinterpret-
ing my comments)
MM: (you deliberately misinterpreting my
comments)
These patterns of partial agreement are also evi-
dent in the  values for words from specic syn-
tactic categories (see Table 2 again). In the WSJ
data, agreementon determiners is particularlylow
because they are often included as part ofa phrase
bytaggerDbuttypically not included in the spe-
cic lexical items chosen by tagger M. Interest-
ingly, in the WSJ experiments, the taggers most
frequently agreed on the selection of modals and
adjectives, while in the ame experiment, agree-
mentwas highest on nouns and adjectives. The
highagreementon adjectives inbothgenres iscon-
All Words Nouns Verbs Modals Adj's Adverbs Det's
NG-FE 0:4657 0:5213 0.4571 0:4008 0:5011 0:3576 0:4286
WSJ-SE, Exp1 0:4228 0:3999 0.4235 0:6992 0:6000 0:4328 0:2661
WSJ-SE, Exp2 0:3703 0:3705 0.4261 0:4298 0:4294 0:2256 0:1234
Table 2:  Values for Word Agreement
sistent with results from other work (Bruce and
Wiebe, 1999;; Wiebe et al., 1999), but high agree-
menton nouns inthe amedata verses highagree-
ment on modals in the WSJ data suggests a genre
specic usage of these categories. This would be
the case if, for example, modals were most fre-
quently used to express uncertainty,atype of sub-
jectivity that would be relatively rare in ames.
Turning to subjective-elementtype, in both
WSJ experiments, the values for type agreement
are comparable to those for word agreement. Re-
callthatmultipletypes maybe assigned toasingle
subjective instance. Allsuch instances in the WSJ
data are u in combination with an evaluativetag
(i.e., e+, e- and e?), and they are not common:
each tagger assigned multiple tags to fewer than
7% of the subjective instances. However, if partial
matches between type tags are recognized, i.e., if
they share a common tag, then the  values im-
prove signicantly.Table 3 shows both types of
results.
It is interesting to note the variationin type agree-
ment for words of dierent syntactic categories.
Agreement on adjectives is consistently high while
the agreementonthetype of subjectivity ex-
pressed by modalsand adverbs is consistently low.
This contrasts with the fact that word agreement
for modals, in particular, and, to a lesser extent,
adverbs was high. This lack of agreementsug-
gests that the type of subjectivity expressed by
adjectives is more easily distinguished than that
of modals or adverbs. This is particularly impor-
tant because the number of adjectives included in
subjective elements is high. In contrast, the num-
bers of modals and adverbs are relatively low.
Additional insight can be gained bycombining
the 3 evaluative classications (i.e., e+, e- and
e?) to form a single tag, e, representing any
form of evaluative expression. Table 4 presents
type agreement results for the tag set e, u, o.
In contrasting Tables 3 and 4, it is surprising
to note that most of the  values decrease when
the distinction among the evaluativetypes is re-
moved. This suggests that the three evaluative
types are natural classications. Only for adverbs
does type agreementimprove with the smaller
tag set;; this indicates that it is dicult to dis-
tinguish the evaluative nature of adverbs. Note
also that agreement for modals is not impacted
bythechange in tag sets. This fact supports the
hypothesis that modals are used primary to ex-
press uncertainty. As a nal point, welookat
patterns of agreementintype classication using
the models of symmetry, marginal homogeneity,
quasi-independence, and quasi-symmetry. Each
model tests for a specic pattern of agreement:
symmetry tests the interchangeability of taggers,
marginal homogeneityveries the absence of bias
among taggers, quasi-independence veries that
the taggers act independently when they disagree,
and quasi-symmetry tests for the presence of any
pattern in their disagreements. For a more com-
plete description of these models and their use
in analyzing intercoder reliability see (Bruce and
Wiebe, 1999). In short, the results presented in
Table 5 indicate that the taggers are not inter-
changeable: they exhibit biases in their type clas-
sications, and there is a pattern of correlated dis-
agreement in the assignment of the original type
tags. Surprisingly, the taggers appear to act in-
dependently when they disagree in assigning the
compressed type tags (i.e., tags e, u and o). This
shift in the pattern of disagreementbetween tag-
gers again suggests that the compression of the
evaluativetagswas inappropriate. Additionally,
these ndings suggest that it may be possible to
automatically correct the type biases expressed
by the taggers using the technique described in
(Bruce and Wiebe, 1999), a topic that will be in-
vestigated in future work.
6.3 Uniqueness
Based on previous work (Wiebe et al., 1998), we
hypothesized that low-frequency words are associ-
ated with subjectivity.Table 6 provides evidence
that the number of unique words (words that ap-
pear just once) in subjective elements is higher
than expected. The rst rowgives information
for all words and the second gives information for
words that appear just once. The gures in the
Num columns are total counts, and the gures in
the P columns give the proportion that appear in
subjectiveelements. The Agree columns givein-
All Words Nouns Verbs Modals Adj's Adverbs Det's
Exp1 Full Match 0:4216 0:4228 0.2933 0:1422 0:5919 0:1207 0:5000
Partial Match 0:5156 0:4570 0.4447 0:3011 0:6607 0:3305 0:5000
Exp2 Full Match 0:3041 0:2353 0.2765 0:1429 0:5794 0:1207 0:0000
Partial Match 0:4209 0:2353 0.3994 0:3494 0:6719 0:4439 0:1429
Table 3:  Values for Type Agreement Using All Types in the WSJ Data
All Words Nouns Verbs Modals Adj's Adverbs Det's
Exp1 Full Match 0:3377 0:0440 0.1648 0:1968 0:5443 0:3810 0:0000
Partial Match 0:5287 0:1637 0.3765 0:4903 0:8125 0:3810 0:0000
Exp2 Full Match 0:2569 0:0000 0.1923 0:1509 0:4783 0:1707 0:1429
Partial Match 0:4789 0:0000 0.4167 0:4000 0:8056 0:7671 0:4000
Table 4:  Values for Type Agreement Using E,O,U in the WSJ Data
Sym. M.H. Q.S. Q.I.
Exp1 All Types G
2
112:351 92:447 19:904 66:771
Sig. 0:000 0:000 0:527 0:007
e,o,u G
2
85:478 84:142 1:336 12:576
Sig. 0:000 0:000 0:248 0:027
Exp2 All Types G
2
94:669 76:247 18:422 58:892
Sig. 0:000 0:000 0:241 0:001
e,o,u G
2
66:822 66:819 0:003 0:0003
Sig. 0:000 0:000 0:986 0:987
Table 5: Tests for Patterns of AgreementinWSJType-Tagged Data
WSJ-SE NG-FE
D M Agree Agree R MM
Num P Num P Num P Num P Num P Num P
All words 18341 .07 18341 .08 16857 .04 15413 .15 86279 .01 88210 .02
unique 2615 .14 2615 .20 2522 .15 2348 .17 5060 .07 4836 .03
Table 6: Proportions of Unique Words in Subjective Elements
formationfor the subset of the corresponding data
set upon whichthetwo annotators agree.
Comparison of rows 1 and 2 across columns
shows that the proportion of unique words that
are subjective is higher than the proportion of all
words that are subjective. In all cases, this dier-
ence in proportions is highly statistically signi-
cant.
6.4 Types and Context
An interesting question is, when a word appears
in multiple subjective elements, are those subjec-
tive elements all the same type? Table 7 shows
that a signicant portion are used in more than
one type. Each item considered in the table is a
word-POS pairthatappears morethan once inthe
corpus. The gures shown are the total numberof
word-POS items that appear more than once (the
columns labeled MultInst) and the proportion of
those items that appear in more than one type
of subjective element (the columns labeled Mult-
Type). These results highlightthe need forcontex-
tual disambiguation. For example, one thinks of
great as a positiveevaluative term, but its polarity
depends on the context;; it can be used negatively
evaluatively in a context such as \Just great." A
goalof performingsubjective-element annotations
is to support learning such local contextual inu-
ences.
7 Generating and Testing PSEs
using Document-Level
Annotations
This section uses the opinion-piece annotations to
expand our set of PSEs beyond those that can be
derived from the subjective-element annotations.
Precision is used to assess feature quality.The
precision of feature F for class C is the number
of Fs that occur in units of class C over the total
numberofFsthatoccuranywhere in the data.
An important motivationfor using the opinion-
piece data is that there is a large amount of it,
and manually rening existing annotations as de-
scribed in section 5.2 is much easier and more re-
liable than other types of subjectivity annotation.
However, we cannot expect absolutely high pre-
cisions for two reasons. First, the distribution of
opinions and non-opinions is highly skewed in fa-
vor of non-opinions. For example, in Table 1, tag-
ger 1 classies only 23 of 293 articles as opinion
pieces. Second, as discussed in section 4, opin-
ion pieces contain objectivesentences and non
opinion-pieces contain subjective sentences. For
example, in WSJ-SE, which has been annotated
at the sentence and document levels, 70% of the
sentences in opinionpieces are subjective and 30%
are objective. In non-opinion pieces, 44% of the
sentences are subjective and only 56% are objec-
tive.
To give an idea of expected precisions, let us
consider the precision of subjectivesentences with
respect to opinion pieces. Suppose that 15% of
the sentences in the dataset are in opinions, 85%
in non-opinions. Let us assume the proportions of
subjective and objectivesentences in opinion and
non-opinion pieces given just above. Let N be the
total number of sentences. The desired precision
is the number of subjectivesentences in opinions
over the total number of subjective sentences. It
is .22:
p=.15*N*.70/(.15*N*.70+.85*N*.44).
In addition, we are assessing PSEs, which are
only potentially subjective;; manyhave objective
as well as subjectiveuses.
Thus, even if precisions are muchlower than 1,
we use increases in precision over a baseline as ev-
idence of promising PSEs. The baseline for com-
parison is the number of word instances in opin-
ion pieces, divided by the total number of word
instances. Table 8 shows the precisions for three
types of PSEs. The freq columns give total fre-
quencies, and the +prec columns show the im-
provements in precision from the baseline. The
baseline precisions are given at the bottom of the
table.
As mentionedabove, (Wiebe, 2000)showed suc-
cess automatically identifying adjective PSEs us-
ing Lin's method, seeded by a smallamountofde-
tailedmanualannotations. Desiring to moveaway
from manually annotated data, for this paper the
same process is used, but the seed words are all
the adjectives (verbs) in the training data. In ad-
dition, in the current setting, there are no a priori
values to use for parameters C (cluster size) and
FT (ltering threshold), as there were in (Wiebe,
2000), and results vary with dierent parameter
settings. Thus, a train-validate-test process is ap-
propriate. In Table 8, the numbers given under,
e.g., W9-10, are the results obtained when W9-10
is used as the test set. One of the other datasets,
sayW9-22,was used as the training set, meaning
that all the adjectives (verbs) in that dataset are
the seed words, and all ltering was performed us-
ing only that data. The seed-ltering process was
repeated with dierent settings of C and FT, pro-
ducing a dierent set of adjectives (verbs) for each
setting. A third dataset, say W9-33,was used as a
validationset, i.e., amongall the sets of adjectives
generated from the training set, those with good
performance on the validation set were selected as
WSJ-SE-M WSJ-SE-D NG-SE-M
MultInst MultType MultInst MultType MultInst MultType
413 .17 378 .16 571 .29
Table 7: Word-POS-Types Used in Multiple Types of Subjective Elements
W9-10 W9-22 W9-33 W9-04
freq +prec freq +prec freq +prec freq +prec
adjectives 373 .21 1340 .11 2137 .09 2537 .14
verbs 721 .16 1436 .08 3139 .07 3720 .11
unique words 6065 .10 5441 .07 6045 .06 6171 .09
baseline precision .17 .13 .14 .18
freq: Total frequency +prec: Increase in precision over baseline
Table 8: Frequencies and Increases in Precision
the PSEs to test on the test set. A set was consid-
ered to have good performance on the validation
set if its precision is at least .25 and its frequency
is at least 100. Since this process is meantto
be a method for mining existing document-level
annotations for PSEs, the existing opinion-piece
annotations were used for training and validation.
Our manual opinion-piece annotations were used
for testing.
The row labeled unique words shows the preci-
sion on the test set of the individual words that
are unique in the test set. The increase over base-
line precision shows that low-frequency words can
be informative for recognizing subjectivity.
Note that the features all do better and worse
on the samedata sets. This shows that the subjec-
tivityis somehowharder to identify in, say,W9-33
than in W9-10;; it also shows an important consis-
tency among the features, even though they are
identied in dierentways.
8 Conclusions
This paper presents the results of an empiricalex-
amination of subjectivity at the dierentlevels of
a text: the expression level, the sentence level,
and the documentlevel. While analysis of subjec-
tivity is perhaps most natural and precise at the
expression level, document-level annotations are
freely available from a number of sources and are
appropriate for many applications. The sentence-
level annotation is a workable intermediate level:
sentence-level judgmentsare not as ne-grained as
expression-level judgments,andthey don'tinvolve
the large amount of noise found at the document
level.
As part of this examination,we present a study
of annotator agreementcharacterizing the di-
culty of identifying subjectivity at the dierent
levels of a text. The results demonstrate that not
only can subjectivitybeidentied at the docu-
ment level with high reliability, but that it is also
possible to identify expression-level subjectivity,
albeit with lower reliability.
Using manual annotations, we are able to char-
acterize subjective language. At the expression
level, we found that it is natural to distinguish
among positively evaluative, negatively evalua-
tive, and speculative uses of a word. We also
found that subjective text contains a high pro-
portion of unique word occurrences, muchmoreso
than ordinary text. Rather than ignoring or dis-
carding unique words, we demonstrate that the
occurrence of a unique word is a PSE. We also
found that agreement is higher for some syntac-
tic word classes, e.g., for adjectives in comparison
with determiners.
Finally,we are able to mine PSEs from text
tagged at the documentlevel. Given the diculty
of evaluating PSEs in document-level subjectiv-
ity classication due to the mix of subjectiveand
objectivesentences, the PSEs identied in this
study exhibit relatively high precision. In future
work, wewillinvestigate document-level classi-
cation using these PSEs, as well as other methods
for extracting PSEs from text tagged at the doc-
ument level;; methods to be investigated include
mutual-bootstrapping and/or co-training.

References
C. Aone, M. Ramos-Santacruz, and W. Niehaus.
2000. Assentor: An nlp-based solutiontoe-mail
monitoring. In Proc. IAAI-2000, pages 945{
950.
A. Baneld. 1982. Unspeakable Sentences. Rout-
ledge and Kegan Paul, Boston.
R. Barzilay, M. Collins, J. Hirschberg, and
S. Whittaker. 2000. The rules behind roles:
Identifying speaker role in radio broadcasts. In
Proc. AAAI.
E. Brill. 1992. A simple rule-based part of speech
tagger. In Proc. of the 3rd ConferenceonAp-
plied Natural Language Processing (ANLP-92),
pages 152{155.
R.Bruce andJ.Wiebe. 1999. Recognizingsubjec-
tivity: A case study of manualtagging. Natural
Language Engineering,5(2).
J. Cohen. 1960. A coecient of agreement for
nominal scales. Educational and Psychological
Meas., 20:37{46.
A. P.Dawid and A. M. Skene. 1979. Max-
imum likelihood estimation of observer error-
rates using the EM algorithm. Applied Statis-
tics, 28:20{28.
M. Fludernik. 1993. The Fictions of Language
and the Languages of Fiction. Routledge, Lon-
don.
L. Goodman. 1974. Exploratory latent structure
analysis using both identiable and unidenti-
able models. Biometrika, 61:2:215{231.
E. Hovy. 1987. Generating Natural Language un-
der Pragmatic Constraints. Ph.D. thesis, Yale
University.
D. Karp, Y. Schabes, M. Zaidel, and D. Egedi.
1992. A freely available wide coverage mor-
phological analyzer for English. In Proc. of
the 14th International ConferenceonCompu-
tational Linguistics (COLING-92).
D. Kaufer. 2000. Flaming: A White Paper.
www.eudora.com.
B. Kessler, G. Nunberg, and H. Schutze. 1997.
Automatic detection of text genre. In Proc.
ACL-EACL-97.
D. Lin. 1998. Automatic retrieval and clustering
of similar words. In Proc. COLING-ACL '98,
pages 768{773.
Diane J. Litman and R. J. Passonneau. 1995.
Combining multiple knowledge sources for dis-
course segmentation. In Proc. 33rdAnnual
Meeting of the Association for Computational
Linguistics (ACL-95), pages 108{115. Associa-
tion for Computational Linguistics, june.
D. Marcu, M. Romera, and E. Amorrortu. 1999.
Experiments in constructing a corpus of dis-
course trees: Problems, annotation choices, is-
sues. In The Workshop on Levels of Represen-
tation in Discourse, pages 71{78.
M. Marcus, Santorini, B., and M. Marcinkiewicz.
1993. Building a large annotated corpus of En-
glish: The penn treebank. Computational Lin-
guistics, 19(2):313{330.
G. Nunberg, I. Sag, and T. Wasow. 1994. Idioms.
Language, 70:491{538.
W. Sack. 1995. Representing and recognizing
point of view. In Proc. AAAI Fall Symposium
on AI Applications in Knowledge Navigation
and Retrieval.
E. Spertus. 1997. Smokey: Automatic recogni-
tion of hostile messages. In Proc. IAAI.
D. Stein and S. Wright, editors. 1995. Subjectiv-
ity and Subjectivisation.Cambridge University
Press, Cambridge.
L. Terveen, W. Hill, B. Amento, D. McDonald,
and J. Creter. 1997. Building task-specic in-
terfaces to high volume conversational data. In
Proc. CHI 97, pages 226{233.
S. Teufel and M. Moens. 2000. What's yours and
what's mine: Determining intellectual attribu-
tion in scientic texts. In Proc. Joint SIGDAT
Converence on EMNLP and VLC.
J. Wiebe, K. McKeever, and R. Bruce. 1998.
Mapping collocational properties into machine
learning features. In Proc. 6th Workshop on
Very Large Corpora (WVLC-98), pages 225{
233, Montreal, Canada, August. ACL SIGDAT.
J. Wiebe, R. Bruce, and T. O'Hara. 1999. Devel-
opment and use of a gold standard data set for
subjectivity classications. In Proc. 37th An-
nual MeetingoftheAssoc. for Computational
Linguistics (ACL-99), pages 246{253, Univer-
sity of Maryland, June. ACL.
J. Wiebe. 1994. Tracking point of view in narra-
tive. Computational Linguistics, 20(2):233{287.
J. Wiebe. 2000. Learning subjective adjectives
from corpora. In 17th National Conferenceon
Articial Intelligence (AAAI-2000).
