Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 440–448,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Feature Subsumption for Opinion Analysis
Ellen Riloff and Siddharth Patwardhan
School of Computing
University of Utah
Salt Lake City, UT 84112
{riloff,sidd}@cs.utah.edu
Janyce Wiebe
Department of Computer Science
University of Pittsburgh
Pittsburgh, PA 15260
wiebe@cs.pitt.edu
Abstract
Lexical features are key to many ap-
proaches to sentiment analysis and opin-
ion detection. A variety of representations
have been used, including single words,
multi-word Ngrams, phrases, and lexico-
syntactic patterns. In this paper, we use a
subsumption hierarchy to formally de ne
different types of lexical features and their
relationship to one another, both in terms
of representational coverage and perfor-
mance. We use the subsumption hierar-
chy in two ways: (1) as an analytic tool
to automatically identify complex features
that outperform simpler features, and (2)
to reduce a feature set by removing un-
necessary features. We show that reduc-
ing the feature set improves performance
on three opinion classi cation tasks, espe-
cially when combined with traditional fea-
ture selection.
1 Introduction
Sentiment analysis and opinion recognition are ac-
tive research areas that have many potential ap-
plications, including review mining, product rep-
utation analysis, multi-document summarization,
and multi-perspective question answering. Lexi-
cal features are key to many approaches, and a va-
riety of representations have been used, including
single words, multi-word Ngrams, phrases, and
lexico-syntactic patterns. It is common for dif-
ferent features to overlap representationally. For
example, the unigram  happy will match all of
the texts that the bigram  very happy matches.
Since both features represent a positive sentiment
and the bigram matches fewer contexts than the
unigram, it is probably suf cient just to have the
unigram. However, there are many cases where
a feature captures a subtlety or non-compositional
meaning that a simpler feature does not. For exam-
ple,  basket case is a highly opinionated phrase,
but the words  basket and  case individually
are not. An open question in opinion analysis is
how often more complex feature representations
are needed, and which types of features are most
valuable. Our  rst goal is to devise a method to
automatically identify features that are represen-
tationally subsumed by a simpler feature but that
are better opinion indicators. These subjective ex-
pressions could then be added to a subjectivity lex-
icon (Esuli and Sebastiani, 2005), and used to gain
understanding about which types of complex fea-
tures capture meaningful expressions that are im-
portant for opinion recognition.
Many opinion classi ers are created by adopt-
ing a  kitchen sink approach that throws together
a variety of features. But in many cases adding
new types of features does not improve perfor-
mance. For example, Pang et al. (2002) found that
unigrams outperformed bigrams, and unigrams
outperformed the combination of unigrams plus
bigrams. Our second goal is to automatically iden-
tify features that are unnecessary because similar
features provide equal or better coverage and dis-
criminatory value. Our hypothesis is that a re-
duced feature set, which selectively combines un-
igrams with only the most valuable complex fea-
tures, will perform better than a larger feature set
that includes the entire  kitchen sink of features.
In this paper, we explore the use of a subsump-
tion hierarchy to formally de ne the subsump-
tion relationships between different types of tex-
tual features. We use the subsumption hierarchy
in two ways. First, we use subsumption as an an-
440
alytic tool to compare features of different com-
plexities and automatically identify complex fea-
tures that substantially outperform their simpler
counterparts. Second, we use the subsumption hi-
erarchy to reduce a feature set based on represen-
tational overlap and on performance. We conduct
experiments with three opinion data sets and show
that the reduced feature sets can improve classi -
cation performance.
2 The Subsumption Hierarchy
2.1 Text Representations
We analyze two feature representations that have
been used for opinion analysis: Ngrams and Ex-
traction Patterns. Information extraction (IE)
patterns are lexico-syntactic patterns that rep-
resent expressions which identify role relation-
ships. For example, the pattern  <subj>
ActVP(recommended) extracts the subject of
active-voice instances of the verb  recommended 
as the recommender. The pattern  <subj>
PassVP(recommended) extracts the subject of
passive-voice instances of  recommended as the
object being recommended.
(Riloff and Wiebe, 2003) explored the idea
of using extraction patterns to represent more
complex subjective expressions that have non-
compositional meanings. For example, the expres-
sion  drive (someone) up the wall expresses the
feeling of being annoyed, but the meanings of the
words  drive ,  up , and  wall have no emotional
connotations individually. Furthermore, this ex-
pression is not a  xed word sequence that can be
adequately modeled by Ngrams. Any noun phrase
can appear between the words  drive’ and  up , so
a  exible representation is needed to capture the
general pattern  drives <NP> up the wall .
This example represents a general phenomenon:
many expressions allow intervening noun phrases
and/or modifying terms. For example:
 stepped on <mods> toes 
Ex: stepped on the boss’ toes
 dealt <np> <mods> blow 
Ex: dealt the company a decisive blow
 brought <np> to <mods> knees 
Ex: brought the man to his knees
(Riloff and Wiebe, 2003) also showed that syn-
tactic variations of the same verb phrase can be-
have very differently. For example, they found that
passive-voice constructions of the verb  ask had
a 100% correlation with opinion sentences, but
active-voice constructions had only a 63% corre-
lation with opinions.
Pattern Type Example Pattern
<subj> PassVP <subj> is satis ed
<subj> ActVP <subj> complained
<subj> ActVP Dobj <subj> dealt blow
<subj> ActInfVP <subj> appear to be
<subj> PassInfVP <subj> is meant to be
<subj> AuxVP Dobj <subj> has position
<subj> AuxVP Adj <subj> is happy
ActVP <dobj> endorsed <dobj>
InfVP <dobj> to condemn <dobj>
ActInfVP <dobj> get to know <dobj>
PassInfVP <dobj> is meant to be <dobj>
Subj AuxVP <dobj> fact is <dobj>
NP Prep <np> opinion on <np>
ActVP Prep <np> agrees with <np>
PassVP Prep <np> is worried about <np>
InfVP Prep <np> to resort to <np>
<possessive> NP <noun>’s speech
Figure 1: Extraction Pattern Types
Our goal is to use the subsumption hierarchy
to identify Ngram and extraction pattern features
that are more strongly associated with opinions
than simpler features. We used three types of fea-
tures in our research: unigrams, bigrams, and IE
patterns. The Ngram features were generated us-
ing the Ngram Statistics Package (NSP) (Baner-
jee and Pedersen, 2003).1 The extraction pat-
terns (EPs) were automatically generated using
the Sundance/AutoSlog software package (Riloff
and Phillips, 2004). AutoSlog relies on the Sun-
dance shallow parser and can be applied exhaus-
tively to a text corpus to generate IE patterns that
can extract every noun phrase in the corpus. Au-
toSlog has been used to learn IE patterns for the
domains of terrorism, joint ventures, and micro-
electronics (Riloff, 1996), as well as for opinion
analysis (Riloff and Wiebe, 2003). Figure 1 shows
the 17 types of extraction patterns that AutoSlog
generates. PassVP refers to passive-voice verb
phrases (VPs), ActVP refers to active-voice VPs,
InfVP refers to in nitive VPs, and AuxVP refers
1NSP is freely available for use under the GPL from
http://search.cpan.org/dist/Text-NSP. We discarded Ngrams
that consisted entirely of stopwords. We used a list of 281
stopwords.
441
to VPs where the main verb is a form of  to be 
or  to have . Subjects (subj), direct objects (dobj),
PP objects (np), and possessives can be extracted
by the patterns.2
2.2 The Subsumption Hierarchy
We created a subsumption hierarchy that de nes
the representational scope of different types of fea-
tures. We will say that feature A representation-
ally subsumes feature B if the set of text spans
that match feature A is a superset of the set of text
spans that match feature B. For example, the uni-
gram  happy subsumes the bigram  very happy 
because the set of text spans that match  happy 
includes the text spans that match  very happy .
First, we de ne a hierarchy of valid subsump-
tion relationships, shown in Figure 2. The 2Gram
node, for example, is a child of the 1Gram node
because a 1Gram can subsume a 2Gram. Ngrams
may subsume extraction patterns as well. Ev-
ery extraction pattern has at least one correspond-
ing 1Gram that will subsume it.3. For example,
the 1Gram  recommended subsumes the pattern
 <subj> ActVP(recommended) because the pat-
tern only matches active-voice instances of  rec-
ommended . An extraction pattern may also
subsume another extraction pattern. For exam-
ple,  <subj> ActVP(recommended) subsumes
 <subj> ActVP(recommended) Dobj(movie) .
To compare speci c features we need to for-
mally de ne the representation of each type of
feature in the hierarchy. For example, the hierar-
chy dictates that a 2Gram can subsume the pattern
 ActInfVP <dobj> , but this should hold only if
the words in the bigram correspond to adjacent
words in the pattern. For example, the 2Gram  to
 sh subsumes the pattern  ActInfVP(like to  sh)
<dobj> . But the 2Gram  like  sh should not
subsume it. Similarly, consider the pattern  In-
fVP(plan) <dobj> , which represents the in ni-
tive  to plan . This pattern subsumes the pattern
 ActInfVP(want to plan) <dobj> , but it should
not subsume the pattern  ActInfVP(plan to start) .
To ensure that different features truly subsume
each other representationally, we formally de ne
each type of feature based on words, sequential
2However, the items extracted by the patterns are not ac-
tually used by our opinion classi ers; only the patterns them-
selves are matched against the text.
3Because every type of extraction pattern shown in Fig-
ure 1 contains at least one word (not including the extracted
phrases, which are not used as part of our feature representa-
tion).
dependencies, and syntactic dependencies. A se-
quential dependency between words wi and wi+1
means that wi and wi+1 must be adjacent, and that
wi must precede wi+1. Figure 3 shows the formal
de nition of a bigram (2Gram) node. The bigram
is de ned as two words with a sequential depen-
dency indicating that they must be adjacent.
Name = 2Gram
Constituent[0] = WORD1
Constituent[1] = WORD2
Dependency = Sequential(0, 1)
Figure 3: 2Gram De nition
A syntactic dependency between words wi and
wi+1 means that wi has a speci c syntactic rela-
tionship to wi+1, and wi must precede wi+1. For
example, consider the extraction pattern  NP Prep
<np> , in which the object of the preposition at-
taches to the NP. Figure 4 shows the de nition of
this extraction pattern in the hierarchy. The pat-
tern itself contains three components: the NP, the
attaching preposition, and the object of the prepo-
sition (which is the NP that the pattern extracts).
The de nition also includes two syntactic depen-
dencies: the  rst dependency is between the NP
and the preposition (meaning that the preposition
syntactically attaches to the NP), while the second
dependency is between the preposition and the ex-
traction (meaning that the extracted NP is the syn-
tactic object of the preposition).
Name = NP Prep <np>
Constituent[0] = NP
Constituent[1] = PREP
Constituent[2] = NP EXTRACTION
Dependency = Syntactic(0, 1)
Dependency = Syntactic(1, 2)
Figure 4:  NP Prep <np> Pattern De nition
Consequently, the bigram  affair with will not
subsume the extraction pattern  affair with <np> 
because the bigram requires the noun and preposi-
tion to be adjacent but the pattern does not. For ex-
ample, the extraction pattern matches the text  an
affair in his mind with Countess Olenska but the
bigram does not. Conversely, the extraction pat-
tern does not subsume the bigram either because
the pattern requires syntactic attachment but the
bigram does not. For example, the bigram matches
442
<subj> ActVP
<subj> ActInfVP
<subj> ActVP Dobj
<subj> PassVP
<subj> PassInfVP
InfVP <dobj> ActInfVP <dobj>
PassInfVP <dobj>
1Gram
2Gram
<possessive> NP
<subj> AuxVP AdjP
<subj> AuxVP Dobj
ActVP <dobj>
ActVP Prep <np>
NP Prep <np>
PassVP Prep <np>
Subj AuxVP <dobj>
3Gram
ActVP Prep:OF <np>
InfVP Prep <np>
NP Prep:OF <np>
PassVP Prep:OF <np> 4Gram
InfVP Prep:OF <np>
Figure 2: The Subsumption Hierarchy
the sentence  He ended the affair with a sense of
relief , but the extraction pattern does not.
Figure 5 shows the de nition of another ex-
traction pattern,  InfVP <dobj> , which includes
both syntactic and sequential dependencies. This
pattern would match the text  to protest high
taxes . The pattern de nition has three compo-
nents: the in nitive  to , a verb, and the direct ob-
ject of the verb (which is the NP that the pattern
extracts). The de nition also shows two syntac-
tic dependencies. The  rst dependency indicates
that the verb syntactically attaches to the in nitive
 to . The second dependency indicates that the ex-
tracted NP syntactically attaches to the verb (i.e.,
it is the direct object of that particular verb).
The pattern de nition also includes a sequen-
tial dependency, which speci es that  to must be
adjacent to the verb. Strictly speaking, our parser
does not require them to be adjacent. For exam-
ple, the parser allows intervening adverbs to split
in nitives (e.g.,  to strongly protest high taxes ),
and this does happen occasionally. But split in-
 nitives are relatively rare, so in the vast major-
ity of cases the in nitive  to will be adjacent to
the verb. Consequently, we decided that a bigram
(e.g.,  to protest ) should representationally sub-
sume this extraction pattern because the syntac-
tic  exibility afforded by the pattern is negligi-
ble. The sequential dependency link represents
this judgment call that the in nitive  to and the
verb are adjacent in most cases.
For all of the node de nitions, we used our best
judgment to make decisions of this kind. We tried
to represent major distinctions between features,
without getting caught up in minor differences that
were likely to be negligible in practice.
Name = InfVP <dobj>
Constituent[0] = INFINITIVE TO
Constituent[1] = VERB
Constituent[2] = DOBJ EXTRACTION
Dependency = Syntactic(0, 1)
Dependency = Syntactic(1, 2)
Dependency = Sequential(0, 1)
Figure 5:  InfVP <dobj> Pattern De nition
To use the subsumption hierarchy, we assign
each feature to its appropriate node in the hierar-
chy based on its type. Then we perform a top-
down breadth- rst traversal. Each feature is com-
pared with the features at its ancestor nodes. If
a feature’s words and dependencies are a superset
of an ancestor’s words and dependencies, then it
is subsumed by the (more general) ancestor and
discarded.4 When the subsumption process is  n-
ished, a feature remains in the hierarchy only if
4The words that they have in common must also be in the
same relative order.
443
there are no features above it that subsume it.
2.3 Performance-based Subsumption
Representational subsumption is concerned with
whether one feature is more general than another.
But the purpose of using the subsumption hier-
archy is to identify more complex features that
outperform simpler ones. Applying the subsump-
tion hierarchy to features without regard to per-
formance would simply eliminate all features that
have a more general counterpart in the feature set.
For example, all bigrams would be discarded if
their component unigrams were also present in the
hierarchy.
To estimate the quality of a feature, we use In-
formation Gain (IG) because that has been shown
to work well as a metric for feature selection (For-
man, 2003). We will say that feature A be-
haviorally subsumes feature B if two criteria are
met: (1) A representationally subsumes B, and (2)
IG(A)  IG(B) - δ, where δ is a parameter repre-
senting an acceptable margin of performance dif-
ference. For example, if δ=0 then condition (2)
means that feature A is just as valuable as fea-
ture B because its information gain is the same or
higher. If δ>0 then feature A is allowed to be a lit-
tle worse than feature B, but within an acceptable
margin. For example, δ=.0001 means that A’s in-
formation gain may be up to .0001 lower than B’s
information gain, and that is considered to be an
acceptable performance difference (i.e., A is good
enough that we are comfortable discarding B in
favor of the more general feature A).
Note that based on the subsumption hierarchy
shown in Figure 2, all 1Grams will always sur-
vive the subsumption process because they cannot
be subsumed by any other types of features. Our
goal is to identify complex features that are worth
adding to a set of unigram features.
3 Data Sets
We used three opinion-related data sets for our
analyses and experiments: the OP data set created
by (Wiebe et al., 2004), the Polarity data set5 cre-
ated by (Pang and Lee, 2004), and the MPQA data
set created by (Wiebe et al., 2005).6 The OP and
Polarity data sets involve document-level opinion
classi cation, while the MPQA data set involves
5Version v2.0, which is available at:
http://www.cs.cornell.edu/people/pabo/movie-review-data/
6Available at http://www.cs.pitt.edu/mpqa/databaserelease/
sentence-level classi cation.
The OP data consists of 2,452 documents from
the Penn Treebank (Marcus et al., 1993). Metadata
tags assigned by the Wall Street Journal de ne the
opinion/non-opinion classes: the class of any doc-
ument labeled Editorial, Letter to the Editor, Arts
& Leisure Review, or Viewpoint by the Wall Street
Journal is opinion, and the class of documents in
all other categories (such as Business and News)
is non-opinion. This data set is highly skewed,
with only 9% of the documents belonging to the
opinion class. Consequently, a trivial (but useless)
opinion classi er that labels all documents as non-
opinion articles would achieve 91% accuracy.
The Polarity data consists of 700 positive and
700 negative reviews from the Internet Movie
Database (IMDb) archive. The positive and neg-
ative classes were derived from author ratings ex-
pressed in stars or numerical values. The MPQA
data consists of English language versions of ar-
ticles from the world press. It contains 9,732
sentences that have been manually annotated for
subjective expressions. The opinion/non-opinion
classes are derived from the lower-level annota-
tions: a sentence is an opinion if it contains a sub-
jective expression of medium or higher intensity;
otherwise, it is a non-opinion sentence. 55% of the
sentences belong to the opinion class.
4 Using the Subsumption Hierarchy for
Analysis
In this section, we illustrate how the subsump-
tion hierarchy can be used as an analytic tool to
automatically identify features that substantially
outperform simpler counterparts. These features
represent specialized usages and expressions that
would be good candidates for addition to a sub-
jectivity lexicon. Figure 6 shows pairs of features,
where the  rst is more general and the second is
more speci c. These feature pairs were identi ed
by the subsumption hierarchy as being representa-
tionally similar but behaviorally different (so the
more speci c feature was retained). The IGain
column shows the information gain values pro-
duced from the training set of one cross-validation
fold. The Class column shows the class that the
more speci c feature is correlated with (the more
general feature is usually not strongly correlated
with either class).
The top table in Figure 6 contains examples for
the opinion/non-opinion classi cation task from
444
Opinion/Non-Opinion Classi cation
ID Feature IGain Class Example
A1 line .0016 - . . . issue consists of notes backed by credit line receivables
A2 the line .0075 opin ...lays it on the line; ...steps across the line
B1 nation .0046 - . . . has 750,000 cable-tv subscribers around the nation
B2 a nation .0080 opin It’s not that we are spawning a nation of ascetics . . .
C1 begin .0006 - Campeau buyers will begin writing orders...
C2 begin with .0036 opin To begin with, we should note that in contrast...
D1 bene ts .0040 - . . . earlier period included $235,000 in tax bene ts.
DEP NP Prep(bene ts to) .0090 opin . . . boon to the rich with no proven bene ts to the economy
E1 due .0001 - . . . an estimated $ 1.23 billion in debt due next spring
EEP ActVP Prep(due to) .0038 opin It’s all due to the intense scrutiny...
Positive/Negative Sentiment Classi cation
ID Feature IGain Class Example
F1 short .0014 - to make a long story short...
F2 nothing short .0039 pos nothing short of spectacular
G1 ugly .0008 - ...an ugly monster on a cruise liner
G2 and ugly .0054 neg it’s a disappointment to see something this dumb and ugly
H1 disaster .0010 - ...rated pg-13 for disaster related elements
HEP AuxVP Dobj(be disaster) .0048 neg . . . this is such a confused disaster of a  lm
I1 work .0002 - the next day during the drive to work...
IEP ActVP(work) .0062 pos the  lm will work just as well...
J1 manages .0003 - he still manages to  nd time for his wife
JEP ActInfVP(manages to keep) .0054 pos this  lm manages to keep up a rapid pace
Figure 6: Sample features that behave differently, as revealed by the subsumption hierarchy.
(1 ) unigram; 2 ) bigram; EP ) extraction pattern)
the OP data. The more speci c features are more
strongly correlated with opinion articles. Surpris-
ingly, simply adding a determiner can dramatically
change behavior. Consider A2. There are many
subjective idioms involving  the line (two are
shown in the table; others include  toe the line 
and  draw the line ), while objective language
about credit lines, phone lines, etc. uses the deter-
miner less often. Similarly, consider B2. Adding
 a to  nation often corresponds to an abstract
reference used when making an argument (e.g.,
 a nation of ascetics ), whereas other instances
of  nation are used more literally (e.g.,  the 6th
largest in the nation ). 21% of feature B1’s in-
stances appear in opinion articles, while 70% of
feature B2’s instances are in opinion articles.
 Begin with (C2) captures an adverbial phrase
used in argumentation ( To begin with... ) but
does not match objective usages such as  will
begin an action. The word  bene ts alone
(D1) matches phrases like  tax bene ts and  em-
ployee bene ts that are not opinion expressions,
while DEP typically matches positive senses of
the word  bene ts . Interestingly, the bigram
 bene ts to is not highly correlated with opin-
ions because it matches in nitive phrases such
as  tax bene ts to provide and  health bene ts
to cut . In this case, the extraction pattern  NP
Prep(bene ts to) is more discriminating than the
bigram for opinion classi cation. The extraction
pattern EEP is also highly correlated with opin-
ions, while the unigram  due and the bigram
 due to are not.
The bottom table in Figure 6 shows feature
pairs identi ed for their behavioral differences on
the Polarity data set, where the task is to distin-
guish positive reviews from negative reviews. F2
and G2 are bigrams that behave differently from
their component unigrams. The expression  noth-
ing short (of) is typically used to express posi-
tive sentiments, while  nothing and  short by
themselves are not. The word  ugly is often used
as a descriptive modi er that is not expressing
a sentiment per se, while  and ugly appears in
predicate adjective constructions that are express-
ing a negative sentiment. The extraction pattern
HEP is more discriminatory than H1 because it
distinguishes negative sentiments ( the  lm is a
disaster! ) from plot descriptions ( the disaster
movie... ). IEP shows that active-voice usages of
 work are strong positive indicators, while the
unigram  work appears in a variety of both pos-
itive and negative contexts. Finally, JEP shows
that the expression  manages to keep is a strong
positive indicator, while  manages by itelf is
much less discriminating.
445
These examples illustrate that the subsumption
hierarchy can be a powerful tool to better under-
stand the behaviors of different kinds of features,
and to identify speci c features that may be desir-
able for inclusion in specialized lexical resources.
5 Using the Subsumption Hierarchy to
Reduce Feature Sets
When creating opinion classi ers, people often
throw in a variety of features and trust the ma-
chine learning algorithm to  gure out how to make
the best use of them. However, we hypothesized
that classi ers may perform better if we can proac-
tively eliminate features that are not necesary be-
cause they are subsumed by other features. In this
section, we present a series of experiments to ex-
plore this hypothesis. First, we present the results
for an SVM classi er trained using different sets
of unigram, bigram, and extraction pattern fea-
tures, both before and after subsumption. Next, we
evaluate a standard feature selection approach as
an alternative to subsumption and then show that
combining subsumption with standard feature se-
lection produces the best results of all.
5.1 Classi cation Experiments
To see whether feature subsumption can improve
classi cation performance, we trained an SVM
classi er for each of the three opinion data sets.
We used the SVMlight (Joachims, 1998) package
with a linear kernel. For the Polarity and OP data
we discarded all features that have frequency < 5,
and for the MPQA data we discarded features that
have frequency < 2 because this data set is sub-
stantially smaller. All of our experimental results
are averages over 3-fold cross-validation.
First, we created 4 baseline classi ers: a 1Gram
classi er that uses only the unigram features; a
1+2Gram classi er that uses unigram and bigram
features; a 1+EP classi er that uses unigram and
extraction pattern features, and a 1+2+EP classi-
 er that uses all three types of features. Next, we
created analogous 1+2Gram, 1+EP, and 1+2+EP
classi ers but applied the subsumption hierar-
chy  rst to eliminate unnecessary features be-
fore training the classi er. We experimented with
three delta values for the subsumption process:
δ=.0005, .001, and .002.
Figures 7, 8, and 9 show the results. The sub-
sumption process produced small but consistent
improvements on all 3 data sets. For example, Fig-
ure 8 shows the results on the OP data, where all
of the accuracy values produced after subsumption
(the rightmost 3 columns) are higher than the ac-
curacy values produced without subsumption (the
Base[line] column). For all three data sets, the best
overall accuracy (shown in boldface) was always
achieved after subsumption.
Features Base δ=.0005 δ=.001 δ=.002
1Gram 79.8
1+2Gram 81.2 81.0 81.3 81.0
1+EP 81.7 81.4 81.4 82.0
1+2+EP 81.7 82.3 82.3 82.7
Figure 7: Accuracies on Polarity Data
Features Base δ=.0005 δ=.001 δ=.002
1Gram 97.5 - - -
1+2Gram 98.0 98.7 98.6 98.7
1+EP 97.2 97.8 97.9 97.9
1+2+EP 97.8 98.6 98.7 98.7
Figure 8: Accuracies on OP Data
Features Base δ=.0005 δ=.001 δ=.002
1Gram 74.8
1+2Gram 74.3 74.9 74.6 74.8
1+EP 74.4 74.6 74.6 74.6
1+2+EP 74.4 74.9 74.7 74.6
Figure 9: Accuracies on MPQA Data
We also observed that subsumption had a dra-
matic effect on the F-measure scores on the OP
data, which are shown in Figure 10. The OP data
set is fundamentally different from the other data
sets because it is so highly skewed, with 91% of
the documents belonging to the non-opinion class.
Without subsumption, the classi er was conser-
vative about assigning documents to the opinion
class, achieving F-measure scores in the 82-88
range. After subsumption, the overall accuracy
improved but the F-measure scores increased more
dramatically. These numbers show that the sub-
sumption process produced not only a more ac-
curate classi er, but a more useful classi er that
identi es more documents as being opinion arti-
cles.
For the MPQA data, we get a very small im-
provement of 0.1% (74.8% ! 74.9%) using sub-
sumption. But note that without subsumption the
performance actually decreased when bigrams and
446
Features Base δ=.0005 δ=.001 δ=.002
1Gram 84.5
1+2Gram 88.0 92.5 92.0 92.3
1+EP 82.4 86.9 87.4 87.4
1+2+EP 86.7 91.8 92.5 92.3
Figure 10: F-measures on OP Data
 97.6
 97.8
 98
 98.2
 98.4
 98.6
 98.8
 99
 1000  2000  3000  4000  5000  6000  7000  8000  9000  10000
Accuracy (%)
Top N
Baseline
Subsumption d=0.002
Feature Selection
Subsumption d=0.002 + Feature Selection
Figure 11: Feature Selection on OP Data
extraction patterns were added! The subsumption
process counteracted the negative effect of adding
the more complex features.
5.2 Feature Selection Experiments
We conducted a second series of experiments to
determine whether a traditional feature selection
approach would produce the same, or better, im-
provements as subsumption. For each feature, we
computed its information gain (IG) and then se-
lected the N features with the highest scores.7 We
experimented with values of N ranging from 1,000
to 10,000 in increments of 1,000.
We hypothesized that applying subsumption be-
fore traditional feature selection might also help to
identify a more diverse set of high-performing fea-
tures. In a parallel set of experiments, we explored
this hypothesis by  rst applying subsumption to
reduce the size of the feature set, and then select-
ing the best N features using information gain.
Figures 11, 12, and 13 show the results of these
experiments for the 1+2+EP classi ers. Each
graph shows four lines. One line corresponds to
the baseline classi er with no subsumption, and
another line corresponds to the baseline classi er
with subsumption using the best δ value for that
data set. Each of these two lines corresponds to
7In the case of ties, we included all features with the same
score as the Nth-best as well.
 78
 78.5
 79
 79.5
 80
 80.5
 81
 81.5
 82
 82.5
 83
 83.5
 1000  2000  3000  4000  5000  6000  7000  8000  9000  10000
Accuracy (%)
Top N
Baseline
Subsumption d=0.002
Feature Selection
Subsumption d=0.002 + Feature Selection
Figure 12: Feature Selection on Polarity Data
 72
 72.5
 73
 73.5
 74
 74.5
 75
 75.5
 1000  2000  3000  4000  5000  6000  7000  8000  9000  10000
Accuracy (%)
Top N
Baseline
Subsumption d=0.0005
Feature Selection
Subsumption d=0.0005 + Feature Selection
Figure 13: Feature Selection on MPQA Data
just a single data point (accuracy value), but we
drew that value as a line across the graph for the
sake of comparison. The other two lines on the
graph correspond to (a) feature selection for dif-
ferent values of N (shown on the x-axis), and (b)
subsumption followed by feature selection for dif-
ferent values of N.
On all 3 data sets, traditional feature selection
performs worse than the baseline in some cases,
and it virtually never outperforms the best classi-
 er trained after subsumption (but without feature
selection). Furthermore, the combination of sub-
sumption plus feature selection generally performs
best of all, and nearly always outperforms feature
selection alone. For all 3 data sets, our best ac-
curacy results were achieved by performing sub-
sumption prior to feature selection. The best accu-
racy results are 99.0% on the OP data, 83.1% on
the Polarity data, and 75.4% on the MPQA data.
For the OP data, the improvement over baseline
for both accuracy and F-measure are statistically
signi cant at the p < 0.05 level (paired t-test). For
the MPQA data, the improvement over baseline is
447
statistically signi cant at the p < 0.10 level.
6 Related Work
Many features and classi cation algorithms have
been explored in sentiment analysis and opinion
recognition. Lexical cues of differing complexi-
ties have been used, including single words and
Ngrams (e.g., (Mullen and Collier, 2004; Pang et
al., 2002; Turney, 2002; Yu and Hatzivassiloglou,
2003; Wiebe et al., 2004)), as well as phrases
and lexico-syntactic patterns (e.g, (Kim and Hovy,
2004; Hu and Liu, 2004; Popescu and Etzioni,
2005; Riloff and Wiebe, 2003; Whitelaw et al.,
2005)). While many of these studies investigate
combinations of features and feature selection,
this is the  rst work that uses the notion of sub-
sumption to compare Ngrams and lexico-syntactic
patterns to identify complex features that outper-
form simpler counterparts and to reduce a com-
bined feature set to improve opinion classi cation.
7 Conclusions
This paper uses a subsumption hierarchy of
feature representations as (1) an analytic tool
to compare features of different complexities,
and (2) an automatic tool to remove unneces-
sary features to improve opinion classi cation
performance. Experiments with three opinion
data sets showed that subsumption can improve
classi cation accuracy, especially when combined
with feature selection.
Acknowledgments
This research was supported by NSF Grants IIS-
0208798 and IIS-0208985, the ARDA AQUAINT
Program, and the Institute for Scienti c Comput-
ing Research and the Center for Applied Scienti c
Computing within Lawrence Livermore National
Laboratory.

References
S. Banerjee and T. Pedersen. 2003. The Design, Imple-
mentation, and Use of the Ngram Statistics Package.
In Proc. Fourth Int’l Conference on Intelligent Text
Processing and Computational Linguistics.
A. Esuli and F. Sebastiani. 2005. Determining the se-
mantic orientation of terms through gloss analysis.
In Proc. CIKM-05.
G. Forman. 2003. An Extensive Empirical Study of
Feature Selection Metrics for Text Classi cation. J.
Mach. Learn. Res., 3:1289 1305.
M. Hu and B. Liu. 2004. Mining and summarizing
customer reviews. In Proc. KDD-04.
T. Joachims. 1998. Making Large-Scale Support
Vector Machine Learning Practical. In A. Smola
B. Schcurrency1olkopf, C. Burges, editor, Advances in Ker-
nel Methods: Support Vector Machines. MIT Press,
Cambridge, MA.
S-M. Kim and E. Hovy. 2004. Determining the senti-
ment of opinions. In Proc. COLING-04.
M. Marcus, B. Santorini, and M. Marcinkiewicz. 1993.
Building a Large Annotated Corpus of English:
The Penn Treebank. Computational Linguistics,
19(2):313 330.
T. Mullen and N. Collier. 2004. Sentiment Analysis
Using Support Vector Machines with Diverse Infor-
mation Sources. In Proc. EMNLP-04.
B. Pang and L. Lee. 2004. A sentimental education:
Sentiment analysis using subjectivity summarization
based on minimum cuts. In Proc. ACL-04.
B. Pang, L. Lee, and S. Vaithyanathan. 2002. Thumbs
up? Sentiment Classi cation using Machine Learn-
ing Techniques. In Proc. EMNLP-02.
A-M. Popescu and O. Etzioni. 2005. Extracting prod-
uct features and opinions from reviews. In Proc.
HLT-EMNLP-05.
E. Riloff and W. Phillips. 2004. An Introduction to the
Sundance and AutoSlog Systems. Technical Report
UUCS-04-015, School of Computing, University of
Utah.
E. Riloff and J. Wiebe. 2003. Learning Extraction Pat-
terns for Subjective Expressions. In Proc. EMNLP-
03.
E. Riloff. 1996. An Empirical Study of Automated
Dictionary Construction for Information Extraction
in Three Domains. Arti cial Intelligence, 85:101 
134.
P. Turney. 2002. Thumbs up or thumbs down? Seman-
tic orientation applied to unsupervised classi cation
of reviews. In Proc. ACL-02.
C. Whitelaw, N. Garg, and S. Argamon. 2005. Us-
ing appraisal groups for sentiment analysis. In Proc.
CIKM-05.
J. Wiebe, T. Wilson, R. Bruce, M. Bell, and M. Mar-
tin. 2004. Learning subjective language. Computa-
tional Linguistics, 30(3):277 308.
J. Wiebe, T. Wilson, and C. Cardie. 2005. Annotating
expressions of opinions and emotions in language.
Language Resources and Evaluation, 39(2/3).
H. Yu and V. Hatzivassiloglou. 2003. Towards an-
swering opinion questions: Separating facts from
opinions and identifying the polarity of opinion sen-
tences. In Proc. EMNLP-03.
