Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 327–335,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Get out the vote: Determining support or opposition from Congressional
floor-debate transcripts
Matt Thomas, Bo Pang, and Lillian Lee
Department of Computer Science, Cornell University
Ithaca, NY 14853-7501
mattthomas84@gmail.com, pabo@cs.cornell.edu, llee@cs.cornell.edu
Abstract
We investigate whether one can determine
from the transcripts of U.S. Congressional
floor debates whether the speeches repre-
sent support of or opposition to proposed
legislation. To address this problem, we
exploit the fact that these speeches occur
as part of a discussion; this allows us to
use sources of information regarding re-
lationships between discourse segments,
such as whether a given utterance indicates
agreement with the opinion expressed by
another. We find that the incorporation
of such information yields substantial im-
provements over classifying speeches in
isolation.
1 Introduction
One ought to recognize that the present
political chaos is connected with the de-
cay of language, and that one can prob-
ably bring about some improvement by
starting at the verbal end. — Orwell,
“Politics and the English language”
We have entered an era where very large
amounts of politically oriented text are now avail-
able online. This includes both official documents,
such as the full text of laws and the proceedings of
legislative bodies, and unofficial documents, such
as postings on weblogs (blogs) devoted to politics.
In some sense, the availability of such data is sim-
ply a manifestation of a general trend of “every-
body putting their records on the Internet”.1 The
1It is worth pointing out that the United States’ Library of
Congress was an extremely early adopter of Web technology:
the THOMAS database (http://thomas.loc.gov) of congres-
online accessibility of politically oriented texts in
particular, however, is a phenomenon that some
have gone so far as to say will have a potentially
society-changing effect.
In the United States, for example, governmen-
tal bodies are providing and soliciting political
documents via the Internet, with lofty goals in
mind: electronic rulemaking (eRulemaking) ini-
tiatives involving the “electronic collection, dis-
tribution, synthesis, and analysis of public com-
mentary in the regulatory rulemaking process”,
may “[alter] the citizen-government relationship”
(Shulman and Schlosberg, 2002). Additionally,
much media attention has been focused recently
on the potential impact that Internet sites may have
on politics2, or at least on political journalism3.
Regardless of whether one views such claims as
clear-sighted prophecy or mere hype, it is obvi-
ously important to help people understand and an-
alyze politically oriented text, given the impor-
tance of enabling informed participation in the po-
litical process.
Evaluative and persuasive documents, such as
a politician’s speech regarding a bill or a blog-
ger’s commentary on a legislative proposal, form a
particularly interesting type of politically oriented
text. People are much more likely to consult such
evaluative statements than the actual text of a bill
or law under discussion, given the dense nature of
legislative language and the fact that (U.S.) bills
often reach several hundred pages in length (Smith
et al., 2005). Moreover, political opinions are ex-
sional bills and related data was launched in January 1995,
when Mosaic was not quite two years old and Altavista did
not yet exist.
2E.g., “Internet injects sweeping change into U.S. poli-
tics”, Adam Nagourney, The New York Times, April 2, 2006.
3E.g., “The End of News?”, Michael Massing, The New
York Review of Books, December 1, 2005.
327
plicitly solicited in the eRulemaking scenario.
In the analysis of evaluative language, it is fun-
damentally necessary to determine whether the au-
thor/speaker supports or disapproves of the topic
of discussion. In this paper, we investigate the
following specific instantiation of this problem:
we seek to determine from the transcripts of
U.S. Congressional floor debates whether each
“speech” (continuous single-speaker segment of
text) represents support for or opposition to a pro-
posed piece of legislation. Note that from an ex-
perimental point of view, this is a very convenient
problem to work with because we can automati-
cally determine ground truth (and thus avoid the
need for manual annotation) simply by consulting
publicly available voting records.
Task properties Determining whether or not a
speaker supports a proposal falls within the realm
of sentiment analysis, an extremely active re-
search area devoted to the computational treatment
of subjective or opinion-oriented language (early
work includes Wiebe and Rapaport (1988), Hearst
(1992), Sack (1994), and Wiebe (1994); see Esuli
(2006) for an active bibliography). In particu-
lar, since we treat each individual speech within
a debate as a single “document”, we are consider-
ing a version of document-level sentiment-polarity
classification, namely, automatically distinguish-
ing between positive and negative documents (Das
and Chen, 2001; Pang et al., 2002; Turney, 2002;
Dave et al., 2003).
Most sentiment-polarity classifiers proposed in
the recent literature categorize each document in-
dependently. A few others incorporate various
measures of inter-document similarity between the
texts to be labeled (Agarwal and Bhattacharyya,
2005; Pang and Lee, 2005; Goldberg and Zhu,
2006). Many interesting opinion-oriented docu-
ments, however, can be linked through certain re-
lationships that occur in the context of evaluative
discussions. For example, we may find textual4
evidence of a high likelihood of agreement be-
4Because we are most interested in techniques applicable
across domains, we restrict consideration to NLP aspects of
the problem, ignoring external problem-specific information.
For example, although most votes in our corpus were almost
completely along party lines (and despite the fact that same-
party information is easily incorporated via the methods we
propose), we did not use party-affiliation data. Indeed, in
other settings (e.g., a movie-discussion listserv) one may not
be able to determine the participants’ political leanings, and
such information may not lead to significantly improved re-
sults even if it were available.
tween two speakers, such as explicit assertions (“I
second that!”) or quotation of messages in emails
or postings (see Mullen and Malouf (2006) but cf.
Agrawal et al. (2003)). Agreement evidence can
be a powerful aid in our classification task: for ex-
ample, we can easily categorize a complicated (or
overly terse) document if we find within it indica-
tions of agreement with a clearly positive text.
Obviously, incorporating agreement informa-
tion provides additional benefit only when the in-
put documents are relatively difficult to classify
individually. Intuition suggests that this is true
of the data with which we experiment, for several
reasons. First, U.S. congressional debates contain
very rich language and cover an extremely wide
variety of topics, ranging from flag burning to in-
ternational policy to the federal budget. Debates
are also subject to digressions, some fairly natural
and others less so (e.g., “Why are we discussing
this bill when the plight of my constituents regard-
ing this other issue is being ignored?”)
Second, an important characteristic of persua-
sive language is that speakers may spend more
time presenting evidence in support of their po-
sitions (or attacking the evidence presented by
others) than directly stating their attitudes. An
extreme example will illustrate the problems in-
volved. Consider a speech that describes the U.S.
flag as deeply inspirational, and thus contains only
positive language. If the bill under discussion is a
proposed flag-burning ban, then the speech is sup-
portive; but if the bill under discussion is aimed at
rescinding an existing flag-burning ban, the speech
may represent opposition to the legislation. Given
the current state of the art in sentiment analysis,
it is doubtful that one could determine the (proba-
bly topic-specific) relationship between presented
evidence and speaker opinion.
Qualitative summary of results The above dif-
ficulties underscore the importance of enhancing
standard classification techniques with new infor-
mation sources that promise to improve accuracy,
such as inter-document relationships between the
documents to be labeled. In this paper, we demon-
strate that the incorporation of agreement model-
ing can provide substantial improvements over the
application of support vector machines (SVMs) in
isolation, which represents the state of the art in
the individual classification of documents. The en-
hanced accuracies are obtained via a fairly primi-
tive automatically-acquired “agreement detector”
328
total train test development
speech segments 3857 2740 860 257
debates 53 38 10 5
average number of speech segments per debate 72.8 72.1 86.0 51.4
average number of speakers per debate 32.1 30.9 41.1 22.6
Table 1: Corpus statistics.
and a conceptually simple method for integrat-
ing isolated-document and agreement-based in-
formation. We thus view our results as demon-
strating the potentially large benefits of exploiting
sentiment-related discourse-segment relationships
in sentiment-analysis tasks.
2 Corpus
This section outlines the main steps of the process
by which we created our corpus (download site:
www.cs.cornell.edu/home/llee/data/convote.html).
GovTrack (http://govtrack.us) is an independent
website run by Joshua Tauberer that collects pub-
licly available data on the legislative and fund-
raising activities of U.S. congresspeople. Due to
its extensive cross-referencing and collating of in-
formation, it was nominated for a 2006 “Webby”
award. A crucial characteristic of GovTrack from
our point of view is that the information is pro-
vided in a very convenient format; for instance,
the floor-debate transcripts are broken into sepa-
rate HTML files according to the subject of the
debate, so we can trivially derive long sequences
of speeches guaranteed to cover the same topic.
We extracted from GovTrack all available tran-
scripts of U.S. floor debates in the House of Rep-
resentatives for the year 2005 (3268 pages of tran-
scripts in total), together with voting records for all
roll-call votes during that year. We concentrated
on debates regarding “controversial” bills (ones in
which the losing side generated at least 20% of the
speeches) because these debates should presum-
ably exhibit more interesting discourse structure.
Each debate consists of a series of speech seg-
ments, where each segment is a sequence of un-
interrupted utterances by a single speaker. Since
speech segments represent natural discourse units,
we treat them as the basic unit to be classified.
Each speech segment was labeled by the vote
(“yea” or “nay”) cast for the proposed bill by the
person who uttered the speech segment.
We automatically discarded those speech seg-
ments belonging to a class of formulaic, generally
one-sentence utterances focused on the yielding
of time on the house floor (for example, “Madam
Speaker, I am pleased to yield 5 minutes to the
gentleman from Massachusetts”), as such speech
segments are clearly off-topic. We also removed
speech segments containing the term “amend-
ment”, since we found during initial inspection
that these speeches generally reflect a speaker’s
opinion on an amendment, and this opinion may
differ from the speaker’s opinion on the underly-
ing bill under discussion.
We randomly split the data into training, test,
and development (parameter-tuning) sets repre-
senting roughly 70%, 20%, and 10% of our data,
respectively (see Table 1). The speech segments
remained grouped by debate, with 38 debates as-
signed to the training set, 10 to the test set, and 5
to the development set; we require that the speech
segments from an individual debate all appear in
the same set because our goal is to examine clas-
sification of speech segments in the context of the
surrounding discussion.
3 Method
The support/oppose classification problem can be
approached through the use of standard classifiers
such as support vector machines (SVMs), which
consider each text unit in isolation. As discussed
in Section 1, however, the conversational nature
of our data implies the existence of various rela-
tionships that can be exploited to improve cumu-
lative classification accuracy for speech segments
belonging to the same debate. Our classification
framework, directly inspired by Blum and Chawla
(2001), integrates both perspectives, optimizing
its labeling of speech segments based on both in-
dividual speech-segment classification scores and
preferences for groups of speech segments to re-
ceive the same label. In this section, we discuss
the specific classification framework that we adopt
and the set of mechanisms that we propose for
modeling specific types of relationships.
329
3.1 Classification framework
Let s1,s2,...,sn be the sequence of speech seg-
ments within a given debate, and let Y and
N stand for the “yea” and “nay” class, respec-
tively. Assume we have a non-negative func-
tion ind(s,C) indicating the degree of preference
that an individual-document classifier, such as an
SVM, has for placing speech-segment s in class
C. Also, assume that some pairs of speech seg-
ments have weighted links between them, where
the non-negative strength (weight) str(lscript) for a
link lscript indicates the degree to which it is prefer-
able that the linked speech segments receive the
same label. Then, any class assignment c =
c(s1),c(s2),...,c(sn) can be assigned a cost
summationdisplay
s
ind(s,c(s))+
summationdisplay
s,sprime:c(s)negationslash=c(sprime)
summationdisplay
lscript betweens,sprime
str(lscript),
where c(s) is the “opposite” class from c(s). A
minimum-cost assignment thus represents an opti-
mum way to classify the speech segments so that
each one tends not to be put into the class that
the individual-document classifier disprefers, but
at the same time, highly associated speech seg-
ments tend not to be put in different classes.
As has been previously observed and exploited
in the NLP literature (Pang and Lee, 2004; Agar-
wal and Bhattacharyya, 2005; Barzilay and Lap-
ata, 2005), the above optimization function, unlike
many others that have been proposed for graph or
set partitioning, can be solved exactly in an prov-
ably efficient manner via methods for finding min-
imum cuts in graphs. In our view, the contribution
of our work is the examination of new types of
relationships, not the method by which such re-
lationships are incorporated into the classification
decision.
3.2 Classifying speech segments in isolation
In our experiments, we employed the well-known
classifier SVMlight to obtain individual-document
classification scores, treating Y as the positive
class and using plain unigrams as features.5 Fol-
lowing standard practice in sentiment analysis
(Pang et al., 2002), the input to SVMlight con-
sisted of normalized presence-of-feature (rather
than frequency-of-feature) vectors. The ind value
5SVMlight is available at svmlight.joachims.org. Default
parameters were used, although experimentation with differ-
ent parameter settings is an important direction for future
work (Daelemans and Hoste, 2002; Munson et al., 2005).
for each speech segment s was based on the signed
distance d(s) from the vector representing s to the
trained SVM decision plane:
ind(s,Y) def=



1 d(s) > 2σs;parenleftBig
1+ d(s)2σs
parenrightBig
/2 |d(s)|≤ 2σs;
0 d(s) < −2σs
where σs is the standard deviation of d(s) over all
speech segments s in the debate in question, and
ind(s,N) def= 1− ind(s,Y).
We now turn to the more interesting problem of
representing the preferences that speech segments
may have for being assigned to the same class.
3.3 Relationships between speech segments
A wide range of relationships between text seg-
ments can be modeled as positive-strength links.
Here we discuss two types of constraints that are
considered in this work.
Same-speaker constraints: In Congressional
debates and in general social-discourse contexts,
a single speaker may make a number of comments
regarding a topic. It is reasonable to expect that in
many settings, the participants in a discussion may
be convinced to change their opinions midway
through a debate. Hence, in the general case we
wish to be able to express “soft” preferences for all
of an author’s statements to receive the same label,
where the strengths of such constraints could, for
instance, vary according to the time elapsed be-
tween the statements. Weighted links are an ap-
propriate means to express such variation.
However, if we assume that most speakers do
not change their positions in the course of a dis-
cussion, we can conclude that all comments made
by the same speaker must receive the same label.
This assumption holds by fiat for the ground-truth
labels in our dataset because these labels were
derived from the single vote cast by the speaker
on the bill being discussed.6 We can implement
this assumption via links whose weights are essen-
tially infinite. Although one can also implement
this assumption via concatenation of same-speaker
speech segments (see Section 4.3), we view the
fact that our graph-based framework incorporates
6We are attempting to determine whether a speech seg-
ment represents support or not. This differs from the problem
of determining what the speaker’s actual opinion is, a prob-
lem that, as an anonymous reviewer put it, is complicated by
“grandstanding, backroom deals, or, more innocently, plain
change of mind (‘I voted for it before I voted against it’)”.
330
both hard and soft constraints in a principled fash-
ion as an advantage of our approach.
Different-speaker agreements In House dis-
course, it is common for one speaker to make ref-
erence to another in the context of an agreement
or disagreement over the topic of discussion. The
systematic identification of instances of agreement
can, as we have discussed, be a powerful tool for
the development of intelligently selected weights
for links between speech segments.
The problem of agreement identification can be
decomposed into two sub-problems: identifying
references and their targets, and deciding whether
each reference represents an instance of agree-
ment. In our case, the first task is straightfor-
ward because we focused solely on by-name ref-
erences.7 Hence, we will now concentrate on the
second, more interesting task.
We approach the problem of classifying refer-
ences by representing each reference with a word-
presence vector derived from a window of text
surrounding the reference.8 In the training set,
we classify each reference connecting two speak-
ers with a positive or negative label depending on
whether the two voted the same way on the bill un-
der discussion9. These labels are then used to train
an SVM classifier, the output of which is subse-
quently used to create weights on agreement links
in the test set as follows.
Let d(r) denote the distance from the vector
representing reference r to the agreement-detector
SVM’s decision plane, and let σr be the standard
deviation of d(r) over all references in the debate
in question. We then define the strength agr of the
agreement link corresponding to the reference as:
agr(r) def=



0 d(r) < θagr;
α·d(r)/4σr θagr ≤ d(r) ≤ 4σr;
α d(r) > 4σr.
The free parameter α specifies the relative impor-
7One subtlety is that for the purposes of mining agree-
ment cues (but not for evaluating overall support/oppose
classification accuracy), we temporarily re-inserted into our
dataset previously filtered speech segments containing the
term “yield”, since the yielding of time on the House floor
typically indicates agreement even though the yield state-
ments contain little relevant text on their own.
8We found good development-set performance using the
30 tokens before, 20 tokens after, and the name itself.
9Since we are concerned with references that potentially
represent relationships between speech segments, we ignore
references for which the target of the reference did not speak
in the debate in which the reference was made.
Agreement classifier
(“reference⇒agreement?”)
Devel.
set
Test
set
majority baseline 81.51 80.26
Train: no amdmts; θagr = 0 84.25 81.07
Train: with amdmts; θagr = 0 86.99 80.10
Table 2: Agreement-classifier accuracy, in per-
cent. “Amdmts”=“speech segments containing the
word ‘amendment’”. Recall that boldface indi-
cates results for development-set-optimal settings.
tance of the agr scores. The threshold θagr con-
trols the precision of the agreement links, in that
values of θagr greater than zero mean that greater
confidence is required before an agreement link
can be added.10
4 Evaluation
This section presents experiments testing the util-
ity of using speech-segment relationships, evalu-
ating against a number of baselines. All reported
results use values for the free parameter α derived
via tuning on the development set. In the tables,
boldface indicates the development- and test-set
results for the development-set-optimal parameter
settings, as one would make algorithmic choices
based on development-set performance.
4.1 Preliminaries: Reference classification
Recall that to gather inter-speaker agreement in-
formation, the strategy employed in this paper is
to classify by-name references to other speakers
as to whether they indicate agreement or not.
To train our agreement classifier, we experi-
mented with undoing the deletion of amendment-
related speech segments in the training set. Note
that such speech segments were never included in
the development or test set, since, as discussed in
Section 2, their labels are probably noisy; how-
ever, including them in the training set allows the
classifier to examine more instances even though
some of them are labeled incorrectly. As Table
2 shows, using more, if noisy, data yields bet-
ter agreement-classification results on the devel-
opment set, and so we use that policy in all subse-
quent experiments.11
10Our implementation puts a link between just one arbi-
trary pair of speech segments among all those uttered by a
given pair of apparently agreeing speakers. The “infinite-
weight” same-speaker links propagate the agreement infor-
mation to all other such pairs.
11Unfortunately, this policy leads to inferior test-set agree-
331
Agreement classifier Precision (in percent):
Devel. set Test set
θagr = 0 86.23 82.55
θagr = µ 89.41 88.47
Table 3: Agreement-classifier precision.
An important observation is that precision may
be more important than accuracy in deciding
which agreement links to add: false positives with
respect to agreement can cause speech segments
to be incorrectly assigned the same label, whereas
false negatives mean only that agreement-based
information about other speech segments is not
employed. As described above, we can raise
agreement precision by increasing the threshold
θagr, which specifies the required confidence for
the addition of an agreement link. Indeed, Table
3 shows that we can improve agreement precision
by setting θagr to the (positive) mean agreement
score µ assigned by the SVM agreement-classifier
over all references in the given debate12. How-
ever, this comes at the cost of greatly reducing
agreement accuracy (development: 64.38%; test:
66.18%) due to lowered recall levels. Whether
or not better speech-segment classification is ulti-
mately achieved is discussed in the next sections.
4.2 Segment-based speech-segment
classification
Baselines The first two data rows of Table
4 depict baseline performance results. The
#(“support”) − #(“oppos”) baseline is meant
to explore whether the speech-segment classifica-
tion task can be reduced to simple lexical checks.
Specifically, this method uses the signed differ-
ence between the number of words containing the
stem “support” and the number of words contain-
ing the stem “oppos” (returning the majority class
if the difference is 0). No better than 62.67% test-
set accuracy is obtained by either baseline.
Using relationship information Applying an
SVM to classify each speech segment in isolation
leads to clear improvements over the two base-
line methods, as demonstrated in Table 4. When
we impose the constraint that all speech segments
uttered by the same speaker receive the same la-
bel via “same-speaker links”, both test-set and
ment classification. Section 4.5 contains further discussion.
12We elected not to explicitly tune the value of θagr in or-
der to minimize the number of free parameters to deal with.
Support/oppose classifer
(“speech segment⇒yea?”)
Devel.
set
Test
set
majority baseline 54.09 58.37
#(“support”)−#(“oppos”) 59.14 62.67
SVM [speech segment] 70.04 66.05
SVM + same-speaker links 79.77 67.21
SVM + same-speaker links ...
+ agreement links, θagr = 0 89.11 70.81
+ agreement links, θagr = µ 87.94 71.16
Table 4: Segment-based speech-segment classifi-
cation accuracy, in percent.
Support/oppose classifer
(“speech segment⇒yea?”)
Devel.
set
Test
set
SVM [speaker] 71.60 70.00
SVM + agreement links ...
with θagr = 0 88.72 71.28
with θagr = µ 84.44 76.05
Table 5: Speaker-based speech-segment classifica-
tion accuracy, in percent. Here, the initial SVM is
run on the concatenation of all of a given speaker’s
speech segments, but the results are computed
over speech segments (not speakers), so that they
can be compared to those in Table 4.
development-set accuracy increase even more, in
the latter case quite substantially so.
The last two lines of Table 4 show that the
best results are obtained by incorporating agree-
ment information as well. The highest test-set re-
sult, 71.16%, is obtained by using a high-precision
threshold to determine which agreement links to
add. While the development-set results would in-
duce us to utilize the standard threshold value of 0,
which is sub-optimal on the test set, the θagr = 0
agreement-link policy still achieves noticeable im-
provement over not using agreement links (test set:
70.81% vs. 67.21%).
4.3 Speaker-based speech-segment
classification
We use speech segments as the unit of classifica-
tion because they represent natural discourse units.
As a consequence, we are able to exploit relation-
ships at the speech-segment level. However, it is
interesting to consider whether we really need to
consider relationships specifically between speech
segments themselves, or whether it suffices to sim-
ply consider relationships between the speakers
332
of the speech segments. In particular, as an al-
ternative to using same-speaker links, we tried a
speaker-based approach wherein the way we de-
termine the initial individual-document classifica-
tion score for each speech segment uttered by a
person p in a given debate is to run an SVM on the
concatenation of all of p’s speech segments within
that debate. (We also ensure that agreement-link
information is propagated from speech-segment to
speaker pairs.)
How does the use of same-speaker links com-
pare to the concatenation of each speaker’s speech
segments? Tables 4 and 5 show that, not sur-
prisingly, the SVM individual-document classifier
works better on the concatenated speech segments
than on the speech segments in isolation. How-
ever, the effect on overall classification accuracy
is less clear: the development set favors same-
speaker links over concatenation, while the test set
does not.
But we stress that the most important obser-
vation we can make from Table 5 is that once
again, the addition of agreement information leads
to substantial improvements in accuracy.
4.4 “Hard” agreement constraints
Recall that in in our experiments, we created
finite-weight agreement links, so that speech seg-
ments appearing in pairs flagged by our (imper-
fect) agreement detector can potentially receive
different labels. We also experimented with forc-
ing such speech segments to receive the same la-
bel, either through infinite-weight agreement links
or through a speech-segment concatenation strat-
egy similar to that described in the previous sub-
section. Both strategies resulted in clear degrada-
tion in performance on both the development and
test sets, a finding that validates our encoding of
agreement information as “soft” preferences.
4.5 On the development/test set split
We have seen several cases in which the method
that performs best on the development set does
not yield the best test-set performance. However,
we felt that it would be illegitimate to change the
train/development/test sets in a post hoc fashion,
that is, after seeing the experimental results.
Moreover, and crucially, it is very clear that
using agreement information, encoded as prefer-
ences within our graph-based approach rather than
as hard constraints, yields substantial improve-
ments on both the development and test set; this,
we believe, is our most important finding.
5 Related work
Politically-oriented text Sentiment analysis has
specifically been proposed as a key enabling tech-
nology in eRulemaking, allowing the automatic
analysis of the opinions that people submit (Shul-
man et al., 2005; Cardie et al., 2006; Kwon et al.,
2006). There has also been work focused upon de-
termining the political leaning (e.g., “liberal” vs.
“conservative”) of a document or author, where
most previously-proposed methods make no di-
rect use of relationships between the documents to
be classified (the “unlabeled” texts) (Laver et al.,
2003; Efron, 2004; Mullen and Malouf, 2006). An
exception is Grefenstette et al. (2004), who exper-
imented with determining the political orientation
of websites essentially by classifying the concate-
nation of all the documents found on that site.
Others have applied the NLP technologies of
near-duplicate detection and topic-based text cat-
egorization to politically oriented text (Yang and
Callan, 2005; Purpura and Hillard, 2006).
Detecting agreement We used a simple method
to learn to identify cross-speaker references indi-
cating agreement. More sophisticated approaches
have been proposed (Hillard et al., 2003), in-
cluding an extension that, in an interesting re-
versal of our problem, makes use of sentiment-
polarity indicators within speech segments (Gal-
ley et al., 2004). Also relevant is work on the gen-
eral problems of dialog-act tagging (Stolcke et al.,
2000), citation analysis (Lehnert et al., 1990), and
computational rhetorical analysis (Marcu, 2000;
Teufel and Moens, 2002).
We currently do not have an efficient means
to encode disagreement information as hard con-
straints; we plan to investigate incorporating such
information in future work.
Relationships between the unlabeled items
Carvalho and Cohen (2005) consider sequential
relations between different types of emails (e.g.,
between requests and satisfactions thereof) to clas-
sify messages, and thus also explicitly exploit the
structure of conversations.
Previous sentiment-analysis work in different
domains has considered inter-document similar-
ity (Agarwal and Bhattacharyya, 2005; Pang and
Lee, 2005; Goldberg and Zhu, 2006) or explicit
333
inter-document references in the form of hyper-
links (Agrawal et al., 2003).
Notable early papers on graph-based semi-
supervised learning include Blum and Chawla
(2001), Bansal et al. (2002), Kondor and Lafferty
(2002), and Joachims (2003). Zhu (2005) main-
tains a survey of this area.
Recently, several alternative, often quite sophis-
ticated approaches to collective classification have
been proposed (Neville and Jensen, 2000; Laf-
ferty et al., 2001; Getoor et al., 2002; Taskar et
al., 2002; Taskar et al., 2003; Taskar et al., 2004;
McCallum and Wellner, 2004). It would be inter-
esting to investigate the application of such meth-
ods to our problem. However, we also believe
that our approach has important advantages, in-
cluding conceptual simplicity and the fact that it is
based on an underlying optimization problem that
is provably and in practice easy to solve.
6 Conclusion and future work
In this study, we focused on very general types
of cross-document classification preferences, uti-
lizing constraints based only on speaker identity
and on direct textual references between state-
ments. We showed that the integration of even
very limited information regarding inter-document
relationships can significantly increase the accu-
racy of support/opposition classification.
The simple constraints modeled in our study,
however, represent just a small portion of the
rich network of relationships that connect state-
ments and speakers across the political universe
and in the wider realm of opinionated social dis-
course. One intriguing possibility is to take ad-
vantage of (readily identifiable) information re-
garding interpersonal relationships, making use of
speaker/author affiliations, positions within a so-
cial hierarchy, and so on. Or, we could even at-
tempt to model relationships between topics or
concepts, in a kind of extension of collaborative
filtering. For example, perhaps we could infer that
two speakers sharing a common opinion on evo-
lutionary biologist Richard Dawkins (a.k.a. “Dar-
win’s rottweiler”) will be likely to agree in a de-
bate centered on Intelligent Design. While such
functionality is well beyond the scope of our cur-
rent study, we are optimistic that we can develop
methods to exploit additional types of relation-
ships in future work.
Acknowledgments We thank Claire Cardie, Jon
Kleinberg, Michael Macy, Andrew Myers, and the
six anonymous EMNLP referees for valuable dis-
cussions and comments. We also thank Reviewer
1 for generously providing additional post hoc
feedback, and the EMNLP chairs Eric Gaussier
and Dan Jurafsky for facilitating the process (as
well as for allowing authors an extra proceedings
page...). This paper is based upon work sup-
ported in part by the National Science Founda-
tion under grant no. IIS-0329064. Any opinions,
findings, and conclusions or recommendations ex-
pressed are those of the authors and do not neces-
sarily reflect the views or official policies, either
expressed or implied, of any sponsoring institu-
tions, the U.S. government, or any other entity.

References
A. Agarwal, P. Bhattacharyya. 2005. Sentiment anal-
ysis: A new approach for effective use of linguis-
tic knowledge and exploiting similarities in a set of
documents to be classified. In Proceedings of the
International Conference on Natural Language Pro-
cessing (ICON).
R. Agrawal, S. Rajagopalan, R. Srikant, Y. Xu. 2003.
Mining newsgroups using networks arising from so-
cial behavior. In Proceedings of WWW, 529–535.
N. Bansal, A. Blum, S. Chawla. 2002. Correla-
tion clustering. In Proceedings of the Symposium
on Foundations of Computer Science (FOCS), 238–
247. Journal version in Machine Learning Journal,
special issue on theoretical advances in data cluster-
ing, 56(1-3):89–113 (2004).
R. Barzilay, M. Lapata. 2005. Collective content selec-
tion for concept-to-text generation. In Proceedings
of HLT/EMNLP, 331–338.
A. Blum, S. Chawla. 2001. Learning from labeled and
unlabeled data using graph mincuts. In Proceedings
of ICML, 19–26.
C. Cardie, C. Farina, T. Bruce, E. Wagner. 2006. Us-
ing natural language processing to improve eRule-
making. In Proceedings of Digital Government Re-
search (dg.o).
V. Carvalho, W. W. Cohen. 2005. On the collective
classification of email “speech acts”. In Proceedings
of SIGIR, 345–352.
W. Daelemans, V. Hoste. 2002. Evaluation of ma-
chine learning methods for natural language pro-
cessing tasks. In Proceedings of the Third Interna-
tional Conference on Language Resources and Eval-
uation (LREC), 755–760.
S. Das, M. Chen. 2001. Yahoo! for Amazon: Extract-
ing market sentiment from stock message boards. In
Proceedings of the Asia Pacific Finance Association
Annual Conference (APFA).
K. Dave, S. Lawrence, D. M. Pennock. 2003. Mining
the peanut gallery: Opinion extraction and semantic
classification of product reviews. In Proceedings of
WWW, 519–528.
M. Efron. 2004. Cultural orientation: Classifying sub-
jective documents by cociation [sic] analysis. In
Proceedings of the AAAI Fall Symposium on Style
and Meaning in Language, Art, Music, and Design,
41–48.
A. Esuli. 2006. Sentiment classification bibliography.
liinwww.ira.uka.de/bibliography/Misc/Sentiment.html.
M. Galley, K. McKeown, J. Hirschberg, E. Shriberg.
2004. Identifying agreement and disagreement in
conversational speech: Use of Bayesian networks to
model pragmatic dependencies. In Proceedings of
the 42nd ACL, 669–676.
L. Getoor, N. Friedman, D. Koller, B. Taskar. 2002.
Learning probabilistic models of relational structure.
Journal of Machine Learning Research, 3:679–707.
Special issue on the Eighteenth ICML.
A. B. Goldberg, J. Zhu. 2006. Seeing stars
when there aren’t many stars: Graph-based semi-
supervised learning for sentiment categorization.
In TextGraphs: HLT/NAACL Workshop on Graph-
based Algorithms for Natural Language Processing.
G. Grefenstette, Y. Qu, J. G. Shanahan, D. A. Evans.
2004. Coupling niche browsers and affect analysis
for an opinion mining application. In Proceedings
of RIAO.
M. Hearst. 1992. Direction-based text interpretation as
an information access refinement. In P. Jacobs, ed.,
Text-Based Intelligent Systems, 257–274. Lawrence
Erlbaum Associates.
D. Hillard, M. Ostendorf, E. Shriberg. 2003. Detection
of agreement vs. disagreement in meetings: Train-
ing with unlabeled data. In Proceedings of HLT-
NAACL.
T. Joachims. 2003. Transductive learning via spectral
graph partitioning. In Proceedings of ICML, 290–
297.
R. I. Kondor, J. D. Lafferty. 2002. Diffusion kernels
on graphs and other discrete input spaces. In Pro-
ceedings of ICML, 315–322.
N. Kwon, S. Shulman, E. Hovy. 2006. Multidimen-
sional text analysis for eRulemaking. In Proceed-
ings of Digital Government Research (dg.o).
J. Lafferty, A. McCallum, F. Pereira. 2001. Condi-
tional random fields: Probabilistic models for seg-
menting and labeling sequence data. In Proceedings
of ICML, 282–289.
M. Laver, K. Benoit, J. Garry. 2003. Extracting policy
positions from political texts using words as data.
American Political Science Review.
W. Lehnert, C. Cardie, E. Riloff. 1990. Analyzing re-
search papers using citation sentences. In Program
of the Twelfth Annual Conference of the Cognitive
Science Society, 511–18.
D. Marcu. 2000. The theory and practice of discourse
parsing and summarization. MIT Press.
A. McCallum, B. Wellner. 2004. Conditional mod-
els of identity uncertainty with application to noun
coreference. In Proceedings of NIPS.
T. Mullen, R. Malouf. 2006. A preliminary investiga-
tion into sentiment analysis of informal political dis-
course. In Proceedings of the AAAI Symposium on
Computational Approaches to Analyzing Weblogs,
159–162.
A. Munson, C. Cardie, R. Caruana. 2005. Optimizing
to arbitrary NLP metrics using ensemble selection.
In Proceedings of HLT-EMNLP, 539–546.
J. Neville, D. Jensen. 2000. Iterative classification in
relational data. In Proceedings of the AAAI Work-
shop on Learning Statistical Models from Relational
Data, 13–20.
B. Pang, L. Lee. 2004. A sentimental education:
Sentiment analysis using subjectivity summarization
based on minimum cuts. In Proceedings of the ACL,
271–278.
B. Pang, L. Lee. 2005. Seeing stars: Exploiting class
relationships for sentiment categorization with re-
spect to rating scales. In Proceedings of the ACL.
B. Pang, L. Lee, S. Vaithyanathan. 2002. Thumbs
up? Sentiment classification using machine learning
techniques. In Proceedings of EMNLP, 79–86.
S. Purpura, D. Hillard. 2006. Automated classifica-
tion of congressional legislation. In Proceedings of
Digital Government Research (dg.o).
W. Sack. 1994. On the computation of point of view.
In Proceedings of AAAI, pg. 1488. Student abstract.
S. Shulman, D. Schlosberg. 2002. Electronic rulemak-
ing: New frontiers in public participation. Prepared
for the Annual Meeting of the American Political
Science Association.
S. Shulman, J. Callan, E. Hovy, S. Zavestoski. 2005.
Language processing technologies for electronic
rulemaking: A project highlight. In Proceedings of
Digital Government Research (dg.o), 87–88.
S. S. Smith, J. M. Roberts, R. J. Vander Wielen. 2005.
The American Congress. Cambridge University
Press, fourth edition.
A. Stolcke, N. Coccaro, R. Bates, P. Taylor, C. Van Ess-
Dykema, K. Ries, E. Shriberg, D. Jurafsky, R. Mar-
tin, M. Meteer. 2000. Dialogue act modeling for
automatic tagging and recognition of conversational
speech. Computational Linguistics, 26(3):339–373.
B. Taskar, P. Abbeel, D. Koller. 2002. Discriminative
probabilistic models for relational data. In Proceed-
ings of UAI, Edmonton, Canada.
B. Taskar, C. Guestrin, D. Koller. 2003. Max-margin
Markov networks. In Proceedings of NIPS.
B. Taskar, V. Chatalbashev, D. Koller. 2004. Learn-
ing associative Markov networks. In Proceedings of
ICML.
S. Teufel, M. Moens. 2002. Summarizing scientific
articles: Experiments with relevance and rhetorical
status. Computational Linguistics, 28(4):409–445.
P. Turney. 2002. Thumbs up or thumbs down? Seman-
tic orientation applied to unsupervised classification
of reviews. In Proceedings of the ACL, 417–424.
J. M. Wiebe, W. J. Rapaport. 1988. A computational
theory of perspective and reference in narrative. In
Proceedings of the ACL, 131–138.
J. M. Wiebe. 1994. Tracking point of view in narrative.
Computational Linguistics, 20(2):233–287.
H. Yang, J. Callan. 2005. Near-duplicate detection
for eRulemaking. In Proceedings of Digital Gov-
ernment Research (dg.o).
J. Zhu. 2005. Semi-supervised learning literature
survey. Computer Sciences Technical Report TR
1530, University of Wisconsin-Madison. Available
at http://www.cs.wisc.edu/∼jerryzhu/pub/ssl survey.pdf;
has been updated since the initial 2005 version.
