Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, pages 117–125,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Classification of Discourse Coherence Relations: An Exploratory Study
using Multiple Knowledge Sources
Ben Wellnera0a2a1 , James Pustejovskya0 , Catherine Havasia0 ,
Anna Rumshiskya0 and Roser Saur´ıa0
a0 Department of Computer Science
Brandeis University
Waltham, MA USA
a1 The MITRE Corporation
202 Burlington Road
Bedford, MA USA
a3 wellner,jamesp,havasi,arum,roser
a4 @cs.brandeis.edu
Abstract
In this paper we consider the problem of
identifying and classifying discourse co-
herence relations. We report initial re-
sults over the recently released Discourse
GraphBank (Wolf and Gibson, 2005). Our
approach considers, and determines the
contributions of, a variety of syntactic and
lexico-semantic features. We achieve 81%
accuracy on the task of discourse relation
type classification and 70% accuracy on
relation identification.
1 Introduction
The area of modeling discourse has arguably seen
less success than other areas in NLP. Contribut-
ing to this is the fact that no consensus has been
reached on the inventory of discourse relations
nor on the types of formal restrictions placed on
discourse structure. Furthermore, modeling dis-
course structure requires access to considerable
prior linguistic analysis including syntax, lexical
and compositional semantics, as well as the res-
olution of entity and event-level anaphora, all of
which are non-trivial problems themselves.
Discourse processing has been used in many
text processing applications, most notably text
summarization and compression, text generation,
and dialogue understanding. However, it is also
important for general text understanding, includ-
ing applications such as information extraction
and question answering.
Recently, Wolf and Gibson (2005) have pro-
posed a graph-based approach to representing in-
formational discourse relations.1 They demon-
strate that tree representations are inadequate for
1The relations they define roughly follow Hobbs (1985).
modeling coherence relations, and show that many
discourse segments have multiple parents (incom-
ing directed relations) and many of the relations
introduce crossing dependencies – both of which
preclude tree representations. Their annotation of
135 articles has been released as the GraphBank
corpus.
In this paper, we provide initial results for the
following tasks: (1) automatically classifying the
type of discourse coherence relation; and (2) iden-
tifying whether any discourse relation exists on
two text segments. The experiments we report
are based on the annotated data in the Discourse
GraphBank, where we assume that the discourse
units have already been identified.
In contrast to a highly structured, compositional
approach to discourse parsing, we explore a sim-
ple, flat, feature-based methodology. Such an ap-
proach has the advantage of easily accommodat-
ing many knowledge sources. This type of de-
tailed feature analysis can serve to inform or aug-
ment more structured, compositional approaches
to discourse such as those based on Segmented
Discourse Representation Theory (SDRT) (Asher
and Lascarides, 2003) or the approach taken with
the D-LTAG system (Forbes et al., 2001).
Using a comprehensive set of linguistic fea-
tures as input to a Maximum Entropy classifier,
we achieve 81% accuracy on classifying the cor-
rect type of discourse coherence relation between
two segments.
2 Previous Work
In the past few years, the tasks of discourse seg-
mentation and parsing have been tackled from
different perspectives and within different frame-
works. Within Rhetorical Structure Theory (RST),
Soricut and Marcu (2003) have developed two
117
probabilistica5 models for identifying clausal ele-
mentary discourse units and generating discourse
trees at the sentence level. These are built using
lexical and syntactic information obtained from
mapping the discourse-annotated sentences in the
RST Corpus (Carlson et al., 2003) to their corre-
sponding syntactic trees in the Penn Treebank.
Within SDRT, Baldridge and Lascarides
(2005b) also take a data-driven approach to
the tasks of segmentation and identification of
discourse relations. They create a probabilistic
discourse parser based on dialogues from the Red-
woods Treebank, annotated with SDRT rhetorical
relations (Baldridge and Lascarides, 2005a). The
parser is grounded on headed tree representations
and dialogue-based features, such as turn-taking
and domain specific goals.
In the Penn Discourse TreeBank (PDTB) (Web-
ber et al., 2005), the identification of discourse
structure is approached independently of any lin-
guistic theory by using discourse connectives
rather than abstract rhetorical relations. PDTB
assumes that connectives are binary discourse-
level predicates conveying a semantic relationship
between two abstract object-denoting arguments.
The set of semantic relationships can be estab-
lished at different levels of granularity, depend-
ing on the application. Miltsakaki, et al. (2005)
propose a first step at disambiguating the sense of
a small subset of connectives (since, while, and
when) at the paragraph level. They aim at distin-
guishing between the temporal, causal, and con-
trastive use of the connective, by means of syntac-
tic features derived from the Penn Treebank and a
MaxEnt model.
3 GraphBank
3.1 Coherence Relations
For annotating the discourse relations in text, Wolf
and Gibson (2005) assume a clause-unit-based
definition of a discourse segment. They define
four broad classes of coherence relations:
(1) 1. Resemblance: similarity (par), con-
trast (contr), example (examp), generaliza-
tion (gen), elaboration (elab);
2. Cause-effect: explanation (ce), violated
expectation (expv), condition (cond);
3. Temporal (temp): essentially narration;
4. Attribution (attr): reporting and evidential
contexts.
The textual evidence contributing to identifying
the various resemblance relations is heterogeneous
at best, where, for example, similarity and contrast
are associated with specific syntactic constructions
and devices. For each relation type, there are well-
known lexical and phrasal cues:
(2) a. similarity: and;
b. contrast: by contrast, but;
c. example: for example;
d. elaboration: also, furthermore, in addi-
tion, note that;
e. generalization: in general.
However, just as often, the relation is encoded
through lexical coherence, via semantic associa-
tion, sub/supertyping, and accommodation strate-
gies (Asher and Lascarides, 2003).
The cause-effect relations include conventional
causation and explanation relations (captured as
the label ce), such as (3) below:
(3) cause: SEG1: crash-landed in New Hope,
Ga.,
effect: SEG2: and injuring 23 others.
It also includes conditionals and violated expecta-
tions, such as (4).
(4) cause: SEG1: an Eastern Airlines Lockheed
L-1011 en route from Miami to the Bahamas
lost all three of its engines,
effect: SEG2: and land safely back in Miami.
The two last coherence relations annotated in
GraphBank are temporal (temp) and attribution
(attr) relations. The first corresponds generally to
the occasion (Hobbs, 1985) or narration (Asher
and Lascarides, 2003) relation, while the latter is
a general annotation over attribution of source.2
3.2 Discussion
The difficulty of annotating coherence relations
consistently has been previously discussed in the
literature. In GraphBank, as in any corpus, there
are inconsistencies that must be accommodated
for learning purposes. As perhaps expected, an-
notation of attribution and temporal sequence rela-
tions was consistent if not entirely complete. The
most serious concern we had from working with
2There is one non-rhetorical relation, same, which identi-
fies discontiguous segments.
118
thea6 corpus derives from the conflation of diverse
and semantically contradictory relations among
the cause-effect annotations. For canonical cau-
sation pairs (and their violations) such as those
above, (3) and (4), the annotation was expectedly
consistent and semantically appropriate. Problems
arise, however when examining the treatment of
purpose clauses and rationale clauses. These are
annotated, according to the guidelines, as cause-
effect pairings. Consider (5) below.
(5) cause: SEG1: to upgrade lab equipment in
1987.
effect: SEG2: The university spent $ 30,000
This is both counter-intuitive and temporally false.
The rationale clause is annotated as the cause, and
the matrix sentence as the effect. Things are even
worse with purpose clause annotation. Consider
the following example discourse:3
(6) John pushed the door to open it, but it was
locked.
This would have the following annotation in
GraphBank:
(7) cause: to open it
effect: John pushed the door.
The guideline reflects the appropriate intuition
that the intention expressed in the purpose or ra-
tionale clause must precede the implementation of
the action carried out in the matrix sentence. In
effect, this would be something like
(8) [INTENTION TO SEG1] CAUSES SEG2
The problem here is that the cause-effect re-
lation conflates real event-causation with telos-
directed explanations, that is, action directed to-
wards a goal by virtue of an intention. Given that
these are semantically disjoint relations, which
are furthermore triggered by distinct grammatical
constructions, we believe this conflation should be
undone and characterized as two separate coher-
ence relations. If the relations just discussed were
annotated as telic-causation, the features encoded
for subsequent training of a machine learning al-
gorithm could benefit from distinct syntactic envi-
ronments. We would like to automatically gen-
erate temporal orderings from cause-effect rela-
tions from the events directly annotated in the text.
3This specific example was brought to our attention by
Alex Lascarides (p.c).
Splitting these classes would preserve the sound-
ness of such a procedure, while keeping them
lumped generates inconsistencies.
4 Data Preparation and Knowledge
Sources
In this section we describe the various linguistic
processing components used for classification and
identification of GraphBank discourse relations.
4.1 Pre-Processing
We performed tokenization, sentence tagging,
part-of-speech tagging, and shallow syntactic
parsing (chunking) over the 135 GraphBank docu-
ments. Part-of-speech tagging and shallow parsing
were carried out using the Carafe implementation
of Conditional Random Fields for NLP (Wellner
and Vilain, 2006) trained on various standard cor-
pora. In addition, full sentence parses were ob-
tained using the RASP parser (Briscoe and Car-
roll, 2002). Grammatical relations derived from
a single top-ranked tree for each sentence (head-
word, modifier, and relation type) were used for
feature construction.
4.2 Modal Parsing and Temporal Ordering
of Events
We performed both modal parsing and tempo-
ral parsing over events. Identification of events
was performed using EvITA (Saur´ı et al., 2006),
an open-domain event tagger developed under the
TARSQI research framework (Verhagen et al.,
2005). EvITA locates and tags all event-referring
expressions in the input text that can be tempo-
rally ordered. In addition, it identifies those gram-
matical features implicated in temporal and modal
information of events; namely, tense, aspect, po-
larity, modality, as well as the event class. Event
annotation follows version 1.2.1 of the TimeML
specifications.4
Modal parsing in the form of identifying sub-
ordinating verb relations and their type was per-
formed using SlinkET (Saur´ı et al., 2006), an-
other component of the TARSQI framework. Slin-
kET identifies subordination constructions intro-
ducing modality information in text; essentially,
infinitival and that-clauses embedded by factive
predicates (regret), reporting predicates (say), and
predicates referring to events of attempting (try),
volition (want), command (order), among others.
4See http://www.timeml.org.
119
SlinkET annotates these subordination contexts
and classifies them according to the modality in-
formation introduced by the relation between the
embedding and embedded predicates, which can
be of any of the following types:
a7 factive: The embedded event is presupposed
or entailed as true (e.g., John managed to
leave the party).
a7 counter-factive: The embedded event is pre-
supposed as entailed as false (e.g., John was
unable to leave the party).
a7 evidential: The subordination is introduced
by a reporting or perception event (e.g., Mary
saw/told that John left the party).
a7 negative evidential: The subordination is a
reporting event conveying negative polarity
(e.g., Mary denied that John left the party).
a7 modal: The subordination creates an inten-
sional context (e.g., John wanted to leave the
party).
Temporal orderings between events were iden-
tified using a Maximum Entropy classifier trained
on the TimeBank 1.2 and Opinion 1.0a corpora.
These corpora provide annotated events along
with temporal links between events. The link
types included: before (a8a10a9 occurs before a8a12a11 ) , in-
cludes (a8 a11 occurs sometime during a8 a9 ), simultane-
ous (a8a13a9 occurs over the same interval as a8a14a11 ), begins
(a8 a9 begins at the same time as a8 a11 ), ends (a8 a9 ends at
the same time as a8 a11 ).
4.3 Lexical Semantic Typing and Coherence
Lexical semantic types as well as a measure of
lexical similarity or coherence between words in
two discourse segments would appear to be use-
ful for assigning an appropriate discourse rela-
tionship. Resemblance relations, in particular, re-
quire similar entities to be involved and lexical
similarity here serves as an approximation to defi-
nite nominal coreference. Identification of lexical
relationships between words across segments ap-
pears especially useful for cause-effect relations.
In example (3) above, determining a (potential)
cause-effect relationship between crash and injury
is necessary to identify the discourse relation.
4.3.1 Corpus-based Lexical Similarity
Lexical similarity was computed using the
Word Sketch Engine (WSE) (Killgarrif et al.,
2004) similarity metric applied over British Na-
tional Corpus. The WSE similarity metric imple-
ments the word similarity measure based on gram-
matical relations as defined in (Lin, 1998) with mi-
nor modifications.
4.3.2 The Brandeis Semantic Ontology
As a second source of lexical coherence, we
used the Brandeis Semantic Ontology or BSO
(Pustejovsky et al., 2006). The BSO is a lexically-
based ontology in the Generative Lexicon tradi-
tion (Pustejovsky, 2001; Pustejovsky, 1995). It fo-
cuses on contextualizing the meanings of words
and does this by a rich system of types and qualia
structures. For example, if one were to look up the
phrase RED WINE in the BSO, one would find its
type is WINE and its type’s type is ALCOHOLIC
BEVERAGE. The BSO contains ontological qualia
information (shown below). Using the BSO, one
a15a16
a16
a16
a16
a16
a17
wine
CONSTITUTIVE a18 Alcohol
HAS ELEMENT a18 Alcohol
MADE OF a18 Grapes
INDIRECT TELIC a18 drink activity
INDIRECT AGENTIVE a18 make alcoholic beverage
a19a21a20
a20
a20
a20
a20
a22
is able to find out where in the ontological type
system WINE is located, what RED WINE’s lexi-
cal neighbors are, and its full set of part of speech
and grammatical attributes. Other words have a
different configuration of annotated attributes de-
pending on the type of the word.
We used the BSO typing information to seman-
tically tag individual words in order to compute
lexical paths between word pairs. Such lexical as-
sociations are invoked when constructing cause-
effect relations and other implicatures (e.g. be-
tween crash and injure in Example 3).
The type system paths provide a measure of the
connectedness between words. For every pair of
head words in a GraphBank document, the short-
est path between the two words within the BSO
is computed. Currently, this metric only uses the
type system relations (i.e., inheritance) but prelim-
inary tests show that including qualia relations as
connections is promising. We also computed the
earliest common ancestor of the two words. These
metrics are calculated for every possible sense of
the word within the BSO.
120
The use of the BSO is advantageous compared
to other frameworks such as Wordnet because it
focuses on the connection between words and their
semantic relationship to other items. These con-
nections are captured in the qualia information and
the type system. In Wordnet, qualia-like informa-
tion is only present in the glosses, and they do
not provide a definite semantic path between any
two lexical items. Although synonymous in some
ways, synset members often behave differently in
many situations, grammatical or otherwise.
5 Classification Methodology
This section describes in detail how we con-
structed features from the various knowledge
sources described above and how they were en-
coded in a Maximum Entropy model.
5.1 Maximum Entropy Classification
For our experiments of classifying relation types,
we used a Maximum Entropy classifier5 in order
to assign labels to each pair of discourse segments
connected by some relation. For each instance (i.e.
pair of segments) the classifier makes its decision
based on a set of features. Each feature can query
some arbitrary property of the two segments, pos-
sibly taking into account external information or
knowledge sources. For example, a feature could
query whether the two segments are adjacent to
each other, whether one segment contains a dis-
course connective, whether they both share a par-
ticular word, whether a particular syntactic con-
struction or lexical association is present, etc. We
make strong use of this ability to include very
many, highly interdependent features6 in our ex-
periments. Besides binary-valued features, fea-
ture values can be real-valued and thus capture fre-
quencies, similarity values, or other scalar quanti-
ties.
5.2 Feature Classes
We grouped the features together into various
feature classes based roughly on the knowledge
source from which they were derived. Table 1
describes the various feature classes in detail and
provides some actual example features from each
class for the segment pair described in Example 5
in Section 3.2.
5We use the Maximum Entropy classifier included with
Carafe available at http://sourceforge.net/projects/carafe
6The total maximum number of features occurring in our
experiments is roughly 120,000.
6 Experiments and Results
In this section we provide the results of a set of
experiments focused on the task of discourse rela-
tion classification. We also report initial results on
relation identification with the same set of features
as used for classification.
6.1 Discourse Relation Classification
The task of discourse relation classification in-
volves assigning the correct label to a pair of dis-
course segments.7 The pair of segments to assign
a relation to is provided (from the annotated data).
In addition, we assume, for asymmetric links, that
the nucleus and satellite are provided (i.e., the di-
rection of the relation). For the elaboration rela-
tions, we ignored the annotated subtypes (person,
time, location, etc.). Experiments were carried out
on the full set of relation types as well as the sim-
pler set of coarse-grained relation categories de-
scribed in Section 3.1.
The GraphBank contains a total of 8755 an-
notated coherence relations. 8 For all the ex-
periments in this paper, we used 8-fold cross-
validation with 12.5% of the data used for test-
ing and the remainder used for training for each
fold. Accuracy numbers reported are the average
accuracies over the 8 folds. Variance was gener-
ally low with a standard deviation typically in the
range of 1.5 to 2.0. We note here also that the
inter-annotator agreement between the two Graph-
Bank annotators was 94.6% for relations when
they agreed on the presence of a relation. The
majority class baseline (i.e., the accuracy achieved
by calling all relations elaboration) is 45.7% (and
66.57% with the collapsed categories). These are
the upper and lower bounds against which these
results should be based.
To ascertain the utility of each of the various
feature classes, we considered each feature class
independently by using only features from a sin-
gle class in addition to the Proximity feature class
which serve as a baseline. Table 2 illustrates the
result of this experiment.
We performed a second set of experiments
shown in Table 3 that is essentially the converse
of the previous batch. We take the union of all the
7Each segment may in fact consist of a sequence of seg-
ments. We will, however, use the term segment loosely to
refer to segments or segment sequences.
8All documents are doubly annotated; we used the anno-
tator1 annotations.
121
Feature Description Example
Class
C Words appearing at beginning and end of the two discourse seg-
ments - these are often important discourse cue words.
first1-is-to; first2-is-The
P Proximity and direction between the two segments (in terms of
segments) - binary features such as distance less than 3, distance
greater than 10 were used in addition to the distance value itself;
the distance from beginning of the document using a similar bin-
ning approach
adjacent; dist-less-than-3; dist-less-
than-5; direction-reverse; samesentence
BSO Paths in the BSO up to length 10 between non-function words in the
two segments.
ResearchLab a23 EducationalActivity
a23 University
WSE WSE word-pair similarities between words in the two segments
were binned as (a24 0.05, a24 0.1, a24 0.2). We also computed sen-
tence similarity as the sum of the word similarities divided by the
sum of their sentence lengths.
WSE-greater-than-0.05; WSE-
sentence-sim = 0.005417
E Event head words and event head word pairs between segments as
identified by EvITA.
event1-is-upgrade; event2-is-spent;
event-pair-upgrade-spent
SlinkET Event attributes, subordinating links and their types between event
pairs in the two segments
seg1-class-is-occurrence; seg2-class-
is-occurrence; seg1-tense-is-infinitive;
seg2-tense-is-past; seg2-modal-seg1
C-E Cuewords of one segment paired with events in the other. first1-is-to-event2-is-spent; first2-is-
The-event1-is-upgrade
Syntax Grammatical dependency relations between two segments as iden-
tified by the RASP parser. We also conjoined the relation with one
or both of the headwords associated with the grammatical relation.
gr-ncmod; gr-ncmod-head1-equipment;
gr-ncmod-head-2-spent; etc.
Tlink Temporal links between events in the two segments. We included
both the link types and the number of occurrences of those types
between the segments
seg2-before-seg1
Table 1: Feature classes, their descriptions and example feature instances for Example 5 in Section 3.2.
Feature Class Accuracy Coarse-grained Acc.
Proximity 60.08% 69.43%
P+C 76.77% 83.50%
P+BSO 62.92% 74.40%
P+WSE 62.20% 70.10%
P+E 63.84% 78.16%
P+SlinkET 69.00% 75.91%
P+CE 67.18% 78.63%
P+Syntax 70.30% 80.84%
P+Tlink 64.19% 72.30%
Table 2: Classification accuracy over standard and
coarse-grained relation types with each feature
class added to Proximity feature class.
feature classes and perform ablation experiments
by removing one feature class at a time.
Feature Class Accuracy Coarse-grain Acc.
All Features 81.06% 87.51%
All-P 71.52% 84.88%
All-C 75.71% 84.69%
All-BSO 80.65% 87.04%
All-WSE 80.26% 87.14%
All-E 80.90% 86.92%
All-SlinkET 79.68% 86.89%
All-CE 80.41% 87.14%
All-Syntax 80.20% 86.89%
All-Tlink 80.30% 87.36%
Table 3: Classification accuracy with each fea-
ture class removed from the union of all feature
classes.
6.2 Analysis
From the ablation results, it is clear that overall
performance is most impacted by the cue-word
features (C) and proximity (P). Syntax and Slin-
kET also have high impact improving accuracy by
roughly 10 and 9 percent respectively as shown
in Table 2. From the ablation results in Table 3,
it is clear that the utility of most of the individ-
ual features classes is lessened when all the other
feature classes are taken into account. This indi-
cates that multiple feature classes are responsible
for providing evidence any given discourse rela-
tions. Removing a single feature class degrades
performance, but only slightly, as the others can
compensate.
Overall precision, recall and F-measure results
for each of the different link types using the set
of all feature classes are shown in Table 4 with the
corresponding confusion matrix in Table A.1. Per-
formance correlates roughly with the frequency of
the various relation types. We might therefore ex-
pect some improvement in performance with more
annotated data for those relations with low fre-
quency in the GraphBank.
122
Relation Precision Recall F-measure Count
elab 88.72 95.31 91.90 512
attr 91.14 95.10 93.09 184
par 71.89 83.33 77.19 132
same 87.09 75.00 80.60 72
ce 78.78 41.26 54.16 63
contr 65.51 66.67 66.08 57
examp 78.94 48.39 60.00 31
temp 50.00 20.83 29.41 24
expv 33.33 16.67 22.22 12
cond 45.45 62.50 52.63 8
gen 0.0 0.0 0.0 0
Table 4: Precision, Recall and F-measure results.
6.3 Coherence Relation Identification
The task of identifying the presence of a rela-
tion is complicated by the fact that we must con-
sider all a25a27a26
a11a29a28
potential relations where a30 is the
number of segments. This presents a trouble-
some, highly-skewed binary classification prob-
lem with a high proportion of negative instances.
Furthermore, some of the relations, particularly
the resemblance relations, are transitive in na-
ture (e.g. a31a33a32a35a34a13a32a37a36a38a36a39a8a12a36a41a40a39a42a12a43a45a44a46a42a48a47a50a49a52a51a53a31a54a32a55a34a10a32a37a36a56a36a39a8a14a36a57a40a39a42a58a47a13a44a46a42a60a59a13a49a62a61
a31a33a32a35a34a10a32a35a36a38a36a39a8a12a36a41a40a39a42a14a43a45a44a46a42a60a59a50a49 ). However, these transitive links
are not provided in the GraphBank annotation -
such segment pairs will therefore be presented in-
correctly as negative instances to the learner, mak-
ing this approach infeasible. An initial experiment
considering all segment pairs, in fact, resulted in
performance only slightly above the majority class
baseline.
Instead, we consider the task of identifying the
presence of discourse relations between segments
within the same sentence. Using the same set of
all features used for relation classification, perfor-
mance is at 70.04% accuracy. Simultaneous iden-
tification and classification resulted in an accuracy
of 64.53%. For both tasks the baseline accuracy
was 58%.
6.4 Modeling Inter-relation Dependencies
Casting the problem as a standard classification
problem where each instance is classified inde-
pendently, as we have done, is a potential draw-
back. In order to gain insight into how collec-
tive, dependent modeling might help, we intro-
duced additional features that model such depen-
dencies: For a pair of discourse segments, a42a12a43 and
a42a58a47 , to classify the relation between, we included
features based on the other relations involved with
the two segments (from the gold standard annota-
tions): a63a14a64a65a40a39a42 a43 a44a46a42a14a59a10a49a67a66a68a70a69a71a73a72a75a74 and a63a14a64a65a40a39a42 a47 a44a46a42a14a76a27a49a67a66a36a77a69a71a79a78a41a74 .
Adding these features improved classification ac-
curacy to 82.3%. This improvement is fairly sig-
nificant (a 6.3% reduction in error) given that this
dependency information is only encoded weakly
as features and not in the form of model con-
straints.
7 Discussion and Future Work
We view the accuracy of 81% on coherence rela-
tion classification as a positive result, though room
for improvement clearly remains. An examination
of the errors indicates that many of the remain-
ing problems require making complex lexical as-
sociations, the establishment of entity and event
anaphoric links and, in some cases, the exploita-
tion of complex world-knowledge. While impor-
tant lexical connections can be gleaned from the
BSO, we hypothesize that the current lack of word
sense disambiguation serves to lessen its utility
since lexical paths between all word sense of two
words are currently used. Additional feature engi-
neering, particularly the crafting of more specific
conjunctions of existing features is another avenue
to explore further - as are automatic feature selec-
tion methods.
Different types of relations clearly benefit from
different feature types. For example, resemblance
relations require similar entities and/or events, in-
dicating a need for robust anaphora resolution,
while cause-effect class relations require richer
lexical and world knowledge. One promising ap-
proach is a pipeline where an initial classifier as-
signs a coarse-grained category, followed by sepa-
rately engineered classifiers designed to model the
finer-grained distinctions.
An important area of future work involves in-
corporating additional structure in two places.
First, as the experiment discussed in Section 6.4
shows, classifying discourse relations collectively
shows potential for improved performance. Sec-
ondly, we believe that the tasks of: 1) identify-
ing which segments are related and 2) identify-
ing the discourse segments themselves are prob-
ably best approached by a parsing model of dis-
course. This view is broadly sympathetic with the
approach in (Miltsakaki et al., 2005).
We furthermore believe an extension to the
GraphBank annotation scheme, with some minor
changes as we advocate in Section 3.2, layered on
top of the PDTB would, in our view, serve as an
interesting resource and model for informational
123
discourse.a80
Acknowledgments
This work was supported in part by ARDA/DTO
under grant number NBCHC040027 and MITRE
Sponsored Research. Catherine Havasi is sup-
ported by NSF Fellowship # 2003014891.

References
N. Asher and A. Lascarides. 2003. Logics of Con-
versation. Cambridge University Press, Cambridge,
England.
J. Baldridge and A. Lascarides. 2005a. Annotating
discourse structures for robust semantic interpreta-
tion. In Proceedings of the Sixth International Work-
shop on Computational Semantics, Tilburg, The
Netherlands.
J. Baldridge and A. Lascarides. 2005b. Probabilistic
head-driven parsing for discourse structure. In Pro-
ceedings of the Ninth Conference on Computational
Natural Language Learning, Ann Arbor, USA.
T. Briscoe and J. Carroll. 2002. Robust accurate sta-
tistical annotation of general text. Proceedings of
the Third International Conference on Language Re-
sources and Evaluation (LREC 2002), Las Palmas,
Canary Islands, May 2002, pages 1499–1504.
L. Carlson, D. Marcu, and M. E. Okurowski. 2003.
Building a discourse-tagged corpus in the frame-
work of rhetorical structure theory. In Janvan Kup-
pelvelt and Ronnie Smith, editors, Current Direc-
tions in Discourse and Dialogue. Kluwer Academic
Publishers.
K. Forbes, E. Miltsakaki, R. Prasad, A. Sakar, A. Joshi,
and B. Webber. 2001. D-LTAG system: Discourse
parsing with a lexicalized tree adjoining grammar.
In Proceedings of the ESSLLI 2001: Workshop on
Information Structure, Discourse Structure and Dis-
course Semantics.
J. Hobbs. 1985. On the coherence and structure of dis-
course. In CSLI Technical Report 85-37, Stanford,
CA, USA. Center for the Study of Language and In-
formation.
A. Killgarrif, P. Rychly, P. Smrz, and D. Tugwell.
2004. The sketch engine. In Proceedings of Eu-
ralex, Lorient, France, pages 105–116.
D. Lin. 1998. Automatic retrieval and clustering of
similar words. In Proceedings of COLING-ACL,
Montreal, Canada.
E. Miltsakaki, N. Dinesh, R. Prasad, A. Joshi, and
B. Webber. 2005. Experiments on sense anno-
tation and sense disambiguation of discourse con-
nectives. In Proceedings of the Fourth Workshop
on Treebanks and Linguistic Theories (TLT 2005),
Barcelona, Catalonia.
J. Pustejovsky, C. Havasi, R. Saur´i, P. Hanks, and
A. Rumshisky. 2006. Towards a Generative Lexical
resource: The Brandeis Semantic Ontology. In Lan-
guage Resources and Evaluation Conference, LREC
2006, Genoa, Italy.
J. Pustejovsky. 1995. The Generative Lexicon. MIT
Press, Cambridge, MA.
J. Pustejovsky. 2001. Type construction and the logic
of concepts. In The Language of Word Meaning.
Cambridge University Press.
R. Saur´ı, M. Verhagen, and J. Pustejovsky. 2006. An-
notating and recognizing event modality in text. In
The 19th International FLAIRS Conference, FLAIRS
2006, Melbourne Beach, Florida, USA.
R. Soricut and D. Marcu. 2003. Sentence level dis-
course parsing using syntactic and lexical informa-
tion. In Proceedings of the HLT/NAACL Confer-
ence, Edmonton, Canada.
M. Verhagen, I. Mani, R. Saur´ı, R. Knippen, J. Littman,
and J. Pustejovsky. 2005. Automating temporal an-
notation within TARSQI. In Proceedings of the ACL
2005.
B. Webber, A. Joshi, E. Miltsakaki, R. Prasad, N. Di-
nesh, A. Lee, and K. Forbes. 2005. A short intro-
duction to the penn discourse TreeBank. In Copen-
hagen Working Papers in Language and Speech Pro-
cessing.
B. Wellner and M. Vilain. 2006. Leveraging ma-
chine readable dictionaries in discriminative se-
quence models. In Language Resources and Eval-
uation Conference, LREC 2006, Genoa, Italy.
F. Wolf and E. Gibson. 2005. Representing dis-
course coherence: A corpus-based analysis. Com-
putational Linguistics, 31(2):249–287.
