Sentence Level Discourse Parsing using Syntactic and Lexical Information
Radu Soricut and Daniel Marcu
Information Sciences Institute
University of Southern California
4676 Admiralty Way, Suite 1001
Marina del Rey, CA 90292
a0 radu, marcu
a1 @isi.edu
Abstract
We introduce two probabilistic models that can
be used to identify elementary discourse units
and build sentence-level discourse parse trees.
The models use syntactic and lexical features.
A discourse parsing algorithm that implements
these models derives discourse parse trees with
an error reduction of 18.8% over a state-of-
the-art decision-based discourse parser. A set
of empirical evaluations shows that our dis-
course parsing model is sophisticated enough
to yield discourse trees at an accuracy level that
matches near-human levels of performance.
1 Introduction
By exploiting information encoded in human-produced
syntactic trees (Marcus et al., 1993), research on prob-
abilistic models of syntax has driven the performance of
syntactic parsers to about 90% accuracy (Charniak, 2000;
Collins, 2000). The absence of semantic and discourse
annotated corpora prevented similar developments in se-
mantic/discourse parsing. Fortunately, recent annotation
projects have taken signi cant steps towards developing
semantic (Fillmore et al., 2002; Kingsbury and Palmer,
2002) and discourse (Carlson et al., 2003) annotated cor-
pora. Some of these annotation efforts have already had
a computational impact. For example, Gildea and Juraf-
sky (2002) developed statistical models for automatically
inducing semantic roles. In this paper, we describe proba-
bilistic models and algorithms that exploit the discourse-
annotated corpus produced by Carlson et al. (2003).
A discourse structure is a tree whose leaves correspond
to elementary discourse units (edu)s, and whose internal
nodes correspond to contiguous text spans (called dis-
course spans). An example of a discourse structure is
the tree given in Figure 1. Each internal node in a dis-
course tree is characterized by a rhetorical relation, such
it will use its network
The bank also says
1
2 3
[2,3]
ATTRIBUTION
to channel investments.
[ ]
[ ] [ ]
1
2 3
ENABLEMENT
Figure 1: Discourse structure of a sentence.
as ATTRIBUTION and ENABLEMENT. Within a rhetorical re-
lation a discourse span is also labeled as either NUCLEUS
or SATELLITE. The distinction between nuclei and satel-
lites comes from the empirical observation that a nucleus
expresses what is more essential to the writer’s purpose
than a satellite. Discourse trees can be represented graph-
ically in the style shown in Figure 1. The arrows link the
satellite to the nucleus of a rhetorical relation. Arrows are
labeled with the name of the rhetorical relation that holds
between the linked units. Horizontal lines correspond to
text spans, and vertical lines identify text spans which are
nuclei.
In this paper, we introduce two probabilistic models
that can be used to identify elementary discourse units
and build sentence-level discourse parse trees. We show
how syntactic and lexical information can be exploited in
the process of identifying elementary units of discourse
and building sentence-level discourse trees. Our evalu-
ation indicates that the discourse parsing model we pro-
pose is sophisticated enough to achieve near-human lev-
els of performance on the task of deriving sentence-level
discourse trees, when working with human-produced
syntactic trees and discourse segments.
2 The Corpus
For the experiments described in this paper, we use a pub-
licly available corpus (RST-DT, 2002) that contains 385
                                                               Edmonton, May-June 2003
                                                             Main Papers , pp. 149-156
                                                         Proceedings of HLT-NAACL 2003
Wall Street Journal articles from the Penn Treebank. The
corpus comes conveniently partitioned into a Training set
of 347 articles (6132 sentences) and a Test set of 38 ar-
ticles (991 sentences). Each document in the corpus is
paired with a discourse structure (tree) that was manually
built in the style of Rhetorical Structure Theory (Mann
and Thompson, 1988). (See (Carlson et al., 2003) for de-
tails concerning the corpus and the annotation process.)
Out of the 385 articles in the corpus, 53 have been inde-
pendently annotated by two human annotators. We used
this doubly-annotated subset to compute human agree-
ment on the task of discourse structure derivation. In our
experiments we used as discourse structures only the dis-
course sub-trees spanning over individual sentences.
Because the discourse structures had been built on top
of sentences already associated with syntactic trees from
the Penn Treebank, we were able to create a composite
corpus which allowed us to perform an empirically driven
syntax-discourse relationship study. This composite cor-
pus was created by associating each sentence a2 in the dis-
course corpus with its corresponding Penn Treebank syn-
tactic parse tree a2a4a3a6a5a8a7a10a9a6a11a12a7a14a13a14a11a12a15a17a16a19a18a20a18a22a21a23a2a20a24 and its correspond-
ing sentence-level discourse tree a25a26a13a10a2a27a11a29a28a20a30a31a16a32a2a27a18a27a15a17a16a19a18a20a18a22a21a23a2a20a24 . Al-
though human annotators were free to build their dis-
course structures without enforcing the existence of well-
formed discourse sub-trees for each sentence, in about
95% of the cases in the (RST-DT, 2002) corpus, there
exists a discourse sub-tree a25a33a13a10a2a27a11a29a28a20a30a34a16a19a2a27a18a27a15a17a16a19a18a20a18a6a21a35a2a20a24 associated
with each sentence a2 . The remaining 5% of the sentences
cannot be used in our approach, as no well-formed dis-
course tree can be associated with these sentences.
Therefore, our Training section consists of a set of
5809 triples of the form
a36
a2a33a37a38a2a4a3a22a5a8a7a10a9a6a11a12a7a14a13a14a11a12a15a39a16a20a18a20a18a6a21a35a2a20a24a29a37a40a25a26a13a14a2a27a11a29a28a20a30a34a16a32a2a27a18a4a15a39a16a20a18a20a18a6a21a35a2a20a24a40a41
which are used to train the parameters of the statistical
models. Our Test section consists of a set of 946 triples
of a similar form, which are used to evaluate the perfor-
mance of our discourse parser.
The (RST-DT, 2002) corpus uses 110 different rhetori-
cal relations. We found it useful to also compact these re-
lations into classes, as described by Carlson et al. (2003),
and operate with the resulting 18 labels as well (seen as
coarser granularity rhetorical relations). Operating with
different levels of granularity allows one to get deeper
insight into the dif culties of assigning the appropriate
rhetorical relation, if any, to two adjacent text spans.
3 The Discourse Segmenter
We break down the problem of building sentence-level
discourse trees into two sub-problems: discourse seg-
mentation and discourse parsing. Discourse segmenta-
tion is covered by this section, while discourse parsing is
covered by Section 4.
RBDT NN
ADVPNP
S
VBZ (says) SBAR (will)
S
[it  will  use  its  network]
VP
VPTO
S
VB NNS
VP
VP
PRP
NP
MD
2
(to)
(use)
NNPRP
[to channel investments.]3
VP (says)
NP(network)VB(use)
[The  bank  also  says ]1
N
N
w
p
Nr
Figure 2: Discourse segmentation using lexicalized syn-
tactic trees.
Discourse segmentation is the process in which a given
text is broken into non-overlapping segments called ele-
mentary discourse units (edus). In the present work, ele-
mentary discourse units are taken to be clauses or clause-
like units that are unequivocally the NUCLEUS or SATEL-
LITE of a rhetorical relation that holds between two adja-
cent spans of text (see (Carlson et al., 2003) for details).
Our approach to discourse segmentation breaks the prob-
lem further into two sub-problems: sentence segmen-
tation and sentence-level discourse segmentation. The
problem of sentence segmentation has been studied ex-
tensively, and tools such as those described by Palmer
and Hearst (1997) and Ratnaparkhi (1998) can handle it
well. In this section, we present a discourse segmenta-
tion algorithm that deals with segmenting sentences into
elementary discourse units.
3.1 The Discourse Segmentation Model
The discourse segmenter proposed here takes as input a
sentence and outputs its elementary discourse unit bound-
aries. Our statistical approach to sentence segmentation
uses two components: a statistical model which assigns
a probability to the insertion of a discourse boundary af-
ter each word in a sentence, and a segmenter, which uses
the probabilities computed by the model for inserting dis-
course boundaries. We  rst focus on the statistical model.
A good model of discourse segmentation needs to ac-
count both for local interactions at the word level and
for global interactions at more abstract levels. Consider,
for example, the syntactic tree in Figure 2. According
to our hypothesis, the discourse boundary inserted be-
tween the words says and it is best explained not by
the words alone, but by the lexicalized syntactic structure
[VP(says) [VBZ(says)a42 SBAR(will)]], sig-
naled by the boxed nodes in Figure 2. Hence, we hy-
pothesize that the discourse boundary in our example is
best explained by the global interaction between the verb
(the act of saying) and its clausal complement (what is
being said).
PP(without)VBN(passed)
VP(passed)
PP(at)
VP(priced)
VBN(priced)
Figure 3: The same syntactic information indicates dis-
course boundaries depending on the lexical heads in-
volved.
Given a sentence a2a44a43a46a45a39a47a12a45a49a48a51a50a52a50a4a50a10a45a49a53a54a50a4a50a52a50a55a45a49a56 , we  rst  nd
the syntactic parse tree a7 of a2 . We used in our exper-
iments both syntactic parse trees obtained using Char-
niak’s parser (2000) and syntactic parse trees from the
PennTree bank. Our statistical model assigns a segment-
ing probability a57a58a21a23a59 a53a40a60a45 a53 a37a55a7a55a24 for each word a45 a53 , where a59 a53a62a61a62a63
boundary, no-boundarya64 . Because our model is
concerned with discourse segmentation at sentence level,
we de ne a57a58a21 boundarya60a45 a56 a37a40a7a55a24a65a43a67a66 , i.e., the sentence
boundary is always a discourse boundary as well.
Our model uses both lexical and syntactic features
for determining the probability of inserting discourse
boundaries. We apply canonical lexical head projection
rules (Magerman, 1995) in order to lexicalize syntactic
trees. For each word a45 , the upper-most node with lex-
ical head a45 which has a right sibling node determines
the features on the basis of which we decide whether to
insert a discourse boundary. We denote such node a68a70a69 ,
and the features we use are node a68a71a69 , its parent a68a73a72 , and
the siblings of a68a74a69 . In the example in Figure 2, we de-
termine whether to insert a discourse boundary after the
word says using as features node a68a75a72a65a43a77a76a6a78a79a21a10a80a33a81a33a82a34a80a33a24 and
its children a68 a69 a43a83a76a22a84a86a85a34a21a14a80a33a81a26a82a34a80a32a24 and a68a74a87a88a43a90a89a33a84a92a91a6a93a79a21a95a94a8a96a32a97a6a97a22a24 .
We use our corpus to estimate the likelihood of inserting
a discourse boundary between word a45 and the next word
using formula (1),
a57a58a21a35a59 a60a45a75a37a40a7a55a24a99a98a101a100
a5a8a7a29a21a35a68a73a72a75a102a103a50a52a50a52a50a40a68a104a69a74a42a27a68 a87 a50a52a50a52a50a38a24
a100
a5a8a7a29a21a35a68 a72 a102a105a50a52a50a4a50a55a68 a69 a68a104a87a62a50a4a50a52a50a38a24
(1)
where the numerator represents all the counts of the rule
a68 a72 a102a105a50a52a50a4a50a40a68 a69 a68a104a87a51a50a52a50a52a50 for which a discourse boundary has
been inserted after word a45 , and the denominator repre-
sents all the counts of the rule.
Because we want to account for boundaries that are
motivated lexically as well, the counts used in formula (1)
are de ned over lexicalized rules. Without lexicalization,
the syntactic context alone is too general and fails to dis-
tinguish genuine cases of discourse boundaries from in-
correct ones. As can be seen in Figure 3, the same syn-
tactic context may indicate a discourse boundary when
the lexical headspassedandwithoutare present, but
it may not indicate a boundary when the lexical heads
priced and at are present.
The discourse segmentation model uses the corpus pre-
sented in Section 2 in order to estimate probabilities for
inserting discourse boundaries using equation (1). We
also use a simple interpolation method for smoothing lex-
icalized rules to accommodate data sparseness.
Once we have the segmenting probabilities given by
the statistical model, a straightforward algorithm is used
to implement the segmenter. Given a syntactic tree a7 , the
algorithm inserts a boundary after each word a45 for which
a57a58a21 boundarya60a45a75a37a55a7a55a24a107a106a109a108a54a50a111a110 .
4 The Discourse Parser
In the setting presented here, the input to the discourse
parser is a Discourse Segmented Lexicalized Syntactic
Tree (i.e., a lexicalized syntactic parse tree in which the
discourse boundaries have been identi ed), henceforth
called a DS-LST. An example of a DS-LST in the tree
in Figure 2. The output of the discourse parser is a dis-
course parse tree, such as the one presented in Figure 1.
As in other statistical approaches, we identify two
components that perform the discourse parsing task. The
 rst component is the parsing model, which assigns a
probability to every potential candidate parse tree. For-
mally, given a discourse tree a112a113a15 and a set of parameters
a114 , the parsing model estimates the conditional probabil-
ity a57a58a21a115a112a113a15 a60a114 a24 . The most likely parse is then given by
formula (2).
a112a58a15a117a116a119a118a10a120a119a121a51a43a46a9a22a16a27a122a22a123a124a9a22a125a31a126a51a127a128a57a58a21a115a112a58a15 a60
a114
a24 (2)
The second component is called the discourse parser, and
it is an algorithm for  nding a112a58a15 a116a119a118a10a120a119a121 . We  rst focus on the
parsing model.
A discourse parse tree can be formally represented
as a set of tuples. The discourse tree in Figure 1, for
example, can be formally written as the set of tuples
a63 ATTRIBUTION-SN[1,1,3]a37 ENABLEMENT-NS[2,2,3]a64 . A tu-
ple is of the form a129a70a130a13a38a37a40a123a131a37a119a132a32a133 , and denotes a discourse rela-
tion a129 that holds between the discourse span that contains
edus a13 through a123 , and the discourse span that contains
edus a123a135a134a58a66 througha132 . Each relation a129 also signals explic-
itly the nuclearity assignment, which can be NUCLEUS-
SATELLITE (NS), SATELLITE-NUCLEUS (SN), or NUCLEUS-
NUCLEUS (NN). This notation assumes that all relations a129
are binary relations. The assumption is justi ed empiri-
cally: 99% of the nodes of the discourse trees in our cor-
pus are binary nodes. Using only binary relations makes
our discourse model easier to build and reason with.
In what follows we make use of two functions: func-
tion a16a19a18a20a136 applied to a tuple a129a70a130a13a38a37a55a123a135a37a23a132a33a133 yields the discourse
relation a129 ; function a25a6a2 applied to a tuple a129a70a130a13a38a37a55a123a135a37a23a132a33a133 yields
the structure a130a13a38a37a55a123a135a37a23a132a33a133 . Given a set of adequate parameters
a114 , our discourse model estimates the goodness of a dis-
course parse tree a112a58a15 using formula (3).
a57a58a21a115a112a113a15 a60
a114
a24a137a43a139a138
a140a40a141
a126a51a127
a57a62a120a19a21a115a25a6a2a22a21a115a11a52a24 a60
a114
a24a49a142a143a57a144a87a33a21a115a16a20a18a20a136a40a21a115a11a52a24 a60
a114
a24 (3)
VP (says)
< (1, VP(says))(2, SBAR(will)) <(3, S(to)) (2, VP(use))
(says)
RB
[The  bank  also  says]
DT NN
ADVPNP
VBZ
1
S(says)
N
SBAR (will)
[it  will  use  its  network]
VP
PRP
NP
2
(use)
VB(use)
NNPRP
NP
MD(will)
VP(will)
S(will)
VP
VPTO
S
VB NNS
(to)
[to channel investments.]3
(to)
(to)
N
N
AN
A
H = will
H = to
H = says
D = { },
H
H
Figure 4: Dominance set extracted from a DS-LST.
For each tuple a11 a61 a112a58a15 , the probability a57a137a120 estimates the
goodness of the structure of a11 . We expect these proba-
bilities to prefer the hierarchical structure (1, (2, 3)) over
((1,2), 3) for the discourse tree in Figure 1. For each tu-
ple a11 a61 a112a113a15 , the probability a57 a87 estimates the goodness
of the discourse relation of a11 . We expect these probabili-
ties to prefer the rhetorical relation ATTRIBUTION-NS over
CONTRAST-NN for the relation between spans 1 and a130a145a54a37a40a146a19a133
in the discourse tree in Figure 1. The overall probability
of a discourse tree is obtained multiplying the structural
probabilities a57a62a120 and the relational probabilities a57a62a87 for all
the tuples in the discourse tree.
Our discourse model uses as a114 the information present
in the input DS-LST. However, given such a tree a147a62a15
as input, one cannot estimate probabilities such as
a57a58a21a115a112a113a15 a60 a147a62a15a44a24 without running into a severe sparseness
problem. To overcome this, we map the input DS-LST
into a more abstract representation that contains only the
salient features of the DS-LST. This mapping leads to the
notion of a dominance set over a discourse segmented
lexicalized syntactic tree. In what follows, we de ne this
notion and show that it provides adequate parameteriza-
tion for the discourse parsing problem.
4.1 The Dominance Set of a DS-LST
The dominance set of a DS-LST contains feature repre-
sentations of a discourse segmented lexicalized syntactic
tree. Each feature is a representation of the syntactic and
lexical information that is found at the point where two
edus are joined together in a DS-LST. Our hypothesis is
that such  attachment points in the structure of a DS-
LST (the boxed nodes in the tree in Figure 4) carry the
most indicative information with respect to the potential
discourse tree we want to build. A set representation of
the  attachment points of a DS-LST is called the domi-
nance set of a DS-LST.
For each edu a148 we identify a word a45 in a148 as the head
word of edu a148 and denote it a149 . a149 is de ned as the word
with the highest occurrence as a lexical head in the lexi-
calized tree among all the words in a148 . The node in which
a149 occurs highest is called the head node of edu a148 and is
denoted a68a74a150 . The edu which has as head node the root of
the DS-LST is called the exception edu. In our example,
the head word for edu 2 is a149a90a43a109a94a8a96a33a97a22a97 , and its head node is
a68 a150 a43a151a89a32a84a92a91a6a93a8a21a152a94a8a96a33a97a6a97a26a24 ; the head word for edu 3 is a149a153a43a155a154a54a156 ,
and its head node is a68 a150 a43a77a89a34a21a115a154a92a156a6a24 . The exception edu is
edu 1.
For each edu a148 which is not the exception edu, there
exists a node which is the parent of the head node of a148 ,
and the lexical head of this node is guaranteed to belong
to a different edu than a148 , call it a157 . We call this node the
attachment node of a148 and denote it a68a71a158 . In our example,
the attachment node of edu 2 is a68a71a158a159a43a160a76a6a78a8a21a14a80a33a81a26a82a34a80a32a24 , and
its lexical head says belongs to edu 1; the attachment
node of edu 3 is a68a74a158a161a43a162a76a6a78a8a21a152a163a79a80a33a164a22a24 , and its lexical headuse
belongs to edu 2. We write formally that two edus a148 and
a157 are linked through a head node a68a71a150 and an attachment
node a68 a158 as a21a115a148a58a37a165a68 a150 a24a107a166a151a21a35a157a167a37a40a68 a158 a24 .
The dominance set of a DS-LST is given by all the
edu pairs linked through a head node and an attachment
node in the DS-LST. Each element in the dominance set
represents a dominance relationship between the edus in-
volved. Figure 4 shows the dominance set a112 for our ex-
ample DS-LST. We say that edu 2 is dominated by edu 1
(shortly written a145a58a166a168a66 ), and edu 3 is dominated by edu 2
(a146a74a166a169a145 ).
4.2 The Discourse Model
Our discourse parsing model uses the dominance set a112
of a DS-LST as the conditioning parameter a114 in equa-
tion (3). The discourse parsing model we propose uses
the dominance set a112 to compute the probability of a dis-
course parse tree a112a58a15 according to formula (4).
a57a58a21a115a112a58a15 a60a112a170a24a171a43 a138
a140a40a141
a126a51a127
a57 a120 a21a115a25a6a2a22a21a115a11a52a24 a60a172 a13a14a136a115a7a10a18a4a16 a120 a21a115a11a19a37a165a112a170a24a55a24a49a142
a57a173a87a26a21a95a16a19a18a20a136a55a21a35a11a52a24 a60a172 a13a14a136a95a7a10a18a27a16a4a87a32a21a35a11a19a37a40a112a65a24a55a24a55a24 (4)
Different projections of a112 are used to accurately estimate
the structure probabilities a57a62a120 and the relation probabili-
ties a57a173a87 associated with a tuple in a discourse tree. The
projection functions a172 a13a14a136a95a7a10a18a27a16a20a120 and a172 a13a14a136a95a7a10a18a27a16a4a87 ensure that, for
each tuple a11 a61 a112a113a15 , only the information in a112 relevant to
a11 is to be conditioned upon. In the case of a57a51a120 (the prob-
ability of the structure a130a13a38a37a40a123a131a37a119a132a32a133 ), we  lter out the lexical
heads and keep only the syntactic labels; also, we  lter
out all the elements of a112 which do not have at least one
edu inside the span of a11 . In our running example, for in-
stance, for a11a39a43 ENABLEMENT-NSa130a145a92a37a165a145a54a37a40a146a32a133 , a172 a13a14a136a95a7a10a18a27a16 a120 a21a115a11a19a37a165a112a170a24a99a43
a63 a21a23a145a92a37a38a147a137a174a104a175a44a129a73a24a74a166a83a21a10a66a26a37a165a176a177a57a75a24a29a37a4a21a115a146a86a37a165a147a51a24a71a166a178a21a35a145a54a37a165a176a75a57a75a24a12a64 . The span
of a11 is a130a145a92a37a165a146a19a133 , and set a112 has two elements involving edus
The bank also says
1
[ ] 1
s
sP  * P  = 0.001rScore1 = S1*S2*
The bank also says
1
[ ] 1
s
s
[2,3]
it will use its network
2 3
to channel investments.[ ] [ ]2 3
S1 = 1
r
 0.47
0.88r
S2 = P  * P  = 0.40r
 P  ( [2,2,3] | (2,SBAR)<(1,VP), (3,S)<(2,VP) ) =
 P ( ENABLEMENT−NS | S(to)<VP(use) ) = 
P ( [1,1,3] |  (2, SBAR) <  (1, VP),  (3, S) <  (2, VP)  ) = 0.37
P ( ATTRIBUTION−SN |   SBAR(will) < VP(says)  ) = 0.009
it will use its network
2 3
to channel investments.[ ] [ ]2 3
ENABLEMENT
ATTRIBUTION
ENABLEMENT
Figure 5: Bottom-up discourse parsing.
from it, namely the dominance relationships a145a161a166a179a66 and
a146a70a166a109a145 . To decide the appropriate structure, a172 a13a14a136a95a7a10a18a27a16a19a120 keeps
them both; this is because a different dominance relation-
ship between edus 1 and 2, namely a66a128a166a101a145 , would most
likely in uence the structure probability of a11 .
In the case of a57a144a87 (the probability of the relation a129 ),
we keep both the lexical heads and the syntactic la-
bels, but  lter out the edu identi ers (clearly, the rela-
tion between two spans does not depend on the posi-
tions of the spans involved); also, we  lter out all the
elements of a112 whose dominance relationship does not
hold across the two sub-spans of a11 . In our running ex-
ample, for a11a44a43 ENABLEMENT-NSa130a145a54a37a165a145a92a37a165a146a19a133 , a172 a13a14a136a95a7a10a18a27a16a27a87a33a21a115a11a19a37a165a112a170a24a99a43
a63 a147a17a21a95a7a10a28a32a24a74a166a180a176a75a57a58a21a115a30a79a2a27a18a19a24a38a64 . The two sub-spans of a11 are a130a145a54a37a165a145a20a133
and a130a146a54a37a165a146a19a133 , and only the dominance relationship a146a168a166a181a145
holds across these spans; the other dominance relation-
ship in a112 , a145a182a166a179a66 , does not in uence the choice for the
relation label of a11 .
The conditional probabilities involved in equation (4)
are estimated from the training corpus using maximum
likelihood estimation. A simple interpolation method is
used for smoothing to accommodate data sparseness. The
counts for the dependency sets are also smoothed using
symbolic names for the edu identi ers and accounting
only for the distance between them.
4.3 The Discourse Parser
Our discourse parser implements a classical bottom-up
algorithm. The parser searches through the space of
all legal discourse parse trees and uses a dynamic pro-
gramming algorithm. If two constituents are derived for
the same discourse span, then the constituent for which
the model assigns a lower probability can be safely dis-
carded.
Figure 5 shows a discourse structure created in a
bottom-up manner for the DS-LST in Figure 2. Tu-
ple ENABLEMENT-NS[2,2,3] has a score of 0.40, obtained
as the product between the structure probability a57 a120 of
0.47 and the relation probability a57 a87 of 0.88. Tuple
ATTRIBUTION-SN[1,1,3] has a score of 0.37 for the struc-
ture, and a score of 0.009 for the relation. The  nal score
for the entire discourse structure is 0.001. All probabil-
ities used were estimated from our training corpus. Ac-
cording to our discourse model, the discourse structure in
Figure 5 is the most likely among all the legal discourse
structures for our example sentence.
5 Evaluation
In this section we present the evaluations carried out for
both the discourse segmentation task and the discourse
parsing task. For this evaluation, we re-trained Char-
niak’s parser (2000) such that the test sentences from the
discourse corpus were not seen by the syntactic parser
during training.
5.1 Evaluation of the Discourse Segmenter
We train our discourse segmenter on the Training sec-
tion of the corpus described in Section 2, and test it on
the Test section. The training regime uses syntactic trees
from the Penn Treebank. The metric we use to evalu-
ate the discourse segmenter records the accuracy of the
discourse segmenter with respect to its ability to insert
inside-sentence discourse boundaries. That is, if a sen-
tence has 3 edus, which correspond to 2 inside-sentence
discourse boundaries, we measure the ability of our al-
gorithm to correctly identify these 2 boundaries. We re-
port our evaluation results using recall, precision, and F-
score  gures. This metric is harsher than the metric pre-
viously used by Marcu (2000), who assesses the perfor-
mance of a discourse segmentation algorithm by count-
ing how often the algorithm makes boundary and no-
boundary decisions for every word in a sentence.
We compare the performance of our probabilistic dis-
course segmenter with the performance of the decision-
based segmenter proposed by (Marcu, 2000) and the per-
formance of two baseline algorithms. The  rst base-
line (a174a170a66a4a112a170a147 ) uses punctuation to determine when to in-
sert a boundary; because commas are often used to in-
dicate breaks inside long sentences, a174a113a66a27a112a170a147 inserts dis-
course boundaries after each comma. The second base-
line (a174a71a145a32a112a170a147 ) uses syntactic information; because long
sentences often have embedded sentences, a174a71a145a32a112a170a147 in-
serts discourse boundaries after each text span whose
corresponding syntactic subtree is labeled S, SBAR, or
SINV. We also compute the agreement between human
annotators on the discourse segmentation task (a149a128a112a170a147 ),
using the doubly-annotated discourse corpus mentioned
in Section 2.
Recall Precision F-score
a174a113a66a4a112a65a147 28.2 37.1 32.0
a174a74a145a32a112a65a147 25.4 64.9 36.5
a112a113a18a20a11a29a112a170a147 77.1 83.3 80.1
a147a137a3a22a5a117a112a65a147a49a21a95a15a177a183a173a24 82.7 83.5 83.1
a147a62a3a6a5a117a112a170a147a49a21a95a15a44a184a144a24 85.4 84.1 84.7
a149a88a112a170a147 98.2 98.5 98.3
Table 1: Discourse segmenter evaluation
Table 1 shows the results obtained by the algorithm
described in this paper (a147a62a3a6a5a117a112a170a147a49a21a95a15a104a183a185a24 ) using syntactic
trees produced by Charniak’s parser (2000), in com-
parison with the results obtained by the algorithm de-
scribed in (Marcu, 2000) (a112a113a18a20a11a29a112a170a147 ), and baseline algo-
rithms a174a113a66a27a112a170a147 and a174a74a145a33a112a170a147 , on the same test set. Cru-
cial to the performance of the discourse segmenter is
the recall  gure, because we want to  nd as many dis-
course boundaries as possible. The baseline algorithms
are too simplistic to yield good results (recall  gures of
28.2% and 25.4%). The algorithm presented in this pa-
per gives an error reduction in missed discourse bound-
aries of 24.5% (recall accuracy improvement from 77.1%
to 82.7%) over (Marcu, 2000). The overall error reduc-
tion is of 15.1% (improvement in F-score from 80.1% to
83.1%).
In order to asses the impact on the performance of
the discourse segmenter due to incorrect syntactic parse
trees, we also carry an evaluation using syntactic trees
from the Penn Treebank. The results are shown in row
a147a62a3a6a5a117a112a170a147a49a21a115a15a39a184a62a24 . Perfect syntactic trees lead to a further er-
ror reduction of 9.5% (F-score improvement from 83.1%
to 84.7%). The performance ceiling for discourse seg-
mentation is given by the human annotation agreement
F-score of 98.3%.
5.2 Evaluation of the Discourse Parser
We train our discourse parsing model on the Training sec-
tion of the corpus described in Section 2, and test it on
the Test section. The training regime uses syntactic trees
from the Penn Treebank. The performance is assessed us-
ing labeled recall and labeled precision as de ned by the
standard Parseval metric (Black et al., 1991). As men-
tioned in Section 2, we use both 18 labels and 110 la-
bels for the discourse relations. The recall and precision
 gures are combined into an F-score  gure in the usual
manner.
The discourse parsing model uses syntactic trees pro-
duced by Charniak’s parser (2000) and discourse seg-
ments produced by the algorithm described in Section 3.
We compare the performance of our model (a147a137a3a22a5a117a112a170a57 )
with the performance of the decision-based discourse
parsing model (a112a113a18a20a11a29a112a113a57 ) proposed by (Marcu, 2000), and
a174a74a112a113a57 a112a113a18a20a11a29a112a113a57 a147a62a3a6a5a117a112a113a57 a149a88a112a113a57
Unlabeled 64.0 67.0 70.5 92.8
18 Labels 23.4 37.2 49.0 77.0
110 Labels 20.7 35.5 45.6 71.9
Table 2: a147a62a3a6a5a117a112a113a57 performance compared to baseline,
state-of-the-art, and human performance
a15a177a183a173a147a107a183 a15a39a184a137a147a107a183 a15a75a183a173a147a137a184 a15a39a184a137a147a137a184
Unlabeled 70.5 73.0 92.8 96.2
18 Labels 49.0 56.4 63.8 75.5
110 Labels 45.6 52.6 59.5 70.3
Table 3: a147a62a3a6a5a117a112a113a57 performance with human-level accu-
racy for syntactic trees and discourse boundaries.
with the performance of a baseline algorithm (a174a104a112a113a57 ).
The baseline algorithm builds right-branching discourse
trees labeled with the most frequent relation encountered
in the training set (i.e., ELABORATION-NS). We also com-
pute the agreement between human annotators on the dis-
course parsing task (a149a88a112a113a57 ), using the doubly-annotated
discourse corpus mentioned in Section 2. The results are
shown in Table 2. The baseline algorithm has a perfor-
mance of 23.4% and 20.7% F-score, when using 18 la-
bels and 110 labels, respectively. Our algorithm has a
performance of 49.0% and 45.6% F-score, when using
18 labels and 110 labels, respectively. These results rep-
resent an error reduction of 18.8% (F-score improvement
from 37.2% to 49.0%) over a state-of-the-art discourse
parser (Marcu, 2000) when using 18 labels, and an error
reduction of 15.7% (F-score improvement from 35.5% to
45.6%) when using 110 labels. The performance ceiling
for sentence-level discourse structure derivation is given
by the human annotation agreement F-score of 77.0% and
71.9%, when using 18 labels and 110 labels, respectively.
The performance gap between the results of a147a62a3a6a5a117a112a113a57 and
human agreement is still large, and it can be attributed
to three possible causes: errors made by the syntactic
parser, errors made by the discourse segmenter, and the
weakness of our discourse model.
In order to quantitatively asses the impact in perfor-
mance of each possible cause of error, we perform further
experiments. We replace the syntactic parse trees pro-
duced by Charniak’s parser at 90% accuracy (a15 a183 ) with
the corresponding Penn Treebank syntactic parse trees
produced by human annotators (a15a186a184 ). We also replace
the discourse boundaries produced by our discourse seg-
menter at 83% accuracy (a147a167a183 ) with the discourse bound-
aries taken from (RST-DT, 2002), which are produced by
the human annotators (a147a51a184 ).
The results are shown in Table 3. The results in col-
umn a15a39a184a137a147a107a183 show that using perfect syntactic trees leads
to an error reduction of 14.5% (F-score improvement
from 49.0% to 56.4%) when using 18 labels, and an error
reduction of 12.9% (F-score improvement from 45.6%
to 52.6%) when using 110 labels. The results in col-
umn a15 a183 a147 a184 show that the impact of perfect discourse
segmentation is double the impact of perfect syntactic
trees. Human-level performance on discourse segmen-
tation leads to an error reduction of 29.0% (F-score im-
provement from 49.0% to 63.8%) when using 18 labels,
and an error reduction of 25.6% (F-score improvement
from 45.6% to 59.5%) when using 110 labels. Together,
perfect syntactic trees and perfect discourse segmentation
lead to an error reduction of 52.0% (F-score improvement
from 49.0% to 75.5%) when using 18 labels, and an error
reduction of 45.5% (F-score improvement from 45.6% to
70.3%) when using 110 labels. The results in column
a15a39a184a137a147a137a184 in Table 3 compare extremely favorable with the
results in column a149a88a112a113a57 in Table 2. The discourse parsing
model produces unlabeled discourse structure at a per-
formance level similar to human annotators (F-score of
96.2%). When using 18 labels, the distance between our
discourse parsing model performance level and human
annotators performance level is of absolute 1.5% (75.5%
versus 77%). When using 110 labels, the distance is of
absolute 1.6% (70.3% versus 71.9%). Our evaluation
shows that our discourse model is sophisticated enough
to match near-human levels of performance.
6 Conclusion
In this paper, we have introduced a discourse parsing
model that uses syntactic and lexical features to estimate
the adequacy of sentence-level discourse structures. Our
model de nes and exploits a set of syntactically moti-
vated lexico-grammatical dominance relations that fall
naturally from a syntactic representation of sentences.
The most interesting  nding is that these dominance
relations encode suf cient information to enable the
derivation of discourse structures that are almost indis-
tinguishable from those built by human annotators. Our
experiments empirically show that, at the sentence level,
there is an extremely strong correlation between syntax
and discourse. This is even more remarkable given that
the discourse corpus (RST-DT, 2002) was built with no
syntactic theory in mind. The annotators used by Carlson
et al. (2003) were not instructed to build discourse trees
that were consistent with the syntax of the sentences. Yet,
they built discourse structures at sentence level that are
not only consistent with the syntactic structures of sen-
tences, but also derivable from them.
Recent work on Tree Adjoining Grammar-based lexi-
calized models of discourse (Forbes et al., 2001) has al-
ready shown how to exploit within a single framework
lexical, syntactic, and discourse cues. Various linguis-
tics studies have also shown how intertwined syntax and
discourse are (Maynard, 1998). However, to our knowl-
edge, this is the  rst paper that empirically shows that the
connection between syntax and discourse can be compu-
tationally exploited at high levels of accuracy on open
domain, newspaper text.
Another interesting  nding is that the performance of
current state-of-the-art syntactic parsers (Charniak, 2000)
is not a bottleneck for coming up with a good solution
to the sentence-level discourse parsing problem. Little
improvement comes from using manually built syntactic
parse trees instead of automatically derived trees. How-
ever, experiments show that there is much to be gained if
better discourse segmentation algorithms are found; 83%
accuracy on this task is not suf cient for building highly
accurate discourse trees.
We believe that semantic/discourse segmentation is
a notoriously under-researched problem. For example,
Gildea and Jurafsky (2002) present a semantic parser that
optimistically assumes that has access to perfect seman-
tic segments. Our results suggest that more effort needs
to be put on semantic/discourse-based segmentation. Im-
provements in this area will have a signi cant impact on
both semantic and discourse parsing.
References
E. Black, S. Abney, D. Flickinger, C. Gdaniec, R. Gr-
ishman, P. Harrison, D. Hindle, R. Ingria, F. Jelinek,
J. Klavans, M. Liberman, M. Marcus, S. Roukos,
B. Santorini, and T. Strzalkowski. 1991. A proce-
dure for quantitatively comparing the syntactic cover-
age of English grammars. In Proceedings of Speech
and Natural Language Workshop, pages 306 311, Pa-
ci c Groove, CA. DARPA.
L. Carlson, D. Marcu, and M. E. Okurowski. 2003.
Building a discourse-tagged corpus in the framework
of Rhetorical Structure Theory. In Jan van Kuppevelt
and Ronnie Smith, editors, Current Directions in Dis-
course and Dialogue. Kluwer Academic Publishers. To
appear.
Eugene Charniak. 2000. A maximum-entropy-inspired
parser. In Proceedings of the NAACL 2000, pages 132 
139, Seattle, Washington, April 29  May 3.
Michael Collins. 2000. Discriminative reranking for nat-
ural language parsing. In Proceedings of ICML 2000,
Stanford University, Palo Alto, CA, June 29 July 2.
C. J. Fillmore, C. F. Baker, and S. Hiroaki. 2002. The
framenet database and software tools. In Proceedings
of the LREC 2002, pages 1157 1160, LREC.
K. Forbes, E. Miltsakaki, R. Prasad, A. Sarkar, A. Joshi,
and B. Webber. 2001. D-LTAG System: Discourse
parsing with a lexicalized tree-adjoining grammar. In
ESSLLI’2001 Workshop on Information Structure, Dis-
course Structure and Discourse Semantics.
Daniel Gildea and Daniel Jurafsky. 2002. Automatic la-
beling of semantic role. Computational Linguistics,
28(3):245 288.
Paul Kingsbury and Martha Palmer. 2002. From Tree-
bank to Propbank. In Proceedings of the LREC 2002,
Las Palmas, Canary Islands, Spain, May 28-June 3.
David M. Magerman. 1995. Statistical decision-tree
models for parsing. In Proceedings of the ACL 1995,
pages 276 283, Cambridge, Massachusetts, June 26-
30.
William C. Mann and Sandra A. Thompson. 1988.
Rhetorical Structure Theory: Toward a functional the-
ory of text organization. Text, 8(3):243 281.
Daniel Marcu. 2000. The Theory and Practice of Dis-
course Parsing and Summarization. The MIT Press,
Cambridge, Massachusetts.
M. Marcus, B. Santorini, and M. Marcinkiewicz. 1993.
Building a large annotated corpus of English: the Penn
Treebank. Computational Linguistics, 19(2):313 330.
Senko K. Maynard. 1998. Principles of Japanese Dis-
course: A Handbook. Cambridge University Press.
David D. Palmer and Marti A. Hearst. 1997. Adaptive
multilingual sentence boundary disambiguation. Com-
putational Linguistics, 23(2):241 269, June.
Adwait Ratnaparkhi. 1998. Maximum Entropy Models
for Natural Language Ambiguity Resolution. Ph.D.
thesis, University of Pennsylvania.
RST-DT. 2002. RST Discourse Tree-
bank. Linguistic Data Consortium.
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?
catalogId=LDC2002T07.
