Proceedings of the Workshop on Frontiers in Corpus Annotation II: Pie in the Sky, pages 29–36,
Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
Attribution and the (Non-)Alignment of Syntactic and Discourse Arguments
of Connectives
Nikhil Dinesh and Alan Lee and Eleni Miltsakaki and Rashmi Prasad and Aravind Joshi
University of Pennsylvania
Philadelphia, PA 19104 USA
CUnikhild,aleewk,elenimi,rjprasad,joshiCV@linc.cis.upenn.edu
Bonnie Webber
University of Edinburgh
Edinburgh, EH8 9LW Scotland
bonnie@inf.ed.ac.uk
Abstract
The annotations of the Penn Discourse
Treebank (PDTB) include (1) discourse
connectives and their arguments, and (2)
attribution of each argument of each con-
nective and of the relation it denotes. Be-
cause the PDTB covers the same text as
the Penn TreeBank WSJ corpus, syntac-
tic and discourse annotation can be com-
pared. This has revealed significant dif-
ferences between syntactic structure and
discourse structure, in terms of the argu-
ments of connectives, due in large part to
attribution. We describe these differences,
an algorithm for detecting them, and fi-
nally some experimental results. These re-
sults have implications for automating dis-
course annotation based on syntactic an-
notation.
1 Introduction
The overall goal of the Penn Discourse Treebank
(PDTB) is to annotate the million word WSJ cor-
pus in the Penn TreeBank (Marcus et al., 1993) with
a layer of discourse annotations. A preliminary re-
port on this project was presented at the 2004 work-
shop on Frontiers in Corpus Annotation (Miltsakaki
et al., 2004a), where we described our annotation
of discourse connectives (both explicit and implicit)
along with their (clausal) arguments.
Further work done since then includes the an-
notation of attribution: that is, who has expressed
each argument to a discourse connective (the writer
or some other speaker or author) and who has ex-
pressed the discourse relation itself. These ascrip-
tions need not be the same. Of particular interest is
the fact that attribution may or may not play a role
in the relation established by a connective. This may
lead to a lack of congruence between arguments at
the syntactic and the discourse levels. The issue of
congruence is of interest both from the perspective
of annotation (where it means that, even within a
single sentence, one cannot merely transfer the an-
notation of syntactic arguments of a subordinate or
coordinate conjunction to its discourse arguments),
and from the perspective of inferences that these an-
notations will support in future applications of the
PDTB.
The paper is organized as follows. We give a brief
overview of the annotation of connectives and their
arguments in the PDTB in Section 2. In Section 3,
we describe the annotation of the attribution of the
arguments of a connective and the relation it con-
veys. In Sections 4 and 5, we describe mismatches
that arise between the discourse arguments of a con-
nective and the syntactic annotation as provided by
the Penn TreeBank (PTB), in the cases where all the
arguments of the connective are in the same sen-
tence. In Section 6, we will discuss some implica-
tions of these issues for the theory and practice of
discourse annotation and their relevance even at the
level of sentence-bound annotation.
2 Overview of the PDTB
The PDTB builds on the DLTAG approach to dis-
course structure (Webber and Joshi, 1998; Webber
et al., 1999; Webber et al., 2003) in which con-
nectives are discourse-level predicates which project
predicate-argument structure on a par with verbs at
29
the sentence level. Initial work on the PDTB has
been described in Miltsakaki et al. (2004a), Milt-
sakaki et al. (2004b), Prasad et al. (2004).
The key contribution of the PDTB design frame-
work is its bottom-up approach to discourse struc-
ture: Instead of appealing to an abstract (and arbi-
trary) set of discourse relations whose identification
may confound multiple sources of discourse mean-
ing, we start with the annotation of discourse con-
nectives and their arguments, thus exposing a clearly
defined level of discourse representation.
The PDTB annotates as explicit discourse connec-
tives all subordinating conjunctions, coordinating
conjunctions and discourse adverbials. These pred-
icates establish relations between two abstract ob-
jects such as events, states and propositions (Asher,
1993).
1
We use Conn to denote the connective, and Arg1
and Arg2 to denote the textual spans from which the
abstract object arguments are computed.
2
In (1), the
subordinating conjunction since establishes a tem-
poral relation between the event of the earthquake
hitting and a state where no music is played by a
certain woman. In all the examples in this paper, as
in (1), Arg1 is italicized, Arg2 is in boldface, and
Conn is underlined.
(1) She hasn’t played any music since the earthquake
hit.
What counts as a legal argument? Since we take
discourse relations to hold between abstract objects,
we require that an argument contains at least one
clause-level predication (usually a verb – tensed or
untensed), though it may span as much as a sequence
of clauses or sentences. The two exceptions are
nominal phrases that express an event or a state, and
discourse deictics that denote an abstract object.
1
For example, discourse adverbials like as a result are dis-
tinguished from clausal adverbials like strangely which require
only a single abstract object (Forbes, 2003).
2
Each connective has exactly two arguments. The argument
that appears in the clause syntactically associated with the con-
nective, we call Arg2. The other argument is called Arg1. Both
Arg1 and Arg2 can be in the same sentence, as is the case for
subordinating conjunctions (e.g., because). The linear order of
the arguments will be Arg2 Arg1 if the subordinate clause ap-
pears sentence initially; Arg1 Arg2 if the subordinate clause ap-
pears sentence finally; and undefined if it appears sentence me-
dially. For an adverbial connective like however,Arg1isinthe
prior discourse. Hence, the linear order of its arguments will be
Arg1 Arg2.
Because our annotation is on the same corpus as
the PTB, annotators may select as arguments textual
spans that omit content that can be recovered from
syntax. In (2), for example, the relative clause is
selected as Arg1 of even though, and its subject can
be recovered from its syntactic analysis in the PTB.
In (3), the subject of the infinitival clause in Arg1 is
similarly available.
(2) Workers described “clouds of blue dust” that hung
over parts of the factory even though exhaust fans
ventilated the air.
(3) The average maturity for funds open only to institu-
tions, considered by some to be a stronger indicator
because those managers watch the market closely,
reached a high point for the year – 33 days.
The PDTB also annotates implicit connectives be-
tween adjacent sentences where no explicit connec-
tive occurs. For example, in (4), the two sentences
are contrasted in a way similar to having an explicit
connective like but occurring between them. Anno-
tators are asked to provide, when possible, an ex-
plicit connective that best describes the relation, and
in this case in contrast was chosen.
(4) The $6 billion that some 40 companies are looking to
raise in the year ending March 21 compares with only
$2.7 billion raise on the capital market in the previous
year. IMPLICIT - in contrast In fiscal 1984, before
Mr. Gandhi came into power, only $810 million
was raised.
When complete, the PDTB will contain approxi-
mately 35K annotations: 15K annotations of the 100
explicit connectives identified in the corpus and 20K
annotations of implicit connectives.
3
3 Annotation of attribution
Wiebe and her colleagues have pointed out the
importance of ascribing beliefs and assertions ex-
pressed in text to the agent(s) holding or making
them (Riloff and Wiebe, 2003; Wiebe et al., 2004;
Wiebe et al., 2005). They have also gone a consid-
erable way towards specifying how such subjective
material should be annotated (Wiebe, 2002). Since
we take discourse connectives to convey semantic
predicate-argument relations between abstract ob-
jects, one can distinguish a variety of cases depend-
ingontheattribution of the discourse relation or its
3
The annotation guidelines for the PDTB are available at
http://www.cis.upenn.edu/AOpdtb.
30
arguments; that is, whether the relation or arguments
are ascribed to the author of the text or someone
other than the author.
Case 1: The relation and both arguments are at-
tributed to the same source. In (5), the concessive
relation between Arg1 and Arg2, anchored on the
connective even though is attributed to the speaker
Dick Mayer, because he is quoted as having said
it. Even where a connective and its arguments are
not included in a single quotation, the attribution can
still be marked explicitly as shown in (6), where only
Arg2 is quoted directly but both Arg1 and Arg2 can
be attibuted to Mr. Prideaux. Attribution to some
speaker can also be marked in reported speech as
shown in the annotation of so that in (7).
(5) “Now, Philip Morris Kraft General Foods’ parent
company is committed to the coffee business and to
increased advertising for Maxwell House,” says Dick
Mayer, president of the General Foods USA division.
“Even though brand loyalty is rather strong for cof-
fee, we need advertising to maintain and strengthen
it.”
(6) B.A.T isn’t predicting a postponement because the
units “are quality businesses and we are en-
couraged by the breadth of inquiries,” said Mr.
Prideaux.
(7) Like other large Valley companies, Intel also noted
that it has factories in several parts of the nation,
so that a breakdown at one location shouldn’t leave
customers in a total pinch.
Wherever there is a clear indication that a relation
is attributed to someone other than the author of the
text, we annotate the relation with the feature value
SA for “speaker attribution” which is the case for
(5), (6), and (7). The arguments in these examples
are given the feature value IN to indicate that they
“inherit” the attribution of the relation. If the rela-
tion and its arguments are attributed to the writer,
they are given the feature values WA and IN respec-
tively.
Relations are attributed to the writer of the text by
default. Such cases include many instances of re-
lations whose attribution is ambiguous between the
writer or some other speaker. In (8), for example,
we cannot tell if the relation anchored on although
is attributed to the spokeswoman or the author of the
text. As a default, we always take it to be attributed
to the writer.
Case 2: One or both arguments have a different at-
tribution value from the relation. While the default
value for the attribution of an argument is the attribu-
tion of its relation, it can differ as in (8). Here, as in-
dicated above, the relation is attributed to the writer
(annotated WA) by default, but Arg2 is attributed to
Delmed (annotated SA, for some speaker other than
the writer, and other than the one establishing the
relation).
(8) The current distribution arrangement ends in March
1990 , although Delmed said it will continue to pro-
vide some supplies of the peritoneal dialysis prod-
ucts to National Medical, the spokeswoman said.
Annotating the corpus with attribution is neces-
sary because in many cases the text containing the
source of attribution is located in a different sen-
tence. Such is the case for (5) where the relation
conveyed by even though, and its arguments are at-
tributed to Dick Mayer.
We are also adding attribution values to the anno-
tation of the implicit connectives. Implicit connec-
tives express relations that are inferred by the reader.
In such cases, the author intends for the reader to
infer a discourse relation. As with explicit connec-
tives, we have found it useful to distinguish implicit
relations intended by the writer of the article from
those intended by some other author or speaker. To
give an example, the implicit relation in (9) is at-
tributed to the writer. However, in (10) both Arg1
and Arg2 have been expressed by the speaker whose
speech is being quoted. In this case, the implicit re-
lation is attributed to the speaker.
(9) Investors in stock funds didn’t panic the week-
end after mid-October’s 190-point market plunge.
IMPLICIT-instead Most of those who left stock
funds simply switched into money market funds.
(10) “People say they swim, and that may mean they’ve
been to the beach this year,” Fitness and Sports. “It’s
hard to know if people are responding truthfully.
IMPLICIT-because People are too embarrassed to
say they haven’t done anything.”
The annotation of attribution is currently under-
way. The final version of the PDTB will include an-
notations of attribution for all the annotated connec-
tives and their arguments.
Note that in the Rhetorical Structure Theory
(RST) annotation scheme (Carlson et al., 2003), at-
tribution is treated as a discourse relation. We, on
the other hand, do not treat attribution as a discourse
31
relation. In PDTB, discourse relations (associated
with an explicit or implicit connective) hold between
two abstracts objects, such as events, states, etc. At-
tribution relates a proposition to an entity, not to an-
other proposition, event, etc. This is an important
difference between the two frameworks. One conse-
quence of this difference is briefly discussed in Foot-
note 4 in the next section.
4 Arguments of Subordinating
Conjunctions in the PTB
A natural question that arises with the annotation
of arguments of subordinating conjunctions (SUB-
CONJS) in the PDTB is to what extent they can be
detected directly from the syntactic annotation in the
PTB. In the simplest case, Arg2 of a SUBCONJ is its
complement in the syntactic representation. This is
indeed the case for (11), where since is analyzed as
a preposition in the PTB taking an S complement
which is Arg2 in the PDTB, as shown in Figure 1.
(11) Since the budget measures cash flow, anew$1di-
rect loan is treated as a $1 expenditure.
Furthermore, in (11), since together with its com-
plement (Arg2) is analyzed as an SBAR which mod-
ifies the clause anew$1 direct loan is treated as a
$1 expenditure, and this clause is Arg1 in the PDTB.
Can the arguments always be detected in this
way? In this section, we present statistics showing
that this is not the case and an analysis that shows
that this lack of congruence between the PDTB and
the PTB is not just a matter of annotator disagree-
ment.
Consider example (12), where the PTB requires
annotators to include the verb of attribution said
and its subject Delmed in the complement of al-
though.Butalthough as a discourse connective de-
nies the expectation that the supply of dialysis prod-
ucts will be discontinued when the distribution ar-
rangement ends. It does not convey the expectation
that Delmed will not say such things. On the other
hand, in (13), the contrast established by while is be-
tween the opinions of two entities i.e., advocates and
their opponents.
4
4
This distinction is hard to capture in an RST-based pars-
ing framework (Marcu, 2000). According to the RST-based an-
notation scheme (Carlson et al., 2003) ‘although Delmed said’
and ‘while opponents argued’ are elementary discourse units
(12) The current distribution arrangement ends in March
1990, although Delmed said it will continue to pro-
vide some supplies of the peritoneal dialysis prod-
ucts to National Medical, the spokeswoman said.
(13) Advocates said the 90-cent-an-hour rise, to $4.25 an
hour by April 1991, is too small for the working poor,
while opponents argued that the increase will still
hurt small business and cost many thousands of
jobs.
In Section 5, we will identify additional cases. What
we will then argue is that it will be insufficient to
train an algorithm for identifying discourse argu-
ments simply on the basis of syntactically analysed
text.
We now present preliminary measurements of
these and other mismatches between the two corpora
for SUBCONJS. To do this we describe a procedural
algorithm which builds on the idea presented at the
start of this section. The statistics are preliminary in
that only the annotations of a single annotator were
considered, and we have not attempted to exclude
cases in which annotators disagree.
We consider only those SUBCONJS for which both
arguments are located in the same sentence as the
connective (which is the case for approximately 99%
of the annotated instances). The syntactic configura-
tion of such relations pattern in a way shown in Fig-
ure 1. Note that it is not necessary for any of BVD3D2D2,
BTD6CVBD,orBTD6CVBE to have a single node in the parse tree
that dominates it exactly. In Figure 1 we do obtain a
single node for BVD3D2D2,andBTD6CVBE but for BTD6CVBD,itis
the set of nodes CUC6C8BNCEC8CV that dominate it exactly.
Connectives like so that,andeven if are not domi-
nated by a single node, and cases where the annota-
tor has decided that a (parenthetical) clausal element
is not minimally necessary to the interpretation of
BTD6CVBE will necessitate choosing multiple nodes that
dominate BTD6CVBE exactly.
Given the node(s) in the parse tree that dominate
BVD3D2D2 (CUC1C6CV in Figure 1), the algorithm we present
tries to find node(s) in the parse tree that dominate
BTD6CVBD and BTD6CVBE exactly using the operation of tree
subtraction (Sections 4.1, and 4.2). We then discuss
its execution on (11) in Section 4.3.
annotated in the same way: as satellites of the relation Attribu-
tion. RST does not recognize that satellite segments, such as
the ones given above, sometimes participate in a higher RST
relation along with their nuclei and sometimes not.
32
CB
BDBE
SBAR NP
Anew$1 direct
loan
VP
is treated as a
$1 expenditure
IN CBBE
the budget mea-
sures cash flow
since
Given C6
BVD3D2D2
BP CUC1C6CV, our goal is to find C6
BTD6CVBD
BP
CUC6C8BNCEC8CV,andC6
BTD6CVBE
BP CUCB
BE
CV.Steps:
AF CW
BVD3D2D2
BP C1C6
AF DC
BVD3D2D2B7BTD6CVBE
BP CBBUBTCAAH D4CPD6CTD2D8B4CW
BVD3D2D2
B5
AF DC
BVD3D2D2B7BTD6CVBDB7BTD6CVBE
BP CB
BDBE
AH lowest BTD2CRCTD7D8D3D6
D4CPD6CTD2D8B4DC
BVD3D2D2B7BTD6CVBE
B5
with la-
bel S or SBAR. Note that DC BE BTD2CRCTD7D8D3D6
DC
AF C6
BTD6CVBE
BP DC
BVD3D2D2B7BTD6CVBE
A0C6
BVD3D2D2
BP CBBUBTCAA0CUC1C6CV
BP CUCB
BE
CV
AF C6
BTD6CVBD
BP DC
BVD3D2D2B7BTD6CVBDB7BTD6CVBE
A0CUDC
BVD3D2D2B7BTD6CVBE
CV
BP CB
BDBE
A0CUCBBUBTCACV
BP CUC6C8BNCEC8CV
Figure 1: The syntactic configuration for (11), and the execution of the tree subtraction algorithm on this configuration.
4.1 Tree subtraction
We will now define the operation of tree subtraction
the graphical intuition for which is given in Figure
2. Let CC be the set of nodes in the tree.
Definition 4.1. The ancestors of any node D8 BE CC,
denoted by BTD2CRCTD7D8D3D6
D8
AI CC is a set of nodes such
that D8 BE BTD2CRCTD7D8D3D6
D8
and D4CPD6CTD2D8B4D9BND8B5 B5 B4CJD9 BE
BTD2CRCTD7D8D3D6
D8
CLCMCJBTD2CRCTD7D8D3D6
D9
AQ BTD2CRCTD7D8D3D6
D8
CLB5
Definition 4.2. Consider a node DC BE CC, and a set
of nodes CH AQ CC A0CUDCCV, we define the set CI
BC
BP
CUD2CYD2 BE CC A0CUDCCVCMDC BE BTD2CRCTD7D8D3D6
D2
CM B4BKDD BE CHBNDD BIBE
BTD2CRCTD7D8D3D6
D2
CM D2 BIBE BTD2CRCTD7D8D3D6
DD
B5CV. Given such an DC
and CH , the operation of tree subtraction gives a set
of nodes CI such that, CI BP CUDE
BD
CYDE
BD
BE CI
BC
CM B4BKDE
BE
BE
CI
BC
BNDE
BE
BIBE B4BTD2CRCTD7D8D3D6
DE
BD
A0CUDE
BD
CVB5B5CV
We denote this by DCA0CH BP CI.
The nodes DE BE CI are the highest descendants of
DC, which do not dominate any node DD BE CH and are
not dominated by any node in CH .
4.2 Algorithm to detect the arguments
For any D8 BE CC,letC4
D8
denote the set of leaves(or
terminals) dominated by D8 and for BT AI CC we denote
the set of leaves dominated by BT as C4
BT
BP
CJ
BKCPBEBT
C4
CP
.
CG A0CUDD
BD
BNDD
BE
CV BP CUDE
BD
BNDE
BE
CV
CG
DD
BD DEBE
DD
BE DEBD
Figure 2: Tree subtraction DCA0CH BP CI
For any set of leaves C4 we define C6
BC
C4
to be a set
of nodes of maximum cardinality such that C4
C6
BC
C4
BP
CJ
BKD2BEC6
BC
C4
C4
D2
BP C4
The set C6
C4
BP CUD2
BD
CYD2
BD
BE C6
BC
C4
CM B4BKD2
BE
BE C6
BC
C4
BND2
BE
BIBE
B4BTD2CRCTD7D8D3D6
D2
BD
A0CUD2
BD
CVB5B5CV. We can think of Conn,
Arg1 and Arg2 each as a set of leaves and we use
C6
BVD3D2D2
, C6
BTD6CVBD
and C6
BTD6CVBE
to denote the set of high-
est nodes which dominate them respectively.
Given C6
BVD3D2D2
, our task is then to find C6
BTD6CVBD
and
33
C6
BTD6CVBE
. The algorithm does the following:
1. Let CW
BVD3D2D2
(the head) be the last node in C6
BVD3D2D2
in an in-
order traversal of the tree.
2. DC
BVD3D2D2B7BTD6CVBE
AH D4CPD6CTD2D8B4CW
BVD3D2D2
B5
3. Repeat while D4CPD6CTD2D8B4DC
BVD3D2D2B7BTD6CVBE
B5 has label S or SBAR,
and has only two children:
DC
BVD3D2D2B7BTD6CVBE
BP D4CPD6CTD2D8B4DC
BVD3D2D2B7BTD6CVBE
B5
This ensures the inclusion of complementizers and subor-
dinating conjuctions associated with the clause in Arg1.
The convention adopted by the PDTB was to include such
elements in the clause with which they were associated.
4. DC
BVD3D2D2B7BTD6CVBDB7BTD6CVBE
is the lowest node with label S or
SBAR such that:
DC
BVD3D2D2B7BTD6CVBDB7BTD6CVBE
BE BTD2CRCTD7D8D3D6
D4CPD6CTD2D8B4DC
BVD3D2D2B7BTD6CVBE
B5
5. Repeat while D4CPD6CTD2D8B4DC
BVD3D2D2B7BTD6CVBDB7BTD6CVBE
B5 has label S or
SBAR, and has only two children:
DC
BVD3D2D2B7BTD6CVBDB7BTD6CVBE
BP D4CPD6CTD2D8B4DC
BVD3D2D2B7BTD6CVBDB7BTD6CVBE
B5
6. C6
BTD6CVBE
BP DC
BVD3D2D2B7BTD6CVBE
A0C6
BVD3D2D2
(tree subtraction)
7. C6
BTD6CVBD
BP DC
BVD3D2D2B7BTD6CVBDB7BTD6CVBE
A0CUDC
BVD3D2D2B7BTD6CVBE
CV (tree sub-
traction)
4.3 Executing the algorithm on (11)
The idea behind the algorithm is as follows. Since
we may not be able to find a single node that domi-
nates BVD3D2D2, BTD6CVBD, and/or BTD6CVBE exactly, we attempt
to find a node that dominates BVD3D2D2 and BTD6CVBE to-
gether denoted by DC
BVD3D2D2B7BTD6CVBE
(CBBUBTCAin Figure 1),
and a node that dominates BVD3D2D2, BTD6CVBD and BTD6CVBE
together denoted by DC
BVD3D2D2B7BTD6CVBDB7BTD6CVBE
(CB
BDBE
in Fig-
ure 1). Note that this is an approximation, and there
may be no single node that dominates BVD3D2D2,and
BTD6CVBE exactly.
Given DC
BVD3D2D2B7BTD6CVBE
theideaistoremoveallthe
material corresponding to BVD3D2D2 (C6
BVD3D2D2
) under that
node and call the rest of the material BTD6CVBE.Thisis
what the operation of tree subtraction gives us, i.e.,
DC
BVD3D2D2B7BTD6CVBE
A0C6
BVD3D2D2
which is CUCB
BE
CV in Figure 1.
Similarly, given DC
BVD3D2D2B7BTD6CVBDB7BTD6CVBE
we would like
to remove the material corresponding to BVD3D2D2
and BTD6CVBE and CUDC
BVD3D2D2B7BTD6CVBE
CV is that material.
DC
BVD3D2D2B7BTD6CVBDB7BTD6CVBE
A0 CUDC
BVD3D2D2B7BTD6CVBE
CV givesusthe
nodes CUC6C8BNCEC8CV which is the desired BTD6CVBD.
5 Evaluation of the tree subtraction
algorithm
Describing the mismatches between the syntactic
and discourse levels of annotation requires a detailed
analysis of the cases where the tree subtraction al-
gorithm does not detect the same arguments as an-
notated by the PDTB. Hence this first set of exper-
iments was carried out only on Sections 00-01 of
the WSJ corpus (about 3500 sentences), which is ac-
cepted by the community to be development data.
First, the tree subtraction algorithm was run on
the PTB annotations in these two sections. The ar-
guments detected by the algorithm were classified
as: (a) Exact, if the argument detected by the al-
gorithm exactly matches the annotation; (b) Extra
Material, if the argument detected contains some
additional material in comparison with the annota-
tion; and (c) Omitted Material, if some annotated
material was not included in the argument detected.
The results are summarized in Table 1.
Argument Exact Extra Material Omitted Material
Arg1 82.5% 12.6% 4.9%
(353) (54) (21)
Arg2 93.7% 2.6% 3.7%
(401) (11) (16)
Table 1: Tree subtraction on the PTB annotations for SUB-
CONJS. Section 00-01(428 instances)
5.1 Analysis of the results in Table 1
5.1.1 Extra Material
There were 54 (11) cases where Arg1 (Arg2) in
the PTB (obtained via tree subtraction) contained
more material than the corresponding annotation in
the PDTB. We describe only the cases for Arg1,
since they were a superset of the cases for Arg2.
Second VP-coordinate - In these cases, Arg1 of
the SUBCONJ was associated with the second of two
coordinated VPs. Example (14) is the relation an-
notated by the PDTB, while (15) is the relation pro-
duced by tree subtraction.
(14) She became an abortionist accidentally, and continued
because it enabled her to buy jam, cocoa and other
war-rationed goodies.
(15) She became an abortionist accidentally, and contin-
ued because it enabled her to buy jam, cocoa and
other war-rationed goodies.
Such mismatches can be either due to the fact
that the algorithm looks only for nodes of type S
or SBAR, or due to disagreement between the PTB
and PDTB. Further investigation is needed to under-
34
stand this issue more precisely.
5
The percentage of
such mismatches (with respect to the total number
of cases of extra material) is recorded in the first col-
umn of Table 2, along with the number of instances
in parentheses.
Lower Verb - These are cases of a true mismatch
between the PDTB and the PTB, where the PDTB
has associated Arg1 with a lower clause than the
PTB. BL of the BDBF “lower verb” cases for Arg1 were
due to verbs of attribution, as in (12). (The percent-
age of “lower verb” mismatches is given in the sec-
ond column of Table 2, along with the number of
instances in parentheses.)
Clausal Adjuncts - Finally, we considered cases
where clause(s) judged not to be minimally neces-
sary to the interpretation of Arg1 were included.
(16) shows the relation annotated by the PDTB,
where the subordinate clause headed by partly be-
cause is not part of Arg1, but the tree subtraction
algorithm includes it as shown in (17).
(16) When Ms. Evans took her job, several important
divisions that had reported to her predecessor weren’t
included partly because she didn’t wish to be a full
administrator.
(17) When Ms. Evans took her job, several important
divisions that had reported to her predecessor weren’t
included partly because she didn’t wish to be a full
administrator.
To get an idea of the number of cases where a
single irrelevant clause was included, we determined
the number of instances for which pruning out one
node from Arg1 resulted in an exact match. This is
given in the third column of Table 2. The second
row of Table 2 illustrates the same information for
Arg2. Most of these are instances where irrelevant
clauses were included in the argument detected from
the PTB.
Argument Second VP Lower One Node Other
Coordinate Verb Pruned
Arg1 16.7% 24.1% 31.5% 27.7%
(9) (13) (17) (15)
Arg2 0% 9.1% 72.7% 18.2%
(0) (1) (8) (2)
Table 2: Cases which result in extra material being included
in the arguments.
5
It is also possible for the PDTB to associate an argument
with only the first of two coordinated VPs, but the number of
such cases were insignificant.
5.1.2 Omitted Material
The main source of these errors in Arg1 are the
higher verb cases. Here the PDTB has associated
Arg1 with a higher clause than the PTB. Examples
(18) and (19) show the annotated and algorithmi-
cally produced relations respectively. This is the in-
verse of the aforementioned lower verb cases, and
the majority of these cases are due to the verb of at-
tribution being a part of the relation.
(18) Longer maturities are thought to indicate declining
interest rates because they permit portfolio man-
agers to retain relatively higher rates for a longer
period.
(19) Longer maturities are thought to indicate declining in-
terest rates because they permit portfolio managers
to retain relatively higher rates for a longer period.
To get an approximate idea of these errors, we
checked if selecting a higher S or SBAR made the
Arg1 exact or include extra material. These are the
columns Two up exact and Two up extra included
in Table 3. At this time, we lack a precise under-
standing of the remaining mismatches in Arg1, and
the ones resulting in material being omitted from
Arg2.
Argument Twoupexact Two up extra Other
included
Arg1 47.6% (10) 14.3% (3) 28.1% (8)
Table 3: Cases which result in material being omitted from
Arg1 as a result of excluding a higher verb
5.2 Additional experiments
We also evaluated the performance of the tree sub-
traction procedure on the PTB annotations on Sec-
tions 02-24 of the WSJ corpus, and the results are
summarized in Table 4.
Argument Exact Extra Material Omitted Material
Arg1 76.1% 17.6% 6.3%
Arg2 92.5% 3.6% 3.9%
Table 4: Tree subtraction on PTB annotations for the SUB-
CONJS(approx. 5K instances). Sections 02-24
Finally we evaluated the algorithm on the output
of a statistical parser. The parser implementation in
(Bikel, 2002) was used in this experiment and it was
run in a mode which emulated the Collins (1997)
parser. The parser was trained on Sections 02-21
and Sections 22-24 were used as test data, where
35
the parser was run and the tree subtraction algorithm
was run on its output. The results are summarized in
Table 5.
Argument Exact Extra Material Omitted Material
Arg1 65.5% 25.2% 9.3%
Arg2 84.7% 0% 15.3%
Table 5: Tree subtraction on the output of a statistical parser
(approx. 600 instances). Sections 22-24.
6 Conclusions
While it is clear that discourse annotation goes be-
yond syntactic annotation, one might have thought
that at least for the annotation of arguments of subor-
dinating conjunctions, these two levels of annotation
would converge. However, we have shown that this
is not always the case. We have also described an
algorithm for discovering such divergences, which
can serve as a useful baseline for future efforts to de-
tect the arguments with greater accuracy. The statis-
tics presented suggest that the annotation of the dis-
course arguments of the subordinating conjunctions
needs to proceed separately from syntactic annota-
tion – certainly when annotating other English cor-
pora and very possibly for other languages as well.
A major source of the mismatches between syn-
tax and discourse is the effect of attribution, either
that of the arguments or of the relation denoted by
the connective. We believe that the annotation of at-
tribution in the PDTB will prove to be a useful aid
to applications that need to detect the relations con-
veyed by discourse connectives with a high degree
of reliability, as well as in constraining the infer-
ences that may be drawn with respect to the writer’s
commitment to the relation or the arguments. The
results in this paper also raise the more general ques-
tion of whether there may be other mismatches be-
tween the syntactic and discourse annotations at the
sentence level.

References
Nicholas Asher. 1993. Reference to Abstract Objects in Dis-
course. Kluwer Academic Press.
Daniel Bikel. 2002. Design of a Multi-lingual, Parallel-
processing Statistical Parsing Engine. In HLT.
Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski,
2003. Current Directions in Discourse and Dialogue, chap-
ter Building a Discourse-Tagged Corpus in the framework
of Rhetorical Structure Theory, pages 85–112. Kluwer Aca-
demic Publishers.
Michael Collins. 1997. Three Generative, Lexicalized Models
for Statistical Parsing. In 35th Annual Meeting of the ACL.
Katherine Forbes. 2003. Discourse Semantics of S-Modifying
Adverbials. Ph.D. thesis, Department of Linguistics, Uni-
versity of Pennsylvania.
Daniel Marcu. 2000. The Rhetorical Parsing of Unrestricted
Texts: A Surface-Based Approach. Computational Linguis-
tics, 26(3):395–448.
Mitchell Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1993. Building a large scale anno-
tated corpus of english: the Penn Treebank. Computational
Linguistics, 19.
Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi, and Bonnie
Webber. 2004a. Annotating Discourse Connectives and
their Arguments. In the HLT/NAACL workshop on Frontiers
in Corpus Annotation, Boston, MA.
Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi, and Bonnie
Webber. 2004b. The Penn Discourse Treebank. In the Lan-
guage Resources and Evaluation Conference, Lisbon, Portu-
gal.
Rashmi Prasad, Eleni Miltsakaki, Aravind Joshi, and Bonnie
Webber. 2004. Annotation and Data Mining of the Penn
Discourse TreeBank. In ACL Workshop on Discourse Anno-
tation, Barcelona, Spain.
Ellen Riloff and Janyce Wiebe. 2003. Learning Extraction Pat-
terns for Subjective Expressions. In Proceedings of the SIG-
DAT Conference on Empirical Methods in Natural Language
Processing (EMNLP ’03), pages 105–112, Sapporo, Japan.
Bonnie Webber and Aravind Joshi. 1998. Anchoring a
Lexicalized Tree-Adjoining Grammar for Discourse. In
ACL/COLING Workshop on Discourse Relations and Dis-
course Markers, Montreal, Canada, August.
Bonnie Webber, Alistair Knott, Matthew Stone, and Aravind
Joshi. 1999. Discourse Relations: A Structural and Presup-
positional Account using Lexicalized TAG. In ACL, College
Park, MD, June.
Bonnie Webber, Aravind Joshi, Matthew Stone, and Alistair
Knott. 2003. Anaphora and Discourse Structure. Computa-
tional Linguistics, 29(4):545–87.
Janyce Wiebe, Theresa Wilson, Rebecca Bruce, Matthew Bell,
and Melanie Martin. 2004. Learning subjective language.
Computational Linguistics, 30(3):277–308.
Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. An-
notating expressions of opinions and emotions in language.
Language Resources and Evaluation, 1(2).
Janyce Wiebe. 2002. Instructions for annotating opinions in
newspaper articles. Technical Report TR-02-101, Depart-
ment of Computer Science, University of Pittsburgh.
