Combining Hierarchical Clustering and Machine Learning
to Predict High-Level Discourse Structure
Caroline Sporleder and Alex Lascarides
School of Informatics
University of Edinburgh
2 Buccleuch Place,
Edinburgh EH8 9LW
a0 csporled, alex
a1 @inf.ed.ac.uk
Abstract
We propose a novel method to predict the inter-
paragraph discourse structure of text, i.e. to infer
which paragraphs are related to each other and form
larger segments on a higher level. Our method com-
bines a clustering algorithm with a model of seg-
ment “relatedness” acquired in a machine learning
step. The model integrates information from a va-
riety of sources, such as word co-occurrence, lexi-
cal chains, cue phrases, punctuation, and tense. Our
method outperforms an approach that relies on word
co-occurrence alone.
1 Introduction
For the interpretation of texts, it is not enough to un-
derstand each sentence individually; one also needs
to have an idea of how sentences relate to each other,
i.e. one needs to know the discourse structure of the
text. This knowledge is important for many NLP
applications, e.g. text summarisation or question an-
swering. Most discourse theories, such as Rhetori-
cal Structure Theory (RST) (Mann and Thompson,
1987), assume that discourse structure can be rep-
resented as a tree whose leaves are the elementary
discourse units (edus) of the text, e.g. sentences or
clauses. Edus are linked to each other by rhetorical
relations, such as Contrast or Elaboration, and then
form larger text segments (represented by interme-
diate nodes in the tree), which in turn can be linked
to other segments via rhetorical relations, giving rise
to even larger segments.
Discourse parsing is concerned with inferring
discourse structure automatically and can be viewed
as consisting of three co-dependent subtasks: (i)
identifying the edus, (ii) determining which dis-
course segments (edus or larger segments) relate
to each other, i.e. finding the correct attachment
site for each segment, and (iii) identifying how dis-
course segments are related to each other, i.e. infer-
ring the rhetorical relation.
While these tasks have been dealt with quite well
for small structures (i.e. on clause and sentence
level) (Soricut and Marcu, 2003), many of these
approaches cannot be applied directly to higher-
level structures (e.g. on multi-sentence and inter-
paragraph level) because they rely nearly exclu-
sively on cue phrases, which are much less useful
for large structures (Marcu, 2000, p. 129). In this
paper, we focus exclusively on inferring high-level
structure. In particular, we investigate ways to auto-
matically determine the correct attachment site for
paragraph and multi-paragraph segments.
Finding a good attachment site is a complex task;
even if one requires the final structure to be a tree,
the number of valid structures grows rapidly with
the number of edus in a text. An exhaustive search
is often not feasible, even for relatively small texts.
One way to address this problem is by making an
assumption that discourse structure correlates with
higher-level text structure, i.e. that it obeys sen-
tence, paragraph and section breaks (Marcu, 2000).
Under this assumption a non-sentential edu cannot
attach directly to another edu that is not part of the
same sentence and a sentence cannot attach directly
to a sentence in a preceding or following paragraph.
This leads to a significant reduction in the number of
valid trees by allowing for a “divide-and-conquer”
approach which treats inter-paragraph structure as
independent from intra-paragraph structure.
While this clearly is a simplifying assumption, it
is likely that textual and discourse structure are re-
lated in some way. For example, psycholinguistic
research has shown that paragraph boundaries are
not arbitrary (humans can predict them with an ac-
curacy that is higher than chance) and they are not
largely determined by aesthetics either (humans do
not apply a simple length criterion when deciding
where to place a boundary) (Stark, 1988). This sug-
gests that the placement of paragraph boundaries
may be influenced by discourse structure. This is
further supported by the observation that 79% of
the paragraphs in the manually annotated data set
we used (see Section 3) do correspond to discourse
segments, even though the annotators were free to
link text spans in any way they liked, provided it
did not result in crossing branches, i.e. the annota-
tion instructions did not bias the annotators to en-
sure that paragraphs corresponded to discourse seg-
ments (Carlson and Marcu, 2001).
To determine the high-level tree-structure of a
text in the absence of cue phrases, word co-
occurrence measures have been suggested (Marcu,
2000; Yaari, 1997). In this paper we take a different
approach: instead of relying on word co-occurrence
alone, we use machine learning to build a complex
model of segment relatedness that combines word
co-occurrence with other information, such as lex-
ical chains, cue phrases, segment length, punctua-
tion and tense patterns. We then combine this model
with a hierarchical clustering algorithm to derive the
inter-paragraph tree structure of a text. The results
show significant improvements over an approach
based on word co-occurrence alone.
2 Combining Clustering and Machine
Learning
We use Hierarchical Agglomerative Clustering (see
e.g. Everitt (1993)) as our clustering method. The
algorithm starts with the set of elementary seg-
ments, in our case the paragraphs of a text, and then
iteratively chooses a pair of adjacent segments to
merge until only one segment (corresponding to the
whole text) is left. The result is a binary tree whose
leaves correspond to paragraphs and whose inter-
mediate nodes correspond to intermediate discourse
segments. Most discourse theories allow non-binary
structures. For example, the List relation in RST
can be non-binary. But the majority of structures
are binary. In our data set more than 95% of the
inter-paragraph structures were binary. Hence, bi-
nary trees seem a good approximation.
In our use of clustering we draw on an idea from
Yaari (1997) who uses this technique as a first step
in his hierarchical segmentation algorithm. How-
ever, where Yaari uses a similarity measure based
on word co-occurrence to decide which segments
should be merged, we use a machine learnt model
of segment relatedness. Since we will compare our
model against Yaari’s measure, we describe the lat-
ter in more detail here.
Yaari uses a cosine measure to define segment
similarity (cf. Salton and Buckley, (1994)):
a0a2a1a4a3a5a1a7a6a9a8a11a10a12a1a4a13a15a14a17a16a4a18a20a19a22a21a23a18a25a24a27a26a29a28 a30a32a31a34a33
a31a9a35a36a27a37
a33
a31a9a35a36a23a38
a39
a30a40a31 a33a42a41a31a9a35a36a27a37a33a43a41a31a9a35a36a44a38
(1)
Herea13 ranges over the set of terms in the text. Terms
are extracted by removing closed-class words from
the text and then stemming it using Porter’s stemmer
(Porter, 1980). Each term is weighted, where
a33
a31a9a35a36a27a37
is the weight assigned to term a13 in segment a18a45a19, de-
fined as the product of three factors: the frequency
of the term in the segment, the relative frequency of
the term in the text, and the general significance of
the term in a corpus (a46 a0a2a1a48a47
a31
):
a33
a31a9a35a36a2a37
a28a50a49
a31a9a35a36a2a37a29a51
a49
a31
a49a53a52a55a54a57a56 a51
a46
a0a58a1a48a47
a31
(2)
a46
a0a58a1a48a47
a31
a28a59a6a9a60a27a47a42a61
a61 a31
(3)
In the definition ofa46 a0a58a1a48a47
a31
,
a61
is the number of files in
the corpus (Yaari uses the British National Corpus1)
and
a61 a31
is the number of files containing the terma13 .
We take a different approach. Instead of bas-
ing the decision to merge two segments on word
co-occurrence alone, we use supervised machine
learning to combine several contextual features (in-
cluding word co-occurrence, see Section 4) into a
model that assesses how likely two segments are
to be related. The segment pair that scores high-
est is merged. We used a maximum entropy learner
(Ratnaparkhi, 1998) to train our model, but any ma-
chine learnt classifier that returns a probability dis-
tribution over possible outcome (e.g. merge, don’t
merge) or at least ranks them would be suitable.
3 Data
The RST Discourse Treebank (RST-DT) (Carlson
et al., 2002) was used for training and testing. It
contains 385 Wall Street Journal articles from the
Penn Treebank, which are manually annotated with
discourse structure in the framework of Rhetori-
cal Structure Theory (RST) (Mann and Thompson,
1987). The set is divided into a training set (342
texts) and a test set (43 texts). 52 texts, selected
from both sets, were annotated twice. We use these
to estimate human agreement on the task.
Since we focus only on inter-paragraph structure,
intra-paragraph structure was discarded. In most
cases the discourse structure of a text obeyed para-
graph boundaries, but about 21% of the paragraphs
did not correspond to a discourse segment. One way
to deal with such cases is by removing them from
the training set but since the training set is already
relatively small, we decided instead to replace them
by the inter-paragraph tree which comes closest to
the original structure.
In most cases where discourse structure does not
follow paragraph structure the deviation is relatively
minor. For example, Figure 1 shows 10 edus (num-
bered 1 to 10) in 3 paragraphs (A to C, indicated
1http://www.hcu.ox.ac.uk/BNC/
by boxes). There is no discourse segment corre-
sponding to paragraph B because a subsegment of
the paragraph (consisting of edus 4 to 6) merges
with the previous paragraph and only then is the re-
sulting segment merged with the last edu of B (i.e.
number 7). However, there is a discourse segment
corresponding to the two paragraphs A and B, i.e.
the structure in Figure 1 maps relatively easily to
the inter-paragraph structure ((AB)C) (as opposed
to (A(BC))).
1 2 3 4 5 6 7 8 9 10
A B C
Figure 1: Unambiguous inter-paragraph structure
However, for 8% of paragraphs the mapping was
less straightforward. For example, in the tree in Fig-
ure 2 some of B’s edus attach to the left and some to
the right. Hence it is not immediately clear whether
one should map to ((AB)C) or (A(BC)). In these
cases we used majority voting to resolve the ambi-
guity, i.e. if most of the edus of a paragraph attached
to the left (as in Figure 2) the paragraph was merged
with its left neighbour otherwise it was merged with
its right neighbour. Hence, the tree in Figure 2 is
assumed to have the structure ((AB)C).
1 2 3 4 5 6 7 8 9 10
A B C
Figure 2: Ambiguous inter-paragraph structure
The few non-binary structures in the training
set were binarised by replacing them with left-
branching binary structures.
Since we want to predict the likelihood of merg-
ing two segments, each pair of adjacent segments
(of any size) can be treated as a training example.
Segment pairs that are contained in a discourse tree
are positive examples and segment pairs not con-
tained in the tree are added as negative examples.
For instance, the tree in Figure 3 contains 3 pos-
itive training examples (A+B, C+D, and AB+BC)
and 7 negative examples (B+C, AB+C, A+BC,
BC+D, B+CD, ABC+D, and A+BCD). Pairs of
non-adjacent segments, e.g. A+D, were ignored be-
cause they are not permitted under the assumption
that discourse structure is a tree with non-crossing
branches (i.e. their probability is 0).
A B C D
Figure 3: Tree showing inter-paragraph structure
The 342 texts in the RST-DT training set gave rise
to 1,830 positive and 185,691 negative training ex-
amples.
4 Feature Set
Each training example is described by a set of fea-
tures. The features were deliberately kept fairly
shallow, i.e. they make use only of tokenisation,
part-of-speech and sentence boundary information
(all of which were taken from the original Penn
Treebank mark-up). They do not require any deep
processing, such as parsing.
The model uses features from 7 areas: segment
position, segment length, term overlap, punctuation,
tense, cue phrases, and lexical chains.
Segment position This set comprises 3 features,
indicating whether the left (right) segment of the
pair is the first (last) in the text and whether the
merged segment would be in the beginning, middle
or end of the text. The motivation for these features
is that the beginning and end of a text often have
a special discourse role (at least in this domain),
e.g. the first paragraph frequently leads into the text,
while the last often provides a summary.
Segment length This set consists of 6 features:
the number of words, sentences, and paragraphs of
the left and right segment. Segment length can of-
ten be a clue as to whether two segments should be
merged. For example, very long segments are not
normally merged with very short segments unless
the short segment has a special position, e.g. is the
first or last of the text.
Term overlap We use the formulae in Section 2
to calculate term overlap. This yields a real-valued
score between 0 and 1, which was quantised by
breaking the range into 10 equal intervals.
Punctuation This set comprises 7 features: the
final punctuation mark of the left segment and
whether the left (right) segment contains, starts
with, or ends with a quotation mark. The presence
of quotations in both segments may indicate that
they are related and so should increase their merging
probability. Likewise, the final punctuation mark
can sometimes be an important clue, e.g. if the left
segment ends with a question mark, the next seg-
ment might provide an answer to the question and
this should increase the merging probability.
Tense We use 6 tense features: the first, last, and
majority tense of the left (right) segment. Tense
information was obtained by using regular expres-
sions to extract verbal complexes from the part-of-
speech tagged text and then determine their tense.
Tense often serves as a cue for discourse structure
(Lascarides and Asher, 1993; Webber, 1988b). A
shift from simple past to past perfect, for instance,
can indicate the start of an embedded segment.
Cue phrases This set comprises 4 features. The
first three features are reserved for potential cue
phrases in the first sentence of the right segment.
Cue phrases are identified by scanning a sentence
(or the first 100 characters of it, whichever is
shorter) for an occurrence of one of the cue phrases
listed in Knott (1996). We have three features to
be able to deal with multiple cue phrases (e.g. But
because. . . ). In this case, the feature first cue
phrase will be assigned the first cue word (but),
second cue phrase the second cue word (be-
cause) and so on. Cue phrases are often ambigu-
ous between syntactic and discourse use, as well
as among different rhetorical relations. While our
algorithm does not attempt proper disambiguation
between syntactic and discourse usage, some non-
discourse usages are filtered out on the basis of part-
of-speech information. For example, second can be
an adverb (as in Example 4) as well as an adjec-
tive (as in Example 5) but when used as a discourse
marker it is usually an adverb.
(4) Second, the extra savings would spur so
much extra economic growth that the Treasury
wouldn’t suffer.
(5) It was announced yesterday that the profits have
fallen for the second year in a row.
The fourth cue phrase feature encodes whether
the first sentence of the right segment contains a
discourse anaphor, i.e. an anaphor which refers to
a discourse segment rather than a real world entity,
and if so which it is. An example is that in Example
6 (cf. Webber (1988a)). We do not attempt proper
anaphora resolution, instead we treat first sentence
occurrences of this and that as discourse anaphors
if they seem to be complete NPs, e.g. are directly
followed by a verb. This method potentially over-
generates as these expressions could still refer to a
preceding NP and it potentially undergenerates as
it can sometimes also refer to discourse segments.
However, previous research has found that demon-
strative anaphors rarely refer to NPs, while it rarely
refers to discourse segments (Webber (1988a)).
(6) It’s always been presumed that when the
glaciers receded, the area got very hot. The
Folsum men couldn’t adapt, and they died out.
That’s what is supposed to have happened.
Lexical chains This set comprises 28 features.
The idea of using lexical chains as indicators of lex-
ical cohesion goes back to Morris and Hirst (1991).
A lexical chain is a sequence of semantically related
words and can indicate the presence and extent of
subtopics in a text. We use our own implementation
to compute chains.
A distinction is made between common noun
chains, which are built on the basis of semantic re-
latedness using WordNet (Miller et al., 1990), and
proper noun chains, which contain nouns not found
in WordNet and are based on co-reference rather
than semantic relatedness. As a first step, nouns are
extracted and lemmatised using the Morpha anal-
yser (Minnen et al., 2001) and then looked up in
WordNet. If no entry can be found and the noun is a
compound noun, the first lexeme is removed and the
remaining string is looked up until an entry is found
or only one lexeme remains. For example, if chief
executive officer could not be found in WordNet, our
algorithm would try executive officer and then of-
ficer. Each term that can be found in WordNet is
treated as a potential element of a common noun
chain, even if it is strictly speaking a proper noun.
This allows chains like Mexico – country – Chile. If
a noun cannot be found in WordNet it is treated as a
potential member of a proper noun chain.
A potential problem for lexical chains is that
words can have more than one sense and seman-
tic relatedness depends on the sense rather than the
word itself. We take a greedy approach to word
sense disambiguation: while a noun is in a chain on
its own, the algorithm is agnostic about its sense but
this changes when another noun is added. A new
noun a0 is added by comparing each of its senses
to the senses of the members of existing chains and
a score is calculated for each sense pair depending
on the WordNet distance between them. Only dis-
tances up to an empirically set cut-off point count as
a match, where the cut-off point depends on whether
the term is a proper noun and on the nature of the se-
mantic relation (only hypernym, hyponym and syn-
onym relations are considered). If there are one or
more matches, the noun is added with the sense that
achieved the highest score to the chain a1 with which
this score was achieved. If a1 contains only one noun
a0
a2 , all senses of
a0
a2 are removed apart from the sense
with which the match was achieved. Repeated oc-
currences of the same noun in a text are placed in
the same chain, i.e. it is assumed that a word keeps
its sense throughout the text.
When all common noun chains have been built,
the significance of each chain is assessed and chains
that are not considered significant are deleted. To
be considered significant a chain has to contain at
least two nouns (or two occurrences of the same
noun) and the Gsig (see equation 3) averaged over
all its elements either has to be relatively high or
the chain has to be relatively long compared to the
overall length of all other chains, where length is
measured as the number of “hits” a chain has in
the text.2 For example, Wall Street Journal articles
frequently contain expressions of date, such as De-
cember, month, Tuesday, but these do not normally
make interesting chains as they are high frequency
expressions and the appearance of various date ex-
pressions throughout the text does not normally in-
dicate a subtopic, i.e. it does not mean that the text
is “about” time and date expressions. However, if
time and date expression are very frequent in the text
this may be an indicator that these do indeed form a
subtopic and that the chain should be retained.
Proper noun chains are built for words not in
WordNet. Chain membership is determined on the
basis of identity, i.e. a chain contains repeated oc-
currences of the same noun. Some proper noun
phrase matching is done. For example, the expres-
sions U.S. District Judge Peter Smith, Judge Smith,
and Mr. Smith are treated as referring to the same en-
tity and can therefore be placed in the same chain.
When all proper noun chains have been built, those
that contain only one element (i.e. one occurrence of
a term) are removed. All other chains are retained.
Note, unlike most approaches that make use of
lexical chains, we do not break a chain in two if
too many sentences intervene between the individ-
ual chain elements; chains are continued as long as
new elements can be found. However, the algo-
rithm keeps track of where in the text chain elements
were found. If a chain skips one or two paragraphs
this is actually an important clue because it can in-
dicate that the two paragraphs form an embedded
segments. This is especially true if there are also
chains which start in the left paragraph and end in
the right. For example, Figure 4 shows a text with 5
paragraphs (A to E) and two lexical chains. Chain 1
spans the whole text but skips paragraphs B and C,
while chain 2 only spans paragraphs B and C. A sit-
uation like this makes it likely that B and C should
be merged before either of them is merged with an-
other paragraph. Hence Tree 1 in Figure 5 should be
more likely than Tree 2. For this analysis it is cru-
cial that chain 1 is not broken into two. Obviously
2Both thresholds were empirically set.
for very long texts the situation will be slightly dif-
ferent and there will be circumstances where a chain
should be broken.
2
1
A B C D E
Figure 4: A chain skipping two segments
A B C D E
(a) Tree 1
A B C D E
(b) Tree 2
Figure 5: Possible tree structures
The individual chain features distinguish between
proper and common noun chains. The reason for
this is that the former are likely to be more reliable
as they are based on term identity rather than seman-
tic relatedness. For both types the features encode
whether and how many chains:
a0 span the two segments
a0 exclusively span the two segment (i.e. start in
the left segment and end in the right)
a0 start or end in the left (right) segment
a0 skip both of the segments
a0 exclusively skip the two segments (i.e. skip
both segments but none of the neighbouring
segments)
a0 skip one of the two segments
a0 exclusively skip the left (right) segment
To combine all features, we trained a maximum
entropy model (see e.g. Ratnaparkhi (1998)) on the
training set. Each feature is automatically assigned
a weight reflecting its usefulness. Once trained the
model outputs a probability distribution over the
classes merge and don’t merge for each pair of seg-
ments, based on the weighted features for the pair.
To prevent the model from overfitting we used a fea-
ture cut-off of 10, i.e. feature-value pairs that occur
less than 10 times in the training set were discarded.
5 Experiments
As described in Section 2, the trained model was
combined with the clustering method to build trees
for the test set. These were evaluated against the
manually built discourse trees. Precision (P) and re-
call (R) were defined in accordance with the PARSE-
VAL measures (Black et al., 1991), i.e. precision is
random RB TO LB ME MEa0 LC MEa0 TO MEa0 LCTO human*
P 44.37% 36.76% 49.98% 53.52% 58.06% 55.86% 57.12% 55.26% 64.37%
R 46.71% 40.35% 52.42% 56.23% 60.78% 58.29% 59.70% 57.69% 64.60%
F 45.05% 37.58% 50.79% 54.27% 59.00% 56.66% 58.00% 56.07% 64.34%
Table 1: Results on RST-DT test set (* on doubly annotated set)
defined as the number of correct nodes (i.e. match-
ing brackets) divided by the number of nodes in the
automatically built tree and recall as the number of
correct nodes divided by the number of nodes in the
manually built tree. Precision and recall are com-
bined in the f-score (F), defined as
a41
a1a3a2
a1a3a4a5a2 .
Table 1 shows the results. We compared the
performance of our model (ME) to Yaari’s (1997)
method of building trees based on term overlap
(TO). In addition, three baselines were used: merg-
ing segments randomly (results averaged over 100
runs), producing a right-branching tree by always
merging the last two segments (RB) and producing
a left-branching tree by always merging the first two
segments (LB). Finally, an upper bound was calcu-
lated by comparing the trees for the doubly anno-
tated text files in the RST-DT. Note that the doubly
annotated data set is slightly different from the test
set, hence the upper bound can only give an indica-
tion of the human performance on this task.
The maximum entropy model outperforms all
other methods on precision, recall and f-score. The
difference in correct discourse segments (true pos-
itives) between our method and the next best (i.e.
left-branching) is statistically significant (one-tailed
paired t-test,a13 =1.72, a6a49 = 37,a7a9a8a11a10a13a12a14a10a16a15 ).
Interestingly, Yaari’s word co-occurrence based
method (TO) is outperformed by left-branching
trees (LB). Furthermore, while Marcu (2000) ar-
gues that right-skewed structures should be consid-
ered better than left-skewed structures, in our exper-
iments, the latter actually outperform the former, i.e.
inter-paragraph structure in the RST-DT is predom-
inantly left-branching. Predictably, human perfor-
mance is better than any of the automatic methods.
To investigate the contribution of our different
feature sets we re-trained the model after removing
lexical chains (MEa0 LC), term overlap (MEa0 TO)
and lexical chains and term overlap (MEa0 LCTO).
The results are also shown in Table 1. As can
be seen, removing lexical chain features results in
more performance loss than removing term-overlap
features. Thus it seems that lexical chains are
more useful for the task than term-overlap. How-
ever, the performance difference between MEa0 LC
and MEa0 TO is not statistically significant (a13 =0.96,
a6
a49 =37,
a7a18a17a19a10a13a12a14a10a16a15 ). Removing both feature sets
(MEa0 LCTO) still leads to a better performance
than is achieved by left-skewed clustering (LB),
which indicates that other features, such as tense
and cue word features, are able to compensate to
some degree for the absence of chain and term
overlap features. But the difference between LB
and MEa0 LCTO is again not statistically significant
(a13 =1.24, a6a49 =37,a7a20a17a11a10a13a12a14a10a16a15 ).
So far we have not said much about the rhetori-
cal relations that hold between larger discourse seg-
ments. In fact, assigning relations to higher-level
structures is easier than doing so for inter-sentence
structures. One reason for this is that there is much
less variation on inter-paragraph level. For exam-
ple, the RST-DT contains 111 different relations but
only 64 of these are used at inter-paragraph level.
Furthermore, the most frequent relation on inter-
paragraph level (Elaboration-additional) accounts
for a much larger percentage (37%) of all relations
used at this level than does the most frequent rela-
tion on intra-paragraph level (List, 13%). Hence,
always predicting Elaboration-additional would al-
ready achieve 37% accuracy. Being able to reliably
distinguish between Elaboration-additional and the
second most frequent inter-paragraph relation, List,
would guarantee 53% accuracy. In contrast, cor-
rectly predicting the two most frequent relations on
intra-paragraph level would only achieve 26% accu-
racy. We plan to address the prediction of rhetorical
relations between larger discourse segments in fu-
ture work.
6 Conclusion
In this paper, we proposed a machine learning ap-
proach for predicting inter-paragraph structure. In-
ferring inter-paragraph structure can be seen as a
subtask of discourse parsing. While low-level dis-
course parsing relies to a large extent on cue phrases
as predictors for rhetorical structure, these are less
useful for high-level structure. As an alternative,
word co-occurrence measures have been suggested.
In this paper, we took a different approach and em-
ployed a machine learning approach to build a com-
plex model of segment relatedness which was then
combined with a clustering algorithm. The use
of machine learning enabled us to combine con-
textual cues from several areas, such as word co-
occurrence, lexical chains, changes in tense pat-
terns, punctuation etc. Our model outperformed
a word co-occurrence measure as well as left- or
right-branching trees.
In future work, we plan to extend our approach
to predict rhetorical relations between paragraphs.
While an empirical analysis revealed that one can
achieve a relatively high accuracy by just predicting
the most frequent relation, it is still worthwhile to
investigate how much better one can do with more
sophisticated methods. There is also clearly a re-
lationship between structure and relation. For ex-
ample, non-binary structures are more likely to be
joined by a List relation than by an Explanation re-
lation. Hence, inferring structure and predicting re-
lations should be interleaved.
It would also be interesting to investigate,
whether it would be useful to relax the constraint
that inter-paragraph structure is a tree with non-
crossing branches. Some researchers have sug-
gested that higher level discourse structure may be
better represented if one allows crossing branches
(Knott et al., 2001). In principle, the approach sug-
gested here could be used to generate such struc-
tures if one removed the constraint that only adja-
cent segments can be merged.
Finally, it remains to be seen to what extent our
results carry over to other domains. So far, the RST-
DT remains the only publicly available data set an-
notated with discourse structure but a larger corpus
is currently annotated as part of the Penn Discourse
Treebank project.3 It would be interesting to apply
our methods to this data set as well.

References

E. Black, S. Abney, D. Flickinger, C. Gdaniec, R. Gr-
ishman, P. Harrison, D. Hindle, R. Ingria, F. Jelinek,
J. Klavans, M. Liberman, M. Marcus, S. Roukos,
B. Santorini, T. Strzalkowski. 1991. A procedure
for quantitatively comparing the syntactic coverage of
English grammars. In Proceedings of the 4th DARPA
Workshop on Speech and Natural Language, 306–
311.

L. Carlson, D. Marcu. 2001. Discourse tagging man-
ual. Technical Report ISI-TR-545, Information Sci-
ences Institute, Los Angeles, CA, 2001.

L. Carlson, D. Marcu, M. E. Okurowski. 2002.
RST Discourse Treebank. Linguistic Data Con-
sortium. http://www.ldc.upenn.edu/
Catalog/CatalogEntry.jsp?catalogId=
LDC2002T07, 2002.

B. Everitt. 1993. Cluster Analysis. Edward Arnold,
London, 3rd edition.
3http://www.cis.upenn.edu/˜pdtb/

A. Knott, J. Oberlander, M. O’Donnell, C. Mellish.
2001. Beyond elaboration: The interaction of re-
lations and focus in coherent text. In T. Sanders,
J. Schilperoord, W. Spooren, eds., Text Representa-
tion: Linguistic and Psycholinguistic Aspects, 181–
196. Benjamins, Amsterdam.

A. Knott. 1996. A Data-Driven Methodology for Moti-
vating a Set of Coherence Relations. Ph.D. thesis, De-
partment of Artificial Intelligence, University of Edin-
burgh.

A. Lascarides, N. Asher. 1993. Temporal interpreta-
tion, discourse relations and common sense entail-
ment. Linguistics and Philosophy, 16(5):437–493.

W. C. Mann, S. A. Thompson. 1987. Rhetorical struc-
ture theory: A theory of text organization. Technical
Report ISI/RS-87-190, Information Sciences Institute,
Los Angeles, CA, 1987.

D. Marcu. 2000. The Theory and Practice of Discourse
Parsing and Summarization. MIT Press, Cambridge,
MA.

G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, K. J.
Miller. 1990. Introduction to WordNet: An on-line
lexical database. International Journal of Lexicogra-
phy, 3(4):235–312.

G. Minnen, J. Carroll, D. Pearce. 2001. Applied mor-
phological processing of English. Natural Language
Engineering, 7(3):207–223.

J. Morris, G. Hirst. 1991. Lexical cohesion computed
by thesaural relations as an indicator of the structure
of text. Computational Linguistics, 17(1):21–48.

M. F. Porter. 1980. An algorithm for suffix stripping.
Program, 14:130–137.

A. Ratnaparkhi. 1998. Maximum Entropy Models for
Natural Language Ambiguity Resolution. Ph.D. the-
sis, Computer and Information Science, University of
Pennsylvania.

G. Salton, C. Buckley. 1994. Term-weighting ap-
proaches in automatic text retrieval. Information Pro-
cessing and Management, 24(5):513–617.

R. Soricut, D. Marcu. 2003. Sentence level discourse
parsing using syntactic and lexical information. In
Proceedings of the 2003 Human Language Technol-
ogy Conference of the North American Chapter of the
Association for Computational Linguistics.

H. A. Stark. 1988. What do paragraph markings do?
Discourse Processes, 11:275–303.

B. L. Webber. 1988a. Discourse deixis: Reference to
discourse segments. In Proceedings of the 26th An-
nual Meeting of the Association for Computational
Linguistics, 113–122.

B. L. Webber. 1988b. Tense as discourse anaphor. Com-
putational Linguistics, 14(2):61–73.

Y. Yaari. 1997. Segmentation of expository texts by hi-
erarchical agglomerative clustering. In Proceedings
of the 2nd International Conference on Recent Ad-
vance in Natural Language Processing, 59–65.
