Discourse Parsing: A Decision Tree Approach 
Tadashi Nomoto 
Advanced Research Laboratory 
Hitachi Ltd. 
2520 Hatoyama Saitama, 
350-0395 Japan 
nomoto@harl, hitachi, co. jp 
Yuji Matsumoto 
Nara Institute of Science and Technology 
8916-5 Takayarna Ikoma Nara, 
630-0101 Japan 
matsu@is, aist-nara, ac. jp 
Abstract 
The paper presents a new statistical method, for 
parsing discourse. A parse of discourse is defined 
as a set of semantic dependencies among sentences 
that make up the discourse. A collection of news 
articles from a Japanese economics daily are man- 
ually marked for dependency and used as a train- 
ing/testing corpus. We use a C4.5 decision tree 
method to develop a model of sentential dependen- 
cies. However, rather than to use class decisions 
made by C4.5, we exploit information on class dis- 
tributions to rank possible dependencies among sen- 
tences according to their probabilistic strength and 
take a parse to be a set of highest ranking dependen- 
cies. We also study effects of features such as clue 
words, distance and similarity on the performance of 
the discourse parser. Experiments have found that 
the method performs reasonably well on diverse text 
types, scoring an accuracy rate of over 60%. 
1 Introduction 
Attempts to the automatic identification of a struc- 
ture in discourse have so far met with a limited 
success in the computational linguistics literature. 
Part of the reason is that, compared to sizable data 
resources available to parsing research such as the 
Penn Treebank (Marcus et al., 1993), large cor- 
pora annotated for discourse information are hard to 
come by. Researchers in discourse usually work with 
a corpus of a few hundred sentences (Kurohashi and 
Nagao, 1994; Litman and Passonneau, 1995; Hearst, 
1994). The lack of a large-scale corpus has made it 
impossible to talk about results of discourse studies 
with the sufficient degree of reliability. 
In the work described here, we created a corpus 
with discourse information, containing 645 articles 
from a Japanese economic paper, an order of magni- 
tude larger than any previous work on discourse pro- 
cessing. It had a total of 12.770 sentences and 5,352 
paragraphs. Each article in the corpus was manually 
annotated for a discourse dependency" relation. We 
then built a statistical discourse parser based on the 
C4.5 decision tree method (Quinlan, 1993), which 
was trained and tested on the corpus we have cre- 
Figure 1: A discourse tree. 'S' denotes a sentence. 
ated. The design of a parser was inspired by Haruno 
(1997)'s work on statistical sentence parsing. 
The paper is organized as follows. Section 2 
presents general ideas about statistical parsing as 
applied to the discourse, After a brief introduction 
to some of the points of a decision tree model, we dis- 
cuss incorporating a decision tree within a statistical 
parsing model. In Section 3, we explain how we have 
built an annotated corpus. There we also describe 
a procedure of experiments we have conducted, and 
conclude the section with their results. 
2 Statistical Discourse Parsing 
First, let us make ourselves clear about what we 
mean by parsing a discourse. The job of parsing is 
to find whatever dependencies there are among ele- 
ments that make up a particular linguistic unit. In 
discourse parsing, elements one is interested in find- 
ing dependencies among correspond to sentences, 
and a level of unit under investigation is a discourse. 
We take a naive approach to the notion of a depen- 
dency here. We think of it as a re!ationship between 
a pair of sentences such that the interpretation of one 
sentence in some way depends on that of the other. 
Thus a dependency relationship is not a structural 
one, but rather a semantic or rhetorical one. 
The job of a discourse parser is to take as input 
216 
a discourse, or a set of sentences that make up a 
discourse and to produce as output a parse, or a 
set of dependency relations (which may give rise to 
a tree-like structure as in Figure 1). In statistical 
parsing, this could be formulated as a problem of 
finding a best parse with a model P(T I D), where 
T is a set of dependencies and D a discourse. 
Tbest = arg maXTP(T \[ D ) 
Tbe,t is a set of dependencies that maximizes the 
probability P(T I D). Further, we assume that a 
discourse D is a set of sentences marked for some 
pre-defined set of features F = {fl,..-,fn}. Let 
CF ($1) be a characterization of sentence $1 in terms 
of a feature set F. Then for D = {S1,...,Sm}, 
CF(D) = {CF(S1),CF(S2),...,CF(Sm)}. Let us 
assume that: 
P(T I D) = rI P(A ~ B \[ CF(D)). 
A+..-BET 
'A ~- B' reads like " sentence B is dependent on 
sentence A", where A,B E {$i,... ,Sin}. The prob- 
ability of T being an actual parse of discourse D is 
estimated as the product of probabilities of its ele- 
ment dependencies when a discourse has a represen- 
tation CF(D). We make a usual assumption that 
element dependencies are probabilistically iiadepen- 
dent. 
2.1 Decision Tree Model 
A general framework for discourse parsing described 
above is thus not much different from that for sta- 
tistical sentence parsing. Differences, however, lie in 
a makeup of the feature set F. Rather than to use 
information on word forms, word counts, and part- 
of-speech tags as in much research on statistical sen- 
tence parsing, we exploit as much information as can 
be gleaned from a discourse, such as lexical cohesion, 
distance, location, and clue words, to characterize a 
sentence. Therefore it is important that you do not 
end up with a mountain of irrelevant features. 
A decision tree method represents one of ap- 
proaches to classification problems, where features 
are ranked according to how much they contribute to 
a classification, and models are then built with fea- 
tures most relevant to that classification. Suppose, 
for example, that you work for a travel agency and 
want to find out what features of a hotel are more 
important for tourists, based on data from your cus- 
tomers like Table 1. With decision tree techniques, 
you would be able to tell what features are more 
closely associated with customers' preferences. 
The aim of the decision tree approach is to in- 
duce rules from data that best characterize classes. 
A particular approach called C4.5 (Quinlan, 1993), 
which we adopt here, builds rules by recursively di- 
viding the training data into subsets until all divi- 
sions contain only single class cases. In which subset 
Table 1: An illustration: hotel preferences. 
'Bath/shower' means a room has a bath, a shower or 
none. 'Time' means the travel time in min. from an 
airport. 'Class' indicates whether a particular hotel 
is a customer's choice. 
bath/shower time room rate class 
1 bath 15 expensive no 
2 shower 20 inexpensive no 
3 shower 10 inexpensive yes 
4 bath 15 moderate yes 
5 bath 25 moderate yes 
6 none 20 inexpensive no 
7 shower 50 inexpensive no 
Figure 2: A decision tree for the hotel example. 
bath/shower   
h 
room raze   
ive moderate 
NO YES 
I showcr~one 
time NO 
YES NO 
a particular case is placed is determined by the out- 
come of a 'test' on that case. Let us explain how 
this works by way of the hotel example above. Sup- 
pose that the first test is "bath/shower', which has 
three outcomes, bath, shower, and none. Then the 
data set breaks up into three groups, {1,4,5} (bath), 
{2,3,7} (shower), and {6}(none). Since the last 
group {6} consists of only a single case, there is no 
further division of the group. The bath group, being 
a multi-class set, is further divided by a test "room 
rate", which produces two subdivisions, one with { 1 } 
(expensive), and the other with {4,5} (moderate). 
Either set now consists of only single class cases. 
For the shower group, applying the time test(<=15) 
would produce two subsets, one with {3}, and the 
other with {2,7}. l Either one now contains cases 
from a single class. A decision tree for divisions we 
made is shown in Figure 2. 
Now compare a hand-created decision tree in Fig- 
i Here we choose a midpoint between I0 and 20 as in C4.5. 
217 
Figure 3: A tree for the hotel example by C4.5. 
Figures in parentheses indicate the number of cases 
that reach relevant nodes. A figure after a slash, eg. 
(4/1), indicates the number of misclassified cases. 
room rale (7) 
NO (1) YES (2) NO (4/I) 
ure 2 with one in Figure 3, which is generated by 
C4.5 for the same data. Surprisingly, the latter tree 
consists of only one test node. This happens because 
C4.5 ranks possible tests, which we did not, and ap- 
ply one that gives a most effective partitioning of 
data based on information-theoretic criteria known 
as the gain criterion and the gain ratio criterion. 2 
The intuitive idea behind the criteria is to prefer a 
test with a least entropy, i.e., a test that partitions 
data in such a way that a particular class may be- 
come dominant for each subset it creates. Thus a 
feature that best accounts for a class distribution in 
data is always chosen in preference to others. For 
the data in Table 1, C4.5 determined that the test 
room rate is a best class identifier and everything 
2 The gain criterion measures the effectiveness of partition- 
ing a data set T with respect to a test X, and is defined 
follows. 
gain(X) = info(T) -in/ox(T ) 
Define info(T) to be an entropy of T, that is, the average 
amount of information generated by T. Then we have: 
k 
in/o(T) = - ~-~ freq.(Cj, T) x log 2 freq(Cj' T) 
ITI IVl j=l 
freq(C,T) is the number of cases from a class C divided by 
the sum of cases in T. Now infox(T ) is the average amount 
of information generated by partitioning T with respect to a 
test X. That is, 
in/ox(T) = ~ 1~ \[ info(T,) 
i=1 
Thus a good classifier would give a small value for info X (T) 
and a large value for in/o X (T). 
The gain ratio criterion is a modification to the gain crite- 
rion. It has the effect of making a splitting of a data set less 
intense. 
gain ratio(X) = gain(X)/split info~') 
where: 
split info(X) = - ~ IT, \[ x " \[ Ti I 
I T \[ rag2 I T I *-=l 
The ratio decreases with an increase in the number of splits. 
else is irrelevant to identifying the classes. All that 
one needs to account for the class distribution in Ta- 
ble 1 turn out to be just one feature. So we might 
just as well conclude that the customers are just in- 
terested in the room charge when they pick up a 
hotel. 
A benefit of using the decision tree method is that 
it enables us to identify relevant features for classi- 
fication and disregard those that are not relevant, 
which is particularly useful for a task such as ours, 
where a large number of features are potentially in- 
volved and their relevance to classification is not al- 
ways known. 
2.2 Parsing with Decision Tree 
As we mentioned in section 2, we define discourse 
parsing as a task of finding a best tree T, or a set of 
dependencies among sentences that maximizes P(T I 
D). 
Tbest = arg maxTP(T \[ D) 
F(T \[ D) = H P(A ~ B \[ CF(D)). 
A+--BET 
What we do now is to equip the model with a fea- 
ture selection functionality. This can be done by 
assuming: 
P(A ~ B \[ CF(D)) = P(A ~ B \[ CF(D),DTF) 
Z P(X ~- B \[ CF(D),DTF) 
X<B (1) 
DTF is a decision tree constructed with a feature 
set F by C4.5. 'X < B' means that X is a sen- 
tence that precedes B: P(X ~ Y \[ CF(D),DTv) 
is the probability that sentence Y depends on sen- 
' tence X under the condition that both CF(D) and 
DTF are used. We estimate P, using class distribu- 
tions from the decision tree DTF. For example, we 
have numbers in parentheses after leaves in the de- 
cision tree in Figure 3. They indicate the number of 
cases that reach a particular leaf and also the num- 
ber of misclassified cases. Thus a leaf with the label 
inexpensive has the total of 4 cases, one of which 
is misclassified. This means that we have 3 cases 
correctly classified as "NO" and one case wrongly 
classified. Thus a class distribution for "NO" is 3/4 
and that for "YES" is 1/4. In practice, however, 
we slightly correct class frequencies, using Laplace's 
rule of succession, i.e., x/n ~ x + 1/n + 2. 
Now suppose that we have a discourse D = 
{...,Si,...,Sj,...,Sk,...} and want to know what 
Si depends on, assuming that Si depends on either 
Sj or Sk. To find that out involves constructing 
3 Note that here we are in effect making a claim about 
the structure of a discourse, namely that a sentence modifies 
one that precedes it. Changing it to something like 'X 6 
D,X # B' allows one to have forward as well as backward 
dependencies. 
218 
Figure 4: A hypothetical decision tree. 
dist 
YES (10/3) YES (14/8) 
CF(D) and DTF. Let us represent sentences Sj and 
Sk in terms of how far they are separated from Si, 
measured in sentences. Suppose that dist(Sj) = 2 
and dist(Sj) = 4; that is, sentence S.# appears 2 sen- 
tences behind Si and Sk 4 sentences behind. Assume 
further that we have a decision tree constructed from 
data elsewhere that looks like Figure 4. 
With CF(D) and DTF at hand, we are now in 
a position to find P(A ~ B I CF(D)), for each 
possible dependency Sj ~ Si, and Sk +-- Si. 
P(Sj ~ Si l Cai,t(D),DTai,t) 
= (10- 3 + 1)/(10 + 2). 
= .67 
P(Sk +- Si \[ Caist(D), DTaist) 
= (14 - s + 1)/(14 + 2) 
= .44 
Since Si links with either Sj or Sk, by Equation 1, 
we normalize the probability estimates so that they 
sum to 1. 
P(S i ~ Si I Cd,t(D)) = .67/(.67 + .44) = .60 
P(.-~ ~ Si \[ Caist(D)) = .44/(.67 + .44) = .40 
Recall that class frequencies are corrected by 
Laplace's rule. Let T 1 = {Sj ~ Si} and Tk = {,-~ 
Si} Then P(Tj I D) > P(Tk t D). Thus Tb,,t = Tj. 
We conclude that Si is more likely to depend on Sj 
than SI,. 
2.2.1 Features 
The following list a set of features we used to encode 
a discourse. As a convention, we refer to a sentence 
for which we like to find a dependency as 'B', and a 
sentence preceding 'B' as 'A'. 
<DistSen> records information on how far ahead 
A appears from B, measured in sentences. 
#S(B) - #S(A) 
3InT..Sen_Distance 
'#S(X)' denotes an ordinal number indicating 
the position of a sentence X in a text, i.e., 
#S(kth_sentence) = k. 'Max_Sen_Distance' de- 
notes a distance, measured in sentences, from 
B to A, when B ocurrs farthest from A, i.e., 
#S(last_sentence_in_text) - 1. DistSen thus has 
continuous values between 0 and 1. We discard texts 
which contain no more than one sentence. 
<DistPax> is defined similarly to DistSen, except 
that the distance is measured in paragraphs. 
#Par(B) - #Par(A) 
Max_Par.Distance 
'Par(X)' is a paragraph that contains a sentence 
X, and '#Par(X)? denotes an ordinal number of 
Par(X). 'Max_Par_Distance' is a maximal distance 
one could have between two paragraphs in a text, 
that is, #Par(last_sentence.in_text) - 1. 
<LocSen> defines the location of a sentence by: 
#s(x) 
# S ( Last_S en tence ) 
Here 'Last_Sentence' is the last sentence of a text. 
LocSen takes values between 0 and 1. A discourse- 
initial sentence takes 0, and a discourse-final sen- 
tence 1. 
<LocPax> is defined similarly to DistPa.r. It gives 
information on the location of a paragraph in which 
a sentence X occurs. 
#Par(X) 
#Last._Paragraph 
'#Last_Paragraph' is the position of the last para- 
graph in a text, represented by its ordinal number. 
• <LocW£thinPar> gives information on the location 
of a sentence X within a paragraph in which it ap- 
pears. 
#S(X) - #S(Par_lniL.Sen) 
Length(Par(X)) 
'ParAnit_Sen' refers to the initial sentence of a para- 
graph in which X occurs, 'Length(Par(X))' denotes 
the number of sentences that occur in that para- 
graph. LocWithinPar takes continuous values rang- 
ing from 0 to 1. A paragraph initial sentence would 
have 0 and a paragraph final sentence 1. 
<LenText> the length of a text, measured in 
Japanese characters. 
<LenSenA> the length of A in Japanese characters. 
<LenSenB> the length of B in Japanese characters. 
<Sire> gives information on the lexical similarity 
between A and B, based on an information-retrieval 
measure known as tf • idf. 4 One important point 
4 For a word j E Si, its weight u'i2 is defined by: 
N wi.i = tf, j . log d~j 
219 
hereis that we did not use words per se in mea- 
suring the similarity. What we did was to break 
up nominals from sentences into simple characters 
(grapheme) and use only them to measure the sim- 
ilarity. We did this to deal with abbreviations and 
rewordings, which we found quite frequent in the 
corpus we used. 
<Sire2> same as Sire feature, except that the sim- 
ilarity is measured between A and Par(B), a para- 
graph in which B occurs. We define Siva2 as 
'SIM(A,Concat(Par(B)))' (see footnote 4 for the 
definition of SIM), where 'Concat(Par(B))' is a 
concatenation of sentences in Par(B). 
<IsATit\].e> indicates whether A is a title. We re- 
garded a title as a special sentence that initiates a 
discourse. 
<Clues> differs from features above in that it does 
not refer to any single feature but is a collective term 
for a set of clue-related features, each of which is 
used to indicate the presence or absence of a rele- 
vant clue in A and B. We examined N most frequent 
words found in a corpus and associated each with a 
different clue feature. We experimented with cases 
where N is 0, 100, 500 and 1000. A sentence can 
be marked for a multiple number of clue expressions 
at the same time. For a clue c, an associated Clues 
feature d takes one of the four values, depending 
on the way c appears in A and B. c' = 0 if c ap- 
pears in neither A or B; d = 1 if c appears in both 
A and B; d = 2 if c appears in A and not in B; 
and d = 3 if c appears not in A but in B. We con- 
sider clue expressions from the following grammat- 
ical classes: nominals, adjectives, demonstratives, 
adverbs, sentence connectives, verbs, sentence-final 
particles, topic-marking particles, and punctuation 
marks. 5 While we did not consider a complex clue 
expression, which can be made up of multiple ele- 
ments from various grammatical classes 6 , it is pos- 
d/j is the number of sentences in the text which have an oc- 
currence of a word j. N is the total number of sentences in 
the text. The tf.idf metric has the property of favoring high 
frequency words with local distribution. For a pair of sen- 
fences X = {zl .... } and \]" = {YI,--.}, where z and y are 
words, we define the lexical similarity between X and Y by: 
t 
SIM(X, Y) = i=1 
w(xi)2, w(yi)2 
i.~l i=l 
They- are extracted from a corpus by a Japanese tokenizer 
program (Sakurai and Hisamitsu, 1997). 
6 English examples would be for example, as a result, 
etc., which are thought of as an indicator of a discourse 
relationship. 
Table 2: Top 20 lexical clues. Suushi below is a 
grammar term of a class of numerals. Since there 
are infinitely many of them, we decided not to treat 
them individually, but to represent them collectively 
with a single feature .~uushi. 
lemma explanation 
o 
suushi* 
wa 
suru 
J 
f 
mo ( 
) 
nado 
nai 
aru 
kara 
koto 
dewa 
nen 
hi 
no 
comma 
period 
numerals 
topic marker 
'do' 
right angular parenthesis 
left angular parenthesis 
topic marker 
left parenthesis 
right parenthesis 
'so forth' 
dash 
negative auxiliary 
'exist','be' 
'from' 
nominalizer 
topic marker 
'year' 
'day' 
possessive particle 
sible to think of a complex clue in terms of its com- 
ponent clues for which a sentence is marked. 
Classes For a sentence pair A and B, the class is 
either yes or no, corresponding to the presence or 
absence of a dependency link from B to A. 
The features above are more or less plucked from 
the air. Some are motivated, and some are less so. 
Our strategy here, however, is to rely on the deci- 
sion tree mechanism to select 'good' features and 
filter out features that are not relevant to the class 
identification. 
2.2.2 Notes on Discourse Encoding 
Let us make further notes on how to encode a dis- 
course with the set of features we have described. 
We characterize a sentence in relation to its po- 
tential "modifyee" sentence, a sentence in a dis- 
course which it is likely to depend on. Thus en- 
coding is based on a pair of sentences, rather than 
on a single sentence. For example, a discourse 
D = {St, $2, $3 } would give a set of possible de- 
pendency pairs "P(D) = {< S2,Si >,< S3, Si >,< 
S,,& >,< S3,S~_ >,< S~,S3 >,< S2,Ss >}. We 
assume that CF(D) = CF(P(D)). Furthermore, we 
may want to constrain P by restricting the attention 
to pairs of a particular type. If we are interested 
only in backward dependencies, then we will have 
220 
?(D) = {< Sl,& >,< Sl,S3 >,< S2, S3 >}. 
In experiments, we assumed a discourse as con- 
sisting of backward dependency pairs and encoded 
each pair with the set of features above. Assump- 
tions we made about the structure of a discourse are 
the following: 
1. Every sentence in a discourse has exactly one 
preceding "modifyee" to link to. 
2. A discourse may have crossing dependencies. 
3 Evaluation 
3.1 Data 
To evaluate our method, we have done a set of ex- 
periments, using data from a Japanese economics 
daily (Nihon-Keizai-Shimbun-Sha, 1995). They con- 
sist of 645 articles of diverse text types (prose, nar- 
rative, news report, expository text, editorial, etc.), 
which are randomly drawn from the entire set of arti- 
cles published during the year. Sentences and para- 
graphs contained in the data set totalled 12,770 and 
5,352, respectively. We had, on the average, 984.5 
characters, 19.2 sentences, and 8.2 paragraphs, for 
one article in the data. Each sentence in an article 
was annotated with a link to its associated modi- 
fyee sentence. Annotations were given manually by 
the first author. Each sentence was associated with 
exactly one sentence. 
In assigning a link tag to a sentence, we did not 
follow any specific discourse theories such as Rhetor- 
ical Structure Theory (Mann and Thompson, 1987). 
This was because they often do not provide informa- 
tion on discourse relations detailed enough to serve 
as tagging guidelines. In the face of this, we fell 
back on our intuition to determine which sentence 
links with which. Nonetheless, we followed an in- 
formal rule, motivated by a linguistic theory of co- 
hesion by Halliday and Hasan (1990): which says 
that we relate a sentence to one that is contextually 
most relevant to it, or one that has a cohesive link 
with it. This included not only rhetorical relation- 
ships such as 'reason', 'cause-result', 'elaboration', 
'justification' or 'background' (Mann and Thomp- 
son, 1987), but also communicative relationships 
such as 'question-answer' and those of the 'initiative- 
response' sort (Fox, 1987; Levinson, 1994; Carletta 
et al., 1997). 
Since the amount of data available at the time of 
the experiments was rather moderate (645 articles), 
we decided to resort to a test procedure known as 
cross-validation. The following is a quote from Quin- 
lan (1993). 
"In this procedure, the available data is di- 
vided into N blocks so as to make each 
block's number of cases and class distri- 
bution as uniform as possible. N differ- 
ent classification models are then built, in 
each of which one block is omitted from the 
training data, and the resulting model is 
tested on the cases in that omitted block." 
The average performance over the N tests is sup- 
posed to be a good predictor of the performance of 
a model built from all the data. It is common to set 
N=IO. 
However, we are concerned here with the accuracy 
of dependency parses and not with that of class de- 
cisions by decision tree models. This requires some 
modification to the way the validation procedure is 
applied to the data. What we did was to apply the 
procedure not on the set of cases as in C4.5, but 
on the set of articles. We divided the set of articles 
into 10 blocks in such a way that each block contains 
as uniform a number of sentences as possible. The 
procedure would make each block contain a uniform 
number of correct dependencies. (Recall that every 
sentence in an article is manually annotated with ex- 
actly one link. So the number of correct links equals 
that of sentences.) The number of sentences in each 
block ranged from 1,256 to 1,306. 
The performance is rated for each article in the 
test set by using a metric: 
number of correct dependencies retrieved precision = 
total number of dependencies retrieved 
At each validation step, we took an average perfor- 
mance score for articles in the test set as a precision 
of that step's model. Results from 10 parsing models 
were then averaged to give a summary figure. 
3.2 Results and Analyses 
We list major results of the experiments in Table 3, 
The results show that clues are not of much help 
to improve performance. Indeed we get the best 
result of 0.642 when N = 0, i.e., the model does 
not use clues at all. We even find that an overall 
performance tends to decline as models use more Of 
the words in the corpus as clues. It is somewhat 
tempting to take the results as indicating that clues 
have bad effects on the performance (more discus- 
sion on this later). This, however, appears to run 
counter to what we expect from results reported in 
prior work on discourse(Kurohashi and Nagao, 1994; 
Litman and Passonneau, 1995; Grosz and Sidner, 
1986; Marcu, 1997), where the notion of clues or 
cue phrases forms an important part of identifying 
a structure of discourse7 
Table 4 shows how the confidence value (CF) af- 
fects the performance of discourse models. The CF 
7 One problem with earlier work is that evaluations are 
done on very small data; 9 sections from a scientific writing 
(approx. 300 sentences) (Kurohashi and Nagao, 1994): 15 
narrathes (I113 clauses) (Lhman and Passonneau. 1995): 3 
texts (Marcu, 1997). It is not clear how reliable estimates of 
performance obtained there would be. 
221 
Table 3: Effects of lexical clues on the performance of models. N is the number of clues used. Figures in 
parentheses represent the ratio of improvements against a model with N = 0. 
\[ N=0 N= 100 N=500 N=IO00 t 
\[ 0.642 0.635 (-1.100%) 0.632 (-1.580%) 0.628 (-2.220%) I 
Table 4: Effects of pruning on performance. CF refers to a confidence value. Small CF values cause more 
prunings than large values. 
Clues CF=5% CF=10% CF=25% CF=50% CF=75% CF=95% 
0 0.626 0.636 0.642 0.633 0.625 0.624 
100 0.629 0.627 0.635 0.626 0.614 0.609 
500 0.626 0.630 0.632 0.616 0.604 0.601 
1000 0.628 0.627 0.628 0.616 0.601 0.597 
represents the extent to which a decision tree is 
pruned; A small CF leads to a heavy pruning of 
a tree. The tree pruning is a technique by which 
to prevent a decision tree from fitting training data 
too closely. The problem of a model fitting data too 
closely or overfitting usually causes an increase of 
errors on unseen data. Thus a heavier pruning of a 
tree would result in a more general tree. 
While Haruno (1997) reports that a less pruning 
produces a better performance for Japanese sentence 
parsing with a decision tree, results we got in Table 4 
show that this is not true with discourse parsing. In 
Haruno (1997), the performance improves by 1.8% 
from 82.01% (CF = 25%) to 83.35% (CF = 95%). 
25% is the default value for CF in C4.5, which is 
generally known to be the best CF level in machine 
learning. Table 4 shows that this is indeed the case: 
we get a best performance at around CF = 25% for 
all the values of N. 
Let us turn to effects that each feature might have 
on the model's performance. For each feature, we re- 
moved it from the model and trained and tested the 
model on the same set of data as before the removal. 
Results are summarized in Table 5. It was found 
that, of the features considered, DistSen, which en- 
codes a distance between two sentences, contributes 
most to the performance; at N = 0, its removal 
caused as much as an 8.62% decline in performance. 
On the other hand, lexical features Sire and Sire2 
made little contribution to the overall performance; 
their removal even led to a small improvement in 
some cases, which seems consistent with the earlier 
observation that lexical features are a poor class pre- 
dictor. 
To further study effects of lexical clues, we have 
run another experiment where clues are limited to 
sentence connectives (as identified by a tokenizer 
program). Clues included any connective that has 
an occurrence in the corpus, which is listed in Ta- 
ble 6. Since a sentence connective is relevant to es- 
tablishing inter-sententiaI relationships, it was ex- 
pected that restricting clues to connectives would 
improve performance. As with earlier experiments, 
we have run a 10-fold cross validation experiment on 
the corpus, with 52 attributes for lexical clues. We 
found that the accuracy was 0.642. So it turned out 
that using connectives is no better than when we do 
not use clues at all. 
Figure 5 gives a graphical summary of the signif- 
icance of features in terms of the ratio of improve- 
ment after their removal (given as parenthetical fig- 
ures in Table 5). Curiously, while the absence of 
the DistSen feature caused a largest decline, the 
significance of a feature tends to diminish with the 
growth of N. The reason, we suspect, may have to 
do with the susceptibility of a decision tree model 
to irrelevant features, particularly when their num- 
ber is large. But some more work needs to be done 
before we can say anything about how irrelevancy 
affects a parser's performance. 
One caveat before leaving the section; the experi- 
ments so far did not establish any correlation, either 
positive or negative, between the use of lexical infor- 
mation and the performance on discourse parsing. 
To say anything definite would probably require ex- 
periments on a corpus much larger than is currently 
available. However, it would be safe to say that dis- 
tance and length features are more prominent than 
lexical features when a corpus is relatively small. 
4 Conclusion 
The paper demonstrated how it is possible to build a 
discourse parser which performs reasonably well on 
diverse data. It relies crucially on (a) feature selec- 
tion by a decision tree and (b) the way a discourse 
is encoded. While we have found that distance and 
222 
Table 5: Measuring the significance of features. Figures below indicate how much the performance is affected 
by the removal of a feature. 'REF' refers to a model where no feature is removed. 'Clues' indicates the number 
of clues used for a model. A minus sign at a feature indicates the removal of that feature from a model. 
FEATURES/#CLUES 0 100 500 1000 
REF. 0.642 0.635 0.632 0.628 
DistSen- 0.591 
LenText- 0.626 
LocWithinPar- 0.631 
Sim- 0.643 
(-8.620%) 0.598 (-6.180%) 
(-2.550%) 0.626 (-1.430%) 
(--1.740%) 0.627 (-1.270%) 
(+0.160%) 0.640 (+0.790%) 
0.604 (-4.630%) 0.603 (-4.140%) 
0.620 (-1.930%) 0.623 (-0.800%) 
0.624 (-1.280%) 0.628 (±0.000%) 
0.632 (±0.000%) 0.630 (+0.320%) 
0.638 (+0.950%) 0.629 (+0.160%) 
0.632 (±0.000%) 0.632 (+0.640%) 
0.629 (-0A70%) 0.631 (+0.480%) 
0.631 (-0.150%) 0.627 (-0.150%) 
0.634 (+0.320%) 0.630 (+0.320%) 
0.628 (-0.630%) 0.628 (±0.000%) 
0.631 (-0.150%) 0.628 (±0.000%) 
Sim2- 
LenSenA- 
LenSenB- 
LocPar- 
LocSen- 
DistPa.r- 
IsAtitle- 
0.644 (+0.320%) 0.647 (+1.860%) 
0.641 (-0.150%) 0.638 (+0.480%) 
0.642 (±0.000%) 0.639 (+0.630%) 
0.640 (-0.310%) 0.637 (+0.320%) 
0.639 (-0.460%) 0.631 (-0.630%) 
0.636 (-0.940%) 0.631 (-0.630%) 
0.638 (-0.620%) 0.635 (~0.000%) 
2 
2 
'~ -2 
o 
< -6 
>o -S 
-10 
0 
I 
Figure 5: The ratio of improvement after removal of feature. 
. ~, .,.,~,1,,, ........... ,4. ........... I I I 
: :2 ::2 &% .",'7 -'. r ~"r:'='~---:---~'-T:~'~':-:2.Z-'~:'=':'=-~-:-~-~ ............................. -t~" --- ~ - ;-.--- : ---.-- - ~ ~'=~-.'..=.:'..=..:.~7~..~. _~.--: 7" -" 7 ;~\]~(:.--'.3 
...~........z...=-~. ~ -, ............... ~ ............ . ............................ 
........ ~-JT~:.=~-_-==:.~---=~.~.7 ................................ 0 ...................... o ............................... 
DistSen o 
LenText -+-- 
LocWithinP -O-- 
b4m -.x- .... 
Sire2 -a,-. 
LenSenA --~-- 
,~ LenSenB -.o-- - 
LocPar - + -- 
LocSen -E3-- 
DistPar 
lsAtitle -~-- 
I I I 
200 400 600 800 100 
The number of clues 
Table 6: Connectives found in the corpus. Underlined items (also marked with an asterisk) are those that 
the tokenizer program erroneously identified as a connective. 
shikashi but, ippou whereas, daga but, soreo (.), shikamo moreover, tokoroga but, soshite 
and, soreni moreover, sokode incidentally, soredemo still, sore (.), tadashi provided that, 
soredakeni all the more because, tokini by the way, dakara so, demo but, sonoue moreover, 
sitagatte therefore, dewa now, nimokakawarazu despite, soredewa well, sorede and then, 
sorekara after that, towaie nevertheless, shitagatte therefore, tsuide while, katoitte but. 
dakarakoso consequently, matawa or, soretomo or else, soreto for another thing, nanishiro 
anyhow, omakeni in addition, sunawachi in other words, toiunowa because, naraba if. 
sonokawari instead, samunaktL.ba or else, sunawachi namely, naishiwa or. sate by the way, 
toshite ('.). toiunomo because, sorenimokakawarazu nonetheless, sorenishitemo yet, oyobi 
moreover, tokorode incidentally, nazenara because, tosureba if, nanishiro anyhow, otto 
(*), nanoni but 
223 
length features are more prominent than lexical fea- 
tures, we were not able to establish the usefulness 
of the latter features, which is expected from earlier 
works on discourse as well as on sentence parsing 
(Magerman, 1995; Collins, 1996). 
The following are some of the future research is- 
sues: 
Building a larger corpus Our discourse parser 
did not perform as well as a statistical sentence 
parser, which normally performs with over 80% pre- 
cision. We suspect that the reason may have ~to do 
with inconsistencies in tagging and the size of the 
corpus we used. 
Parsing with the rhetorical structure theory 
Technically it is straightforward to turn the present 
parsing method into a full-fledged RST parser, which 
involves modifying the way classes are defined and 
redefining constraints on a structure of discourse. A 
problem, however, is that the task of assigning sen- 
tences to rhetorical relations with some consistency 
could turn out to be quite difficult for human coders. 
Extending to other Languages The general 
framework in which our parser is built does not pre- 
suppose elements specific to a particular language. 
It should be possible to carry it over to other lan- 
guages with no significant modification to it: 

References 
Jean Carlotta, Amy Isard, Stephen Isard, Jacque- 
line C. Kowtko, Gwyneth Doherty-Sneddon, and 
Anne H. Anderson. 1997. The Reliability of a Di- 
alogue Structure Coding Scheme. Computational 
Linguistics, 23(1):13-31. 
Michael John Collins. 1996. A New Statistical 
Parser Based on Bigram Lexical Dependencies. 
In Proceedings of the 34th Annual Meeting of the 
Association for Computational Linguistics, pages 
184-191, California, USA., June. Association for 
Computational Linguistics. 
Barbara A. Fox. 1987. Discourse structure and 
anaphora. Cambridge Studies in Linguistics 48. 
Cambridge University Press, Cambridge, UK. 
Barbara Grosz and Candance Sidner. 1986. Atten- 
tion, Intentions and the Structure of Discourse. 
Computational Linguistics, 12(3):175-204. 
M.A.K. Halliday and R. Hasan. 1990. Cohesion in 
English. Longman, New York. 
Masahiko Haruno. 1997. Kettei-gi o mochiita ni- 
hongo kakariuke kaiseki (A Japanese Dependency 
Parser Using A Decision Tree). Submitted for 
publication. 
Marti A. Hearst. 1994. Multi-Paragraph Segmen- 
tation of Expository Text. In Proceedings of the 
32nd Annual Meeting of the Association for Com- 
putational Linguistics, pages 9-16, New Mexico, 
USA. 
Sadao Kurohashi and Makoto Nagao. 1994. Auto- 
matic Detection od Discourse Structure by Check- 
ing Surface Information in Sentences. In Proceed- 
ings of The 15th Intewrnational Conference on 
Computational Linguistics, pages 1123-1127, Au- 
gust. Kyoto,Japan. 
Stephen C. Levinson. 1994. Pragmatics. Cambridge 
Textbooks in Linguistics. Cambridge University 
Press. 
Diane Litman and Rebecca J. Passonneau. 1995. 
Combining multiple knowledge sources for dis- 
course segmentation. In Proceedings of the 33rd 
Annual Meeting of the Association for Computa- 
tional Linguistics, pages 108-115. The Association 
for Computational Linguistics, June. Cambridge, 
MASS. USA. 
David M. Magerman. 1995. Statistical Decision- 
Tree Models for Parsing. In Proceedings of the 
33rd Annual Meeting of the Association for Com- 
putational Linguistics, pages 276--283, Camridge, 
MASS. USA., June. Association for Computa- 
tional Linguistics. 
William C. Mann and Sandra A. Thompson. 1987. 
Rhetorical Structure Theory : A Theory of Text 
Organization. Technical Report ISI/RS 87-190, 
ISI. 
Daniel Marcu. 1997. The Rhetorical Parsing of Nat- 
ural Language Texts. In Proceedings of the 35th 
Annual Meetings of the Association for Computa- 
tional Linguistics and the 8th European Chapter 
of the Association for Computational Linguistics, 
pages 96-102, Madrid, Spain, July. 
Mitcell Marcus, Beatrice Santorini, and Mary Ann 
Marcinkiewicz. 1993. Building a Large Annotated 
Corpus of English: The Penn Treebank. Compu- 
tational Linguistics, 19(2):313-330, June. 
Nihon-Keizai-Shimbun-Sha. 1995. Nihon Keizai 
Shimbun 95 sen CD-ROM ban. CD-ROM. Nihon 
Keizai Shimbun, Inc., Tokyo. 
J. Ross Quinlan. 1993. C4.5: Programs for Machine 
Learning. Morgan Kaufmann. 
Hirofumi Sa.kurai and Toru Hisamitsu. 1997. 
Keitaiso Puroguramu ANIMA no Sekkei to Jissoo. 
In Jyoohoo Shori Gakkai Zenki Zenkoku Taikai 
Kooen Ronbun Shuu, volume 2, pages 57-56. In- 
formation Processing Society of Japan, March 12- 
14. 
