Learning Discourse Relations with Active Data Selection 
Tadashi Nomoto* 
National Institute of Japanese Literature 
1-16-10 Yutaka Shinagawa 
Tokyo 142-8585 Japan 
nomoto@nij 1. ac. jp 
Yuji Matsumoto 
Nara Institute of Science and Technology 
8916-5 Takayama Ikoma Nara 
630-0101 Japan 
matsu@is, aist-nara, ac. jp 
Abstract 
The paper presents a new approach to identi- 
fying discourse relations, which makes use of a 
particular sampling method called committee- 
based sampling (CBS). In the committee-based 
sampling, multiple learning models are gener- 
ated to measure the utility of an input example 
in classification; if it is judged as not useful, 
then the example will be ignored. The method 
has the effect of reducing the amount of data 
required for training. In the paper, we extend 
CBS for decision tree classifiers. With an addi- 
tional extension called error feedback, it is found 
that the method achieves an increased accuracy 
as well as a substantial reduction in the amount 
of data for training classifiers. 
1 Introduction 
The success of corpus-based approaches to dis- 
course ultimately depends on whether one is 
able to acquire a large volume of data annotated 
for discourse-level information. However, to ac- 
quire merely a few hundred texts annotated for 
discourse information is often impossible due to 
the enormity of the haman labor required. 
This paper presents a novel method for reduc- 
ing the amount of data for training a decision 
tree classifier, while not compromising the accu- 
racy. While there has been some work explor- 
ing the use of machine leaning techniques for 
discourse and dialogue (Marcu, 1997; Samuel et 
al., 1998), to our knowledge, no computational 
research on discourse or dialogue so far has ad- 
dressed the problem of reducing or minimizing 
the amount of data for training a learning algo- 
rithm. 
* The work reported here was conducted while the first 
author was with Advanced Research Lab., Hitachi Ltd, 
2520 Hatoyama Saitama 350-0395 Japan. 
A particular method proposed here is built 
on the committee-based sampling, initially pro- 
posed for probabilistic classifiers by Dagan and 
Engelson (1995), where an example is selected 
from the corpus according to its utility in im- 
proving statistics. We extend the method for 
decision tree classifiers using a statistical tech- 
nique called bootstrapping (Cohen, 1995). With 
an additional extension, which we call error 
.feedback, it is found that the method achieves 
an increased accuracy as well as a significant 
reduction of training data. The method pro- 
posed here should be of use in domains other 
than discourse, where a decision tree strategy is 
found applicable. 
2 Tagging a corpus with discourse 
relations 
In tagging a corpus, we adopted Ichikawa 
(1990)'s scheme for organizing discourse rela- 
tions (Table 1). The advantage of Ichikawa 
(1990)'s scheme is that it directly associates dis- 
course relations with explicit surface cues (eg. 
sentential connectives), so it is possible for the 
coder to determine a discourse relation by figur- 
ing a most natural cue that goes with a sentence 
he/she is working on. Another feature is that, 
unlike Rhetorical Structure Theory (Mann and 
Thompson, 1987), the scheme assumes a dis- 
course relation to be a local one, which is de- 
fined strictly over two consecutive sentences. 1 
We expected that these features would make a 
tagging task less laborious for a human coder 
than it would be with RST. Further, our earlier 
study indicated a very low agreement rate with 
1This does not mean to say that all of the discourse 
relations are local. There could be some relations that 
involve sentences separated far apart. However we did 
not consider non-local relations, as our preliminary study 
found that they are rarely agreed upon by coders. 
158 
Table 1: Ichikawa (1990)'s taxonomy of discourse relations. The first column indicates major classes 
and the second subclasses. The third column lists some examples associated with each subclass. 
Note that the EXPANDING subclass has no examples in it. This is because no explicit cue is used 
to mark the relationship. 
LOGICAL 
SEQUENCE 
ELABORATION 
CONSEQUENTIAL 
ANTITHESIS 
ADDITIVE 
CONTRAST 
INITIATION 
APPOSITIVE 
COMPLEMENTARY 
EXPANDING 
dakara therefore, shitagatte thus 
shikashi but, daga but 
soshite and, tsuigi-ni next 
ipp6 in contrast, soretomo or 
tokorode to change the subject, sonouchi in the meantime 
tatoeba .for example, y6suruni in other words 
nazenara because, chinamini incidentally 
NONE 
RST (~ = 0.43; three coders); especially for a 
casual coder, RST turned out to be a quite dif- 
ficult guideline to follow. 
In Ichikawa (1990), discourse relations are or- 
ganized into three major classes: the first class 
includes logical (or strongly semantic) relation- 
ships where one sentence is a logical conse- 
quence or contradiction of another; the second 
class consists of sequential relationships where 
two semantically independent sentences are jux- 
taposed; the third class includes elaboration- 
type relationships where one of the sentences 
is semantically subordinate to the other. 
In constructing a tagged corpus, we asked 
coders not to identify abstract discourse rela- 
tions such as LOGICAL, SEQUENCE and ELAB- 
ORATION, but to choose from a list of pre- 
determined connective expressions. We ex- 
pected that the coder would be able to iden- 
tify a discourse relation with far less effort when 
working with eXplicit cues than when working 
with abstract Concepts of discourse relations. 
Moreover, since 93% of sentences considered 
for labeling in the corpus did not contain pre- 
determined relation cues, the annotation task 
was in effect one of guessing a possible connec- 
tive cue that m'ay go with a sentence. The ad- 
vantage of using explicit cues to identify dis- 
course relations is that even if one has little or 
no background in linguistics, he or she may be 
able to assign a discourse relation to a sentence 
by just asking him/herself whether the associ- 
ated cue fits well with the sentence. In addition, 
in order to make the usage of cues clear and un- 
ambiguous, the annotation instruction carried 
a set of examples for each of the cues. Fur- 
ther, we developed an emacs-based software aid 
which guides the coder to work through a cor- 
pus and also is capable of prohibiting the coder 
from making moves inconsistent with the tag- 
ging instruction. 
As it turned out, however, Ichikawa's scheme, 
using subclass relation types, did not improve 
agreement (~ = 0.33, three coders). So, we 
modified the relation taxonomy so that it con- 
tains just two major classes, SEQUENCE and 
ELABORATION, (LOGICAL relationships being 
subsumed under the SEQUENCE class) and as- 
sumed that a lexical cue marks a major class to 
which it belongs. The modification successfully 
raised the ~ score to 0.70. Collapsing LOGICAL 
and SEQUENCE classes may be justified by not- 
ing that both types of relationships have to do 
with relating two semantically independent sen- 
tences, a property not shared by relations of the 
elaboration type. 
3 Learning with Active Data 
Selection 
3.1 Committee-based Sampling 
In the committee-based sampling method (CBS, 
henceforth) (Dagan and Engelson, 1995; Engel- 
son and Dagan, 1996), a training example is se- 
lected from a corpus according to its usefulness; 
a preferred example is one whose addition to 
the training corpus improves the current esti- 
mate of a model parameter which is relevant 
to classification and also affects a large propor- 
tion of examples. CBS tries to identify such an 
example by randomly generating multiple mod- 
els (committee members) based on posterior dis- 
159 
tributions of model parameters and measuring 
how much the member models disagree in clas- 
sifying the example. The rationale for this is: 
disagreement among models over the class of 
an example would suggest that the example af- 
fects some parameters sensitive to classification, 
and furthermore estimates of affected parame- 
ters are far from their true values. Since models 
are generated randomly from posterior distribu- 
tions of model parameters, their disagreement 
on an example's class implies a large variance 
in estimates of parameters, which in turn indi- 
cates that the statistics of parameters involved 
are insufficient and hence its inclusion in the 
training corpus (so as to improve the statistics 
of relevant parameters). 
For each example it encounters, CBS goes 
through the following steps to decide whether 
to select the example for labeling. 
1. Draw k models (committee members) ran- 
domly from the probability distribution 
P(M \] S) of models M given the statistics 
S of a training corpus. 
2. Classify an input example by each of 
the committee members and measure how 
much they disagree on classification. 
3. Make a biased random decision as to 
whether or not to select the example 
for labeling. This would make a highly 
disagreed-upon example more likely to be 
selected. 
As an illustration of how this might work, 
consider a problem of tagging words with 
parts of speech, using a Hidden Markov Model 
(HMM). A (bigram) HMM tagger is typically 
given as: 
n 
T(Wl . . . Wn) = argmax ~ P(wi I ti)P(ti+l I ti) 
t~ ...~ ~__~ 
where wl...wn is a sequence of input words, 
and tl...tn is a sequence of tags. For a sequence 
of input words wl...wn, a sequence of corre- 
sponding tags T(wl...wn) is one that maxi- 
mizes the probability of reaching tn from tl via 
ti (1 < i < n) and generating Wl...wn along 
with it. Probabilities P(wi I ti) and P(ti+l I ti) 
are called model parameters of an HMM tag- 
ger. In Dagan and Engelson (1995), P(M I S) 
is given as the posterior multinomial distribu- 
tion P(al = al,...,an = an J S), where ai 
is a model parameter and ai represents one 
of the possible values. P(al = al,...,an = 
an I S) represents the proportion of the times 
that each parameter oq takes a/, given the 
statistics S derived from a corpus. (Note that 
~ P(ai = ai I S) = 1.) For instance, consider 
a task of randomly drawing a word with replace- 
ment from a corpus consisting of 100 different 
words (wl,..., Wl00). After 10 trials, you might 
have outcomes like wl = 3, w2 = 1,..., w55 = 
2,...,w71 = 3,...,w76 = 1,...,wl00 = 0: i.e., 
Wl was drawn three times, w2 was drawn once, 
w55 was drawn twice, etc. If you try another 10 
times, you might get different results. A multi- 
nomial distribution tells you how likely you get 
a particular sequence of word occurrences. Da- 
gan and Engelson (1995)'s idea is to assume 
the distribution P(al = al,...,an = an I S) 
as a set of binomial distributions, each corre- 
sponding to one of its parameters. An arbitrary 
HMM model is then constructed by randomly 
drawing a value ai from a binomial distribu- 
tion for a parameter ai, which is approximated 
by a normal distribution. Given k such models 
(committee members) from the multinomial dis- 
tribution, we ask each of them to classify an 
input example. We decide whether to select 
the example for labeling based on how much 
the committee members disagree in classifying 
that example. Dagan and Engelson (1995) in- 
troduces the notion of vote entropy to quantify 
disagreements among members. Though one 
could use the kappa statistic (Siegel and Castel- 
lan, 1988) or other disagreement measures such 
as the a statistic (Krippendorff, 1980) instead of 
the vote entropy, in our implementation of CBS, 
we decided to use the vote entropy, for the lack 
of reason to choose one statistic over another. 
A precise formulation of the vote entropy is as 
follows: 
v(e, e) log V(c, e) 
V(e) = - k 
C 
Here e is an input example and c denotes a 
class. V(c, e) is the number of votes for c. k 
is the number of committee members. A se- 
lection function is given in probabilistic terms, 
160 
based on V(e). 
g Pselect(e) 
= log k V(e) 
g here is called the entropy gain and is used to 
determine the number of times an example is 
selected; a grea~ter g would increase the number 
of examples selected for tagging. Engelson and 
Dagan (1996) investigated several plausible ap- 
proaches to the selection function but were un- 
able to find significant differences among them. 
At the beginning of the section, we mentioned 
some properties of 'useful' examples. A useful 
example is one which contributes to reducing 
variance in parameter values and also affects 
classification. By randomly generating multiple 
models and measuring a disagreement among 
them, one would be able to tell whether an ex- 
ample is useful in the sense above; if there were 
a large disagreement, then one would know that 
the example is relevant to classification and also 
is associated with parameters with a large vari- 
ance and thus with insufficient statistics. 
In the following section, we investigate how 
we might extend CBS for use in decision tree 
classifiers. 
3.2 Decision Tree Classifiers 
Since it is difficult, if not impossible, to express 
the model distribution of decision tree classi- 
fiers in terms of the multinomial distribution, 
we turn to the bootstrap sampling method to 
obtain P(M \[ S). The bootstrap sampling 
method provides a way for artificially establish- 
ing a sampling distribution for a statistic, when 
the distribution is not known (Cohen, 1995). 
For us, a relevant statistic would be the poste- 
rior probability that a given decision tree may 
occur, given the training corpus. 
Bootstrap Sampling Procedure 
Repeat i = 1. ,. K times: 
1. Draw a bootstrap pseudosample S~ of size 
N from S by sampling with replacement as 
follows: 
Repeat N times: select a member of S at 
random ai~d add it to S~. 
2. Build a decision tree model M from S~. 
Add M to Ss. 
S is a small Set of samples drawn from the 
tagged corpus. Repeating the procedure 100 
times would give 100 decision tree models, each 
corresponding to some S~ derived from the sam- 
ple set S. Note that the bootstrap procedure 
allows a datum in the original sample to be se- 
lected more than once. 
Given a sampling distribution of decision tree 
models, a committee can be formed by ran- 
domly selecting k models from Ss. Of course, 
there are some other approaches to construct- 
ing a committee for decision tree classifiers (Di- 
etterich, 1998). One such, known as random- 
ization, is to use a single decision tree and ran- 
domly choose a path at each attribute test. Re- 
peating the process k times for each input ex- 
ample produces k models. 
3.2.1 Features 
In the following, we describe a set of features 
used to characterize a sentence. As a conven- 
tion, we refer to a current sentence as 'B' and 
the preceding sentence as 'A'. 
<LocSen> defines the location of a sentence 
by: #s(x) 
# S ( Last..S entence) 
'#S(X)' denotes an ordinal number indi- 
cating the position of a sentence X in a 
text, i.e., #S(kth_sentence) = k, (k >_ 0). 
'Last_Sentence' refers to the last sentence in a 
text. LocSen takes a continuous value between 
0 and 1. A text-initial sentence takes 0, and a 
text-final sentence 1. 
<LocPar> is defined similarly to DistPar. It 
records information on the location of a para- 
graph in which a sentence X occurs. 
#Par(X) 
#Last.Paragraph 
'#Par(X)' denotes an ordinal number indicat- 
ing the position of a paragraph containing X. 
'#Last_Paragraph' is the position of the last 
paragraph in a text, represented by the ordinal 
number. 
<LocWithinPax> records information on the 
location of a sentence X within a paragraph in 
which it appears. 
#S(X) - #S(Par_\[nit_Sen) 
Length(Par(X)) 
161 
'Par_Init_Sen' refers to the initial sentence of a 
paragraph in which X occurs, 'Length(Par(X))' 
denotes the number of sentences that occur in 
that paragraph. LocW:i.thinPar takes continu- 
ous values ranging from 0 to 1. A paragraph 
initial sentence would have 0 and a paragraph 
final sentence 1. 
<LenText> the length of a text, measured in 
Japanese characters. 
the length of A in Japanese char- <LenSenA> 
acters. 
<LenSenB> 
acters. 
the length of B in Japanese char- 
<Sire> encodes the lexical similarity between 
A and B, based on an information-retrieval 
measure known as tf. idf (Salton and McGill, 
1983). 2 One important feature here is that we 
defined similarity based on (Japanese) charac- 
ters rather than on words: in practice, we broke 
up nominals from relevant sentences into simple 
alphabetical characters (including graphemes) 
and used them to measure similarity between 
the sentences. (Thus in our setup xi in foot- 
note 2 corresponds to one character, and not 
to one whole word.) We did this to deal with 
abbreviations and rewordings, which we found 
quite frequent in the corpus. 
<Cue> takes a discrete value 'y' or 'n'. The 
cue feature is intended to exploit surface cues 
most relevant for distinguishing between the SE- 
QUENCE and ELABORATION relations. The fea- 
2For a word j in a sentence Si (j E Si), its weight wij 
is defined by: 
N w# = tf~j • log ~- 
df~ is the number of sentences in the text which have 
an occurrence of a word j. N is the total number of 
sentences in the text. The tf.idf metric has the property 
of favoring high frequency words with local distribution. 
For a pair of sentences .,~ = (xl .... ) and Y = (yx,...), 
where x and y are words, we define the lexical similarity 
between X and Y by: 
t 
2 E w(xi)w(y~) 
SIM(.X,Y)= t i=x t 
E E 
i=1 i=1 
where w(xi) represents a t~idf weight assigned to the 
term xi. The measure is known as the Dice coefficient 
(Salton and McGill, 1983) 
ture takes 'y' if a sentence contains one or more 
cues relevant to distinguishing between the two 
relation types. We considered up to 5 word 
n-grams found in the training corpus. Out of 
these, those whose INFOx values are below a 
particular threshold are included in the set of 
cues. 3 And if a sentence contains one of the 
cues in the set, it is marked 'y', and 'n' other- 
wise. The cutoff is determined in such a way 
as to minimize INFOcue(T), where T is a set 
of sentences (represented with features) in the 
training corpus. We had the total of 90 cue ex- 
pressions. Note that using a single binary fea- 
ture for cues alleviates the data sparseness prob- 
lem; though some of the cues may have low fre- 
quencies, they will be aggregated to form a sin- 
gle cue category with a sufficient number of in- 
stances. In the training corpus, which contained 
5221 sentences, 1914 sentences are marked 'y' 
and 3307 are marked 'n' with the cutoff at 0.85, 
which is found to minimize the entropy of the 
distribution of relation types. It is interesting to 
note that the entropy strategy was able to pick 
up cues which could be linguistically motivated 
(Table 2). In contrast to Samuel et al. (1998), 
we did not consider relation cues reported in 
the linguistics literature, since they would be 
useless unless they contribute to reducing the 
cue entropy. They may be linguistically 'right' 
cues, but their utility in the machine learning 
context is not known. 
<PrevRel> makes available information about 
a relation type of the preceding sentence. It has 
two values, ELA for the elaboration relation, and 
SEQ for the sequence relation. 
In the Japanese linguistics literature, there is 
a popular theory that sentence endings are rel- 
evant for identifying semantic relations among 
3INFOx (T) measures the entropy of the distribution 
of classes in a set T with respect to a feature X. We 
define INFOx just as given in Quinlan (1993): 
xNFOx(T) = x xNFo(T,) 
i=1 
Ti represents a partition of T corresponding to one of 
the values for X. INFO(T) is defined as follows: 
k INFO(T) = ~ freq(Cj, T) freq(Cj, T) 
- ~ i.~\] x log s \] T I 
j=l 
fi'eq(C, T) is the number of cases from class C in a set T 
of cases. 
162 
I 
Table 2: Some of the 'linguistically interesting' cues identified by the entropy strategy. 
mata on the other hand, dSjini at the same time, ippou in contrast, sarani in 
addition, mo topic marker, ni-tsuite-wa regarding, tameda the reason is that, 
kekka as the result ga-nerai the goal is that 
sentences. Some of the sentence endings are in- 
flectional categories of verbs such as PAST/NON- 
PAST, INTERROGATIVE, and also morpholog- 
ical categories :like nouns and particles (eg. 
question-markers). Based on Ichikawa (1990), 
we defined six types of sentence-ending cues 
and marked a sentence according to whether it 
contains a part.icular type of cue. Included in 
the set are inflectional forms of the verb and 
the verbal adjec~tive, PAST/NON-PAST, morpho- 
logical categories such as COPULA, and NOUN, 
parentheses (quotation markers), and sentence- 
final particles such as -ka. We use the follow- 
ing two attributes to encode information about 
sentence-ending cues. 
<EndCueh> records information about a 
sentence-ending form of the preceding sentence. 
It takes a discrete value from 0 to 6, with 
0 indicating the absence in the sentence of 
relevant cues. 
<EadCueB> Sa~me as above except that this fea- 
ture is concerned with a sentence-ending form 
of the current sentence, i.e. the 'B' sentence. 
Finally, we have two classes, ELABORATION 
and SEQUENCE. 
4 Evaluation 
To evaluate our method, we carried out ex- 
periments, using a corpus of news articles 
from a Japanese economics daily (Nihon-Keizai- 
Shimbun-Sha, 1995). The corpus had 477 arti- 
cles, randomly selected from issues that were 
published durilig the year. Each sentence in the 
articles was tagged with one of the discourse re- 
lations at the subclass level (i.e. CONSEQUEN- 
TIAL, ANTITHESIS, etc.). However, in evaluation 
experiments, we translated a subclass relation 
into a corresponding major class relation (SE- 
QUENCE/ELABORATION) for reasons discussed 
earlier. Furthermore , we explicitly asked coders 
not to tag a paragraph initial sentence for a dis- 
course relation, for we found that coders rarely 
agree on their :classifications. Paragraph-initial 
sentences were dropped ffrom the evaluation cor- 
pus. This had left us with 5221 sentences, of 
which 56% are labeled as SEQUENCE and 44% 
ELABORATION. 
To find out effects of the committee-based 
sampling method (CBS), we ran the C4.5 (Re- 
lease 5) decision tree algorithm with CBS 
turned on and off (Quinlan, 1993) and measured 
the performance by the 10-fold cross validation, 
in which the corpus is divided evenly into 10 
blocks of data and 9 blocks are used for train- 
ing and the remaining one block is held out for 
testing. On each validation fold, CBS starts 
with a set of about 512 samples from the set of 
training blocks and sequentially examines sam- 
ples from the rest of the training set for pos- 
sible labeling. If a sample is selected, then a 
decision tree will be trained on the sample to- 
gether with the data acquired so far, and tested 
on the held-out data. Performance scores (er- 
ror rates) are averaged over 10 folds to give a 
summary figure for a particular learning strat- 
egy. Throughout the experiments, we assume 
that k = 10 and g = 1, i.e., 10 committee 
members and the entropy gain of 1. Figure 1 
shows the result of using CBS for a decision tree. 
Though the performance fluctuates erratically, 
we see a general tendency that the CBS method 
fares better than a decision tree classifier alone. 
In fact differences between C4.5/CBS and C4.5 
alone proved statistically significant (t = 7.06, 
df = 90, p < .01). 
While there seems to be a tendency for per- 
formance to improve with an increase in the 
amount of training data, either with or without 
CBS, it is apparent that an increase in the train- 
ing data has non-linear effects on performance, 
which makes an interesting contrast with proba- 
bilistic classifiers like HMM, whose performance 
improves linearly as the training data grow. The 
reason has to do with the structural complex- 
ity of the decision tree model: it is possible 
that small changes in the INFO value lead to 
163 
Figure 1: Effects of CBS on the decision tree learning. Each point in the scatterplots represents a 
summary figure, i.e. the average of figures obtained for a given x in 10-fold cross validation trials. 
The x-axis represents the amount of training data, and the y-axis the error rate. The error rate is 
the proportion of the misclassified instances to the total number of instances. 
47.5 
47 
,46.5 
46 
45.5 
45 
44.5 
44 
43.5 
43 
0 
@ 
o 
o o O@~ o 
@ @ o o 
@ @@~ @ o 
@ O O O : 
@ @ 
+ O @ 0 O 
"¢+ $ O:O + o + o 
@ o o : o 0 
o +%+ + %>+ ~+ o 
+ + +++ +~ + ~+o ~ *+ 
+ + + + ++ ~ +¢0 + ++ .p + + + 
+ Oar + ~+++ + 
+ 
+ + O@ 
4- 4- + 
I I I 
200 400 600 
Training Data 
I 
STD C4.5 
CBS+C4.5 + 
O 
@ 
o o 
@ o o 
@ o 
o 
o o o 
o ¢ o @ O 
-I-k++ + 
+ + +++ ++++ 
÷ 
+ 
+ I 
8OO 
0 
1000 
a drastic restructuring of a decision tree. In the 
face of this, we made a small change to the way 
CBS works. The idea, which we call a sampling 
with error feedback, is to remove harmful exam- 
ples from the training data and only use those 
with positive effects on performance. It forces 
the sampling mechanism to return to status quo 
ante when it finds that an example selected de- 
grades performance. More precisely, this would 
be put as follows: 
f St U {e}, if E(CSU{e}) < E(C s~) S +l \[ 
St otherwise 
St is a training set at time t. C s denotes a clas- 
sifter built from the training set S. E(C s) is an 
error rate of a classifier C s. Thus if there is an 
increase or no reduction in the error rate after 
adding an example e to the training set, a clas- 
sifter goes back to the state before the change. 
As Figure 2 shows, the error feedback pro- 
duced a drastic reduction in the error rate. At 
900, the committee-based method with the er- 
ror feedback reduced the error rate by as much 
as 23%. Figure 3 compares performance of 
three sampling methods, random sampling, the 
committee-based sampling with 100 bootstrap 
replicates (i.e., K = 100) and that with 500 
bootstrap replicates. In the random sampling 
method, a sample is selected randomly from 
the data and added to the training data. Fig- 
ure 4 compares a random sampling approach 
with CBS with 500 bootstrap replicates. Both 
used the error feedback mechanism. Differ- 
ences, though they seem small, turned out to 
be statistically significant (t = 4.51, df = 
90, p < .01), which demonstrates the signif- 
icance of C4.5/CBS approach. Furthermore, 
Figure 5 demonstrates that the number of boot- 
strap replicates affects performance (t = 8.87, 
df = 90, p < .01). CBS with 500 bootstraps 
performs consistently better than that with 100 
bootstrap replicates. This might mean that in 
the current setup, 100 replicates are not enough 
to simulate the true distribution of P(M I S). 
Note that CBS with 500 replicates achieves the 
error rate of 33.40 with only 1008 training sam- 
ples, which amount to one fourth of the training 
data C4.5 alone required to reach 44.64. While a 
direct comparison with other learning schemes 
in discourse such as a transformation method 
(Samuel et al., 1998) is not feasible, if Samuel 
et al. (1998)'s approach is indeed comparable to 
C5.0, as discussed in Samuel et al. (1998), then 
the present method might be able to reduce the 
164 
Figure 2: The committee-based method (with 100 bootstraps) with the error feedback as compared 
against one without. The error rate decreases rapidly with the growth of the training data (the 
lower scatterplot). The upper scatterplot represents CBS with the error feedback disabled. 
46 
44' 
42 
uJ 
38 
36 
34 0 
o o C~+C4.5 o 
04200 CBS.~C4.5+EF + o o ~° °°°oo~ o~:o ~ ~o ooo o~%o 
o o o ~oo o o o ~ o 
~°o 
\ 
+ 
\ \ 
++~+%+++++++++++*+-H.++++ 
+++'H'~'+~'++'~'+÷++++'H ~'H'+++++..H.+++.H..H.~.+ ...... 
I I I I I I I I "J'H'4"l:+,~ 
100 200 300 400 500 600 700 800 900 
Training Data 
Figure 3: Comparing performance of three approaches, random sampling (RANDOM-EF), boot- 
strapped CBS with 100 replicates (CBS100-EF), and bootstrapped CBS with 500 replicates 
(CBS500-EF),: all with the error feedback on. 
42 
40 
3 ! 
36 
34 
32 
i ,' 
RANDOM+EF o 
CBS100+EF + CBS500+EF E\] 
\[~÷ 
| I I I I I I I I 
100 200 300 400 500 600 700 800 900 
Training Data 
1000 
amount of training data without hurting perfor- 
mance. 
5 Conclusions 
We presented a new approach for identifying 
discourse relations, built upon the committee- 
based sampling method, in which useful ex- 
amples are selected for training and those not 
useful are discarded. Since the committee- 
based sampling method was originally devel- 
oped for probabilistic classifiers, we extended 
the method for a decision tree classifier, us- 
165 
Figure 4: Differences in performance of RANDOM-EF (random sampling with the error feedback) 
and CBS500-EF (CBS with the error feedback, with 500 bootstrap replicates). 
46 
44 
42 
4O 
38 
36 
34 
32 
i i i i i i ! 
random sampling+EF o 
CBS+BPSOO+EF + 
+o 
I I I I I I I I I 
100 200 300 400 5(X3 600 700 800 900 
Training Data 
1000 
Figure 5: Differences in performance of CBS500-EF and CBS100-EF. 
44 
4O 
38 
36 
34 
! i 
CBSlOO+EF ¢. 
CBSSOO+EF + 
+o + 
• ~ ~++~.:v.,,,~ 
32 I , I I I I I | I - 
0 1 O0 200 300 400 500 600 700 800 900 
Training Data 
ing a statistical technique called bootstrapping. 
The use of the method for learning discourse 
relations resulted in a drastic reduction in the 
amount of data required and also an increased 
accuracy. Further, we found that the num- 
ber of bootstraps has substantial effects on per- 
formance; CBS with 500 bootstraps performed 
better than that with 100 bootstraps 

References 

Paul R. Cohen. 1995. Empirical Methods in Ar- 
tificial Intelligence. The MIT Press. 

Ido Dagan and Sean Engelson. 1995. 
Committee-based sampling for training 
probabilistic classifiers. In Proceedings off In- 
ternational Conference on Machine Learning, 
pages 150-157, July. 

Thomas G. Dietterich. 1998. An experimental 
comparison of three methods for constructing 
ensembles of decision trees: Bagging, boost- 
ing, and randomization, submitted to Ma- 
chine Learning. 

Sean P. Engelson and Ido Dagan. 1996. Mini- 
mizing manual annotation cost in supervised 
training from ,corpora. In Proceedings off the 
3~th Annual Meeting of the Association for 
Computational Linguistics, pages 319-326. 
ACL, June. University of California,Santa 
Cruz. 

Takashi Ichikawa. 1990. Bunshddron-gaisetsu. 
KySiku-Shuppan, Tokyo. 

Klaus Krippendorff. 1980. Content Analysis: 
An Introductiqn to Its Methodology, volume 5 
of The Sage COMMTEXT series. The Sage 
Publications, Inc. 

W. C. Mann and S. A. Thompson. 1987. 
Rhetorical Structure Theory. In L. Polyani, 
editor, The Structure of Discourse. Ablex 
Publishing Co:rp., Norwood, NJ. 

Daniel Marcu. 1997. The Rhetorical Pars- 
ing of Natural Language Texts. In Proceed- 
ings of the 35th Annual Meetings of the As- 
sociation for ,Computational Linguistics and 
the 8th European Chapter of the Association 
for Computational Linguistics, pages 96-102, 
Madrid, Spain, July. 

Nihon-Keizai-Shimbun- Sha. 1995. Nihon 
Keizai Shimbun 95 hen CD-ROM ban. 
CD-ROM. Nihon Keizai Shimbun, Inc., 
Tokyo. 

J. Ross Quinlani 1993. C~.5: Programs for Ma- 
chine Learning. Morgan Kanfmann. 

Gerald Salton and Michael J. McGill. 1983. 
Introduction to Modern Information Re- 
treival. McGraw-Hill Computer Science Se- 
ries. McGraw~Hill Publishing Co. 

Ken Samuel, Sandra Carberry, and K. Vijay- 
Shanker. 1998. Diaglogue act tagging with 
transformation-based learning. In Proceed- 
ings of the 36th Annual Meeting off the Asso- 
ciation of Computational Linguistics and the 
17th International Conference on Computa- 
tional Linguistics, pages 1150-1156, August 
10-14. Montreal, Canada. 

Sidney Siegel and N. John CasteUan. 1988. 
Nonparametr~c Statistics for the Behavioral 
Sciences. McGraw-Hill, Second edition. 
