From discourse structures to text summaries 
Daniel Marcu 
Department of Computer Science 
Umverslty of Toronto 
Toronto, Ontario 
Canada MSS 3G4 
marcu@cs.toronto, edu 
Abstract 
We describe experiments that show that 
the concepts of rhetorical analysts and nu- 
cleanty can be used effectively for deter- 
numng the most nnportant umts m a text 
We show how these concepts can be xm- 
plemented and we discuss results that we 
obtained with a chscourse-based summa- 
nzatmn program 
1 Motivation 
The evaluaUon of automatic summarizers has always 
been a thorny problem most papers on summanzaUon 
describe the approach that they use and give some "con- 
vmcmg" samples of the output In very few cases, the 
dtrect output of a suramanzatton program Is compared 
wtth a human-made summary or evaluated wtth the help 
of human subjects, usually, the results are modest Un- 
fortunately, evaluatmg the results of a pamcular tmple- 
mentaUon does not enable one to detenmne what part of 
the fmlure is due to the tmplementatton ttself and what 
part to Rs underlying assumpttons The posmon that we 
take m tins paper is that, m order to bmld htgh-quahty 
summarization programs, one needs to evaluate not only 
a representatave set of automattcally generated outputs (a 
htghly chfficult problem by Rself), but also the adequacy 
of the assumptaons that these programs use That way, 
one ts able to dtsungmsh the problems that pertmn to a 
parttcular implementation from those that pertmn to the 
underlying theoretical framework and explore new ways 
to improve each 
With few excepttons, automaUc approaches to summa- 
nzatmn have primarily addressed possthle ways to deter- 
rmne the most tmportant parts of a text (see Patce (!990) 
for an excellent overview) Deterrmnmg the salient parts 
IS constdered to be achievable because one or more of 
the following assumpuons hold 0) important sentences 
m a text contmn words that are used frequently (Luhn, 
1958, Edmundson, 1968), (n) tmportant sentences con- 
tam words that are used m the Utle and secuon head- 
mgs (Edmundson, 1968), On) important sentences are 
located at the begmmng or end of paragraphs (Baxen- 
dale, 1958), 0v) tmportant sentences are located at posl- 
Uons m a text that are genre dependent-- these posluons 
can be detenmned automatically, through trmnmg tech- 
tuques (Lm and Hovy, 1997), (v) important sentences use 
bq¢us words such as "greatest'~ and "stgmficant" or mdt- 
cater phrases such as "the mmn aim of thispaper" and 
"the purpose of tb~s aruclo", wlule nonqmportant sen- 
tences use stigma words such as "hardly" and "tmpossl- 
ble" (Edmundson, 1968, Rush, Salvador, and Zamora, 
1971), (v0 important sentences and concepts are the 
lughest connected enttUes m elaborate semantuc struc- 
tures (Skorochodko, 1971, Lm, 1995, Barzday and E1- 
hadad, 1997), and (vn) tmportant and nonqmportant sen- 
tences are derivable from a &scourse representaUon of 
the text (Sparck Jones, 1993, One; Surmta, and Mnke, 
1994) 
In deterrmnmg the words that occur most frequently m 
a text or the sentences that use words that occur m the 
headings of secttons, computers are accurate tools How- 
ever, m determmmgthe concepts that are semanucally 
related or the dtscourse structure of a text, computers 
are no longer so accurate, rather, they are highly depen- 
dent on the coverage of the hngmsuc resources that they 
use and the qualRy of the algorithms that they Imple- 
ment Although ~t ~s plausible that elaborate cohesion- 
and coherence-based structures can be used effecuvely 
m summanzauon, we beheve that before bmldmg sum-. 
manzzat~on programs, we should deterrmne the extent to 
winch these assumpUons hold 
In tins paper, we describe experiments that show that 
• the concepts of rbetoncal analysts and nucleanty can be 
used effecUvely for deterrmmng the most important umts 
m a text We show how these concepts were implemented 
and discuss results that we obtained with a ¢hscourse- 
based summanzauon program 
2 From discourse trees to summaries 
an empirical view 
2.1 Introduction 
Researchers m computauonal imgmsucs (Mann and 
Thompson, 1988, Mattluessen and Thompson, 1988, 
Sparck Jones, 1993) have long speculated that the nuclei 
that pertain to a rhetorical structure tree (RS-tree) (Mann 
and Thompson, 1988) consmute an adequate summanza- 
82 
! 
! 
Umt 
I 
2. ~ 
3 
4 
5 
7 
8 
9 
Table I 
10 
11 
12 
13 
14 
15 
16 
17 
18 
Judges ?~y~te \[ Program 
1 2 3 4 5 6 7 8 9 10 11 12 13 .... 
0 2 2 2 0 0 0 0 0 0 0 0 0 3 3 3 
0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 2 
0 2 0 2 0 0 0 0 0 0 0 0 " 1 3 2 3 
2 1 2 2 2 2 2 2 2 2 2 2 2 6 5 6 
1 1 0 1 1 1 0 1 2 1 0 2 2 4 3 4 
0 I 0 1 1 I 0 1 1 1 0 2 2 4 •3 4 
0 2 1 0 0 0 1 1 1 0 0 0 0 4 3 3 
0 1 0 0 0 0 0 0 0 0 0 0 0 4 3 3 
0 0 2 0 0 0 0 0 0 0 1 0 1 1 0 1 
0 2 2 2 0 0 2 0 0 0 0 0 0 3 4 3 
0 0 0 2 0 0 0 1 0 0 0 0 1 3 4 3 
2 2 2 2 2 2 2 2 2 0 1 2 2 5 4 5 
1 1 0 0 0 1 0 1 0 0 0 2 0 3 3 3 
1 0 0 0 0 1 1 0 0 0 0 2 0 3 3 3 
0 0 0 0 0 1 0 0 0 0 0 1 0 2 3 3 
0 1 .1 0 1 0 0 0 2 0" 0 1 0 4 3 4 
0 I 0 0 0 0 0 0 1 0 0 I 0 2 1 3 
2 1 1 0 1 0 1 0 2 0 I 1 2 4 3 4 
The scores assigned by the judges, analysts, and our program tothe textual umts m text I 
tion 0.fthe text for winch that RS-tree was bruit However, 
to otff knowledge, there was no experiment to confirm 
how vahd this speculaUon really is In what follows, 
• we desonbe an experiment that shows that there exists a 
strong correlataon between the nuclei of the RS-tree of a 
text and what readers perceive to be the most important 
umts m a text 
2.2 Experiment 
2.2.1 Materials and methods 
We know from the results reported m the psychological 
hterature on summanzaUon (Johnson, 1970, Chou Hare 
and Borchardt, 1984, Sherrarck 1989) that there exists a 
certmn degree of disagreement between readers with re- 
spect to the importance that they assign to various textual 
umts and that the ¢hsagreement is dependent on the qual- 
ity of the text and the comprehension and summarization 
slalls of the readers (Wmograd, 1984) In an attempt to 
produce an adequate reference set of data, we selected for 
our experiment five texts from $czenttflc American that 
we considered to be weU-wntten The texts ranged in 
size from 161 to 725 words We used square brackets to 
enclose the wammal textual units (essentially the clauses) 
of each text Overall, the five texts were broken rote 160 
textual umts with the shortest text being broken into 18 
textual umts, and the longest into 70 The shortest text is 
g!ven in (1), below (here, for the purpose of reference, the 
rmmmal umts are not only enclosed by square brackets, 
but also are numbered) 
(1) \[With its &stunt orbit I\] \[-- 50 percent farther from the 
sun than Earth _.2\] \[and shm atmospheric blanket, 3\] 
\[Mars experiences fngld weather conchaons 4\] \[Sur- 
face temperatures typically average about -60 degrees 
Celsius (-76 degrees FahrenheR) at the equator s\] \[and 
can ¢hp to -123 degrees C near the poles s\] \[Only the 
nndday sun at tropical latRudes Is warm enough to thaw 
Ice on eccaslon, ~\] \[but any hqmd water formed ~ this 
way would evaporate almost instantly s\] \[because of the 
low almosphenc pressure 9\] 
\[Although the atmosphere holds a small amount of 
water, i°\] \[and water-ice clouds someumes develop, H \] 
\[most Maman weather revolves blowing dust or car- 
ben &oxide n\] \[Each wmter0 for example, a bhzzard 
of frozen carbon ¢hoxlde rages over one pole, t3\] \[and 
a few meters of thts dry-wee snow acfumulate 14\] \[as 
previously frozen carbon thoxzde evaporates from the 
opposite polar cap is\] \[Yet even on the summer pole, Is\] 
\[where the sun remains m the sky all day long, 17 \] \[tem- 
peratures never warm enough to melt frozen water ~s\] 
We followed Garner's (1982) strategy and asked 13 
independent judges to rate each textual umt accorchng 
to its importance to a potentml summary The judges 
used a three-point scale and assigned a score of 2 to the 
umts that they beheved to be very nnportant and should 
appear m a concise summary, I to those they considered 
moderately important, whlch should appear m a long 
summary, and 0 to those they consldered ummportant, 
winch should not appear in any summary The judge s 
were instructed that there were no nght or wrong answers 
and no upper or lower bounds with respect to the number 
of textual umts that they should select as being Important 
or moderately important The judges were all graduate 
students m computer sclence, we assumed that they had 
developed adequate comprehensmn and summanzauon 
shlls on thelr own, so no trmnmg session was carried 
out Table 1 presents the scores that were assigned by 
each judge to the umts m text (1) 
The same texts were also given to two computauonal 
• hngmsts with sohd knowledge of rhetoncal structure the- 
ory (RST) The analysts were asked to bmld one RS-tree 
83 
o 
Text 1 2 3 4 5 All 
Allumts ~ 70 71 
Verylmportant.unlts 88 63 65 64 67 66 
Lessnnportantumts 51 73 54 46 - 58 
Ummportantunits 75 83 73 73 71 74 
Table 2 Percent agreement with the majonty opinion 
for each text We took then the RS-trees built by the an- 
alysts and used our formalizaUon of RST (Marcu, 1996, 
Marcu, 1997b) to assocmte with each. node m a tree its 
sal,ent umts The salient umts were computed recur- 
s~vely, assocmnng with each leaf m an RS-tree the leaf 
itself, and to each internal node the salient umts of the 
nucleus or nucle~ of the rhetoncal relauon correspon&ng 
to that node We then computed for each textual umt a 
score, depen&ng on the depth m the tree where it oc- 
curred as a salient umt the textual umts that were sal,ent 
umts of the top nodes m a tree had a Ingher score than 
those that were salient umts of the nodes found at the bot- 
tom of a tree Essentially, from a rhetorical structure tree, 
we derived an importance score for each textual umt the 
lmpoi-tance scores ranged from 0 to n where n was the 
depth of the RS-tree i Table 1 presents the scores that 
were derived from the RS-trees that were bmlt by each 
analyst for text (1) 
2.2.2 Results 
Overall agreement among judges. We measured the 
ability of judges to agree with one another, using the no- 
Uon ofpercent agreement that was defined by Gale (1992) 
and used extensively m &scourse segmentanon stud- 
les (Passonnean and Lltman, 1993, Hearst, 1994) Per- 
cent agreement reflects the ratio of observed agreements 
vath the majority opmmn to posmble agreements with 
the majority opinion The percent agreements computed 
for, each of the five texts and each level of ,mportance 
are given m table 2 The agreements among judges for 
our expenment seem to follow the same pattern as those 
described by other researchers m summanzatlon (John- 
son, 1970) That is, the judges are qmte consistent with 
respect to what they perceive as being very Lmportant 
and unimportant, but less conststent wath respect to what 
they perceive as being less tmportant In contrast with 
• the agreement observed among judges, the percentage 
agreements computed for 1000 ,mportance assignments 
that were randomly generated for the same texts followed 
a normal distnbutlon with p = 47 31, (r = 0 04 These 
results suggest that the agreement among judges ,s ssg- 
mficant 
Agreement among judges with respect to the impor- 
tance of each textual umt. We considered a textual 
umt to be labeled con~stendy ifa s,mple majonty of the 
judges (~ 7) assigned the same score to that umt Over- 
Secuon 3 2 gives an example of how the importance scores 
were computed 
84 
all, the judges labeled conmstently 140 of the 160 textual 
units (87%) In contrast, a set of 1000 randomly gener- 
ated importance scores showed agreement, on average, 
for only 50 of the 160 textual umts (3 I%), o" = 0 05 
The judges consistently labeled 36 of the umts as very 
important, 8 as less maportant, and 96 as unmaportant 
They were inconsistent with respect to 20 textual units 
For example, for text (1), thejudges consistently labeled 
umts 4 and 12 as very important, umts 5 and 6 as less ,m- 
portant, units 1,2, 3, 7, 8, 9,10,11,13,14,15,17 as umm- 
portant, and were inconsistent m labehng umt 18 If we 
compute percent agreement figures only for the textual 
umts for winch at least 7 judges agreed, we get 69% 
for the units considered very important, 63% for those 
considered less important, and 77% for those considered 
ummportant The overall percent agreement m tins case 
is 75% 
Statistical significance. It has often been emphasized 
that agreement figures of the hnds computed above could 
be mrslea&ng (Knppendorff, 1980, Passonneau and Lit- 
man, 1993) Since the "true" set of lmpertant textual 
umts cannot be mdependentlyknown, we cannot tom- 
pure how valid the importance ass,gaments of the judges 
were Moreover, although the agreement figures that 
would occur by chance offer a strong mdlcatlon that our 
data are reliable, they do not prowde a prec,se measure- 
ment ofrehabdlty 
To compute a rehablhty figure, we followed the same 
methodology as Passonneau and Lltrnan (1993) and 
Hearst (1994) and apphed the Cochran's Q summary 
statlsucs to our data (Cochran. 1950) Cochran's test 
assumes that a set of judges make binary decismns with 
respect to a dataset The null hypothesis is that the num- 
ber of judges that take the same declmon is randomly 
&sttabuted Since Cochran's test is appropriate only for 
binary judgments and since our mam goal was to deter- 
mine a rehablhty figure for the agreement among judges 
with respect to what they believe to be ,mportant, we 
evaluated two versions of.the data that reflected only one 
Importance level In the first Version we considered as 
being important the judgments with a score of 2 and 
unimportant the judgments with a score of 0 and 1 In 
the second version, we consdered as being important the 
judgments with a score of 2 and 1 and ummportant the 
judgments with a score of 0 EssenUally, we mapped the 
judgment matrices of each of the five texts rote matnces 
whose elements ranged over only two values 0 and 1 
After these mod,ficauons were made, we computed for 
each version and each text the Cochran stausucs Q, winch 
approximates the X z &stnbuuon w,th n - 1 degrees of 
freedom, where n rs the number of elements m the dataset 
In all cases we obtmned probabflmes that were very low 
p < 10 -6 Tins means that the agreement among judges 
was extremely slgmficant 
Although the probainhty was very low for both ver- 
sions, it was lower for the first Vermon of the modflied 
data than for the second Tins means that ,t is more re- 
hable to consider as important only the units that were 
I 
I 
I 
• assigned a score of 2 by a majority of the judges 
As we have already menUoned, our ulumate goal was 
to detenmne whether there exists a correlauon between ~ 
the umts that judges find important and the umts that 
have nuclear status m the rhetorical structure trees of the 
same texts Since the percentage agreement for the umts 
that were consadered very important was higher than the 
percentage agreement for the mats that were consadered 
less amportant, and since the Cochran's slgmficance com- 
puted for the first versaon of the mochfied data was Ingher 
that the one computed for the second, we decaded to con- 
sider the set of 36 textual umts labeled by a majority of 
judges wath 2 as a rehable reference set of importance 
umts for the five texts For example, umts 4 and 12 from 
text (1) belong to t/us reference set 
Agreement between analysts. Once we detenmned 
the set of textual umts that the judges beheved to be 
amportant, we needed to detenmne the agreement be- 
tween the analysts who built the &scourse trees for the 
five texts Because we chd not know the &stnbutton of 
the importance scores denved from the thscourse trees, 
we computed the correlatmn between the analysts by ap- 
plying Spearman's correlatzon coefficaent on the scores 
associated to each textual umt We interpreted these 
scores as ranks on a scale that measures the xmportance 
of the umts m a text 
The Spearman rank correlauon coefficaent as an alter- 
naUve to the usual correlauon coefficaent It ~s based on 
the ranks of the data, and not on the data itself, so as 
resastant to outhers The null hypothesis tested by the 
Spearman coefficient as that two variables are indepen- 
dent of each other, agmnst the alternative hypothesis that 
the rank of a variable is correlated with the rank of an- 
other variable The value of the staustlcs ranges from 
-1, mchcatmg that Ingh ranks of one variable occur with 
low ranks of the other variable, through 0, mchcatmg no 
correlauon between the variables, to +1, mchcalzng that 
ingh ranks of one vanable occur with ingh ranks of the 
other variable 
The Spearman correlauon coefficient between the 
- ranks assagned for each textual umt on the bases of the 
RS-trees bmlt by the two analysts was very ingh 0 798, 
at the p < 0 0001 level of sagmficance The chfferences 
between the two analysts came mmnly from then" anter- 
pretaUons of two of the texts the RS-trees 0lone analyst 
nm'Iored the paragraph structure of the texts, while the 
RS-trees of the other muTored a logical orgamzaUon of 
the text, winch that analyst believed to be amportant 
Agreement between the analysts and the judges with 
respect to the most important textual units. In order 
to detenmne whether there exists any correspondence 
between what readers beheve to be important and the 
nuclea of the RS-trees, we selected, from each of the five 
texts, the set of textual umts that were labeled as "very 
Important" by a majority of the judges For example, 
for text (1), we selected umts 4 and 12, a e, 11% of the 
umts Overall, the judges selected 36 mats as being very 
amportant, whach as approximately 22% of the mats an a 
text The percentages of tmportant umts for the five texts 
were 11,36, 35, 17, and 22 respecuvely 
We took the maximal scores computed for each textual 
umt from the RS-trees bruit by each analyst and selected 
a percentage of umts that matched the percentage of Im- 
portant umts selected by the judges In the cases m winch 
there were ues, we selected a percentage ofumts that was 
closest to the one computed for the judges For example, 
we selected umts 4 and 12, winch represented the most 
important 11% of umts as reduced from the RS-tree bruit 
by the first analyst However, we selected only • umt 4, 
winch represented 6% of the most Important umts as re- 
duced from the RS-tree bmlt by the s.e~nd analyst The 
reason for selecting only umt 4 for the second analyst 
was that umts 10,11, and 12 have the same score -- 4 
(see table I) If we had selected umts 10,11 and 12 as 
well, we wouldhave ended up selecting 22% of the umts 
m text (1), winch as farther from 11 than 6 Hence, we 
detenmned for each text the set of amportant umts as la- 
beled byjudges and as denved from the RS-trees of those 
texts 
We calculated for each text the recall and precasaon of 
the important umts derived from the RS-trees, with re- 
spect to the umts labeled important by the judges The 
overall recall and precasaon was the same for both ana- 
lysts 56% recall and 66% precision In contrast, the 
average recall and precasaon for the same percentages of 
umts selected randomly 1000 Umes from the same five 
texts were both 25 7%, o- = 0 059 
In summarizing text, at ~s often useful to consider not 
only clauses, but full sentences To account for tbJs, we 
consadered to be ~mportant all the textual units that per- 
tinned to a sentence that was characterized by at least 
one amportant textual umt For example, we labeled as 
important textual umts 1 to 4 m text (I), because they 
make up a full sentence and because umt 4 was labeled 
as nnportant For the adjusted data, we detenmned agmn 
the percentages of amportant umts for the five texts and 
we re-calculated the recall and precasmn for both ana- 
lysts the recall was 69% and 66% and the preclsaon 82% 
and 75% respectively In contrast, the average recall and 
precisaon for the same percentages of mats selected ran- 
domly 1000 ttmes from the same five texts were 38 4%, 
¢r = 0 048 These results confirm that there exasts a 
strong correlaUon between the nuclea of the RS-trees that 
pertmn to a text and what readers perceave as being ampor- 
tant m that text Gaven the values of recall and precasaon 
that we obtained, at as plausible that an adequate com- 
putatmnal treatment of dxscourse theories would provide 
most of what is needed for selecting accurately the xm- 
portant umts m a text However, the results also suggest 
that RST by atself as not enough if one wants to strive for 
peffecUon 
The above results not only provade strong evadence that 
chscourse theories can be used effecUvely for text sum- 
manzaUon, but also enable one to derive strategies that 
an automaUc summarizer naght follow For example, the 
Spearman correlauon coofficlent between the judges and 
the first analyst, the one who chd not follow the paragraph 
85 
structure, was lower than the one between the judges and 
the second analyst It follows that most humanjudges are 
mchned to use the paragraph breaks as valuable sources 
of mformaUon when they mterpret discourse If the mm 
ofa summanzaUon program ss to mmuc human behaxaor, 
~t seems adequate for the program to take advantage of 
the paragraph structure of the texts that It analyzes 
Currently, the rank asstgnment for each textual umt m 
an RS-tree ts done enurely on the basts of the mammal 
depth m the tree where that umt as sahent (Marcu, 1996) 
Our data seem to support the fact that there exists a cor- 
relatmn also between the types of relatmus that are used 
to connect various textual umts and the tmportance of 
those umts m a text We plan to desagn other experiments 
that can provade clearcut evtdence on the nature of this 
correlauon 
3 An RST-based summarization program 
3.1 Implementation 
Our summanzauon program rehes on a rhetorical parser 
that braids RS-trees for unrestricted texts The mathe- 
maUcal foundaUons of the rhetorical parsing algorithm 
rely on a first'order formahzaUon of vahd textl struc- 
tures (Marcu, 1997h) The assumpUons of the formal- 
azaUon are the following 1 The elementary umts of 
complex text structures are non-overlappmg spans of text 
2 Rhetorical, coherence, and cohessve relauons hold be- 
tween textual umts of various sizes 3 Rel~ons can 
be paruuoned into two classes paratacuc and hypotac- 
uc Paratacuc relauons are those that hold between spans 
of equal ~mportanee HypotacUc relations are those that 
hold between a span that ts essenual for the writer's pur- 
pose, I e, a nucleus, and a span that increases the under- 
standing of the nucleus but is not essenUal for the writer's 
purpose, ~ e, a satelhte 4 The abstract structure of most 
texts ts a binary, tree-lake structure 5 If a relaUon 
holds between two textual spans of the tree structure of a 
text, that relatton also holds between the most Important 
umts of the consUtuent subspans The most ~mportant 
umts of a textual span are determined recursavely they 
correspond to the most important umts of the tmmechate 
subspans when the relauon that holds between these sub- 
spans ts paratacUc, and to the most amportant umts of the 
nucleus subspan when the relauon that holds between the 
tmmedtate subspans as hypotaclac 
The rhetorical parsmg algorithm, which is outhned m 
figure l, is based on a comprehens|ve corpus analysisof 
more than 450 discourse markers and 7900 text fragments 
(see (Marcu, 199To) for detmls) When gwen a text, the 
rhetorical parser detenmnes first the &scourse markers 
and the elementary umts that make UP that text The 
parser uses then the mformatton derived from the cor- 
pus analysts m order to hypothesize rhetorical relaUons 
among the elementary umts In the end, the parser apphes 
a constrmnt-saUsfactmn procedure to deterrmne the text 
str~tures that are vahd If more than one val|d structure 
is found, the parser chooses one that ts the "best" accord- 
mg to a gwen metric The detmls of the algorithms that 
INPUT a text T 
1 Deternune the set D of all dtscourse markers m T 
and the set Ur of elementary textual nmts m T 
2 Hypothes|ze a set ofrelataons R between the elements 
of Ur 
3 Deternune the set ValTrees of all vahd RS-trees of 
T that can be built using relauons from R 
4 Deterrmne the "best" RS-tree m VaITrees on the 
basts of a metric that assagns Ingher wetghts to the trees 
that are more skewed to the right 
Figure 1 An outline of the rhetorical parsing algorithm 
• ~ 78 c/~.aa 
12 P.T.cmphf~caaon ' 
1011~- 1618 Anmhem 
17 
Figure 2 The RS-tree of mammal weight built by the 
rhetoncal parser for text (I) 
are used by the rethoncal parser are &scussed at length 
m (Mareu, 1997a, Marco, I997b) 
When the rhetoncal parser takes text (1) as mpuL R 
produces the RS-tree m figure 2 The conventaon that 
we use IS that nuclei are surrounded by sohd boxes and 
satelhtes by dotted boxes, the hnks between a node and 
a subordinate nucleus or nuclei are represented by sohd 
arrows, and the hnks between a node and a subordinate 
satelhte by dotted hnes The nodes with only one satel- 
hte denote occurrences of parenthetical mformaUon for 
example, textual tnnt 2 ss labeled as parenthetacal to the 
textual umt that results from juxtaposing 1 and 3 The 
numbers assoctated voth each leaf correspond to the nu-. 
mencal labels m text (1) The numbers assocxated voth 
each internal node correspond to the sahent umts of that 
node and are exphcatly represented m the RS-tree 
By respecting the RS-tree m figure 2, one can horace 
that the trees that are bmlt by the program do not have the 
same granulartty as the trees constructed by the analysts 
For example, the program treats umts 13,14, and 15 as 
one elementary umt However, as we argue m (Marcu, 
1997b), the corpus analysis on winch our parser as bmlt 
supports the observatton that, m most cases, the global 
structure of the RS-tree as not affected by the mabahty of 
the rbetoncal parser to uncover all clauses m a text 
86 
most of the clauses that are not uncovered are nuclet of 
JOn~ relaUons 
The summanzatton program takes the RS-tree pro- 
duced by the rbetoncal parser and selects the textual umts 
that are most salient m that text If the nim of the program 
Is to produce just a very short summary, only the salient 
umts associated with the internal nodes found closer to 
the root are selected The longer the summary one wants 
to generate, the farther the selected salient umts roll be 
from the root In fact, one can see that the RS-trees 
bmlt by the rhetoncal parser reduce a pamal order on the 
~mportance of the textual umts For text (1), the most 
important umt ~s 4 The textual umts that are sahent m 
the nodes found one level below represent the next level 
of importance (m this case, umt 12 -- umt 4 was already 
accounted for) The next level contains umts 5, 6,16, and 
18, and so on 
3.2 Evaluation 
To evaluate our program, we associated with each textual 
umt m the RS-trees bmlt by the rhetoncal parser a score 
m the same way we did for the RS-trees bmlt by the 
analysts For example, the RS-tree m figure 2 has a depth 
of 6 Because umt 4 is salient for the root, ~t gets a 
score of 6 Units 5, 6 are salient for an internal node 
found two levels below the root therefore, thmr score Is 
4 Umt 9 Is salient for a leaf found five levels below the 
root therefore, ~ts score ~s 1 Table I presents the scores 
associated by our summanzauon program to each umt m 
text (1) 
We used the importance scores assigned by our pro- 
gram to compute staUst~cs s~rmlar to those discussed m 
the prevmus secUon When the program selected only 
the textual umts w~th the highest scores, m percentages 
that were equal to those of the judges, the recall was 53% 
and the preclslon was 50% When the program selected 
the full sentences that were asseclated w~th the most im- 
portant umts, m percentages that were equal to those of 
the judges, the recall was 66% and the precls~on 68% 
The lower recall and precision scores associated w~th 
clauses seem to be caused primarily by the difference m 
granularity w~th respect to the way the texts were broken 
into subumts the program does not recover all rmmmal 
textual umts, and as a consequence, ~ts assignment of 
importance scores ~s coarser When full sentences are 
considered, the judges and the program work at the Same 
level of granularity, and as a consequence, the summa- 
nzauon results tmprove s~gmficantly 
4 Comparison with other work 
We are not aware of any RST-based summanzatlon pro- 
gram for Enghsh However, Ono et al (1994) discuss 
a summanzaUon program for Japanese whose m~mmal 
textual umts are sentences Due to the differences be- 
tween Enghsh and Japanese, R was impossible for us to 
compare Ono's summarizer wtth ours Fundamental dif- 
ferences concerning the assumpttons that underhe Ono's 
workand ours are discussed at length m (Mareu, 1997b) 
87 
, Umt type 
Clauses 
~ Sentences 
Table 3 
Recall Precision 
Random 25 7 25 7 
Microsoft 28 26 
Summarizer " 
Our snmmanzer 53 50 
Analysts 56 66" 
Random 38 4 38 4 
MierosoR 41 39 
Summarizer 
Our summarizer 66 68 
Analysts . 67 5 78 5 
An evaluauon of our summarization program 
We were able to obtmn only one other program that 
summarizes Enghsh text m the one included m the Ma- 
crosoft Office97 package We run the Microsoft summa- 
nzaUon program on the five texts from Sczent~fic Amer- 
scan and selected the same percentages of textual umts 
as those considered Important by the judges When we 
selected percentages of text that corresponded only to the 
clauses considered important by the judges, the lVherosoft 
program recalled 28% of the umts, with a prec~slon of 
26% When we selected percentages of text that corre- 
sponded to Sentences considered lmportsnt by thejudgus, 
the Microsoft program recalled 41% of the units, wxth a 
precision of 39% All Microsoft figures are only shghtly 
above those that correspond to the basehne algorithms 
that select Hnportant umts randomly It follows that our 
program outperforms slgmficantly the one found m the 
Office97 package 
We are not aware of any other summanzatton program 
that can bmld summaries with granularity as fine as a 
clause (as our program can) 
5 Conclusions 
We deserthed the first experiment that shows that the con- 
cepts of rhetorical analysts and nucleanty can be used ef- 
fecUvely for suramannng text The expemnent suggests 
that discourse-based methods can account for determin- 
ing the most zmportant umts m a text w~th a recall and 
precision as high as 70% We Showed how the concepts of 
rbetoncal analysts and nucleanty can be treated algonth- 
mtcally and we compared recall and preclsmn figures of a 
summanzauon program that implements these concepts 
with recall and prects~on figures that pertmn to a basehne 
algonthm and to a c6mmerclal system, the MlcrosoR Of. 
rice97 summarizer The discourse-based summanzauon 
program that we propose outperforms both the basehne 
and the commercial summarizer (see table 3) However, 
since ~ts results do not match yet the recall and precision 
figures that pertmn to the manual discourse analyses, zt 
zs likely that improvements of the rhetorical parser al- 
gorithm wall result m better performance of subsequent 
Lmplemetat~ons 
Acknowledgements. I am grateful to Graeme Htrst for 
the .invaluable help he gave me dunng every stage of 
tins work and to Marllyn Mantel, David Mitchell, Kevm 
Schlueter, and Melame Baljko for their advice on ex- 
perimental design and stanstlcs I am also grateful to 
Marzena Makuta for her help with the RST analyses and 
to my colleagues and friends who volunteered to act as 
judges m the experiments described here 
Tins reasearch was supported by the Natural Sciences 
and Engineering Research Council of Canada 

References 
Barzalay, Regina and Mtchael Elhadad 1997 Using 
Lexlcal Chmns for Text Summanzauon In Proceed- 
mgs of the ACL'97/EACL'97 Workshop on lntelhgent 
Scalable Text Summanzatton 

Baxendale, PB ' 1958 Macinne-mademdexfortechmcal 
hterature m an experiment IBM Journal of Research 
and Development, 2 354-361 

Chou Hare, Vlctona and Kathleen M Borchardt 1984 
Direct instruction of summanzaUon skzIls Readmg 
Research Quarterly, 20(1) 62-78, Fall 

Cochran, WG 1950 The comparison of percentages m 
matched samples Btometrtka, 37 256-266 

Edmundson, H P 1968 New methods m automatic ex: 
tractmg Journal of the Assoclatton for Compuung 
Machinery, 16(2) 264-285, April 

Gale, Wfiham, Kenneth W Church, and Dawd Yarowsky 
1992 Esumatzng upper and lower bounds on the per- 
formance of word-sense disamblguataon programs In 
Proceedings of the 30th Annual Meetmg of the Assoct- 
atwn for Computatwnal Lmgmstws (ACL-92), pages 
249--256 

Garner, Rutli 1982 Efficient text summanzauon 
costs and benefits Journal of FEducanonal Research, 
75 275-279 

Hearst, Marta 1994 Multa-paragraph segmecntauon of 
expository text In Proceedmgs of the 32nd Annual 
Meenng of the Assoczanon for ComputanonaI Lmgms- 
tws, pages 9-16, Las Cruces, New Mexico, June 27- 
30 

Johnson, RonaldE 1970 Recall of prose as a funcuon 
of structural importance of hngmsUc umts Journal of 
Verbal Learning and Verbal Behavwur, 9 12-20 

Knppendorff, Klaus 1980 Content analys~s An Intro- 
ductton to tts Methodology Sage Pubhcatmns, Bev- 
erly Hills, CA 

Lan, Chin-Yew 1995 Knowledge-based automauc topm 
ldenUficutton In Proceedings of the 33rd Annual 
Meenng of.the AssoclaUon for Computanonal Lm- 
gmsttcs (ACL-95), pages 308-310, Cambridge, Mas- 
sachusetts, June 26-~30 

Lm, Chin-Yew and Eduard Hovy 1997 Idenufymg top- 
lcs by posmon In Proceedings of the Fifth Conference 
on Apphed Natural Language Processing (ANLP-97), 
pages 283-290, Washington, DC, March 31 - April 3 

Luhn, H P 1958 The automatac creation of hterature 
abstracts IBM Journal of Research and Development, 
2(2) 159-165, April 

Mann, Wllham C and Sandra A Thompson 1988 
Rhetorical structure theory Toward a funclaonal the- 
ory of text orgamzaUon Text, 8(3) 243-281 

Marcu, Darnel 1996 Btalchng up rhetorical structure 
trees In Pwceedmgs of the Thwteenth Natzonal Con- 
ference on Arufictal lntelhgence (AAAI-96), volume 2, 
pages 1069-1074, Portland, Oregon, August 4-8, 

Marcu, Darnel 1997a The rhetorical parsing of natu- 
ral language texts In Pwceedmgs of the 35th Annual 
Meenng of the Assoctatwn for Computatlonal Lmgms- 
tws (ACIIEACL-97), Madrid, Spmn, July. 7-I0 

Marcu, Darnel 1997b The rhetorical parsing, sum- 
manzanon, and generatwn of natural language texts 
Ph D thesis, Department of Computer Science, Um- 
verslty of Toronto, Forthconung 

Mattinessen, Chnsuan and Sandra A Thompson 1988 
The structure of dtscourse and 'subordmauon' In 
J Hmman and S A Thompson, editors, Clause com- 
bining m grammar and dzscourse, volume 18 of Typo- 
logwal Studtes m Language John Benjanuns Pubhsh- 
mg Company, pages 275-329 

On0, Kenjl, Kazuo Sunuta, and Seljl Mnke 1994 Ab- 
stract generation based on rhetorical structure extrac- 
Uon In Proceedings of the lnternatwnal Conference 
on Computanonal Lmgmstws ( Cohng-94),pages 3A A. 
348, Japan 

Pmce, Chris D 1990 Construcung hterature abstracts 
by computer techmques and prospects Informatwn 
Processmg and Management, 26(1) 171-186 

Passonneau, Rebecca J and Diane J Lltman 1993 
Intenuon-based segmentation human rehabfllty and 
correlatton wtth hngmsUc cues In Proceedings of the 
31st Annual Meeting of the Assocmnon for Computa- 
nonalLmgmsttcs, pages 148-155, Oino, June 22-26 

Rush, JE, R Salvador, and A Zamora 1971 Auto- 
matac abstracting and indexing PtoducUon of mdlca- 
Uve abstracts by apphcauon of contextual reference 
and syntacuc coherence criteria Journal of Amerwan 
Society for lnformanon Sctences, 22(4) 260-274 

Sherrard, Carol 1989 Teaching students to summarize 
Applying texthngulstlcs System, 17(1) • 

. Skorochodko, E F 1971 Adaptive method of automatic 
abstracung and indexing In lnformatwn Processing, 
volume 2, pages 1179-1182 North-Holland Pubhsh- 
mg Company 

Sparck Jones, Karen 1993 What nught be m a 
summary') In Informatwn Retrieval 93 Vonder 
Modelherung zur Anwendung, pages 9-26, Umver- 
sltatsverlag Konstanz 

Wmograd, Peter N 1984 Strategic chfiicultaes m 
summanzang texts Reading Research Quaterly, 
19(4) 40~ ~25, Summer 
