Fast Generation of Abstracts from General Domain 
Text Corpora by Extracting Relevant Sentences 
Klaus Zechner 
Computational Linguistics Program 
Department of Philosophy 
135 Baker Hall 
Carnegie Mellon University 
Pittsburgh, PA 15213-3890, USA 
zechner@andrew, cmu. edu 
Abstract 
This paper describes a system for gen- 
erating text abstracts which relies on 
a general, purely statistical principle, 
i.e., on the notion of "relevance", as 
it is defined in terms of the combina- 
tion of tf*idf weights of words in a sen- 
tence. The system generates abstracts 
from newspaper articles by selecting the 
"most relevant" sentences and combin- 
ing them in text order. Since neither 
domain knowledge nor text-sort-specific 
heuristics are involved, this system pro- 
vides maximal generality and flexibility. 
Also, it is fast and can be efficiently iln- 
plemented for both on-line and off-line 
purposes. An experiment shows that re- 
call and precision for the extracted sen- 
tences (taking the sentences extracted 
by human subjects as a baseline) is 
within the same range as recall/precision 
when the human subjects are coinpared 
amongst each other: this means in fact 
that tile performance of the system is in- 
distinguishable from the performance of 
a human abstractor. Finally, the system 
yields significantly better results than 
a default "lead" algorithm does which 
chooses just some initial sentences from 
the text. 
1 Introduction 
With increasing amounts of machine readable in- 
formation being available, one of the major prob- 
lems for users is to find those texts that are most 
relevant to their interests and needs in as short an 
amount of time as possible. 
The traditional IR approach is that the user 
inputs a boolean query (possibly in a natural 
language-like formulation) and the system re- 
sponds by presenting to the user the texts that 
are a "best match" to his query. In corpora where 
abstracts are not already provided it might facil- 
itate the retrieval process a lot if text abstracts 
could be generated automatically either off-line to 
be stored together with tile texts (e.g., as ranked 
sentence numbers), or on-line, in accordance with 
the user's query. 
So far, there have been two main approaches 
in this field (for overviews on abstracting and 
summarizing see, e.g., (?) or (?)). One is ori- 
ented more towards information extraction, work- 
ing with a knowledge base in a limited domain 
("top down", see e.g., (?; ?; ?)), tile other type 
relies mainly on various heuristics ("bottom up", 
see e.g., (?; ?)) which are less dependent on the 
domain but are still at least; tuned to the text sort 
and thus have to be adapted whenever the system 
would have to be applied outside its original en- 
vironment. Combinations of these methods have 
also been attempted recently (see e.g. (?)). 
The focus of this paper will be the description 
and evaluation of an abstracting system which 
avoids the disadvantages coming along with most 
of these traditional approaches, while still be- 
ing able to achieve a performance which matches 
closely the results of an identical abstracting task 
performed by human subjects in a comparative 
study. 
The results indicate that it is indeed possible to 
build a system relying on a simple and efficient al- 
gorithm, using standard tf*idf weights only, while 
still achieving a satisfying output} 
2 A System for Generating Text 
Abstracts 
Kupiec et al. (?) present the results of a study 
where 80% of the sentences in man-made abstracts 
were "close sentence matches", i.e., they were "ei- 
ther extracted verbatim from the original or with 
minor modifications" (p.70). Therefore, we argue 
that it is not only an easy way but indeed an ap- 
propriate one for an automatic system to choose a 
number of the most relevant sentences and present 
1By "satisfying" we mean at least indicative for the 
content of ~he respective text, if not also informative 
about it. 
986 
these as a "text; abstract;" to the user. ~ We further 
argue that; coherence, although certainly desir- 
able, is imi)ossible without a large scale knowledge 
based 1;ext mldersl;an(ling syst;em which would not 
only slow down dm l)erformance signiticantly but 
necessarily could not be domain inde,1)endent. 
Our design goal was to use as simple and effl- 
cleat an algorithm as t)ossibh',, avoiding "hem(s- 
tics" and "fe, al;ures" emph)yed by other systems 
(e.g., (?)) wlfich may be hell)tiff in a specific text 
domain but would have to be redesigned whenever 
it were ported to a new domain, a In this respect, 
our system can be compared with the approach 
of (?) wit() also t)resent an abstracting system for 
general domain texts. However, whereas their fo- 
cus is on the evaluation of abstracl; readability (as 
stand-alone texts), ours is rather on abstract rele- 
vance. A flirther difference is the (non-standard) 
method of tf*idfweight ('ah:ulation timy are using 
for their system. 
Our sysl;em was deveh)ped in C+.t-, using li- 
braries for dealing with texts marke(l ut) in SGML 
format. The algorithm performs the following 
sl;et)s: 4 
1. Take an arl;Me fl'om the corl)uS 5 and 
lmild a word weight; matrix for all con- 
tellt words across all sentences (l;f*idf 
(:omputal;ion, where the idf-vahms ttte r(> 
trieved fl'om a preconqmted file). (; Iligit fre- 
(tuency closed class words (like A, THE, ON 
etc.) are excluded via a stop list file. 
2. Determine the sentence weights for all sen- 
ten(:es in tim arl;Me: Compltt;e the sum over 
2Clem'ly, there will be less (:oherence than in a man- 
made abstract, but, the extracted passages can t)e pre- 
sented in a way which indicates their relative position 
in tim text, thus avoiding a possil)ly wrong inti)ression 
of adjacency. 
aln fact,, it t,urned out that fact,ors which couhl 1)e 
thought of as %l)ecitic for newspaper articles", su(:h 
as increased weights for title words or sentences in 
the beginning, did not have a sign(titan( eriect (m the 
sys|;elll~s per\['orntance,. 
4Due to space limitations, we cannot, give all tilt; 
details here. The reader is ref('xred t,O (?) for there 
information on this algorithm, various odter nte, thotls 
that were tested and their respective result,s. (Tiffs 
paper can I)e el)rained Kern t,im author's heine 1)age 
whose URL is: 
ht tp://www.h:l.cmu.e(lu/~zechner/klaus.htnfl.) 
'~'We used the Daily Telegral)h Corpus which com- 
prises approx. 44.000 articles (15 million words). 
(~tt*idf=term frequency in a document (tfk) times 
t,he logarithm of the nunlber of documents in a collec- 
l;ion (N), divi(led I)y the IlnI\[lber of do(;untents where 
this term oc(:nrs (Ilk): tfk * log ~_N This formula n k 
yields a high numl)er for words which are frequent in 
one dneument but; api)e.ar in very few documents only; 
hence, they can be considered a.s "indicative" fbr the 
respective document. 
all tf*idf-values of the (:on(eat words 7 for each 
sentence, s 
3. Sort the sentences according to l;heir weights 
and extract the N highest weighted sentences 
in text order to yield (,he abstract of the doc- 
llHleltt. 
To r(~thtce the size of the vocabulary, our system 
(;()nv(',rts every word to Ul)I)er (:ase and (runt:ales 
words after the sixth character. This is also rout:it 
faster than a word stemming algorithm which has 
to perh)rm a inorphological analysis. For our ex- 
periment;s, the, amount of new ambiguities thereby 
introduced did not cause specific problems for tim 
system. 
For the test set, we (:host', 6 articles fl'om the cor- 
ires whi(:h are (:los(; t;o tim gh)bal cortms a,verage 
of \] 7 senl;en(:es per ardcl(;; these artich',s (:ontain 
approx. 550 words alt(l 22 sentences on the, aver- 
age (range: 19 23). All these artMes are at)out a 
single topic, i)robably becmme of our choi(:e al)out 
a ret)resenl;ative text, lengdL We (lo not address 
ttm issue of multi-topicality here; however, it is 
well-known that texts with more (hall olle tel)it 
are. hm'd to deal wit;it for all kinds of Ill. systeltlS. 
E.g., the ANES system, described i)y ('?), tries 
to i(lenl;iily l;hese texts beforehand to 1)e ex(:luded 
fl'om abstracl;ing. 
The system's rllll-til\[te ()It a SUN St)arc worksta- 
l;ion (UNIX, SUN OS 4.1.3) is appro×. 3 seconds 
for an article of th(; test, set. 
3 Experiment: Abstracts as 
Extracts Generated by Human 
Subjects 
In order to bc able to ewfluate the quality of tim 
abstracts t)rodueed by our system, we, conducted 
an experiment where we asked 13 human subjects 
to choose the "most relevant 5-7 sentences" from 
the six articles Dom the test set. 9 \]b t;~cilitate 
their (;ask, the subjects should first give each of 
the sentences in an artMe a "relewmce score" from 
l ("barely relewmt")to 5 ("highly relevant;") and 
finally choose tit(', trust scored sentences for th(;ir 
abstracts. The subjects were all native speakers 
of English (since we used an Englistl cortms) and 
were. paid for their task. Compared l;o about 3 set:- 
ends for the machine system, the hmnans nee(h;d 
rThis provides a bias towards longer sentences. Ex- 
periment,s with methods that normalized for sentence 
length yiehled worse results, so dtis bias appears to be 
apI)roI)riate. 
SWords in the title and/or appearing in t,ln! 
first,/last few sent,enees (:all be given Inore weight by 
tneans of an editable parame.l;e.r tilt;. It turns out,, how- 
ever, that, these weights do not, lead to an improvement, 
of the syst,em's performance. 
9This number corresponds in fact, well to the obser- 
vation of (Y) that, the opt,ilnal smnmary length is be- 
t;ween 20% and 30% of the original document length. 
987 
about 8 minutes (two orders of magnitude more 
time) for determining the most relevant sentences 
for an article. 
4 Results and Discussion 
4.1 Automatically created abstracts 
Table 1 shows the precision/recall values for the 
tf*idf-method described in section ?? and for a 
default method that selects just the first N sen- 
fences fi'om the beginning of each artMe ("lead"- 
method). Whereas tile lead method most likely 
provides a higher readability (see Brandow et al. 
(?)), tile data clearly indicates that the tf*idf 
method is superior to this default approach in 
terms of relevance, l° The computation of these 
precision/recall values is based on the sentences 
which were chosen by the human subjects from 
the experiinent, i.e., an average was built over the 
precision/recall between the machine system and 
each individual subject. 
4.2 Abstracts produced by human 
subjects 
The global analysis shows a surprisingly good 
correlation across the hunmn subjects for the sen- 
tence scores of all six articles (see table ??): in 
the Pearson-r correlation matrix, 71 coefticients 
are significant at the 0.01 level (***), 5 at the 0.05 
level (*), and only 2 are non significant (n.s.). This 
result indicates that there is a good inter-subject 
agreement about the relative "relevance" of sen- 
tences in these texts. 
4.3 Comparison of machine-made and 
hurnan-Inade abstracts 
We computed precision/recall for (;very human 
subject, compared to all the other 12 subjects 
(taking the average precision/recall). From these 
individual recall/precision values, the average was 
computed to yield a global measure for inter- 
huinan precision/recall. Depending oil the article, 
these values range from 0.43/0.43 to 0.58/0.58, the 
mean being 0.49/0.49. As we can see, these re- 
sults are in the same range as the results for the 
machine system discussed previously (0.46/0.55, 
for abstracts with 6' sentences). This means that 
if we compare the output of the automatic sys- 
tem to the output of an average human subject in 
the experiment, there is no noticeable ditference in 
terms of precision/recall the machine l)erforlns 
as well as human subjects do, given the task of 
selecting the most relevant sentences from a text. 
1°The tf*idf nmtho<t proved itself better than all the 
other methods of weight computation which we tested 
(see (?)); in particular, those using a combination of 
w~rious other heuristics, as proposed, e.g., in (?). 
5 Suggestions for further work 
5.1 Dealing with multi-topical texts 
It can be argued that so far we have only dealt 
with short texts about a single topic. It is not 
clear how well the system would be able to handh; 
texts where multiple threads of contents occur; 
possibly, one couhl employ the method of text- 
tiling here (see e.g., (?)), which helps determin- 
ing coherent sections within a text and thus could 
"guide" the abstracting system ill that it would 
be able to track a sequence of multit)le topics in a 
text,. 
5.2 On-line abstracting 
While our system currently produces abstracts off- 
line, it is feasible to extend it in a way where 
it uses the user's query in an IR environment to 
determine tile relevant sentences of the retrieved 
documents, tIere, instead of producing a "general 
abstract", the resulting on-line abstract would re- 
flect more of the "user's perspective" on the re- 
spective text. However, it would have to be in- 
vestigated, how nmch weight-increase the words 
from the user's query should get in order not to 
bias tile resulting output in too strong a way. 
Further issues concerning the human-inaehine 
interface are: 
• highlighting passages containing the query 
words 
• listing of top ranked keywords in tile retrieved 
text(s) 
• indicating the relative position of the ex- 
tracted sentences in the text 
• allowing for scrolling in the main text, start- 
ing at an arbitrary position within the ab- 
stract 
6 Conclusions 
Ill this paper, we have shown that it is possible to 
implement a system for generating text abstracts 
which purely operates with word frequency statis- 
tics, without using either domain specific knowl- 
edge or text, sort specific heuristics. 
It was demonstrated that the resulting ab- 
stracts have the same quality in terms of preci- 
sion/recall as the abstracts created by human sub- 
jects ill an experiment. 
While a simple lead-method is more likely to 
produce higher readability judgments, the advan- 
tage of the tf*idf-method for abstracting is its, su- 
periority in terms of capturing content relevance. 
Acknowledgments 
Tile major part of this work has been drawn froln 
the author's dissertation at the Centre for Cogni- 
tive Science, University of Edinburgh, UK. I wish 
to thank lily supervisors Steve Finch and Richard 
988 
Tal)lc' l: lh'ecision/r(!call wdues tbr default (lead) and tf*id\[' methods. 
3- ..... 6.agfi).  
s o.ss/o.sl | 0.45/0.a8 
10 0.37/0.62 | 0.41/0.74 
12 0.a4/0.( 9 | 0.ag/0.sa 
14 0"33/0"79 l 0.37/0.91 
Table 2: Significance of sentence score correlation between human sul)jeet, s: All 6 articles 
HS2 
HS4 
HS3 
HS8 
IIS9 
HS1 
HS5 
HS12 
ITS11 
HS13 
ITS10 
HS14 
height 
tlS4 IIS3 IIS8 ItS9 IIS1 HS5 ITS12 I1S11 \[IS13 HS10 HS14 IIS15 
gg* #~# g#g ##g *gg ### ### 
gg* *g* *g# g*g g*g * 
# *## *g# #g* ggg 
• g# #** g # 
ggg *** #*@ 
• ## gg# 
##$ 
**g #*# *## g*g 
### ### #gg **g 
#*~ ~** g*g g*# 
%*g #gg *g* #** 
*#$ ##* g#* #*# 
#gg #g* ggg #** 
g#~ g*$ gg# g** 
### #%* g*# \]L,S. 
IL.S. *** 
*gg 
Shillcock for vahlal)le discussions, suggestions and 
advice.. Also, I am grateful to Chris Manning for 
his comments on an (~arlier draft;, as well as t,o the 
t;WO allOllyi\[lOllS r(wieweI'8 whose, reillarks gre.al;ly 
helped in improving t, his l)aper. 
The author has l)eeal supi)orted in part by 
grants from the Austrian Chamber of Com- 
merce and ~Dade (lhlndeswirtsehafl;skammer) and 
the Austrian Ministry of Science and Research 
(BMWF). 
References 
Brandow, R., Mitze, K., Ibm, L.F. 1995. Auto- 
mat;it Condensation of Electronic Publications 
by Se, ntence Selection. In: Information P,,vccss- 
ing 84 Management, ,71(,5). pp.675 685 
Edmundson, H.P. 1969. New Methods in Au- 
tomatic Extracting. In: Journal of the ACM, 
16(2). pp.264 285 
Hearst, M.A., Plaunt, C. 1993. Subt;opie Sl;ruc- 
luring for Pull-Length Docmn(mt Access. In: 
Proceedings of the 16th ACM-SIGII~ Confl;r- 
trice, t)p.59 68 
Hobbs, ,I.R., At)I)elt, D.E., Bear, ,1.S., Israel, l).J., 
Tyson, W.M. 1992. FASTUS: A System for 
Extra(:ting Int'ormation h'oln Natural Language 
Text. SRI International, Technical Note 519, 
Menlo Park, CA 
Jaeobs, P.S., Rau, L.F. 1990. SCISOR: Extract- 
ing Informal;ion Dora On-line News. ht: Com- 
munications of the ACM, .73 (11). pp.88 97 
Kupiec, J., Pedersen, J., (\]hen, F. :1995. A Train- 
al)le, Docmne.nt Summarizer. In: l'r'oce¢'.dings of 
the. 18th ACM-SIGIR Confe~w~,cc.. t)p.68 73 
Mauhlin, M.L. 11989. Information l/.etrieval l)y 
Texl; Skimming. CMU-CS-89-193, Carnegie 
Mclhm University, Pittsbllrgh, PA 
Miike, S., It;oh, E., Ono, K., Sumita, K. 1994. 
A Full-Text Retrieval System with a Dynamic 
Abstract; Generation Funct;ion. In: Proceedings 
of the 17th ACM-SIGIR Um~fl:~v.'nce. t)t).152 
161 
Morris, A.H., Kasper, G., Adams, D. 1992. 
The Effects and Limitations of Automated T('~xt; 
Condensing on Reading Comi)rehension Per- 
formaime. In: Information Systems l},esea'rch, 
3(1). 1)t).17 35 
Paice, C.I). 1990. Constructing l,il;era- 
lure Abstracts t)y Comlmter: \[\[>(:hniques and 
Prost)ecl;s. In: Information Processing 84 Man- 
ageme.nt , 26(1). t)1).7171 186 
Salt(m, G., Allan, J., Buekley, C. \]993. Ap- 
proa(-hes to Passage Ret, rieval in Full ~l~,xt In- 
formation Systems. TR 93-1334 (1993), Cornell 
University, \[thaca, NY 
Sparck Jones, K., En(lres-Niggemeyer, i3. :1995. 
Automal;i(: Summarizing. In: Information P~v- 
cessing I"i Management, 31(5). pp.625 630 
Zeehner, K. 1995. Automatic Text Abstracting 
by Selecting Relewmt Passages. M.Se, i)isser- 
ration, Centre for Uognitive Science, University 
of Edinburgh, UK 
989 
