Sentence extraction as a classification task 
Simone Teufel 
Centre for Cognitive Science 
and Language Technology Group 
University of E&nburgh 
S. Teufel@ed. ac. uk 
Marc Moens 
Language Technology Group 
University of Ecb.nburgh 
M. Moens@ed. ac. uk 
Abstract 
A useful first step m document summau- 
sation is the selection of a small number of 
'meamngful' sentences from a larger text 
Kupiec et al (1995) describe tim as a clas- 
mficatlon task on the basis of a corpus of 
technical papers with summaries written 
by professional abstractors, their system 
ldent~fies those sentences m the text which 
also occur in the summary, and then ac- 
quires a model of the 'abstract-worthiness' 
of a sentence as a combination of a hmlted 
numbel of properties of that sentence 
We report on a rephcatlon of thin exper- 
nnent with different data summaries for 
our documents were not written by pro- 
fessional abstractors, but by the authors 
themselves Tins produced fewer allguable 
sentences to tram on We use alternative 
'meaningful' sentences (selected by a hu- 
man judge) as training and evaluation ma- 
terial, because tlns has advantages for the 
subsequent automatic generation of more 
flexible abstracts We quantitatively com- 
pare the two ¢hfferent strategies for training 
and evaluation (vm ahgnment vs human 
judgement), we also chscnss qualitative chf- 
ferences and consequences for the genera- 
tlon of abstracts 
1 Introduction 
A useful first step m the automatic or semi- 
automatic generation of abstracts from source texts 
m the selection of a small number of 'meamngful' 
sentences from the source text To achieve tins, 
each sentence m the source text is scored according 
to some measure of importance, and the best-rated 
sentences are selected Thin results m collections of 
the N most 'meamngful' sentences, m the order m 
wlnch they appeared m the source text - we will call 
these excerpts An excerpt can be used to give read- 
ers an idea of what the longer text m about, or It can 
be used as input into a process to .produce a more 
coherent abstract 
It has been argued for almost 40 years that it m 
posmble to automatically create excerpts which meet 
bamc reformation compresmon needs (Luhn, 1958) 
Since then, different measurements for the impor- 
tance of a sentence have been suggested, m partic- 
ular stochastic measurements for the mgmficance of 
-key words or phrases (Lulm, 1058, Zechner, 1995) 
Other research, starting with (Edmundson, 1969), 
stressed the Importance of heuristics for the location 
of the candidate sentence m the source text (Baxen- 
dale, 1958) and for the occurrence of cue phrases 
(Palce and Jones, 1993, Johnson et al, 1993) 
Single heunstms tend to work well on documents 
that resemble each other m style and content For 
the more robust creation of excerpts, combinations 
of these heuristics can be used The eruclal ques- 
tion m how to combine the ¢hfferent heuristics In 
the past, the relative usefulness of single methods 
had to be balanced manually Kupmc et al (1095) 
use supervised learnmg to automatically adjust fea- 
ture w~ghts, using a corpus of research papers and 
corresponding summaries 
Humans have good intuition about what makes 
a sentence 'abstract-worthy', I e suitable for inclu- 
sion in a summary Abstract-worthiness m a lugh- 
level quality, comprising notions such as semantic 
content, relative importance and appropriateness for 
representing the contents of a document • .For the 
automatic evaluation of the quality of machine gen- 
erated excerpts, one has to find an operational ap- 
proximation to this subjective notion of abstract- 
worthiness, 1 e a defuntlon of a desired result We 
will call the criteria of what constitutes success the 
gold standard, and the set of sentences that fulfill 
58 
! 
I 
! 
I 
i 
! 
i 
I 
I i 
i 
1 
l 
t 
I 
! 
l 
i 
I. 
I 
I 
I 
I 
I 
! 
I 
I 
I 
i 
I 
i 
i 
i 
I 
I 
! 
| 
these criteria the gold standard sentences Apart 
from evaluation, a gold standard m also needed for 
supervmed learning 
In Kupiec et al (1995), a gold standard sentence 
is a sentence m the source text that zs matched ruth 
a summary sentence on the basra of semantic and 
syntactic snnflanty In thear corpus of 188 engineer- 
mg papers with summaries written by professional 
abstractors, 79% of sentences occurred m both sum- 
mary and source text with at most minor moddica~ 
tzons 
However, our collection of papers, whose abstracts 
were written by the authors themselves, shows a 
szgnh~cant difference these abstracts have $1~nl~-. 
cantly fewer ahgnable sentences (31 7%) This does. 
not mean that there are fewer .abstract-worthy sen- 
tenees m the source text We used a simple (labour- 
intensive) way of defimng thin alternative gold stan- 
dard, vzz aslang a human judge to identify addi- 
tional abstract-worthy sentences in the source text 
Our mare question was whether Kuplec et al's 
methodology could be used for our kind of gold stan- 
dard sentences also, and if there was a fundamental 
chfference in extraction performance between sen- 
tences in both gold standards or between documents 
with higher or lower alignment We also conducted 
an experiment to see how additional training mate- 
hal would influence the statistical model 
The remainder of this paper is organized as fol- 
lows in the next section, we s-mmanze Kuplec et 
al's method and results Then, we describe our 
data and dmcuss the results from three experiments 
with dflferent evaluation strategies and tralmng ma- 
terial Differences between our and Kuplec et al's 
data with respect to the ahgnablhty of document 
and summary sentences, and consequences thereof 
are conmdered m the discussion 
2 Sentence selection as classification 
In Kupzec et al's experiment, the gold standard 
sentences are those summary sentences that can be 
aligned with sentences m the source texts Once 
the alignment has been carried out, the system tries 
to determine the characteristic properties of ahgned 
sentences according to a number of features, wz 
presence of particular cue phrases, location in the 
text, sentence length, occurrence of thematic words, 
and occurrence of proper names Each document 
sentence receives scores for each of the features, re- 
suiting m an estimate for the sentence's probabihty 
to also occur m the summary This probabihty is 
calculated as follows 
59 
P(8 ~ SlFi, 
P(a ~ SIFt, 
e{, ~ s) P(F,I, ~ S) 
P(F,) 
k F, 
, Fk) ~ p(,~s) N~=, P{P,l,~s) 
.1\[I;., pcF,~ 
, Fj) Probablhty that sentence 
s m the source text m mcluded 
111 ~lmmary S, given Its feature 
vvlues, 
compressmn rate (constant), 
probablhty of feature-value pair oc- 
curnng m a sentence winch m m the 
summary, 
probabihty that the feature-value 
pair occurs uncon&tzonally, 
• number of feature-valus pairs, 
j-th feature-value pair 
Aseummg statmtmal independence of the features, 
P(~ls E S) and P(Fj) can be estnnated from the 
corpus 
Evaluatmn rches on ccross-vahdatmn The model 
m trmned on a training set of documents, having one 
document out at a tune (the cu~ent test document) 
The model is then used to extract can&date sen- 
tences from the test document, allowing evaluation 
of precision (sentences selected correctly over total 
number of sentences selected) and recall (sentences 
selected correctly over ahgnable sentences m sum- 
mary) Since from anygrven test text as many sen- 
tences are selected as there are ahgnable sentences 
m the summary, precamon and recall are always the 
same 
Kupiec et al reports that preasion of the m&wd- 
ual hetmstles ranges between 20-33%, the highest 
cumulative result (44%) was adaeved using para. 
graph, fixed phrases and length cut-off features 
3 Our experiment 
3.1 Data and gold standards 
Our corpus m a collection of 202 papers from dif- 
ferent areas of computational lmgtusties, with sum- 
maries written by the authors 1 The average length 
of the summaries m 4 7 sentences, the average length 
of the documents 210 sentences 
We seml-aut0matlcally marked up the following 
structural reformation title, summary, headings, 
paragraph structure and sentences Tables, equa- 
tions, figures, captious, references and cross refer- 
ences were removed and replaced by place holders 
1The corpus was drawn from the computataon and 
language arcinve (http //xxx lan3..gov/cmp-lg), con- 
vetted from DTF~ source into HTML m order to ex- 
tract raw text and ramlmal structure automatically, then 
transformed into our SGML format with a perl script, 
and manually corrected Data colinctlen took place col- 
laboratavely with Byron Georgantopolous 
" :'i 310senmm:e~ " . .~Tm~ l~se.~ " " 
Amhof'mmman~s GoM ~,~ls 
Tranung set 1 40 documenm 
~8888~ 21% 
Auth~rmmmmnea Gold~-,,~Mds Authorsmmmanes Gold standards 
Training set 2 Trammg set 3 42 documJnts 42 documents 
Figure 1 Composition of gold standards for trmnmg sets 
U 
m 
~ Gold madard A At~sabt~ ,.umry 
  Gold ~-d,,d B non-ahxnab~ b~ mkvml sea~.,nces (tram jDdgm~m) 
We decided to use two gold standards 
• Gold standard A: AHgnment. Gold stan- 
dard sentences are those occurring m both au- 
thor summary and source text, m line with Ku- 
pmcet al's gold standard 
• Gold standard B: Human Judgement. 
Gold standard sentences are non:ahgnable 
source text sentences which a human judge 
identified as relevant, 1 e mchcatlve of the con- 
tents of the source text Exactly how many 
human-selected sentence candidates were cho- 
sen was the human judge's decision 
Ahgnment between summary and document sen- 
fences was assmted by a simple surface snmlarlty 
measure (longest common subsequence of non-stop- 
list words) Final ahgnment was declded by a hu- 
man judge The cxlterlon was snnllanty of semantic 
contents of the compared sentences The following 
sentence palr illustrates a dsrect match - . 
Summary: /n understand~ a reference, an 
agent detexmmes his confidence m Its ade- 
quacy as a means of identifying the referent 
Document: An agent understands a refer- 
ence once he is com~dent m the adequacy of 
its (referred) p/an as a means of Identifying 
the referent 
Our data show an important chfference wlth Ku-. 
plec et al's data we have slgnn~cantly lower ahgn- 
ment rates Only 17 8% of the summary sentences 
In our corpus could be automatlcally ahgned wlth 
a document sentence wlth a certain degree of reh- 
ablhty, and only 3% of all summary sentences are 
Identlcal matches wlth document sentences 
We created three chfferent sets of trmnmg mate- 
hal 
Training set I: The 40 documents with the 
highest rate of overlap, 84% of the summary 
sentences could be semi-antomatlcally ah~ned 
with a document sentence 
Training set 2:42 documents from the year 
1994 were arbitrarily chosen out of the re- 
mmnmg 163 documents and seml-automatlcally 
ahgned They showed a much lower rate of over- 
lap, only 36% of summary sentences could be 
mapped into a document sentence 
Training set 3:42 documents from the year 
1995 were arhitranly chosen out of the remain- 
mg documents and serm-automahcally ahgned 
Again, the overlap was rather low 42% 
Training set 123: Conjunctlon of training sets 
I, 2 and 3 The average document length m 194 
sentences, the average summary length m 4 7 
sentences 
• A human judge provlded a mark-up of addltlonal 
abstract-worthy sentences for these 3 trmnmg sets 
(124 documents) The remaining 78 documents 
remain as unseen test data Figure 1 shows the 
compomtlon of gold standards for our training sets 
Gold standard sentences for trmmng set I consmt of 
an approximately balanced mixture of ahgned and 
human-selected candidates, whereas training set 2 
contains three times as many human-selected as 
ahgned gold standard sentences, training set 3 even 
four times as many Each document m trmmng set 1 
is associated with an average of 7 75 gold standard 
sentences (A+B), compared to an average of 7 07 
gold standard sentences m trmnmg set 2, and an 
average of 9 14 gold standard sentences m trammg 
set 3 
6O 
II 
I 
1 
! 
I 
1 
I 
I 
,| 
i 
1 
m. 
I 
I. 
I 
I 
i 
I 
I 
I 
! 
I 
I 
I 
l 
i 
3.2 Heuristics 
We ~ employed 5 chfferent heuristics 4 of the meth- 
ods used by Kuplec et al (1995), viz cue phrase 
method, locatlon metli6d, sentence length method 
and thematic word method, and another well-known 
method m the hterature, viz title method 
1. Cue phrase method: The cue phrase method 
seeks to filter out met~-dtscourse from subject mat- 
ter We advocate the cue phrase method as our mare 
method because of the ad&tmnal 'rhetorical' context 
these meta-lmgmstlc markers make available Thls 
context of the extracted sentences - along with their 
proposmunal content - can be used to generate more 
flexible abstracts 
We use a hst of 1670 negative and positive cues 
and indicator phrases or formulalc expressions, 707 
of which occur m our training sets For sLmphclty 
and efficiency, these cue phrases are fixed strings 
Our cue phrase hst was manually created by a 
cycle of Inspection of extracted sentences, ldentlfi- 
cat!on of as yet unaccounted-for expressmns, ad&- 
tlon of these expressions to the cue phrase hst, and 
possibly inclusion of overlooked abstract-worthy sen- 
tences m the gold standard Cue phrases were man- 
ually classtfied mto 5 classes, whlch we expected to 
correspond to the hkehbood of a sentence containing 
the glvcu cue to be included m the summary a score 
of-1 means 'very unhkely', -~3 means 'very hkely 
to be included m a summary' 2 We found ~t useful 
to assist the dec~un process with corpus frequen- 
cies For each cue phrase, we compded ~ts relative 
frequency m the gold standard sentences and m the 
overall corpus If a cue phrase proved general {\] e 
~t had a high relative corpus frequency) and dtstmc- 
t~ve (~ e \]t had a high frequency within the gold 
standard sentences), we gave ~t a high score, and 
included other phrases that are syntactically and se- 
manhcally sirmlar to \]t mr0 the cue hst We scanned 
the data and found the following tendencies 
• Certain communlcat~ve verbs are typically used 
to describe the overall goals, they occur fre- 
quently m the gold-standard sentences (ar- 
gue, propane, develop and attempt) Others 
are predonnnantly used for describing com- 
munlcattve sub-goals (detaded steps and sub- 
arguments) and should therefore be m a dif- 
ferent equivalence class (prove, show and con- 
clude) W~tlnn the class of commumcat~ve 
verbs, tense and mode seem to be relevant 
for abstract-worthinesS Verbs m past tense 
~We experimented w~th larger and smaller numbers 
of classes, but obtained best results with the 5-way 
&stmct~on 
61 
or present pedect {as used m the conclumon) 
are more hkely to refer to global achieve- 
ments/goals,, and thus to be included m the 
• summary In the body of the text, present and 
future forms tend to be used to introduce sub- 
tasks 
Genre specific nominal phrases hke this paper 
are more distractive when they occur at the be- 
gmmng of the sentence (as an approxLmatlon to 
subject/topic potation)than their non-subject 
counterparts 
Exphclt summansatlon markers hke m sum, 
concluding chd occur frequently, but quite un- 
expectedly almost always m combination with 
commumcatlve sub-tasks They were therefore 
less useful at slgnalhng abstrac~worthy mate- 
hal 
Sentences m the source text are matched against 
expresslous m the hst Matching sentences are clas- 
sified into the correspundmg class, and sentences 
not contaunng cue phrases are clsssflied as 'neutral' 
(score 0) Sentences with competing cue phrases are 
classflied as members of the class with the lngher 
numerical score, unless one of the competing classes 
is negative 
Sentences occurnng directly after hsadmgs hke In- 
troductson or Results are valuable indicators of the 
general subject area of papers Even though one 
rmght argue that ttns property should be handled 
within the location method, we percetve tlas refor- 
mation as meta-hngmstlc (and thus logically belong- 
mg to the cue phrase method) Thus, scores for these 
sentences recelve a prior score of +2 ('hkely to occur 
m a summary') 
In a later section, we show how tins method per- 
forms on unseen data of the same land (viz texts m 
the genre of computational lmgulshcs research pa- 
pers of about .~6--8 pages long) Even though the 
cue phrase method is well tuned to these data, we 
are aware that the hst of phrases .we collected mlght 
not generahze to other genres Some land of automa- 
tion seems desirable to assist a possible adaptation 
2. Location method. Paragraphs at the start 
and end of a document are more hksly to contain 
material that Is useful for a summary, as papers are 
organized hierarchically Paragraphs are also orga- 
razed hierarchically, with crucial reformation at the 
beginning and the end of paragraphs Therefore, 
sentences m document peripheral paragraphs should 
be good can&dates, and even more so If they occur 
m the periphery of the paragraph 
Our algunthm assigns non-zero values only to sen- 
tences winch are m document penpheral sections, 
sentences in the middle of the document receive a 
0 score The algorithm is sensitive to prototypl- 
cal heachngs (IntrOdact:on), if such hendmgs cannot 
be found, it uses a fixed range of paragraphs (first 
7 and last 3 paragraphs) Within these document 
peripheral paragraphs, the values 'l_f' and 'm' (for 
paragraph initial-or-final and paragraph medial sen- 
tences, respectively) are assigned 
3. Sentence Length method. All sentences un- 
der a certain length (current threshold 15 tokens in- 
cluding punctuation) receive a 0 score, all sentences 
above the threshold a 1 score 
Kuplec et al mention tins method as useful for 
filtering out captious, titles and headings In our 
experiment, thin was not necessary as our format 
encodes headings and titles as such, and captions are 
removed As expected, it turns out that the sentence 
length method Is our least effective method ¢ 
4. Thematic word method. Tins method tries 
to identify key words that are characteristic for 
the contents of the document It concentrates on 
non-stop-hst words winch occur frequently m the 
document, but rarely m the overall collection In 
theory, sentences cont.~mg (clusters of) such the- 
matlc words should be characteristic for the docu- 
ment We use a standard tenn-frequency*mverse- 
doenment-frequency (tf*ldf) method 
, loy\[l°°*N~ ,core(w) = ~oo cT;r~-,. 
.floe frequency of word w m document 
fgt,~ number of documents contmnmg word w 
N number of documents m collectaon 
The 10 top-scoring words are chosen as the- 
matlc words, sentence scores are then computed 
as a weighted count of thematic word m sentence, 
meaned by sentence length The 40 top-rated sen- 
tences get score 1, all others 0 
5. Title method. Words occurring in the title 
are good candidates for document specific concepts 
The title method score of a sentence m the mean 
frequency of title word occurrences (excluding stop- 
lint words) The 18 top-sconng sentences receive 
the value 1, all other sentences 0 We also exper- 
imented with taking words occurring m all headings 
into account (these words were scored accorchng to 
the tffldf method) but received better results for tl- 
tle words only 
Method 1 (cue) 
Method 2 (location) 
Method 3 (length) 
Method 4 (tf*idf) 
Method 5 (title) 
Baseline 
Indiv. Cumul. 
552 552 
32 1 65 3 
28 9 66 3 
17 1 66 5 
21 7 68 4 
28 0 
Figure 2 First experiment Impact of mdlvtdual 
hennstlcs, training set 123, gold standards A+B 
Cue Phrase Method 
Heuristics Combination 
Basefine 
Seen Unseen 
60 9 54 9 
71 6 65 3 
291 
Figure 3 First Experiment DLfference between 
unseen and seen data, training set 3, gold stan- 
dards A+B 
3.3 Results 
Training and evaluation took place as m Kuplec et 
al's experiment As a basehne we chose sentences 
from the begmmng of the source text, winch o~ 
tamed a recall and preczmon of 28 0% on training 
set 123 Tins from-top baseline (winch zs also used 
by Kuplec et al ) • is a more conservative basehne 
than random order it zs more dn~cult to beat, as 
prototyplcal document structure places a Ingh per- 
centage of relevant reformation m the beginning 
3.3.1 First experiment 
Figure 2 summarizes the contribution of the in- 
dividual methods 8 Using the cue phrase method 
(method 1) Is Clearly the strongest single heuris- 
tic Note that the contribution of a method cannot 
be judged by the individual precision/recall for that 
method For example, the sentence length method 
(method 3) voth a recall and preczslon over the base- 
line contributes hardly anything to the end resul L 
whereas the title method (method 5), winch is be~ 
low the basehne If regarded mchvldually, performs 
much better m combination with methods 1 and 2 
than method 3 does (67 3% for heuristics 1, 2 and 
5, not to be seen from thin table) The reason for 
tins is the relative independence of the methods If 
method 5 identifies a successful canchdate, It is less 
likely that tins can&date has also been Identified by 
method I or 2 Method 4 (tf*ldf) decreased results 
shghtly m some of the expernnents, but not m the 
SAll figures m tables are preamon percentages 
I 
I 
I 
1 
I 
62 
! 
I 
i 
i 
I 
i 
I 
I 
m, 
! 
I 
! 
i 
! 
i 
1 
i 
i 
I 
comb cue base 
TS1 661 490 296 
TS2 622 i 545 249 
TS3 71-6 ~ 609 291 
TS 123 684 i 55 2 28 0 
Figure 4 First experiment Baseline, best single 
hennstxc and combination, gold standards A+B 
experiments with our final/largest training set 123 
where tt led to a (non-mgmficant) increase 
We also checked how much precision and recall 
decrease for unseen data This decrease apphes only 
to the cue phrase method, because the other henrm- 
tics are fixed and would not change by seeing more 
data After the manual mark-up of gold standard 
sentences and additions to the cue phrase hst for 
training set 3, we treated traunng set 3 as ff it was 
unseen we used only those 1423 cue phrases for ex- 
traction that were compded from training set I and 
2 A comparxson of fins 'unseen' result to the end 
result (Figure 3) shows that our cue phrase hst, even 
though hand-crafted, xs robust and general enough 
for our purposes, it generahzes reasonably well to 
texts of a mmflar kind 
Figure 4 shows mean precmion and recall for our 
different training sets for three dflferent extraction 
methods a combination of all 5 methods ('comb '), 
the best single heuristic ('cue'), and the baseline 
('base') We used both gold standards A+B These 
results reconfirm the usefulness of Kupiec et al's 
method of heunst4c combination The method m- 
creases precmlon for the best method by around 
20% It m worth pointing out that thin method pro- 
duces very short excerpts, wxth compresmous as Ingh 
as 2-5%, and with a preczslon equal to the recall 
Thus tins xs a different task from producing long ex- 
cerpts, e g with a compres~on bf25%, as usually re- 
ported m the literature Usmg tins compresmon, we 
achieved a recall of 96 0% (gold standard A), 98 0% 
(gold standard B) and 97 3% (gold standards A+B) 
for training set 123 For comparmon, Kuplee et al 
report a 85% recall 
3.3.2 Second experiment 
In order to see how the chfferent gold standards 
c°ntnbute to the results, we used only one gold stan- 
dard (A or B) at a time for trmmng and for extrac- 
tion Figure 5 summarizes the results 
Looking at Gold standard A, we see that trmnmg 
set 1 m the only training set winch obtains a recall 
that is comparable to Kuplec et al's Incidentally, 
tratmng set 1 is also the only tratmng set that Is 
Evaluation strategy . 
Gold standard A Gold startdard B 
TS comb cue base comb cue base 
1 369 275 214 453 : 30'4 ~108 
250 184 92 538 479 i 203 
3 271 135 "135 643 544 ~' 257 
123 316 232 163 572 467\[ 204 
63 
Figure 5 Second expenment Impact of type of gold 
standard 
70% 
60%- 
50%" 
40%" 
30%" 
comparable to Kuplec et al's data vnth respect to 
allgnablhty The bad performance of tratmng set 2 
and 3 under evaluation w3th gold standard A m not~- 
surprmmg, as there are too few aligned g01d standard" 
sentences to tram on 50% of the documents m these 
training sets contain no or only one ahgned sentence 
premxon/recall 
C, mld standards A+B 
. Gold~ 
~~Id mndm'd A compression 
o~1 o~ o~ o~ o~5 ~ 
Figure 6 Second expernnent Impact of type of gold 
standard on preclswn and recall, as a function of 
compresmon 
Overall, performance seems to correspond to the 
ratio of gold standard sentences to source text sen- 
tences, x e the compresmon of the task 4 The de- 
pendency between prectston/recaU and compresmon 
m depicted m Figure 6 Taking both gold stan- 
dards into account increases performance conmder- 
ably compared to either of the gold standards alone, 
because of the lower compresmon As we don't have 
training sets with exactly the same number of gold 
standard A and B sentences, we cannot directly com- 
pare the performance, but the graph m suggestive of 
a mmdar behavlour of both gold standards The re- 
sults for training set 123 ,failbetween the results of 
the mchvxdual training sets (symbolL~ed by the large 
data points) 
4The chiference m performance between trmmug sets 
m the first expenment Is thus probably meanly att.- 
tnbutable to ~hfferences m compresmon between the 
trmmn~ sets 
TS 
1 
Training 2 
3 
123 
Extraction 
1 2 3 123 
~661 612 697! 663 
658" 622 695 660 
651 i 629 .716 661 
664 \[629 i708 684 
Figure 7 Thtrd experiment Impact of training ma- 
tenal on prec~mon and recall, gold standards A+B 
From tins second experiment we conclude that for 
our task, them m no dnq~erence between gold stan- 
dard A and B The crucial factor that preclmon and 
recall depends on ms the compression of the task 
3.3.3 Third experiment 
In order to evaluate the impact of the training 
material on preclslon and recall, we computed each 
possible pair of training and evaluation material (cf 
figure 7) 
In tins experiment, all documents of the tram- 
mg set are used to trmn. the model, thin model m 
then evaluated against each document in the test 
set, and the mean preclslon and recall is reported 
Importantly, m thin experiment none of the other 
documents in the test set m used for tr~tmmg 
These expernnents show a surprising umfonmty 
wztlnn test sets overall extraction results for each 
trmnmg set are very mxmlar Trmnmg on different 
data does not change the statmtical model much In 
most cases, extraction for each training set worked 
best when the model was trmned on the training set 
itself, rather than on more data Thus, the dflYerence 
in results between mchvldual trammg sets m not an 
-effect of data' sparseness at the level of heuristics 
combmatlon 
We conclude from thin third experm~ent that im- 
provement m the overall results can primarily be 
achieved by improwng tangle hsurlstlcs, and not by 
providing more training data for our simple statmtl- 
cal model 
4 Discussion 
Companng our experiment to Kuplec et al ,s the 
most obvlous dn~erence m the dfiYerence m data 
Our texts are likely to be more heterogeneous, 
coming from areas of computational hngumtlcs with 
. different methodologles and thus having an arg u- 
mentative, experimental, or irnplementatlonal ormn- 
tatlon Also, as they are not journal artlcles, they 
are not heavdy edited There is also less of a pro- 
totyplcal article structure in computational hnguls- 
tics than m experimental dmciphnes like chemical 
en~neenng Thin makes our texts more dn~cult to 
extract from 
The major difference, however, m that we use sum- 
manes winch are not written by trained abstractore, 
but by the authors themselves In only around 20% " 
of documents m our ongjnal corpus, sentence selec- 
tion had been used as a method for sununazy gen- 
eration, whereas profesmonal abstractors rely more 
heavily and systematically on sentences m the source 
text when creating their abstracts 
Using ahgned sentences as gold-standard has two 
mare advantages First, it makes the defimtlon of 
the gold standard less labour mtenmve Second, it 
prowdes a lngher de~ee of objeetwlty It m a much 
shmpler task for a human judge to dsclds if two sen- 
tences convey, the same propositional content, than 
to decide if a sentence is qualdied for mclumon m a 
summary or not 
However, using alignment as the sole definition for 
gold standard lmphes that a sentence is only a good 
extraction candidate if its equivalent occurs m the 
summary, an assumption we beheve to be too restric- 
tive Document sentences other than the aligned. 
ones m~ht have been sumlar in quality to the chosen 
sentences, but wdl be trmned on as a negative exam- 
ple with Kupmc et al's method Kupmc et al also 
recognize that there m not only one optmlal excerpt, 
and mention Bath et al's (1961) research winch nn- 
phes that the agreement between human judges is 
rather low We argue that It makes sense to comple- 
ment ahgned sentences with manually determmed 
supplementary can&dates Tins m not solely moti- 
vated by the data we work with but also by the fact 
that we envtsage a different task than Kupmc et al 
(who use the excerpts as mchcative abstracts) We 
see the extraction of a set of sentences as an interme- 
diate step towards the eventual generation of more 
flemble and coherent abstracts of variable length 
For tins task, a whole range of sentences other than 
just the summary sentences might quahfy as good 
candidates for further processing ~ One important 
subgoal m the reconstruction of approximated docu- 
ment structure (cf rhetorical structure, as defined in 
RST (Mann et al, 1992)) One of the reasons why 
we concentrated on cue phrases was that we beheve 
that cue phrases are anobvious and easily accessible 
source of rhetorical information 
Another nnportant question was ff there were 
other properties foUowmg from the mmn difference 
between our training sets, ahgnablhty Are docu- 
ments with a Ingh degree of ahgnablhty :nherently 
5Tlns m nurrored by the fact that m our gold stan- 
dards, the number of human-selected sentence canch- 
dates outwelghed ahgned sentences by far 
64 
! 
! 
more statable for abstraction by our algorithm ~ It 
might be suspected that ahgnahhty m correlated 
witha better internal structure of the papers, but 
our experiments suggest that, for the purpose of sen- 
tence extraction, thin m eather not the case or not 
relevant Our results show that our training sets 1, 
2 and 3 behave ~ery slrmlarly under evaluation tak- 
ing ahgned gold standards or human-selected gold 
standards into account The only definite factor m- 
fiuencmg the results was the compression rate With 
respect to the quahty of abstracts, tins imphes that 
the strategy which authors use for summary gen- 
erahon - be it sentence selection or complete re- 
generation of the summary from semanhc represen- 
tahon - m a matter of authonal chmce and not an 
mchcator of style, text ~quahty, or any aspect that 
our extraction program is particularly senmhve to 
Thin means that Kupmc et al's method of clasmfi- 
catory sentence selection m not restricted to texts 
which have hlgh-quahty summaries created by hu- 
man abstractors We claim that adding human- 
selected gold standards wdl be useful for generation 
of more flembie and coherent abstracts, than tram- 
mg on just a fixed number of author-provldsd sum- 
mary sentences would allow 
5 Conclusions 
We have rephcated Kuplec et al's experiment for 
automatic sentence extraction using several rode- 
pendent heurmtlcs and superwsed learning The 
summaries for our documents were not written by 
professional abstractors, but by the authors them- 
selves As a result, our data demonstrated conmd- 
erably lower overlap between sentences m the sum- 
mary and sentences m the mare text We used an 
alternative evaluation that mL~ed ahgned sentences 
with other good can&dates for extraction, as iden- 
tified by a human judge 
We obtained a 68 4% recall and preclmon on our 
text material, compared to a 28 0% baseline and a 
best mdlvldual method of 55 2% Combimng m&- 
vldually weaker methods results m an increase of 
around 20% of the best method, m line with Kupmc 
et al's results Thin shows the ~mefulness. of Ku- 
plec et al's methodology for a different type of data 
and evaluation strategy We found that there was 
no difference m performance between our evaluation 
strategies (alignment or human judgement), apart 
from external constraints on the task hke the com- 
pression rate We also show that increased trmmng 
did not slgmficantly improve the sentence extraction 
results, and conclude that there m more room for im- 
provement m the extraction methods themselves 
With respect to our ultimate goal of generatmg of 
higher quahty abstracts (more coherent, more flex- 
ible variable-length abstracts), we argue that the 
use of human-selected extraction can&dates m ad- 
• Vantageous to the task Our favounte heurmhc in- 
cludes meta-lmgmstic cue phrases, because they can 
be used to detect rhetorical structure m the docu- 
ment, and because they provide a rhetoncal context 
for each extracted sentence m addition to its propo- 
mhonal content 
6 Acknowledgements 
The authors would hke to thank Chrm Brew, Janet 
Httzeman and two anonymous referees for comments 
on earher drafts of the paper The first author m 
supported by an EPSRC studentshp 

References 
Baxendale, P (1958) Man-made mdex for techmcal ht- 
erature - an experiment IBM3ournal on research and 
development, 2(4) 

Edmundson, H (1969) New methods m automahc ex- 
tracting Journal of the ACM, 16(2) 

Johnson, F C, Peace, C D, Black, W J, and Neal, 
A P (1993) The apphcataon of hngumtac processing 
to automata¢ abstract generation Journal of Docu- 
ment and Tezt Management, 1(3) 215-42 

Kupmc, J, Pedersen, J, and Chen, F (1995) A tram- 
able document ~rnmea'lzer In Proceedings of the ISth 
A UM-SIGIR Conference 

Lulm, H P (1958) The automatac creation ofhterature 
abstracts IBM Journal o\] Research and Development, 2(2) 

Mann, W C, Matthesen, C M I M, and Thompson, 
S A (1992) Rhetorical structure theory and text 
A.Mysm In Mann, W G and Thompson, S A, ed- 
itors, Discourse descnphon J Benj~mm~ Pub Co, 
AmsterdAm 

Peace, C D and Jones, A P (1993) The ldentdicatlon 
of important concepts m highly structured techmcal 
papers In Proceedings of the Stzteenth Annual In. 
ternatsonal ACM SIGIR conference on research and 
development m IR 

Zechner, K (1995) Automa~c text abstracting by se- 
lecting relevant passages Master's thems, Centre for 
Cogmt, ve Science, Umvermty of Edinburgh 
