
ThePotentialandLimitationsofAutomatic
SentenceExtractionforSummarization
Chin-YewLinandEduardHovy
UniversityofSouthernCalifornia/InformationSciencesInstitute
4676AdmiraltyWay
MarinadelRey,CA90292,USA
{cyl,hovy}@isi.edu 


Abstract
Inthispaperwepresentanempiricalstudyof
thepotentialandlimitationofsentenceextrac-
tion in text summarization. Our results show
that the single document generic summariza-
tiontaskasdefinedinDUC2001needstobe
carefullyrefocusedasreflectedinthelowin-
ter-human agreement at 100-word
1
(0.40
score) and high upper bound at full text
2

(0.88) summaries. For 100-word summaries,
theperformanceupperbound,0.65,achieved
oracleextracts
3
.Suchoracleextractsshowthe
promise of sentence extraction algorithms;
however, we first need to raise inter-human
agreementtobeabletoachievethisperform-
ance level. We show that compression is a
promisingdirectionandthatthecompression
ratioofsummariesaffectsaveragehumanand
systemperformance.
1 Introduction
Most automatic text summarization systems existing
todayareextractionsystemsthatextractpartsoforigi-
nal documents and output the results as summaries.
Among them, sentence extraction is by far the most

1
Wecomputeunigramco-occurrencescoreofapairofman-
ual summaries, one as candidate summary and the other as
reference.
2
Wecomputeunigramco-occurrencescoresofafulltextand
itsmanualsummariesof100words.Thesescoresarethebest
achievable using the unigram co-occurrence scoring metric
sinceallpossiblewordsarecontainedinthefulltext.Three
manualsummariesareused.
3
Oracle extracts are the best scoring extracts generated by
exhaustive search of all possible sentence combinations of
1005words.
popular (Edmundson 1969, Luhn 1969, Kupiec et al.
1995,Goldsteinetal.1999,HovyandLin1999).The
majorityofsystemsparticipatinginthepastDocument
Understanding Conference (DUC 2002), a large scale
summarization evaluation effort sponsored by the US
government, are extraction based. Although systems
basedoninformationextraction(RadevandMcKeown
1998,Whiteetal.2001,McKeownetal.2002)anddis-
courseanalysis(Marcu1999b,Strzalkowskietal.1999)
alsoexist,wefocusourstudyonthepotentialandlimi-
tationsofsentenceextractionsystemswiththehopethat
ourresultswillfurtherprogressinmostoftheautomatic
textsummarizationsystemsandevaluationsetup.
Theevaluationresultsofthesingledocumentsummari-
zationtaskinDUC2001and2002(DUC2002,Paul&
Liggett2002)indicatethatmostsystemsareasgoodas
thebaselinelead-basedsystemandthathumansaresig-
nificantlybetter,thoughnotbymuch.Thisleadstothe
beliefthatlead-basedsummariesareasgoodaswecan
get for single document summarization in the news
genre, implying that the research community should
investfutureeffortsinotherareas.Infact,averyshort
summary of about 10 words (headline-like) task has
replaced the single document 100-word summary task
inDUC2003.Thegoalofthisstudyistorenewinterest
in sentence extraction-based summarization and its
evaluationby estimatingtheperformanceupperbound
usingoracleextracts,andtohighlighttheimportanceof
taking into account the compression ratio when we
evaluateextractsorsummaries.
Section 2 gives an overview of DUC relevant to this
study.Section3introducesa recall-basedunigram co-
occurrenceautomaticevaluationmetric.Section4pre-
sentstheexperimentaldesign.Section5showstheem-
pirical results. Section 6 concludes this paper and
discussesfuturedirections.

2 DocumentUnderstandingConference
Fully automatic single-document summarization was
one of two main tasks in the 2001 Document Under-
standingConference.Participantswererequiredtocre-
ate a generic 100-word summary.There were 30 test
setsinDUC2001andeachtestsetcontainedabout10
documents.Foreachdocument,onesummarywascre-
ated manually as the ëidealí model summary at ap-
proximately 100 words.We will refer to this manual
summary as H1. Two other manual summaries were
alsocreatedataboutthatlength.Wewillrefertothese
twoadditionalhumansummariesasH2

andH3.Inaddi-
tion,baselinesummarieswerecreatedautomaticallyby
taking the first n sentences up to 100 words. We will
referthisbaselineextractasB1.
3 UnigramCo-OccurrenceMetric
Inarecentstudy(LinandHovy2003),weshowedthat
therecall-basedunigramco-occurrenceautomaticscor-
ingmetriccorrelatedhighlywithhumanevaluationand
has high recall and precision in predicting statistical
significanceofresultscomparingwithitshumancoun-
terpart. The idea is to measure the content similarity
betweenasystemextractandamanualsummaryusing
simple n-gram overlap. A similar idea called IBM
BLEU score has proved successful in automatic ma-
chinetranslationevaluation(Papinenietal.2001,NIST
2002).Forsummarization,wecanexpressthedegreeof
contentoverlapintermsofn-grammatchesasthefol-
lowingequation:
)1(
)(
)(
}{
}{
∑∑
∑∑
∈∈−
∈∈−
−
−
=
UnitsModelCCgramn
UnitsModelCCgramn
match
n
gramnCount
gramnCount
C 
Modelunitsaresegmentsof manual summaries.They
are typically either sentences or elementary discourse
unitsasdefinedbyMarcu(1999b).Count
match
(n-gram)
is the maximum number of n-grams co-occurring in a
systemextractandamodelunit.Count(n-gram)isthe
number of n-grams in the model unit. Notice that the
averagen-gramcoveragescore,C
n
,asshowninequa-
tion1,isarecall-basedmetric,sincethedenominatorof
equation 1 is the sum total of the number of n-grams
occurringinthemodelsummaryinsteadofthesystem
summaryandonlyonemodelsummaryisusedforeach
evaluation. In summary, the unigram co-occurrence
statisticsweuseinthefollowingsectionsarebasedon
thefollowingformula:
)2(logexp),(








=
∑
=
j
in
nn
CwjiNgram 
Wherej ≥i,iandjrangefrom1to4,andw
n
is1/(j-
i+1).Ngram(1,4)isaweightedvariablelengthn-gram
match score similar to the IBM BLEU score; while
Ngram(k,k),i.e.i=j=k,issimplytheaveragek-gram
co-occurrencescoreC
k
.Inthisstudy,weseti=j=1,
i.e.unigramco-occurrencescore.
Withatestcollectionavailableandanautomaticscoring
metric defined, we describe the experimental setup in
thenextsection.
4 ExperimentalDesigns
As stated in the introduction, we aim to find the per-
formanceupperboundofasentenceextractionsystem
andtheeffectofcompressionratioonitsperformance.
We present our experimental designs to address these
questionsinthefollowingsections.
4.1 Performance Upper Bound Estimation
UsingOracleExtract
Inordertoestimatethepotentialofsentenceextraction
systems,itisimportanttoknowtheupperboundthatan
ideal sentence extraction method might achieve and
howfarthestate-of-the-artsystemsareawayfromthe
bound. If the upper bound is close to state-of-the-art
systemsí performance then we need to look for other
summarizationmethodstoimproveperformance.Ifthe
upper bound is much higher than any current systems
canachieve,thenitisreasonabletoinvestmoreeffortin
sentence extraction methods. The question is how to
estimatetheperformanceupperbound.Oursolutionis
tocastthisestimationproblemasanoptimizationprob-
lem. We exhaustively generate all possible sentence
combinationsthatsatisfygivenlengthconstraintsfora
summary, for example, all the sentence combinations
totaling 1005 words. We then compute the unigram
co-occurrence score for each sentence combination,
against the ideal. The best combinations are the ones
withthehighestunigramco-occurrencescore.Wecall
this sentence combination the oracle extract. Figure 1
showsanoracleextractfordocumentAP900424-0035.
OneofitshumansummariesisshowninFigure2.The
oracle extract covers almost all aspects of the human
summaryexceptsentences5and6andpartofsentence
4.However,ifweallowtheautomaticextracttocontain
morewords,forexample,150wordsshowninFigure3,
the longeroracleextractthen covers everythinginthe
humansummary.Thisindicatesthatlowercompression
can boost system performance. The ultimate effect of
compressioncanbecomputedusingthefulltextasthe
oracleextract,sincethefulltextshouldcontainevery-
thing included in the human summary. That situation
provides the best achievable unigram co-occurrence
score.Anearoptimalscorealsoconfirmsthevalidityof
usingtheunigramco-occurrencescoringmethodasan
automaticevaluationmethod.

4.2 Compression Ratio and Its Effect on System
Performance
Oneimportantfactorthat affectsthe averageperform-
ance of sentence extraction system is the number of
sentences contained in the original documents. This
factorisoftenoverlookedandhasneverbeenaddressed
systematically. For example, if a document contains
onlyonesentencethenthisdocumentwillnotbeuseful
indifferentiatingsummarizationsystemperformanceñ
there is only one choice. However, for a document of
100sentencesandassumingeachsentenceis20words
long, there are C(100,5) = 75,287,520 different 100-
wordextracts.Thishugesearchspacelowersthechance
of agreement between humans on what constitutes a
good summary. It also makes system and human per-
formance approach average since it is more likely to
includesomegoodsentencesbutnotallofthem.Em-
piricalresultsshowninSection5confirmthisandthat
leadsustothequestionofhowtoconstructacorpusto
evaluatesummarizationsystems.Wediscussthisissue
intheconclusionsection.
4.3 Inter-HumanAgreementandItsEffecton
SystemPerformance
In this section we study how inter-human agreement
affects system performance. Lin and Hovy (2002) re-
portedthat, comparedtoa manually createdideal,hu-
mansscoredabout0.40inaveragecoveragescoreand
the best system scored about 0.35. According to these
numbers,wemightassumethathumanscannotagreeto
eachotheronwhatisimportantandthebestsystemis
almostasgoodashumans.Ifthisistruethenestimating
anupperboundusingoracleextractsismeaningless.No
matterhowhightheestimatedupperboundsmaybe,we
probablywouldneverbeabletoachievethatperform-
ance due to lack of agreement between humans: the
oracle approximating one human would fail miserably
withanother.Thereforewesetupexperimentstoinves-
tigatethefollowing:
1. Whatisthedistributionofinter-humanagree-
ment?
Figure3.A150-wordoracleextractfordocu-
mentAP900424-0035.
Figure 2. A manual summary for document
AP900424-0035.
Figure1.A100-wordoracleextractfordocu-
mentAP900424-0035.
<DOC>
<DOCNO>AP900424-0035</DOCNO>
<DATE>04/24/90</DATE>
<HEADLINE>
<SHSNTNO="1">ElizabethTaylorinIntensiveCareUnit</S>
<SHSNTNO="2">ByJEFFWILSON</S>
<SHSNTNO="3">AssociatedPressWriter</S>
<SHSNTNO="4">SANTAMONICA,Calif.(AP)</S>
</HEADLINE>
<TEXT>
<SSNTNO="1">AseriouslyillElizabethTaylorbattledpneumoniaather
hospital,herbreathingassistedbyaventilator,doctorssay.</S>
<SSNTNO="2">HospitalofficialsdescribedherconditionlateMonday
asstabilizingafteralungbiopsytodeterminethecauseofthepneumo-
nia.</S>
<SSNTNO="3">Analysisofthetissuesamplewasexpectedtotakeuntil
Thursday,saidherspokeswoman,ChenSam.</S>
<SSNTNO="9">Anotherspokewomanfortheactress,LisaDelFavaro,
saidMissTaylor'sfamilywasatherbedside.</S>
<SSNTNO="13">``Itisserious,buttheyarereallypleasedwithher
progress.</S>
<SSNTNO="22">Duringanearlyfatalboutwithpneumoniain1961,
MissTaylorunderwentatracheotomy,anincisionintoherwindpipeto
helpherbreathe.</S>
</TEXT>
</DOC>
<DOC>
<TEXT>
<SSNTNO="1">ElizabethTaylorbattledpneumoniaatherhospital,
assistedbyaventilator,doctorssay.</S>
<SSNTNO="2">HospitalofficialsdescribedherconditionlateMonday
asstabilizingafteralungbiopsytodeterminethecauseofthepneumo-
nia.</S>
<SSNTNO="3">Analysisofthetissuesamplewasexpectedtobecom-
pletebyThursday.</S>
<SSNTNO="4">Ms.Sam,spokeswomansaid"itisserious,buttheyare
reallypleasedwithherprogress.</S>
<SSNTNO="5">She'snotwell.</S>
<SSNTNO="6">She'snotonherdeathbedoranything.</S>
<SSNTNO="7">Anotherspokeswoman,LisaDelFavaro,saidMiss
Taylor'sfamilywasatherbedside.</S>
<SSNTNO="8">Duringanearlyfatalboutwithpneumoniain1961,Miss
Taylorunderwentatracheotomytohelpherbreathe.</S>
</TEXT>
</DOC>
<DOC>
<DOCNO>AP900424-0035</DOCNO>
<DATE>04/24/90</DATE>
<HEADLINE>
<SHSNTNO="1">ElizabethTaylorinIntensiveCareUnit</S>
<SHSNTNO="2">ByJEFFWILSON</S>
<SHSNTNO="3">AssociatedPressWriter</S>
<SHSNTNO="4">SANTAMONICA,Calif.(AP)</S>
</HEADLINE>
<TEXT>
<SSNTNO="1">AseriouslyillElizabethTaylorbattledpneumoniaather
hospital,herbreathingassistedbyaventilator,doctorssay.</S>
<SSNTNO="2">HospitalofficialsdescribedherconditionlateMonday
asstabilizingafteralungbiopsytodeterminethecauseofthepneumo-
nia.</S>
<SSNTNO="3">Analysisofthetissuesamplewasexpectedtotakeuntil
Thursday,saidherspokeswoman,ChenSam.</S>
<SSNTNO="4">The58-year-oldactress,whowonbest-actressOscars
for``Butterfield8''and``Who'sAfraidofVirginiaWoolf,''hasbeen
hospitalizedmorethantwoweeks.</S>
<SSNTNO="8">Herconditionispresentlystabilizingandherphysicians
arepleasedwithherprogress.''</S>
<SSNTNO="9">Anotherspokewomanfortheactress,LisaDelFavaro,
saidMissTaylor'sfamilywasatherbedside.</S>
<SSNTNO="13">``Itisserious,buttheyarereallypleasedwithher
progress.</S>
<SSNTNO="14">She'snotwell.</S>
<SSNTNO="15">She'snotonherdeathbedoranything,''Ms.Samsaid
lateMonday.</S>
<SSNTNO="22">Duringanearlyfatalboutwithpneumoniain1961,
MissTaylorunderwentatracheotomy,anincisionintoherwindpipeto
helpherbreathe.</S>
</TEXT>
</DOC>

2. Howdoesastate-of-the-artsystemdifferfrom
averagehumanperformanceatdifferentinter-
humanagreementlevels?
We present our results in the next section using 303
newspaperarticlesfromtheDUC2001singledocument
summarizationtask.Besidestheoriginaldocuments,we
also have three human summaries, one lead summary
(B1), and one automatic summary from one top per-
formingsystem(T)foreachdocument.
5 Results
In order to determine the empirical upper and lower
bounds of inter-human agreement, we first ran cross-
humanevaluationusingunigramco-occurrencescoring
through six human summary pairs, i.e. (H1,H2),
(H1,H3),(H2,H1),(H2,H3),(H3,H1),and(H3,H2).For
a summary pair (X,Y), we used X as the model sum-
maryandYasthesystemsummary.Figure4showsthe
distributionsoffourdifferentscenarios.TheMaxHdis-
tribution picks the best inter-human agreement scores
foreachdocument,theMinHdistributiontheminimum
one,theMedHdistributionthemedian,andthe AvgH
distributiontheaverage.Theaverageofthebestinter-
human agreement and the average of average inter-
humanagreementdifferbyabout10percentinunigram
co-occurrencescoreand18percentbetweenMaxHand
MinH. These big differences might come from two
sources. The first one is the limitation of the unigram
0
10
20
30
40
50
60
70
80
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
UnigramCo-occurrenceScores
#

of

I
n
st
an
ce
s
AvgH MaxH MedH MinH
AverageMAX=0.50
AverageAVG=0.40
AverageMED=0.39
AverageMIN=0.32
Figure 4. DUC 2001 single document inter-
human unigram co-occurrence score distribu-
tions for maximum, minimum, average, and
median.
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
D
4
1.
A
P
88
12
1
1
-0
0
2
7
D
53
.
FB
I
S
3
-
2
2
9
4
2
D3
1
.L
A
0
2
1
6
8
9
-0
2
27
D3
4
.A
P8
8
0
9
14
-0
07
9
D
5
3
.A
P8
8
0
8
16
-0
23
4
D
2
8
.L
A
1
1
0
5
9
0
-
0
0
38
D
1
9.
A
P
8
80
33
0
-
01
1
9
D1
4
.
A
P
9
01
01
0
-
00
3
6
D2
2
.
L
A
0
70
18
9-
00
8
0
D
2
7
.
LA
10
07
89
-
0
00
7
D
1
2.
FT9
34
-
1
1
0
1
4
D2
2
.A
P8
8
1
2
16
-0
01
7
D3
7
.F
B
I
S
3-
11
91
9
D
1
9
.L
A
1
0
2
1
8
9
-
0
1
51
D
0
5.
F
T
9
41
-
1
54
7
D
5
7.
L
A
1
10
58
9-
0
0
82
D3
4
.
A
P
8
80
91
3
-
02
0
4
D
4
5.
A
P
90
06
2
5
-
0
1
6
0
D
5
0
.A
P8
8
1
2
22
-0
11
9
D
1
4
.A
P9
0
1
0
12
-
0
03
2
D
4
1.
S
JM
N
9
1
-
0
6
07
1
0
2
2
D
4
3.
F
T
9
23
-
5
85
9
D0
8
.
A
P
8
90
31
6
-
00
1
8
D
1
9.
A
P
88
06
2
3
-0
1
3
5
D
4
3
.
FT9
3
3
-
8
9
4
1
D
4
4
.
FT9
3
4
-
9
1
1
6
D
1
2
.
W
S
J
8
70
22
7-01
49
D0
4
.
FT
92
3-
5
08
9
D
1
5
.A
P8
9
0
3
02
-
0
06
3
D
0
4.
F
T
9
23
-
6
03
8
D3
7
.
A
P
8
90
70
4
-
00
4
3
D
12
.
W
S
J
8
7
01
23
-
0
1
0
1
D
1
5.
A
P
89
05
1
1
-0
1
2
6
D
1
5.
A
P
90
0
5
21
-0
0
6
3
D
0
6.
FT9
22
-1
0
2
0
0
D3
4
.A
P9
0
0
6
01
-0
04
0
DocumentIDs
Un
i
g
r
a
m

C
o
-o
ccu
r
r
en
ce

Sc
o
r
e
s
MaxH B1 T E100 E150 FT AvgMaxH AvgB1 AvgT AvgE100 AvgE150 AvgFT
Figure5.DUC2001singledocumentinter-human,baseline,system,100-word,150-word,andfulltext
oracleextractsunigramco-occurrencescoredistributions(#ofsentences<=30).DocumentIDsaresorted
bydecreasingMaxH.

co-occurrencescoringappliedtomanualsummariesthat
itcannotrecognizesynonymsorparaphrases.Thesec-
ondoneisthetruelackofagreementbetweenhumans.
Wewouldliketoconductanin-depthstudytoaddress
this question, and would just assume the unigram co-
occurrencescoringisreliable.
In other experiments, we used the best inter-human
agreementresultsasthereferencepointforhumanper-
formanceupperbound.Thisalsoimpliedthatweused
the human summary achieving the best inter-human
agreementscoreasourreferencesummary.
Figure 5 shows the unigram co-occurrence scores of
human,baseline,systemT,andthreeoracleextraction
systemsatdifferentextractionlengths.Wegeneratedall
possible sentence combinations that satisfied 1005
wordsconstraints.Duetocomputation-intensivenature
ofthistask,weonlyuseddocumentswithfewerthan30
sentences. We then computed the unigram co-
occurrencescoreforeachcombination,selectedthebest
oneastheoracleextraction,andplottedthescoreinthe
figure.Thecurvefor1005wordsoracleextractionsis
theupperboundthat asentence extractionsystem can
achieve within the given word limit. If an automatic
systemisallowedtoextractmorewords,wecanexpect
that longer extracts would boost system performance.
Thequestionishow muchbetterandwhatistheulti-
mate limit? To address these questions, we also com-
puted unigram co-occurrence scores for oracle
extractionsof1505wordsandfulltext
4
.Theperform-
anceoffulltextistheultimateperformanceanextrac-
tionsystemcanreachusingtheunigramco-occurrence
scoring method. We also computed the scores of the
leadbaselinesystem(B1)andanautomaticsystem(T).
The average unigram co-occurrence score for full text
(FT)was0.833,1505words(E150)was0.796,1005
words (E100) was 0.650, the best inter-human agree-
ment(MaxH)was0.546,systemTwas0.465,andbase-
linewas0.456.Itisinterestingtonotethatthestate-of-
the-artsystemperformedatthesamelevelasthebase-
linesystembutwasstillabout10%awayfromhuman.
The10%differencebetweenE100andMaxH(0.650vs.
0.546)implies we might needto constrainthumansto
focustheirsummariesincertainaspectstoboostinter-
humanagreementtothelevelofE100;whilethe15%
and24%improvementsfromE100toE150andFTin-
dicatecompressionwouldhelppushoverallsystemper-
formancetoamuchhigherlevel,ifasystemisableto
compress longersummariesintoashorterwithout los-
ingimportantcontent.
To investigate relative performance of humans, sys-
tems, and oracle extracts at different inter-human
agreement levels, we created three separate document
sets based on their maximum inter-human agreement
(MaxH)scores.SetSetAhadMaxHscoregreaterthan
orequalto0.70,setBwasbetween0.70and0.60,and

4
We used full text asextract and computed its unigram co-
occurrencescoreagainstareferencesummary.
Figure 7. DUC 2001 single document inter-
human,baseline,system,andfulltextunigram
co-occurrencescoredistributions(SetB).
Figure 6. DUC 2001 single document inter-
human,baseline,system,andfulltextunigram
co-occurrencescoredistributions(SetA).
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
D41
.AP
8
8
12
11
-0
02
7
D11
.A
P
8
9
040
3
-01
2
3
D1
4
.A
P
88
090
2-
0
06
2
D41
.A
P
89
080
1-
00
2
5
D
0
6
.S
J
MN
9
1
-06
19
1
08
1
D41
.LA
0
51
590
-00
65
D06.
W
SJ
9
107
10-
0
148
D
53
.FB
IS
3-
229
4
2
D28
.LA
1
1
0490
-0
1
8
4
D31
.LA
0
308
89
-01
63
D13
.S
J
M
N
91
-0
6
2
5543
4
D2
4
.L
A0
5119
0
-0
1
85
D31
.LA
0
2
1689
-02
27
D05
.FT9
31
-3
88
3
D06
.SJ
M
N
9
1
-06
28
3
08
3
D31
.A
P
8
9
100
6-
00
2
9
D50
.AP
8
8
07
14
-0
14
2
D34
.AP
8
8
0
91
4-
00
7
9
D1
4
.A
P
88
062
9-
0
15
9
D13
.A
P
90
030
6-
01
0
5
D31
.LA
0
3078
9
-0
0
4
7
D14
.LA
1
03
089
-00
70
DocumentIDs
Un
i
g
ra
m

C
o
-
o
ccu
r
r
e
n
c
e

Sc
o
r
e
s
MaxH B1 T AvgMaxH AvgB1 AvgT
AvgFT FT AvgE100 AvgE150
AvgE150=0.863
AvgFT=0.924
AvgE100=0.705
AvgMaxH=0.741
AvgT=0.525
AvgB1=0.516
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
D53
.A
P
88
081
6-
02
3
4
D05
.FT9
21
-931
0
D41
.LA
0
81490
-0
0
3
0
D24
.AP
9
0
04
24
-0
03
5
D
3
7.
F
B
IS
4
-276
02
D06
.LA
0
71
590
-00
68
D28
.LA
1
1
0590
-0
0
3
8
D31
.LA
0
615
89
-01
43
D32
.A
P
90
032
3-
00
3
6
D37
.AP
9
0
10
13
-0
04
6
D56
.A
P
88
112
6-
00
0
7
D06
.AP
8
9
03
22
-0
01
0
D19
.A
P
88
033
0-
01
1
9
D14
.AP
8
8
09
13
-0
12
9
D04.
F
T
923
-5
7
9
7
D44
.FT9
32
-585
5
D14
.A
P
90
101
0-
00
3
6
D37
.AP
8
8
05
10
-0
17
8
D41
.A
P
89
080
5-
01
2
6
D22
.AP
8
8
07
05
-0
10
9
D50
.A
P
90
091
0-
00
2
0
D
3
1
.S
J
MN
9
1
-06
08
4
22
8
D34
.A
P
90
052
9-
00
0
5
D22
.LA
0
701
89
-00
80
D04.
F
T
923
-583
5
D
1
4
.A
P
88
122
2-
0
12
6
D24
.A
P
90
051
2-
00
3
8
D
2
7
.LA
1
007
89
-00
07
D31
.A
P
88
100
9-
00
7
2
D
4
5
.A
P
88
052
0-
0
26
4
D08
.A
P
88
031
8-
00
5
1
D
1
5.
F
B
IS4
-6
77
21
D
1
2
.FT9
3
4-
1
1014
D
3
7
.A
P
90
042
8-
0
10
8
D4
5.
F
T
921-
305
D
5
4
.LA
0
927
90
-00
10
D56
.SJ
M
N
9
1
-0
6
1
3630
5
DcoumentIDs
Un
i
g
ra
m

C
o
-
o
ccu
r
r
e
n
c
e

Sc
o
r
e
s
MaxH B1 T FT AvgMaxH AvgB1
AvgT AvgFT AvgE100 AvgE150
AvgE150=0.840
AvgE100=0.698
AvgFT=0.917
AvgMaxH=0.645
AvgB1=0.509
AvgT=0.490
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
D22
.AP
8
8
12
16
-0
01
7
D
4
4
.FT9
3
3-
1
0881
D4
3
.A
P
8
9
013
1
-02
8
0
D06
.LA
0
11
889
-00
67
D27
.AP
8
9
0
72
2-
00
8
1
D41
.SJ
M
N
9
1
-06
14
2
12
6
D57
.AP
9
0
1
20
3-
01
6
6
D31
.S
J
M
N9
1
-0
6
0
1222
4
D
2
8
.S
J
MN
9
1
-06
31
2
12
0
D30
.AP
9
0
0
41
6-
01
8
8
D27.
W
S
J9
111
21-
0
136
D
3
4
.A
P
88
091
3-
0
20
4
D2
7.
W
SJ
9
112
12
-0
0
80
D06
.W
SJ
91
0
4
05-
0
154
D15
.LA
1
01
690
-00
40
D50
.A
P
88
122
2-
01
1
9
D0
8
.A
P
89
030
7-
0
15
0
D4
4.
F
T
933
-570
9
D37
.A
P
90
042
8-
00
0
5
D5
0
.A
P
8
9
121
3
-00
0
4
D54
.W
SJ
91
10
31-
0
01
2
D1
9
.L
A0
7158
9
-0
0
76
D14
.AP
9
0
08
29
-0
12
0
D43.
F
T
911
-346
3
D31
.A
P
8
8
092
7
-01
1
7
D28
.LA
1
211
89
-00
17
D08
.AP
8
9
0
31
6-
00
1
8
D44.
F
T
9
3
4
-8
6
2
8
D
0
8
.A
P
90
072
1-
0
11
0
D19
.A
P
8
8
062
3-
01
3
5
D
2
2
.A
P
88
070
5-
0
01
8
D32
.AP
8
9
0
32
6-
00
8
1
D43.
F
T
933
-8
9
4
1
D5
3
.A
P
88
061
3-
0
16
1
DocumentIDs
Un
i
g
ra
m

C
o
-
o
ccu
r
r
e
n
c
e

Sc
o
r
e
s
MaxH B1 T AvgMaxH AvgB1 AvgT
AvgFT FT AvgE100 AvgE150
AvgE150=0.790
AvgE100=0.645
AvgFT=0.897
AvgMaxH=0.536
AvgT=0.435
AvgB1=0.423
Figure 8. DUC 2001 single document inter-
human,baseline,system,andfulltextunigram
co-occurrencescoredistributions(SetC).

setCbetween0.60and0.50.Ahad22documents,setB
37,andsetC100.Totalwasabout52%(=159/303)of
the test collection. The 1005 and 1505 words aver-
ages were computed over documents which contain at
most30sentences.TheresultsareshowninFigures6,
7,and8.Inthehighestinter-humanagreementset(A),
we found that average MaxH, 0.741, was higher than
average 1005 words oracle extract, 0.705; while the
average automatic system performance was around
0.525. This is good news since the high inter-human
agreementandthebigdifference(0.18)between1005
words oracle and automatic system performance pre-
sents a research opportunity for improving sentence
extraction algorithms. The scores of MaxH (0.645 for
setBand0.536forsetC)intheothertwosetsareboth
lowerthan1005wordsoracles(0.698forsetB,5.3%
lower, and 0.645 for set C, 9.9% lower). This result
suggeststhatoptimizingsentenceextractionalgorithms
at the Set C level might not be worthwhile since the
algorithms are likely to overfit the training data. The
reason is that the average run time performance of a
sentenceextractionalgorithmdependsonthemaximum
inter-human agreement. For example, given a training
referencesummaryT
SUM1
anditsfulldocumentT
DOC1
,
weoptimizeoursentenceextractionalgorithmtogener-
ateanoracleextractbasedonT
SUM1
fromT
DOC1
.Inthe
runtime,wetestonareferencesummaryR
SUM1
andits
fulldocumentR
DOC1
.IntheunlikelycasethatR
DOC1
is
thesameasT
DOC1
andR
SUM1
isthesameasT
SUM1
,i.e.
T
SUM1
andR
SUM1
haveunigramco-occurrencescoreof1
(perfect inter-human agreement for two summaries of
onedocument),theoptimizedalgorithmwillgeneratea
perfectextractforR
DOC1
andachievethebestperform-
ancesinceitisoptimizedonT
SUM1
. However,usually
T
SUM1
andR
SUM1
aredifferent.Thentheperformanceof
thealgorithmwillnotexceedthemaximumunigramco-
occurrencescorebetweenT
SUM1
andR
SUM1
.Thereforeit
is important to ensure high inter-human agreement to
allowresearchersroomtooptimizesentenceextraction
algorithmsusingoracleextracts.
Finally, we present the effect of compression ratio on
inter-human agreement (MaxH) and performance of
baseline(B1),automaticsystemT(T),andfulltextora-
cle(FT)inFigure9.Compressionratioiscomputedin
termsofwordsinsteadofsentences.Forexample,a100
wordssummaryofa 500 wordsdocumenthas a com-
pressionratioof0.80(=1ñ100/500). Thefigureshows
thatthreehumansummaries(H1,H2,andH3)haddif-
ferent compression ratios (CMPR H1, CMPR H2, and
CMPR H3) for different documents but did not differ
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
D
24.
LA
0511
90-
018
5
D
19
.
A
P
88021
7
-0
1
7
5
D
24.
A
P
90051
1-
01
59
D
56.
A
P
881
12
6-
00
07
D
53.
F
B
IS
3-
22
942
D
44.
F
T
934-
133
50
D
57.
A
P
89101
7-
02
04
D
0
4.
F
T
923-
6038
D
37.
A
P
88
051
0-
01
78
D
22.
LA
0701
89-
008
0
D
44.
F
T
922-
3171
D
31.
A
P
88100
9
-
0
0
7
2
D
31.
S
J
M
N
91-
06
01
2
224
D
06.
A
P
9
0102
9-
00
35
D
08.
A
P
90123
1-
0
01
2
D
15.
LA
101690-
0
04
0
D
44.
F
T
933-
2760
D
08.
S
J
M
N
9
1-
06
1
93081
D
08.
A
P
88031
8-
00
51
D
15
.
A
P
9
0052
1
-0
0
6
3
D
06
.
A
P
89032
2
-0
0
1
0
D3
1
.S
JM
N
9
1
-
0
6
0
84228
D
32.
A
P
89050
2-
02
0
5
D
45.
S
J
M
N
9
1-
06
1
82091
D
32.
A
P
9
0031
3-
0
19
1
D
39
.
A
P
88101
7
-0
2
3
5
D
32.
LA
040789-
005
1
D
53.
A
P
8812
2
7-
01
85
D
50.
A
P
8
9121
0-
00
79
D
06.
L
A
080
790-
011
1
D
54.
LA
102190-
0
04
5
D
37.
S
J
MN
9
1
-
06
1
4
307
0
D
13.
W
S
J
9
10702
-0
0
78
D
22.
W
S
J
8
80923
-
0163
DocumentIDs
Un
i
g
r
a
m

C
o
-
o
cc
u
rre
n
c
e

Sc
o
r
e
s
B1 MaxH T CMPRH1 CMPRH2 CMPRH3
FT Linear(B1) Linear(MaxH) Linear(T) Linear(FT)
Figure9.DUC2001singledocinter-human,baseline,andsystemunigramco-occurrencescoreversus
compressionratio.DocumentIDsaresortedbyincreasingcompressionratioCMPRH1.

much.Theunigramco-occurrencescoresforB1,T,and
MaxHwerenoisybuthad ageneraltrend(Linear B1,
Linear T, and Linear MaxH) of drifting into lower
performance when compression ratio increased (i.e.
when summaries became shorter); while the per-
formance of FT did not exhibit a similar trend. This
confirms our earlier hypothesis that humans are less
likely to agree at high compression ratio and system
performancewillalsosufferathighcompressionratio.
TheconstancyofFTacrossdifferentcompressionratios
is reasonable since FT scores should only depend on
how well the unigram co-occurrence scoring method
capturescontentoverlapbetweenafulltextanditsref-
erence summaries and how likely humans use
vocabularyoutsidetheoriginaldocument.
6 Conclusions
In this paper we presented an empirical study of the
potential and limitations of sentence extraction as a
method of automatic text summarization. We showed
thefollowing:
(1) Howtouseoracleextractstoestimatetheper-
formance upper bound of sentence extraction
methodsatdifferentextractlengths.Weunder-
standthatsummariesoptimizedusingunigram
co-occurrence score do not guarantee good
quality in terms of coherence, cohesion, and
overallorganization.However,wewouldargue
thatagoodsummarydoesrequiregoodcontent
andwewillleavehowtomakethecontentco-
hesive, coherent, and organized to future re-
search.
(2) Inter-humanagreementvariedalotandthedif-
ferencebetweenmaximumagreement(MaxH)
and minimum agreement (MinH) was about
18%ontheDUC2001data.Tominimizethe
gap,weneedtodefinethesummarizationtask
better. This has been addressed by providing
guided summarization tasks in DUC 2003
(DUC 2002). We guesstimate the gap should
besmallerinDUC2003data.
(3) State-of-the-artsystemsperformedatthesame
levelasthebaselinesystembutwerestillabout
10% away from the average human perform-
ance.
(4) The potential performance gains (15% from
E100 to E150 and 24% to FT) estimated by
oracleextractsofdifferentsizesindicatedthat
sentence compression or sub-sentence extrac-
tionarepromisingfuturedirections.
(5) Therelativeperformanceofhumansandoracle
extracts at three inter-human agreement inter-
valsshowedthatitwasonlymeaningfultoop-
timize sentence extraction algorithms if inter-
human agreement was high. Although overall
highinter-humanagreementwaslowbutsub-
sets of high inter-human agreement did exist.
For example, about human achieved at least
60% agreement in 59 out of 303 (~19%)
documentsof30sentencesorless.
(6) We also studied how compression ratio af-
fectedinter-humanagreementandsystemper-
formance, and the results supported our
hypothesis that humans tend to agree less at
high compression ratio, and similar between
humansandsystems.Howtotakeintoaccount
thisfactorinfuturesummarizationevaluations
isaninterestingtopictopursuefurther.
Usingexhaustivesearchtoidentifyoracleextractionhas
beenstudiedbyotherresearchersbutindifferentcon-
texts.Marcu(1999a)suggestedusingexhaustivesearch
tocreatetrainingextractsfromabstracts.Donawayetal.
(2000)usedexhaustivesearchtogenerateallthreesen-
tencesextractstoevaluatedifferentevaluationmetrics.
Themaindifferencebetweentheirworkandoursisthat
we searched for extracts of a fixed number of words
whiletheylookedforextractsofafixednumberofsen-
tences.
Inthefuture,wewouldliketoapplyasimilarmethod-
ologytodifferenttextunits,forexample,sub-sentence
unitssuchaselementarydiscourseunit(Marcu1999b).
We wanttostudyhowtoconstrainthesummarization
task to achieve higher inter-human agreement, train
sentence extraction algorithms using oracle extracts at
different compression sizes, and explore compression
techniquestogobeyondsimplesentenceextraction.

References
Donaway, R.L., Drummey, K.W., and Mather, L.A.
2000. A Comparison of Rankings Produced by
Summarization Evaluation Measures. In Proceeding
oftheWorkshoponAutomaticSummarization,post-
conferenceworkshopofANLP-NAACL-2000,Seat-
tle,WA,USA,69ñ78.
DUC.2002.TheDocumentUnderstandingConference.
http://duc.nist.gov.
Edmundson, H.P. 1969. New Methods in Automatic
Abstracting.JournaloftheAssociationforComput-
ingMachinery.16(2).
Goldstein, J., M. Kantrowitz, V. Mittal, and J. Car-
bonell. 1999. Summarizing Text Documents: Sen-
tence Selection and Evaluation Metrics. In
Proceedingsofthe22ndInternationalACMConfer-
ence on Research and Development in Information
Retrieval(SIGIR-99),Berkeley,CA,USA,121ñ128.
Hovy,E.andC.-Y.Lin.1999.AutomaticTextSumma-
rization in SUMMARIST. In I. Mani and M.
Maybury (eds), Advances in Automatic Text Sum-
marization,81ñ94.MITPress.
Kupiec,J.,J.Pederson,andF.Chen.1995.ATrainable
Document Summarizer. In Proceedings of the 18th
InternationalACMConferenceonResearchandDe-
velopmentinInformationRetrieval(SIGIR-95),Se-
attle,WA,USA,68ñ73.
Lin,C.-Y.and E.Hovy.2002. Manualand Automatic
Evaluations of Summaries. In Proceedings of the
Workshop on Automatic Summarization, post-
conference workshop of ACL-2002, pp. 45-51,
Philadelphia,PA,2002.
Lin,C.-Y.andE.H.Hovy.2003.AutomaticEvaluation
of Summaries Using N-gram Co-occurrence Statis-
tics. In Proceedings of the 2003 Human Language
Technology Conference (HLT-NAACL 2003), Ed-
monton,Canada,May27ñJune1,2003.
Luhn,H.P.1969.TheAutomaticCreationofLiterature
Abstracts. IBM Journal of Research and Develop-
ment.2(2),1969.
Marcu,D.1999a.Theautomaticconstructionoflarge-
scale corpora for summarization research. Proceed-
ings of the 22nd International ACM Conference on
ResearchandDevelopmentinInformationRetrieval
(SIGIR-99),Berkeley,CA,USA,137ñ144.
Marcu,D.1999b.Discoursetreesaregoodindicatorsof
importanceintext.InI.ManiandM.Maybury(eds),
Advances in Automatic Text Summarization, 123ñ
136.MITPress.
McKeown, K., R. Barzilay, D. Evans, V. Hatzivassi-
loglou, J. L. Klavans, A. Nenkova, C. Sable, B.
Schiffman,S.Sigelman.2002.TrackingandSumma-
rizingNewsonaDailyBasiswithColumbiaísNews-
blaster. In Proceedings of Human Language
Technology Conference 2002 (HLT 2002). San
Diego,CA,USA.
NIST.2002.AutomaticEvaluationofMachineTransla-
tionQualityusingN-gramCo-OccurrenceStatistics.
Over, P. and W. Liggett. 2002. Introduction to DUC-
2002:anIntrinsicEvaluationofGenericNewsText
Summarization Systems. In Proceedings of Work-
shop on Automatic Summarization (DUC 2002),
Philadelphia,PA,USA.
http://www-nlpir.nist.gov/projects/duc/pubs/
2002slides/overview.02.pdf
Papineni, K., S. Roukos, T. Ward, W.-J. Zhu. 2001.
Bleu:aMethodforAutomaticEvaluationofMachine
Translation. IBM Research Report RC22176
(W0109-022).
Radev, D.R. and K.R. McKeown. 1998. Generating
NaturalLanguageSummariesfromMultipleOn-line
Sources.ComputationalLinguistics,24(3):469ñ500.
Strzalkowski,T,G.Stein,J.Wang,andB,Wise.ARo-
bustPracticalTextSummarizer.1999.InI.Maniand
M. Maybury (eds), Advances in Automatic Text
Summarization,137ñ154.MITPress.
White, M., T. Korelsky, C. Cardie, V. Ng, D. Pierce,
andK.Wagstaff.2001.MultidocumentSummariza-
tion via Information Extraction. In Proceedings of
Human Language Technology Conference 2001
(HLT2001),SanDiego,CA,USA.
