A Comparison of Rankings Produced by Summarization 
Evaluation Measures 
Robert L. Donaway 
Department of Defense 
9800 Savage Rd. STE 6409 
Ft. Meade, MD 20755-6409 
rldonaw@super, org 
Kevin W. Drummey 
Department of Defense 
9800 Savage Rd. STE 6341 
Ft. Meade, MD 20755-6341 
kwdrumm @super. org 
Laura A. Mather 
La Jolla Research Lab 
Britannica.com, Inc. 
3253 Holiday Ct. Suite 208 
La Jolla, CA 92037 
mather@us, britannica, corn 
Abstract 
Summary evaluation measures produce a ranking 
of all possible extract summaries of a document., 
Recall-based evaluation measures, which depend on 
costly human-generated ground truth summaries, 
produce uncorrelated rankings when ground truth 
is varied. This paper proposes using sentence-rank- 
based and content-based measures for evaluating ex- 
tract summaries, and compares these with recall- 
based evaluation measures. Content-based measures 
increase the correlation of rankings induced by syn- 
onymous ground truths, and exhibit other desirable 
properties. 
1 Introduction 
The bulk of active research in the automatic 
text summarization community centers on de- 
veloping algorithms to produce extract sum- 
maries, e. g. (Schwarz, 1990), (Paice and Jones, 
.1993), (Kupiec et al., 1995), (Marcu, 1997), 
(Strzalkowski et al., 1998), and (Goldstein et 
:al., 1999). Yet understanding how to evalu- 
ate their output has received less attention. In. 
• 1997, TIPSTER sponsored a conference (SUM- 
MAC) where various text summarization algo- 
rithms were evaluated for their performance in 
various tasks (Mani et al., 1999; Firmin and 
Chrzanowski, 1999). While extrinsic evalua- 
tion measures such as these are often very Con- 
crete, the act of designing the task and scor- 
ing the results of the task introduces bias and 
subject-based variability. These factors may 
confound the comparison of summarization al- 
gorithms. Machine-generated summaries also 
may be evaluated intrinsically by comparing 
them with "ideal" human-generated summaries. 
However, there is often little agreement as to 
what constitutes the ideal summary of a docu- 
ment. 
Both intrinsic and extrinsic methods require 
time consuming, expert human input in order 
to evaluate summaries. While the traditional 
methods have many advantages, they are costly, 
and human assessors cannot always agree on 
summary quality. If a numerical measure were 
available which did not depend on human judge- 
ment, researchers and developers would be able 
to immediately assess the effect of modifications 
to summary generation algorithms• Also, such 
a measure might be free of the bias that is in- 
troduced by human assessment. 
This paper investigates the properties of vari- 
ous numerical measures for evaluating the qual- 
ity of generic, indicative document summaries. 
As explained by Mani et al. (1999), a generic 
summary is not topic-related, but "is aimed at 
a broad readership community" and an indica- 
tive summary tells "what topics are addressed 
in the source text, and thus can be used to 
alert the user as to source content." Section 2 
discusses the properties of numerical evaluation 
measures, points out several drawbacks associ- 
ated with intrinsic measures and introduces new 
measures developed by the authors. An exper- 
iment was devised to compare the new evalua- 
tion measures with the traditional ones. The de- 
sign of this experiment is discussed in Section 3 
and its results are presented in Section 4. The 
final section includes conclusions and a state- 
ment of the future work related to these evalu- 
ation measures. 
2 Evaluation Measures 
An evaluation measure produces a numerical 
score which can be used to compare different 
summaries of the same document. The scores 
are used to assess summary quality across a col- 
lection of test documents in order to produce 
an average for an algorithm or system. How- 
ever, it must be emphasized that the scores are 
{}9 
most significant when considered per document. 
For example, two different summaries of a doc- 
ument may have been produced by two differ- 
ent summarization algorithms. Presumably, the 
summary with the higher score indicates that 
the system which produced it performed bet- 
ter than the other system. Obviously, if one 
system consistently produces higher scores than 
another system, its average score will be higher, 
and one has reason to believe that it is a bet- 
ter system. Thus, the important feature of any 
summary evaluation measure is not the value of 
its score, but rather the ranking its score im- 
poses on a set of extracts of a document. 
To compare two evaluation measures, whose 
scores may have very different ranges and distri- 
butions, one must compare the order in which 
the measures rank various summaries of a docu- 
ment. For instance, suppose a summary scoring 
function Y is completely dependent upon the 
output of another scoring function X, such as 
Y -- 2 X. Since Y is an increasing function of X, 
both X and Y will produce the same ranking of 
any set of summaries. However, the scores pro- 
duced by Y will have a very different distribu- 
tion than those of X and the two sets of scores 
will not be correlated since the dependence of Y 
on X is non-linear. Therefore, in order to com- 
pare the scores two different measures assign to 
a set of summaries, one must compare the ranks 
. they assign, not the actual scores. 
The ranks assigned by an evaluation mea- 
sure produce equivalence classes of extract 
summaries; each rank equivalence class con- 
tains summaries which received the same score. 
When a measure produces the same score for 
two different summaries of a document, there is 
a tie, and the equivalence class will contain more 
than one summary. All summaries in an equiv- 
alence class must share the same rank; let this 
rank be the midrank of the range of ranks that 
would have be assigned if each score were dis- 
tinct. An evaluation measure should posses the 
following properties: (i) higher-ranking sum- 
maries are more effective or are of higher quality 
than lower-ranking summaries, and (ii) all of the 
summaries in a rank equivalence class are more- 
or-less equally effective. 
The following sections contrast the ranking 
properties of three types of evaluation measures: 
recall-based measures, a sentence-rank-based 
measure and content-based measures. These 
types of measures are defined, their properties 
are described and their use is explained. 
2.1 Recall-Based Evaluation Measures 
Recall-based evaluation measures are intrin- 
sic. They compare machine-generated sum- 
maries with sentences previously extracted by 
human assessors or judges. From each docu- 
ment, the judges extract sentences that they 
believe make up the best extract summary of 
the document. A summary of a document gen- 
erated by a summarization algorithm is typi- 
cally compared to one of these "ground truth" 
summaries by counting the number of sentences 
the ground truth summary and the algorithm's 
summary have in common. Thus, the more sen- 
tences a summary has recalled from the ground 
truth, the higher its score will be. See work by 
Goldstein et al. (1999) and Jing et al. (1998) 
for examples of the use of this measure. 
The recall-based measures introduce a bias 
since they are based on the Opinions of a small 
number of assessors. It is widely acknowledged 
(Jing et al., 1998; Kupiec et al., 1995; Voorhees, 
1998) that assessor agreement is typically quite 
low. There are at least two sources of this dis- 
agreement. First, it is possible that one human 
assessor will pick a particular sentence for in- 
clusion in their summary when the content of 
another sentence or set of sentences is approx- 
imately equivalent. Jing et al. (1998) agree: 
"...precision and recall are not the best mea- 
sures for computing document quality. This is 
due to the fact that a small change in the sum- 
mary output (e.g., replacing one sentence with 
an equally good equivalent which happens not 
to match majority opinion \[of the assessors\]) can 
dramatically affect a system's score." We call 
this source of summary disagreement 'disagree- 
ment due to synonymy.' Here is an example of 
two human-generated extracts from the same 
1991 Wall Street Journal article which contain 
different sentences, but still seem to be describ- 
ing an article about violin playing in a film: 
EXTRACT 1: Both Ms. Streisand's film 
husband, played by Jeroen Krabbe, and 
her film son, played by her real son Ja- 
son Gould, are, for the purposes of the 
screenplay, violinists. The actual sound 
- what might be called a fiddle over - was 
produced off camera by Pinchas Zucker- 
"70 
I 
I 
I 
i 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
man. The violin program in "Prince of 
Tides" eliminates the critic's usual edge 
and makes everyone fall back on his basic 
pair of ears. 
EXTRACT 2: Journalistic ethics forbid 
me from saying if I think "Prince of Tides" 
is as good as "Citizen Kane," but I don't 
think it's wrong to reveal that the film 
has some very fine violin playing. But 
moviegoers will hear Mr. Zuckerman cast 
off the languor that too often makes him 
seem like the most bored of great violin- 
ists. With each of these pieces, Mr. Zuck- 
erman takes over the movie and shows 
what it means to play his instrument with 
supreme dash. 
Another source of disagreement can arise from 
judges' differing opinions about the true focus of 
the original document. In other words, judges 
disagree on what the document is about. We 
call this second source 'disagreement due to fo- 
cus.' Here is a human-generated extract of the 
same article which seems to differ in focus: 
EXTRACT 3: Columbia Pictures has de- 
layed the New York City and Los Angeles 
openings of "Prince Of Tides" for a week. 
So Gothamites and Angelenos, along with 
the rest of the country, will have to wait 
until Christmas Day to see this film ver- 
sion of the Pat Conroy novel about a 
Southern football coach (Nick Nolte) dal- 
lying with a Jewish female psychothera- 
pist (Barbra Streisand) in the Big Apple. 
Perhaps the postponement is a sign that 
the studio is looking askance at this ex- 
pensive product directed and co-produced 
".by its female lead. 
Whatever the source, disagreements at the 
sentence level are prevalent. This has seri- 
ous implications for measures based on a sin- 
gle opinion, when a slightly different opinion 
would result in a significantly different score 
(and rank) for many summaries. 
For example, consider the following three- 
sentence ground truth extract of a 37-sentence 
1994 Los Angeles Times article from the TREC 
collection. It contains sentences 1, 2 and 13. 
(1) Clinton Trade Initiative Sinks Under 
G-7 Criticism. (2) President Clinton came 
to the high-profile Group of Seven sum- 
mit to demonstrate new strength in for- 
71 
eign policy but instead watched his pre- 
mier initiative sink Saturday under a wave 
of sharp criticism. (13) The negative re- 
action to the president's trade proposal 
came as a jolt after administration offi- 
cials had built it up under the forward- 
looking name of "Markets 2000" and had 
portrayed it as evidence of his interest in 
leading the other nations to more open 
trade practices. 
An extract that replaces sentence 13 with sen- 
tence 5: 
(5) In its most elementary form, it woul~d 
have set up a one-year examination of im- 
' prediments to world trade, but it would 
have also set an agenda for liberalizing 
trade rules in entirely new areas, such as 
financial services, telecommunications and 
invest ment. 
will receive the same recall score as one which 
replaces sentence 13 with sentence 32: 
(32) Most nations have yet to go through 
this process, which they hope to complete 
by January. 
These two alternative summaries both have the 
same recall rank, but are obviously of very dif- 
ferent quality. 
Considered quantitatively, the only impor- 
tant component of either precision or recall is 
the 'sentence agreement' J, the number of sen- 
tences a summary has in common with the 
ground truth summary. Following Goldstein 
et al. (1999), let M be the number of sen- 
tences in a ground truth extract summary and 
let K be the number of sentences in a sum- 
mary to be evaluated. With precision P = 
J/K and recall R = JIM as usual, and F1 = 
2PR/(P + R); then elementary algebra shows 
that F1 = 2J/(M÷K). Often, a fixed summary 
length K is used. (In terms of word count, this 
represents varying compression rates.) When a 
particular ground truth of a given document is 
chosen, then precision, recall and F1 are all con- 
stant multiples of J. As such, these measures 
produce different scores, but the same ranking 
of all the K-sentence extracts from the docu- 
ment. Since only this ranking is of interest, it is 
not necessary to examine more than one of P, 
R and F1. 
The sentence agreement J can only take on 
integer values between 0 and M, so J, P, R, 
and F1 are all discrete variables. Therefore, al- 
though there may be thousands of possible ex- 
tract summaries of a document, only M + 1 dif- 
ferent scores are possible. This will obviously 
create a large number of ties in rankings pro- 
duced by the P, R, and F1 scores, and will 
greatly increase the probability that radically 
different summaries will be given the same score 
and rank. On the other hand, two summaries 
which express the same ideas using different sen- 
tences will be given very different scores. Both 
of these problems are illustrated by the exam- 
ple above. Furthermore, if a particular ground 
truth includes a large proportion of the doc- 
ument's sentences (perhaps it is ~ very con- 
cise document), shorter summaries will likely in- 
clude only sentences which appear in the ground 
truth. Consequently, even a randomly selected 
collection of sentences might obtain the largest 
possible score. Thus, recall-based measures are 
likely to violate both properties (i) and (ii), dis- 
cussed at the beginning of Section 2. These in- 
herent weaknesses in recall-based measures will 
be further explored in Section 4. 
2.2 A Sentence-Rank-Based Measure 
One way to produce ground truth summaries is 
to ask judges to rank the sentences of a docu- 
.ment in order of their importance in a generic, 
indicative summary. This is often a difficult 
task for which it is nearly impossible to obtain 
consistent results. However, sentences which 
appear early in a document are often more in- 
dicative of the content of the document than 
are other sentences. This is particularly true 
in newspaper articles, whose authors frequently 
try, to give the main points in the first para- 
graph (Brandow et al., 1995). Similarly, adja- 
cent sentences are more likely to be related to 
each other than to those which appear further 
away in the text. Thus, sentence position alone 
may be an effective way to rank the importance 
of sentences. 
To account for sentence importance within a 
ground truth, a summary comparison measure 
was developed which treats an extract as a rank- 
ing of the sentences of the document. For ex- 
ample, a document with five sentences can be 
expressed as (1, 2, 3, 4, 5). A particular extract 
may include sentences 2 and 3. Then if sen- 
tence 2 is more important than sentence 3, the 
sentence ranks are given by (4, 1, 2, 4, 4). Sen- 
tences 1, 4, and 5 all rank fourth, since 4 is the 
midrank of ranks 3, 4 and 5. Such rank vectors 
can be compared using Kendall's tau statistic 
(Sheskin, 1997), thus quantifying a summary's 
agreement with a particular ground truth. As 
will be shown in Section 4, sentence rank mea- 
sures result in a smaller number of ties than do 
recall-based evaluation measures. 
Although it is also essentially recall-based, 
the sentence rank measure has another slight 
advantage over recall. Suppose a ground truth 
summary of a 20-sentence document consists 
of sentences (2, 3, 5}. The machine-generated 
summaries consisting of sentences (2, 3, 4} and 
{2, 3, 9} would receive the same recall score, but 
(2, 3, 4} would receive a higher tau score (5 is 
closer to 4 than to 9). Of course, this higher 
score may not be warranted if the content of 
sentence 9 is more similar to that of sentence 5. 
The use of the tau statistic may be more ap- 
propriate for ground truths produced by classi- 
fying all of the sentences of the original docu- 
ment in terms of their importance to an indica- 
tive summary. Perhaps four different categories 
could be used, ranging from 'very important' to 
'not important.' This would allow comparison 
of a ranking with four equivalence classes (rep- 
resenting the document) to one with just two 
equivalence classes (representing inclusion and 
exclusion from the summary to be evaluated). 
2.3 Content-Based Measures 
Since indicative summaries alert users to doc- 
ument content, any measure that evaluates the 
quality of an indicative summary ought to con- 
sider the similarity of the content of the sum- 
mary to the content of the full document. This 
consideration should be independent of exactly 
which sentences are chosen for the summary. 
The content of the summary need only cap- 
ture the general ideas of the original docu- 
ment. If human-generated extracts are avail- 
able, machine-generated extracts may be evalu- 
ated alternatively by comparing their contents 
to these ground truths. This section defines 
content-based measures by comparing the term 
frequency (tf) vectors of extracts to tf vectors 
of the full text or to tf vectors of a ground truth 
extract. When the texts and summaries are to- 
kenized and token aliases are determined by a 
thesaurus, sumrriaries that disagree due to syn- 
onymy are likely to have similarly-distributed 
72 
! 
! 
t 
I 
i 
I 
i 
i 
11 
I 
I 
I 
l 
I 
I 
i 
I 
l 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
i 
I 
i 
I 
i 
I 
I 
i 
! 
term frequencies. Also, summaries which hap- 
pen to use synonyms appearing infrequently in 
the text will not be penalized in a summary- 
to-full-document comparison. Note that term 
frequencies can always be used to compare an 
extract with its full text, since the two will al- 
ways have terms in common, but without a the- 
saurus or some form of term aliasing, term fre- 
quencies cannot be used to compare abstracts 
with extracts. 
The vector space model of information re- 
trieval as described by Salton (1989) uses the 
inner product of document vectors to measure 
the content similarity sirn(dl,d2) of two docu- 
ments dl and d2. Geometrically, this similarity 
measure gives the cosine of the angle between 
the two document vectors. Since cos 0 = 1, doc- 
uments with high cosine similarity are deemed 
similar. We apply this concept to summary 
evaluation by computing document-summary 
content similarity sim(d, s) or ground truth- 
summary content similarity sire(g, s). 
Note that when comparing a summary with 
its document, a prior human assessment is not 
necessary. This may serve to eliminate the am- 
biguity of a human assessor's bias towards cer- 
tain types of summaries or sentences. How- 
ever, the scores produced by such evaluation 
measures cannot be used reliably to compare 
summaries of drastically different lengths, since 
a much longer summary is more likely than a 
short summary to produce a term frequency 
.vector which is similar to the full document's 
"tf vector, despite the normalization of the two 
vectors. (This contrasts with the bias of recall 
towards short summaries.) 
This similarity measure can be enhanced in a 
number of ways. For example, using term fre- 
quency counts for a large corpus of documents, 
term weighting (such as log-entropy (Dumais, 
199!) or tf-idf (Salton, 1989)) can be used to 
weight the terms in the document and summary 
vectors. This may improve the performance of 
the similarity measure by increasing the weights 
of content-indicative terms and decreasing the 
weights of those terms that are not indicative 
of content. It is demonstrated in Section 4 that 
term weighting caused a significant increase in 
the correlation of the rankings produced by dif- 
ferent ground truths; however, it is n.ot clear 
that this weighting increases the scores of high 
quality summaries. 
There are two potential problems with using - 
the cosine measure to evaluate the performance 
of a summarization algorithm. First of all, it 
is likely that the summary vector will be very 
sparse compared to the document vector since 
the summary will probably contain many fewer 
terms than the document. Second, it is possi- 
ble that the summary will use key terms that 
are not used often in the document. For exam- 
ple, a document about the merger of two banks, 
may use the term "bank" frequently, and use the 
related (yet not exactly synonymous) term "fi- 
nancial institution" only a few times. It is pos- 
silJle that a high quality extract would have a 
low cosine similarity with the full document if it 
contained only those few sentences that use the 
term "financial institution" instead of "bank." 
Both of these problems can be addressed with 
another common tool in information retrieval: 
latent semantic indexing or LSI (Deerwester et 
al., 1990). 
LSI is a method of reducing the dimension 
of the vector space model using the singular 
value decomposition. Given a corpus of doc- 
uments, create a term-by-document matrix A 
where each row corresponds to a term in the 
document set and each column corresponds to 
a document. Thus, the columns of A represent 
all the documents from the corpus, expressed 
in a particular term-weighting scheme. (In our 
testing, the document vectors' entries are the 
relative frequencies of the terms.) Compute the 
singular value decomposition (SVD) of this ma- 
trix (for details see Golub and van Loan (1989)). 
Retain some number of the largest singular val- 
ues of A and discard the rest. In general, re- 
moving singular values serves as a dimension 
reduction technique. While the SVD computa- 
tion may be time-consuming when the corpus is 
large, it needs to be performed only once to pro- 
duce a new term-document matrix and a pro- 
jection matrix. To calculate the similarity of a 
summary and a document, the summary vector 
s must also be mapped to this low-dimensional 
space. This involves computing a vector-matrix 
product, which can be done quickly. 
The effect of using the scaled, reduced- 
dimension document and summary vectors is 
two-fold. First, each coordinate of both the doc- 
ument and summary vector will contribute to 
73 
the overall similarity of the summary to the doc- 
ument (unlike the original vector space model, 
where only terms that occur in the summary 
contribute to the cosine similarity score). Sec- 
ond, the purpose of using LSI is to reduce the 
effect of near-synonymy on the similarity score. 
If a term occurs infrequently in the document 
but is highly indicative of the content of the 
document, as in the case where the infrequent 
term is synonymous with a frequent term, the 
summary will be penalized less in the reduced- 
dimension model for using the infrequent term- 
than it would be in the original vector space 
model. This reduction in penalty occurs be- 
cause LSI essentially averages the weights of, 
terms that co-occur frequently with other terms 
(both "bank" and "financial institution" often 
occur with the term "account"). This should 
improve the accuracy of the cosine similarity 
measure for determining the quality of a sum- 
mary of a document. 
3 Experimental Design 
This section describes the experiment that tests 
how well these summary evaluation metrics per- 
form. Fifteen documents from the Text Re- 
trieval Conference (TREC) collection were used 
in the experiment. These documents are part of 
a corpus of 103 newspaper articles. Each of the 
documents was tokenized by a  process- 
ing algorithm, which performed token aliasing. 
In our experiments, the term set was comprised 
of all the aliases appearing in the full corpus of 
103 documents. This corpus was used for the 
purposes of term weighting. Four expert judges 
created extract summaries (ground truths) for 
each of the documents. A list of the first 15 
documents, along with some of their numeri- 
cal features is found in Table 1. The judges 
were instructed to select as many sentences as 
were necessary to make an "ideal" indicative ex- 
tract summary of the document. In terms of 
the count of sentences in the ground truth, the 
lengths of the summaries varied from document 
to document. Ground truth compression rates 
were generally between 10 and 20 percent. The 
inter-assessor agreement also varied, but was of- 
ten quite high. We measured this by calculating 
the average pairwise recall in the collection of 
four ground truths. 
A suite of summary evaluation measures {Ek } 
which produce a score for a summary was de- 
veloped. These measures may depend on none, 
one, or all of the collection of ground truth 
summaries {gj}. Measures which do not de- 
pend on ground truth compute the summary- 
document similarity sire(s, d). Content-based 
measures which depend on a single ground truth 
gi compute the summary-ground truth similar- 
ity sim(s, gi). A measure which depends on 
all of the ground truths gl,.-.,ga, computes 
a summary's similarity with each ground truth 
separately and averages these values. Table 2 
enumerates the 28 different evaluation measures 
that were compared in this experiment. Note 
that the Recall and Kendall measures require a 
ground truth. 
In this study, the measures will be used to 
evaluate extract summaries of a fixed sentence 
length K. In all of our tests, K = 3 for rea- 
sons of scale which will become clear. A sum- 
mary length of three sentences represents vary- 
ing proportions of the number of sentences in 
the full text document, but this length was usu- 
ally comparable to the lengths of the human- 
generated ground truths. For each document, 
the collection {Sj} was generated. This is the 
set of all possible K-sentence extracts from the 
document. If the document has N sentences 
total, there will be N choose K 
N) N! g = K! (/~L g)! 
extracts in the exhaustive collection {Sj}. The 
focus now is only on the set of all possible sum- 
maries and the evaluation measures, and not On 
any particular summarization algorithm. For 
each document, each of the measures in {Ek} 
was used to rank the sets {Sj}. (Note that the 
measures which do notdepend on ground truths 
could, in fact, be used to generate summaries if 
it were possible to produce and rank the exhaus- 
tive set of fixed-length summaries in real time. 
Despite the authors' access to impressive com- 
puting power, the process took several hours 
for each document!) The next section compares 
these different rankings of the exhaustive set of 
extracts for each document. 
4 Experimental Results 
One way to compare the different rankings pro- 
duced by two different evaluation measures is to 
74 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
i 
I % 
i 
I 
I 
I 
I 
i 
I 
I 
I 
I 
I 
I 
I 
i 
I 
I 
i 
I 
I 
i 
I 
I 
I 
I 
i 
I 
Doc. 
No. 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
Table 1: Test Document & Summary Statistics 
Sent. 
TREC File Name Count 
WSJ911211-0057 34 
wsJg00608-0126 34 
WSJ900712-0047 " 18 
latwp940604.0027 23 
latwp940621.0116 27 
latwp940624.0094 17 
latwp940707.0400 33 
latwp940709.0051 37 
latwp940713.0013 34 
latwp940713.0014 30 
latwp940721.0080 28 
latwp940725.0030 36 
latwp940725.0128 18 
latwp940729.0109 25 
' l'atwp940801.0010 28 
Token Gnd. Truth 
Count Sent. Cnt. 
667 3, 4, 12, 3 
603 4, 4, 9, 3 
364 2, 3, 5, 2 
502 4, 5, 5, 4 
579 12, 11, 10 
460 5, 5, 5, 4 
503 6, 9, 8, 8 
877 3, 5, 5, 4 
702 9, 4, 5, 8 
528 6, 5, 7, 5 
793 3, 3, 5, 2 
690 9, 2, 7, 5 
438 6, 3, 5, 5 
I 
682 4, 3, 4,3 
474 4, 5, 4, 5 
Gnd. Truth 
Avg. Recall 
44% 
54% 
7S% 
69% 
84% 
79% 
52% 
53% 
35% 
88% 
88% 
45% 
63% 
96% 
43% 
Table 2: Evaluation Measures 
Similarity 
Measure Details 
Recall JJ M , Ji = # ( s N gi ) 
Kendall Tau see Section 2.2 
tf Cosine sirn(s, d or gi) o, tf voctors 
"tf-idf Cosine sim(s,-) on tf-idf weighted vectors 
SVD tf Cosine sire(s, .) o, iow-ai ..... to,s 
N.A. E1 E2 E3 E4 E5 
N.A. E6 E7 Es E9 ElO 
Ell E12 E13 E14 EIS E16 
El7 ElS El9 E20 E21 E22 
E23 E24 E25 E26 E27\] E2s 
Calculate their Spearman rank correlation coef- 
'ficient. When two evaluation measures produce 
nearly the same ranking of the summary set, the 
rank correlation will be near 1 and a scatterplot 
of the two rankings will show points nearly ly- 
• ing on a line with slope 1. When there is little 
correlation between two rankings, the statistic 
will be near 0 and the scatterplot will appear to 
have randomly-distributed points. A negative 
correlation indicates that one ranking often re- 
verses the rankings of the other and in this case 
a rank scatterplot will show points nearly lying 
on a line with negative slope. 
Table 3 compares the Spearman correlation 
of the rankings produced by a specific pair of 
ground truths. The first row contains the cor- 
relations of two highly similar ground truth ex- 
tracts of document 14. Both of these extracts 
consisted of three sentences; two of the sen- 
tences were common to both extracts. Not sur- 
prisingly, the correlation is high regardless of 
what measure produced the rankings. The sec- 
ond row demonstrates an increase (across the 
row) in correlation between rankings produced 
by two different ground truth summaries of doc- 
ument 8. These two ground truths did not dis- 
agree in focus, but did disagree due to synonymy 
-- they contain just one sentence in common. 
In general, the correlation among the rankings 
produced by synonymous ground truths was in- 
creased most by using the SVD content-based 
comparison. Figure 1 illustrates the correla- 
tion increase graphically for this pair of ground 
truths. By contrast, the third row of Table 3 
displays a decrease (across the row) in correla- 
tion between rankings produced by two differ- 
ent ground truths. In this case, the two ground 
truths disagreed in .focus: they are Extracts 2 
and 3 contrasted in Section 2.1. Again, the cor- 
relation among the rankings produced by the 
four ground truths was decreased most by us- 
ing a weighted content-based comparison such 
75 
Table 3: Correlation of Ground Truths Depends on Level of Disagreement 
It recall i tan I tf cosine I tf-idf I SVD I 
Agree Sentences 0.87 0.96 0.95 0.87 0.99 
Disagree synonymy 0.34 0.37 0.53 0.72 0.96 
Disagree focus 0.22 0.31 0.32 0.20 -0.29 i 
as tf-idf or SVD. These patterns were typical for 
rankings produced by ground truths which dif- 
fered in focus, allaying the fear that applying 
the SVD weighting would produce correlated 
rankings based on any two ground truths. 
Of course, the lack of correlation among 
recall-based rankings whenever ground truths 
did not contain exactly the same sentences im- 
plies that a different collection of extracts would 
rank highly if one ground truth were replaced 
with the other. This effect would surely carry 
through to system averages across a set of doc- 
uments. To exemplify the size of this effect, 
for each document, the summaries which scored 
highest using one ground truth were scored (us- 
ing recall) against a second ground truth. With 
the first ground truths, these high-scoring sum- 
maries averaged over 75% recall; using the sec- 
ond ground truths, the same summaries aver- 
aged just over 25% recall. Thus, by simply 
changing judges, an automatic system which 
produced these summaries would appear to 
have a very different success rate. This dispar- 
. ity is lessened when content-based measures are 
used, but the outcomes are still disparate. 
Evidence suggests that the content-based 
measures which do not rely on a ground truth 
• may be an acceptable substitute to those which 
do'.- Over the set of 15 documents, the aver- 
age within-document inter-assessor correlation 
is 0.61 using term frequency, 0.72 using tf-idf, 
and 0.67 using SVD. The average correlation 
of the ground truth dependent measures with 
those that perform summary-document com- 
parisons is 0.48 using term frequency, 0.70 using 
tf-idf, and 0.56 using SVD. This means that on 
average, the rankings based on single ground 
truths are only slightly more correlated to each 
other than they are to the rankings that do not 
depend on any ground truth. 
As noted in Section 2.1, the recall-based 
measures exhibit unfavorable scoring proper- 
ties. Figure 2 shows the histogram of scores 
assigned to the exhaustive summary set for doc- 
Figure 1: Synonymy: Content-based Measures 
Increase Rank Correlation 
ii' e. |¢ 
D~J~ll8 ~ Sca~ GTIv~ GT2{P~d) 
~'J'' I (~ !.-" 
~ 
io : , 
• o ~I m xH 
Documl8 R~ ~@d: GTI~ GT2 ~dd) 
I i i~;i i i i iiii'l~ : : ~', ! ! !,,~r.,"~ 
I 2D ~ nl m 
~E6~ 
Oocuml8 R~ ~ct GTI m GI2~) 
~I " '' ' '' i 
ill '" ' " II 
~n~EI2~ 
D~uT~8 ~ Sc~e~ct GTI ~ GT2 ~4CF Cen) 
i !l ,,,,,,,,-,,,: .... amwi ll,, I 
, . ,,. ~.., ~%.~,,r~, 
~itt~ 
~SP~Sca~GII~GT2NC~) 
m ~ m 
76 
I 
I 
i 
I 
I 
i 
I 
! 
! 
1 
I 
! 
| 
i 
I 
I 
I 
! 
I 
ument 14 by five different measures. Each of 
these measures was based on the same ground 
truth summary of this document, which con- 
tained four sentences. Clearly, the measures 
based on a more sophisticated parsing method 
have a much greater ability to discriminate be- 
tween summaries. By contrast, the recail met- 
ric can assign one of only four scores to a length 
3 summary, based on the value of Ji Elemen- 
tary combinatorics shows that 4 extracts will 
receive the highest possible score (and thus will 
rank first), 126 summaries will rank second, 840 - 
summaries will rank third, and 1330 summaries 
will rank last (with a score of 0). This accounts 
for all of the 2300 three-sentence extracts that, 
are possible. It seems very unlikely that all of 
the second-ranking summaries are equally effec- 
tive. The histogram depicting this distribution 
is shown at the top of Figure 2. This is fol- 
lowed by the histograms for the Kendall met- 
ric, and the content-based metrics using term 
frequency, tf-idf, and SVD weighted vectors, re- 
spectively. The tf-idf and SVD weighted mea- 
sures produced a very fine distribution of scores, 
particularly near the top of the range. That is, 
these metrics are able to distinguish between 
different high-scoring summaries. These pat- 
terns in the score histograms were typical across 
the 15 documents. 
5 Conclusions and Future Work 
'There is wide variation in the rankings pro- 
duced by recall scores from non-identical ground 
truths. This difference in scores is reflected in 
• averages computed across documents. The low 
inter-assessor correlation of ranks based on re- 
call measures is distressing, and indicates that 
these measures cannot be effectively used to 
compare performances of summarization sys- 
tems. Measures which gauge content similarity 
produce more highly correlated rankings when- 
ever ground truths do not disagree in focus. 
Content-based measures assign different rank- 
ings when ground truths do disagree in focus. In 
addition, these measures provide a finer grained 
score with which to compare summaries. 
Moreover, the content-based measures which 
rely on a ground truth are only slightly more 
correlated to each other than theyare to the 
measures which perform summary-document 
comparisons. This suggests that the effective- 
77 
Figure 2: Score Histograms for Document 14 
l~Oillerd 14 S~re H'=do~ GT4 ~3~) 
! = 
i!il . 
Ofx~t t 4 ~re F~¢~ GT4 (Xer~) 
i! l i,_ Ii,° -- B -- 
U U U 
~S:R 
(~cun~ 14 ~ ~'~T GT4 ~ ~) 
,dlllh,,.,,,,,,,..,.. ...... 
1 \]o 
U 02 ~ U U 
,LI U 
~tS~ 
I~cume~ 14 ,~m I,~to~a~ ~4 (SVD Co~/e) 
IllllilUildJJ i!l _..,,,bang liiiDiad,,. 
I.I ~ Ii O.t U Q LI 
~TSeo~ 
ness of summarization algorithms could be mea- 
sured without the use of human judges. Since 
the cosine measure is easy to calculates feed- 
back of summary quality can be almost instan- 
taneous. 
The properties of these content-based mea- 
sures need to be further investigated. For ex- 
ample, it is not clear that content-based mea- 
sures satisfy properties (i) and (ii), discussed in 
Section 2. Also, while they do produce far fewer 
ties than either recall or tau, such a fine distinc- 
tion in summary quality is probably not justi- 
fied. When human-generated ground truths are 
available, perhaps some combination of recall 
and the content-based measures could be used. 
For instance, whenever recall is not perfect, the 
content of the non-overlapping sentences could 
be compared with the missed ground truth sen- 
tences. Also, the effects of compression rate, 
summary length, and document style are not 
known. 
The authors are currently performing further 
experiments to see if users prefer summaries 
that rank highly with content-based measures 
over other summaries. Also, the outcomes 
of extrinsic evaluation techniques will be com- 
pared with each of these scoring methods. In 
other words, do the high-ranking summaries 
help users to perform various tasks better than 
lower-ranking summaries do? 
6 Acknowledgements 
The authors would like to thank Mary Ellen 
Okurowski and Duncan Buell for their sup- j 
port, encouragement, and advice throughout 
this project. Thanks go also to Tomek Strza- 
lkowski, Inderjeet Mani, Donna Harman, and 
Hal Wilson for their suggestions of how to im- 
prove the design of the experiment. We greatly 
appreciate the fine editing advice Oksana Las- 
sowsky provided. Finally, we are especially 
grateful to the four expert judges, Benay, Ed, 
MEO, and Toby, who produced our ground 
truth summaries. 

References 
Ronald Brandow, Karl Mitze, and Lisa F. Ran. 
1995. Automatic condensation of electronic 
publications by sentence selection. Informa- 
tion Processing and Management, 31 (5):675- 
685. 
Scott Deerwester, Susan T. Dumais, George W. 
Furnas, Thomas K. Landauer, and Richard 
• ttarshman. 1990. Indexing by latent seman- 
tic analysis. Journal of the American Society 
for Information Science, 41(6):391-407. 
Susan T. Dumais. 1991. Improving the retrieval 
of information from external sources. Behav- 
ior Research Methods, Instruments ~ Com- 
puters, 23(2):229-236. 
Therese Firmin and Michael J. Chrzanowski. 
1999. An evaluation of automatic text sum- 
marization systems. In Advances in Au- 
tomatic Text Summarization, chapter 21, 
pages 325-336. MIT Press, Cambridge, Mas- 
sachusetts. 
Jade Goldstein, Mark Kantrowitz, Vibhu Mit- 
tal, and Jaime Carbonell-. 1999. Summa- 
rizing text documents: Sentence selection 
and evaluation metrics. In Proceedings of the 
ACM SIGIR, pages 121-128. 
Gene H. Golub and Charles F. van Loan. 1989. 
Matrix Computations. The Johns Hopkins 
University Press, Baltimore. 
Hongyan Jing, Kathleen McKeown, Regina 
Barzilay, and Michael Elhadad. 1998. Sum- 
marization evaluation methods: Experiments 
and analysis. In American Association for 
Artificial Intelligence Spring Symposium Se- 
ries, pages 60-68. 
Julian Kupiec, Jan Pederson, and Francine 
Chen. 1995. A trainable document Summa- 
rizer. In Proceedings of the A CM SIGIR, 
' pages 68-73. 
Inderjeet Mani, David House, Gary Klein, 
Lynette Hirschman, Therese Firmin, and 
Beth Sundheim. 1999. The TIPSTER SUM- 
MAC text summarization evaluation. In Pro- 
ceedings of the Ninth Conference of the Euro- 
pean Chapter of the A CL, pages 77-85. 
Daniel Marcu. 1997. From discourse struc- 
ture to text summaries. In Proceedings of 
the A CL '97/EA CL'97 Workshop on Intelli- 
gent Scalable Text Summarization, pages 82- 
88. 
C. D. Paice and P. A. Jones. 1993. The identifi- 
cation of important concepts in highly struc- 
tured technical papers. In Proceedings of the 
A CM SIGIR, pages 69-78. 
Gerard Salton. 1989. Automatic Text Pro- 
cessing. Addison-Wesley Publishers, Mas- 
sachusetts. 
C. Schwarz. 1990. Content based text han- 
dling. Information Processing and Manage- 
ment, 26(2):219-226. 
David J. Sheskin. 1997. Handbook of Paramet- 
ric and Nonparametric Statistical Procedures. 
CRC Press LLC, United States. 
T. Strzalkowski, J. Wang, and B. Wise. 1998. 
A robust practical text summarizaton sys- 
tem. In AAAI Intelligent Text Summariza- 
tion Workshop, pages 26-30. 
Ellen M. Voorhees. 1998. Variations in rele- 
vance judgements and the measurement of 
retrieval effectiveness. In Proceedings of the 
A CM SIGIR, pages 315-323. 
