Corpus and Evaluation Measures for Multiple Document Summarization
with Multiple Sources
Tsutomu HIRAO
NTT Communication Science Laboratories
hirao@cslab.kecl.ntt.co.jp
Takahiro FUKUSIMA
Otemon Gakuin University
fukusima@res.otemon.ac.jp
Manabu OKUMURA
Tokyo Institute of Technology
oku@pi.titech.ac.jp
Chikashi NOBATA
Communication Research Laboratories
nova@crl.go.jp
Hidetsugu NANBA
Hiroshima City University
nanba@its.hiroshima-cu.ac.jp
Abstract
In this paper, we introduce a large-scale test collec-
tion for multiple document summarization, the Text
Summarization Challenge 3 (TSC3) corpus. We
detail the corpus construction and evaluation mea-
sures. The significant feature of the corpus is that it
annotates not only the important sentences in a doc-
ument set, but also those among them that have the
same content. Moreover, we define new evaluation
metrics taking redundancy into account and discuss
the effectiveness of redundancy minimization.
1 Introduction
It has been said that we have too much informa-
tion on our hands, forcing us to read through a great
number of documents and extract relevant informa-
tion from them. With a view to coping with this situ-
ation, research on automatic text summarization has
attracted a lot of attention recently and there have
been many studies in this field. There is a particular
need to establish methods for the automatic sum-
marization of multiple documents rather than single
documents.
There have been several evaluation workshops
on text summarization. In 1998, TIPSTER SUM-
MAC (Mani et al., 2002) took place and the Doc-
ument Understanding Conference (DUC)1 has been
held annually since 2001. DUC has included multi-
ple document summarization among its tasks since
the first conference. The Text Summarization Chal-
lenge (TSC)2 has been held once in one and a half
years as part of the NTCIR (NII-NACSIS Test Col-
lection for IR Systems) project since 2001. Multiple
document summarization was included for the first
time as one of the tasks at TSC2 (in 2002) (Okumura
et al., 2003). Multiple document summarization is
now a central issue for text summarization research.
1http://duc.nist.gov
2http://www.lr.pi.titech.ac.jp/tsc
In this paper, we detail the corpus construction
and evaluation measures used at the Text Summa-
rization Challenge 3 (TSC3 hereafter), where multi-
ple document summarization is the main issue. We
also report the results of a preliminary experiment
on simple multiple document summarization sys-
tems.
2 TSC3 Corpus
2.1 Guidelines for Corpus Construction
Multiple document summarization from multiple
sources, i.e., several newspapers concerned with the
same topic but with different publishers, is more dif-
ficult than single document summarization since it
must deal with more text (in terms of numbers of
characters and sentences). Moreover, it is peculiar
to multiple document summarization that the sum-
marization system must decide how much redun-
dant information should be deleted3.
In a single document, there will be few sentences
with the same content. In contrast, in multiple doc-
uments with multiple sources, there will be many
sentences that convey the same content with differ-
ent words and phrases, or even identical sentences.
Thus, a text summarization system needs to recog-
nize such redundant sentences and reduce the redun-
dancy in the output summary.
However, we have no way of measuring the ef-
fectiveness of such redundancy in the corpora for
DUC and TSC2. Key data in TSC2 was given as
abstracts (free summaries) whose number of char-
acters was less than a fixed number and, thus, it
is difficult to use for repeated or automatic evalu-
ation, and for the extraction of important sentences.
Moreover, in DUC, where most of the key data were
abstracts whose number of words was less than a
3It is true that we need other important techniques such as
those for maintaining the consistency of words and phrases that
refer to the same object, and for making the results more read-
able; however, they are not included here.
fixed number, the situation was the same as TSC2.
At DUC 2002, extracts (important sentences) were
used, and this allowed us to evaluate sentence ex-
traction. However, it is not possible to measure the
effectiveness of redundant sentences reduction since
the corpus was not annotated to show sentence with
same content. In addition, this is the same even if
we use the SummBank corpus (Radev et al., 2003).
In any case, because many of the current summa-
rization systems for multiple documents are based
on sentence extraction, we believe these corpora to
be unsuitable as sets of documents for evaluation.
On this basis, in TSC3, we assumed that the pro-
cess of multiple document summarization consists
of the following three steps, and we produce a cor-
pus for the evaluation of the system at each of the
three steps4.
Step 1 Extract important sentences from a given set
of documents
Step 2 Minimize redundant sentences from the re-
sult of Step 1
Step 3 Rewrite the result of Step 2 to reduce the
size of the summary to the specified number of
characters or less.
We have annotated not only the important sen-
tences in the document set, but also those among
them that have the same content. These are the cor-
pora for steps 1 and 2. We have prepared human-
produced free summaries (abstracts) for step 3.
In TSC3, since we have key data (a set of cor-
rect important sentences) for steps 1 and 2, we con-
ducted automatic evaluation using a scoring pro-
gram. We adopted an intrinsic evaluation by human
judges for step 3, which is currently under evalu-
ation. We provide details of the extracts prepared
for steps 1 and 2 and their evaluation measures in
the following sections. We do not report the overall
evaluation results for TSC3.
2.2 Data Preparation for Sentence Extraction
We begin with guidelines for annotating important
sentences (extracts). We think that there are two
kinds of extract.
1. A set of sentences that human annotators
judge as being important in a document set
(Fukusima and Okumura, 2001; Zechner,
1996; Paice, 1990).
4This is based on general ideas of a summarization system
and is not intended to impose any conditions on a summariza-
tion system.
Mainichi articles
Yomiuri articles
abstract
(a)
(b)
(c)
(d)
Doc. x
Doc. y
Figure 1: An example of an abstract and its sources.
2. A set of sentences that are suitable as a source
for producing an abstract, i.e., a set of sen-
tences in the original documents that corre-
spond to the sentences in the abstracts(Kupiec
et al., 1995; Teufel and Moens, 1997; Marcu,
1999; Jing and McKeown, 1999).
When we consider how summaries are produced,
it seems more natural to identify important seg-
ments in the document set and then produce sum-
maries by combining and rephrasing such informa-
tion than to select important sentences and revise
them as summaries. Therefore, we believe that sec-
ond type of extract is superior and thus we prepared
the extracts in that way.
However, as stated in the previous section, with
multiple document summarization, there may be
more than one sentence with the same content, and
thus we may have more than one set of sentences
in the original document that corresponds to a given
sentence in the abstract; that is to say, there may be
more than one key datum for a given sentence in the
abstract5.
we have two sets of sentences that correspond to
sentence a0 in the abstract.
(1) a1a3a2 of document a4 , or
(2) a combination of a1a6a5 and a1a8a7 of document a9
This means that a1a3a2 alone is able to produce a0 , and
a0 can also be produced by combining
a1 a5 and a1 a7 (Fig-
ure 1).
We marked all the sentences in the original doc-
uments that were suitable sources for producing the
sentences of the abstract, and this made it possible
for us to determine whether or not a summariza-
tion system deleted redundant sentences correctly
at Step 2. If the system outputs the sentences in
the original documents that are annotated as cor-
responding to the same sentence in the abstract, it
5We use ‘set of sentences’ since we often find that more
than one sentence corresponds to a sentence in the abstract.
Table 1: Important Sentence Data.
Sentence ID of Abstract Set of Corresponding Sentences
1 a10a12a11a14a13a16a15 a10a12a11a14a13a18a17a20a19a21a11a12a13a22a13a23a15
2 a10a12a11a20a24a20a19a25a11a8a26a20a19a21a11a8a27a28a15
3 a10a12a11a8a29a30a17a8a19a31a11a6a29a23a13a28a19a11a6a29 a24 a15 a10a12a11a6a13a20a19a25a11 a24 a17a20a19a21a11 a27 a17a20a15
has redundancy. If not, it has no redundancy. Re-
turning to the above example, if the system outputs
a1a32a2 ,a1a20a5 ,and a1a8a7 , they all correspond to sentence
a0 in the
abstract, and thus it is redundant.
3 Evaluation Metrics
We use both intrinsic and extrinsic evaluation. The
intrinsic metrics are “Precision”, “Coverage” and
“Weighted Coverage.” The extrinsic metric is
“Pseudo Question-Answering.”
3.1 Intrinsic Metrics
3.1.1 Number of Sentences System Should
Extract
Precision and Recall are generally used as evalua-
tion matrices for sentence extraction, and we used
the PR Breaking Point (Precision = Recall) for the
evaluation of extracts in TSC1 (Fukusima and Oku-
mura, 2001). This means that we evaluate systems
when the number of sentences in the correct ex-
tract is given. Moreover, in TSC3 we assume that
the number of sentences to be extracted is known
and we evaluate the system output that has the same
number of sentences.
However, it is not as easy to decide the number of
sentences to be extracted in TSC3 as in TSC1. We
assume that there are correspondences between sen-
tences in original documents and their abstract as in
Table 1. An ASCII space, ” ”, is the delimiter for
the sets of corresponding sentences in the table. As
shown in the table, we often see several sets of sen-
tences that correspond to a sentence in the abstract
in multiple document summarization.
An ‘extract’ here is a set of sentences needed
to produce the abstract. For instance, we can ob-
tain ‘extracts’ such as “a1a32a2 ,a1a20a7 ,a1a20a33 ,a1a20a34 ,a1a8a7a20a35 ,a1a20a34a8a35 ”, and
“a1a32a2a36a35 ,a1a37a2a8a2 ,a1a20a7 ,a1a20a33 ,a1a20a34 ,a1a20a5a8a35 ,a1a20a5a32a2 ,a1a20a5a8a7 ” from Table 1 6. Often
there are several ‘extracts’ and we must determine
which of these is the best. In such cases, we define
the ‘correct extract’ as the set with the least number
of sentences needed to produce the abstract because
it is desirable to convey the maximum amount of
information with the least number of sentences.
Finding the minimum set of sentences to produce
the abstract amounts to solving the constraint sat-
6In fact, it is possible to produce the abstract with other sen-
tence combinations.
isfaction problem. In the example in Table 1, we
obtain the following constraints from each sentence
in the abstract:
a38a40a39
a2a42a41a43a1a3a2a45a44a47a46a8a1a3a2a30a35a42a48a47a1a32a2a8a2a14a49 ,
a38a40a39
a5a50a41a43a1a6a7a51a48a47a1a6a33a51a48a52a1a8a34 ,
a38a40a39
a7a50a41a43a46a8a1a6a5a20a35a53a48a52a1a6a5a37a2a54a48a52a1a8a5a8a7a55a49a56a44a52a46a20a1a3a2a45a48a52a1a6a7a20a35a53a48a52a1a8a34a8a35a55a49
With these conditions, we now find the minimum
set that makes all the conjunctions true. We need
to find the minimum set that makes a39 a2a57a48 a39 a5a47a48
a39
a7a58a41a60a59a31a61a32a62a45a63a32a64 In this case, the minimum cover is
a65
a1a3a2a6a66a32a1a8a7a67a66a37a1a6a33a55a66a37a1a8a34a67a66a37a1a6a7a20a35a68a66a37a1a8a34a8a35a67a69 , and so the system should
extract six sentences.
In TSC3, we computed the number of sentences
that the system should extract and then evaluated the
system outputs, which must have the same number
of sentences, with the following precision and cov-
erage.
3.1.2 Precision
Precision is the ratio of how many sentences in the
system output are included in the set of the corre-
sponding sentences. It is defined by the following
equation.
Precision a70a72a71a73a75a74 (1)
where a76 is the least number of sentences needed
to produce the abstract by solving the constraint
satisfaction problem and a77 is the number of ‘cor-
rect’ sentences in the system output, i.e., the sen-
tences that are included in the set of correspond-
ing sentences. For example, the sentences listed
in Table 1 are ‘correct.’ If the system output is
“a1a32a2a30a35a67a66a32a1a32a2a8a2a14a66a37a1a6a33a3a66a32a1a32a2a8a78a14a66a37a1a6a34a20a35a68a66a37a1a8a34a32a2 ”, then the Precision is as
follows:
Precision a70a80a79a81a82a70a82a83a85a84a81a37a81a87a86 a84 (2)
for “a1a32a2a88a66a32a1a32a2a30a35a68a66a37a1a3a2a20a2a14a66a32a1a8a7a67a66a37a1a6a33a3a66a32a1a8a34a20a35 ”, the Precision is as fol-
lows:
Precision a70
a81
a81 a70a90a89a91a84 (3)
3.1.3 Coverage
Coverage is an evaluation metric for measuring how
close the system output is to the abstract taking into
account the redundancy found in the set of sentences
in the output.
The set of sentences in the original documents
that corresponds correctly to the a92 -th sentence of
the human-produced abstract is denoted here as
a93a95a94a97a96
a2a14a66
a93a98a94a97a96
a5a67a66a37a99a37a99a32a99a8a66
a93a95a94a97a96a100
a66a37a99a37a99a32a99a8a66
a93a95a94a97a96a101 . In this case, we have
a102 sets of corresponding sentences. Here, a93a95a94a97a96a100 indi-
cates a set of elements each of which corresponds to
the sentence number in the original documents, de-
noted as a93a95a94a97a96a100 a41 a65a55a103 a94a22a96a100a20a96a2a14a66 a103 a94a97a96a100a20a96a5a55a66a37a99a32a99a37a99a28a66 a103 a94a22a96a100a20a96a104 a66a32a99a37a99a32a99a69 . For
instance, from Table 1, a93 a2
a96
a5a105a41
a103
a2
a96
a5
a96
a2a88a66
a103
a2
a96
a5
a96
a5 and
a103
a2
a96
a5
a96
a2a51a41a43a1a32a2a36a35a67a66
a103
a2
a96
a5
a96
a5a106a41a43a1a32a2a8a2 .
Then, we define the evaluation score a63a32a46a36a92a14a49 for the
a92 -th sentence in the abstract as equation (1).
a107a91a108a110a109a6a111
a70a58a112a114a113a116a115
a117a16a118a120a119a14a118a85a121
a122a123a125a124a127a126a128a28a122
a129a6a130 a117a132a131
a108a97a133a28a134a97a135
a119
a135
a129
a111
a136a138a137
a134a22a135
a119
a136 a74 (4)
where a139a140a46a8a141a142a49 is defined by the following equation.
a131
a108a97a143a144a111
a70
a89 if the system outputs
a143
a83 otherwise
(5)
Function a63 returns 1 (one) when any a93a98a94a97a96a100 is out-
puted completely. Otherwise it returns a partial
score according to the number of sentences a145a93a95a94a110a100 a145.
Given function a63 and the number of sentences in
the abstract a146 , Coverage is defined as follows:
Coverage a70
a147
a134
a130 a117
a107a91a108a110a109a6a111
a148
a84 (6)
If the system extracts “a1a32a2a36a35 ,a1a32a2a20a2 ,a1a8a33 ,a1a37a2a8a78 ,a1a20a34a8a35 ,a1a20a34a32a2 ”,
a63a32a46a30a92a12a49 is computed as follows:
a107a91a108
a89
a111
a70 max
a108
a83a85a74a91a89
a111
a70 a89
a107a91a108a127a149a87a111
a70 max
a108
a83a85a84a150a116a150
a111
a70 a83a85a84a150a37a150
a107a91a108
a150
a111
a70 max
a108
a83a85a74a16a83a85a84a150a37a150
a111
a70 a83a85a84a150a37a150
and its Coverage is 0.553. If the system extracts
“a1a32a2 ,a1a37a2a36a35 ,a1a37a2a8a2 ,a1a20a7 ,a1a20a33 ,a1a20a34a8a35 ”, then the Coverage is 0.780.
a107a116a108
a89
a111
a70 max
a108
a89a91a74a88a89
a111
a70a90a89
a107a116a108a127a149a120a111
a70 max
a108
a83a85a84
a81a87a86
a111
a70a151a83a85a84
a81a87a86
a107a116a108
a150
a111
a70 max
a108
a83a85a74a23a83a85a84
a81a87a86
a111
a70a82a83a85a84
a81a87a86
3.1.4 Weighted Coverage
Now we define ‘Weighted Coverage’ since each
sentence in TSC3 is ranked A, B or C, where “A” is
the best. This is similar to “Relative Utility” (Radev
et al., 2003). We only use three ranks in order to
limit the ranking cost. The definition is obtained by
modifying equation (6).
W.C. a70
a147
a134
a130 a117a3a152
a108a110a153a120a108a110a109a6a111a16a111a144a154a155a107a91a108a110a109a6a111
a152
a108
a137
a111a110a148a156a108
a137
a111a158a157
a152
a108a22a159a160a111a110a148a142a108a22a159a160a111a140a157
a152
a108a97a161a53a111a110a148a142a108a97a161a53a111
a74 (7)
where a61a158a46a30a92a12a49 denotes the ranking of the a92 -th sentence
of the abstract and a162a163a46a36a61a158a46a36a92a14a49a8a49 is its weight. a146a164a46a36a61 a0 a146a164a165a158a49 is
the number of sentences whose ranking is a61
a0
a146a166a165 in
the abstract. Suppose the first sentence is ranked A,
the second B, and the third C in Table 1, and their
weights are given as a162a57a46 a93 a49a155a41a168a167 ,a162a163a46a20a169a170a49a171a41a168a172a55a64a174a173 and
a162a57a46
a39
a49a54a41a43a172a55a64a174a175
7.
As before, if the system extracts
“a1 a2a30a35 ,a1 a2a8a2 ,a1 a33 ,a1 a2a8a78 ,a1 a34a8a35 ,a1 a34a32a2 ”, then the Weighted
Coverage is computed as follows:
W.C. a70
a89
a154
a89
a157
a83a55a84a176
a154
a83a85a84a150a37a150
a157
a83a85a84a150
a154
a83a85a84a150a37a150
a89
a154
a89
a157
a83a85a84a176
a154
a89
a157
a83a85a84a150
a154
a89
a70a47a83a85a84
a81a37a177a37a178
a84
(8)
3.2 Extrinsic Metrics
3.2.1 Pseudo Question-Answering
Sometimes question-answering (QA) by human
subjects is used for evaluation (Morris et al., 1992;
Hirao et al., 2001). That is, human subjects judge
whether predefined questions can be answered by
reading only a machine generated summary. How-
ever, the cost of this evaluation is huge. Therefore,
we employ a pseudo question-answering evaluation,
i.e., whether a summary has an ‘answer’ to the ques-
tion or not. The background to this evaluation is in-
spired by TIPSTER SUMMAC’s QA track (Mani et
al., 2002).
For each document set, there are about five ques-
tions for a short summary and about ten questions
for long summary. Note that the questions for the
short summary are included in the questions for the
long summary. Examples of questions for the topic
“Release of SONY’s AIBO” are as follows: “How
much is AIBO?”, “When was AIBO sold?”, and
“How many AIBO are sold?”.
Now, we evaluate the summary from the ‘exact
match’ and ‘edit distance’ for each question. ‘Ex-
act match’ is a scoring function that returns one
when the summary includes the answer to the ques-
tion. ‘Edit distance’ measures whether the system’s
summary has strings that are similar to the answer
strings. The score a179 e
a180
based on the edit distance is
normalized with the length of the sentence and the
answer string so that the range of the score is [0,1]:
Sed a70 length of the sentence a181 edit distancelength of the answer strings a84 (9)
The score for a summary is the maximum value
of the scores for sentences in the summary. The
7a182a184a183a186a185a32a183a110a187a23a188a22a188 may be computed differently. It is 1/rank (one
divided by rank) here.
Table 2: Description of TSC3 Corpus.
# of doc. sets 30
# of articles (The Mainichi) 175
# of articles (The Yomiuri) 177
Total 352
# of Sentences 3587
score is 1 if the summary has a sentence that in-
cludes the whole answer string.
It should be noted that the presence of answer
strings in the summary does not mean that a human
subject can necessarily answer the question.
4 Preliminary Experiment
In order to examine whether our corpus is suitable
for summarization evaluation, our evaluation mea-
sures significant information and redundancies in
the system summaries.
Below we provide the details of the corpus, eval-
uation results and effectiveness of the minimization
of redundant sentences.
4.1 Description of Corpus
According to the guidelines described in section
two, we constructed extracts and abstracts of thirty
sets of documents drawn from the Mainichi and
Yomiuri newspapers published between 1998 to
1999, each of which was related to a certain topic.
First, we prepared abstracts (their sizes were 5%
and 10% of the total number of the characters in
the document set), then produced extracts using the
abstracts. Table 2 shows the statistics.
One document set consists of about 10 articles
on average, and the almost same number of articles
were taken from the Mainichi newspaper and the
Yomiuri newspaper. Most of the topics are classified
into a single-event according to McKeown (2001).
The following list contains all the topics.
0310 Two-and-half-million-year old new hominid species
found in Ethiopia.
0320 Acquisition of IDC by NTT (and C&W).
0340 Remarketing of game software judged legal by Tokyo
District Court.
0350 Night landing practice of carrier-based aircrafts of the
Independence.
0360 Simultaneous bombing of the US Embassies in Tanzania
and Kenya.
0370 Resignation of President Suharto.
0380 Nomination of Mr. Putin as Russian prime minister.
0400 Osama bin Laden provided shelter by Taliban regime in
Afghanistan.
0410 Transfer of Nakata to A.C. Perugia.
0420 Release of Dreamcast.
0440 Existence of Japanese otter confirmed.
0450 Kyocera Corporation makes Mita Co. Ltd. its subsidiary.
0460 Five-story pagoda at Muroji Temple damaged by ty-
phoon.
0470 Retirement of aircraft YS-11.
0480 Test observation of astronomical telescope ‘Subaru’
started.
0500 Dolly the cloned sheep.
0510 Mass of neutrinos.
0520 Human Genome Project finishes decoding of the 22nd
chromosome.
0530 Peace talks in Northern Ireland at the end of 1999.
0540 Debut of new model of bullet train (700 family).
0550 Mr. Yukio Aoshima decides not to run for gubernatorial
election.
0560 Mistakes in entrance examination of Kansai University.
0570 Space shuttle Endeavour, from its launch to return.
0580 40 million-year-old fossil of new monkey species found
by research group at Kyoto University.
0590 Dead body of George Mallory found on Mt. Everest.
0600 Release of SONY’s AIBO.
0610 e-one, look-alike of iMac.
0630 Research on Kitora tomb resumes.
0640 Tidal wave damage generated by earthquake in Papua
New Guinea.
0650 Mistaken bombing of the Chinese embassy by NATO.
4.2 Compared Extraction Methods
We used the lead-based method, the TFa99 IDF-based
method (Zechner, 1996) and the sequential pattern-
based method (Hirao et al., 2003), and compared
performance of these summarization methods on
the TSC3 corpus.
Lead-based Method
The documents in a test set were sorted in chrono-
logical and ascending order. Then, we extracted a
sentence at a time from the beginning of each docu-
ment and collected them to form a summary.
TFa99 IDF-based Method
The score of a sentence is the sum of the significant
scores of each content word in the sentence. We
therefore extracted sentences in descending order of
importance score. The sentence score Stfidfa46a20a1 a94 a49 is
defined by the following.
Stfidfa108a97a189 a134 a111 a70 a190a22a191a91a192
a124
a152
a108a186a193
a74a12a194a106a195
a111
a74 (10)
where a162a57a46a30a59a88a66a32a196a155a179a95a49 is defined as follows:
a152
a108a110a193
a74a14a194a197a195
a111
a70
a193a6a198a199a108a110a193
a74a14a194a197a195
a111a125a200a8a201a186a202a37a203
a136
a194
a159
a136
a204
a198a205a108a186a193a12a111
a84 (11)
a59a91a206a207a46a36a59a88a66a37a196a171a179a98a49 is the frequency of word a59 in the docu-
ment set, a208a23a206a140a46a36a59a88a49 is the document frequency of a59 , and
a145a196a155a169a52a145 is the total number of documents in the set. In
fact, we computed these using all the articles pub-
lished in the Mainichi and Yomiuri newspapers for
the years 1998 and 1999.
Sequential Pattern-based Method
The score of a sentence is the sum of the signifi-
cant scores of each sequential pattern in the sen-
tence. The patterns used for scoring were decided
Table 3: Evaluation results for “Precision”, “Cover-
age” and “Weighted Coverage.”
Method Length Prec. Cov. W.C.
Lead Short .426 .212 .326Long .539 .259 .369
TFa209 IDF Short .497 .292 .397Long .604 .325 .434
Pattern Short .613 .305 .403Long .665 .298 .418
Table 4: Evaluation results for “Pseudo Question-
Answering.”
Method Length Exact Edit
Lead Short .300 .589Long .275 .602
TFa209 IDF Short .375 .643Long .393 .659
Pattern Short .390 .644Long .370 .640
by using a statistical significance test such as the a210 a5
metric test and using 1,000 patterns. This is an ex-
tension of Lin’s method (Lin and Hovy, 2000). The
sentence score Spata46a8a1 a94 a49 is defined by the following.
Spata108a97a189 a134 a111 a70 a211a88a191a91a192
a124
a152
a108a110a212a205a111
a74 (12)
where a162a163a46a36a213a207a49 is defined as follows:
a152
a108a186a212a205a111
a70
a201a138a202a37a203a87a108a97a198a158a108a110a212
a74a12a194a197a195
a111a125a157
a89
a111a199a200a6a201a138a202a37a203a87a108
a122a123a205a214a199a122
a215a32a216
a211
a135a123a205a214a32a217
a111
a218
a107a16a148a142a108a186a212a205a111
a84 (13)
a206a140a46a30a213a207a66a3a196a155a179a95a49 is the sentence frequency of pattern a213 in
the document set and a206a140a46a30a213a207a66 a93 a179a98a49 is the sentence fre-
quency of patterna213 in all topics. a145
a93
a179a114a145 is the number
of sentences in all topics and a219a36a63a20a146a166a46a30a213a207a49 is the pattern
length.
4.3 Evaluation Result
Table 3 shows the intrinsic evaluation result. All
methods have lower Coverage and Weighted Cov-
erage scores than Precision scores. This means that
the extracted sentences include redundant ones. In
particular, the difference between “Precision” and
“Coverage” is large in “Pattern.”
Although both “Pattern” and “TFa99 IDF” outper-
form “Lead,” the difference between them is small.
In addition, we know that “Lead” is a good extrac-
tion method for newspaper articles; however, this is
not true for the TSC3 corpus.
Table 4 shows the extrinsic evaluation results.
Again, both “Pattern” and “TFa99 IDF” outperform
“Lead”, but the difference between them is small.
We found a correlation between the intrinsic and ex-
trinsic measures.
Table 5: Effects of clustering (“Precision”, “Cover-
age”, “Weighted Coverage”).
Method Length Prec. Cov. W.C.
TFa209 IDF Short .430 .297 .377Long .533 .345 .455
Pattern Short .531 .289 .390Long .620 .338 .456
Table 6: Effects of clustering (Pseudo Question-
Answering).
Method Length Exact Edit
TFa209 IDF Short .401 .650Long .377 .648
Pattern Short .392 .650Long .380 .655
4.4 Effect of Redundant Sentence
Minimization
The experiment described in the previous section
shows that a group of sentences extracted in a sim-
ple way includes many redundant sentences. To
examine the effectiveness of minimizing redundant
sentences, we compare the Maximal Marginal Rele-
vance (MMR) based approach (Carbonell and Gold-
stein, 1998) with the clustering approach (Nomoto
and Matsumoto, 2001). We use ‘cosine similarity’
with a bag-of-words representation for the similar-
ity measure between sentences.
Clustering-based Approach
After computing importance scores using equations
(10) and (12), we conducted hierarchical clustering
using Ward’s method until we reached a76 (see Sec-
tion 3.1.1) clusters for the first a175a55a76 sentences. Then,
we extracted the sentence with the highest score
from each cluster.
Table 5 shows the results of the intrinsic evalu-
ation and Table 6 shows the results of the extrin-
sic evaluation. By comparison with Table 3, the
clustering-based approach resulted in TFa99 IDF and
Pattern scoring low in Precision, but high in Cov-
erage. When comparing Table 4 with Table 6, the
score is improved in most cases. These results im-
ply that redundancy minimization is effective for
improving the quality of summaries.
MMR-based Approach
After computing importance scores using equations
(10) and (12), we re-ranked the first a175a68a76 sentences by
MMR and extracted the first a76 sentences.
Table 7 and 8 show the intrinsic and extrinsic
evaluation results, respectively. We can see the ef-
fectiveness of redundancy minimization by MMR.
Notably, in most cases, there is a large improvement
in both the intrinsic and extrinsic evaluation results
as compared with clustering.
Table 7: Effects of MMR (“Precision”, “Coverage”,
“Weighted Coverage”).
Method Length Prec. Cov. W.C.
TFa209 IDF Short .469 .306 .403Long .565 .376 .475
Pattern Short .469 .332 .429Long .577 .377 .500
Table 8: Effects of MMR (Pseudo Question-
Answering).
Method Length Exact Edit
TFa209 IDF Short .386 .647Long .405 .667
Pattern Short .417 .663Long .390 .656
These results show that redundancy minimization
has a significant effect on multiple document sum-
marization.
5 Conclusion
We described the details of a corpus constructed for
TSC3 and measures for its evaluation, focusing on
sentence extraction. We think that a corpus in which
important sentences and those with the same content
are annotated for multiple documents is a new and
significant feature for summarization corpora.
It is planned to make the TSC3 corpus available
(even if the recipient is not a TSC3 participant) by
exchanging memoranda with the National Institute
of Informatics in Japan. We sincerely hope that this
corpus will be useful to researchers who are inter-
ested in text summarization and serve to facilitate
further progress in this field.

References

J. Carbonell and J. Goldstein. 1998. The Use of
MMR, Diversity-Based Reranking for Reorder-
ing Documents and Producing Summaries. In
Proc. of the 21th ACM-SIGIR, pages 335–336.

T. Fukusima and M. Okumura. 2001. Text Summa-
rization Challenge: Text Summarization Evalua-
tion in Japan. In Proc. of the NAACL 2001 Work-
shop on Automatic summarization, pages 51–59.

T. Hirao, Y. Sasaki, and H. Isozaki. 2001. An
Extrinsic Evaluation for Question-Biased Text
Summarization on QA tasks. In Proc. of the
NAACL 2001 Workshop on Automatic Summa-
rization, pages 61–68.

T. Hirao, J. Suzuki, H. Isozaki, and E. Maeda.
2003. Multiple Document Summarization using
Sequential Pattern Mining (in Japanese). In The
Special Interest Group Notes of IPSJ (NL-158-6),
pages 31–38.

H. Jing and K. McKeown. 1999. The Decom-
position of Human-Written Summary Sentences.
Proc. of the 22nd ACM-SIGIR, pages 129–136.

J. Kupiec, J Petersen, and F. Chen. 1995. A Train-
able Document Summarizer. In Proc. of the 18th
SIGIR, pages 68–73.

C-Y. Lin and E. H. Hovy. 2000. The Automated
Acquisition of Topic Signatures for Text Sum-
marization. In Proc. of the 16th COLING, pages
495–501.

I. Mani, G. Klein, D. House, L. Hirschman, T. Fir-
man, and B. Sundheim. 2002. SUMMAC: a text
summarization evaluation. Natural Language
Engineering, 8(1):43–68.

D. Marcu. 1999. The Automatic Construction
of Large-scale Corpora for Summarization Re-
search. Proc. of the 22nd ACM-SIGIR, pages
137–144.

K. McKeown, R. Barzilay, D. Evans, V. Hatzivas-
silogou, M. Y. Kan, B. Schiffman, and S. Teufel.
2001. Columbia Multi-Document Summariza-
tion: Approach and Evaluation. In Proc. of the
Document Understanding Conference 2001.

A. H. Morris, G. M. Kasper, and D.A. Adams.
1992. The Effects and Limitations of Automatic
Text Condensing on Reading Comprehension.
Information System Research, 3(1):17–35.

T. Nomoto and M. Matsumoto. 2001. A New Ap-
proach to Unsupervised Text Summarization. In
Proc. of the 24th ACM-SIGIR, pages 26–34.

M. Okumura, T. Fukusima, and H. Nanba. 2003.
Text Summarization Challenge 2, Text Summa-
rization Evaluation at NTCIR Workshop 3. In
Proc. of the HLT/NAACL 2003 Text Summariza-
tion Workshop, pages 49–56.

C. Paice. 1990. Constructing Literature Abstracts
by Computer: Techniques and Prospects. Infor-
mation Processing and Management, 26(1):171–
186.

D. R. Radev, S. Teufel, H. Saggion, W. Lam,
J. Blitzer, H. Qi, A. Celebi, D. Liu, and
E. Drabek. 2003. Evaluation challenges in large-
scale document summarization. In Proc. of the
41st ACL, pages 375–382.

S. Teufel and M. Moens. 1997. Sentence Extrac-
tion as a Classification Task. In Proc. of the ACL
Workshop on Intelligent Scalable Text Summa-
rization, pages 58–65.

K. Zechner. 1996. Fast Generation of Abstracts
from General Domain Text Corpora by Extract-
ing Relevant Sentences. In Proc. of the 16th
COLING, pages 986–989.
