A Survey for Multi-Document Summarization 
Satoshi Sekine 
New York University 
715 Broadway, 7th floor 
New York, NY, 10003, USA 
sekine@cs.nyu.edu 
 
 
Chikashi Nobata 
Communications Reserach Laboratory 
2-2-2 Hikaridai, Seika-chou, Soraku-gun 
Kyoto, 619-0289, Japan 
nova@crl.go.jp 
 
Abstract 
Automatic Multi-Document summarization is still hard 
to realize. Under such circumstances, we believe, it is 
important to observe how humans are doing the same 
task, and look around for different strategies. 
We prepared 100 document sets similar to the ones 
used in the DUC multi-document summarization task. 
For each document set, several people prepared the 
following data and we conducted a survey. 
A) Free style summarization 
B) Sentence Extraction type summarization 
C) Axis (type of main topic) 
D) Table style summary 
In particular, we will describe the last two in detail, 
as these could lead to a new direction for multi-
summarization research. 
1 Introduction 
Automatic Multi-Document summarization is still hard 
to realize. Like single document summarization for 
newspaper articles, where we don’t have a notably bet-
ter automatic summarization algorithm than a simple 
lead based method, automatic multi-document summa-
rization faces very difficult challenges. Under such 
circumstances, we believe, it is important to observe 
how humans are doing on the same task, and look for 
possible different strategies.  
Assume you are given several documents talking 
about the same topic, and are asked to summarize them, 
what might you do. The authors tried this by them-
selves. First we used a marker to mark the important 
phrases or sentences. Then we tried to connect them, in 
some cases by figuring out the main or common topics 
in the marked sentences, or in some cases, by making a 
list or a table to figure out the overview of the docu-
ments. When we looked at the result at this stage we 
noticed that these are very good summaries, even if 
they are not summaries in the conventional sense (a set 
of sentences to be read). The main topics are good to 
understand the overall issues in the document set and 
the table is a good digest of the issues throughout the 
document set. If we can automatically create such data 
from document sets, we might be able to make a good 
summary. The questions arising here are what kinds of 
“main topic” we can make in general, and what per-
centage of document sets are suitable for table-style 
summarization. 
The main topics we created in our hand summary 
experiment were like lists of keywords, but we found 
that there are more general types like “these documents 
are talking about a single person”. As keyword extrac-
tion has been one of the techniques in summarization, 
we will focus on the types of the main topics in the 
following experiments. 
We will describe the definition of our types and re-
port on the experiment of manually creating table-style 
summarization, as well as analyses of free style sum-
maries and sentence extraction type summaries. We 
prepared 100 document sets similar to the ones used in 
the DUC multi-document summarization task (DUC 
homepage). For each document set, annotators prepared 
the following data. 
A) Free style summarization 
B) Sentence Extraction type summarization 
C) Axis (type of main topic) 
D) Table style summary 
In particular, we will describe the last two in detail, 
as these could lead to a new direction for multi-
document summarization research. 
2 Document Sets 
First, we describe how we accumulated our 100 multi-
document data. We found that the topics of DUC multi-
document data are a bit biased as it is pre-filtered for 
evaluation purposes, i.e. DUC document sets are care-
fully chosen as described in the guidelines. The pre-
filtering is useful for evaluation purposes, but it does 
not necessarily reflect the distribution of user needs or 
distribution of topics in the news. We would like to 
obtain relatively more balanced document sets. We 
adopted the procedure described in the following, 
where the entire experiment was done using a Japanese 
newspaper corpus (Mainichi 1998 and 1999). 
a0 Select an article randomly from the corpus 
(seed) 
a0 Choose keywords from each article. Keywords 
are all nouns of frequency more than 1, except 
for some special types of nouns 
a0 Use dice coefficient to retrieve articles similar 
to the seed article. Gather all documents that 
have coefficient more than 0.5. 
a0 Select article sets that have more than 3 articles. 
About 300 such sets are obtained and among 
them, we selected 100 document sets, preferring 
more documents in a set and avoiding overlap-
ping topics. 
The average number of articles in a document set 
was 4.7 and the average number of sentences in a 
document was 12.9. Annotators read the articles in each 
set and detected if there were articles that are different 
from the topic throughout the document set. Such arti-
cles, which turned out to be very few in number, were 
excluded in the following experiments. 
3 Task and annotator 
We have four tasks and three annotators (indicated by a 
number). Annotator 1 and 2 did the same task, but an-
notator 3 did only a part of it. All of them have college 
degrees, in particular annotators 1 and 2 are Japanese 
native speakers and have majors in linguistics at US 
universities. 
Some examples (free summaries for one document 
set, and axes and table data for three sets, all translated 
into English) are shown in the appendix. 
 
Annotator  
Task 1 2 3 
Free style summary 100 40  
Sent. Extraction 100 20 100 
Axis 100 100 100 
Table summary 100 100  
 
Table 1. Task and annotator 
4 Free style summarization 
The first task is a free style summarization. The inter-
annotator agreement based on the word vector metric 
adopted by TSC evaluation (TSC homepage) is calcu-
lated. This is a cosine metric of tf*idf measure of the 
words in the summaries. Most of the pairs (37 sets out 
of 40 sets) had values of 0.5 or more, which is much 
larger than that of automatic systems measured against 
the human made summaries in TSC-1 (ranging around 
0.4 in 10% summary, 0.5 in 40% summary). We can 
reasonably believe the summaries are very reliable. 
5 Sentence Extraction 
Now we will look at the summarization by sentence 
extraction. Annotators 1 and 3 conducted the task for 
the entire data, so we will compare the results of those 
two. We asked the annotators to extract about 20% of 
the sentences as a summary of each document set, but 
the actual numbers of extracted sentences are slightly 
different between the two. Table 2 shows the number of 
sentences selected by the two annotators with inter-
annotator agreement data. The number of sentences 
selected by both annotators (533) looks low, compared 
to the number of sentences selected by only one 
annotator (650 and 746). However, the chi-square test 
is 513.9, which means that these two results are 
strongly correlated (less than 0.0001%  chance). 
 
Annotator 1  
annotator 3 selected not selected 
 
Total 
selected 533 746 1279 
not selected 650 4050 4700 
Total 1183 4796 5979 
 
Table 2. Number of selected sentences 
6 Axis 
Axis is based on the idea of (McKeown et al. 2001). 
They defined 4 categories of document sets based on 
the main topic of the document set for the purpose of 
using different summarization strategies (they actually 
used two sub-systems), shown in Table 3.  
 
Category and Description 
Single-Event (2) 
The documents center around one single event at one 
place and at roughly the same time, involving the 
same agents and actions 
Person-centered (10) 
The documents deal with one event concerning one 
person 
Multi-Event (7) 
Several events occurring at different places and times 
and usually with different protagonists, are reported 
together 
Other (11) 
Document sets contain even more loosely related 
documents 
 
Table 3. McKeown’s categories 
The number in brackets for each category indicates 
the number of document sets in the DUC 2001 training 
data. As can be seen, the number of “person centered” 
sets is quite high. We believe this is due to the pre-
filtering in the DUC data. “Other” is also high, which 
means more categories may be needed.  
We created new categories based on our study of 
document sets (other than the 100 sets reported here). 
We defined 13 categories, shown in Table 4, for what 
we will call the axis of the document set.  
 
Single-person Multi-Person 
Single-location Multi-location 
Single-organization Multi-organization 
Single-facility Multi-facility 
Single-product Multi-product 
Single-event Multi-event 
Others  
 
Table 4. 13 Axes 
 
The axis is a combination of two types of informa-
tion; single or multi, and 6 kinds of named entities 
(person, location, organization, facility, product and 
event). “Single” means that all the articles are talking 
about a single event, person or other entity, whereas 
“Multi” articles are talking about multiple entities that 
might participate in similar types of events. We used 6 
categories of entity types, which are the major catego-
ries defined in the MUC (Grishman and Sundheim 
1996) or ACE project (ACE homepage). For example, 
if a document set is talking about Einstein’s biography, 
it should be tagged as “single-person”, and if a set is 
talking about earthquakes in California last year, it 
should be tagged as “multi-event”. 
In order to demonstrate the validity of the catego-
ries, we tried to categorize the training data of DUC 
2001’s multi-document sets into our categories. Two 
people assigned one or two categories to each set. We 
allow more than one axis to a document set, as some 
document sets should be inherently categorized into 
more than one axis. If we consider only the first 
choices, the inter annotator agreement ratio is 80% and 
if we include the second choices, the ratio is 93.3%. 
We believe the categorization is practical. Table 5 
shows the distribution of axis categories tagged on our 
100 data sets by three annotators. Note that annotators 
1 and 2 assigned more than one axis to some data sets, 
so the totals exceed 100. 
All the categories except multi-facility are used by 
at least two annotators. Because the axis “other” is used 
rarely, the set of axes are empirically found to have 
quite good coverage. 
The inter-annotator agreements are 55, 61 and 67 % 
among the three annotators. Although the ratios are 
lower than that on the DUC data (we believe this is 
because of the pre-filtering of document sets), the 
agreement is still high at 55-67% even though there are 
13 kinds of axis. Note that chi-square test is not suit-
able to measure the data because of the data sparseness. 
 
Annotator  
Axis 1 2 3 
s-event 36 32 37 
s-facility 3 1 0 
s-location 7 2 4 
s-organization 6 12 11 
s-person 12 13 12 
s-product 14 14 8 
m-event 15 30 8 
m-facility 0 0 0 
m-location 2 3 2 
m-organization 9 12 13 
m-person 2 7 1 
m-product 2 2 0 
Other 1 0 3 
 
Table 5. Distribution of axis 
 
There are 39 document sets that have the same axis 
assigned by the three annotators, and there are only 7 
document sets that have three different axes by three 
annotators (no overlap at all). Even when different 
categories are tagged, sometimes all of them are under-
standable and we can say that these are all correct. So 
for some document sets, more than one category is in-
stinctively correct. This result indicates that, for some 
large percentage of document sets, it is possible to as-
sign axis(es). We believe for summarizing those docu-
ment sets, knowing the axis before summarization 
could be quite helpful. We are seeking a method to 
automate the process of finding the axis(es). 
7 Table 
A table is a good way to summarize a document set 
talking about multiple events of the same type, a collec-
tion of similar events or chronological events. We 
asked annotators to make a table for each document set. 
Table 6 shows some statistics of the created tables. The 
average number of columns is 3.47 and 5.25 for anno-
tator 1 and annotator 2, respectively. Regarding com-
parison between tables, the percentages of complete 
overlap (relationship of columns is 1 to 1 and the same 
information is collected in the column) are 58% and 
38%. The percentages of overlap (relationship of col-
umns is not 1 to 1, but the information of the columns 
is overlapping between the tables) are 94% and 70%. 
We can see that annotator 1 made fewer columns than 
annotator 2, and most columns made by annotator 1 
overlap columns made by annotator 2. So the differ-
ence is probably due to the fact that annotator 2 made 
more detailed tables. (As this is the first such survey, it 
was not easy to create good instructions) In other words, 
it might be the case that most important information 
(which was turned into columns) is simultaneously 
found by the two annotators. 
 
 
 Annotator 1 Annotator 2 
Ave. num. of column 3.47 5.25 
Complete overlap 58% 38% 
Overlap 94% 70% 
 
Table 6. Statistics of created tables 
 
When we compared the tables created by the two 
annotators one by one, we categorized the results into 5 
categories. 
A) Two tables are completely the same 
B) The information in the tables is the same, but 
the way of segmenting information into col-
umns is different. For example, one of the ta-
bles has a column “visiting activity (of a 
diplomat)” including information about visiting 
place, person and purpose, whereas the other 
table has columns “visiting place”, “the person 
to meet” and “purpose of the visit”. 
C) Missing one or two columns from either or 
both of tables (in total). This means one of the 
tables has one or two fewer columns and the in-
formation in the columns is not mentioned in the 
other table. As we can guess from Table 6, most 
of the missing columns were found in tables of 
annotator 1. 
D) Missing more than two columns from tables. 
E) The two tables are completely different in 
structure, because of the table creator’s different 
point of view. 
Table 7 shows the result of this survey.  
 
Description Num. of sets 
A) Same table 8 
B) Only segmentation 15 
C) Missing one or two column 34 
D) Missing more than two column 17 
E) Completely different table 26 
Total 100 
 
Table 7. Comparison of tables 
 
There are only a small number of document sets (8) 
from which the annotators made completely the same 
table. However, for more than half the document sets, 
the tables created by the two annotators are quite simi-
lar (including “same table”, “only segmentation” and 
“missing one or two columns”). This is complementary 
to the result shown in Table 6; for many document sets, 
the tables by annotator 2 have additional information 
compared to the tables by annotator 1. 
We also asked the annotators to judge if each docu-
ment set is suitable to summarize into a table. We made 
three categories for the survey. 
A) Table is natural for summarizing the document 
set 
B) Information can be summarized in table format 
C) Table is not suitable to summarize the docu-
ment set 
The result for the two annotators is shown in Table 
8. Annotators 1 and 2 judged 40 and 45 sets to be suit-
able for a table, 36 and 38 are OK and 24 and 17 are 
not suitable. This is an interesting result - that for so 
many document sets (40-45%) a table is judged to be 
natural for summarizing. Compared to that, only a 
smaller fraction (17-24%) are judged unsuitable. The 
relationships between the two annotators’ judgments 
are also shown in Table 8. The Chi-test is 17.94 and the 
probability is 0.13%; that means that the two judges are 
highly correlated. 
 
Annotator 1  
Annotator 2 A B C 
 
total 
A 28 12 5 45 
B 9 16 13 38 
C 3 8 6 17 
total 40 36 24 100 
 
Table 8. Suitability of table 
 
8 Discussion 
We reported a survey for multi-document summariza-
tion. We believe the results are encouraging for the 
pursuit of some novel strategies of multi-document 
summarization.  
One of them is the notion of axis. As we observed 
that for some percentage of the document sets, the axis 
can be tagged with some certainty, we might be able to 
make an automatic system to find it. Once the axis is 
correctly found, it might be useful for multi document 
summarization. For example, if a set is “single-person” 
then the summary for the set should be centered on the 
person. This may suggest, for example, generating a 
summary of type ‘biography’ (Mani 2001). If a docu-
ment set is found to be “multi-event”, then the sum-
mary should focus on the differences of the events.  
The other result found in the experiment is that a 
quite large percentage of document sets can be summa-
rized in table format. As this is a preliminary experi-
ment, there is incompleteness in the instruction and we 
believe further study on this topic is necessary. In addi-
tion to setting guidelines for the degree of detail, the 
style of cell contents shall be more uniform. Currently, 
cells contain words, phrases and sentences. We believe 
that by making more careful instructions for annotation, 
the comparison between different tables can be more 
systematized. In other words, a systematic evaluation 
may be possible. 
9 Future Work 
Obviously, the future work suggested by these results 
includes automatic methods to find what the human 
found in this experiment. 
We have started finding the axis automatically by 
observing the distribution of named entities, words and 
phrases. 
Once we are able to find the axis and suitability of 
table summary automatically for a given document set, 
the next stage of research will involve using this infor-
mation to select an appropriate way to summarize the 
set, i.e. table summary, sentence extraction summary or 
summary including rewriting. This will be extended if 
the document set is created dynamically from a user’s 
query. The type of query could be a helpful clue in se-
lecting the way to summarize the retrieved document 
set. 
The technology to summarize a document set in ta-
ble format is studied in Information Extraction. How-
ever, it has a hard limitation that the topic of the 
document set has to be known in advance and the 
knowledge to build table has to be created by hand, 
which usually takes a long time. There have been ef-
forts to automate the knowledge creation (Riloff 1996) 
(Yangarber 2000) (Sudo 2001); we hope to make a 
bridge between such automatic IE knowledge discovery 
and automatic summarization efforts. 
 
10 Acknowledgements 
This research is supported by the Defense Advanced 
Research Projects Agency as part of the Translingual 
Information Detection, Extraction and Summarization 
(TIDES) program, under Grant N66001-001-1-8917 
from the Space and Naval Warfare Systems Center, 
San Diego, and by the National Science Foundation 
under Grant IIS-0081962. This paper does not neces-
sarily reflect the position of the U.S. Government. We 
would like to thank our colleagues at New York Uni-
versity, who provided useful suggestions and 
discussions, including, Prof. Ralph Grishman, Mr. 
Kiyoshi Sudo and Mr. Yusuke Shinyama. Also, we 
thank the three annotators to do the tedious job. 

References 
(ACE-homepage)  
http://www.nist.gov/speech/tests/ace/index.htm 

(DUC-homepage)                
http://www-nlpir.nist.gov/projects/duc/ 

(TSC-homepage)  http://lr-www.pi.titech.ac.jp/tsc/ 

(McKeown et al. 2001) K.R. McKeown, R. Barzilay, 
D. Evans, V. Hatzivassiloglou, M. Yen Kan, B. 
Schiffman, S. Teufel, “Columbia Multi-Document 
Summarization: Approach and Evaluation”, Pro-
ceedings of the Document Understanding Confer-
ence (DUC-2001), 2001  
(Grishman and Sundheim 1996)  R. Grishman, B. 
Sundheim, “Message Understanding Conference - 6: 
A Brief History”, Proceedings of the 16th Interna-
tional Conference on Computational Linguistics 
(COLING ’96), 1996  
(Mani 2001) I. Mani, “Automatic Summarization”, 
John Benjamins Publishing Company, 2001 
(Riloff 1996) Ellen Riloff, “Automatically Generating 
Extraction Patterns from Untagged Text”, Proceed-
ings of the 13th National Conference on Artificial 
Intelligence, 1996 
(Sudo et al. 2001) Kiyoshi Sudo, Satoshi Sekine and 
Ralph Grishman, “Automatic Pattern Acquisition for 
Japanese Information Extraction”, Procedings of 
Human Language Technologies (HLT ‘01), 2001 
(Yangarber 2000) Roman Yangarber, Ralph Grishman, 
Pasi Tapanainen and Silja Huttunen: “Automatic 
Acquisition of Domain Knowledge for Information 
Extraction”, Proceedings of the 18th International 
Conference on Computational Linguistics (COLING 
2000), 2000 
