Extracting Key Paragraph based on Topic and Event Detection 
-- Towards Multi-Document Summarization 
Fumiyo Fukumoto and Yoshimi Suzuki t 
Department. of Computer Science and Media Enginccring, 
Yamanashi University 
4-3-11 Takeda, Kofu 400-8511 Japan {j~tkumotoCo)skye. esb:. ysuzuki @aIps l. esi~ }. yamano.~hi, ac.jp 
Abstract 
This paper proposes a method for extracting key 
paragraph for multi-document summarization based 
on distinction between a topic and a~ event. A topic 
emd an event are identified using a simple criterion 
called domain dependency of words. The method 
was tested on the TDT1 corpus which has been de- 
veloped by the TDT Pilot Study and the result can 
be regarded as promising the idea of domain depen- 
dency of words effectively employed. 
1 Introduction 
As the volume of olfline documents has drastically 
increased, summarization techniques have become 
very importaalt in IR and NLP studies. Most of the 
summarization work has focused on a single docu- 
ment. Tiffs paper focuses on multi-document sum- 
marization: broadcast news documents about the 
same topic. One of the major problems in the multi- 
document summarization task is how to identify dif- 
ferences and similza'ities across documents. This can 
be interpreted as a question of how to make a clear 
distinction between an e~ent mM a topic in docu= 
meats. Here, an event is the subject of a document 
itself, i.e. a writer wants to express, in other words, 
notions of who, what, where, when. why and how in 
a document. On the other hand, a topic in this paper 
is some unique thing that happens at some specific 
time and place, and the unavoidable consequences. 
It'becomes background among documents. For ex- 
ample, in the documents of :Kobe Japan quake', the 
event includes early reports of damage, location and 
nature of quake, rescue efforts, consequences of the 
quake, a~ld on-site reports, while the topic is Kobe 
Japaa~ quake. The well-known past experience from 
IR ~ that notions of who, what, where, when, why 
and how may not make a great contribution to the 
topic detection and tracking task (Allan and Papka, 
1998) causes this fact, i.e. a topic and an event are 
different from each other 1 . 
1 Some topic words can also be an event. Fbr instance: 
in the document shown in Figure 1: 'Japan: and =quake' are 
topic words and also event words in the document. However, 
we regarded these words as a topic, i.e. not be an event. 
In this paper: we propose a. method fi)r extract- 
ing key paragraph for multi-document smnmariza- 
tion based on distinction between a topic and an 
event. We use a silnple criterion called domain de- 
pendency of words as a solution and present how the 
i.dea of domain dependency of words can be utilized 
effectively to identify a topic and an event: and thus 
allow multi-document summarization. 
The basic idea of our approach is that whether a 
word appeared in a document is a topic (an event) 
or not, depends on the domain to which the docu- 
ment belongs. Let us take a look at the following 
document from the TDT1 corpus. 
(1-2) Two Americans known dead in Japan quake 
1. The number of \[Americans\] known to have been 
killed in Tuesday's earthquake in Japan has risen to 
two, the \[State\] \[Department\] said Thursday. 
2. The first was named Wednesday as Voni Lynn 
~Vong~ a teacher from California. \[State I \[De- 
partment\] spokswoman Christine Shelly declined 
to name the second: saying formalities of notifying 
the family had not been completed. 
3. With the death toll still mounting, at least 4:000 
people were killed in the earthquake which devas- 
tated the Japanese city of Kobe. 
4. \[U.S.\] diplomats were trying to locate the sevcrM 
thousand-strong \[U.S.\] community in the area: and 
some \[Americans\] who had been made homeless 
were found shelter in the \[U.S.\] consulate there: 
which was only lightly damaged in the quake. 
5. Shelly said an emergency \[State\] \[Department\] 
telephone number in Washington to provide infor- 
mation about private \[American\] citizens in Japan 
had received over 6,000 calls, more than half ot'Th-'e'm 
seeking direct assistance. 
6. The Pentagon has agreed to send 57:000 blankets 
to Japan and \[U.S.\] ambassador to Tokyo ~Valter 
Mondale has donated a $25,000 discretionary fund 
for emergencies to the Japanese Red Cross, Shelly 
said. 7. Japan has also agreed to a visit by a team of \[U.S.\] 
experts headed by Richard Witt, national director 
of the Federal Emergency Management Agency. 
Figure 1: The document titled 'Two Americans 
l~lown dead in Japan quake' 
Figure I is the document whose topic is 'Kobe Japan 
quake', and the subject of the document (event 
31 
words) is 'Two Americans known dead in Japan 
quake'. Underlined words denote a topic, and the 
words marked with '\[ \]' are events. '1,,,7' of Figure 
1 is paragraph id. Like Lulm's technique of keyword 
extraction, our method assumes that an event asso- 
ciated with a document appears throughout parm 
graphs (Luhn, 1958), but a topic does not. This is 
because an event is the subject of a document itself. 
while a topic is an event, along with all directly re- 
lated events. In Figure 1, event words 'Americans' 
and 'U.S.', for instance, appears across paragraphs, 
while a topic word, for example, 'Kobe' appears only 
the third paragraph. Let us consider further a broad 
coverage domain which consists of a small number of 
sanaple news documents about the same topic, 'Kobe 
Japan quake'. Figure 2 and 3 are documents with 
'Kobe Japan quake'. 
(l-l) Quake collapses buildings in central Japan 
1. At lea.~t two people died and dozens were injured 
when a powerful earthquake rolled through central 
Japan Tue..~lay morning, collapsing buildings and 
setting off fires in the cities of Kobe and Osaka. 2. The Japan Meteorological Agency said 
the 
earthquake, which measured 7.2 on the open-ended 
Richter scale: rmnbled across Honshu Island from 
the Pacific Ocean to the Japan Sea. 
Figure 2: The document titled 'Quake collapses 
buildings in central Japan' 
(1-3) Kobe quake leaves questions about medical system 
1. The earthquake that devastated Kobe in January 
raised serious questions about the efficiency of 
Japan's emergency medical system, a government 
report released on Tuesday said. 2. 'The earthquake exposed many i~ues in terms 
of quantity, quality, promptness and efficiency of 
Japan's medical care in time of disaster,' the report 
on-'ff'h-~alth and welfare said. 
Figure 3: The document titled 'Kobe quake leaves 
questions about medical system' 
Underlined words in Figure 2 and 3 show the topic 
of these documents. In these two documents, :Kobe' 
which is a topic appears in eveD" document, while 
'Americans' and 'U.S.' which are events of the docu- 
ment shown in Figure 1, does not appear. Our tech- 
nique for making the distinction between a topic and 
an event explicitly exploits this feature of the domain 
dependency of words: how strongly a word features 
a given set of data. 
The rest of the paper is organized as follows. 
The next section provides domain dependency of 
words which is used to identify a topic and an event 
for broadcast news documents. We then present a 
method for extracting topic and event words: and de- 
scribe a paragraph-based summarization algorithm 
using the result of topic and event extraction. Fi- 
nally~ we report some experiments using the TDT1 
corpus which has been developed by the TDT (Topic 
Detection and Tracking) Pilot Study (Allan and 
Carbonell, 1998) with a discussion of evaluation. 
2 Domain Dependency of Words 
The domain dependency of words that how strongly 
a word features a given set of data (documents) con- 
tributes to event extraction, as we previously re- 
ported (Fukumoto et al.: 1997). In the study, we 
hypothesi~d that the articles from the Wall Street 
Journal corpus can be structured by three levels, i.e. 
Domain, Article and Paragraph. It'a word is nil event 
in a given article, it satisfies the two conditions: (1) 
The dispersion value of the word in the Paragraph 
level is smaller than that of the Art.iele, since the 
.word appears throughout paragr~q~hs in the Para- 
graph level rather than articles in the Article level. 
(2) The dispersion value of the word in the Arti- 
cle is smaller than that of the Domain, as the word 
appears across articles rather than domains. 
However, ~here are two problems to adapt it to 
multl-document summarization task. The first is 
that the method extracts only events in the docu- 
ment. Because the goal of the study is to summarize 
a single document, and thus there is no answer to 
the question of how to identi~' differences and sim- 
ilarities across documents. The second is that the 
performance of the method greatly depends on the 
structure of a given data itself. Like the Wall Street 
Journal corpus, (i) if a given data caal be structured 
by three levels, Paragraph, Article and Domain, each 
of which consists of several paragraphs, articles and 
domains, respectively, aaad (ii) if Domain consists of 
different subject domains, such as 'aerospace', 'en- 
vironment' and 'stock market', the method can be 
done with satisfactoD' accuracy. However, there is 
no guarantee to make such an appropriate structure 
from a given set of documents in the multi-document 
summarization task. 
The purpose of this paper is to define domain 
dependency of words for a number of sample doc- 
uments about the same topic, and thus for multi- 
document summarization task. Figure 4 illustrates 
the structure of broadcast news documents which 
have been developed by the TDT (Topic Detection 
and Tracking) Pilot Study (Allan and Carbonell, 
1998). It consists of two levels, Paragraph and Doc- 
ument. In Document level, there is a small number 
of sample news documents about the same topic. 
These documents are arranged in chronological or- 
der such as, '(l-l) Quake collapses buildings in cen- 
tral ,Japan (Figure 2)', '(1-2) Two Americans known 
dead in Japan quake (Figure 1)' and '(1-3) gobe 
quake leaves questions about medical system (Fig- 
ure 3)'. A particular document consists of several 
II 
I 
II 
32 I 
I 
I 
i 
I 
I 
i 
I 
paragraphs. We call it Paragraph level. Let words 
within a document be an event, a topic, or among 
others (We call it n .qeneraZ word). 
(H) 
0 0 
i:r 
~umedlevel 
(t.2) 
:0 x 
x 
:h0 0 
; 5 A 
i=2 
0.3} 
o X 
0 
0 X ~ oo*.. 
0 ' - 
..J 
i=m 
oo 
i Paragraphleve~ ! ' 0/' 
, X 
i::r 
X : 
al 
k.2 
i o.°ll ix 
i 
¢~ lopic word 
& event word 
x: general word 
(1.1) 'Qu~e corpses b,.l~s in cen~ Japan' 
(1.2)'Two Americans known dead b Japan qu~e' 
(1-3) ~obe quake leaves quests about me&al system' 
Figure 4: The stnmture of broadcast news documents 
(event extraction) 
Given the structure shown in Figure 4, how can we 
identi~" every word in document (1-2) with an event, 
a topic or a general word? Our method assumes that 
aal event associated with a document appears across 
paragraphs, but a topic word does not. Then, we use 
domain dependency of words to extract event and 
topic words in document (1-2). Domain dependency 
of words is a measure showing how greatly each word 
features a given set of data. 
In Figure 4.. let 'C)', 'A' and 'x' denote a topicl 
an event and a general word in document (1-2), re- 
spectively. We recall the example shown in Figure 1. 
'A', for instance, 'U.S.' appears across paragraphs. 
However, in the Document level, :A' frequently ap- 
pears in document, (1-2) itself. On the basis of this 
example, we hypothesize that if word i is an event, 
it"satisfies the following condition: 
\[1\] Word i greatly depends on a particular 
document in the Document level rather 
than a particular paragraph in the Para- 
graph. 
Next, we turn to identi~" the remains (words) wit.h 
a topic, or a general word. In Figure 5; a topic of 
documents (1-1) ~ (1-3), for instance, :Kobe' aP- 
pears in a particular paragraph in each level of Para- 
graphl, Paragraph2 and Paragraph3. Here, (1-1), (1- 
2) and (1-3) corresponds to Paragraph1, Paragraph2 
and Paragraph3, respectively. On the other hand, 
in Document level, a topic frequently appears acros.~ 
documents. Then: we hypothesize that if word i is a 
33 
(H) 
.x x I 
.~e.nt !e_ve\[. c. 
o':.i 
:~1 
Paragraph 1: C. level ~! C' xi x 
j=l 
• o°-.° 
i 
j=2 ~ ..... j=n 
ic. 
ParagraphZi O " 
level !! x i 
(1-2) p-3) 
° 
x x \[ 0 t i z 
L i 
:i=2 i=3 
m~ 
--i 
X i ,.,.o 
i=rn 
i0i!, i O:topic word Paragraph 3 0 ; j 
x: general word i leve iJ ! o x !~ j !C': ....... 
l~igure 5: The structure of broadcast news documents 
(topic extraction) 
topic, it satisfies the following condition: 
\[2\] Word i greatly depends on a particu- 
lar paragraph in each Paragraph level 
rather than a particular document in 
Document. 
3 Topic and Event Extraction 
We hypothesized that the domain dependency of 
words is a key clue to make a distinction between 
a topic and an event. This can be broken down into 
two observations: (i) whether a word appears across 
paragraphs (documents), (it) whether or not a word 
appears frequently. We represented the former by 
using dispersion value, and the latter by deviation 
value. Topic and event words are extracted by using 
these values. 
The first step to extract topic and event words is 
to assign weight to the individual word in a docu- 
ment. We applied TF*IDF to each level of the Doc- 
ument and Paragraph, i.e. Paragraphl, Paragraph2 
and Paragraph3. 
N Wdit = TFdit * log Ndt (1) 
Wdit in formula (1) is TF*IDF of term t in the i-th 
document. In a similar way, Wpit denotes TF*IDF 
of the term t in the i-th paragraph. TFdit in (1) 
denotes term frequency of t in the i-th document. N 
is the number of documents and Ndt is the number 
of do(:uments where t occurs. The second step is to 
calculate domain dependency of words. We defined 
it by using formula (2) and (3). 
DispOt = /I/E'~=l(I4;dit - mean')2 (2) 
¥ Tn 
De vdi, = (Wdit - meant) ,10+50 (3) 
DispDt 
Formula (2) is dispersion value of term t in the level 
of Document which consists of m documents, and 
denotes how frequently t appears across documents. 
In a similar way, DispPt denotes dispersion of term 
t in the level of Paragraph. Formula (3) is the devia- 
tion value of t in the i-th document and denotes how 
frequently it appears in a particular document, the 
i-th document. Devpit is deviation of term t in the 
i-th paragraph. In (2) and (3), meant is the mean 
of the total TF*IDF values of term t in the level of 
Document. 
The last step is to extract a topic and an ever~t 
using fonmfla (2) and (3). We recall that if t is an 
event, it satisfies \[1\] described in section 2. This is 
shown by using formula (4) mad (5). 
DispPt < DispDt (4) 
for all Pi E di Devpjt < Devdit (5) 
Formula (4) shows that t frequently appears across 
paragraphs rather than documents. In formula (5), 
di is the i-th document and consists of the number 
of n paragraphs (see Figure 4). Pi is an element of 
di. (5) shows that t frequently appears in the i-th 
document di rather than paragraphs pj ( 1 < j < 
n). On the other hand: if t satisfies formula (6) and 
(7), then propose t as a topic. 
DispPt > DispDt (6) 
for all dl E D, 
Pit exists such that Devpjt >_ Devdlt (7) 
In formula (7), D consists of the number of rn doc- 
aments (see Figure 5). (7) denotes that t frequently 
appears in the particular paragraph pj rather than 
the document di which includes pj. 
4 Key Paragraph Extraction 
The summarization task in this paper is paragraph- 
based extraction (Stein et al., 1999). Basically, para- 
graphs which include not only event words but also 
topic words are considered to be significant para- 
graphs. The basic algorithm works as follows: 
1. For each document: extract topic and event 
words. 
2. Determine the paragraph weights for all para- 
graphs in the documents: 
(a) Compute the sum of topic weights over the 
total number of topic words for each para- 
graph. 
(b) Compute the sum of event weights over the 
total number of event words for each para- 
graph. 
A topic and an event weights are calculated 
by using Devdlt in formula (3). Here, t is a 
topic or an evcnt and i is the i-th document 
in the documents. 
(c) Compute the sum of (a) and (b) for each 
paragraph. 
3. Sort the paragraphs t~ccording to their weights 
and extract the N highest weighted paragrai~hs 
in documents in order to yield summarization 
of the documents. 
4. When their weights are the same, Compute the 
sum of all the topic and event word weights. 
Select a paragraph whose weight is higher than 
the others. 
5 Experiments 
Evaluation of extracting key paragraph based on 
multi-document is difficult. First, we have not found 
an existing collection of summaries of multiple doc- 
uments. Second, the maamal effort needed to judge 
system output is far more extensive than for single 
document summarization. Consequently, we focused 
on the TDT1 corpus. This is because (i) events have 
been defined to support the TDT study effort, (ii) 
it was completely annotated with respect to these 
events (Allan and Carbonell, 1997). Therefore, we 
do not need the manual effort to collect documents 
which discuss about the target event. 
We report the results of three experiments. The 
first experiment, Event Extraction, is concerned with 
event extraction technique, ha the second experi- 
ment, Tracking Task, we applied the extracted top- 
ics to tracking task (Allan and Carbonell, 1998). 
The third experiment: Key Paragraph Extraction is 
conducted to evaluate how the extracted topic and 
event words can be used effectively to extract key 
paragraph. 
5.1 Data 
The TDT1 corpus comprises a set of documents 
(.15,863) that includes both newswire (Reuters) 
7..965 and a manual transcription of the broadcast 
news speech (CNN) 7,898 documents. A set of 25 
target events were defined 2 
All documents were tagged by the tagger (Brill, 
1992). %Ve used nouns in the documents. 
h t t p://morph.ldc.upenn.edu/TDT 
I 
I 
I 
I 
I 
i 
I 
I 
I 
I 
I 
I 
i 
I 
i 
! 
I 
i 
34 I 
I 
I 
I 
I 
I 
I 
I 
I 
i 
I 
i 
I 
/ 
I 
I 
I 
I 
5.2 Event Extraction 
We collected 300 documents from the TDT1 corpus, 
each of which is mmolated with respect to one of 25 
events.' The result is shown in Table 1. 
In Table 1, 'Event type' illustrates the target events 
defined by the TDT Pilot Study. 'Doe' denotes the 
number of documents. 'Rec' (Recall) is the imm- 
ber of correct events divided by the total mnnber 
of events which are selected by a human, and 'Prec' 
(Precision) stands for the number of correctevents 
divided by the number of events which are selected 
by our method. The denominator 'Rec' is made by 
a hmnan judge. 'Accuracy' in Table 1 is the total 
average ratio. 
In Table 1, recall and precision values range from 
55.0/47.0 to 83.3/84.2, the average being 71.0/72.2. 
The worst result of recall and precision was when 
event type was 'Serbs violate Bihac' (55.0/59.3). We 
currently hypothesize that this drop of accuracy is 
due to the fhct that some documents are against our 
assumption of an event. Examining the documents 
whose event type is 'Serbs violate Bihac', 3 ( one 
from CNN and two from Reuters).out of 16 docu- 
ments has discussed the same event, i.e. 'Bosnian 
Muslim enclave hit by heavy shelling'. As a result, 
the event appears across these three documents• Fu- 
ture research will shed nmre light on that. 
5.3 Tracking Task 
Tracking task in the TDT project is starting from 
a few sample documents and finding all subsequent 
documents that discuss the same event (Allan and 
Carbonell, 1998), (Carbonell et al., 1999). The cor- 
pus is divided into two parts: training set and test 
set. Each of the documents is flagged as to whether 
it discusses the target event, and these flags ('YES', 
:'NO') axe the only information used for training the 
.system to correctly classiC" the target event. We ap- 
plied the extracted topic to the tracking task under 
• these conditions. The basic algorithm used in the 
experiment is as follows: 
1L Create a single document Sip and represent it as 
a term vector 
For the results of topic extraction, all the docu- 
ments that belong to the same topic are lmndled 
into a single document Stp and represent it by 
a term vector as follows: 
Stp -~ 
ttpl 
tip2 
ttpn 
s.t. itpj = 
{ f(ttpj) iftt~jisatoplc 
of Stp 
0 otherwise 
. 
f(w) denotes term frequency of word w. 
Represent other training and test documents as 
term vectors 
= 
. S= = 
, 3. 
Let $1: --', S,, be all the other training docu- 
ments (where m is the number of training doc- 
uments which does not belong to the target 
event) and Sx be a test docmnent which should 
be classified as to whether or not it discusses the 
target event. 81, "" ", Sm mid Sz are represented " 
by term vectors as follows: 
Ill 
ti2 
s.t• li.i = { 
f(t,A if t,~ (1 < i < m) 
appears ill S; and 
not, be a topic of Sip 
0 otherwise 
tzl 
i=2 
i=. 
f(t=j) if t.~j appears i,i t;~ 
s.t. txj = 0 otherwise 
Compute the similarity between a training docu- 
ment and a test document 
Given a vector representation of documents SI, 
• • ", Sin, Sty and Sx, a similarity between two 
documents Si (1 < i < m, tp) and the test doc- 
ument S~ would be obtained by using formula 
(8), i.e. the inner product of their normalized 
vectors. 
Si. Sx 
s~m(s.s~) - I S~ II S=l (s) 
The greater the value of Sim(Si, S=) is, the 
more similar Si and S, are. If the similarity 
value between the test document Sx and the 
document Sip is largest among all the other 
pairs of documents, i.e. (&, S=).---, (S~, S=), 
Sx is judged to be a document that discusses 
the target event. 
We used the standard TDT evaluation measure 
Table 2 illustrates the result. 
3. 
Table 2: The results of tracking task 
1 
2 
4 
8 
• 16 
Avg 
%Miss 
32.5 
23.7 
23.1 
12.0 
13.7 
21.0 
%F/A F1 %Rec %Prec 
0.16 0.68 67.5 70.0 
0~06 0.80 76.3 87.8 
0.05 0.81 76.9 90.1 
0.08 0,87 88.0 91.4 
0.06 0.89 86.3 93.6 
0.08 0.76 79.0 86.6 
In Table 2, 'Nt' denotes the number of positive train- 
ing documents where A~ takes on values 1, 2, 4, 8 
.3 http://www.nist.gov/speech/tdt98.htm 
35 
I 
Table 1: The results of event words extraction I 
m 
Event type Doc Avg Rec/Avg Prec ' 
Karrigan/Harding 2 . 64.1/55.5 " ' I 
Kobe Japan quake 16 74.5/75.0 
Lost in Iraq 16 ~5.7/68.8 
NYC Subway bombing 16 68.0/84.2 ' 
OK-City bombing 16 78.8/47.0 • I 
Pentium chip flaw 4 81.1/72.9 
Quayle lung clot 8 63.6/74.4 
Serbians down F- 16 16 .78"6/75"0 I 
Serbs violate Bihac 16 55.1)/59.3 
Shannon Faulker 4 11.4/82.4 
USAir 427 crash 16 72.6/86.3 
WTC Bombing trial 16 62.6/70.1 I 
71.0/72.2 ---= | 
In Table 3. 'Event' denotes event words in the first 
document in chronological order from A~ --- 4, and i 
the title of the document is 'Emergency Work Con- 
tinues After Earthquake in Japan'. Table 3 clearly 
demonstrates that the criterion, domain dependency 
of-''words effectively employed. 
Figure 6 illustrates the DET (Detection Evalua- 
tion Tradeoff) curves for a sample event (event type. 
is 'Comet into Jupiter) runs at several values of Nt. \]~ 
'/l 
! ~" i~'~-~.~ .1, ! .: ~ i i .: i 
| ! : : " ~ " t ~.'. "~ ~ " E-v*~ : • .: " : ~ : ~ ~ :~ : : N=4 ....... : 
• - : ."1 : ",.. :t't" ~" : H=8 ....... : 
I t t : : ** ; I g t¢ ~ t "~ * t -* : • 
| • * * " • ~ • ~,.to 1~, • t ° • * • • t I t l = i~ = ."~ t ' ~ t l t = | 
2o t-.-.....r..?-...---..T....~:::. ....... .: ;::~:....T.~. ........ ! .......... ! ........... ? ....... .I ;- .. : • : • . • ~. : ... : : : : 
I 
~o ~,..-.,...*...:......*....!....~ ....... ~. ...... *.--'~.'~';'i""~-"': ......... .~ ........... ": ....... "i. 
s i,.4.-.1...~...i.....4....|.....~ ....... 4 ...... 4.-.. :4b a.~..i..'~ ......... | .......... ~ ....... 4 
: : : : : : : " : : "'1:~ : ::" : : : I 
P.4....i...~....|.....~.....i.....i ...... 4 ...... .; ...... ~..i......i....~.'...i ........... 4 ....... 
Event type Avg Rec/Avg Prec 
8 61.7/70.5 
8 60.7/73.3 
76.3/79.1 65.7/80.0 
75.9/80.0 
65.2._/61.9 
65.2173.9 
83.3/71.4 
78.7/72.9 
62.0/74.0 
78.5/75.0. 
80.4/70.2 
8 75.9/72.2 
Di.spPt DispDt 
Doc 
Aldrich Ames 
Carlos the Jackal 
Carter in Bosnia 
Cessna on White House 
Clinic Murders 
' Comet into Jupiter 
Cuban riot in Panama 
Death of Kim Jong 
-DNA in OJ trial 
Haiti ousts observers 
Hall's copter 
16 
8 
16 
16 
2 
16 
16 
8 
16 
Humble: TX, flooding 16 
Justice-to-be Breyer 
Accuracy 
and 16. 'Miss' means Miss rate, which is the ra- 
tio of the doounents that were judged as YES but 
were not evahmted as YES for the run in question. 
'F/A' shows false alarm rate and 'FI' is a measure 
that balances recall and precision. 'Rec' denotes the 
ratio of the documents judged YES that were also 
evaluated as YES, and 'Prec' is the percent of the 
documents that were evaluated as YES which corre- 
spond to documents actually judged as YES. 
Table 2 shows that more training data helps the 
performance, as the best result was when we used 
:Yt = 16. 
Table 3 illustrates the extracted topic and event 
words in a sample document. The topic is 'Kobe 
Japan quake' and the number of positive training 
documents is 4. 'Devpzt', 'Devd\]t', 'DispPt' and 
'DispDt' denote values calculated by using formula 
(2) and (3). 
,Table 3: Topic and event words in :Kobe Japan 
quake' 
Topic word 
earthquake 
Japan 
Kobe 
fire 
Devplt 
53,5 
69,8 
56,6 
57.0 
Devdzt 
50.0 
50.0 
50.0 
46.4 
12.3 10.3 
13.3 9.8 
8.6 6.4 
2.3 1.5 
Event word 
emergency 
area 
worker 
rescue 
Devplt 
50.0 
40.6 
50.0 
43.3 
Devdzt 
74.7 
50.0 
66.1 
50.0 
DispP t 
0.9 
0.6 
0.4 
2.3 
DispDt 
1.5 
1.0 
1.0 
3.4 
.ol .(m .o6 o.1 o2. o.5 1 g s lo '2o 4o $o 8o 90 Fatse Atarm p'rotm~Jity (in %) 
II Eigure 6: DET curve for a sample tracking runs • 
Overall, the curves also show that more training 
helps tile performance, while there is no significant B 
difference among -'Yt = 2, 4 and 8. 
5.4 Key Paragraph Extraction 
roll 
We used 4 different sets as a test data. Each set con- • 
sists of 2, 4.. 8 and 16 documents. For each set, we 
36 
I 
I 
5.2 Event Extraction 
We collected 300 docmnents from the TDT1 corpus, 
each of which is annotated with respect to one of 25 
events.' The result is shown in Table 1. 
In Table 1.. 'Event type' illustrates the target events 
defined by the TDT Pilot Study. ~Doc' denotes the 
number of documents. 'Rec' (Recall) is the nmn- 
bet of correct events divided by the total number 
of events which are selected by a humaa, and :Pree ~ 
(Precision) stands for the number of correct-events 
divided by the number of events which are selected 
by our method. The denominator 'Rec: is made by 
a human judge. 'Accuracy' in Table 1 is the total 
average ratio. 
In Table 1, recall and precision values range, from 
55.0/47.0 to 83.3/84.2, the average being 71.0/72.2. 
The worst result of recall and precision was when 
event type was 'Serbs violate Bihac' (55.0/59.3). We 
currently hypothesize that this drop of accuracy is 
due to the fact that some documents are against our 
assumption of an event. Examining the ctocuments 
whose event type is 'Serbs violate Bihac', 3 ( one 
from CNN and two from Reuters) out of 16 docu- 
ments has discussed the same evefit, i.e. 'Bosnian 
Muslim enclave hit by heavy shelling'. As a result, 
the event appears across these three documents. Fu- 
ture research will shed more light on that. 
5.3 Tracking Task 
Tracking task in the TDT project is starting from 
a few sample documents and finding all subsequent 
documents that discuss the same event (Allan and 
Carbonell, 1998), (Carbonell et al., 1999). The cor- 
pus is divided into two parts: training set and test 
~et. Each of the documents is flagged as to whether 
it discusses the target event, and these flags ('YES', 
'NO') are the only information used tbr training the 
system to correctly classiC" the target event. We ap- 
plied the extracted topic to the tracking task under 
these conditions. The basic algorithm used in the 
• experiment is as follows: 
1. Create a single document Stp and represent it as 
".a term vector 
For the results of topic extraction, all the docu- 
ments that belong to the sanae topic are bundled 
into a single document S,p and represent it by 
a term vector as follows: 
~tp -~ 
ttpl 
ttp2 
• s.t. ttpj = 
ttpn 
{ /(t,pj) ift,pj is atoplc 
of Stp 
0 otherwise 
f(w) denotes term frequency of word w. 
2. Represent other training and test documents as 
term vectors 
35 
Let $1, ---, S,,, be all the other training docu- 
ments (where m is the number of training doc- 
uments which does not belong to the target 
event) and Sx be a test document which should 
be classified as to whether or not it discusses the 
target event. $1, "- -, Sm and Sx are represented " 
by term vectors as follows: 
~ = 
'" { 
s.t. llj = 
f(t~j) ift 0 (1 < i <m) 
appears in S~ and 
not be a topic of ,5"tp 
(I otherwise 
S= = 
tzl 
t~2 
• S.t. t~j = f(t.r.j) ift~j ~ppears i, S, 0 otherwise 
3. Compute the similarity between a training docu- 
ment and a test document 
Given a vector representation of documents SI, 
• .., S.,, Stp and S=; a similarity between two 
documents Si (1 < i < m, tp) mad the test doc- 
ument S= would be obtained by using formula 
(8), i.e. the inner product of their normalized 
vectors. 
Si • S= Sim(Si, S~) 
= I Si II S~ I (S) 
The greater the value of Sim(Si,S,) is, the 
more similar 5"/ and Sz are. If the similarity 
value between the test document S, and the 
document Stp is largest among all the other 
pairs of documents, i,e. ($1, Sx), " ", (Sin, S=), 
S= is judged to be a document that discusses 
the target event. 
We used the standard TDT evaluation measure 3 
Table 2 illustrates the result• 
Table 2: The results of tracking task 
Nt %Miss %F/A F1 %Rec %Prec 
1 32.5 0.16 0.68 67.5 70:0 
2 23.7 0.06 0.80 76.3 87.8 
4 23.1 0.05 0.81 76.9 90.1 
8 12.0 0.08 0.87 88.0 91.4 
16 13.7 0.06 0.89 86.3 93.6 
"Avg 21.0 0.08 0.76 79.0 86.6 
In Table 2, 'Nt' denotes the number of positive train- 
ing documents where A~ takes on values 1, 2, 4, 8 
z http://www.nist.gov/speech/tdt98.htrn 
Table 1: The results of event words extraction 
I 
I Event type Doe Avg Rec/Avg Prec Event type Doc Avg Rec/Avg Prec 
.. Aldrich Ames 8 61.7/70.5 Karrigan/Harding 2 64.7/55.5 . II 
Carlos the Jackal 8 60.7/73.3 Kobe Japan quake 16 74.5/75.0 
Carter in Bosnia " 1-6 76.3/79.1 Lost in Iraq 16 75.7/68.8 J 
Cessna on White House 8 65.7/80.0 NYC Subway bombing 16 68.0/84.2 
clinic Murders 16 75.9/80.0 OK-City bombing 16 78.8/47.0 i 
Comet into Jupiter 16 6~o.2/61.9 Pentium chip flaw 4 81.1/72.9 II 
Cuban riot in Panama 2 65.2/73.9 Quayle lung clot 8 63.6/74.4 
Death of Kim Jong 16 83.3/71.4 Serbians down F-16 16 78.6/75.0 *l 
DNA in OJ trial 16 78.7/72.9 Serbs violate Bihac 16 55.0/59.3 i 
Haiti ousts observers 8 62.0/74.0 Shannon Faulker 4 71.4/82.4 l 
Hall's copter 16 78.5/75.0 USAir 427 crash 16 72.6/86.3 
Humble, TX, flooding 16 ...... 80.21/70.2 WTC Bombing trial 16 62.6/70.1 
Justice-to-be Breyer 8 75.9/72.2 I! 
Accuracy 71.0/72.2 ! 
and 16. 'Miss' means Miss rate, which is the ra- In Table 3, 'Event' denotes event words in the first - 
rio of the documents that were, judged as YES but document in chronological order from .,X~ = 4, and 
not evaluated as YES for the run in question, the title of the document is 'Emergency Work Con- i were 
'F/A' shows false Mann rate mad 'FI' is a measure tinues After Earthquake in Japan'. Table 3 clearly i 
that balances recall and precision. 'Rec' denotes the demonstrate~ that the criterion, domain dependency 
ratio of the documents judged YES that were also of words effectively employed, i 
evaluated as YES, and Tree' is the percent of the Figure 6 illustrates the DET (Detection Evalua- | 
documents that were evaluated as YES which corre- tion Tradeoff) curves for a sample event (event type 
spond to documents actually judged as YES. is 'Comet into Jupiter') runs at several values of Art. i 
Table 2 shows that more training data helps the 
performance, as the best result was when we used 9o ,.,.. ....... ---., , ...... , .... 'I 
-'Vt = 16. ~" • q .." ~, ~ "', .: • .. ,,~m~'~,~'---- 
Table 3 illustrates the extracted topic and event E0 ~" ...... " .~"'*'''"=~''~'":"*'"'"'"'''~i .:',..: wt. ".....: : ................ i e~ ....... • B 
words in a sample document. The topic is 'Kobe i i i iq i ~a'!4 ~- i i ~\[;\[: W 
Japem quake' and the m~mber of positive training e0 ~.4....i..-~...~....s....i-...i~:u...i...~,.4 ........ ~ ....... e,~, • i.: ~ " ~ .'.~'~.'~ ;.~ " ~. ~--','.: 
documents is 4. 'Devp\]t', 'Devd\]t', 'DispPt' and ~ l :. :. : : :.'-q: ;~ ~.~.: 1 : m~s-- ! 
'DispDt' denote values calculated by using formula 4o : ~..~..~.~. : . . . : 
(2) and (3). : : : " : 
: : : : : : ~ : ~..|.a. :i .: " -: : -- 
20 ~*..'2...,.t....'*--.:,*,*,?....;~::=~: : ; -- :*'*":'"~" ....... ! .......... ! ........... 9""'"*~ i 
i : : " : ~ : : " .: ". :- " " ! : " i : .: : " .: : .!~.-t.~_ :. : : : : 
Table 3: Topic and event words in 'Kobe Japan ~o| i i i i i i i i i :41 t. ~. \[ i i i 
"quake' ~ i,.L...i.,.;-..i.....;....|.....; ....... 4......~;..... ~j.an..:~ ........ i .......... ,; ....... d 
i"" ": ":""""" "I" ":~ ": " ! I "" " " ~ " " i " " .,~. "~-," " " Topic word Devp~t Devd~t DispPt DispDt ~ i..4....i...~....i.....~....J.....i ...... 4 ...... 4 ...... °..i.....-i....~..i ........... 4 ....... q : : : • ! :" • .: - . .:..: • ., ~ ." 
earthquake 53.5 50.0 12.3 10.3 ~ "=:"= ' .... "- : =:" ......... " ......... "==" ............ : : .01 .(\]2 .I\]6 0.1 0.2 0.5 1 2 S 10 20 40 60 80 90 
Japan 
Kobe 
fire 
69.8 
56.6 
57.0 
50.0 
50.0 
46.4 
13.3 
8.6 
2.3 
9.8 
6.4 
1.5 
Event word 
emergency 
axea 
worker 
rescue 
Devplt 
50.0 
40.6 
50.0 
43.3 
Devdlt 
74.7 
50.0 
66.1 
50.0 
Di s pP t 
0.9 
0.6 
0.4 
2.3 
bL~pDt 
1.5 
1.0 
1.0 
3.4 
Fat~ ~aan. Pr0ea~y fm ~) 
II Figure 6: DET curve for a sample tracking runs B 
'Overall, the curves also show that more trailfing 
helps tile performance, while there is no significant 
difference anaong :Yt = 2, 4 and 8. il 
5.4 Key Paragraph Extraction 
We used 4 different sets as a test data. Each set con- I 
sists of 2, 4, 8 and 16 documents. For each set, we II 
36 
I 
I 
extracted 10% and 20% of the full-documents para- 
"graph length (Jing et al., 1998). Table 4 illustrates 
the result. 
In Table 4, 'Num ~ denotes the number of documents 
in a set. 10 and 20°~ indicate the extraction ratio. 
'Para' denotes the number of par~]graphs exr.racted 
by a humaa~ judge, and 'Correct' shows the accuracy 
ot" the method. 
The best result was 77.7% (the extraction ratio is 
20% and the number of documents is 2). 
Wc now turn our attention to the main question: 
how was the contribution of making the distinction 
between a topic and an event for summarization 
task? Figure 7 illustrates the results of the methods 
which used (i) the extracted topic artd event words, 
i.e. our method, and (ii) only the extracted event" 
words. 
75 
~, 70 
8 
<175 
60 
55 
1 4 8 16 
Num 
Figure 7: Accuracy with each method 
In Figure 7, '(10%): and '(20%)' denote the ex- 
tracted paragraph ratio. 'Event' is the result when 
we used only the extracted event words. Figure 7 
shows that our method consistently outperforms the 
method which used only the extra,.ted events. To 
summarize the evaluation: 
][: Event extraction effectively employed when 
each document discusses different subject about 
the same topic. This shows that the method will 
be applicable to other genres of corpora which 
consist of different subjects. 
2. The result of tracking task (79.0% average recall 
and 86.6% average precision) is comparable to 
the existing tracking techniques which tested on 
the TDT1 corpus (Allan and Carbonell, 1998). 
3. Distinction between a topic and an event im- 
proved the results of key paragraph extrac- 
tion, as our method consistently outperforms 
the method which used only the extracted event 
words (see Figure 7). 
37 
6 Related Work 
The majority of techniques for summarization fall 
within two broad categories: Those that rely on tem- 
plate instantiation and those that rely on passage 
extraction. 
Work in the former approach is the DARPA- 
sponsored TIPSTER program and, in particular, the 
message understanding conferences hag provided fer- 
tile groined for such work, by placing the emphasis 
of docunmnt analysis to the identification and ex- 
traction of certain core entities and facts in a doc- 
ument, while work on template-driven, knowledge. 
based summarization to date is hardly domain or 
genre-independent (Boguraev and Kennedy. 1997). 
The alternative approach largely escapes this con- 
straint, by viewing the task as one of identi~,ing 
certain passages(typically sentences) which, by some 
metric, are deemed to be the most representative, of 
the document's content. A variety of approaches ex- 
ist for determining the salient sentences in the text: 
statistical techniques based oll word distribution 
(Kupiec et al., 1995), (Zechner, 1996), (Salton et 
al., 1991), (Teufell and Moens, 1997), symbolic tech- 
niques based on discourse structure (Marcu, 1997) 
and semantic relations between words (Barzil~v and 
Elhadad, 1997). All of their results demonstrate that 
passage extraction techniques are a useful first step 
in document summarization, although most of them 
have focused on a single document. 
Some researchers have started to apply a 
single-document summarization technique to multi- 
document. Stein et. al. proposed a method for 
summarizing multi-document using single-document 
summarizer (Stralkowsik et al., 1998), (Stralkowski 
et al.. 1999). Their method first summarizes each 
document of multi-document, then groups the sum- 
maries in clusters and finally, orders these summaries 
in a logical way (Stein et al., 1999). Their technique 
seems sensible. However, as she admits, (i) the order 
the information should not only depend on topic cov- 
ered, (ii) background information that helps clari~" 
related information should be placed first. More seri- 
ously, as Barzilay and Mani claim, summarization of 
multiple documents requires information about sim- 
ilarities and differences across documents. There- 
fore it is difficult to identi~" these information using 
a single-document summarizer technique (Mani and 
Bloedorn, 1997), (Barzilay et al., 1999). 
A method proposed by Mani et. al. deal with 
the problem, i.e. they tried to detect the similar- 
ities and differences in information content among 
documents (Mani and Bloedorn, 1997). They used 
a spreading activation algorithm and graph match- 
ing in order to identify similarities and differences 
across documents. The output is presented as a set 
of paragraphs with similar and unique words high- 
lighted. However, if the same information is men- 
Nun: 
Table 4: The results of Key Paragraph Extraction 
Accuracy 
%10 
Paa'a Correct(%) Para 
2 58 44(75.8) 117 
4 107 80(74.7) 214 
8 202 138(68.3) 404 
16 281 175(62~) 563 
Total 648 437(67.4) 1,298 
%20 
Correct(%) Para 
91(77.7) 175 
160(74.7) 321 
278(68.8) 606 
361(64.1) 844 
890(68.5) 1,946 
Total 
Correct(%) 
135(77.1) 
240(74.7) 
416(68.6) 
536(63.5) 
1,327(68.1) 
"tioned several times in different documents, much of 
the summary will be redundant. 
Allan et. al. also address the problem aald pro- 
posed a method for event tracking using common 
words and surprising features by supplementing the 
corpus statistics (Allan and Papka, 1998) (Papka et 
al., 1999). One of the purpose of this study is to 
make a distinction between an event aald an event 
class using surprising features. Here event class fea- 
tures are broad news areas such as politics, death, 
destruction and ~,'~fare. The idea is considered to 
be necessary to obtain higti accuracy, while Allan 
claims that the surprising words do not provide a 
broad enough coverage to capture all documents on 
the event. 
A more recent approach dealing with this problem 
is Barzilav et. al's approach (Barzilay et al., 1999). 
They used paraphrasing rules which are maaaually 
derived from the result of syntactic analysis to iden- 
tify theme intersection and used  generation 
to reformulate them as a coherent, summary. While 
promising to obtain high accuracy: the result of sum- 
marization task has not been reported. 
Like Mani and Barzil~,'s techniques, our ap- 
proach focuses on the problem that how to identi~" 
differences and similarities across documents, rather 
than the problem that how to form the actual sum- 
mar:,, (Sparck, 1993), (McKeown and Radev, 1995), 
(Radev and McKeown, 1998). However, while Barzi- 
lav's approach used paraphrasing rules to eliminate 
redmadancy in a summary, we proposed domain de- 
pendency of words to address robustness of the tech- 
nique. 
7 Conclusion 
In this paper, we proposed a method for extract- 
ing key paragraph for summarization based on dis- 
tinction between a topic and an event. The results 
showed that the average accuracy was 68.1~ when 
we used the TDT1 corpus. TIPSTER Text Sum- 
marization Evaluation (SUMMAC) proposed vari- 
ous methods for evaluating document summariza- 
tion and tasks (Mani et al., 1999). Of these, par- 
ticipants submitted two summaries: a fixed-length 
summary limited to 10% of tile length of the source, 
and a summary which was not limited in length. Fu- 
ture work includes quantitative and qualitative eval- 
uation. In addition, our method used single words 
rather thaaa phrases. These phrases, however, would 
be helpful to resolve ambiguity and reduce a lot of 
noise, i.e. yield much better accuracy. We plaal to 
apply our method to phrase-based topic and event 
extraction, then turn to focus on the problem that 
how to form the actual summary.. 
Acknowledgments 
The authors would like to thank the reviewers 
for their valuable comments. This work was sup- 
ported ~' the Grant-in-aid for the Japan Society for 
the Promotion of Science(JSPS, No.11780258) and 
Tateisi Science and Technology Foundation. 

References 
J. Allan and J. Carbonell. 1997. The tdt pilot study 
corpus documentation. In TDT.Study. Carpus, 
V1.3.doc. 
J. Allan and J. Carbonell. 1998. Topic detection 
and tracking pilot study: Final report.. In Proc. 
of the DARPA Broadcast News Transcription and 
Understanding Workshop. 
J. Allan and R. Papka. 1998. On-line new event de- 
tection and tracking. In Proc. of 21st Annual b~- 
ternational A CM SIGIR Conference on Research 
and Development in Information Retrieval, pages 
37-45. 
R. Barzila.v and M. Elhadad. 1997. Using lexical 
chains for text summarization. In Proc. of ACL 
Workshop on b~telligent Scalable Text Summa- 
rization, pages 10-17. 
R. Barzilay, K. R. McKeown, and M. Elhadad. 
1999. Information fusion in the context of multi- 
document summarization. In Proc. of 87th An- 
nual Meet.ing of Association for Computational 
Linguistics, pages 550-557. 
B. Boguraev mad C. Kennedy. 1997. Saiience-based 
content characterization of text documents. Ixi 
Proc. of A CL Workshop on b,telligent Scalable 
Tezt Summarization: p~ges 2-9. 
E. Brill. 1992. A simple rule-based part of speech 
tagger. In Proc. of the 3rd Conference on Applied 
Natural Language Processing, pages 152-155. 
J. Carbonell, Y. Yang, mad J. Lafferty. 1999. CMU 
report on TDT-2: Segmentation, detection and 
tracking. In Proc. o/the DARPA Broadcast News 
Workshop. 
F. Fukumoto, Y. Suzuki, and J. Fukumoto. 1997. 
An automatic extraction of key paragraphs based 
oil context dependency. In Proc. of the 5th Con- 
ference on Applied Natural Language Processing, 
pages 291-298. 
H. Jing, R. Barzil~', K. R. McKeown, and M. E1- 
hadad. 1998. Summarization evaluation methods: 
Experiments and analysis, intelligent text sum- / 
marization. In Proc. o/1998 American Associa- 
tion/or Artificial h~telligence Sprin 9 Symposium, 
pages 51-59. 
J. Kupiec, 3. Pedersen, and F..Chen. 1995. A 
trainable document summarizer. In Proc. of the 
18th Annual International ACM SIGIR Confer- 
ence on Research and Development in h~formation 
Retrieval, pages 68-73. 
H. P. Lutm. 1958. The automatic creation of litera- 
ture abstracts. IBM journal, 2(1):159-165. 
I. Mani and E. Bloedorn. 1997. Multi-document 
summarization by graph search and matching. In 
Proc. o/the 15th National Conference on Artifi- 
cial h~telligence , pages 622-628. 
I. Mani, T. Firmin, and B. Sundheim. 1999. The 
TIPSTER SUMMAC text summarization evalu- 
ation. In Proc. o/Ninth Conference o/the Eu- 
ropean Chapter o/the Association/or Computa- 
tional Linguistics, pages 77-85. 
D. Marcu. 1997. From discourse structures to text 
summaries. In Proc. of A CI, Workshop on Intel- 
ligent Scalable Text Summarization, pages 82-88. 
K. R. McKeown and D. R. Radev. 1995. Generating 
summaries of multiple news articles. In Proc. of 
the 18th Annual h~ternational A CM SIGIR Con- 
ference on Research and Development in Informa- 
tion Retrieval, pages 74-82. 
R. Papka, J. Allan, and V. Lavrenko. 1999. UMASS 
approaches to detection and tracking at TDT2. In 
Proc. of the DARPA Broadcast News Workshop. 
D. R. Radev and K. R. McKeown. 1998. Gen- 
erating natural  summaries from multi- 
pie on-line sources. Computational Linguistics. 
24(3):469-500. 
G. Salton, J. Allan, C. Buckle); and A. Singhal. 
1991. Automatic aaaalysis, theme generation, and 
summarization of machine-readable texts. Sci- 
ence, 164:1421-1426. 
K. J. Sparck. 1993. What might be in a summary? 
In Proc. of h~forraation Retrieval98, pages 9-26. 
G. C. Stein, T. Strzalkowski, aald G. B. Wise. 1999. 
Summarizing multiple documents using text ex- 
traction and interactive clustering. In Proc. of. 
the Pacific Association for Computational Lin- 
guistics1999, pages 200-208. 
T. Stralkowsik, G. C. Stein, aaad G. B. Wise. 1998. 
A text-extractlon based summarizer. In Proc. of 
Tipster Workshop. 
T. Stralkowski.. G. C. Stein; and G. B. Wise. 1999. 
Getracker: A robust, lightweight topic tracking 
system. In Proe. o/the DARPA Broadcast News 
Workshop. 
S. Teufell and M. Moens. 1997. Sentence extraction 
as a classification task. In Proc. of ACL Workshop 
on h~telligent Scalable Text Summarization, pages 
58-65. 
K. Zechner. 1996. Fast generation of abstracts from 
general domain text corpora by extracting rele- 
vant sentences. In Proc. of the 16th International 
Gonference on Gomputational Lin9uistics, pages 
986-989. 
