Detecting Shifts in News Stories for Paragraph Extraction
Fumiyo Fukumoto Yoshimi Suzuki
Department of Computer Science and Media Engineering,
Yamanashi University
4-3-11, Takeda, Kofu, 400-8511, Japan
{fukumoto@skye.esb, ysuzuki@alps1.esi}.yamanashi.ac.jp
Abstract
For multi-document summarization where docu-
ments are collected over an extended period of time,
the subject in a document changes over time. This
paper focuses onsubject shiftandpresents amethod
for extracting key paragraphs from documents that
discuss the same event. Our extraction method uses
the results of event tracking which starts from a few
sample documents and ﬁnds all subsequent docu-
ments that discuss the same event. The method was
tested on the TDT1corpus, andthe result shows the
eﬀectiveness of the method.
1 Introduction
Multi-document summarization of news stories dif-
fers from single document in that it is important
to identify diﬀerences and similarities across doc-
uments. This can be interpreted as the question
of how to identify an event and a subject in doc-
uments. According to the TDT project, an event is
somethingthatoccurs ataspeciﬁc placeandtimeas-
sociated with some speciﬁc actions, and it becomes
the background among documents. A subject, on
the other hand, refers to a theme of the document
itself. Another important factor, which is typical in
a stream of news, is recognizing and handling sub-
ject shift. The extracted paragraphs based on an
event and a subject may include the main points
of each document and the background among docu-
ments. However, when they are strung together, the
resulting summary still contains much overlapping
information.
This paper focuses on subject shift and presents
a method for extracting key paragraphs from docu-
ments that discuss the same event. We use the re-
sults of our tracking technique which automatically
detects subject shift, and produces the optimal win-
dowsizeinthetrainingdatasoastoincludeonlythe
datawhicharesuﬃcientlyrelatedtothe current sub-
ject. The idea behind this is that, of two documents
fromthetargeteventwhicharecloseinchronological
order, the latter discusses (i) the same subject as an
earlier one, or (ii) a new subject related to the tar-
get event. This is particularlywell illustrated by the
KobeJapanquakeeventintheTDT1data. Theﬁrst
document says that a severe earthquake shook the
city of Kobe. It continues until the 5th document.
The 6th through 17th documents report damage, lo-
cation and nature of quake. The 18th document, on
the other hand, states that the Osaka area suﬀered
much less damage than Kobe. The subject of the
document is diﬀerent from the earlier ones, while all
of these are related to the Kobe Japan quake event.
We use the leave-one-out estimator of Support Vec-
tor Machines(SVMs)(Vapnik, 1995) to make a clear
distinctionbetween (i)and(ii)andthusestimatethe
optimalwindowsize inthe trainingdata. Forthe re-
sults of tracking where documents are divided into
several sets, each of which covers a diﬀerent subject
related to the same event, we applySVMs againand
induce classiﬁers. Using these classiﬁers, we extract
key paragraphs.
The next section explains why we need to detect
subject shift by providing notions of an event,a
subject class andasubject whichareproperties that
identify key paragraphs. After describing SVMs, we
present our system. Finally, we report some exper-
iments using the TDT1 and end with a very brief
summary of existing techniques.
2 An Event, A Subject Class and A
Subject
Our hypothesis about key paragraphs in multiple
documents related to the target event is that they
include words related to the subject of a document,
a subject class among documents, and the target
event. We call these words subject, subject class
and event words. The notion of a subject word
refers to the theme ofthe document itself,i.e.,some-
thing a writer wishes to express, and it appears
across paragraphs, butdoes not appear inother doc-
uments(Luhn, 1958). A subject class word diﬀeren-
tiates it from a speciﬁc subject, i.e. it is a broader
class of subjects, but narrower than an event. It ap-
pears across documents, and these documents dis-
cuss related subjects. An event word, on the other
hand, is something that occurs at a speciﬁc place
and time associated with some speciﬁc actions, and
it appears across documents about the target event.
Let us take a look at the following three documents
concerning the Kobe Japan quake from the TDT1.
1. Emergency work continues after earthquake in Japan
1-1. Casualties are mounting in [Japan], where a strong
[earthquake] eight hours ago struck [Kobe]. Up to
400 {people} related {deaths} are conﬁrmed, thou-
sands of {injuries}, and rescue crews are searching
···············
2. Quake Collapses Buildings in Central Japan
2-1. At least two {people} died and dozens {injuries}
when a powerful [earthquake] rolled through central
[Japan] Tuesday morning, collapsing buildings and
setting oﬀ ﬁres in the cities of [Kobe] and Osaka.
2-2. The [Japan] MeteorologicalAgency said the
[earthquake], which measured 7.2on the open-ended
Richter scale, rumbled across Honshu Island from
the Paciﬁc Ocean to the [Japan] Sea.
2-3. The worst hit areas were the port city of [Kobe]
and the nearby island of Awajishima where in
both places dozens of ﬁres broke out and up to 50
buildings, including several apartment blocks,
···············
3. US forces to fly blankets to Japan quake survivors
3-1. UnitedStatesforcesbasedin[Kobe][Japan]willtake
blankets to help [earthquake] survivors Thursday, in
the U.S. military’s ﬁrst disaster relief operation in
[Japan] since it set up bases here.
3-2. A military transporter was scheduled to take oﬀ in
the afternoonfrom Yokota air base on the outskirts
of Tokyo and ﬂy to Osaka with 37,000 blankets.
3-3. Following the [earthquake] Tuesday, President Clin-
ton oﬀered the assistance of U.S. military forces in
[Japan], and Washington provided the Japanese
···············
Figure 1: Documents from the TDT1
The underlined words in Figure 1 denote a subject
wordineachdocument. Wordsmarkedwith‘{}’and
‘[]’ refer to a subject class word and an event word,
respectively. Words such as ‘Kobe’ and ‘Japan’ are
associated with an event, since all of these docu-
ments concern the Kobe Japan quake. The ﬁrst
document says that emergency work continues af-
ter the earthquake in Japan. Underlined words such
as ‘rescue’ and ‘crews’ denote the subject ofthe doc-
ument. The second document states that the quake
collapsed buildingsincentral Japan. These twodoc-
uments mention the same thing: A powerful earth-
quake rolled through central Japan, and many peo-
ple were injured. Therefore, words such as ‘people’
and ‘injuries’ which appear in both documents are
subject class words, and these documents are classi-
ﬁedintothesameset. Ifwecandetermine thatthese
documents discuss relatedsubjects, wecaneliminate
redundancy between them. The third document, on
the other hand, states that the US military will ﬂy
blankets to Japan quake survivors. The subject of
the document is diﬀerent from the earlier ones, i.e.,
the subject has shifted.
Though it is hard to make a clear distinction be-
tween a subject and a subject class, it is easier to
ﬁnd properties to determine whether the later docu-
ment discusses the same subject as an earlier one or
not. Our method exploits this feature of documents.
3 SVMs
We use a supervised learning technique,
SVMs(Vapnik, 1995), in the tracking and paragraph
extraction task. SVMs are deﬁned over a vector
space where the problem is to ﬁnd a decision surface
that ‘best’ separates a set of positive examples
from a set of negative examples by introducing
the maximum ‘margin’ between two sets. Figure
2 illustrates a simple problem that is linearly
separable.
Margin
w
Positive examples
Negative examples
Figure 2: The decision surface of a linear SVM
Solidline denotes a decision surface, and twodashed
lines refer to the boundaries. The extra circles
are called support vectors, and their removal would
change the decision surface. Precisely, the decision
surface for linearly separable space is a hyperplane
whichcanbe writtenas w·x+b =0,where x isanar-
bitrary data point(x∈R
n
) and w and b are learned
from a training set. In the linearly separable case
maximizingthe margin can be expressed as an opti-
mization problem:
Minimize : −
summationtext
l
i=1
α
i
+
1
2
summationtext
l
i,j=1
α
i
α
j
y
i
y
j
x
i
· x
j
(1)
s.t :
summationtext
l
i=1
α
i
y
i
=0 ∀i : α
i
≥ 0
w =
summationtext
l
i=1
α
i
y
i
x
i
(2)
where x
i
=(x
i1
,···,x
in
)isthei-th training exam-
ple and y
i
is a label corresponding the i-th training
example. In formula (2), each element of w, w
k
(1
≤ k ≤ n) corresponds to each word in the training
examples, and the larger value of w
k
=
summationtext
l
i
α
i
y
i
x
ik
is, the more the word w
k
features positive examples.
We use an upper bound value, E
prime
loo
of the leave-
one-out error of SVMs to estimate the optimal win-
dow size in the training data. E
prime
loo
can estimate
the performance of a classiﬁer. It is based on the
idea of leave-one-out technique: The ﬁrst example is
removed from l training examples. The resulting ex-
ampleis used for training,and aclassiﬁer is induced.
The classiﬁer is tested on the held out example. The
process is repeated for all training examples. The
number of errors divided by l, E
loo
, is the leave-
one-out estimate of the generalization error. E
prime
loo
uses an upper bound on E
loo
instead of calculating
them, which is computationally very expensive. Re-
call that the removal of support vectors change the
decision surface. Thus the worst happens when ev-
ery support vector will become an error. Let l be
the number of training examples of a set S, and m
be the number ofsupport vectors. E
prime
loo
(S) is deﬁned
as follows:
E
loo
(S) ≤ E
prime
loo
(S)=
m
l
(3)
4 System Design
4.1 Tracking by Window Adjustment
Like much previous research, our hypotheses regard-
ingevent tracking is that exploitingtime willlead to
improveddataadjustment because documents closer
together in the stream are more likely to discuss re-
lated subject than documents further apart. Let vectorx
1
,
···, vectorx
p
be positivetrainingdocuments, i.e.,beingthe
target event, which are in chronological order. Let
also vectory
1
, ···, vectory
q
be negative training documents. The
algorithm can be summarized as follows:
1. Scoring negative training documents
In the TDT tracking task, the number of labelled
positive training documents is small (at most 16
documents) compared to the negative training doc-
uments. Therefore, the choice of good training data
isan importantissue toproduce optimalresults. We
ﬁrst represent each document as a vector in an n di-
mensional space, where n is the number of words in
the collection. The cosine of the angle between two
vectors, vectorx
i
and vectory
j
is shown in (4).
cos(vectorx
i
,vectory
j
)=
summationtext
n
k=1
x
ik
·y
jk
radicalbigsummationtext
n
k=1
x
2
ik
·
radicalBig
summationtext
n
k=1
y
2
jk
(4)
where x
ik
and y
jk
are the term frequency of word k
in the document vectorx
i
and vectory
j
, respectively. We com-
pute arelevance score foreach negativetrainingdoc-
ument by the cosine of the angle between a vector of
the center of gravity on positive training documents
and a vector of the negative training document, i.e.,
cos(vectorg,vectory
j
)(1≤ j ≤ q), where vectory
j
is the j-th negative
training document, and vectorg is deﬁned as follows:
vectorg =(g
1
,···,g
n
)=(
1
p
p
summationdisplay
i=1
x
i1
,···,
1
p
p
summationdisplay
i=1
x
in
) (5)
x
ij
(1 ≤ j ≤ n) is the term frequency of word j
in the positive document vectorx
i
. The negative training
documentsaresorted inthedescending orderoftheir
relevance scores: vectory
1
, ···, vectory
q−1
and vectory
q
.
2. Adjusting window size
We estimate that the most recent positive training
document, vectorx
p
discusses either (i) the same subject
as the previous positive one, or (ii) a new subject.
To do this, we use the value of E
prime
loo
. Let vectory
1
, ···, vectory
r
be negative training documents whose cosine simi-
larityvalues are the top r among q negative training
documents. Let also Set
1
be a set consisting of vectorx
1
,
vectorx
p
, vectory
1
, ···, vectory
r
, and Set
2
be a set which consists of
vectorx
p−1
, vectorx
p
, vectory
1
, ···, vectory
r
. We compute E
prime
loo
on sets Set
1
and Set
2
. If the value of E
prime
loo
on Set
2
is smaller
than that of Set
1
, this means that vectorx
p
has the same
subject as the previous document vectorx
p−1
, since a clas-
siﬁer which is induced by training Set
2
is estimated
to generate a smaller error rate than that of Set
1
.
In this case, we need to ﬁnd the optimalwindowsize
so as to include only the positive documents which
are suﬃciently related to the subject. The ﬂow of
the algorithm is shown in Figure 3.
begin
for k =1top-3
num = φ,
Let Set
a
= {vectorx
1
,vectorx
p
,···,vectorx
p−k
,vectory
1
,···,vectory
r−1
,vectory
r
}.
Set
b
= {vectorx
1
,vectorx
(p−1)
,···,vectorx
(p−1)−k
,vectory
1
,···,vectory
r−1
,vectory
r
}
if E
prime
loo
(Set
a
) <E
prime
loo
(Set
b
)
then num = k +2exit loop
end if
end for
if num = φ
then num = p
end if
end
Figure 3: Flow of window adjustment
On the other hand, if the value of E
prime
loo
of Set
2
is
larger than that of Set
1
, vectorx
p
is regarded to discuss
a new subject. We use all previously seen positive
documents for training as a default strategy.
3. Tracking
Let num be the number of adjusted positive train-
ing documents. The top num negative documents
areextracted fromq negativedocumentsandmerged
intonum positivedocuments. Thenewset istrained
by SVMs, and a classiﬁer is induced. Recall that
E
prime
loo
is computationally less expensive. However,
they are sometimes too tight for the small size of
training data. This causes a high F/A rate which
is signaled by the ratio of the documents that were
judged as negative but were evaluated as positive.
We then use a simple measure for the test document
which is determined to be positive by a classiﬁer.
For each training document, we compute the cosine
between the test and the training document vectors.
If the cosine between the test and the negative doc-
uments is largest, the test document is judged to be
negative. Otherwise, itis truly positive andtracking
is terminated. The procedure 1, 2 and 3 is repeated
until the last test document is judged.
4.2 Paragraph Extraction
Our window adjustment algorithm is applied each
time the document discusses the target event.
Therefore, some documents are assigned to more
than one set of documents. We thus eliminate some
sets which completely overlap each other, and apply
paragraph extraction to the result. Our hypothesis
about key paragraphs is that they include subject,
subject class, and event words. Let x
p
be a para-
graph in the document x and x
\1
be the resulting
document with x
p
removed. Let also l be the total
number of documents in a set where each document
discusses subjects related tox.Ifx
p
includes subject
words, x
p
is related to x
\1
rather than the other l-1
documents, since subject words appear across para-
graphs in x
\1
rather than the other l−1 documents.
We apply SVMs to the training data, which consists
ofl documents, andinduce aclassiﬁer sbj(x
p
), which
identiﬁes whether x
p
is related to x
\1
or not.
sbj(x
p
)=
braceleftBig
1ifx
p
is assigned to x
\1
0 else
WenotethatSVMsare basicallyintroduced forsolv-
ing binary classiﬁcation, while our paragraph ex-
traction is a multi-class classiﬁcation problem, i.e.,
l classes. We use the pairwise technique for using
SVMs withmulti-class data(Weston andC.Watkins,
1998), and assign x
p
to one of the l documents. In a
similar way, we apply SVMs to the other two train-
ing data and induce classiﬁers: sbj class(x
p
) and
event(x
p
).
sbj class(x
p
)=
braceleftBig
1ifx
p
is assigned to sbj class
x
\1
0 else
event(x
p
)=
braceleftBig
1ifx
p
is assigned to event
x
\1
0 else
sbj class(x
p
) refers to a classiﬁer which identiﬁes
whether or not x
p
is assigned to the set sbj class
x
\1
including x
\1
. It is induced by training data
which consists of m diﬀerent sets including the set
sbj class
x
\1, each of which covers a diﬀerent subject
related to the target event. The classiﬁer event(x
p
)
is induced by trainingdata which consists oftwodif-
ferent sets: one is a set of all documents including
x
\1
, and concerning the target event. The other is
a set of documents which are not the target event.
We extract paragraphs for which (6) holds.
sbj(x
p
)=1&sbj class(x
p
)=1&event(x
p
)=1 (6)
5 Experiments
We used the TDT1 corpus which comprises a set
of diﬀerent sources, Reuters(7,965 documents) and
CNN(7,898 documents)(Allan et al., 1998a). A set
of 25 target events were deﬁned. Each document is
labeled according to whether or not the document
discusses the target event. All 15,863 documents
were tagged by a part-of-speech tagger(Brill, 1992)
and stemmed using WordNet information(Fellbaum,
1998). We extracted all nouns in the documents.
5.1 Tracking Task
Table 1 summarizes the results which were obtained
using the standard TDT evaluation measure
1
.
Table 1: Tracking results
N
t
Miss F/A Prec F1
1 31% 0.16% 70% 0.68
2 27% 0.16% 79% 0.78
4 24% 0.09% 87% 0.78
8 23% 0.09% 87% 0.79
16 22% 0.09% 86% 0.79
‘N
t
’ denotes the number of initial positive training
documentswhere N
t
takesonvalues1,2,4,8and16.
When N
t
takes on value 1, we use the document d
and one negative training document vectory
1
for training.
Here, vectory
1
is a vector whose cosine value of d and vectory
1
is the largest among the other negative documents.
The test set is always the collection minus the N
t
=
16documents. ‘Miss’denotes Miss rate, which isthe
ratio of the documents that were judged as Yes but
were not evaluated as Yes. ‘F/A’ shows false alarm
rate, which is the ratio of the documents judged as
No but were evaluated as Yes. ‘Prec’ stands for pre-
cision, which is the ratio of correct assignments by
the system divided by the total number of the sys-
tem’s assignments. ‘F1’ is a measure that balances
recall and precision, where recall denotes the ratio
of correct assignments by the system divided by the
total number of correct assignments. Table 1 shows
that there is no signiﬁcant diﬀerence among N
t
val-
ues except for 1, since F1 ranges from 0.78 to 0.79.
This shows that the method works well even for a
small number of initial positive training documents.
Furthermore, the results are comparable to the ex-
isting event tracking techniques, since the F1, Miss
and F/A score by CMU were 0.66, 29 and 0.40, and
those of UMass were 0.62, 39 and 0.27, respectively,
when N
t
is 4(Allan et al., 1998b).
The contribution of the adaptive window algo-
rithm is best explained by looking at the window
sizes it estimates. Table 2 illustrates the sample re-
sult oftracking for ‘Kobe Japan Quake’ event on the
N
t
= 16. This event has many documents, each of
these discusses a new subject related to the target
event. The result shows the ﬁrst 10 documents in
chronological order which are evaluated as positive.
Columns 1-3 in Table 2 denote id number, dates,
and title of the document, respectively. ‘id=1’, for
example, denotes the ﬁrst document which is eval-
uated as positive. Columns 4 and 5 stand for the
result of our method, and the majority of three hu-
manjudges, respectively. They take on three values:
‘Yes’ denotes that the document discusses the same
subject as an earlier one, ‘New’ indicates that the
document discusses a new subject, and ‘No’, that
the document is not a positive document. We can
1
http://www.nist.gov/speech/tests/tdt/index.htm
Table 2: The adaptive window size in Event 15, ‘Kobe Japan Quake’
id date title
shifts adjusted window size
system actual recall precision F1
1 01/17/95 Kobe Residents Unable to Commence Rescue Operations New New 100% 100% 1.00
2 01/17/95 Emergency Eﬀorts Continue After Quake in Japan Yes Yes 100% 100% 1.00
3 01/17/95 Japan Helpline Worker Discusses Emergency Eﬀorts Yes New 100% 5% 0.10
4 01/17/95 U.S. Businessman Describes Japan Earthquake Yes Yes 100% 80% 0.89
5 01/17/95 Osaka, Japan, Withstands EarthquakeBetter Than Others Yes New 100% 5% 0.09
6 01/17/95 President Clinton Drums Up Support in Humanitarian Trip No New 100% 5% 0.09
7 01/17/95 Engineer Examines Causes of Damage in Japan Quake Yes New 100% 50% 0.67
8 01/18/95 Mike Chinoy Updates Japan’s Earthquake Recovery Eﬀorts Yes Yes 100% 100% 1.00
9 01/18/95 Smoke Hangs in a Pall Over Quake-, Fire-Ravaged Kobe New New 100% 4% 0.08
10 01/18/95 Japanese Wonder If Their Cities Are Really ‘Quakeproof’ New New 100% 4% 0.07
see that the method correctly recognizes a test doc-
ument as discussing an earlier subject or a new one,
since the result of our method(‘system’) and human
judges(‘actual’) coincide except for ‘id=5, 6 and 7’.
Columns 6-8 stand for the accuracy of the ad-
justed window size. Recall denotes the number
of documents selected by both the system and hu-
man judges divided by the total number of doc-
uments selected by human judges, and precision
shows the number of documents selected by both
the system and human judges divided by the to-
tal number of documents selected by the system.
When the method correctly recognizes a test doc-
ument as discussing an earlier subject(‘system = ac-
tual = Yes’), our algorithmselects documents which
are suﬃciently related to the current subject, since
the total average of F1 was 0.82. We note that the
ratio of precision in ‘system = New’ is low. This
is because we use a default strategy, i.e., we use
all previously seen positive documents for training
when the most recent training document is judged
to discuss a new subject.
5.2 Paragraph Extraction
Weused15outof25events whichhavemorethan16
positive documents in the experiment. Table 3 de-
notes the number of documents and paragraphs in
each event. ‘Avg.’ in ‘doc’ shows the average num-
ber of documents per event, and ‘Avg.’ in ‘para’
denotes the average number of paragraphs per doc-
ument. The maximum number of paragraphs per
document was 100.
Table 4 shows the result of paragraph extraction.
‘CNN’ refers to the results using the CNN corpus
as both training and test data. ‘Reuters’ denotes
the results using the Reuters corpus. ‘Total’ stands
for the results using both corpora. ‘Tracking result’
refers to the F1 score obtained by using tracking re-
sults. ‘Perfect analysis’ stands for the F1 achieved
using the perfect (post-edited) output of the track-
ing method, i.e., the errors by both tracking and
detecting shifts were corrected. Precisely, the docu-
ments judged as Yes but were not evaluated as Yes
Table 3: Data
Event
CNN Reuters
doc para doc para
3(Carter in Bosnia) 26 314 8 37
5(Clinic Murders (Salvi)) 36 416 5 34
6(Comet into Jupiter) 41 539 4 23
8(Death of Kim Jong Il) 28 337 39 353
9(DNA in OJ trial) 108 1,407 6 75
11(Hall’s copter (N. Korea)) 77 875 22 170
12(Humble, TX, ﬂooding) 22 243 0 0
15(Kobe Japan quake) 72 782 12 64
16(Lost in Iraq) 34 395 10 78
17(NYC Subway bombing) 22 374 2 2
18(OK-City bombing) 214 3,209 59 439
21(Serbians down F-16) 50 572 15 135
22(Serbs violate Bihac) 56 669 35 349
24(USAir 427 crash) 32 435 7 98
25(WTC Bombing trial) 18 132 4 54
Avg. 55.4 12.7 15.2 9.7
were eliminated, and the documents judged as No
but were evaluated as Yes were added. Further,
the documents were divided by a human into sev-
eral sets, each of which covers a diﬀerent subject
related to the same event. The evaluation is made
by three humans. The classiﬁcation is determined
to be correct if the majority of three human judges
agrees. Table 4 shows that the average F1 of ‘Track-
ing results’(0.68) in ‘Total’was 0.06 lower than that
of ‘Perfect analysis’(0.74). Overall, the result us-
ing ‘CNN’ was better than that of ‘Reuters’. One
reason behind this lies in the diﬀerence between the
two corpora: CNN consists of a larger number of
words per paragraph than Reuters. This causes a
high recall rate, since a paragraph which consists of
a large number of words is more likely to include
event, subject-class, and subject words than a para-
graph containing a small number of words.
Recall that in SVMs each value of word w
k
is
calculated using formula (2), and the larger value
of w
k
is, the more the word w
k
features positive
examples. Table 5 illustrates sample words which
Table 4: Performance of paragraph extraction
N
t
Tracking results Perfect analysis
CNN Reuters Total CNN Reuters Total
1 0.70 0.56 0.62
2 0.75 0.60 0.67
4 0.76 0.61 0.70 0.78 0.62 0.74
8 0.76 0.62 0.70
16 0.77 0.62 0.72
Avg. 0.85 0.60 0.68
have the highest weighted valuecalculated using for-
mula (2). Each classiﬁer, sbj(x
p
), sbj class(x
p
),
and event(x
p
) is the result using both corpora. The
event is the Kobe Japan quake, and the document
which includes x
p
states that the death toll has
risen to over 800 in the Kobe-Osaka earthquake,
and oﬃcials are concentrating on getting people
out. ‘Words’ denote words which have the highest
weighted valuein each classiﬁer and they are used to
determine whether x
p
is a key paragraph or not. We
assume these words are subject, subject class and
event words, while some words such as ‘earthquake’
and ‘activity’ appear in more than one classiﬁer.
Table 5: Sample words in the Kobe Japan quake
classiﬁer words
sbj(x
p
) earthquake activity Japan seismologist
news conferenceliving prime minister Mu-
rayama crew Bill Dorman
sbj class(x
p
) city something ﬂoor quake Tokyo after-
shock activity street injury ﬁre seismolo-
gist police people building cry
event(x
p
) Kobe magnitude survivor earthquake col-
lapse death ﬁre damage aftershock Kyoto
toll quake magnitude emergency Osaka-
Kobe Japan Osaka
Figure 4: F1 v.s. the number of documents
Figure 4 illustrates how the number of documents
inﬂuences extraction accuracy. The event is the US-
Air 427 crash, and F1 is 0.68, which is lower than
theaverageF1ofallevents(0.79). Theresultiswhen
N
t
is 16. ‘P ana of tracking’ refers to the result us-
ing the post-edited output of the tracking, i.e., only
the errors of tracking were corrected, while ‘Perfect
analysis’ refers to the result using the output: the
errors by both tracking and detecting shifts were
corrected. Figure 4 shows that our method does
not depend on the number of documents, since the
performance does not monotonically decrease when
the number of documents increases. Figure 4 also
shows that there is no signiﬁcant diﬀerence between
‘P ana of tracking’ and ‘Perfect analysis’ compared
tothediﬀerence between ‘Trackingresults’ and‘Per-
fect analysis’. This indicates that (i) subject shifts
are correctly detected, and (ii) the performance of
our paragraph extraction explicitly depends on the
tracking results.
We note the contribution of detecting shifts for
paragraph extraction. Figures 5 and 6 illustrate the
recall and precision with two methods: with and
without detecting shift. In the method without de-
tecting shift, we use the ‘full memory’ approach for
tracking, i.e.,SVMs generate its classiﬁcation model
fromallpreviouslyseen documents. Fortheresult of
tracking, we extract paragraphs for which sbj(x
p
)=
1 and sbj class(x
p
) = 1 hold. We can see from both
Figure 5 and Figure 6 that the method that detects
shifts outperformed the method without detecting
shifts in all N
t
values. More surprisingly, Figure 6
shows that the precision scores in all N
t
values using
the tracking results with detecting shift were higher
than that of ‘P ana’ without detecting shift. Fur-
ther, the diﬀerence in precision between two meth-
ods is larger than that of recall. This demonstrates
that it is necessary to detect subject shifts and thus
to identify subject class words for paragraph extrac-
tion, since the system without detecting shift ex-
tracts many documents, which yields redundancy.
Figure 5: Recall with and without detecting shift
Figure 6: Precision with and without detecting shift
6 Related Work
Most of the work on summarization task by para-
graph or sentence extraction has applied statistical
techniques based on word distribution to the target
document(Kupiec et al.,1995). More recently, other
approaches have investigated the use of machine
learning to ﬁnd patterns in documents(Strzalkowski
et al., 1998) and the utility of parameterized mod-
ules so as to deal with diﬀerent genres or cor-
pora(Goldstein et al., 2000). Some of these ap-
proaches to single document summarization have
been extended to deal with multi-document sum-
marization(Mani and E.Bloedorn, 1997), (Barzilay
et al., 1999), (McKeown et al., 1999).
Our work diﬀers from the earlier work in several
important respects. First, our method focuses on
subject shift of the documents from the target event
rather than the sets of documents from diﬀerent
events(Radev et al., 2000). Detecting subject shift
from the documents in the target event, however,
presents special diﬃculties, since these documents
are collected froma veryrestricted domain. Wethus
present a window adjustment algorithmwhich auto-
maticallyadjusts the optimalwindowinthe training
documents, so as to include only the data which are
suﬃciently related to the current subject. Second,
our approach works in a living way, while many ap-
proaches are stable ones, i.e., they use documents
which are prepared in advance and apply a variety
oftechniques tocreate summaries. Weare interested
in a substantially smaller number of initial training
documents, which are then utilized to extract para-
graphs from documents relevant to the initial doc-
uments. Because the small number of documents
which are used for initial training is easy to col-
lect, and costly human intervention can be avoided.
To do this, we use a tracking technique. The small
size of the training corpus, however, requires sophis-
ticated parameters tuning for learning techniques,
since we can not make one or more validation sets
of documents from the initial training documents
which are required for optimal results. Instead we
use E
prime
loo
of SVMs to cope with this problem. Fur-
ther, our method does not use speciﬁc features for
training such as ‘Presence and type of agent’ and
‘Presence of citation’, which makes it possible to be
extendable to other domains(Teufel, 2001).
7 Conclusion
Thispaperstudiedthe eﬀectiveness ofdetecting sub-
ject shifts in paragraph extraction. Future work
includes (i) incorporating Named Entity extraction
into the method, (ii) applying the method to the
TDT2 and TDT3 corpora for quantitative evalua-
tion, and (iii) extending the method to on-line para-
graph extraction for real-world applications, which
will extract key paragraphs each time the document
discusses the target event.
Acknowledgments
We would like to thank Prof. Virginia Teller of
Hunter College CUNY for her valuable comments
and the anonymous reviewers for their helpful sug-
gestions.

References

J. Allan, J.Carbonell, G.Doddington, J.Yamron,
and Y.Yang. 1998a. Topic Detection and Track-
ing pilot study ﬁnal report. In Proc. of DARPA
Workshop.

J. Allan, R.Papka, and V.Lavrenko. 1998b. On-
line new event detection and tracking. In Proc.
of ACM SIGIR’98, pages 37–45.

R. Barzilay, K. R. McKeown, and M. Elhadad.
1999. Information fusion in the context of multi-
document summarization. In Proc. of ACL’99,
pages 550–557.

E. Brill. 1992. A simple rule-based part of speech
tagger. In Proc. of ANLP’92, pages 152–155.

C. Fellbaum, editor. 1998. Nouns in WordNet, An
Electronic Lexical Database. MIT.

J. Goldstein, V.Mittal, J.Carbonell, and
M.Kantrowitz. 2000. Multi-document sum-
marization by sentence extraction. In Proc. of the
ANLP/NAACL-2000 Workshop on Automatic
Summarization, pages 40–48.

J. Kupiec, J.Pedersen, and F.Chen. 1995. A train-
able document summarizer. In Proc. of ACM SI-
GIR’95, pages 68–73.

H. P. Luhn. 1958. The automatic creation of litera-
ture abstracts. IBM journal, 2(1):159–165.
I. Mani and E.Bloedorn. 1997. Multi-document
summarization by graph search and merging. In
Proc. of AAAI-97, pages 622–628.

K. McKeown, J.Klavans, V.Hatzivassiloglou,
R.Barzilay, and E.Eskin. 1999. Towards mul-
tidocument summarization by reformulation:
Progress and prospects. In Proc. of the 16th
National Conference on AI, pages 18–22.

D. Radev, H.Jing, and M.Budzikowska. 2000.
Centroid-based summarization of multiple doc-
uments: Sentence extraction, utility-based eval-
uation, and user studies. In Proc. of the
ANLP/NAACL-2000 Workshop on Automatic
Summarization, pages 21–30.

T. Strzalkowski, J.Wang, and B.Wise. 1998. A
robust practical text summarization system. In
Proc. of AAAI Intelligent Text summarization
Workshop, pages 26–30.

S. Teufel. 2001. Task-based evaluation of summary
quality: Describing relationships between scien-
tiﬁc papers. In Proc. of NAACL 2001 Workshop
on Automatic Summarization, pages 12–21.

V. Vapnik. 1995. The Nature of Statistical Learning
Theory. Springer.

J. Weston and C.Watkins. 1998. Multi-class Sup-
port Vector Machines. In Technical Report CSD-
TR-98-04.
