Extracting Important Sentences with Support Vector Machines
Tsutomu HIRAO and Hideki ISOZAKI and Eisaku MAEDA
NTT Communication Science Laboratories
2-4, Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0237 Japan
{hirao,isozaki,maeda}@cslab.kecl.ntt.co.jp
Yuji MATSUMOTO
Graduate School of Information and Science, Nara Institute of Science and Technology
8516-9, Takayama, Ikoma, Nara 630-0101 Japan
matsu@is.aist-nara.ac.jp
Abstract
Extractingsentencesthatcontainimportantin-
formation from a document is a form of text
summarization. The technique is the key tothe
automatic generation of summaries similar to
those written by humans. To achieve such ex-
traction, it is important to be able to integrate
heterogeneous pieces of information. One ap-
proach, parameter tuning by machine learning,
has been attracting a lot of attention. This pa-
per proposes a method of sentence extraction
based on Support Vector Machines (SVMs). To
conﬁrm the method’s performance, we conduct
experiments that compare our method to three
existing methods. Results on the Text Summa-
rization Challenge (TSC) corpus show that our
method oﬀers the highest accuracy. Moreover,
we clarify the diﬀerent features eﬀective for ex-
tracting diﬀerent document genres.
1 Introduction
Extracting important sentences means extract-
ing from a document only those sentences that
have important information. Since some sen-
tences are lost, the result may lack coherence,
but important sentence extraction is one of
the basic technologies forgenerating summaries
that are useful for humans to browse. There-
fore, this technique plays an important role in
automatic text summarization.
Many researchers have been studied impor-
tant sentence extraction since the late 1950’s
(Luhn, 1958). Conventional methods focus on
sentence features and deﬁne signiﬁcance scores.
The features include key words, sentence posi-
tion, and certain linguistic clues. Edmundson
(1969) and Nobata et al. (2001) have proposed
scoringfunctionstointegrateheterogeneousfea-
tures. However, we can not tune the parameter
values by hand when the number of features is
large.
When a large quantity of training data is
available, tuning can be eﬀectively realized by
machine learning. In recent years, machine
learning has attracted attention in the ﬁeld of
automatic text summarization. Aone et al.
(1998) and Kupiec et al. (1995) employed
Bayesianclassiﬁers, Manietal. (1998),Nomoto
et al. (1997), Lin (1999), and Okumura et
al. (1999) used decision tree learning. How-
ever,mostmachinelearningmethodsoverﬁtthe
training data when many features are given.
Therefore, we need to select features carefully.
Support Vector Machines (SVMs) (Vapnik,
1995) is robust even when the number of
features is large. Therefore, SVMs have
shown good performance for text categoriza-
tion (Joachims, 1998), chunking (Kudo and
Matsumoto, 2001), and dependency structure
analysis (Kudo and Matsumoto, 2000).
In this paper, we present an important sen-
tence extraction technique based on SVMs. We
veriﬁed the technique against the Text Summa-
rizationChallenge(TSC)(FukushimaandOku-
mura, 2001) corpus.
2 Important Sentence Extraction
based on Support Vector Machines
2.1 Support Vector Machines (SVMs)
SVM is a supervised learning algorithm for 2-
class problems.
Training data is given by
(x
1
,y
1
),···,(x
u
,y
u
), x
j
∈ R
n
,y
j
∈{+1,−1}.
Here, x
j
is a feature vector of the j-th
sample; y
j
is its class label, positive(+1) or
negative(−1). SVM separates positive and neg-
ative examples by a hyperplane deﬁned by
w ·x+b =0, w ∈ R
n
,b∈ R, (1)
Positive
Negative
margin
wx + b = 0
wx + b = 1
wx + b = -1
Support Vector
Figure 1: Support Vector Machines.
where “·” represents the inner product.
In general, such a hyperplane is not unique.
Figure 1 shows a linearly separable case. The
SVM determines the optimal hyperplane by
maximizing the margin. A margin is the dis-
tance between negative examples and positive
examples.
Since training data is not necessarily linearly
separable, slackvariables(ξ
j
)areintroducedfor
all x
j
. These ξ
j
incur misclassiﬁcation error,
and should satisfy the following inequalities:
w · x
j
+b ≥ 1− ξ
j
w · x
j
+b ≤−1+ξ
j
. (2)
Under these constraints, the following objective
function is to be minimized.
1
2
||w||
2
+C
u
summationdisplay
j=1
ξ
j
. (3)
The ﬁrst term in (3) corresponds to the size
of the margin and the second term represents
misclassiﬁcation.
By solving a quadratic programming prob-
lem, the decisionfunction f(x) = sgn(g(x))can
be derived where
g(x)=
parenleftBigg
lscript
summationdisplay
i=1
λ
i
y
i
x
i
· x+b
parenrightBigg
. (4)
The decision function depends on only sup-
port vectors (x
i
). Training examples, except
for support vectors, have no inﬂuence on the
decision function.
Non-linear decision surfaces can be realized
byreplacingtheinnerproductof(4)withaker-
nel function K(x· x
i
):
g(x)=
parenleftBigg
lscript
summationdisplay
i=1
λ
i
y
i
K(x
i
,x)+b
parenrightBigg
. (5)
In this paper, we use polynomial kernel func-
tionsthathavebeenveryeﬀectivewhenapplied
to other tasks, such as natural language pro-
cessing(Joachims, 1998; Kudoand Matsumoto,
2001; Kudo and Matsumoto, 2000):
K(x,y)=(x ·y+1)
d
. (6)
2.2 Sentence Ranking by using Support
Vector Machines
Important sentence extraction can be regarded
as a two-class problem: important or unimpor-
tant. However,theproportionofimportantsen-
tences in training data will diﬀer from that in
the test data. The number of important sen-
tences in a document is determined by a sum-
marization rate that is given at run-time. A
simple solution for this problem is to rank sen-
tencesin a document. We use g(x)the distance
from the hyperplane to x torank the sentences.
2.3 Features
We deﬁne the boolean features discussed below
that are associated with sentence S
i
by taking
past studies into account (Zechner, 1996; No-
bata et al., 2001; Hirao et al., 2001; Nomoto
and Matsumoto, 1997).
We use 410 boolean variables for each S
i
.
Where x =(x[1],···,x[410]). Areal-valuedfea-
ture normalized between 0 and 1 is represented
by 10 boolean variables. Each variable corre-
sponds to an internal [i/10,(i +1)/10) where
i = 0 to 9. For example, Posd =0.75 is rep-
resented by “0000000100” because 0.75 belongs
to [7/10,8/10).
Position of sentences
We deﬁne three feature functions for the posi-
tion of S
i
. First, Lead is a boolean that corre-
sponds to the output of the lead-based method
describedbelow
1
. Second, Posd is S
i
’sposition
inadocument. Third, Posp is S
i
’sposition ina
paragraph. The ﬁrst sentence obtains the high-
est score, the last obtains the lowest score:
1
WhenasentenceappearsintheﬁrstN ofdocument,
we assign 1 to the sentence. An N was given for each
documentbyTSCcommittee.
Posd(S
i
)=1−BD(S
i
)/|D(S
i
)|
Posp(S
i
)=1−BP(S
i
)/|P(S
i
)|.
Here, |D(S
i
)| is the number of characters in
the document D(S
i
) that contains S
i
; BD(S
i
)
is the number of characters before S
i
in D(S
i
);
|P(S
i
)| is the number of characters of the para-
graph P(S
i
) that contains S
i
, and BP(S
i
)is
the number of characters before S
i
in the para-
graph.
Length of sentences
We deﬁne a feature function that addresses the
length of sentence as
Len(S
i
)=|S
i
|/ max
S
z
∈D(S
i
)
|S
z
|.
Here, |S
i
| is the number of characters of sen-
tence S
i
, and max
S
z
∈D
|S
z
| is the maximum
numberof characters in asentence that belongs
to D(S
i
).
In addition, the length of a previous sentence
Len
−1
(S
i
)=Len(S
i−1
)andthelengthofanext
sentence Len
+1
(S
i
)=Len(S
i+1
) are also fea-
tures of sentence S
i
.
Weight of sentences
We deﬁned the feature function that weights
sentences based on frequency-based word
weighting as
Wf(S
i
)=
summationdisplay
t
tf(t,S
i
)·w(t,D(S
i
)).
Here, Wf(S
i
) is the summention of weighting
w(t,D(S
i
)) of words that appear in a sentence.
tf(t,S
i
) is term frequency of t in S
i
. We used
only nouns. In addition, we deﬁne word weight
w(t,D(S
i
))basedona specific field (Haraetal.,
1997):
w(t,D(S
i
)) = α
parenleftBigg
1
T
T
summationdisplay
z=1
ε
z
V
z
parenrightBigg
+β
parenleftbigg
tf(t,D(S
i
))
summationtext
t
prime
tf(t
prime
,D(S
i
))
parenrightbigg
.
Here, T is the number of sentence in a docu-
ment,and V
z
isthenumberofwordsinsentence
S
z
∈ D(S
i
)(repetitionsareignored). Also, ε
z
is
a boolean value: that is 1 when t appears inS
z
.
The ﬁrst term of the equation above is the
weighting of a word in a speciﬁc ﬁeld. The sec-
ond term is the occurrence probability of word
t.
We set parameters α and β as 0.8, 0.2, re-
spectively. The weight of a previous sentence
Wf
−1
(S
i
)=Wf(S
i−1
), and the weight of a next
sentence Wf
+1
(S
i
)=Wf(S
i+1
) are also features
of sentence S
i
.
Density of key words
We deﬁne the feature function Den(S
i
) that
represents density of key words in a sentence
by using Hanning Windowfunction (f
H
(k,m)):
Den(S
i
) =max
m
m+Win/2
summationdisplay
k=m−Win/2
f
H
(k,m)·a(k,S
i
),
where f
H
(k,m) is given by
f
H
(k,m)=
braceleftbigg
1
2
parenleftbig
1+cos2π
k−m
Win
parenrightbig
(|k −m|≤Win/2)
0(|k −m| > Win/2).
The key words (KW) are the top 30% of
words in a document according to w(t,D(S
i
)).
Also, m is the center position of the window,
Win = |S
i
|/2. In addition, a(k,S
i
) is deﬁnedas
follows:
a(k,S
i
)=





w(t,D) Whereaword t (∈ KW)begins
at k
0 k isnotthebeginningposition
ofawordinKW.
Named Entities
x[r]=1(1≤r≤8)indicatesthat a certainNamed
Entity class appears in S
i
. The number of
Named Entityclasses is 8 (Sekine and Eriguchi,
2000), e.g., PERSON, LOCATION. We use
Isozaki’s NE recognizer (Isozaki, 2001).
Conjunctions
x[r]=1 (9≤r≤61) if and only if a certain con-
junctionis used in the sentence. The numberof
conjunctions is 53.
Functional words
x[r]=1(62≤r≤234)ifandonlyifacertainfunc-
tional word such as ga, ha, and ta is used in
the sentence. The number of functional words
is 173.
Part of speech
x[r]=1(235≤r≤300)ifandonlyifacertainpart
of speech such as “Noun-jiritsu” and “Verb-
jiritsu” is used in the sentence. The number
of part of speech is 66.
Semantical depth of nouns
x[r]=1 (301≤r≤311) if and only if S
i
contains
a noun at a certain semantical depth according
toaJapaneselexicon,Goi-Taikei(Ikeharaetal.,
1997). The number of depth levels is 11. For
instance, Semdep=2 means that a noun in S
i
belongs to the second depth level.
Document genre
x[r]=1 (312≤r≤315) if and only if the docu-
ment belongs to a certain genre. The genre is
explicitly written in the header of each docu-
ment. The number of genres is four: General,
National, Editorial, and Commentary.
Symbols
x[r]=1 (r=316) if and only if sentence includes
a certain symbol (for example: •,star,diamondmath).
Conversation
x[r]=1 (r=317) if and only if S
i
includes a con-
versation style expression.
Assertive expressions
x[r]=1 (r=318) if and only if S
i
includes an as-
sertive expression.
3 Experimental settings
3.1 Corpus
We used the data set of TSC (Fukushima and
Okumura, 2001) summarization collection for
our evaluation. TSC was established as a sub-
task of NTCIR-2 (NII-NACSIS Test Collection
for IR Systems). The corpus consists of 180
Japanese documents
2
from the MainichiNews-
papers of 1994, 1995, and 1998. In each doc-
ument, important sentences were manually ex-
tracted at summarization rates of 10%, 30%,
and 50%. Note that the summarization rates
depend on the number of sentences in a doc-
ument not the number of characters. Table 1
shows the statistics.
3.2 Evaluated methods
Wecomparedfourmethods: decisiontreelearn-
ing, boosting, lead, and SVM. At each summa-
rizationrate,wetrainedclassiﬁersandclassiﬁed
test documents.
Decision tree learning method
We used C4.5 (Quinlan, 1993) for our experi-
ments with the default settings. We used the
2
EachdocumentispresentedinSGMLstylewithsen-
tenceandparagraphseparatorsattached.
features described in section 2. Sentences were
rankedaccordingtotheircertaintyfactorsgiven
by C4.5.
Boosting method
We used C5.0, which applies boosting to deci-
sion tree learning. The number of rounds was
set to 10. Sentences were ranked according to
their certainty factors given by C5.0.
Lead-based method
The ﬁrst N sentences of a document were se-
lected. N wasdeterminedaccordingtothesum-
marization rates.
SVM method
This is our method as outlined in section 2. We
used the second-order polynomial kernel, and
set C (in equation (3)) as 0.0001. We used
TinySVM
3
.
3.3 Measures for evaluation
In the TSC corpus, the number of sentences to
be extracted was explicitly given by the TSC
committee. When we extract sentences accord-
ing to that number, Precision, Recall, and F-
measure become the same value. We call this
value Accuracy. Accuracy is deﬁned as follows:
Accuracy= b/a ×100,
where a is the speciﬁed number of important
sentences, and b is the number of true impor-
tant sentences that were contained in system’s
output.
4 Results
Table 2 shows the results of ﬁve-fold cross vali-
dation by using all 180 documents.
For all summarization rates and all genres,
SVM achieved the highest accuracy, the lead-
based method the lowest. Let the null hypoth-
esis be “There are no diﬀerences among the
scoresofthe fourmethods”. We testedthis null
hypothesisat asigniﬁcance levelof 1%by using
Tukey’s method. Although the SVM’s perfor-
mance was best, the diﬀerences were not sta-
tistically signiﬁcant at 10%. At 30% and 50%,
SVM performed better than the other methods
with a statistical signiﬁcance.
3
http://cl.aist-nara.ac.jp/˜taku-ku/software/TinySVM/
Table 1: Details of data sets.
General National Editorial Commentary
#ofdocuments 16 76 41 47
#ofsentences 342 1721 1362 1096
#ofimportantsentences(10%) 34 172 143 112
#ofimportantsentences(30%) 103 523 414 330
#ofimportantsentences(50%) 174 899 693 555
Table 2: Evaluation results of cross validation.
Summarizationrate10%
Genre SVM C4.5 C5.0 Lead
General 55.7 55.2 52.4 47.9
Editorial 34.2 33.6 27.9 31.6
National 61.4 52.0 56.3 51.8
Commentary 28.7 27.4 21.4 15.9
Average 46.2 41.4 40.4 37.4
Summarizationrate30%
Genre SVM C4.5 C5.0 Lead
General 51.0 45.7 50.4 50.5
Editorial 47.8 41.6 43.3 36.7
National 55.9 44.1 49.3 54.3
Commentary 48.7 39.4 40.1 32.4
Average 51.6 42.4 45.7 44.2
Summarizationrate50%
Genre SVM C4.5 C5.0 Lead
General 65.2 63.0 60.2 60.4
Editorial 60.6 54.1 54.6 51.0
National 63.3 58.7 58.7 61.5
Commentary 65.7 59.6 60.6 50.4
Average 63.5 58.2 58.4 56.1
5 Discussion
Table 2 shows that Editorial and Commentary
are more diﬃcult than the other genres. We
can consider two reasons for the poor scores of
Editorial and Commentary:
• Thesegenres have no featureuseful for dis-
crimination.
• Non-standard features are useful in these
genres.
Accordingly, we conduct an experiment to
clarify genre dependency
4
.
4
WedidnotuseGeneralbecausethenumberofdoc-
umentsinthisgenrewasinsuﬃcient.
1 Extract 36 documents at random from
genre i for training.
2 Extract4documentsatrandomfromgenre
j for test.
3 Repeat this 10 times for all combinations
of (i,j).
Table 3 shows that the result implies that
non-standard features are useful in Editorial
and Commentary documents.
Now, we examine eﬀective features in each
genre. Since we used the second order polyno-
mial kernel, we can expand g(x) as follows:
g(x)=b+
lscript
summationdisplay
i=1
w
i
+2
lscript
summationdisplay
i=1
w
i
u
summationdisplay
k=1
x
i
[k]x[k]+
lscript
summationdisplay
i=1
w
i
u
summationdisplay
h=1
u
summationdisplay
k=1
x
i
[h]x
i
[k]x[h]x[k], (7)
where lscript is the number of support vectors, and
w
i
equals λ
i
y
i
.
We can rewrite it as follows when all vectors
are boolean:
g(x)=W
0
+
u
summationdisplay
k=1
W
1
[k]x[k]+
u−1
summationdisplay
h=1
u
summationdisplay
k=h+1
W
2
[k,h]x[h]x[k] (8)
where
W
0
= b+
summationtext
lscript
i=1
w
i
,W
1
[k]=3
summationtext
lscript
i=1
w
i
x
i
[k], and
W
2
[h,k]=2
summationtext
lscript
i=1
w
i
x
i
[h]x
i
[k].
Therefore, W
1
[k] indicates the signiﬁcance of
an individual feature and W
2
[h,k]indicates the
signiﬁcance of a feature pair. When |W
1
[k]| or
|W
2
[h,k]| was large, the feature or the feature
pair had a strong inﬂuence on the optimal hy-
perplane.
Table 3: Evaluation results for three genres.
Training \ Test
National Editorial Commentary
10% 30% 50% 10% 30% 50% 10% 30% 50%
National 63.4 57.6 65.5 32.8 39.4 53.6 24.0 39.5 60.8
Editorial 49.3 46.8 58.4 33.9 49.1 64.4 24.9 43.6 62.1
Commentary 37.4 43.3 61.1 18.4 41.8 57.8 30.6 49.6 67.0
Table 4: Eﬀective features and their pairs
Summarization rate 10%
National Editorial Commentary
Lead ∧ ga 0.9≤Posd≤1.0 ∧ 0.7≤Wf<0.8 0.9≤Posd≤1.0 ∧ Semdep=2
0.9≤Posd≤1.0 ∧ ga NE ∧ de 0.5≤Len
+1
<0.6 ∧ Noun-hijiritsu
Lead ∧ ta 0.9≤Posd≤1.0 ∧ de 0.0≤Posp<0.1 ∧ 0.5≤Wf
+1
<0.6
0.9≤Posd≤1.0 ∧ ta Lead ∧ 0.7≤Wf<0.8 0.8≤Posd<0.9 ∧ Particle
Summarization rate 30%
National Editorial Commentary
Lead ∧ Semdep=6 0.0≤Posp<0.1 ∧ ga Aux verb ∧ Semdep=2
0.9≤Posd≤1.0 ∧ Semdep=6 0.9≤Posd≤1.0 ∧ NE Verb-jiritsu ∧ Semdep=2
Lead ∧ ga Lead ∧ NE Semdep=2
0.9≤Posd≤1.0 0.0≤Posd<0.1 0.0≤Posp<0.1 ∧ 0.5≤Den<0.6
Summarization rate 50%
National Editorial Commentary
Lead 0.0≤Posp<0.1 ∧ Semdep=6 0.0≤Posp<0.1 ∧ Particle
Lead ∧ ha 0.0≤Posp<0.1 ∧ ga 0.2≤Posd<0.3
Lead ∧ Verb-jiritsu 0.0≤Posp<0.1 0.4≤Len<0.5
Lead ∧ ta 0.0≤Posd<0.1 0.0≤Posp<0.1
Table 4 shows some of the eﬀective features
that had large weights W
1
[k], W
2
[h,k] for each
genre.
Eﬀective features common to three genres at
three rates were sentence positions. Since Na-
tional has a typical newspaper style, the begin-
ning of the document was important. More-
over, “ga” and “ta” were important. These
functional words are used when a new event is
introduced.
In Editorial and Commentary, the end of a
paragraph and that of a document were impor-
tant. The reason for this result is that subtopic
or main topic conclusions are common in those
positions. This implies that National has a dif-
ferent text structure from Editorial and Com-
mentary.
Moreover, in Editorial, “de” and sentence
weight was important. In Commentary, seman-
tically shallow words, sentence weight and the
length of a next sentence were important.
In short, we conﬁrmed that the feature(s) ef-
fectivefordiscriminating a genre diﬀerwiththe
genre.
6 Conclusion
This paper presented a SVM-based important
sentence extraction technique. Comparisons
were made using the lead-based method, deci-
siontreelearningmethod,andboostingmethod
with the summarization rates of 10%, 30%,
and 50%. The experimental results show that
the SVM-based method outperforms the other
methods at all summarization rates. Moreover,
we clariﬁed the eﬀective features for three gen-
res, and showed that the important features
vary with the genre.
Inourfuturework,wewouldliketoapplyour
method to trainable Question Answering Sys-
tem SAIQA-II developed in our group.
Acknowledgement
We would like to thank all the members of the
KnowledgeProcessingResearchGroupforvalu-
able comments and discussions.

References

C.Aone,M.Okurowski,andJ.Gorlinsky. 1998.
Trainable Scalable Summarization Using Ro-
bustNLPandMachineLearning. Proc. of the
17th COLING and 36th ACL, pages 62–66.

H. Edmundson. 1969. New methods in
automatic abstracting. Journal of ACM,
16(2):246–285.

T. Fukushima and M. Okumura. 2001. Text
Summarization Challenge Text summariza-
tion evaluation in Japan. Proc. of the
NAACL2001 Workshop on Automatic sum-
marization, pages 51–59.

M. Hara, H. Nakajima, and T. Kitani. 1997.
Keyword Extraction Using a Text Format
and Word Importance in a Speciﬁc Filed (in
Japanese). Transactions of Information Pro-
cessing Society of Japan, 38(2):299–309.

T. Hirao, M. Hatayama, S. Yamada, and
K. Takeuchi. 2001. Text Summarization
based on Hanning Window and Dependency
structure analysis. Proc. of the 2nd NTCIR
Workshop, pages 349–354.

S. Ikehara, M. Miyazaki, S. Shirai, A. Yokoo,
H. Nakaiwa, K. Ogura, Y. Ooyama, and
Y. Hayashi. 1997. Goi-Taikei – A Japanese
Lexicon (in Japanese). Iwanami Shoten.

H. Isozaki. 2001. Japanese Named Entity
Recognition based on Simple Rule Generator
andDecisionTreeLearning. Proc. of the 39th
ACL, pages 306–313.

T. Joachims. 1998. Text Categorization with
Support Vector Machines: Learning with
Many Relevant Features. Proc. of ECML,
pages 137–142.

T. Kudo and Y. Matsumoto. 2000. Japane De-
pendency Structure Analysis Based on Su-
port Vector Machines. Proc. of EMNLP and
VLC, pages 18–25.

T. Kudo and Y. Matsumoto. 2001. Chunking
with Support Vector Machine. Proc. of the
2nd NAACL, pages 192–199.

J. Kupiec, J. Pedersen, and F. Chen. 1995. A
Trainable Document Summarizer. Proc. of
the 18th ACM-SIGIR, pages 68–73.

Chin-YewLin. 1999. TrainingaSelectionFunc-
tion for Extraction. Proc. of the 18th ACM-
CIKM, pages 55–62.

H.Luhn. 1958. TheAutomaticCreationofLit-
erature Abstracts. IBM Journal of Research
and Development, 2(2):159–165.

I.Maniand E.Bloedorn. 1998. MachineLearn-
ingof Generic and User-FocusedSummariza-
tion. Proc. of the 15th AAAI, pages 821–826.

C.Nobata, S. Sekine, M. Murata,K.Uchimoto,
M. Utiyama, and H. Isahara. 2001. Sentence
Extraction System Assembling Multiple Ev-
idence. Proc. of the 2nd NTCIR Workshop,
pages 319–324.

T.Nomotoand Y.Matsumoto. 1997. The Reli-
abilityofHumanCodingandEﬀectsonAuto-
matic Abstracting (inJapanese). The Special
Interest Group Notes of IPSJ (NL-120-11),
pages 71–76.

M. Okumura, Y. Haraguchi, and H. Mochizuki.
1999. Some Observations on Automatic
Text Summarization Based on Decision Tree
Learning(inJapanese). Proc. of the 59th Na-
tional Convention of IPSJ (5N-2),pages393–
394.

J. Quinlan. 1993. C4.5: Programs for Machine
Learning. Morgan Kaufmann.

S. Sekine and Y. Eriguchi. 2000. Japanese
Named Entity Extraction Evaluation - Anal-
ysis of Results -. Proc. of the 18th COLING,
pages 1106–1110.

V. Vapnik. 1995. The Nature of Statistical
Learning Theory. New York.

K.Zechner. 1996. FastGenerationofAbstracts
from General Domain Text Corpora by Ex-
tractingRelevantSentences. Proc. of the 16th
COLING, pages 986–989.
