Efficient Support Vector Classifiers
for Named Entity Recognition
Hideki Isozaki and Hideto Kazawa
NTT Communication Science Laboratories
Nippon Telegraph and Telephone Corporation
2-4 Hikari-dai, Seika-cho, Soraku-gun, Kyoto, 619-0237, Japan
a0 isozaki,kazawa
a1 @cslab.kecl.ntt.co.jp
Abstract
Named Entity (NE) recognition is a task in which
proper nouns and numerical information are ex-
tracted from documents and are classified into cat-
egories such as person, organization, and date. It
is a key technology of Information Extraction and
Open-Domain Question Answering. First, we show
that an NE recognizer based on Support Vector Ma-
chines (SVMs) gives better scores than conventional
systems. However, off-the-shelf SVM classifiers are
too inefficient for this task. Therefore, we present a
method that makes the system substantially faster.
This approach can also be applied to other simi-
lar tasks such as chunking and part-of-speech tag-
ging. We also present an SVM-based feature selec-
tion method and an efficient training method.
1 Introduction
Named Entity (NE) recognition is a task in which
proper nouns and numerical information in a docu-
ment are detected and classified into categories such
as person, organization, and date. It is a key technol-
ogy of Information Extraction and Open-Domain
Question Answering (Voorhees and Harman, 2000).
We are building a trainable Open-Domain Question
Answering System called SAIQA-II. In this paper,
we show that an NE recognizer based on Support
Vector Machines (SVMs) gives better scores than
conventional systems. SVMs have given high per-
formance in various classification tasks (Joachims,
1998; Kudo and Matsumoto, 2001).
However, it turned out that off-the-shelf SVM
classifiers are too inefficient for NE recognition.
The recognizer runs at a rate of only 85 bytes/sec
on an Athlon 1.3 GHz Linux PC, while rule-based
systems (e.g., Isozaki, (2001)) can process several
kilobytes in a second. The major reason is the
inefficiency of SVM classifiers. There are other
reports on the slowness of SVM classifiers. An-
other SVM-based NE recognizer (Yamada and Mat-
sumoto, 2001) is 0.8 sentences/sec on a Pentium III
933 MHz PC. An SVM-based part-of-speech (POS)
tagger (Nakagawa et al., 2001) is 20 tokens/sec on
an Alpha 21164A 500 MHz processor. It is difficult
to use such slow systems in practical applications.
In this paper, we present a method that makes the
NE system substantially faster. This method can
also be applied to other tasks in natural language
processing such as chunking and POS tagging. An-
other problem with SVMs is its incomprehensibil-
ity. It is not clear which features are important or
how they work. The above method is also useful for
finding useless features. We also mention a method
to reduce training time.
1.1 Support Vector Machines
Suppose we have a set of training data for a two-
class problem: a2a4a3a6a5a8a7a10a9a11a5a13a12a14a7a8a15a8a15a8a15a16a7a17a2a4a3a19a18a20a7a10a9a21a18a22a12, wherea3a24a23a10a2a26a25
a27a29a28
a12 is a feature vector of the a30-th sample in the
training data and a9a23 a25a32a31a34a33a36a35a37a7a8a38a20a35a34a39 is the label for
the sample. The goal is to find a decision func-
tion that accurately predicts a9 for unseen a3 . A
non-linear SVM classifier gives a decision function
a40
a2a4a3a41a12a6a42 signa2a44a43a45a2a4a3a41a12a10a12 for an input vectora3 where
a43a45a2a4a3a41a12a46a42
a47
a48
a23a50a49a41a5a11a51
a23a53a52a54a2a4a3a6a7a56a55a57a23a58a12a24a33a54a59a16a15
Here, a40 a2a4a3a41a12a60a42a61a33a36a35 meansa3 is a member of a cer-
tain class anda40 a2a4a3a41a12a36a42a32a38a20a35 meansa3 is not a mem-
ber. a55a23s are called support vectors and are repre-
sentatives of training examples. a62 is the number
of support vectors. Therefore, computational com-
plexity ofa43a63a2a4a3a41a12 is proportional toa62 . Support vectors
and other constants are determined by solving a cer-
tain quadratic programming problem. a52a54a2a4a3a6a7a56a55a64a12 is a
kernel that implicitly maps vectors into a higher di-
mensional space. Typical kernels use dot products:
a52a54a2a4a3a6a7a56a55a64a12a65a42a67a66a24a2a4a3a69a68a17a55a64a12. A polynomial kernel of degree
a70 is given by
a66a24a2a4a71a63a12a60a42a72a2a73a35a74a33a75a71a24a12a26a76 . We can use vari-
a77
a77
a77
a77
a78
a77
a77
a77
a77
a77
a77
a77
a77
a77 a78
a77
a79
a79
a79
a79
a79
a78
a79
a79
a79
a79
a79
a79
a79
a79
a79
a79
a79
a79
a77 : positive example,a79 : negative example
a78
a77 ,
a78
a79 : support vectors
Figure 1: Support Vector Machine
ous kernels, and the design of an appropriate kernel
for a particular application is an important research
issue.
Figure 1 shows a linearly separable case. The de-
cision hyperplane defined by a43a45a2a4a3a41a12a80a42a82a81 separates
positive and negative examples by the largest mar-
gin. The solid line indicates the decision hyperplane
and two parallel dotted lines indicate the margin be-
tween positive and negative examples. Since such a
separating hyperplane may not exist, a positive pa-
rametera83 is introduced to allow misclassifications.
See Vapnik (1995).
1.2 SVM-based NE recognition
As far as we know, the first SVM-based NE system
was proposed by Yamada et al. (2001) for Japanese.
His system is an extension of Kudo’s chunking sys-
tem (Kudo and Matsumoto, 2001) that gave the best
performance at CoNLL-2000 shared tasks. In their
system, every word in a sentence is classified se-
quentially from the beginning or the end of a sen-
tence. However, since Yamada has not compared
it with other methods under the same conditions, it
is not clear whether his NE system is better or not.
Here, we show that our SVM-based NE system is
more accurate than conventional systems. Our sys-
tem uses the Viterbi search (Allen, 1995) instead of
sequential determination.
For training, we use ‘CRL data’, which was pre-
pared for IREX (Information Retrieval and Extrac-
tion Exercise1, Sekine and Eriguchi (2000)). It has
about 19,000 NEs in 1,174 articles. We also use
additional data by Isozaki (2001). Both datasets
are based on Mainichi Newspaper’s 1994 and 1995
CD-ROMs. We use IREX’s formal test data called
GENERAL that has 1,510 named entities in 71 ar-
ticles from Mainichi Newspaper of 1999. Systems
are compared in terms of GENERAL’s F-measure
1http://cs.nyu.edu/cs/projects/proteus/irex
which is the harmonic mean of ‘recall’ and ‘preci-
sion’ and is defined as follows.
Recall = M/(the number of correct NEs),
Precision = M/(the number of NEs extracted by a
system),
where M is the number of NEs correctly extracted
and classified by the system.
We developed an SVM-based NE system by fol-
lowing our NE system based on maximum en-
tropy (ME) modeling (Isozaki, 2001). We sim-
ply replaced the ME model with SVM classifiers.
The above datasets are processed by a morpholog-
ical analyzer ChaSen 2.2.12. It tokenizes a sen-
tence into words and adds POS tags. ChaSen uses
about 90 POS tags such as common-noun and
location-name. Since most unknown words are
proper nouns, ChaSen’s parameters for unknown
words are modified for better results. Then, a char-
acter type tag is added to each word. It uses 17
character types such as all-kanji and small-
integer. See Isozaki (2001) for details.
Now, Japanese NE recognition is solved by the
classification of words (Sekine et al., 1998; Borth-
wick, 1999; Uchimoto et al., 2000). For instance,
the words in “President George Herbert Bush said
Clinton is . . . ” are classified as follows: “Presi-
dent” = OTHER, “George” = PERSON-BEGIN, “Her-
bert” = PERSON-MIDDLE, “Bush” = PERSON-END,
“said” = OTHER, “Clinton” = PERSON-SINGLE, “is”
= OTHER. In this way, the first word of a person’s
name is labeled as PERSON-BEGIN. The last word is
labeled as PERSON-END. Other words in the name
are PERSON-MIDDLE. If a person’s name is ex-
pressed by a single word, it is labeled as PERSON-
SINGLE. If a word does not belong to any named
entities, it is labeled as OTHER. Since IREX de-
fines eight NE classes, words are classified into 33
(a42a85a84a87a86a69a88a22a33a75a35 ) categories.
Each sample is represented by 15 features be-
cause each word has three features (part-of-speech
tag, character type, and the word itself), and two
preceding words and two succeeding words are also
used for context dependence. Although infrequent
features are usually removed to prevent overfitting,
we use all features because SVMs are robust. Each
sample is represented by a long binary vector, i.e.,
a sequence of 0 (false) and 1 (true). For instance,
“Bush” in the above example is represented by a
2http://chasen.aist-nara.ac.jp/
vectora3a80a42a89a2a4a71a91a90a92a35a94a93a95a7a8a15a8a15a8a15a34a7a10a71a91a90a96a97a93a4a12 described below. Only
15 elements are 1.
a98a100a99a102a101a56a103a11a104a106a105 // Current word is not ‘Alice’
a98a100a99a107a108a103a11a104a109a101 // Current word is ‘Bush’
a98a100a99a110a8a103a11a104a106a105 // Current word is not ‘Charlie’
:
a98a100a99a102a101a94a111a112a105a34a107a112a113a112a103a21a104a109a101 // Current POS is a proper noun
a98a100a99a102a101a94a111a112a105a16a110a17a105a112a103a21a104a106a105 // Current POS is not a verb
:
a98a100a99a110a17a113a114a101a94a115a114a101a13a103a21a104a106a105 // Previous word is not ‘Henry’
a98a100a99a110a17a113a114a101a94a115a16a107a8a103a21a104a109a101 // Previous word is ‘Herbert’
:
Here, we have to consider the following prob-
lems. First, SVMs can solve only a two-class prob-
lem. Therefore, we have to reduce the above multi-
class problem to a group of two-class problems.
Second, we have to consider consistency among
word classes in a sentence. For instance, a word
classified as PERSON-BEGIN should be followed
by PERSON-MIDDLE or PERSON-END. It implies
that the system has to determine the best combina-
tions of word classes from numerous possibilities.
Here, we solve these problems by combining exist-
ing methods.
There are a few approaches to extend SVMs to
covera116 -class problems. Here, we employ the “one
class versus all others” approach. That is, each clas-
sifier a40a37a117a2a4a3a41a12 is trained to distinguish members of a
classa118 from non-members. In this method, two or
more classifiers may give a33a36a35 to an unseen vector
or no classifier may givea33a36a35 . One common way to
avoid such situations is to comparea43a117a2a4a3a41a12 values and
to choose the class indexa118 of the largesta43a117a2a4a3a41a12.
The consistency problem is solved by the Viterbi
search. Since SVMs do not output probabilities,
we use the SVM+sigmoid method (Platt, 2000).
That is, we use a sigmoid functiona119a120a2a4a71a63a12a74a42a121a35a17a122a11a2a73a35a123a33
a124a108a125a11a126
a2a26a38a123a127a24a71a24a12a10a12 to mapa43
a117
a2a4a3a41a12 to a probability-like value.
The output of the Viterbi search is adjusted by
a postprocessor for wrong word boundaries. The
adjustment rules are also statistically determined
(Isozaki, 2001).
1.3 Comparison of NE recognizers
We use a fixed value a127a128a42 a35a8a81a57a81 . F-measures are
not very sensitive toa127 unlessa127 is too small. When
we used 1,038,986 training vectors, GENERAL’s F-
measure was 89.64% fora127a129a42a130a81a11a15a131a35 and 90.03% for
a127a54a42a132a35a8a81a57a81 . We employ the quadratic kernel (
a70
a42a89a133 )
because it gives the best results. Polynomial kernels
of degree 1, 2, and 3 resulted in 83.03%, 88.31%,
F-measure (%)
a134
a134
a134
a134
a134
a134
a134
a134
RG+DT
a134
a135
a135
a135
MEa135
a136
a136
a136
a136
a136
a136
a136
SVM
0 20 40 60 80 100 120
CRLdata
a137a139a138a69a140a13a141a94a142a17a143a58a142a14a142a14a144
76
78
80
82
84
86
88
90
Number of NEs in training data (a135a24a145a10a146a94a147)
Figure 2: F-measures of NE systems
and 87.04% respectively when we used 569,994
training vectors.
Figure 2 compares NE recognizers in terms of
GENERAL’s F-measures. ‘SVM’ in the figure in-
dicates F-measures of our system trained by Kudo’s
TinySVM-0.073 witha83a148a42a149a81a11a15a131a35 . It attained 85.04%
when we used only CRL data. ‘ME’ indicates our
ME system and ‘RG+DT’ indicates a rule-based
machine learning system (Isozaki, 2001). Accord-
ing to this graph, ‘SVM’ is better than the other sys-
tems.
However, SVM classifiers are too slow. Fa-
mous SVM-Light 3.50 (Joachims, 1999) took 1.2
days to classify 569,994 vectors derived from 2 MB
documents. That is, it runs at only 19 bytes/sec.
TinySVM’s classifier seems best optimized among
publicly available SVM toolkits, but it still works at
only 92 bytes/sec.
2 Efficient Classifiers
In this section, we investigate the cause of this in-
efficiency and propose a solution. All experiments
are conducted for training data of 569,994 vectors.
The total size of the original news articles was 2 MB
and the number of NEs was 39,022. According to
the definition ofa43a45a2a4a3a41a12, a classifier has to processa62
support vectors for eacha3 . Table 1 showsa62 s for dif-
ferent word classes. According to this table, classi-
fication of one word requiresa3 ’s dot products with
228,306 support vectors in 33 classifiers. Therefore,
the classifiers are very slow. We have never seen
such largea62 s in SVM literature on pattern recogni-
tion. The reason for the largea62 s is word features. In
other domains such as character recognition, dimen-
3http://cl.aist-nara.ac.jp/˜taku-ku/software/TinySVM
siona96 is usually fixed. However, in the NE task,a96
increases monotonically with respect to the size of
the training data. Since SVMs learn combinations
of features,a62 tends to be very large. This tendency
will hold for other tasks of natural language pro-
cessing, too.
Here, we focus on the quadratic kernel a66a24a2a4a71a24a12a60a42
a2a73a35a20a33a150a71a24a12a26a151 that yielded the best score in the above
experiments. Suppose a3a128a42 a2a4a71a91a90a92a35a94a93a95a7a8a15a8a15a8a15a34a7a10a71a91a90a96a97a93a4a12 has
only a152 (=15) non-zero elements. The dot prod-
uct of a3 and a55a23 a42 a2a53a153a23a90a92a35a94a93a95a7a8a15a8a15a8a15a34a7a13a153a23a90a96a60a93a4a12 is given by
a154 a28
a155
a49a41a5
a71a91a90a157a156a11a93a44a153a23a90a157a156a11a93. Hence,
a2a73a35a21a33a158a3a159a68a131a55a23a12a151 a42a160a35a21a33a87a133
a28
a48
a155
a49a41a5
a71a161a90a157a156a11a93a44a153a23a90a157a156a162a93a26a33a163a2
a28
a48
a155
a49a41a5
a71a161a90a157a156a11a93a44a153a23a90a157a156a162a93a4a12a151a15
We can rewritea43a45a2a4a3a41a12 as follows.
a43a45a2a4a3a41a12a164a42 a165a167a166a168a33
a28
a48
a155
a49a41a5
a2a95a165 a5a90a157a156a11a93a131a71a91a90a157a156a162a93a162a33a109a165
a151
a90a157a156a162a93a131a71a91a90a157a156a162a93a151a12
a33
a28a46a169
a5
a48
a155
a49a41a5
a28
a48
a170
a49
a155a17a171
a5
a165a167a172a114a90a157a156a173a7a13a66a64a93a131a71a91a90a157a156a11a93a131a71a161a90a157a66a64a93a95a7
where
a165 a166 a42 a59a161a33
a154
a47
a23a131a49a41a5
a51
a23a174a7
a165a175a5a17a90a157a156a11a93a176a42 a133
a154
a47
a23a131a49a41a5
a51
a23a53a153a8a23a10a90a157a156a11a93a95a7
a165
a151
a90a157a156a11a93a176a42
a154
a47
a23a131a49a41a5
a51
a23a177a153a112a23a56a90a157a156a11a93a178a151a37a7
a165a80a172a114a90a157a156a24a7a13a66a64a93a176a42 a133
a154
a47
a23a131a49a41a5
a51
a23a153a23a90a157a156a11a93a44a153a23a90a157a66a64a93a95a15
For binary vectors, it can be simplified as
a43a45a2a4a3a41a12a46a42a129a165 a166 a33
a48
a155a24a179a44a180a57a181a155a108a182
a49a41a5
a2a95a165a67a183
a5
a90a157a156a162a93a11a33
a48
a170a45a179a44a170a16a184a185a155a37a186a180a57a181a170a14a182
a49a41a5
a165 a172a90a157a156a24a7a13a66a64a93a4a12a14a7
where
a165 a183
a5
a90a157a156a162a93a176a42 a165 a5a90a157a156a162a93a11a33a109a165
a151
a90a157a156a162a93a89a42 a187
a48
a23
a179a4a188a26a189a53a181a155a108a182
a49a41a5a51
a23a7
a165a167a172a57a90a157a156a173a7a13a66a64a93a190a42 a133
a48
a23
a179a44a188a174a189a95a181a155a108a182
a49
a188a174a189a177a181a170a14a182
a49a41a5a51
a23a15
Now, a43a63a2a4a3a41a12 can be given by summing up a165 a183
a5
a90a157a156a162a93
for every non-zero element a71a161a90a157a156a11a93 and a165 a172a90a157a156a24a7a13a66a64a93 for
every non-zero paira71a161a90a157a156a11a93a131a71a91a90a157a66a64a93. Accordingly, we only
need to add a35a87a33a160a152a191a33a160a152a106a2a4a152a82a38a192a35a17a12a10a122a37a133 (=121) con-
stants to get a43a45a2a4a3a41a12. Therefore, we can expect this
method to be much faster than a na¨ıve implementa-
tion that computes tens of thousands of dot products
at run time. We call this method ‘XQK’ (eXpand the
Quadratic Kernel).
Table 1 compares TinySVM and XQK in terms
of CPU time taken to apply 33 classifiers to process
the training data. Classes are sorted by a62 . Small
numbers in parentheses indicate the initialization
time for reading support vectors a31a17a55a23a39 and allocat-
ing memory. XQK requires a longer initialization
time in order to preparea165 a183
a5
anda165a167a172 . For instance,
TinySVM took 11,490.26 seconds (3.2 hours) in to-
tal for applying OTHER’s classifier to all vectors in
the training data. Its initialization phase took 2.13
seconds and all vectors in the training data were
classified in 11,488.13 (a42a61a35a57a35a37a7a10a88a21a193a37a81a11a15a194a133a57a195a196a38a139a133a120a15a131a35a112a187 ) sec-
onds. On the other hand, XQK took 225.28 seconds
in total and its initialization phase took 174.17 sec-
onds. Therefore, 569,994 vectors were classified in
51.11 seconds. The initialization time can be disre-
garded because we can reuse the above coefficents.
Consequently, XQK is 224.8 (=11,488.13/51.11)
times faster than TinySVM for OTHER. TinySVM
took 6 hours to process all the word classes, whereas
XQK took only 17 minutes. XQK is 102 times
faster than SVM-Light 3.50 which took 1.2 days.
3 Removal of useless features
XQK makes the classifiers faster, but mem-
ory requirement increases from a197a60a2a154
a47
a23a131a49a41a5
a152 a23a12 to
a197a60a2
a154
a47
a23a131a49a41a5
a152 a23a2a4a152 a23a33a29a35a17a12a10a122a37a133a114a12 wherea152 a23 (=15) is the num-
ber of non-zero elements ina55a23. Therefore, removal
of useless features would be beneficial. Conven-
tional SVMs do not tell us how an individual feature
works because weights are given not to features but
toa52a54a2a4a3a6a7a56a55a23a12. However, the above weights (a165 a183
a5
and
a165a167a172 ) clarify how a feature or a feature pair works.
We can use this fact for feature selection after the
training.
We simplify a40 a2a4a3a41a12 by removing all features
a156 that satisfy a198a60a199
a125
a2a56a200a157a165 a183a90a157a156a11a93a26a200a102a7a192a198a60a199
a125
a170
a200a157a165a167a172a114a90a157a156a173a7a13a66a64a93a26a200a102a7
a198a60a199
a125
a170
a200a157a165a80a172a114a90a157a66a45a7a13a156a11a93a26a200a157a12a10a12a75a201a203a202 . The largest a202 that does
not change the number of misclassifications for the
training data is found by using the binary search
for each word class. We call this method ‘XQK-
FS’ (XQK with Feature Selection). This approx-
imation slightly degraded GENERAL’s F-measure
from 88.31% to 88.03%.
Table 2 shows the reduction of features that ap-
pear in support vectors. Classes are sorted by the
numbers of original features. For instance, OTHER
has 56,220 features in its support vectors. Accord-
ing to the binary search, its performance did not
change even when the number of features was re-
duced to 21,852 ata202a20a42a75a81a11a15a204a81a114a133a57a195a114a205a37a195 .
Table 1: Reduction of CPU time (in seconds) by XQK
word class a62 TinySVM (init) XQK (init) speed up SVM-Light
OTHER 64,970 11,488.13 (2.13) 51.11 (174.17) 224.8 29,986.52
ARTIFACT-MIDDLE 14,171 1,372.85 (0.51) 41.32 (14.98) 33.2 6,666.26
LOCATION-SINGLE 13,019 1,209.29 (0.47) 38.24 (11.41) 31.6 6,100.54
ORGANIZ..-MIDDLE 12,050 987.39 (0.44) 37.93 (11.70) 26.0 5,570.82
: : : : : :
TOTAL 228,306 21,754.23 (9.83) 1,019.20 (281.28) 21.3 104,466.31
Table 2: Reduction of features by XQK-FS
word class number of features number of non-zero weights seconds
OTHER 56,220a206 21,852 (38.9%) 1,512,827a206 892,228 (59.0%) 42.31
ARTIFIFACT-MIDDLE 22,090a206 4,410 (20.0%) 473,923a206 164,632 (34.7%) 30.47
LOCATION-SINGLE 17,169a206 3,382 (19.7%) 366,961a206 123,808 (33.7%) 27.72
ORGANIZ..-MIDDLE 17,123a206 9,959 (58.2%) 372,784a206 263,695 (70.7%) 31.02
ORGANIZ..-END 15,214a206 3,073 (20.2%) 324,514a206 112,307 (34.6%) 26.87
: : : :
TOTAL 307,721a206 75,455 (24.5%) 6,669,664a206 2,650,681 (39.7%) 763.10
The total number of features was reduced by 75%
and that of weights was reduced by 60%. The ta-
ble also shows CPU time for classification by the
selected features. XQK-FS is 28.5 (=21754.23/
763.10) times faster than TinySVM. Although the
reduction of features is significant, the reduction of
CPU time is moderate, because most of the reduced
features are infrequent ones. However, simple re-
duction of infrequent features without considering
weights damages the system’s performance. For in-
stance, when we removed 5,066 features that ap-
peared four times or less in the training data, the
modified classifier for ORGANIZATION-END mis-
classified 103 training examples, whereas the origi-
nal classifier misclassified only 19 examples. On the
other hand, XQK-FS removed 12,141 features with-
out an increase in misclassifications for the training
data.
XQK can be easily extended to a more general
quadratic kernel a66a24a2a4a71a63a12a163a42a164a2a177a118a108a166a207a33a129a118a5a71a24a12a26a151 and to non-
binary sparse vectors. XQK-FS can be used to se-
lect useful features before training by other kernels.
As mentioned above, we conducted an experiment
for the cubic kernel (a70 a42a208a187 ) by using all features.
When we trained the cubic kernel classifiers by us-
ing only features selected by XQK-FS, TinySVM’s
classification time was reduced by 40% becausea62
was reduced by 38%. GENERAL’s F-measure was
slightly improved from 87.04% to 87.10%. On
the other hand, when we trained the cubic ker-
nel classifiers by using only features that appeared
three times or more (without considering weights),
TinySVM’s classification time was reduced by only
14% and the F-measure was slightly degraded to
86.85%. Therefore, we expect XQK-FS to be use-
ful as a feature selection method for other kernels
when such kernels give much better results than the
quadratic kernel.
4 Reduction of training time
Since training of 33 classifiers also takes a long
time, it is difficult to try various combinations of pa-
rameters and features. Here, we present a solution
for this problem. In the training time, calculation of
a66a24a2a4a3a173a209a158a68a114a3 a5a12a14a7a13a66a24a2a4a3a173a209a36a68a114a3
a151
a12a14a7a8a15a8a15a8a15a16a7a13a66a24a2a4a3a173a209a20a68a64a3 a18 a12 for various
a3 a209 s is dominant. Conventional systems save time
by caching the results. By analyzing TinySVM’s
classifier, we found that they can be calculated more
efficiently.
For sparse vectors, most SVM classifiers (e.g.,
SVM-Light) use a sparse dot product algorithm
(Platt, 1999) that compares non-zero elements ofa3
and those ofa55a23 to geta66a24a2a4a3a69a68a17a55a23a12 ina43a45a2a4a3a41a12. However,a3
is common to all dot products ina66a24a2a4a3a210a68a26a55a5a12a14a7a8a15a8a15a8a15a17a7a13a66a24a2a4a3a20a68
a55
a47
a12. Therefore, we can implement a faster classifier
that calculates them concurrently. TinySVM’s clas-
sifier prepares a list fi2sia90a157a156a11a93 that contains alla55a23s
whose a156 -th coordinates are not zero. In addition,
counters fora3a211a68a37a55a5a7a8a15a8a15a8a15a112a7a10a3a212a68a37a55
a47 are prepared because
dot products of binary vectors are integers. Then,
for each non-zeroa71a91a90a157a156a11a93, the counters are incremented
for alla55a23 a25 fi2sia90a157a156a162a93. By checking only members
of fi2sia90a157a156a11a93 for non-zeroa71a91a90a157a156a11a93, the classifier is not
bothered by fruitless cases: a71a161a90a157a156a11a93a213a42a149a81a11a7a13a153a8a23a56a90a157a156a162a93a215a214a42a89a81 or
a71a91a90a157a156a162a93a87a214a42a216a81a11a7a13a153a13a217a21a90a157a156a162a93a218a42a121a81 . Therefore, TinySVM’s clas-
sifier is faster than other classifiers. This method is
applicable to any kernels based on dot products.
For the training phase, we can build fi2sia183a90a157a156a162a93
that contains alla3a23s whosea156 -th coordinates are not
zero. Then,a66a24a2a4a3a173a209a22a68a16a3 a5a12a14a7a8a15a8a15a8a15a17a7a13a66a24a2a4a3a173a209a218a68a17a3 a18 a12 can be effi-
ciently calculated becausea3a173a209 is common. This im-
provement is effective especially when the cache is
small and/or the training data is large. When we
used a 200 MB cache, the improved system took
only 13 hours for training by the CRL data, while
TinySVM and SVM-Light took 30 hours and 46
hours respectively for the same cache size. Al-
though we have examined other SVM toolkits, we
could not find any system that uses this approach in
the training phase.
5 Discussion
The above methods can also be applied to other
tasks in natural language processing such as chunk-
ing and POS tagging because the quadratic kernels
give good results.
Utsuro et al. (2001) report that a combination
of two NE recognizers attained F = 84.07%, but
wrong word boundary cases are excluded. Our sys-
tem attained 85.04% and word boundaries are auto-
matically adjusted. Yamada (Yamada et al., 2001)
also reports that a70 a42a128a133 is best. Although his sys-
tem attained F = 83.7% for 5-fold cross-validation
of the CRL data (Yamada and Matsumoto, 2001),
our system attained 86.8%. Since we followed
Isozaki’s implementation (Isozaki, 2001), our sys-
tem is different from Yamada’s system in the fol-
lowing points: 1) adjustment of word boundaries, 2)
ChaSen’s parameters for unknown words, 3) char-
acter types, 4) use of the Viterbi search.
For efficient classification, Burges and Sch¨olkopf
(1997) propose an approximation method that uses
“reduced set vectors” instead of support vectors.
Since the size of the reduced set vectors is smaller
than a62 , classifiers become more efficient, but the
computational cost to determine the vectors is very
large. Osuna and Girosi (1999) propose two meth-
ods. The first method approximatesa43a45a2a4a3a41a12 by support
vector regression, but this method is applicable only
whena83 is large enough. The second method refor-
mulates the training phase. Our approach is sim-
pler than these methods. Downs et al. (Downs et al.,
2001) try to reduce the number of support vectors
by using linear dependence.
We can also reduce the run-time complexity of
a multi-class problem by cascading SVMs in the
form of a binary tree (Schwenker, 2001) or a direct
acyclic graph (Platt et al., 2000). Yamada and Mat-
sumoto (2001) applied such a method to their NE
system and reduced its CPU time by 39%. This ap-
proach can be combined with our SVM classifers.
NE recognition can be regarded as a variable-
length multi-class problem. For this kind of prob-
lem, probability-based kernels are studied for more
theoretically well-founded methods (Jaakkola and
Haussler, 1998; Tsuda et al., 2001; Shimodaira et
al., 2001).
6 Conclusions
Our SVM-based NE recognizer attained F =
90.03%. This is the best score, as far as we know.
Since it was too slow, we made SVMs faster. The
improved classifier is 21 times faster than TinySVM
and 102 times faster than SVM-Light. The im-
proved training program is 2.3 times faster than
TinySVM and 3.5 times faster than SVM-Light.
We also presented an SVM-based feature selection
method that removed 75% of features. These meth-
ods can also be applied to other tasks such as chunk-
ing and POS tagging.
Acknowledgment
We would like to thank Yutaka Sasaki for the train-
ing data. We thank members of Knowledge Pro-
cessing Research Group for valuable comments and
discussion. We also thank Shigeru Katagiri and
Ken-ichiro Ishii for their support.

References

James Allen. 1995. Natural Language Understand-
ing 2nd. Ed. Benjamin Cummings.

Andrew Borthwick. 1999. A Maximum Entropy
Approach to Named Entity Recognition. Ph.D.
thesis, New York University.

Chris J. C. Burges and Bernhard Sch¨olkopf. 1997.
Improving speed and accuracy of support vector
learning machines. In Advances in Neural Infor-
mation Processing Systems 9, pages 375–381.

Tom Downs, Kevin E. Gates, and Annette Masters.
2001. Exact simplification of support vector so-
lutions. Journal of Machine Learning Research,
2:293–297.

Hideki Isozaki. 2001. Japanese named entity
recognition based on a simple rule generator and
decision tree learning. In Proceedings of Associ-
ation for Computational Linguistics, pages 306–
313.

Tommi S. Jaakkola and David Haussler. 1998. Ex-
ploiting generative models in discriminative clas-
sifiers. In M. S. Kearns, S. A. Solla, and D. A.
Cohn, editors, Advances in Neural Information
Processing Systems 11. MIT Press.

Thorsten Joachims. 1998. Text categorization with
support vector machines: Learning with many
relevant features. In Proceedings of the European
Conference on Machine Learning.

Thorsten Joachims. 1999. Making large-scale
support vector machine learning practical. In
B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola,
editors, Advances in Kernel Methods, chapter 16,
pages 170–184. MIT Press.

Taku Kudo and Yuji Matsumoto. 2001. Chunking
with support vector machines. In Proceedings of
NAACL, pages 192–199.

Tetsuji Nakagawa, Taku Kudoh, and Yuji Mat-
sumoto. 2001. Unknown word guessing and
part-of-speech tagging using support vector ma-
chines. In Proceedings of the Sixth Natural Lan-
guage Processing Pacific Rim Symposium, pages
325–331.

Edgar E. Osuna and Federico Girosi. 1999. Re-
ducing the run-time complexity in support vector
machines. In B. Sch¨olkopf, C. J. C. Burges, and
A. J. Smola, editors, Advances in Kernel Meth-
ods, chapter 16, pages 271–283. MIT Press.

John C. Platt, Nello Cristiani, and John Shawe-
Taylor. 2000. Large margin DAGs for multiclass
classification. In Advances in Neural Informa-
tion Processing Systems 12, pages 547–553. MIT
Press.

John C. Platt. 1999. Fast training of support vector
machines using sequential minimal optimization.
In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola,
editors, Advances in Kernel Methods, chapter 12,
pages 185–208. MIT Press.

John C. Platt. 2000. Probabilities for SV machines.
In A. J. Smola, P. L. Bartlett, B. Sch¨olkopf,
and D. Schuurmans, editors, Advances in Large
Margin Classifiers, chapter 5, pages 61–71. MIT
Press.

Friedhelm Schwenker. 2001. Solving multi-class
pattern recognition problems with tree-structured
support vector machines. In B. Radig and S. Flor-
czyk, editors, Pattern Recognition, Proceedings
of the 23rd Symposium, number 2191 in LNCS,
pages 283–290. Springer.

Satoshi Sekine and Yoshio Eriguchi. 2000.
Japanese named entity extraction evaluation —
analysis of results —. In Proceedings of 18th
International Conference on Computational Lin-
guistics, pages 1106–1110.

Satoshi Sekine, Ralph Grishman, and Hiroyuki
Shinnou. 1998. A decision tree method for find-
ing and classifying names in Japanese texts. In
Proceedings of the Sixth Workshop on Very Large
Corpora.

Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Naka,
and Shigeki Sagayama. 2001. Support vec-
tor machine with dynamic time-alignment ker-
nel for speech recognition. In Proceedings of Eu-
rospeech, pages 1841–1844.

Koji Tsuda, M. Kawanabe, G. R¨atsch, S. Sonnen-
burg, and K. M¨uller. 2001. A new discriminative
kernel from probabilistic models. In Advances in
Newral Information Processing Systems 14.

Kiyotaka Uchimoto, Qing Ma, Masaki Murata, Hi-
romi Ozaku, Masao Utiyama, and Hitoshi Isa-
hara. 2000. Named entity extraction based on
a maximum entropy model and transformation
rules (in Japanese). Journal of Natural Language
Processing, 7(2):63–90.

Takehito Utsuro, Manabu Sassano, and Kiyotaka
Uchimoto. 2001. Learning to combine outputs
of multiple Japanese named entity extractors (in
Japanese). In IPSJ SIG notes NL-144-5.

Vladimir N. Vapnik. 1995. The Nature of Statisti-
cal Learning Theory. Springer.

E. M. Voorhees and D. K. Harman, editors. 2000.
Proceedings of the 9th Text Retrieval Conference.

Hiroyasu Yamada and Yuji Matsumoto. 2001. Ap-
plying support vector machine to multi-class clas-
sification problems (in Japanese). In IPSJ SIG
Notes NL-146-6.

Hiroyasu Yamada, Taku Kudoh, and Yuji Mat-
sumoto. 2001. Japanese named entity extraction
using support vector machines (in Japanese). In
IPSJ SIG Notes NL-142-17.
