Chunking with Support Vector Machines
Taku Kudo and Yuji Matsumoto
Graduate School of Information Science,
Nara Institute of Science and Technology
a0 taku-ku,matsu
a1 @is.aist-nara.ac.jp
Abstract
We apply Support Vector Machines (SVMs) to
identify English base phrases (chunks). SVMs
are known to achieve high generalization perfor-
mance even with input data of high dimensional
feature spaces. Furthermore, by the Kernel princi-
ple, SVMs can carry out training with smaller com-
putational overhead independent of their dimen-
sionality. We apply weighted voting of 8 SVMs-
based systems trained with distinct chunk repre-
sentations. Experimental results show that our ap-
proach achieves higher accuracy than previous ap-
proaches.
1 Introduction
Chunking is recognized as series of processes —
first identifying proper chunks from a sequence of
tokens (such as words), and second classifying these
chunks into some grammatical classes. Various
NLP tasks can be seen as a chunking task. Exam-
ples include English base noun phrase identification
(base NP chunking), English base phrase identifica-
tion (chunking), Japanese chunk (bunsetsu) identi-
fication and named entity extraction. Tokenization
and part-of-speech tagging can also be regarded as
a chunking task, if we assume each character as a
token.
Machine learning techniques are often applied to
chunking, since the task is formulated as estimating
an identifying function from the information (fea-
tures) available in the surrounding context. Various
machine learning approaches have been proposed
for chunking (Ramshaw and Marcus, 1995; Tjong
Kim Sang, 2000a; Tjong Kim Sang et al., 2000;
Tjong Kim Sang, 2000b; Sassano and Utsuro, 2000;
van Halteren, 2000).
Conventional machine learning techniques, such
as Hidden Markov Model (HMM) and Maximum
Entropy Model (ME), normally require a careful
feature selection in order to achieve high accuracy.
They do not provide a method for automatic selec-
tion of given feature sets. Usually, heuristics are
used for selecting effective features and their com-
binations.
New statistical learning techniques such as Sup-
port Vector Machines (SVMs) (Cortes and Vap-
nik, 1995; Vapnik, 1998) and Boosting(Freund and
Schapire, 1996) have been proposed. These tech-
niques take a strategy that maximizes the margin
between critical samples and the separating hyper-
plane. In particular, SVMs achieve high generaliza-
tion even with training data of a very high dimen-
sion. Furthermore, by introducing the Kernel func-
tion, SVMs handle non-linear feature spaces, and
carry out the training considering combinations of
more than one feature.
In the field of natural language processing, SVMs
are applied to text categorization and syntactic de-
pendency structure analysis, and are reported to
have achieved higher accuracy than previous ap-
proaches.(Joachims, 1998; Taira and Haruno, 1999;
Kudo and Matsumoto, 2000a).
In this paper, we apply Support Vector Machines
to the chunking task. In addition, in order to achieve
higher accuracy, we apply weighted voting of 8
SVM-based systems which are trained using dis-
tinct chunk representations. For the weighted vot-
ing systems, we introduce a new type of weighting
strategy which are derived from the theoretical basis
of the SVMs.
2 Support Vector Machines
2.1 Optimal Hyperplane
Let us define the training samples each of which
belongs either to positive or negative class as:
a2a4a3a6a5a8a7a8a9a10a5a8a11a12a7a14a13a15a13a15a13a16a7a17a2a4a3a19a18a4a7a8a9a17a18a20a11a21a2a4a3a23a22a25a24a27a26a29a28a30a7a25a9a14a22a31a24a33a32a35a34a37a36a10a7a15a38a31a36a17a39a35a11a40a13
a3a41a22 is a feature vector of the
a42 -th sample repre-
sented by an a43 dimensional vector. a9a14a22 is the class
(positive(a34a37a36 ) or negative(a38a31a36 ) class) label of the a42 -
th sample. a44 is the number of the given training sam-
Small Margin Large Margin
Figure 1: Two possible separating hyperplanes
ples. In the basic SVMs framework, we try to sep-
arate the positive and negative samples by a hyper-
plane expressed as: a2a4a45a47a46a40a3a19a11a48a34a50a49a52a51a54a53a54a2a4a45a55a24a56a26a57a28a48a7a8a49a58a24
a26a59a11 . SVMs find an “optimal” hyperplane (i.e. an
optimal parameter set for a45a60a7a61a49 ) which separates the
training data into two classes. What does “optimal”
mean? In order to define it, we need to consider
the margin between two classes. Figure 1 illus-
trates this idea. Solid lines show two possible hyper-
planes, each of which correctly separates the train-
ing data into two classes. Two dashed lines paral-
lel to the separating hyperplane indicate the bound-
aries in which one can move the separating hyper-
plane without any misclassification. We call the dis-
tance between those parallel dashed lines as mar-
gin. SVMs find the separating hyperplane which
maximizes its margin. Precisely, two dashed lines
and margin (a62 ) can be expressed as: a45a63a46a14a3a64a34a65a49a66a51
a67 a36a68a7
a62
a51a54a69a68a70a72a71a45a74a73 .
To maximize this margin, we should minimize
a73a61a45a75a73 . In other words, this problem becomes equiva-
lent to solving the following optimization problem:
a76a78a77a80a79a81a77a83a82a84a77a86a85a12a87a57a88 a89 a2a4a45a90a11a91a51
a5
a92
a73a93a45a74a73 a92
a94a96a95a96a97a81a98 a87a12a99a12a100a56a100a102a101a57a88 a9a14a22a102a103a104a2a4a45a105a46a12a3a41a22a106a11a41a34a50a49a108a107a19a109a54a36a31a2
a42
a51a110a36a68a7a15a13a14a13a15a13a16a7
a44
a11a12a13
The training samples which lie on either of two
dashed lines are called support vectors. It is known
that only the support vectors in given training data
matter. This implies that we can obtain the same de-
cision function even if we remove all training sam-
ples except for the extracted support vectors.
In practice, even in the case where we cannot sep-
arate training data linearly because of some noise
in the training data, etc, we can build the sep-
arating linear hyperplane by allowing some mis-
classifications. Though we omit the details here, we
can build an optimal hyperplane by introducing a
soft margin parameter a111 , which trades off between
the training error and the magnitude of the margin.
Furthermore, SVMs have a potential to carry out
the non-linear classification. Though we leave the
details to (Vapnik, 1998), the optimization problem
can be rewritten into a dual form, where all feature
vectors appear in their dot products. By simply sub-
stituting every dot product of a3a41a22 and a3a48a112 in dual form
with a certain Kernel function a113 a2a4a3a23a22a114a7a102a3a48a112a8a11 , SVMs can
handle non-linear hypotheses. Among many kinds
of Kernel functions available, we will focus on the
a115 -th polynomial kernel:
a113
a2a116a3a41a22a114a7a102a3a117a112a108a11a118a51a119a2a4a3a41a22a68a46a120a3a48a112a121a34a122a36a15a11a102a123 .
Use of a115 -th polynomial kernel functions allows us to
build an optimal separating hyperplane which takes
into account all combinations of features up to a115 .
2.2 Generalization Ability of SVMs
Statistical Learning Theory(Vapnik, 1998) states
that training error (empirical risk) a124a58a125 and test error
(risk) a124a127a126 hold the following theorem.
Theorem 1 (Vapnik) If a128 a2 a128a78a129a54a44 a11 is the VC dimen-
sion of the class functions implemented by some ma-
chine learning algorithms, then for all functions of
that class, with a probability of at least a36a130a38a132a131 , the
risk is bounded by
a124a127a126a130a133a27a124 a125
a34a54a134 a128
a2a4a135 a79
a92
a18
a136
a34a54a36a15a11a91a38a75a135 a79a37a137a138
a44
a7 (1)
where a128 is a non-negative integer called the Vapnik
Chervonenkis (VC) dimension, and is a measure of
the complexity of the given decision function. The
r.h.s. term of (1) is called VC bound. In order to
minimize the risk, we have to minimize the empir-
ical risk as well as VC dimension. It is known that
the following theorem holds for VC dimension a128
and margin a62 (Vapnik, 1998).
Theorem 2 (Vapnik) Suppose a43 as the dimension
of given training samples a62 as the margin, and a139
as the smallest diameter which encloses all train-
ing sample, then VC dimension a128 of the SVMs are
bounded by
a128a140a133
a82a37a77a80a79 a2
a139
a92
a70
a62
a92
a7
a43
a11a141a34a33a36a10a13 (2)
In order to minimize the VC dimension a128 , we have
to maximize the margin a62 , which is exactly the
strategy that SVMs take.
Vapnik gives an alternative bound for the risk.
Theorem 3 (Vapnik) Suppose a124 a18 is an error rate
estimated by Leave-One-Out procedure, a124 a18 is
bounded as
a124
a18
a133 a142a78a143a30a144
a49a12a145a14a146a148a147a17a149a57a150
a143a68a151a68a151
a147a35a146a14a152a41a153a37a145a15a154a61a152a16a147a35a146a17a155
a142a60a143a72a144
a49a12a145a15a146a52a147a17a149a37a156a157a146a35a158
a42a114a43a48a42a120a43a96a159
a150a160a158
a144a37a151
a44
a145a14a155
a13 (3)
Leave-One-Out procedure is a simple method to ex-
amine the risk of the decision function — first by
removing a single sample from the training data, we
construct the decision function on the basis of the
remaining training data, and then test the removed
sample. In this fashion, we test all a44 samples of the
training data using a44 different decision functions. (3)
is a natural consequence bearing in mind that sup-
port vectors are the only factors contributing to the
final decision function. Namely, when the every re-
moved support vector becomes error in Leave-One-
Out procedure, a124 a18 becomes the r.h.s. term of (3). In
practice, it is known that this bound is less predic-
tive than the VC bound.
3 Chunking
3.1 Chunk representation
There are mainly two types of representations for
proper chunks. One is Inside/Outside representa-
tion, and the other is Start/End representation.
1. Inside/Outside
This representation was first introduced in
(Ramshaw and Marcus, 1995), and has been
applied for base NP chunking. This method
uses the following set of three tags for repre-
senting proper chunks.
I Current token is inside of a chunk.
O Current token is outside of any chunk.
B Current token is the beginning of a chunk
which immediately follows another chunk.
Tjong Kim Sang calls this method as IOB1
representation, and introduces three alternative
versions — IOB2,IOE1 and IOE2 (Tjong Kim
Sang and Veenstra, 1999).
IOB2 A B tag is given for every token which
exists at the beginning of a chunk.
Other tokens are the same as IOB1.
IOE1 An E tag is used to mark the last to-
ken of a chunk immediately preceding
another chunk.
IOE2 An E tag is given for every token
which exists at the end of a chunk.
2. Start/End
This method has been used for the Japanese
named entity extraction task, and requires the
following five tags for representing proper
chunks(Uchimoto et al., 2000) 1.
1Originally, Uchimoto uses C/E/U/O/S representation.
However we rename them as B/I/O/E/S for our purpose, since
IOB1 IOB2 IOE1 IOE2 Start/End
In O O O O O
early I B I I B
trading I I I E E
in O O O O O
busy I B I I B
Hong I I I I I
Kong I I E E E
Monday B B I E S
, O O O O O
gold I B I E S
was O O O O O
Table 1: Example for each chunk representation
B Current token is the start of a chunk con-
sisting of more than one token.
E Current token is the end of a chunk consist-
ing of more than one token.
I Current token is a middle of a chunk con-
sisting of more than two tokens.
S Current token is a chunk consisting of only
one token.
O Current token is outside of any chunk.
Examples of these five representations are shown
in Table 1.
If we have to identify the grammatical class of
each chunk, we represent them by a pair of an
I/O/B/E/S label and a class label. For example, in
IOB2 representation, B-VP label is given to a to-
ken which represents the beginning of a verb base
phrase (VP).
3.2 Chunking with SVMs
Basically, SVMs are binary classifiers, thus we must
extend SVMs to multi-class classifiers in order to
classify three (B,I,O) or more (B,I,O,E,S) classes.
There are two popular methods to extend a binary
classification task to that of a113 classes. One is one
class vs. all others. The idea is to build a113 classi-
fiers so as to separate one class from all others. The
other is pairwise classification. The idea is to build
a113a162a161
a2
a113
a38a33a36a15a11a102a70a68a69 classifiers considering all pairs of
classes, and final decision is given by their weighted
voting. There are a number of other methods to ex-
tend SVMs to multiclass classifiers. For example,
Dietterich and Bakiri(Dietterich and Bakiri, 1995)
and Allwein(Allwein et al., 2000) introduce a uni-
fying framework for solving the multiclass problem
we want to keep consistency with Inside/Start (B/I/O) represen-
tation.
by reducing them into binary models. However, we
employ the simple pairwise classifiers because of
the following reasons:
(1) In general, SVMs require a163 a2 a43 a92 a11a165a164 a163 a2 a43a72a166 a11
training cost (where a43 is the size of training data).
Thus, if the size of training data for individual bi-
nary classifiers is small, we can significantly reduce
the training cost. Although pairwise classifiers tend
to build a larger number of binary classifiers, the
training cost required for pairwise method is much
more tractable compared to the one vs. all others.
(2) Some experiments (Kreßel, 1999) report that
a combination of pairwise classifiers performs bet-
ter than the one vs. all others.
For the feature sets for actual training and classi-
fication of SVMs, we use all the information avail-
able in the surrounding context, such as the words,
their part-of-speech tags as well as the chunk labels.
More precisely, we give the following features to
identify the chunk label a154a61a22 for the a42 -th word:
a167 Direction a167
Word: a168 a22a4a169 a92 a168 a22a4a169a48a5 a168 a22 a168 a22a80a170a23a5 a168 a22a80a170a41a5
POS: a152 a22a4a169 a92 a152 a22a104a169a48a5 a152 a22 a152 a22a80a170a41a5 a152 a22a80a170a23a5
Chunk: a154a61a22a4a169 a92 a154a61a22a4a169a141a5 a154a61a22
Here, a168 a22 is the word appearing at a42 -th position, a152a114a22 is
the POS tag of a168
a22 , and a154a61a22 is the (extended) chunk
label for a42 -th word. In addition, we can reverse the
parsing direction (from right to left) by using two
chunk tags which appear to the r.h.s. of the current
token (a154 a22a80a170a41a5 a7a61a154 a22a80a170 a92 ). In this paper, we call the method
which parses from left to right as forward parsing,
and the method which parses from right to left as
backward parsing.
Since the preceding chunk labels (a154a8a22a4a169a48a5a15a7a8a154a61a22a4a169 a92 for
forward parsing , a154a61a22a80a170a23a5a12a7a8a154a61a22a80a170 a92 for backward parsing)
are not given in the test data, they are decided dy-
namically during the tagging of chunk labels. The
technique can be regarded as a sort of Dynamic Pro-
gramming (DP) matching, in which the best answer
is searched by maximizing the total certainty score
for the combination of tags. In using DP matching,
we limit a number of ambiguities by applying beam
search with width
a142
. In CoNLL 2000 shared task,
the number of votes for the class obtained through
the pairwise voting is used as the certain score for
beam search with width 5 (Kudo and Matsumoto,
2000a). In this paper, however, we apply determin-
istic method instead of applying beam search with
keeping some ambiguities. The reason we apply de-
terministic method is that our further experiments
and investigation for the selection of beam width
shows that larger beam width dose not always give a
significant improvement in the accuracy. Given our
experiments, we conclude that satisfying accuracies
can be obtained even with the deterministic parsing.
Another reason for selecting the simpler setting is
that the major purpose of this paper is to compare
weighted voting schemes and to show an effective
weighting method with the help of empirical risk
estimation frameworks.
3.3 Weighted Voting
Tjong Kim Sang et al. report that they achieve
higher accuracy by applying weighted voting of sys-
tems which are trained using distinct chunk rep-
resentations and different machine learning algo-
rithms, such as MBL, ME and IGTree(Tjong Kim
Sang, 2000a; Tjong Kim Sang et al., 2000). It
is well-known that weighted voting scheme has a
potential to maximize the margin between critical
samples and the separating hyperplane, and pro-
duces a decision function with high generalization
performance(Schapire et al., 1997). The boosting
technique is a type of weighted voting scheme, and
has been applied to many NLP problems such as
parsing, part-of-speech tagging and text categoriza-
tion.
In our experiments, in order to obtain higher ac-
curacy, we also apply weighted voting of 8 SVM-
based systems which are trained using distinct
chunk representations. Before applying weighted
voting method, first we need to decide the weights
to be given to individual systems. We can obtain
the best weights if we could obtain the accuracy for
the “true” test data. However, it is impossible to
estimate them. In boosting technique, the voting
weights are given by the accuracy of the training
data during the iteration of changing the frequency
(distribution) of training data. However, we can-
not use the accuracy of the training data for vot-
ing weights, since SVMs do not depend on the fre-
quency (distribution) of training data, and can sepa-
rate the training data without any mis-classification
by selecting the appropriate kernel function and the
soft margin parameter. In this paper, we introduce
the following four weighting methods in our exper-
iments:
1. Uniform weights
We give the same voting weight to all systems.
This method is taken as the baseline for other
weighting methods.
2. Cross validation
Dividing training data into
a142
portions, we em-
ploy the training by using
a142
a38a50a36 portions, and
then evaluate the remaining portion. In this
fashion, we will have
a142
individual accuracy.
Final voting weights are given by the average
of these
a142
accuracies.
3. VC-bound
By applying (1) and (2), we estimate the lower
bound of accuracy for each system, and use
the accuracy as a voting weight. The voting
weight is calculated as: a168 a51a171a36a31a38a27a153 a111 a49a12a147
a143
a43
a115 .
The value of a139 , which represents the smallest
diameter enclosing all of the training data, is
approximated by the maximum distance from
the origin.
4. Leave-One-Out bound
By using (3), we estimate the lower bound of
the accuracy of a system. The voting weight is
calculated as: a168 a51a105a36a127a38 a124 a18 .
The procedure of our experiments is summarized
as follows:
1. We convert the training data into 4 representa-
tions (IOB1/IOB2/IOE1/IOE2).
2. We consider two parsing directions (For-
ward/Backward) for each representation, i.e.
a172
a161
a69a130a51a54a173 systems for a single training data set.
Then, we employ SVMs training using these
independent chunk representations.
3. After training, we examine the VC bound and
Leave-One-Out bound for each of 8 systems.
As for cross validation, we employ the steps 1
and 2 for each divided training data, and obtain
the weights.
4. We test these 8 systems with a separated test
data set. Before employing weighted voting,
we have to convert them into a uniform repre-
sentation, since the tag sets used in individual
8 systems are different. For this purpose, we
re-convert each of the estimated results into 4
representations (IOB1/IOB2/IOE2/IOE1).
5. We employ weighted voting of 8 systems with
respect to the converted 4 uniform representa-
tions and the 4 voting schemes respectively. Fi-
nally, we have a172 (types of uniform representa-
tions) a161 4 (types of weights) a51a174a36a15a175 results for
our experiments.
Although we can use models with IOBES-F or
IOBES-B representations for the committees for
the weighted voting, we do not use them in our
voting experiments. The reason is that the num-
ber of classes are different (3 vs. 5) and the esti-
mated VC and LOO bound cannot straightforwardly
be compared with other models that have three
classes (IOB1/IOB2/IOE1/IOE2) under the same
condition. We conduct experiments with IOBES-
F and IOBES-B representations only to investigate
how far the difference of various chunk representa-
tions would affect the actual chunking accuracies.
4 Experiments
4.1 Experiment Setting
We use the following three annotated corpora for
our experiments.
a176 Base NP standard data set (baseNP-S)
This data set was first introduced by (Ramshaw
and Marcus, 1995), and taken as the standard
data set for baseNP identification task2. This
data set consists of four sections (15-18) of
the Wall Street Journal (WSJ) part of the Penn
Treebank for the training data, and one section
(20) for the test data. The data has part-of-
speech (POS) tags annotated by the Brill tag-
ger(Brill, 1995).
a176 Base NP large data set (baseNP-L)
This data set consists of 20 sections (02-21)
of the WSJ part of the Penn Treebank for the
training data, and one section (00) for the test
data. POS tags in this data sets are also anno-
tated by the Brill tagger. We omit the experi-
ments IOB1 and IOE1 representations for this
training data since the data size is too large for
our current SVMs learning program. In case
of IOB1 and IOE1, the size of training data for
one classifier which estimates the class I and
O becomes much larger compared with IOB2
and IOE2 models. In addition, we also omit to
estimate the voting weights using cross valida-
tion method due to a large amount of training
cost.
a176 Chunking data set (chunking)
This data set was used for CoNLL-2000
shared task(Tjong Kim Sang and Buchholz,
2000). In this data set, the total of 10
base phrase classes (NP,VP,PP,ADJP,ADVP,CONJP,
2ftp://ftp.cis.upenn.edu/pub/chunker/
INITJ,LST,PTR,SBAR) are annotated. This data
set consists of 4 sections (15-18) of the WSJ
part of the Penn Treebank for the training data,
and one section (20) for the test data 3.
All the experiments are carried out with our soft-
ware package TinySVM4, which is designed and op-
timized to handle large sparse feature vectors and
large number of training samples. This package can
estimate the VC bound and Leave-One-Out bound
automatically. For the kernel function, we use the
2-nd polynomial function and set the soft margin
parameter a111 to be 1.
In the baseNP identification task, the perfor-
mance of the systems is usually measured with three
rates: precision, recall and a177a23a178a68a179 a5 a2a106a51a47a69a66a46
a151
a146a35a145a14a154
a42
a155
a42
a147
a43
a46
a146a17a145a15a154a12a158
a44a106a44
a70a30a2
a151
a146a17a145a14a154
a42
a155
a42
a147
a43
a34a50a146a17a145a15a154a12a158
a44a106a44
a11a102a11 . In this paper, we re-
fer to a177 a178a68a179 a5 as accuracy.
4.2 Results of Experiments
Table 2 shows results of our SVMs based chunk-
ing with individual chunk representations. This ta-
ble also lists the voting weights estimated by differ-
ent approaches (B:Cross Validation, C:VC-bound,
D:Leave-one-out). We also show the results of
Start/End representation in Table 2.
Table 3 shows the results of the weighted vot-
ing of four different voting methods: A: Uniform,
B: Cross Validation (
a142
a51a181a180 ), C: VC bound, D:
Leave-One-Out Bound.
Table 4 shows the precision, recall and a177 a178a68a179 a5 of
the best result for each data set.
4.3 Accuracy vs Chunk Representation
We obtain the best accuracy when we ap-
ply IOE2-B representation for baseNP-S and
chunking data set. In fact, we cannot find
a significant difference in the performance be-
tween Inside/Outside(IOB1/IOB2/IOE1/IOE2) and
Start/End(IOBES) representations.
Sassano and Utsuro evaluate how the difference
of the chunk representation would affect the perfor-
mance of the systems based on different machine
learning algorithms(Sassano and Utsuro, 2000).
They report that Decision List system performs
better with Start/End representation than with In-
side/Outside, since Decision List considers the spe-
cific combination of features. As for Maximum
Entropy, they report that it performs better with
Inside/Outside representation than with Start/End,
3http://lcg-www.uia.ac.be/conll2000/chunking/
4http://cl.aist-nara.ac.jp/ taku-ku/software/TinySVM/
Training Condition Acc. Estimated Weights
data rep. a182a72a183a15a184a72a185 B C D
baseNP-S IOB1-F 93.76 .9394 .4310 .9193
IOB1-B 93.93 .9422 .4351 .9184
IOB2-F 93.84 .9410 .4415 .9172
IOB2-B 93.70 .9407 .4300 .9166
IOE1-F 93.73 .9386 .4274 .9183
IOE1-B 93.98 .9425 .4400 .9217
IOE2-F 93.98 .9409 .4350 .9180
IOE2-B 94.11 .9426 .4510 .9193
baseNP-L IOB2-F 95.34 - .4500 .9497
IOB2-B 95.28 - .4362 .9487
IOE2-F 95.32 - .4467 .9496
IOE2-B 95.29 - .4556 .9503
chunking IOB1-F 93.48 .9342 .6585 .9605
IOB1-B 93.74 .9346 .6614 .9596
IOB2-F 93.46 .9341 .6809 .9586
IOB2-B 93.47 .9355 .6722 .9594
IOE1-F 93.45 .9335 .6533 .9589
IOE1-B 93.72 .9358 .6669 .9611
IOE2-F 93.45 .9341 .6740 .9606
IOE2-B 93.85 .9361 .6913 .9597
baseNP-S IOBES-F 93.96
IOBES-B 93.58
chunking IOBES-F 93.31
IOBES-B 93.41
B:Cross Validation, C:VC bound, D:LOO bound
Table 2: Accuracy of individual representations
Training Condition Accuracy a182a72a183a15a184a72a185
data rep. A B C D
baseNP-S IOB1 94.14 94.20 94.20 94.16
IOB2 94.16 94.22 94.22 94.18
IOE1 94.14 94.19 94.19 94.16
IOE2 94.16 94.20 94.21 94.17
baseNP-L IOB2 95.77 - 95.66 95.66
IOE2 95.77 - 95.66 95.66
chunking IOB1 93.77 93.87 93.89 93.87
IOB2 93.72 93.87 93.90 93.88
IOE1 93.76 93.86 93.88 93.86
IOE2 93.77 93.89 93.91 93.85
A:Uniform Weights, B:Cross Validation
C:VC bound, D:LOO bound
Table 3: Results of weighted voting
data set precision recall a177 a178a68a179 a5
baseNP-S 94.15% 94.29% 94.22
baseNP-L 95.62% 95.93% 95.77
chunking 93.89% 93.92% 93.91
Table 4: Best results for each data set
since Maximum Entropy model regards all features
as independent and tries to catch the more general
feature sets.
We believe that SVMs perform well regardless of
the chunk representation, since SVMs have a high
generalization performance and a potential to select
the optimal features for the given task.
4.4 Effects of Weighted Voting
By applying weighted voting, we achieve higher ac-
curacy than any of single representation system re-
gardless of the voting weights. Furthermore, we
achieve higher accuracy by applying Cross valida-
tion and VC-bound and Leave-One-Out methods
than the baseline method.
By using VC bound for each weight, we achieve
nearly the same accuracy as that of Cross valida-
tion. This result suggests that the VC bound has a
potential to predict the error rate for the “true” test
data accurately. Focusing on the relationship be-
tween the accuracy of the test data and the estimated
weights, we find that VC bound can predict the ac-
curacy for the test data precisely. Even if we have
no room for applying the voting schemes because
of some real-world constraints (limited computation
and memory capacity), the use of VC bound may al-
low to obtain the best accuracy. On the other hand,
we find that the prediction ability of Leave-One-Out
is worse than that of VC bound.
Cross validation is the standard method to esti-
mate the voting weights for different systems. How-
ever, Cross validation requires a larger amount of
computational overhead as the training data is di-
vided and is repeatedly used to obtain the voting
weights. We believe that VC bound is more effec-
tive than Cross validation, since it can obtain the
comparable results to Cross validation without in-
creasing computational overhead.
4.5 Comparison with Related Works
Tjong Kim Sang et al. report that they achieve accu-
racy of 93.86 for baseNP-S data set, and 94.90 for
baseNP-L data set. They apply weighted voting of
the systems which are trained using distinct chunk
representations and different machine learning al-
gorithms such as MBL, ME and IGTree(Tjong Kim
Sang, 2000a; Tjong Kim Sang et al., 2000).
Our experiments achieve the accuracy of 93.76 -
94.11 for baseNP-S, and 95.29 - 95.34 for baseNP-
L even with a single chunk representation. In addi-
tion, by applying the weighted voting framework,
we achieve accuracy of 94.22 for baseNP-S, and
95.77 for baseNP-L data set. As far as accuracies
are concerned, our model outperforms Tjong Kim
Sang’s model.
In the CoNLL-2000 shared task, we achieved
the accuracy of 93.48 using IOB2-F representation
(Kudo and Matsumoto, 2000b) 5. By combining
weighted voting schemes, we achieve accuracy of
93.91. In addition, our method also outperforms
other methods based on the weighted voting(van
Halteren, 2000; Tjong Kim Sang, 2000b).
4.6 Future Work
a176 Applying to other chunking tasks
Our chunking method can be equally appli-
cable to other chunking task, such as English
POS tagging, Japanese chunk(bunsetsu) iden-
tification and named entity extraction. For fu-
ture, we will apply our method to those chunk-
ing tasks and examine the performance of the
method.
a176 Incorporating variable context length model
In our experiments, we simply use the so-
called fixed context length model. We believe
that we can achieve higher accuracy by select-
ing appropriate context length which is actu-
ally needed for identifying individual chunk
tags. Sassano and Utsuro(Sassano and Ut-
suro, 2000) introduce a variable context length
model for Japanese named entity identification
task and perform better results. We will incor-
porate the variable context length model into
our system.
a176 Considering more predictable bound
In our experiments, we introduce new types
of voting methods which stem from the theo-
rems of SVMs — VC bound and Leave-One-
Out bound. On the other hand, Chapelle and
Vapnik introduce an alternative and more pre-
dictable bound for the risk and report their
proposed bound is quite useful for selecting
the kernel function and soft margin parame-
ter(Chapelle and Vapnik, 2000). We believe
that we can obtain higher accuracy using this
more predictable bound for the voting weights
in our experiments.
5In our experiments, the accuracy of 93.46 is obtained with
IOB2-F representation, which was the exactly the same repre-
sentation we applied for CoNLL 2000 shared task. This slight
difference of accuracy arises from the following two reason :
(1) The difference of beam width for parsing (N=1 vs. N=5),
(2) The difference of applied SVMs package (TinySVM vs.
a186a96a187a189a188a56a190a192a191a194a193a120a195a102a196 .
5 Summary
In this paper, we introduce a uniform framework for
chunking task based on Support Vector Machines
(SVMs). Experimental results on WSJ corpus show
that our method outperforms other conventional ma-
chine learning frameworks such MBL and Max-
imum Entropy Models. The results are due to
the good characteristics of generalization and non-
overfitting of SVMs even with a high dimensional
vector space. In addition, we achieve higher accu-
racy by applying weighted voting of 8-SVM based
systems which are trained using distinct chunk rep-
resentations.
References
Erin L. Allwein, Robert E. Schapire, and Yoram
Singer. 2000. Reducing multiclass to binary: A
unifying approach for margin classifiers. In In-
ternational Conf. on Machine Learning (ICML),
pages 9–16.
Eric Brill. 1995. Transformation-Based Error-
Driven Learning and Natural Language Process-
ing: A Case Study in Part-of-Speech Tagging.
Computational Linguistics, 21(4).
Oliver Chapelle and Vladimir Vapnik. 2000. Model
selection for support vector machines. In Ad-
vances in Neural Information Processing Systems
12. Cambridge, Mass: MIT Press.
C. Cortes and Vladimir N. Vapnik. 1995. Support
Vector Networks. Machine Learning, 20:273–
297.
T. G. Dietterich and G. Bakiri. 1995. Solving multi-
class learning problems via error-correcting out-
put codes. Journal of Artificial Intelligence Re-
search, 2:263–286.
Yoav Freund and Robert E. Schapire. 1996. Experi-
ments with a new boosting algorithm. In Interna-
tional Conference on Machine Learning (ICML),
pages 148–146.
Thorsten Joachims. 1998. Text Categorization with
Support Vector Machines: Learning with Many
Relevant Features. In European Conference on
Machine Learning (ECML).
Ulrich H.-G Kreßel. 1999. Pairwise Classification
and Support Vector Machines. In Advances in
Kernel Mathods. MIT Press.
Taku Kudo and Yuji Matsumoto. 2000a. Japanese
Dependency Structure Analysis Based on Sup-
port Vector Machines. In Empirical Methods in
Natural Language Processing and Very Large
Corpora, pages 18–25.
Taku Kudo and Yuji Matsumoto. 2000b. Use of
Support Vector Learning for Chunk Identifica-
tion. In Proceedings of the 4th Conference on
CoNLL-2000 and LLL-2000, pages 142–144.
Lance A. Ramshaw and Mitchell P. Marcus. 1995.
Text chunking using transformation-based learn-
ing. In Proceedings of the 3rd Workshop on Very
Large Corpora, pages 88–94.
Manabu Sassano and Takehito Utsuro. 2000.
Named Entity Chunking Techniques in Su-
pervised Learning for Japanese Named Entity
Recognition. In Proceedings of COLING 2000,
pages 705–711.
Robert E. Schapire, Yoav Freund, Peter Bartlett,
and Wee Sun Lee. 1997. Boosting the margin:
a new explanation for the effectiveness of vot-
ing methods. In International Conference on Ma-
chine Learning (ICML), pages 322–330.
Hirotoshi Taira and Masahiko Haruno. 1999. Fea-
ture Selection in SVM Text Categorization. In
AAAI-99.
Erik F. Tjong Kim Sang and Sabine Buchholz.
2000. Introduction to the CoNLL-2000 Shared
Task: Chunking. In Proceedings of CoNLL-2000
and LLL-2000, pages 127–132.
Erik F. Tjong Kim Sang and Jorn Veenstra. 1999.
Representing text chunks. In Proceedings of
EACL’99, pages 173–179.
Erik F. Tjong Kim Sang, Walter Daelemans, Herv´e
D´ejean, Rob Koeling, Yuval Krymolowski, Vasin
Punyakanok, and Dan Roth. 2000. Applying
system combination to base noun phrase identi-
fication. In Proceedings of COLING 2000, pages
857–863.
Erik F. Tjong Kim Sang. 2000a. Noun phrase
recognition by system combination. In Proceed-
ings of ANLP-NAACL 2000, pages 50–55.
Erik F. Tjong Kim Sang. 2000b. Text Chunking by
System Combination. In Proceedings of CoNLL-
2000 and LLL-2000, pages 151–153.
Kiyotaka Uchimoto, Qing Ma, Masaki Murata, Hi-
romi Ozaku, and Hitoshi Isahara. 2000. Named
Entity Extraction Based on A Maximum Entropy
Model and Transformation Rules. In Processing
of the ACL 2000.
Hans van Halteren. 2000. Chunking with WPDV
Models. In Proceedings of CoNLL-2000 and
LLL-2000, pages 154–156.
Vladimir N. Vapnik. 1998. Statistical Learning
Theory. Wiley-Interscience.
