In: Proceedings of CoNLL-2000 and LLL-2000, pages 142-144, Lisbon, Portugal, 2000. 
Use of ',Support Vector Learning 
for Chunk Identification 
Taku Kudoh and Yuji Matsumoto 
Graduate School of Information Science, Nara Institute of Science and Technology 
{taku-ku, matsu}@is, aist-nara, ac. jp 
1 Introduction 
In this paper, we explore the use of Support Vec- 
tor Machines (SVMs) for CoNLL-2000 shared 
task, chunk identification. SVMs are so-called 
large margin classifiers and are well-known as 
their good generalization performance. We in- 
vestigate how SVMs with a very large number 
of features perform with the classification task 
of chunk labelling. 
2 Support Vector Machines 
Support Vector Machines (SVMs), first intro- 
duced by Vapnik (Cortes and Vapnik, 1995; 
Vapnik, 1995), are relatively new learning ap- 
proaches for solving two-class pattern recog- 
nition problems. SVMs are well-known for 
their good generalization performance, and have 
been applied to many pattern recognition prob- 
lems. In the field of natural  process- 
ing, SVMs are applied to text categorization, 
and are reported to have achieved high accu- 
racy without falling into over-fitting even with 
a large number of words taken as the features 
(Joachims, 1998; Taira and Haruno, 1999) 
First of all, let us define the training data 
which belongs to either positive or negative class 
as follows: 
(Xl,YX),..., (Xl,Yl) Xi 6 R n, Yi 6 {+1,-1} 
xi is a feature vector of the i-th sample repre- 
sented by an n dimensional vector, yi is the 
class (positive(+l) or negative(-1) class) label 
of the i-th data. In basic SVMs framework, we 
try to separate the positive and negative exam- 
ples by hyperplane written as: 
(w-x)+b=0 w 6Rn,bE R. 
SVMs find the "optimal" hyperplane (optimal 
parameter w, b) which separates the training 
0 
0 
', ~O 
o', O • 
o <~I",. ° o 
Small Margin 
0 , ,0 0"0~',, 
• 
Large Margin 
Figure 1: Two possible separating hyperplanes 
data into two classes precisely. What "opti- 
mal" means? In order to define it, we need 
to consider the margin between two classes. 
Figures 1 illustrates this idea. The solid lines 
show two possible hyperplanes, each of which 
correctly separates the training data into two 
classes. The two dashed lines parallel to the 
separating hyperplane show the boundaries in 
which one can move the separating hyperplane 
without misclassification. We call the distance 
between each parallel dashed lines as margin. 
SVMs take a simple strategy that finds the sep- 
arating hyperplane which maximizes its margin. 
Precisely, two dashed lines and margin (d) can 
be written as: 
(w. x) + b = :kl, d = 2111wll. 
SVMs can be regarded as an optimization prob- 
lem; finding w and b which minimize \[\[w\[\[ under 
the constraints: yi\[(w • xi) + b\] > 1. 
Furthermore, SVMs have potential to cope 
with the linearly unseparable training data. We 
leave the details to (Vapnik, 1995), the opti- 
mization problems can be rewritten into a dual 
form, where all feature vectors appear in their 
dot product. By simply substituting every dot 
product of xi and xj in dual form with any Ker- 
nel function K(xl, xj), SVMs can handle non- 
linear hypotheses. Among the many kinds of 
Kernel functions available, we will focus on the 
142 
d-th polynomial kernel: 
K(xi,xj) = (xi.xj + 1) d 
Use of d-th polynomial kernel function allows 
us to build an optimal separating hyperplane 
which takes into account all combination of fea- 
tures up to d. 
We believe SVMs have advantage over con- 
ventional statistical learning algorithms, such as 
Decision Tree, and Maximum Entropy Models, 
from the following two aspects: 
• SVMs have high generalization perfor- 
mance independent of dimension of fea- 
ture vectors. Conventional algorithms re- 
quire careful feature selection, which is usu- 
ally optimized heuristically, to avoid over- 
fitting. 
• SVMs can carry out their learning with 
all combinations of given features with- 
out increasing computational complexity 
by introducing the Kernel function. Con- 
ventional algorithms cannot handle these 
combinations efficiently, thus, we usually 
select "important" combinations heuristi- 
cally with taking the trade-off between ac- 
curacy and computational complexity into 
account. 
3 Approach for Chunk Identification 
The chunks in the CoNLL-2000 shared task are 
represented with IOB based model, in which ev- 
ery word is to be tagged with a chunk label ex- 
tended with I (inside a chunk), O (outside a 
chunk) and B (inside a chunk, but the preced- 
ing word is in another chunk). Each chunk type 
belongs to I or B tags. For example, NP could 
be considered as two types of chunk, I-NP or 
B-NP. In training data of CoNLL-2000 shared 
task, we could find 22 types of chunk 1 consid- 
ering all combinations of IOB-tags and chunk 
types. We simply formulate the chunking task 
as a classification problem of these 22 types of 
chunk. 
Basically, SVMs are binary classifiers, thus we 
must extend SVMs to multi-class classifiers in 
order to classify these 22 types of chunks. It is 
1Precisely, the number of combination becomes 23. 
However, we do not consider I-LST tag since it dose not 
appear in training data. 
known that there are mainly two approaches to 
extend from a binary classification task to those 
with K classes. First approach is often used 
and typical one "one class vs. all others". The 
idea is to build K classifiers that separate one 
class among from all others. Second approach 
is pairwise classification. The idea is to build 
K × (K - 1)/2 classifiers considering all pairs of 
classes, and final class decision is given by their 
majority voting. We decided to construct pair- 
wise classifiers for all the pairs of chunk labels, 
so that the total number of classifiers becomes 
22x21 231. The reasons that we use pairwise 2 -- 
classifiers are as follows: 
• Some experiments report that combination 
of pairwise classifier perform better than K 
classifier (Kret~el, 1999). 
• The amount of training data for a pair is 
less than the amount of training data for 
separating one class with all others. 
For the features, we decided to use all the in- 
formation available in the surrounding contexts, 
such as the words, their POS tags as well as the 
chunk labels. More precisely, we give the fol- 
lowing for the features to identify chunk label 
ci at i-th word: 
w j, tj 
cj 
(j = i-2, i-1, i, i+1, i+ 2) 
(j = i-2, i-1) 
where wi is the word appearing at i-th word, ti 
is the POS tag of wi, and c/ is the (extended) 
chunk label at i-th word. Since the chunk labels 
are not given in the test data, they are decided 
dynamically during the tagging of chunk labels. 
This technique can be regarded as a sort of Dy- 
namic Programming (DP) matching, in which 
the best answer is searched by maximizing the 
total certainty score for the combination of tags. 
In using DP matching, we decided to keep not 
all ambiguities but a limited number of them. 
This means that a beam search is employed, 
and only the top N candidates are kept for the 
search for the best chunk tags. The algorithm 
scans the test data from left to right and calls 
the SVM classifiers for all pairs of chunk tags 
for obtaining the certainty score. We defined 
the certainty score as the number of votes for 
the class (tag) obtained through the pairwise 
voting. 
143 
Since SVMs are vector based classifier, they 
accept only numerical values for their features. 
To cope with this constraints, we simply expand 
all features as a binary-value taking either 0 or 
1. By taking all words and POS tags appearing 
in the training data as features, the total dimen- 
sion of feature vector becomes as large as 92837. 
Generally, we need vast computational complex- 
ity and memories to handle such a huge dimen- 
sion of vectors. In fact, we can reduce these 
complexity considerably by holding only indices 
and values of non-zero elements, since the fea- 
ture vectors are usually sparse, and SVMs only 
require the evaluation of dot products of each 
feature vectors for their training. 
In addition, although we could apply some 
cut-off threshold for the number of occurrence 
in the training set, we decided to use everything, 
not only POS tags but also words themselves. 
The reasons are that we simply do not want 
to employ a kind of "heuristics", and SVMs 
are known to have a good generalization per- 
formance even with very large featm:es. 
4 Results 
We have applied our proposed method to the 
test data of CoNLL-2000 shared task, while 
training with the complete training data. For 
the kernel function, we use the 2-nd polynomial 
function. We set the beam width N to 5 ten- 
tatively. SVMs training is carried out with the 
SVM light package, which is designed and opti- 
mized to handle large sparse feature vector and 
large numbers of training examples(Joachims, 
2000; Joachims, 1999a). It took about 1 day 
to train 231 classifiers with PC-Linux (Celeron 
500Mhz, 512MB). 
Figure 1 shows the results of our experiments. 
The all the values of the chunking F-measure are 
almost 93.5. Especially, our method performs 
well for the chunk types of high frequency, such 
as NP, VP and PP. 
5 Discussion 
In this paper, we propose Chunk identification 
analysis based on Support Vector Machines. 
Although we select features for learning in 
very straight way -- using all available features 
such as the words their POS tags without any 
cut-off threshold for the number of occurrence, 
we archive high performance for test data. 
test data 
ADJP 
ADVP 
CONJP 
INTJ 
LST 
NP 
PP 
PRT 
SBAR 
VP 
precision 
79.22% 
80.86% 
62.50% 
100.00% 
0.00% 
93.72% 
96.60% 
80.58% 
89.29% 
93.76% 
all 93.45% 93.51% 
recall FZ=i 
69.63% 74.12 
80.48% 80.67 
55.56% 58.82 
50.00% 66.67 
0.00% 0.00 
94.02% 93.87 
97.94% 97.26 
78.30% 79.43 
84.11% 86.62 
93.84% 93.80 
93.48 
Table 1: The results per chunk type with our 
proposed SVMs based method 
When we use other learning methods such as 
Decision Tree, we have to select feature set man- 
ually to avoid over-fitting. Usually, these fea- 
ture selection depends on heuristics, so that it 
is difficult to apply them to other classification 
problems in other domains. 
Memory based learning method can also ham 
dle all available features. However, the function 
to compute the distance between the test pat- 
tern and the nearest cases in memory is usually 
optimized in an ad-hoc way 
Through our experiments, we have shown the 
high generalization performance and high lea- 
ture selection abilities of SVMs. 

References 
C. Cortes and Vladimir N. Vapnik. 1995. Support 
Vector Networks. Machine Learning, 20:273-297. 
Thorsten Joachims. 1998. Text Categorization with 
Support Vector Machines: Learning with Many 
Relevant Features. In European Conference on 
Machine Learning (ECML). 
Thorsten Joachims. 1999a. Making Large-Scale 
Support Vector Machine Learning Practical. In 
Advances in Kernel Methods. MIT Press. 
Thorsten Joachims. 2000. SVM tight version 
3.02. http://www-ai.cs.uni-dortmund.de/SOFT- 
WARE/ S VM_LI G H T / svm_light.eng.html. 
Ulrich H.-G Krefiel. 1999. Pairwise Classification 
and Support Vector Machines. In Advances in 
Kernel Methods. MIT Press. 
Hirotoshi Taira and Masahiko Haruno. 1999. Fea- 
ture Selection in SVM Text Categorization. In 
AAAI-99. 
Vladimir N. Vapnik. 1995. The Nature of Statistical 
Learning Theory. Springer. 
