Japanese Dependency Structure Analysis 
Based on Support Vector Machines 
Taku Kudo and Yuji Matsumoto 
Graduate School of Information Science, 
Nara Institute of Science and Technology 
{taku-ku, matsu}@is, aist-nara, ac. jp 
..... ..-" r 
Abstract 
This paper presents a method of Japanese 
dependency structure analysis based on Sup-- 
port Vector Machines (SVMs). Conventional 
parsing techniques based on Machine Learn- 
ing framework, such as Decision Trees and 
Maximum Entropy Models, have difficulty 
in selecting useful features as well as find- 
ing appropriate combination of selected fea- 
tures. On the other hand, it is well-known 
that SVMs achieve high generalization per- 
formance even with input data of very high 
dimensional feature space. Furthermore, by 
introducing the Kernel principle, SVMs can 
carry out the training in high-dimensional 
• spaces with a smaller computational cost in- 
dependent of their dimensionality. We apply 
SVMs to Japanese dependency structure iden- 
tification problem. Experimental results on 
Kyoto University corpus show that our sys- 
tem achieves the accuracy of 89.09% even with 
small training data (7958 sentences). 
1 Introduction 
Dependency structure analysis has been rec- 
ognized as a basic technique in Japanese 
sentence analysis, and a number of stud- 
ies have been proposed for years. Japanese 
dependency structure is usually defined in 
terms of the relationship between phrasal 
units called 'bunsetsu' segments (hereafter 
"chunks~). Generally, dependency structure 
analysis consists of two steps. In the first 
step, dependency matrix is constructed, in 
which each element corresponds to a pair of 
chunks and represents the probability of a de- 
pendency relation between them. The second 
step is to find the optimal combination of de- 
pendencies to form the entire sentence. 
In previous approaches, these probabilites 
of dependencies axe given by manually con- 
structed rules. However, rule-based ap- 
proaches have problems in coverage and con- 
sistency, since there are a number of features 
that affect the accuracy of the final results, 
and these features usually relate to one an- 
other. 
On the other hand, as large-scale tagged 
corpora have become available these days, 
a number of statistical parsing techniques 
which estimate the dependency probabilities 
using such tagged corpora have been devel- 
oped(Collins, 1996; Fujio and Matsumoto, 
1998). These approaches have overcome the 
systems based on the rule-based approaches. 
Decision Trees(Haruno et al., 1998) and Max- 
imum Entropy models(Ratnaparkhi, 1997; 
Uchimoto et al., 1999; Charniak, 2000) have 
been applied to dependency or syntactic struc- 
ture analysis. However, these models require 
an appropriate feature selection in order to 
achieve a high performance. In addition, ac- 
quisition of an efficient combination of fea- 
tures is difficult in these models. 
In recent years, new statistical learning 
techniques such as Support Vector Machines 
(SVMs) (Cortes and Vapnik, 1995; Vap- 
nik, 1998) and Boosting(Freund and Schapire, 
1996) are proposed. These techniques take a 
strategy that maximize the margin between 
critical examples and the separating hyper- 
plane. In particular, compared with other 
conventional statistical learning algorithms, 
SVMs achieve high generalization even with 
training data of a very high dimension. Fur- 
thermore, by optimizing the Kernel function, 
SVMs can handle non-linear feature spaces, 
and carry out the training with considering 
combinations of more than one feature. 
Thanks to such predominant nature, SVMs 
deliver state-of-the-art performance in real- 
world applications such as recognition of 
hand-written letters, or of three dimensional 
images. In the field of natural  pro- 
cessing, SVMs are also applied to text cate- 
gorization, and are reported to have achieved 
18 
high accuracy without falling into over-fitting 
even with a large number of words taken as the 
features (Joachims, 1998; Taira and Haruno, 
1999). 
In this paper, we propose an application 
of SVMs to Japanese dependency structure 
analysis. We use the features that have been 
studied in conventional statistical dependency 
analysis with a little modification on them. 
2 Support Vector Machines 
2.1 Optimal Hyperplane 
Let us define the training data which belong 
either to positive or negative class as follows. 
(xl, vl),..., (xi, v~),..., (x~, v~) 
xiEa n , y/E{+i,-1} 
xi is a feature vector of i-th sample, which is 
represented by an n dimensional vector (xi = 
(fl,...,ff,) E Rn). Yi is a scalar value that 
specifies the class (positive(+l) or negative(- 
l) class) of i-th data. Formally, we can define 
the pattern recognition problem as a learning 
and building process of the decision function 
f: lq. n ~ {=i::l}. 
In basic SVMs framework, we try to sepa- 
rate the positive and negative examples in the 
training data by a linear hyperplane written 
as: 
(w-x)+b=0 w e Rn, beR. (1) 
It is supposed that the farther the positive 
and negative examples are separated by the 
discrimination function, the more accurately 
we could separate unseen test examples with 
high generalization performance. Let us con- 
sider two hyperplanes called separating hyper- 
planes: 
(w-xi)+b_> 1 if(yi=l) (2) 
(w-xi) + b _< -1 if (Yi =--1). (3) 
(2) (3) can be written in one formula as: 
w\[(w- x~) + b\] >_ 1 (i = 1,...,0. (4) 
Distance from the separating hyperplane to 
the point xi can be written as: 
d(w, b; xi) = Iw" ~ + bl 
Ilwll 
Thus, the margin between two separating hy- 
perplanes can be written as: 
rain d(w,b;xi)+ rain d(w,b;xi) 
xi ;Yi = l xi;yi=--i 
= rain Iw'xi+b I+ rain 
x,;y,=l Ilwll x,;~,=-i 
2 
Ilwll" 
Iw" xi + bl 
IIwll 
To maximize this margin, we should minimize 
Hwll. In other words, this problem becomes 
equivalent to solving the following optimiza- 
tion problem: 
Minimize: L(w) = ½l\[wl\[ 2 
Subject to: yi\[(w-xi)+b\]> 1 (i=l,...,l). 
Furthermore, this optimization problem can 
be rewritten into the dual form problem: Find 
the Lagrange multipliers c~i >_ O(i = 1,..., l) 
so that: 
Maximize: 
l 1 l 
- ~ a~c~y~y~(xi, xj) (5) L(~) = Z~ _~ 
i= l ",j:'= 1 
Subject to: 
l 
~ _> 0, ~ ~y~ = 0 (i = 1,..., l) 
i=1 
In this dual form problem, xi with non-zero ai 
is called a Support Vector. For the Support 
Vectors, w and b can thus be expressed as 
follows 
w = E o~iYi xi b = w • xi - Yi. 
i;xi6SVs 
The elements of the set SVs are the Support 
Vectors that lie on the separating hyperplanes. 
Finally, the decision function ff : R n ---r {::El} 
can be written as: 
f(x) = sgn(i;x, ~esvs c~iYi (xi - x) + b)(6) 
= sgn (w-x + b). 
2.2 Soft Margin 
In the case where we cannot separate train- 
ing examples linearly, "Soft Margin" method 
forgives some classification errors that may be 
caused by some noise in the training examples. 
First, we introduce non-negative slack vari- 
ables, and (2),(3) are rewritten as: 
(w-xi)+b_> 1-~i if(yi=l) 
(w. xi) + b >_ -1 + ~i if (Yi = -1). 
19 
In this case, we minimize the following value 
instead of 1 2  llwll 
l 
-Ilwll + c (7) 
i----1 
The first term in (7) specifies the size of mar- 
gin and the second term evaluates how far the 
training data are away from the optimal sep- 
arating hyperplane. C is the parameter that 
defines the balance of two quantities. If we 
make C larger, the more classification errors 
are neglected. 
Though we omit the details here, minimiza- 
tion of (7) is reduced to the problem to mini- 
mize the objective function (5) under the fol- 
lowing constraints. 
0 < ai _< c, a y/= 0 (i = 1,..., z) 
Usually, the value of C is estimated experi- 
mentally. 
2.3 Kernel Function 
In general classification problems, there are 
cases in which it is unable to separate the 
training data linearly. In such cases, the train- 
ing data could be separated linearly by ex- 
panding all combinations of features as new 
ones, and projecting them onto a higher- 
dimensional space. However, such a naive ap- 
proach requires enormous computational over- 
head. 
Let us consider the case where we project 
the training data x onto a higher-dimensional 
space by using projection function • 1 As 
we pay attention to the objective function (5) 
and the decision function (6), these functions 
depend only on the dot products of the in- 
put training vectors. If we could calculate the 
dot products from xz and x2 directly without 
considering the vectors ~(xz) and ¢(x2) pro- 
jected onto the higher-dimensional space, we 
can reduce the computational complexity con- 
siderably. Namely, we can reduce the compu- 
tational overhead if we could find the function 
K that satisfies: 
~(xl) " ¢(x2) ---- K(Xl~ x2). (8) 
On the other hand, since we do not need 
itself for actual learning and classification, 
1In general, ,It(x) is a mapping into Hilbert space. 
all we have to do is to prove the existence of 
that satisfies (8) provided the function K is 
selected properly. It is known that (8) holds if 
and only if the function K satisfies the Mercer 
condition (Vapnik, 1998). 
In this way, instead of projecting the train- 
ing data onto the high-dimensional space, we 
can decrease the computational overhead by 
replacing the dot products, which is calculated 
in optimization and classification steps, with 
the function K. 
Such a function K is called a Kernel func- 
tion. Among the many kinds of Kernel func- 
tions available, we will focus on the d-th poly- 
nomial kernel: 
K(xl,x2) = (Xl'X2--t-1) a. (9) 
Use of d-th polynomial kernel function allows 
us to build an optimal separating hyperplane 
which takes into account all combination of 
features up to d. 
Using a Kernel function, we can rewrite the 
decision function as: 
y = sgn (i;x~ ~esVs oLiyiK(xi,x)+b). (10) 
3 Dependency Analysis using 
SVMs 
3.1 The Probability Model 
This section describes a general formulation of 
the probability model and parsing techniques 
for Japanese statistical dependency analysis. 
First of all, we let a sequence of 
chunks be {bz,b2...,bm} by B, and 
the sequence dependency pattern be {Dep(1),Dep(2),...,Dep(m - 
1)} by D, 
where Dep(i) = j means that the chunk b~ 
depends on (modifies) the chunk bj. 
In this framework, we suppose that the de- 
pendency sequence D satisfies the following 
constraints. 
1. Except for the rightmost one, each chunk 
depends on (modifies) exactly one of the 
chunks appearing to the right. 
2. Dependencies do not cross each other. 
Statistical dependency structure analysis 
is defined as a searching problem for the 
dependency pattern D that maximizes the 
conditional probability P(DIB ) of the in- 
20 
put sequence under the above-mentioned con- 
straints. 
Dbest = argmax P(D\[B) D 
If we assume that the dependency probabil- 
ities are mutually independent, P(DIB ) could 
be rewritten as: 
rn-1 
P(DIB) = ~I P(Dep(i)=j Ifit) 
i=1 
fit = {fl,...,fn} e R n. 
P(Dep(i) = J If0) represents the probability 
that bi depends on (modifies) b t. fit is an n di- 
mensional feature vector that represents var- 
ious kinds of linguistic features related with 
the chunks bi and b t. 
We obtain Dbest taking into all the combina- 
tion of these probabilities. Generally, the op- 
timal solution Dbest Can be identified by using 
bottom-up algorithm such as CYK algorithm. 
Sekine suggests an efficient parsing technique 
for Japanese sentences that parses from the 
end of a sentence(Sekine et al., 2000). We ap- 
ply Sekine's technique in our experiments. 
..? 
3.2 Training with SVMs 
In order to use SVMs for dependency analysis, 
we need to prepare positive and negative ex- 
amples since SVMs is a binary classifier. We 
adopt a simple and effective method for our 
purpose: Out of all combination of two chunks 
in the training data, we take a pair of chunks 
that axe in a dependency relation as a positive 
example, and two chunks that appear in a sen- 
tence but are not in a dependency relation as 
a negative example. 
LJ (f J,y t) = {(f12,y12), (f23,v23), 
~l_<j<m 
---, (fro-1 m, Ym-1 m)} 
fij = {fl,---, fn) e R n 
Yij E (Depend(q-l), Not-Depend(-1)} 
Then, we define the dependency probability 
P ( Dep( i) = j l f'ij ): 
P(Dep(i) =j I f'ij) = 
(11) 
(11) shows that the distance between test data 
fr O and the separating hyperplane is put into 
the sigmoid function, assuming it represents 
the probability value of the dependency rela- 
tion. 
We adopt this method in our experiment 
to transform the distance measure obtained 
in SVMs into a probability function and an- 
alyze dependency structure with a fframework 
of conventional probability model 2 
3.3 Static and Dynamic Features 
Features that are supposed to be effective 
in Japanese dependency analysis are: head 
words and their parts-of-speech, particles and 
inflection forms of the words that appear 
at the end of chunks, distance between two 
chunks, existence of punctuation marks. As 
those are solely defined by the pair of chunks, 
we refer to them as static features. 
Japanese dependency relations are heavily 
constrained by such static features since the 
inflection forms and postpositional particles 
constrain the dependency relation. However, 
when a sentence is long and there are more 
than one possible dependents, static features, 
by themselves cannot determine the correct 
dependency. Let us look at the following ex- 
ample. 
watashi-ha kono-hon-wo motteim josei-wo sagasiteiru 
I-top, this book-acc, have, lady-acc, be looking for 
In this example, "kono-hon-wo(this book- 
acc)" may modify either of "motteiru(have)" 
or "sagasiteiru(be looking for)" and cannot 
be determined only with the static features. 
However, "josei-wo (lady-acc)" can modify 
the only the verb "sagasiteiru,". Knowing 
such information is quite useful for resolv- 
ing syntactic ambiguity, since two accusative 
noun phrses hardly modify the same verb. It 
is possible to use such information if we add 
new features related to other modifiers. In 
the above case, the chunk "sagasiteiru" can 
receive a new feature of accusative modifica- 
tion (by "josei-wo") during the parsing pro- 
cess, which precludes the chunk "kono-hon- 
wo" from modifying "sagasiteiru" since there 
is a strict constraint about double-accusative 
2Experimentally, it is shown that tlie sigmoid func- 
tion gives a good approximation of probability func- 
tion from the decision function of SVMs(Platt, 1999). 
21 ¸ 
modification that will be learned from train- 
ing examples. We decided to take into consid- 
eration all such modification information by 
using functional words or inflection forms of 
modifiers. 
Using such information about modifiers in 
the training phase has no difficulty since they 
are clearly available in a tree-bank. On the 
other hand, they are not known in the parsing 
phase of the test data. This problem can be 
easily solved if we adopt a bottom-up parsing 
algorithm and attach the modification infor- 
mation dynamically to the newly constructed 
phrases (the chlmks that become the head of 
the phrases). As we describe later we apply a 
beam search for parsing, and it is possible to 
keep several intermediate solutions while sup- 
pressing the combinatorial explosion. 
We refer to the features that are added in- 
crementally during the parsing process as dy- 
namic features. 
4 Experiments and Discussion 
4.1 Experiments Setting 
We use Kyoto University text corpus (Ver- 
sion 2.0) consisting of articles of Mainichi 
newspaper annotated with dependency struc- 
ture(Kurohashi and Nagao, 1997). 7,958 sen- 
tences from the articles on January 1st to Jan- 
uary 7th are used for the training data, and 
1,246 sentences from the articles on January 
9th are used for the test data. For the kernel 
function, we used the polynomial function (9). 
We set the soft margin parameter C to be 1. 
The feature set used in the experiments are 
shown in Table 1. The static features are ba- 
sically taken from Uchimoto's list(Uchimoto 
et al., 1999) with little modification. In Table 
1, 'Head' means the rightmost content word 
in a chunk whose part-of-speech is not a func- 
tional category. 'Type' means the rightmost 
functional word or the inflectional form of the 
rightmost predicate if there is no functional 
word in the chunk. The static features in- 
clude the information on existence of brack- 
ets, question marks and punctuation marks 
etc. Besides, there are features that show 
the relative relation of two chunks, such as 
distance, and existence of brackets, quotation 
marks and punctuation marks between them. 
For dynamic features, we selected func- 
tional words or inflection forms of the right- 
most predicates in the chunks that appear be- 
tween two chunks and depend on the modi- 
flee. Considering data sparseness problem, we 
Static 
Features 
Dynamic 
Features 
Head 
(surface-form, POS, 
POS-subcategory, 
inflection-type, 
inflection-form), 
Type 
Left/ (surface-form, POS, 
Right POS-subcategory, 
Chunks inflection-type, 
inflection-form), 
brackets, 
quotation-marks, 
punctuation-marks, 
position in sentence 
(beginning, end) 
distance(I,2-5,6-), 
Between case-particles, 
Chunks brackets, 
quotation-marks, 
punctuation-marks 
Form of functional words or in- 
flection that modifies the right 
chunk 
Table 1: Features used in experiments 
apply a simple filtering based on the part-of- 
speech of functional words: We use the lexical 
form if the word's POS is particle, adverb, ad- 
nominal or conjunction. We use the inflection 
form if the word has inflection. We use the 
POS tags for others. 
4.2 Results of Experiments 
Table 2 shows the result of parsing accuracy 
under the condition k = 5 (beam width), and 
d = 3 (dimension of the polynomial functions 
used for the kernel function). 
This table shows two types of dependency 
accuracy, A and B. The training data size is 
measured by the number of sentences. The ac- 
curacy A means the accuracy of the entire de- 
pendency relations. Since Japanese is a head- 
final , the second chunk from the end 
of a sentence always modifies the last chunk. 
The accuracy B is calculated by excluding this 
dependency relation. Hereafter, we use the ac- 
curacy A, if it is not explicitly specified, since 
this measure is usually used in other litera- 
ture. 
4.3 Effects of Dynamic Features 
Table3 shows the accuracy when only static 
features are used. Generally, the results with 
22 
Training 
data 
1172 
1917 
3032 
4318 
5540 
6756 
7958 
Dependency Accuracy 
86.52% 
87.21% 
87.67% 
88.35% 
88.66% 
88.77% 
89.09% 
B 
84.86% 
85.62% 
86.14% 
86.91% 
87.26% 
87.38% 
87.74% 
Sentence 
Accuracy 
39.31% 
40.06% 
42.94% 
44.15% 
45.20% 
45.36% 
46.17% 
Table 2: Result (d = 3, k = 5) 
Training 
data 
1172 
1917 
3032 
4318 
5540 
6756 
7958 
Dependency Accuracy 
A 
86.12% 
86.81% 
87.62% 
88.33% 
88.40% 
88.55% 
88.77% 
B 
84.41% 
85.18% 
86.10% 
86.89% 
86.96% 
87.13% 
87.38% 
Sentence 
Accuracy 
38.50% 
39.80% 
42.45% 
44.47% 
43.66% 
45.04% 
45.04% 
Table 3: Result without dynamic features 
(d = 3, k = 5) 
dynamic feature set is better than the results 
without them. The results with dynamic fea- 
tures constantly outperform that with static 
features only. In most of cases, the improve- 
ments is significant. In the experiments, we 
restrict the features only from the chunks that 
appear between two chunks being in consider- 
ation, however, dynamic features could be also 
taken from the chunks that appear not be- 
tween the two chunks. For example, we could 
also take into consideration the chunk that is 
modified by the right chunk, or the chunks 
W 
m.5 
8,r 
/ 
~0 ~o0 4c00 $000 ~00 7Oo0 
Nt~Iber of TriL~img 0,11~a (l~'sl~c~) 
Figure 1: Training Data vs Accuracy 
Dimension Dependency Sentence 
of Kernel Accuracy Accuracy 
1 
2 
3 
4 
N/A 
86.87% 
87.67% 
8'7.72% 
N/A 
40.60% 
42.94% 
42.78% 
Table 4: Dimension vs. Accuracy (3032 sen- 
tences, k = 5) 
that modify the left chunk. We leave experi- 
ment in such a setting for the future work. 
4.4 Training data vs. Accuracy 
Figure 1 shows the relationship between the 
size of the training data and the parsing accu- 
racy. This figure shows the accuracy of with 
and without the dynamic features. 
The parser achieves 86.52% accuracy for 
test data even with small training data (1172 
sentences). This is due to a good character- 
istic of SVMs to cope with the data sparse- 
ness problem. Furthermore, it achieves almost 
100% accuracy for the training data, showing 
that the training data are completely sepa- 
rated by appropriate combination of features. 
Generally, selecting those specific features of 
the training data tends to cause overfitting, 
and accuracy for test data may fall. However, 
the SVMs method achieve a high accuracy not 
only on the training data but also on the test 
data. We claim that this is due to the high 
generalization ability of SVMs. In addition, 
observing at the learning curve, further im- 
provement will be possible if we increase the 
size of the training data. 
4.5 Kernel Function vs. Accuracy 
Table 4 shows the relationship between the di- 
mension of the kernel function and the parsing 
accuracy under the condition k -- 5. 
As a result, the case of d ---- 4 gives the best 
accuracy. We could not carry out the training 
in realistic time for the case of d = 1. 
This result supports our intuition that we 
need a combination of at least two features. 
In other words, it will be hard to confirm a 
dependency relation with only the features of 
the modifier or the modfiee. It is natural that 
a dependency relation is decided by at least 
the information from both of two chunks. In 
addition, further improvement has been pos- 
sible by considering combinations of three or 
more features. 
23 r~ .I,- L 
Beam Dependency Sentence 
Width Accuracy Accuracy 
1 
3 
5 
7 
10 
15 
88.66% 
88.74% 
88.77% 
88.76% 
88.67% 
88.65% 
45.76% 
45.20% 
45.36% 
45.36% 
45.28% 
45.28% 
Table 5: Beam width vs. Accuracy (6756 sen- 
tences, d = 3) 
4.6 Beam width vs. Accuracy 
Sekine (Sekine et al., 2000) gives an interest- 
ing report about the relationship between the 
beam width and the parsing accuracy. Gener- 
ally, high parsing accuracy is expected when 
a large beam width is employed in the depen- 
dency structure analysis. However, the result 
is against our intuition. They report that a 
beam width between 3 and 10 gives the best 
parsing accuracy, and parsing accuracy falls 
down with a width larger than 10. This result 
suggests that Japanese dependency structures 
may consist of a series of local optimization 
processes. 
We evaluate the relationship between the 
beam width and the parsing accuracy. Table 5 
shows their relationships under the condition 
d = 3, along with the changes of the beam 
width from k = 1 to 15. The best parsing 
accuracy is achieved at k ---- 5 and the best 
sentence accuracy is achieved at k = 5 and 
k=7. 
We have to consider how we should set the 
beam width that gives the best parsing accu- 
racy. We believe that the beam width that 
gives the best parsing accuracy is related not 
only with the length of the sentence, but also 
with the lexical entries and parts-of-speech 
that comprise the chunks. 
4.7 Committee based approach 
Instead of learning a single classier using all 
training data, we can make n classifiers di- 
viding all training data by n, and the final 
result is decided by their voting. This ap- 
proach would reduce computational overhead. 
The use of multi-processing computer would 
help to reduce their training time considerably 
since all individual training can be carried out 
in parallel. 
To investigate the effectiveness of this 
method, we perform a simple experiment: Di- 
viding all training data (7958 sentences) by 
4, the final dependency score is given by a 
weighted average of each scores. This simple 
voting approach is shown to achieve the ac- 
curacy of 88.66%, which is nearly the same 
accuracy achieved 5540 training sentences. 
In this experiment, we simply give an equal 
weight to each classifier. However, if we op- 
timized the voting weight more carefully, the 
further improvements would be achieved (Inui 
and Inni, 2000). 
4.8 Comparison with Related Work 
Uchimoto (Uchimoto et al., 1999) and Sekine 
(Sekine et al., 2000) report that using Kyoto 
University Corpus for their training and test- 
ing, they achieve around 87.2% accuracy by 
building statistical model based on Maximum 
Entropy framework. For the training data, we 
used exactly the same data that they used in 
order to make a fair comparison. In our ex- 
periments, the accuracy of 89.09% is achieved 
using same training data. Our model outper- 
forms Uchimoto's model as far as the accura- 
cies are compared. 
Although Uchimoto suggests that the im- 
portance of considering combination of fea- 
tures, in ME framework we must expand 
these combination by introducing new fea- 
ture set. Uchimoto heuristically selects "effec- 
tive" combination of features. However, such 
a manual selection does not always cover all 
relevant combinations that are important in 
the determination of dependency relation. 
We believe that our model is better than 
others from the viewpoints of coverage and 
consistency, since our model learns the combi- 
nation of features without increasing the com- 
putational complexity. If we want to recon- 
sider them, all we have to do is just to change 
the Kernel function. The computational com- 
plexity depends on the number of support vec- 
tors not on the dimension of the Kernel func- 
tion. 
4.9 Future Work 
The simplest and most effective way to achieve 
better accuracy is to increase the training 
data. However, the proposed method that 
uses all candidates that form dependency re- 
lation requires a great amount of time to com- 
pute the separating hyperplaneas the size of 
the training data increases. The experiments 
given in this paper have actually taken long 
24 
training time 3 
To handle large size of training data, we 
have to select only the related portion of ex- 
amples that are effective for the analysis. This 
will reduce the training overhead as well as 
the analysis time. The committee-based ap- 
proach discussed section 4.7 is one method of 
coping with this problem. For future research, 
to reduce the computational overhead, we will 
work on methods for sample selection as fol- 
lows: 
• Introduction of constraints on non- 
dependency 
Some pairs of chunks need not consider 
since there is no possibility of depen- 
dency between them from grammatical 
constraints. Such pairs of chunks are not 
necessary to use as negative examples in 
the training phase. For example, a chunk 
within quotation marks may not modify 
a chunk that locates outside of the quo- 
tation marks. Of course, we have to be 
careful in introducing such constraints, 
and they should be learned from existing 
corpus. 
• Integration with other simple models 
Suppose that a computationally light and 
moderately accuracy learning model is 
obtainable (there are actually such sys- 
tems based on probabilistic parsing mod- 
els). We can use the system to output 
some redundant parsing results and use 
only those results for the positive and 
negative examples. This is another way 
to reduce the size of training data. 
• Error-driven data selection 
We can start with a small size of train- 
ing data with a small size of feature 
set. Then, by analyzing held-out training 
data and selecting the features that affect 
the parsing accuracy. This kind of grad- 
ual increase of training data and feature 
set will be another method for reducing 
the computational overhead. 
5 Summary 
This paper proposes Japanese dependency 
analysis based on Support Vector Machines. 
Through the experiments with Japanese 
bracketed corpus, the proposed method 
achieves a high accuracy even with a small 
3With AlphaServer 8400 (617Mhz), it took 15 days 
to train with 7958 sentences. 
training data and outperforms existing meth- 
ods based on Maximum Entropy Models. The 
result shows that Japanese dependency anal- 
ysis can be effectively performed by use of 
SVMs due to its good generalization and non- 
overfitting characteristics. 

References 
Eugene Charniak. 2000. A maximum-entropy- 
inspired parser. In Processing of the NAACL 
2000, pages 132-139. 
Michael Collins. 1996. A new statistical parser 
based on bigram lexical dependencies. In Pro- 
ceedings of the ACL '96, pages 184-191. 
C. Cortes and Vladimir N. Vapnik. 1995. Support 
Vector Networks. Machine Learning, 20:273- 
297. 
Y. Freund and Schapire. 1996. Experiments with 
a new Boosting algoritm. In 13th International 
Conference on Machine Learning. 
Masakazu Fujio and Yuji Matsumoto. 1998. 
Japanese Dependency Structure Analysis based 
on Lexicalized Statistics. In Proceedings of 
EMNLP '98, pages 87-96. 
Msahiko Haruno, Satoshi Shirai, and Yoshifumi 
Ooyama. 1998. Using Decision Trees to Con- 
struct a Partial Parser. In Proceedings of the 
COLING '98, pages 505-511. 
Takashi Inui and Kentaro Inui. 2000. Committe- 
based Decision Making in Probabilistic Partial 
Parsing. In Proceedings of the COLING 2000, 
pages 348-354. 
Thorsten Joachims. 1998. Text Categorization 
with Support Vector Machines: Learning with 
Many Relevant Features. In European Confer- 
ence on Machine Learning (ECML). 
Sadao Kurohashi and Makoto Nagao. 1997. Kyoto 
University text corpus project. In Proceedings 
of the ANLP, Japan, pages 115-118. 
John C. Platt. 1999. Probabilistic Outputs for 
Support Vector Machines and Comparisons to 
Regularized Likelihood Methods. In Advances 
in Large Margin Classifiers. MIT Press. 
Adwait Ratnaparkhi. 1997. A Liner Observed 
Time Statistical Parser Based on Maximum En- 
tropy Models. In Proceedings of EMNLP 'gZ 
Satoshi Sekine, Kiyotaka Uchimoto, and Hitoshi 
Isahara. 2000. Backward Beam Search Algo- 
rithm for Dependency Analysis of Japanese. In 
Proceedings of the COLING 2000, pages 754- 
760. 
Hirotoshi Taira and Masahiko Haruno. 1999. Fea- 
ture Selection in SVM Text Categorization. In 
AAAI-99. 
Kiyotaka Uchimoto, Satoshi Sekine, and Hitoshi 
Isahara. 1999. Japanese Dependency Structure 
Analysis Based on Maximum Entropy Models. 
In Proceedings of the EA CL, pages 196-203. 
Vladimir N. Vapnik. 1998. Statistical Learning 
Theory. Wiley-Interscience. 
