Proceedings of EACL '99 
Japanese Dependency Structure Analysis 
Based on Maximum Entropy Models 
Kiyotaka Uchimoto t Satoshi Sekine$ Hitoshi Isahara t 
tCommunications Research Laboratory 
Ministry of Posts and Telecommunications 
588-2, Iwaoka, Iwaoka-cho, Nishi-ku 
Kobe, Hyogo, 651-2401, Japan 
\[uchimot o i isahara\] ©crl. go. j p 
SNew York University 
715 Broadway, 7th floor 
New York, NY 10003, USA 
sekine~cs, nyu. edu 
Abstract 
This paper describes a dependency 
structure analysis of Japanese sentences 
based on the maximum entropy mod- 
els. Our model is created by learning 
the weights of some features from a train- 
ing corpus to predict the dependency be- 
tween bunsetsus or phrasal units. The 
dependency accuracy of our system is 
87.2% using the Kyoto University cor- 
pus. We discuss the contribution of each 
feature set and the relationship between 
the number of training data and the ac- 
curacy. 
1 Introduction 
Dependency structure analysis is one of the ba- 
sic techniques in Japanese sentence analysis. The 
Japanese dependency structure is usually repre- 
sented by the relationship between phrasal units 
called 'bunsetsu.' The analysis has two concep- 
tual steps. In the first step, a dependency matrix 
is prepared. Each element of the matrix repre- 
sents how likely one bunsetsu is to depend on the 
other. In the second step, an optimal set of de- 
pendencies for the entire sentence is found. In 
this paper, we will mainly discuss the first step, a 
model for estimating dependency likelihood. 
So far there have been two different approaches 
to estimating the dependency likelihood, One is 
the rule-based approach, in which the rules are 
created by experts and likelihoods are calculated 
by some means, including semiautomatic corpus- 
based methods but also by manual assignment of 
scores for rules. However, hand-crafted rules have 
the following problems. 
• They have a problem with their coverage. Be- 
cause there are many features to find correct 
dependencies, it is difficult to find them man- 
ually. 
• They also have a problem with their consis- 
tency, since many of the features compete 
with each other and humans cannot create 
consistent rules or assign consistent scores. 
• As syntactic characteristics differ across dif- 
ferent domains, the rules have to be changed 
when the target domain changes. It is costly 
to create a new hand-made rule for each do- 
main. 
At/other approach is a fully automatic corpus- 
based approach. This approach has the poten- 
tial to overcome the problems of the rule-based 
approach. It automatically learns the likelihoods 
of dependencies from a tagged corpus and calcu- 
lates the best dependencies for an input sentence. 
We take this approach. This approach is taken by 
some other systems (Collins, 1996; Fujio and Mat- 
sumoto, 1998; Haruno et ah, 1998). The parser 
proposed by Ratnaparkhi (Ratnaparkhi, 1997) is 
considered to be one of the most accurate parsers 
in English. Its probability estimation is based on 
the maximum entropy models. We also use the 
maximum entropy model. This model learns the 
weights of given features from a training corpus. 
The weights are calculated based on the frequen- 
cies of the features in the training data. The set of 
features is defined by a human. In our model, we 
use features of bunsetsu, such as character strings, 
parts of speech, and inflection types of bunsetsu, 
as well as information between bunsetsus, such as 
the existence of punctuation, and the distance be- 
tween bunsetsus. The probabilities of dependen- 
cies are estimated from the model by using those 
features in input sentences. We assume that the 
overall dependencies in a whole sentence can be 
determined as the product of the probabilities of 
all the dependencies in the sentence. 
196 
Proceedings of EACL '99 
Now, we briefly describe the algorithm of de- 
pendency analysis. It is said that Japanese de- 
pendencies have the following characteristics. 
(1) Dependencies are directed from left to right 
(2) Dependencies do not cross 
(3) A bunsetsu, except for the rightmost one, de- 
pends on only one bunsetsu 
(4) In many cases, the left context is not neces- 
sary to determine a dependency 1 
The analysis method proposed in this paper is de- 
signed to utilize these features. Based on these 
properties, we detect the dependencies in a sen- 
tence by analyzing it backwards (from right to 
left). In the past, such a backward algorithm has 
been used with rule-based parsers (e.g., (Fujita, 
1988)). We applied it to our statistically based 
approach. Because of the statistical property, we 
can incorporate a beam search, an effective way of 
limiting the search space in a backward analysis. 
2 The Probability Model 
Given a tokenization of a test corpus, the prob- 
lem of dependency structure analysis in Japanese 
can be reduced to the problem of assigning one 
of two tags to each relationship which consists of 
two bunsetsus. A relationship could be tagged as 
"0" or "1" to indicate whether or not there is a 
dependency between the bunsetsus, respectively. 
The two tags form the space of "futures" for a 
maximum entropy formulation of our dependency 
problem between bunsetsus. A maximum entropy 
solution to this, or any other similar problem al- 
lows the computation of P(f\[h) for any f from the 
space of possible futures, F, for every h from the 
space of possible histories, H. A "history" in max- 
imum entropy is all of the conditioning data which 
enables you to make a decision among the space 
of futures. In the dependency problem, we could 
reformulate this in terms of finding the probabil- 
ity of f associated with the relationship at index 
t in the test corpus as: 
P(f\]ht) = P(fl Information derivable 
from the test corpus 
related to relationship t) 
The computation of P(f\]h) in M.E. is depen- 
dent on a set of '`features" which, hopefully, are 
helpful in making a prediction about the future. 
Like most current M.E. modeling efforts in com- 
putational linguistics, we restrict ourselves to fea- 
tures which are binary functions of the history and 
aAssumption (4) has not been discussed very much, 
but our investigation with humans showed that it is 
true in more than 90% of cases. 
future. For instance, one of our features is 
g 1 : 
g(h,f) = 
t 0 : 
Here "has(h,z)" is a binary function which re- 
turns true if the history h has an attribute x. We 
focus on attributes on a bunsetsu itself and those 
between bunsetsus. Section 3 will mention these 
attributes. 
Given a set of features and some training data, 
the maximum entropy estimation process pro- 
duces a model in which every feature gi has as- 
sociated with it a parameter ai. This allows us 
to compute the conditional probability as follows 
(Berger et al., 1996): 
P(flh) - YIia\[ '(n'l) z~(h) (2) 
~,i • (3) 
I i 
The maximum entropy estimation technique 
guarantees that for every feature gi, the expected 
value of gi according to the M.E. model will equal 
the empirical expectation of gi in the training cor- 
pus. In other words: 
y\]~ P(h, f). g,(h, f) 
h,! 
= y-~P(h).y~P~(Slh)-g,(h,1). (41 
h ! 
Here /3 is an empirical probability and PME is 
the probability assigned by the M.E. model. 
We assume that dependencies in a sentence are 
independent of each other and the overall depen- 
dencies in a sentence can be determined based on 
the product of probability of all dependencies in 
the sentence. 
if has(h, x) = ture, 
= "Posterior- Head- 
POS(Major) : ~\[J'~(verb)" (1) 
&f=l 
otherwise. 
3 Experiments and Discussion 
In our experiment, we used the Kyoto University 
text corpus (version 2) (Kurohashi and Nagao, 
1997), a tagged corpus of the Mainichi newspaper. 
For training we used 7,958 sentences from news- 
paper articles appearing from January 1st to Jan- 
uary 8th, and for testing we used 1,246 sentences 
from articles appearing on January 9th. The input 
sentences were morphologically analyzed and their 
bunsetsus were identified. We assumed that this 
preprocessing was done correctly before parsing 
input sentences. If we used automatic morpholog- 
ical analysis and bunsetsu identification, the pars- 
ing accuracy would not decrease so much because 
the rightmost element in a bunsetsu is usually a 
case marker, a verb ending, or a adjective end- 
ing, and each of these is easily recognized. The 
automatic preprocessing by using public domain 
197 
Proceedings of EACL '99 
tools, for example, can achieve 97% for morpho- 
logical analysis (Kitauchi et al., 1998) and 99% for 
bunsetsu identification (Murata et al., 1998). 
We employed the Maximum Entropy tool made 
by Ristad (Ristad, 1998), which requires one to 
specify the number of iterations for learning. We 
set this number to 400 in all our experiments. 
In the following sections, we show the features 
used in our experiments and the results. Then we 
describe some interesting statistics that we found 
in our experiments. Finally, we compare our work 
with some related systems. 
3.1 Results of Experiments 
The features used in our experiments are listed in 
Tables 1 and 2. Each row in Table 1 contains a 
feature type, feature values, and an experimental 
result that will be explained later. Each feature 
consists of a type and a value. The features are 
basically some attributes of a bunsetsu itself or 
those between bunsetsus. We call them 'basic fea- 
tures.' The list is expanded from tIaruno's list 
(Haruno et al., 1998). The features in the list are 
classified into five categories that are related to 
the "Head" part of the anterior bunsetsu (cate- 
gory "a"), the '~rype" part of the anterior bun- 
setsu (category "b"), the "Head" part of the pos- 
terior bunsetsu (category "c"), the '~l~ype " part 
of the posterior bunsetsu (category "d"), and the 
features between bunsetsus (category "e") respec- 
tively. The term "Head" basically means a right- 
most content word in a bunsetsu, and the term 
"Type" basically means a function word following 
a "Head" word or an inflection type of a "Head" 
word. The terms are defined in the following para- 
graph. The features in Table 2 are combinations 
of basic features ('combined features'). They are 
represented by the corresponding category name 
of basic features, and each feature set is repre- 
sented by the feature numbers of the correspond- 
ing basic features. They are classified into nine 
categories we constructed manually. For exam- 
ple, twin features are combinations of the features 
related to the categories %" and "c." Triplet, 
quadruplet and quintuplet features basically con- 
sist of the twin features plus the features of the 
remainder categories "a," "d" and "e." The to- 
tal number of features is about 600,000. Among 
them, 40,893 were observed in the training corpus, 
and we used them in our experiment. 
The terms used in the table are the following: 
Anterior: left bunsetsu of the dependency 
Posterior: right bunsetsu of the dependency 
Head: the rightmost word in a bunsetsu other 
than those whose major part-of-speech 2 cat- 
egory is "~ (special marks)," "1~ (post- 
positional particles)," or "~ (suffix)" 
2Part-of-speech categories follow those of JU- 
MAN(Kurohashi and Nagao, 1998). 
Head-Lex: the fundamental form (uninflected 
form) of the head word. Only words with 
a frequency of three or more are used. 
Head-Inf: the inflection type of a head 
Type: the rightmost word other than those 
whose major part-of-speech category is "~ 
(special marks)." If the major category of 
the word is neither "IIJJ~-~-\] (post-positional par- 
ticles)" nor "~\[~:~. (suffix)," and the word is 
inflectable 3, then the type is represented by 
the inflection type. 
JOStiIl: the rightmost post-positional particle 
in the bunsetsu 
JOSttI2: the second rightmost post-positional 
particle in the bunsetsu if there are two or 
more post-positional particles in the bunsetsu 
TOUTEN, WA: TOUTEN means if a comma 
(Touten) exists in the bunsetsu. WA means 
if the word WA (a topic marker) exists in the 
bunsetsu 
BW: BW means "between bunsetsus" 
BW-Distance: the distance between the bunset- 
sus 
BW-TOUTEN: if TOUTEN exists between 
bunsetsus 
BW-IDto-Anterior-Type: 
BW-IDto-Anterior-Type means if there is a 
bunsetsu whose type is identical to that of 
the anterior bunsetsu between bunsetsus 
BW-IDto-Anterior-Type-Head-P OS: the 
part-of-speech category of the head word of 
the bunsetsu of "BW-IDto-Anterior-Type" 
BW-IDto-Posterior-Head: if there is between 
bunsetsus a bunsetsu whose head is identical 
to that of the posterior bunsetsu 
BW-IDto-Posterior- Head-Type(String): 
the lexical information of the bunsetsu "BW- 
IDto-Posterior-Head" 
The results of our experiment are listed in Ta- 
ble 3. The dependency accuracy means the per- 
centage of correct dependencies out of all depen- 
dencies. The sentence accuracy means the per- 
centage of sentences in which all dependencies 
were analyzed correctly. We used input sentences 
that had already been morphologically analyzed 
and for which bunsetsus had been identified. The 
first line in Table 3 (deterministic) shows the ac- 
curacy achieved when the test sentences were an- 
alyzed deterministically (beam width k = 1). The 
second line in Table 3 (best beam search) shows 
the best accuracy among the experiments when 
changing the beam breadth k from 1 to 20. The 
best accuracy was achieved when k = 11, although 
the variation in accuracy was very small. This re- 
sult supports assumption (4) in Chapter 1 because 
3The inflection types follow those of JUMAN. 
198 
Proceedings of EACL '99 
Category \] Feature number \[ Feature type 
Table 1: Features (basic features) 
Basic features (5 categories, 43 types) \[ 
• Feature values ... (Number of values) Accuracy without I 
each feature 
1 
2 
a 3 
4 
5 
6 
7 
8 
9 
b 10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
Anterior-Head-Lex 
Anterior-Head-POS(Major) 
Anterior-Head-POS(Minor) 
Anterior-Head-lnf(Major) 
Anterior-Head-I nf(Minor) 
Anterior-Type(String) 
Anterior-Type(Major) 
Anterior-Type(Minor) 
Anterior-J OSHll(String) 
Anterior-JOSHI 1/Minor ) 
Anterior-J OSHI2(String) 
Anterior-JOSHI2(Minor) 
Anterior-punctuation 
Anterior-bracket-open 
Anterior-bracket-close 
(2204) 
(verb), ~I#~-\] (adjective), ~ (noun) .... (117 
~1~ ~\] (common noun), ~ (quantifier) .... (24) 
~j\[t\]~ (vowel verb) .... (307 
~(stem), ~r~ (fundamental form) .... (6O) 
~, ~ a, ~c L-C, ~, &, tO, t .... (73) 
(post-positional particle), (43) 
:~\]\]J3~ (case marker), ~.zx.~ (imperative form) 
.... (lO2) 
~b, ~'~*, a)Jk, ~, ~t~., ... (63) 
\[nil\], ;~J~ (case marker) .... (5) 
YJ'~:', ~, A e', ,\];:, ~*, ... (63) 
;~gJJ~ (case marker) .... (4) 
\[ml\], comma, pemod (3) 
 nil ,\[nil\]' /<,, , >, :: 111 , 
Posterior-Head-Lex 
Post erior- Head- P OS (Maj or) 
Posterior-Head-POS (Minor) 
Posterior-Head-Inf(Maj or 7 
Post erior-Head-Inf(Minor) 
Posterior-Type(String) 
Posterior-Type(Major) 
Posterior-Type(Minor~ 
Posterior-JOSHll(Strmg) 
Posterior-JOSHIl(Minor) 
Posterior- J OS HI2( St ring) 
Posterior- JOSHI 2(Minor) 
Posterior- punct Uatlon 
Post erior-bracket- open 
Posterior-bracket-close 
BW-Dist ance 
BW-TOU'I'EIN 
BW-WA 
BW-brackets 
BW-IDt o-Ant erior-Type 
BW- IDto-Anterior-Type- 
Head-POS(Major) 
B W- IDt o-Ant erior-Type- 
Head-POS(Minor) 
BW- IDto-Ant erior-Type- 
Head-lnf(Major) 
BW- IDtc-Ant erior-Type- 
Head-lnf(Minor) 
BW-IDto-Posterior-Head 
BW- IDto-Posterior- Head- 
Type(String) 
BW- IDt o- Posterior-Head- 
Type(Major) 
BW- IDt o-Post erior-Head- 
Type(Minor) 
The same values as those of feature number 1. 
The same values as those of feature number 2. 
The same values as those of feature number 3. 
The same values as those of feature number 4. 
The same values as those of feature number 5. 
The same values as those of feature number 6. 
The same values as those of feature number 7. 
The same values as those of feature number 8. 
The same values as those of feature number 9. 
The same values as those of feature number 10. 
The same values as those of feature number 11. 
The same values as those of feature number 12. 
The same values as those of feature number 13. 
The same values as those of feature number 14. 
The same values as those of feature number 15. 
A(1), B~2 ~ 5), C(6 or more) (3) 
\[nil\], \[extstJ (2~ 
\[hill, \[exist\] (27 \[nil\], 
close, open, open-close (4) 
\[nil\], \[existJ (2) 
The same values as those of feature number 2. 
The same values as those of feature number 3. 
The same values as those of feature number 4. 
The same values as those of feature number 5. 
\[nilJ, \[exist\] (2) 
The same values as those of feature number 6. 
The same values as those of feature number 7. 
The same values as those of feature number 8. 
86.96% (-0.16%) 
86.43% (--0.71%) 
87.14% (4-0%) 
69.73% (--17.41%) 
87.11% (-0.03%) 
87.08% (-0.06%) 
85.47~ (-- 1.67v£ 
87.12% ~--0.02% 
87.10% (--0.04% 
86.31% (-0.83% 
76.15~ (--10.99%) 
87.14% (4---0% 7 
86.06% (- 1.08%) 
87.16% (+0.02% 7 
87.11% (-0.03%) 
s4.62~ (-2.52%) 
s6.s7z ~-o.27~'o) 
66.85% (-0.29%) 
84.64% (-2.50%) 
66.81% (-0.33%) 
86.96% (--0.18,%) 
86.08% ~-- 1.06%) 
86.99% (--0.15%) 
86.75% (-o.39%) 
Combination type 
Twin features: 
related to the "Type" part of 
the anterior bunsetsu and the 
"Head" part of the posterior 
bunsetsu. 
Triplet features: 
basically consist of the twin 
features plus the features 
between bunsetsus. 
Quadruplet features: 
basically consist of the twin 
features plus the features 
related to the "Head" part of 
the anterior bunsetsu, and the 
"Type" part of the posterior 
bunsetsu. 
Table 2: Features (combined features) 
Combined features (9 categories, 134 types) 
Combinations 
Category 
(b, c) 
(bx, b2, c) (b, c, e) 
(dl, d2, e) 
(bl, b2, c, d) (b, c, el, e2) 
(a, b, c, d) 
Feature set 
b = {6, 7, 8}, c = {16, 17, 18} 
(bl, b2) = {(9, 11),(10, 12)}, c = {17, 18} b = {6, 7, 8}, c = {17, lS}, 
e = {31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43} 
(dl, d,, e) = (29, 30, 34) 
b I = {6, 7, 8}, c = {17, 18},(b2, d) = (13, 28) 
b = {6, 7, 8), c = {17, 18},(el,e2) = (35, 40) 
(a, c) = {(1, 16), (2, 17), (3, 18)}, 
(b, d) = {(6, 21), (7, 22), (8, 23)} 
Accuracy without 
the feature 
86.99% (-o.15%) 
66.47%(-0.67%) 
85.65% (-1.49%) 
Quintuplet features: (a, bl, b2, c, d) (a, c) = {(2, 17), (3, 18)}, 86.96% (-0.18%) 
basically consist of the (bl, b2) = {(9, 11), (I0, 12)}, d = {21,22,23} 
quadruplet features plus the (a, b, c, d, e) (a, c) = {(1, 16), (2, 17), (3, 18)}, 
features between bunsetsus. (b, d) = {(6, 21), (7, 22), (8, 23}, e = 31 
199 
Proceedings of EACL '99 
Table 3: Results of dependency analysis 
Deterministic (k = 1) 
Best beam search(k = 11) 
Baseline 
Dependency accuracy 
87.14%(9814/11263) 
87.21%(9822/11263) 
64.09%(7219/11263) 
Sentence accuracy 
40.60% (503/1239) 
40.60% (503/1239) 
6.38% (79/1239) 
1.0 
0.8714 ....... 
0.8 
Dependency accuracy 
0.6 
0.4 
0.2 
--~-- 
, , i i I i 
10 20 30 
Number of bunsetsus in a sentence 
Figure 1: Relationship between the number of bunsetsus in a sentence and dependency accuracy. 
it shows that the previous context has almost no 
effect on the accuracy. The last line in Table 3 rep- 
resents the accuracy when we assumed that every 
bunsetsu depended on the next one (baseline). 
Figure 1 shows the relationship between the 
sentence length (the number of bunsetsus) and 
the dependency accuracy. The data for sentences 
longer than 28 segments are not shown, because 
there was at most one sentence of each length. 
Figure 1 shows that the accuracy degradation due 
to increasing sentence length is not significant. 
For the entire test corpus the average running time 
on a SUN Sparc Station 20 was 0.08 seconds per 
sentence. 
3.2 Features and Accuracy 
This section describes how much each feature set 
contributes to improve the accuracy. 
The rightmost column in Tables 1 and 2 shows 
the performance of the analysis without each fea- 
ture set. In parenthesis, the percentage of im- 
provement or degradation to the formal experi- 
ment is shown. In the experiments, when a basic 
feature was deleted, the combined features that 
included the basic feature were also deleted. 
We also conducted some experiments in which 
several types of features were deleted together. 
The results are shown in Table 4. All of the results 
in the experiments were carried out deterministi- 
cally (beam width k = 1). 
The results shown in Table 1 were very close 
to our expectation. The most useful features are 
the type of the anterior bunsetsu and the part- 
of-speech tag of the head word on the posterior 
bunsetsu. Next important features are the dis- 
tance between bunsetsus, the existence of punctu- 
ation in the bunsetsu, and the existence of brack- 
ets. These results indicate preferential rules with 
respect to the features. 
The accuracy obtained with the lexical fea- 
tures of the head word was better than that 
without them. In the experiment with the fea- 
tures, we found many idiomatic expressions, for 
example, "~,, 15-C (oujile, according to)-- b}~b (kimeru, 
decide)" and "~'~" (katachi_de, in the 
form of)-- ~b~ (okonawareru, be held)." We 
would expect to collect more of such expressions 
if we use more training data. 
The experiments without some combined fea- 
tures are reported in Tables 2 and 4. As can 
be seen from the results, the combined features 
are very useful to improve the accuracy. We used 
these combined features in addition to the basic 
features because we thought that the basic fea- 
tures were actually related to each other. With- 
out the combined features, the features are inde- 
pendent of each other in the maximum entropy 
framework. 
We manually selected combined features, which 
are shown in Table 2. If we had used all combi- 
200 
Proceedings of EACL '99 
Table 4: Accuracy without several types of features 
Features 
Without features 1 and 16 (lexical information about the head word) 
Without features 35 to 43 
Without quadruplet and quintuplet features 
Without triplet, quadruplet, and quintuplet features 
Without all combinations 
Accuracy 
86.30% (-0.84%) 
86.83% (-0.31%) 
84.27% (-2.87%) 
81.28% (-5.86%) 
68.83% (-18.31%) 
nations, the number of combined features would 
have been very large, and the training would 
not have been completed on the available ma- 
chine. Furthermore, we found that the accuracy 
decreased when several new features were added 
in our preliminary experiments. So, we should 
not use all combinations of the basic features. We 
selected the combined features based on our intu- 
ition. 
In our future work, we believe some methods 
for automatic feature selection should be studied. 
One of the simplest ways of selecting features is 
to select features according to their frequencies in 
the training corpus. But using this method in our 
current experiments, the accuracy decreased in all 
of the experiments. Other methods that have been 
proposed are one based on using the gain (Berger 
et al., 1996) and an approximate method for se- 
lecting informative features (Shirai et al., 1998a), 
and several criteria for feature selection were pro- 
posed and compared with other criteria (Berger 
and Printz, 1998). We would like to try these 
methods. 
Investigating the sentences which could not be 
analyzed correctly, we found that many of those 
sentences included coordinate structures. We be- 
lieve that coordinate structures can be detected to 
a certain extent by considering new features which 
take a wide range of information into account. 
3.3 Number of Training Data and 
Accuracy 
Figure 2 shows the relationship between the num- 
ber of training data (the number of sentences) and 
the accuracy. This figure shows dependency accu- 
racies for the training corpus and the test corpus. 
Accuracy of 81.84% was achieved even with a very 
small training set (250 sentences). We believe that 
this is due to the strong characteristic of the max- 
imum entropy framework to the data sparseness 
problem. From the learning curve, we can expect 
a certain amount of improvement if we have more 
training data. 
3.4 Comparison with Related Works 
This section compares our work with related 
statistical dependency structure analyses in 
Japanese. 
Comparison with 
Shirai's work (Shirai et al., 1998b) 
Shirai proposed a framework of statistical lan- 
guage modeling using several corpora: the EDR 
corpus, RWC corpus, and Kyoto University cor- 
pus. He combines a parser based on a hand-made 
CFG and a probabilistic dependency model. He 
also used the maximum entropy model to estimate 
the dependency probabilities between two or three 
post-positional particles and a verb. Accuracy of 
84.34% was achieved using 500 test sentences of 
length 7 to 9 bunsetsus. In both his and our ex- 
periments, the input sentences were morphologi- 
cally analyzed and their bunsetsus were identified. 
The comparison of the results cannot strictly be 
done because the conditions were different. How- 
ever, it should be noted that the accuracy achieved 
by our model using sentences of the same length 
was about 3% higher than that of Shirai's model, 
although we used a much smaller set of training 
data. We believe that it is because his approach 
is based on a hand-made CFG. 
Comparison with Ehara's work (Ehara, 1998) 
Ehara also used the Maximum Entropy model, 
and a set of similar kinds of features to ours. How- 
ever, there is a big difference in the number of fea- 
tures between Ehara's model and ours. Besides 
the difference in the number of basic features, 
Ehara uses only the combination of two features, 
but we also use triplet, quadruplet, and quintuplet 
features. As shown in Section 3.2, the accuracy in- 
creased more than 5% using triplet or larger com- 
binations. We believe that the difference in the 
combination features between Ehara's model and 
ours may have led to the difference in the accuracy. 
The accuracy of his system was about 10% lower 
than ours. Note that Ehara used TV news articles 
for training and testing, which are different from 
our corpus. The average sentence length in those 
articles was 17.8, much longer than that (average: 
10.0) in the Kyoto University text corpus. 
Comparison with 
Fujio's work (Fujio and Matsumoto, 1998) 
and Haruno's work (Haruno et al., 1998) 
Fujio used the Maximum Likelihood model 
with similar features to our model in his parser. 
Haruno proposed a parser that uses decision tree 
201 
Proceedings of EACL '99 
A 
0 
< 
O,. 
94 
92 
90 
88 
86 
84 
82 
80 
0 
'2raining" --*- 
"testing ....... 
.... ,+. ............ ~ ............ .+- ............ 
/ 
4 
I I I I I I I 
1000 2000 3000 4000 6000 6000 7000 8000 
Number o! Training Data (sentences) 
Figure 2: Relationship between the number of training data and the parsing accuracy. (beam breadth 
k=l) 
models and a boosting method. It is difficult to 
directly compare these models with ours because 
they use a different corpus, the EDR corpus which 
is ten times as large as our corpus, for training 
and testing, and the way of collecting test data 
is also different. But they reported an accuracy 
of around 85%, which is slightly worse than our 
model. 
We carried out two experiments using almost 
the same attributes as those used in their exper- 
iments. The results are shown in Table 5, where 
the lines "Feature set(l)" and "Feature set(2)" 
show the accuracies achieved by using Fujio's 
attributes and Haruno's attributes respectively. 
Considering that both results are around 85% to 
86%, which is about the same as ours. From these 
experiments, we believe that the important factor 
in the statistical approaches is not the model, i.e. 
Maximum Entropy, Maximum Likelihood, or De- 
cision Tree, but the feature selection. However, 
it may be interesting to compare these models 
in terms of the number of training data, as we 
can imagine that some models are better at cop- 
ing with the data sparseness problem than others. 
This is our future work. 
4 Conclusion 
This paper described a Japanese dependency 
structure analysis based on the maximum en- 
tropy model. Our model is created by learning 
the weights of some features from a training cor- 
pus to predict the dependency between bunset- 
sus or phrasal units. The probabilities of depen- 
dencies between bunsetsus are estimated by this 
model. The dependency accuracy of our system 
was 87.2% using the Kyoto University corpus. 
In our experiments without the feature sets 
shown in Tables 1 and 2, we found that some basic 
and combined features strongly contribute to im- 
prove the accuracy. Investigating the relationship 
between the number of training data and the accu- 
racy, we found that good accuracy can be achieved 
even with a very small set of training data. We 
believe that the maximum entropy framework has 
suitable characteristics for overcoming the data 
sparseness problem. 
There are several future directions. In particu- 
lar, we are interested in how to deal with coordi- 
nate structures, since that seems to be the largest 
problem at the moment. 
References 
Adam Berger and Harry Printz. 1998. A com- 
parison of criteria for maximum entropy / min- 
imum divergence feature selection. Proceedings 
of Third Conference on Empirical Methods in 
Natural Language Processing, pages 97-106. 
Adam L. Berger, Stephen A. Della Pietra, and 
Vincent J. Della Pietra. 1996. A maximum en- 
tropy approach to natural language processing. 
Computational Linguistics, 22(1):39-71. 
Michael Collins. 1996. A new statistical parser 
based on bigram lexical dependencies. Proceed- 
ings of the 34th Annual Meeting of the Asso- 
ciation for Computational Linguistics (ACL), 
pages 184-191. 
Terumasa Ehara. 1998. Japanese bunsetsu de- 
pendency estimation using maximum entropy 
method. Proceedings of The Fourth Annual 
202 
Proceedings of EACL '99 
Table 5: Simulation of Fujio's and Haruno's experiments 
Feature set 
Feature set (1) 
(Without features 4, 5, 9--12, 14, 15, 19, 20, 24--27, 29, 30, 34--43.) 
Feature set (2) 
(Without features 4, 5, 9--12, 19, 20, 24--27, 34-43.) 
Accuracy 
85.71% (-1.43%) 
86.47% (-0.67%) 
Meeting of The Association for Natural Lan- 
guage Processing, pages 382-385. (in Japanese). 
Masakazu Fujio and Yuuji Matsumoto. 1998. 
Japanese dependency structure analysis based 
on lexicalized statistics. Proceedings of Third 
Conference on Empirical Methods in Natural 
Language Processing, pages 87-96. 
Katsuhiko Fujita. 1988. A deterministic parser 
based on karari-uke grammar, pages 399-402. 
Masahiko Haruno, Satoshi Shiral, and Yoshifumi 
Ooyama. 1998. Using decision trees to con- 
struct a practical parser. Proceedings of the 
COLING-ACL '98. 
Akira Kitauchi, Takehito Utsuro, and Yuji Mat- 
sumoto. 1998. Error-driven model learning 
of Japanese morphological analysis. IPSJ- 
WGNL, NL124-6:41--48. (in Japanese). 
Sadao Kurohashi and Makoto Nagao. 1997. Ky- 
oto university text corpus project, pages 115- 
118. (in Japanese). 
Sadao Kurohashi and Makoto Nagao, 1998. 
Japanese Morphological Analysis System JU- 
MAN version 3.5. Department of Informatics, 
Kyoto University. 
Masaki Murata, Kiyotaka Uchimoto, Qing Ma, 
and Hitoshi Isahara. 1998. Machine learning 
approach to bunsetsu identification -- compar- 
ison of decision tree, maximum entropy model, 
example-based approach, and a new method us- 
ing category-exclusive rules --. IPSJ-WGNL, 
NL128-4:23-30. (in Japanese). 
Adwait Ratnaparkhi. 1997. A linear observed 
time statistical parser based on maximum en- 
tropy models. Conference on Empirical Meth- 
ods in Natural Language Processing. 
Eric Sven Ristad. 1998. Maximum en- 
tropy modeling toolkit, release 1.6 beta. 
http ://www.mnemonic.com/software/memt. 
Kiyoaki Shirai, Kentaro Inui, Takenobu Toku- 
naga, and I-Iozumi Tanaka. 1998a. Learning 
dependencies between case frames using max- 
imum entropy method, pages 356-359. (in 
Japanese). 
Kiyoaki Shirai, Kentaro Inui, Takenobu Toku- 
naga, and Hozumi Tanaka. 1998b. A frame- 
work of integrating syntactic and lexical statis- 
tics in statistical parsing. Journal of Nat- 
ural Language Processing, 5(3):85-106. 
Japanese). 
(in 
203 
