Using Decision Trees to Construct a Practical Parser 
Masahiko Haruno* Satoshi Shirai t Yoshifumi Ooyama t 
mharuno ~hlp.atr.co.jp shirai,~cslab.kecl.ntt.co.jp oovama~cslal).kecl.nt t.co.j p 
*ATR Human Information Processing Research Laboratories 
2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan. 
tNTT Communication Science Laboratories 
2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan. 
Abstract 
This paper describes novel and practical Japanese 
parsers that uses decision trees. First, we con- 
struct a single decision tree to estimate modifica- 
tion probabilities; how one phrase tends to modify 
another. Next, we introduce a boosting algorithm 
in which several decision trees are constructed and 
then combined for probability estimation. The two 
constructed parsers are evaluated by using the EDR 
Japanese annotated corpus. The single-tree method 
outperforms the conventional .Japanese stochastic 
methods by 4%. Moreover, the boosting version is 
shown to have significant advantages; 1) better pars- 
ing accuracy than its single-tree counterpart for any 
amount of training data and 2) no over-fitting to 
data for various iterations. 
1 Introduction 
Conventional parsers with practical levels of perfor- 
mance require a number of sophisticated rules that 
have to be hand-crafted by human linguists. It is 
time-consunaing and cumbersome to naaintain these 
rules for two reasons. 
• The rules are specific to the application domain. 
• Specific rules handling collocational expressions 
create side effects. Such rules often deteriorate 
t, he overall performance of the parser. 
The stochastic approach, on the other hand, has 
the potential to overcome these difficulties. Because 
it. induces stochastic rules to maximize overall per- 
formance against training data, it not only adapts 
to any application domain but. also may avoid over- 
fitting to the data. In the late 80s and early 90s, the 
induction and parameter estimation of probabilis- 
tic context free grammars (PCFGs) from corpora 
were intensively studied. Because these grammars 
comprise only nonterminal and part-of-speech tag 
symbols, their performances were not enough to be 
used in practical applications (Charniak, 1993). A 
broader range of information, in particular lexical in- 
formation, was found to be essential in disambiguat- 
ing the syntactic structures of real-world sentences. 
SPATTER (Magerman, 1995) augmented the pure 
PCFG by introducing a number of lexical attributes. 
The parser controlled applications of each rule by us- 
ing the lexical constraints induced by decision tree 
algorithm (Quinlan, 1993). The SPATTER parser 
attained 87% accuracy and first made stochastic 
parsers a practical choice. The other type of high- 
precision parser, which is based on dependency anal- 
ysis was introduced by Collins (Collins, 1996). De- 
pendency analysis first segments a sentence into syn- 
tactically meaningful sequences of words and then 
considers the modification of each segment. Collins' 
parser computes the likelihood that each segment 
modifies the other (2 term relation) by using large 
corpora. These modification probabilities are con- 
ditioned by head words of two segments, distance 
between the two segments and other syntactic fea- 
tures. Although these two parsers have shown simi- 
lar performance, the keys of their success are slightly 
different. SPATTER parser performance greatly de- 
pends on the feature selection ability of the decision 
tree algorithm rather than its linguistic representa- 
tion. On the other hand, dependency analysis plays 
an essential role in Collins' parser for efficiently ex- 
tracting information from corpora. 
In this paper, we describe practical Japanese de- 
pendency parsers that uses decision trees. In the 
Japanese language, dependency analysis has been 
shown to be powerful because segment (bunsetsu) 
order in a sentence is relatively free compared to 
European languages..Japanese dependency parsers 
generally proceed in three steps. 
1. Segment a sentence into a sequence of bunsetsu. 
2. Prepare a modification matrix, each value of 
which represents how one bunsetsu is likely to 
modify another. 
3. Find optimal modifications in a sentence by a 
dynamic programming technique. 
The most difficult part is the second; how to con- 
struct a sophisticated modification matrix. With 
conventional Japanese parsers, the linguist nmst 
classify the bunsetsu and select appropriate features 
to compute modification values. The parsers thus 
suffer from application domain diversity and the side 
effects of specific rules. 
505 
Stochastic dependency parsers like Collins', on the 
other hand, define a set of attributes for condition- 
ing the modification probabilities. The parsers con- 
sider all of the attributes regardless of bunsetsu type. 
These methods can encompass only a small number 
of features if the probabilities are to be precisely 
evaluated from finite number of data. Our decision 
tree method constructs a more sophisticated modi- 
fication matrix. It automatically selects a sufficient 
number of significant attributes according to bun- 
setsu type. We can use arbitrary numbers of the 
attributes which potentially increase parsing accu- 
racy. 
Natural languages are full of exceptional and collo- 
cational expressions. It is difficult for machine learn- 
ing algorithms, as well as human linguists, to judge 
whether a specific rule is relevant in terms of over- 
all performance. To tackle this problem, we test 
the mixture of sequentially generated decision trees. 
Specifically, we use the Ada-Boost algorithm (Fre- 
und and Schapire, 1996) which iteratively performs 
two procedures: 1. construct a decision tree based 
on the current data distribution and 2. updating 
the distribution by focusing on data that are not 
well predicted by the constructed tree. The final 
modification probabilities are computed by mixing 
all the decision trees according to their performance. 
The sequential decision trees gradually change from 
broad coverage to specific exceptional trees that. can- 
not be captured by a single general tree. In other 
words, the method incorporates not only general ex- 
pressions but also infrequent specific ones. 
The rest of the paper is constructed as follows. 
Section 2 summarizes dependency analysis for the 
Japanese language. Section 3 explains our decision 
tree models that compute modification probabili- 
ties. Section 4 then presents experimental results 
obtained by using EDR Japanese annotated corpora. 
Finally, section 5 concludes the paper. 
2 Dependency Analysis in Japanese 
Language 
This section overviews dependency analysis in the 
Japanese language. The parser generally performs 
the following three steps. 
1. Segment a sentence into a sequence ofbunsetsu. 
2. Prepare modification matrix each value of which 
represents how one bunsetsu is likely to modify 
the other. 
3. Find optimal modifications in a sentence by a 
dynamic programming technique. 
Because there are no explicit delimiters between 
words in Japanese, input sentences are first word 
segmented, part-of-speech tagged, and then chunked 
into a sequence of bunsetsus. The first step yields, 
for the following example, the sequence of bunsetsu 
displayed below. The parenthesis in the Japanese 
expressions represent the internal structures of the 
bunsetsu (word segmentations). 
Example: a~lq e)~7~12~.~:C)-~U ~o75~r7 -1' Y -~ ~r,A. t~ 
((~l~)(e~)) ((Y~)(I:)) ((~)i)(e))) 
kinou-no yuugata-ni kinjo-no 
yesterday-NO evenin~Nl neighbor-No 
((~° ~)(~)) ((v -¢ :-)(¢)) ((~2,z,)(t:) 
kodomo-ga wain-wo nornuTta 
children-GA wine-WO drink+PAST 
The second step of parsing is to construct a modifi- 
cation matrix whose values represent the likelihood 
that one bunsetsu modifies another in a sentence. 
In the Japanese language, we usually make two as- 
sumptions: 
1. Every bunsetsu except the last one modifies 
only one posterior bunsetsu. 
2. No modification crosses to other modifications 
in a sentence. 
Table 1 illustrates a modification matrix for the 
example sentence. In the matrix, columns and rows 
represent anterior and posterior bunsetsus, respec- 
tively. For example, the first bunsetsu "kinou- no" 
modifics the second 'yuugala-ni'with score 0.T0 and 
the third 'kinjo-no' with score 0.07. The aim of this 
paper is to generate a modification matrix by using 
decision trees. 
kfnou-no 
~tul#ata.ni 0.70 yvugata-ni 
**njo-no 0.07 0.10 kfnjo.no 
kodorna-#a 0,10 0.10 0.70 kadomo*~a 
~ain-~o 0,10 0.10 0.20 0.05 
nomu.ta 0.03 0.70 0.10 0.95 
i, aln. mlo 
1.00 
Table 1: Modification Matrix for Sample Sentence 
The final step of parsing optimizes the entire de- 
pendency structure by using the values in the mod- 
ification matrix. 
Before going into our model, we introduce the no- 
tations that will be used in the model. Let S be 
the input sentence. S comprises a bunsetsu set B of 
length m ({< bl,f~ >,-.-,< bm,f,, >}) in which 
bi and fi represent the ith bunsetsu and its features, 
respectively. We define D to be a modification set; D 
= {rood(l),..., mod(m - 1)} in which rood(i) indi- 
cates the number of busetsu modified by the ith bun- 
setsu. Because of the first assumption, the length of 
D is always m- 1. Using these notations, the result 
of the third step for the example can be given as D 
= {2, 6, 4, 6, 6} as displayed in Figure 1. 
3 Decision Trees for Dependency 
Analysis 
3.1 Stochastic Model and Decision Trees 
The stochastic dependency parser assigns the most 
plausible modification set Dbe,t to a sentence S in 
506 
1 kmou-no  uugat  3 4  jc-no kodomo-ga 
,ll 
5 6 t'ain- '0 n0mu.ta 
t 
Figure 1: Modification Set for Sample Sentence 
terms of the training data distribution. 
Dbest = argmax D P( D\[S) = arg,nax D P( D\[B) 
By assuming the independence of modifica- 
tions, P(D\[B) can be transformed as follows. 
P(yeslbi, bj, fl ,"', fro) means the probability that 
a pair of bunsetsu bi and bj have a modification rela- 
tion. Note that each modification is constrained by 
all features{f, ,--., fro} in a sentence despite of the 
assumption of independence.We use decision trees 
to dynamically select appropriate features for each 
combination of bunsetsus from {f,,---, fm }. 
mi-~P(yes\[bi, "" ,fro) P(DIB) = 1-I - bj, f,,. 
Let us first consider the single tree case. The 
training data for the decision tree comprise any un- 
ordered combination of two bunsetsu in a sentence. 
Features used for learning are the linguistic informa- 
tion associated with the two bunsetsu. The next sec- 
tion will explain these features in detail. The class 
set for learning has binary values yes and no which 
delineate whether the data (the two bunstsu) has 
a modification relation or not. In this setting, the 
decision tree algorithm automatically and consecu- 
tively selects the significant, features for discriminat- 
ing modify/non-modify relations. 
We slightly changed C4.5 (Quinlan, 1993) pro- 
grams to be able to extract class frequen- 
cies at every node in the decision tree be- 
cause our task is regression rather than classi- 
fication. By using the class distribution, we 
compute the probability PDT(yeslbi, bj, f ~,..., fro) 
which is the Laplace estimate of empirical likeli- 
hood that bi modifies bj in the constructed deci- 
sion tree DT. Note that it. is necessary to nor- 
realize PDT(yes\[bi, bj, f,,..., fro) to approximate 
P(yes\[bi,bj,fx,"',fm). By considering all can- 
didates posterior to bi, P(yeslbi,b.i,fl,'",fm) is 
computed using a heulistic rule (1). It is of course 
reasonable to normalize class frequencies instead of 
the probability PoT(yeslbi, bj,, f,,..., fro). Equa- 
tion (1) tends to emphasize long distance dependen- 
cies more than is true for frequency-based normal- 
ization. 
P(yeslbi, bj, f, ,..., f.~) ~_ 
PDT(yeslbi, bj, fl,'", fro) (1) 
~ >i m P DT(yeslbl, by, f ~ , . . . , f ,, ) 
Let us extend the above to use a set of decision 
trees. As briefly mentioned in Section 1, a number 
of infrequent and exceptional expressions appear in 
any natural language phenomena; they deteriorate 
the overall performance of application systems. It 
is also difficult for automated learning systems to 
detect and handle these expressions because excep- 
tional expressions are placed ill the same class as 
frequent ones. To tackle this difficulty, we gener- 
ate a set of decision trees by adaboost (Freund and 
Schapire, 1996) algorithm illustrated in Table 2. The 
algorithm first sets the weights to 1 for all exana- 
pies (2 in Table 2) and repeats the following two 
procedures T times (3 in Table 2). 
1. A decision tree is constructed by using the cur- 
rent weight vector ((a) in Table 2) 
2. Example data are then parsed by using the tree 
and the weights of correctly handled examples 
are reduced ((b),(c) in Table 2) 
1. 
'2.. 
3. 
Input: sequence of N examples < eL, u,~ > .... , < 
eN, .wN > in which el and wi represent an example 
and its weight, respectively. 
Initialize the weight vector wi =1 for i = 1,..., N 
Do for t = l,2,...,T 
(a) Call C4.5 providing it with the weight vector 
w,s and Construct a modification probability 
set ht 
(b) Let Error be a set of examples that are not. 
identified by lit 
Compute the pseudo error rate of ht: 
e' = E iCE .... wi/ ~ ,=INw, 
if et > 5' then abort loop 
l--e t 
(c) For examples correctly predicted by ht, update 
the weights vector to be wi = wiflt 
4. Output a final probability set: 
hl=Zt=,T(log~)ht/Zt=,T(Iog~) 
Table 2: Combining Decision Trees by Ada-boost 
Algorithm 
The final probability set h I is then computed 
by mixing T trees according to their perfor- 
mance (4 in Table 2). Using h: instead of 
PoT(yeslbi , bj, fl,'", f,,~), in equation (1) gener- 
ates a boosting version of the dependency parser. 
3.2 Linguistic Feature Types Used for 
Learning 
This section explains the concrete feature setting we 
used for learning. The feature set mainly focuses on 
507 
1 lexical information of head word 6 distance between two bunsetsu 
2 part-of-speech of head word 7 particle 'wa' between two bunsetsu 
3 type of bunsetsu 8 punctuation between two bunsetsu 
4 punctuation 
5 parentheses 
Table 3: Linguistic Feature Types Used for Learning 
Feature Type Va|net 
4 
,5 
$'), <6~', ~tE, t~'~t ~', l~'tt~"6, .:~, -'~', 5, a~., L, L¢~', E'.', "tr.,'t~L, "1-6, "t', "~, "~, "~ st ' ~-. \].'~, %*~t.t,- " , "~, \]_'0'), t.¢l~ * , ~**¢9"C, \]'.gt~,gl~,9\]'*~,9"C, 99, ~, 
~¢~,, & ~, __%, ~, ~a~, @t,, @t,L, @t,Ll2, @~6, ~'~", t¢6, @6Ul:, to0, 
~k~', ~k'C, ::, ~, 0~, d)h, tl, I~./J':), ~, I|E, It:, tt::~., t-C, ~b, ~ L<I/, 
l.t~. ~, ~-, ~I.~R~I~'~, ~.~1~., ~,.~l~;l~\]f'tit, lg'~, $1"tf~,t~l, .V,¢IL ~\[\]glllql~\]. e~i~\], 
n o n, k~.,.X, ~J.¢~ 
non, ", ~, ~. \[, \[. \[, ~, l, ",',~,,,I,.I,\],J 
A(0), B(;~4), C(>5) 
7 0, 1 
8 0, 1 
Table 4: Values for Each Feature Type 
¢3.S 
i 
e3 
a2s 
a2 
"graph.dirt- 
sooo *occo ~Sooo 2oo00 2scoo 3o00o asooo 4ooco 45ooo soooo N~bet of Ttammg Data 
Figure 2: Learning Curve of Single-Tree Parser 
the two bunsetsu constituting each data.. Tile class 
set consists of binary values which delineate whether 
a sample (the two bunsetsu) have a modification re- 
lation or not. We use 13 features for the task, 10 di- 
rectly from the 2 bunsetsu under consideration and 
3 for other bunsetu information as summarized in 
Table 3. 
Each bunsetsu (anterior and posterior) has the 5 
features: No.1 to No.5 in Table 3. Features No.6 
to No.8 are related to bunsetsu pairs. Both No.1 
and No.2 concern the head word of the bunsetsu. 
No.1 takes values of frequent words or thesaurus cat- 
egories (NLRI, 1964). No.2, on the other hand, takes 
values of part-of-speech tags. No.3 deals with bull- 
setsu types which consist of functional word chunks 
or tile part-of-speech tags that dominate tile bull- 
setsu's syntactic characteristics. No.4 and No.5 are 
binary features and correspond to punctuation and 
parentheses, respectively. No.6 represents how many 
bunsetsus exist, between the two bunsetsus. Possible 
values are A(0), B(0--4) and C(>5). No.7 deals with 
the post-positional particle 'wa' which greatly influ- 
ences the long distance dependency of subject-verb 
modifications. Finally, No.8 addresses tile punctua- 
tion between the two bunsetsu. Tile detailed values 
of each feature type are summarized ill Table 4. 
4 Experimental Results 
We evaluated the proposed parser using the EDR 
Japanese annotated corpus (EDR, 199.5). The ex- 
periment consisted of two parts. One evaluated the 
single-tree parser and the other tile boosting coun- 
terpart. In tile rest of this section, parsing accuracy 
refers only to precision; how many of tile system's 
output are correct in terms of the annotated corpus. 
We do not show recall because we assume every bun- 
setsu modifies only one posterior bunsetsu. The fea- 
tures used for learning were non head-word features, 
(i.e., type 2 to 8 in Table 3). Section 4.1.4 investi- 
gates lexical information of head words such as fre- 
quent, words and thesaurus categories. Before going 
into details of tile experimental results, we sunnna- 
rize here how training and test data were selected. 
1. After all sentences in the EDR corpus 
were word-segmented and part-of-speech 
tagged (Matsumoto and others, 1996), they 
were then chunked into a sequence of bunsetsu. 
2. All bunsetsu pairs were compared with EDR 
bracketing annotation (correct segmentations 
508 
I Confidence Level \]1 25% ~50%(, 75(~, 95% I 
Parsing Accuracy 82.01% ~3.43~, 83.52% 83.35% 
Table 5: Number of Training Sentences v.s. Parsing Accuracy 
I Number of Training Sentences H 3000 6000 10000 20000 30000 50000 
I \[\[Parsing Accuracy ' 82.07% 82.70% 83.52% 84.07% 84.27% 84.33% 
Table 6: Pruning Confidence Level v.s.Parsing Accuracy 
and modifications). If a sentence contained a 
pair inconsistent with the EDR annotation, the 
sentence was removed from the data. 
3. All data examined (total number of sen- 
tences:207802, total number of bun- 
set.su:1790920) were divided into 20 files, 
The training data were same number of first 
sentences of the 20 files according to the 
training data size. Test data (10000 sentences) 
were the 2501th to 3000th sentences of each 
file. 
4.1 Single Tree Experiments 
In the single tree experiments, we evaluated the fol- 
lowing 4 properties of the new dependency parser. 
• Tree pruning and parsing accuracy 
• Number of training data and parsing accuracy 
• Significance of features other than Head-word 
Lexical Information 
• Significance of Head-word Lexical Information 
4.1.1 Pruning and Parsing Accuracy 
Table 5 summarizes the parsing accuracy with var- 
ious confidence levels of pruning. The number of 
training sentences was 10000. 
In C4.5 programs, a larger value of confidence 
means weaker pruning and 25% is connnonly used in 
various domains (Quinlan, 1993). Our experimental 
results show that 75% pruning attains the best per- 
formance, i.e. weaker pruning than usual. In the 
remaining single tree experiments, we used the 75% 
confidence level. Although strong pruning treats in- 
frequent data as noise, parsing involves many ex- 
ceptional and infrequent modifications as mentioned 
before. Our result means that only information in- 
cluded in small numbers of samples are useful for 
disambiguating the syntactic structure of sentences. 
4.1.2 The amount of Training Data and 
Parsing Accuracy 
Table 6 and Figure 2 show how the number of train- 
ing sentences influences parsing accuracy for the 
same 10000 test. sentences. They illustrate tile fol- 
lowing two characteristics of the learning curve. 
1. The parsing accuracy rapidly rises up to 30000 
sentences and converges at around 50000 sen- 
tences. 
2. The maximum parsing accuracy is 84.33% at 
50000 training sentences. 
We will discuss the maximum accuracy of 84.33%. 
Compared to recent stochastic English parsers that 
yield 86 to 87% accuracy (Collins, 1996; Mager- 
man, 1995), 84.33% seems unsatisfactory at the first 
glance. The main reason behind this lies in the dif- 
ference between the two corpora used: Penn Tree- 
bank (Marcus et al., 1993) and EDR corpus (EDR, 
1995). Penn Treebank(Marcus et al., 1993) was also 
used to induce part-of-speech (POS) taggers because 
the corpus contains very precise and detailed POS 
markers as well as bracket, annotations. In addition, 
English parsers incorporate the syntactic tags that 
are contained in the corpus. The EDR corpus, on the 
other hand, contains only coarse POS tags. We used 
another Japanese POS tagger (Matsumoto and oth- 
ers, 1996) to make use of well-grained information 
for disambiguating syntactic structures. Only the 
bracket information in the EDR corpus was consid- 
ered. We conjecture that the difference between the 
parsing accuracies is due to the difference of the cor- 
pus information. (Fujio and Matsumoto, 1997) con- 
structed an EDR-based dependency parser by using 
a similar method to Collins' (Collins, 1996). The 
parser attained 80.48% accuracy. Although thier 
training and test. sentences are not exactly same as 
ours, the result seems to support our conjecture on 
the data difference between EDR and Penn Tree- 
bank. 
4.1.3 Significance of Non Head-Word 
Features 
We will now summarize tile significance of each non 
head-word feature introduced in Section 3. The in- 
fluence of the lexical information of head words will 
be discussed in the next section. Table 7 illustrates 
how the parsing accuracy is reduced when each fea- 
ture is removed. The number of training sentences 
was 10000. In the table, ant and post. represent, the 
anterior and the posterior bunsetsu, respectively. 
Table 7 clearly demonstrates that the most signifi- 
509 
Feature Accuracy Decrease Feature Accuracy Decrease 
ant POS of head -0.07% post punctuation +1.62(7(, 
ant bunsetsu type 
ant punctuation 
ant parentheses 
post POS of head 
post bunsetsu type 
+9.34% 
+1.15% 
+0.00% 
+2.13% 
+0.52% 
post parentheses -e0.00% 
distance between two bunsetsus +5.21% 
punctuation between two bunsetsus +0.01% 
'wa' between two bunsetsus +1.79% 
Table 7: Decrease of Parsing Accuracy When Each Attribute Removed 
Head Word Information 
Parsing Accuracy 
l\] 100words 200words Levell Level2 I 
83.34% 82.68%82.51%81.67% 
Table 8: Head Word Information v.s. Parsing Accuracy 
cant features are anterior bunsetsu type and distance 
between the two bunsetsu. This result may partially 
support an often used heuristic; bunsetsu modifica- 
tion should be as short range as possible, provided 
the modification is syntactically possible. In partic- 
ular, we need to concentrate on the types of bunsetsu 
to attain a higher level of accuracy. Most features 
contribute, to some extent, to the parsing perfor- 
mance. In our experiment, information on paren- 
theses has no effect on the performance. The reason 
may be that EDR contains only a small number of 
parentheses. One exception in our features is an- 
terior POS of head. We currently hypothesize that 
this drop of accuracy arises from two reasons. 
• In many cases, the POS of head word can be 
determined from bunsetsu type. 
• Our POS tagger sometimes assigns verbs for 
verb-derived nouns. 
4.1.4 Significance of Head-words Lexical 
Information 
We focused on the head-word feature by testing the 
following 4 lexical sources. The first and the second 
are the 100 and 200 most frequent words, respec- 
tively. The third and the fourth are derived from a 
broadly used Japanese thesaurus, Word List by Se- 
mantic Principles (NLRI, 1964). Level 1 and Level 2 
classify words into 15 and 67 categories, respectively. 
1. 100 most Frequent words 
2. 200 most Frequent words 
3. Word List Level 1 
4. Word List Level 2 
Table 8 displays the parsing accuracy when each 
head word information was used in addition to the 
previous features. The number of training sentences 
was 10000. In all cases, the performance was worse 
than 83.52% which was attained without head word 
lexical information. More surprisingly, more head 
word information yielded worse performance. From 
this result, it. may be safely said, at least, for the 
Japanese language,' that we cannot expect, lexica\] in- 
formation to always improve the performance. Fur- 
ther investigation of other thesaurus and cluster- 
ing (Charniak, 1997) techniques is necessary to fully 
understand the influence of lexical information. 
4.2 Boosting Experiments 
This section reports experimental results on the 
boosting version of our parser. In all experiments, 
pruning confidence levels were set. to 55%. Table 9 
and Figure 3 show the parsing accuracy when the 
number of training examples was increased. Because 
the number of iterations in each data set changed be- 
tween 5 and 8, we will show the accuracy by combin- 
ing the first 5 decision trees. In Figure 3, the dotted 
line plots the learning of the single tree case (identi- 
cal to Figure 2) for reader's convenience. The char- 
acteristics of the boosting version can be summa- 
rized as follows compared to the single tree version. 
• The learning curve rises more rapidly with a 
small number of examples. It is surprising that 
the boosting version with 10000 sentences per- 
forms better than the single tree version with 
50000 sentences. 
• The boosting version significantly outperforms 
the single tree counterpart for any number of 
sentences although they use the same features 
for learning. 
Next, we discuss how the number of iterations in- 
fluences the parsing accuracy. Table 10 shows the 
parsing accuracy for various iteration numbers when 
50000 sentences were used as training data. The re- 
suits have two characteristics. 
• Parsing accuracy rose up rapidly at the second 
iteration. 
* No over-fitting to data was seen although the 
performance of each generated tree fell around 
30% at the final stage of iteration. 
510 
I Nombe. o T. i,,i,,gSe,l*e,,co. I 3OO0 6OOO I'0000 2OOOO 3OO0O 5O0OO I 
Parsing Accuracy 83.10% 84.03% 84.44% 84.74% 84.91% 85.03% 
Table 9: Number of Training Sentences v.s. Parsing Accuracy 
Parsing Accuracy \[\[ 84.32% 84.93% 84.89% 84.86% 85.03% 85.01% I 
Table 10: Number of Iteration v.s. Parsing Accuracy 
5 Conclusion 
We have described a new Japanese dependency 
parser that uses decision trees. First, we introduced 
the single tree parser to clarify the basic character- 
istics of our method. The experimental results show 
that it outperforms conventional stochastic parsers 
by 4%. Next, the boosting version of our parser was 
introduced. The promising results of the boosting 
parser can be summarized as follows. 
• The boosting version outperforms the single- 
tree counterpart regardless of training data 
amount. 
• No data over-fitting was seen when the number 
of iterations changed. 
We now plan to continue our research in two direc- 
tions. One is to make our parser available to a broad 
range of researchers and to use their feedback to re- 
vise the features for learning. Second, we will apply 
our method to other languages, say English. Al- 
though we have focused on the Japanese language, 
it is straightforward to modi~" our parser to work 
with other languages. 
05.5 
85 
8,35 
83 
82,5 
B2 
"laoostJng.O=r" 
/ / 
/' / 
J 
N~ber Ot Tra~mg Oata 

References 
Eugene Charniak. 1993. Statistical Language Learn- 
ing. The MIT Press. 
Eugene Charniak. 1997. Statistical Parsing with a 
Context-free Grammar and Word Statistics. In 
Proc. 15th National Conference on Artificial 172- 
telligence, pages 598-603. 
Michael Collins. 1996. A New Statistical Parser 
based on bigram lexical dependencies. In Proc. 
34th Annual Meeting of Association for Compu- 
tational Linguistics, pages 184-191. 
Japan Electronic Dictionary Reseaech Institute Ltd. 
EDR, 1995. the EDR Electronic Dictionary Tech- 
nical Guide. 
Yoav Freund and Robert Schapire. 1996. A 
decision-theoretic generalization of on-line learn- 
ing and an application to boosting. 
M. Fujio and Y. Matsumoto. 1997. Japanese de- 
pendency structure analysis based on statistics. 
In SIGNL NL117-12, pages 83-90. (in Japanese). 
David M. Magerman. 1995. Statistical Decision- 
Tree Models for Parsing. In Proc.33rd Annual 
Meeting of Association for Computational Lin- 
guistics, pages 276-283. 
Mitchell Marcus, Beatrice Santorini, and Mary Ann 
Marcinkiewicz. 1993. Building a large annotated 
corpus of English: The Penn Treebank. Compu- 
tational Linguistics, 19(2):313-330, June. 
Y. Matsumoto et al. 1996. Japanese Morphological 
Analyzer Chasen2.0 User's Manual. 
NLRI. 1964. Word List by Semantic Principles. 
Syuei Syuppan. (in Japanese). 
J.Ross Quinlan. 1993. C4.5 Programs for Machine 
Learning. Morgan Kaufinann Publishers. 
