A Stochastic Parser Based on a Structural Word Prediction Model 
Shinsuke MORI, Masafumi NISHIMURA, Nobuyasu ITOH, 
Shiho OGINO, Hideo WATANABE 
IBM Research, Tokyo Resem:ch Laboratory, IBM Japan, Ltd. 
1623-14 Shimotsurumz~ Ym~atoshi, 24:2-8502, Japan. 
mori~trl.ibm.co.jp 
Abstract 
\]in this paper, we present a stochastic  
model using dependency. This model considers a 
sentence as a word sequence and predicts each word 
from left to right. The history at each step of pre- 
diction is a sequence of partial parse krees covering 
the preceding words. First ore: model predicts the 
partial parse trees which have a dependency relation 
with the next word among them and then predicts 
the next word fi'om only the trees which have a de- 
pendency relation with the next word. Our: model is 
a generative stochastic model, thus this can be used 
not only as a parser but also as ~  model 
of a speech recognizer. In our experiment, we pre- 
pared about 1,000 syntactically annotated Japanese 
sentences extracted fl'om a financial newspaper and 
estimated the parameters of our model. We built a 
parser based on our: model and tested it on approx- 
imately 10O sentences of the same newspaper. The 
accuracy of the dependency relation was 89.9%, the 
highest, accuracy level obtained by Japanese stochas- 
tic parsers. 
1 Introduction 
The stochastic  modeling, imported fl:om 
the speech recognition area, is one of the snccessflfl 
methodologies of natural  processing. In 
fact, all  models for speech recognition are, 
as far" a.s we know, based on an n-gram model and 
most practical part-of-speech (POS) taggers are also 
based on a word or POS n-gram model or its exten- 
sion (Church, 1.988; Cutting et el., 1992; Merialdo, 
1994; l)ennatas and Kokkinakis, 1.995). POS tag- 
ging is the first step of natural  process- 
ing, and stochastic taggers have solved this problem 
with satisfying accuracy for many applications. The 
next step is parsing, or that is to say discovering 
the structure of a given sentence. Recently, many 
parsers based on the stochastic approach have been 
proposed. Although their reported accuracies are 
high, they are not accurate enough for many appli- 
cations at this stage, and more attempts have to be 
made to improve them fm:ther. 
One of the major applications of a parser is to 
parse the spoken text recognized by a speech rec- 
ognizer. This attempt is clearly aiming at spoken 
 understanding. If we consider how to con> 
bine a parser and a speech recognizer~ it is better if 
the parser is based on a generative stochastic model, 
as required for the  model of a speech rec- 
ognizer. Here, "generative" means that the sum of 
probabilities over all possible sentences is equal to 
or less than 1. If the  model is generative, 
it allows a seamless combination of the parser and 
the speech recognizer. This means that the speech 
recognizer has the stochastic parser as its  
model and benefits richer information than a nor- 
mal n-gram model. Even though such a Colnbiim- 
tion is not possible in practices , the recognizer out- 
puts N-best sentences with their probabilities, and 
the parser, taking them as input, parses all of them 
and outputs the sentence with its parse tree that 
has the highest probability of all possible combina- 
tions. As a resnlt, a parser based on a generative 
stochastic  model may hell) a speech rec- 
ognizer to select the most syntactically reasonable 
sentence among candidates. Therefore, it is better 
if the  model of a parser is generative. 
In this paper, taking Japanese as the object lan- 
guage, we propose a generative stochastic  
model and a parser based on it. This model treats a 
sentence as a word sequence and predicts each word 
from left to right. The history at each step of predic- 
tion is a sequence of partial parse trees covering the 
preceding words. To predict a word, our model first 
predicts which of the partial parse trees at this stage 
have dependency relation with the word, and then 
predicts the word fi'om the selected partial parse 
trees. In Japanese each word depends on a subse- 
quent word, that is to say, each dependency relation 
is left to right, it is not necessary to predict the di- 
rection of each dependency relation. So in order to 
extend our model to other s, the model may 
have to predict the direction of each dependency. We 
built a parser based on this model, whose parame- 
ters are estimated fl:om 1,072 sentences in a finan- 
cial newspaper, and tested it on 1.19 sentences fl:om 
the same newspaper:. The accuracy of the depen- 
558 
dency relation was 89.9%, the highest obt.ained by 
any aapa.nese stochastic parsers. 
2 Stochastic Language Model based 
on Dependency 
In this section, we propose a stochastic /angua.ge 
model based on dependency. Unlike most stochas- 
tic  models %r a. parser, our model is the- 
oreticMly based on a hidden Markov model. In our 
model a. sentence is predicted word by word fi'om left 
to right and the state at ea.ch step of prediction is 
basieMly a. sequence of words whose modifiea.nd has 
not appeared yet. According to a psyeholinguistic 
report on la.nguage structure (Yngve, 1960), there is 
an upper limit on the number of the words whose 
inodificaJ~ds ha.ve not appeared yet. This limit is de- 
termined by tim mmfloer of slots in sl~ort-term mem- 
ory, 7 :k 2 (Miller, 1956). With this limitation, we 
Call design a pa.rser based on a linite state model. 
2.1 Sentence Model 
'\]'he I)asic idett of our model is that each word would 
be better predicted from the words that have a. de- 
pendency rela.tion with the. word to be predicted 
than from the preceding two words (l.ri-gram model). 
Let us consider the complete structur('~ of the sen- 
tence in /"igure I and a \]tyl)otheti(:al struetm:e after 
the 1)rediction of tile lifth word at the top of Fig- 
ure 2. In this hypothetica.l st;ructure, there are three 
trees: one root-only tree (/q, eomposc'd of wa) a.nd 
two two-node trees (l. conta.ining 'wz and 'w2, and l(, 
containing w4 an(1 'w5). If the last two trees (& and 
le) de4)end on the word we, this word may better 
be predicted from thes(~ two trees. I"rom this I)oint 
of view, our model Ill-st: predicts the trees del)cnding 
on the next word and then l)redicts the next word 
from thes(" trees. 
Now, let us make the tbllowing definitions in order 
to explain our model formally. 
• 11~ ~-ttqlv2...'tt)~ : a, seqllcnce of words. \]\]ere a. 
word is define(l as a, pair consisting of a string of 
alplmbetic chara.cters and a, pa.rt of speech (e.g. 
the/DT). 
• ti = lil2""lk, : a, sequence of parrtiM parse 
trees covering the i-pretix words ('w~ w~... wi). 
• t + trod t~- : subsequences of ti ha.ving a.nd 
not having a. dependeney relation with the next 
word respectively. In .h~p~mese, like many other 
langua.ges, no two dependency relations cross 
each other; thus tl = t~ t +, 
• (t w) : a tree with 'w as its root a.nd t as the 
sequence of all subtrees connected to the root. 
After wi+l has been predicted from the trees 
depending on it (t+), there a.re trees renmin- 
i,,~ (iT) a.d a. ,,ewly prod.eed t,.,'~e ((t?w~+,)); 
th,,s t~+, = t~ . (~,+,,,,~+,). 
1'(t~) 
--.~5 ................................................................................ , ( 
i ~--~ I i--~1 -'~ " 
Wl W 2 W 3 W 4 W 5 W 6 
I} ........................ + 
~hei isubj\]il lending}!\[ : j ...................... ~la, a~ ~t,,el __,__ 
1 z ii --3 w4 W5 i 1¢6 
×;'(~,,01q) 
.... t.} .............................. <1.~ w,~ > .................................................... 
w, w, }{ w:, w~ w.~ I% { 
................................ * '. ....................................................................... 2 
/)(t~;), where t, t~ 4-. = = .(tat~;) 
Figure 2: Word prediction from a partial parse 
• Jhna:r : upper limit on the munber number of 
words whose moditicands have not appeared 
yet. 
Under these definitions, our stochastic  
model is defined as follows: 
~'("") = H t'(,,,I,,,,, w~...,,,~_~) 
i=1 
~ H l'(w, ltL,)I'(tL,It;-~) (:l) 
t,,, ET',~ i=1 
where 7;, is all possible bhm.ry trees with n nodes. 
lie,','., the first fi~.ctor, (P(wilt+ 1)), is ca.lled the word 
prediction model and the second, (P (~'~1 } ti-1 ))' the 
state prediction model. Let us consider Figure 2 
aga.in. At. the top is the state just a.fter the predic- 
tion of the tilth word. The state prediction model 
then predicts the pa.rtial purse trees depending on 
the next word a.mong all partial parse trees, as shown 
in the second figure. Finally, the word prediction 
model predicts the next word Dora the partial parse 
trees depending on it. 
As described above, there may be an upper limit 
on the number of words whose modificands ha.ve not 
yet appeared. To put it in a.nother way, the length 
of the sequence of l)artial parse trees (ti) is limited. 
559 
W I W 2 W 3 W 4 W 5 W 6 W 7 W 8 W 9 Wlo 
Figure 1: Dependency struetm:e of a sentence. 
There%re, if the depth of the partial parse tree is also 
limited, the number of possible states is limited. Un- 
der this constraint, our model can be considered as 
a hidden Markov model. In a hidden Marker model, 
the first factor is called the output probability and 
the second, the transition probability. 
Since we assmne that no two dependency relations 
cross each other, the state prediction model only 
has to predict the mmaber of the trees depending 
on the ,text word. Tln, s S'(t+_,lt,._,) = ~':'(ylt~_~) 
where y is the number of trees in the sequence t?_ 1, 
According to the above assumption, the last y par- 
tial parse trees depend on the i-th word. Since the 
nmnber of possible parse trees for a word sequence 
grows exponentially with the number of the words, 
the space of the sequence of partial parse trees is 
huge even if the length of the sequence is limited. 
'£his inevitably causes a data-sparseness problem. 
To avoid this problern, we limited the number of 
levels of nodes used to distinguish trees. In our ex- 
periment, only the root and its children are checked 
to identify a partial parse tree. Hereafter, we repre- 
sent \]JLL to denote this model, in which the lexicon 
of the first level and thai; of the second level are con- 
sidered. Thus, in our experiment each word and the 
number of partial parse trees depending on it are 
predicted by a sequence of partial parse trees that 
take account of the nodes whose depth is two or less. 
It is worth noting that if the dependency structure 
of a sentence is linear -- that is to say, if each word 
depends on the next word, -- then our model will 
be equivalent to a word tri-gram model. 
We introduce an interpolation technique (Jelinek 
et al., 1991) into our model like those used in n-gram 
models. By loosening tree identification regulations, 
we obtain a more general model. For example, if 
we check only the POS of the root and the I?OS 
of its children, we will obtain a model similar to a 
POS tri-gram model (denoted PPs' hereafter). If 
we check the lexicon of the root, but not that of its 
children, the model will be like a word bi-gram model 
(denoted PNL hereafter). As a smoothing method, 
we can interpolate the model PLL, similar to a word 
tri-gram model, with a more general model, PPP or 
PNL. In our experiment, as the following formula 
indicates, we interpolated seven models of different 
generalization levels: 
P(wiltt_l) ~--- ~6PLL(WiI',?_I) q- ,~5\]3pL(Wiit+i_l) 
q-,~4\]3pp (W tit?_1) @ )t3PNL(W i\[tF_\] ) 
+ s'N,, It +, ) + x, PNN I*?-,) 
+,X0 (2) 
where X in PYx is the check level of the first level 
of the tree (N: none, P: POS, L: lexicon) and Y is 
that of the second level, and lG,c-gr<~m is the uniform 
distribution over the vocabulary W (-\])~U,O--gI'D,I~\](*/)) = 
l/IWl). 
The state predictio,, model also in- 
terpola.ted in the salne way. in this case, the possi- 
ble events are y = 1,2,..., Ym(~x, thus; /~a,0-gr<~m = 
l / y,,,ax . 
2.2 Parmneter Estimation 
Since our model is a hidden Markov model, the pa- 
rameters of a model can l)e estimated from at. row 
corpus by EM algorithm (13amn, 1972). With this 
algorithm, the probability of the row corpus is ex- 
pected to be maxinfized regardless of the structure of 
ea.ch sentence. So the obtained model is not always 
appropriate for a. parser. 
In order to develop a model appropriate for a 
parser, it is better that the parameters are estimated 
from a syntactically annotated corlms by a maxi- 
mmn likelihood estimation (MI,E) (Meriaklo, 1994:) 
as follows: 
1,(wit+) MZ,,  j'(<t,+ -- f(< t+ w,:>) 
P(t+lt) M&E f(t+,t) f(t) 
where f(x) represents the frequency of an event x in 
tile training corpus. 
The interpolation coeificients in the formula (2) 
are estimated by the deleted interpolation method 
(aelinek et al., 1991). 
2.3 Selecting Words to be Lexiealized 
Generally speaking, a word-based n-gram model is 
better than a l>OS-based 'n-gram model in terms of 
560 
predictive power; however lexica.lization of some in- 
frequent words may be ha.rmfu\] beta.use it may c;mse 
a. data-sparseness problem. In a. practiea.1 tagger 
(I(upiec, \] 989), only the nlost, frequent \] 00 words a.re 
lexicalized. Also, in a, sta.te-ofthe-a.rt English pa.rser 
(Collins, 1997) only the words tha, t occur more tha,n 
d times in training data. are lexicalized. 
For this reason, our pa.rser selectn the words to be 
lexicalized at the time of lea.rning. In the lexical- 
ized models described above (P/A;, I},L and f~VL), 
only the selected words a.re \]exica.lized. The selec- 
tion criterion is parsing a.ccuracy (see section 4) of 
a. hekl-out corpus, a small part of the learning col 
pus excluded from l)a, ramcter cstima.tion. Thus only 
the words tliat a.re 1)redicte(1 to improve the parsing 
a.Ccllra.oy of the test corpilS> or illlklloWll illpll{,> i/3"e 
lexicalized. The algorithm is as follows (see l,'igurc 
a): 
\]. In the initial sta.te a.ll words are in the class of 
their I)OS. 
2. All words are sorted ill descending order of their 
frequency, a.nd the following 1)rocens is executed 
for each word in this order: 
(a.) The word is lexicalizcd provisionally and 
the accura.cy el tile held-oul, corpus is (:;/l- 
cilia.ted. 
(b) Ir a.n illiproven\]ont in observed, the word is 
10xica.lized definitively. 
Tile result of this \]exica.liza.tion algoril.lun is used to 
identil~y a. \])a.rtia.l l)arse tree. That is to say, ()ill 3, Icx- 
iealized wordn are distinguished in lexicalized mod- 
els. It" IlO wordn Were nelcctxxl I:o be lexica/ized, {;hell 
/'NL = \])N1 ' a,ii(I \])LL -- \[)t"L = 191'1'. It is wort\]l 
nol;ing that if we try to .ioi,, a word with allo/,/ler 
wet(l, then this a,lgOlJithnl will be a, llorlna\] top-down 
c, lustcring a.igorithnl. 
2.4 Unknown Word Model 
To enable olir stocllastic la.nguage lnodel to handle 
unknowil words> we added a.li ui/knowii word model 
based Oil a cha.ra,cter \])i-giPa,nl nio(M, lr the next 
word is not in the vocabula.ry, the n\]o(lel predicts 
its POS a.nd the llllklloWll word model predicts the 
string of the word as folkm, s: 
re.q- \] 
s'(,wlS'OS) = 1-\[ ) 
i=1 
where 'w = xtx2...xm, xcl == aSm+l = \]~'\]'. 
1}'1", a special character corresponding to a word 
l)oundary, in introduced so tha.t the ntlilt of the l)rob- 
ability over all sl, rillgs is eqlla,\] to 71. 
In the l)ara.lneter cstima.tion described a.1)ove, a. 
learning corpus is divided into k parts. In our ex- 
i)erirnent, the vocabulary is composed of the wordn 
:-(~ ~ ,-'"l .............................. -10& ....................................... 10S2 L_I @ 
. . .................................. ..,,,,,. 
i 
*// ..................... I'OS ~ ........... ~ ............... I'OS 2 
:-~ ............... I'OS I ...................... I'OS 2 
lexicalize 
yes 
I10 
yes 
yes 
Figure 3: A conceptual figure of the le, xicaliza.tion 
algorit.hm. 
occurring in more than one l)artia\] corlmn and the 
other words are used for 1)arameter entima, tion of 
the unknown word model. The l)i-gram l)robal)ility 
of the unknown word model of a. I)OS in estimated 
\['ronl the words among them and belonging to the 
POS as follows: 
1 !l, o s (.~, i la: i - J ) M)~V .fP o s (* i, * i- ~ ) fPos (,,;~-,) 
The character I)i%ram model is also interl)ola.ted 
Wi/,\]I a llili-~r~iill illode\] and a Zel'O-~l'aill lllodc\]. 
The interl)olation coellicients are estinmi.d by the 
deleted interpolation method (,lelinek el. al., 1991). 
3 Syntactic Analysis 
(,el cJ dl.y, a. l)a.rscr may I)c considered an a module 
that recdvcs a. sequence of words annotated with a, 
I'()S and oul.putn its structm'e. Our parner, which 
includes a stochastic mflmown word model, however, 
is a.I)le to a.cc.el)t a cha.ra.ctc'r sequence as an input and 
execute segmenta.tion, POS tagging, and syntactic 
analysis nimultaneously I . In this section, wc exphfin 
our pa.rser, which is based on the  modal 
described in the preceding section. 
3.1 Sto('hastie Syntac|,ic Analyzer 
A syntactic analyzer, bancd on a. stochastic  
model, ca.lculatc's the pa.rse tree (see Figure 1) with 
the highest probability for a given scquencc of char- 
acters x according to the following tbrmula.: 
:/' = ,,,'giii,ixP(Tl., 0 
"H~ (fl') :=;/~ 
1 There is no space \])etweell words ill ,Japo.llese 
561 
= argmax P(TIx)P(~ ) 
w(:c)=m 
= argn,~,xP(mlT)I~(r) ('." 13ayes' forn~ula) 
w(T)=.'e 
= argnmxP(T) ('." P(mlT)= 1), W(T)=m 
where w(T) represents the concatenation of the 
word string in the syntactic trek T. P(T) in the last 
line is a stochastic  model, in our parser, 
it is the probability of a parse tree T defined by the 
stochastic dependency model including the unknown 
word model described in section 2. 
p(T) = II Its_,), (a) 
i=1 
where wlw2". "wn = w(T). 
3.2 Solution Search Algorithm 
As shown in formula (3), our parser is based on a hid- 
den Markov model. It follows that Viterbi algorithm 
is applicable to search the best solution. Viterbi al- 
gorithm is capable of calculating the best solution in 
O(n) time, where n is the number of input charac- 
ters. 
The parser repeats a state tra.nsition, reading 
characters of the input sentence from left to right. In 
order that the structure of the input sentence may be 
a tree, the number of trees of the final state tn must 
be 1 and no more. Among the states that satisfy this 
condition, the parser selects the state with the high- 
est probability. Since our  model uses only 
the root and its children of a partial parse tree to dis- 
tinguish states, the last state does not have enough 
information to construct the parse tree. The parser 
can, however, calculate the parse tree fi'om the se- 
quence of states, or both the word sequence and the 
sequence of y, the number of trees that depend on 
the next word. Thus it memorizes these values at 
each step of prediction. After the most probable 
last state has been selected, the parser constructs 
the parse tree by reading these sequences fi:om top 
to bottom. 
4 Evaluation 
We developed a POS-based model and its lexical- 
ized version explained in section 2 to evaluate their 
predictive power, and implemented parsers based on 
them that calculate the most probable dependency 
tree fi'om a given character sequence, using the so- 
lution search algorithm explained in section 3 to ob- 
serve their accuracy. In this section, we present and 
discuss the experimental results. 
4.1 Conditions on the Experiments 
The corpus used in our experiments consists of ar- 
ticles extracted from a financial newspaper (Nihon 
Table 1: Corpus. 
learning 
test 
#sentences ~words 
1,072 30,292 
119 3,268 
#chars 
46,212 
4,909 
Keizai ,%inbun). Each sentence in tile articles is 
segmented into words and its dependency structure 
is annotated by linguists using an editor specially 
designed for this task at our site. The corpus was 
divided into ten parts; the parameters of the model 
were estimated fi:om nine of them and the model 
was tested on the rest (see Table 1). A small part 
of each leaning corpus is withheld from parameter 
estimation and used to select the words to be lex- 
icalized. After checking the learning corpus, the 
maximum number of partial parse trees is set to 10 
To evaluate the predictive power of our model, we 
calculated their cross entropy on the test corpns. In 
this process, the annotated tree in the test corpus is 
used as the structure of the sentences. Therefore the 
probability of each sentence in the test corpus is not 
the summation over all its possible derivations. To 
compare the POS-based model and the \]exicalized 
model, we constructed these models using the same 
learning corpus and calcnlated their cross entropy 
on the same test corpus. The POS-based model and 
the }exicalized model have the same mfl~nown word 
model, thus its contribution to the cross entropy is 
constant. 
We implemented a parser based on the depen- 
dency models. Since our models, inchsding a 
character-l)ased unknown word model, can return 
the best parse tree with its probability for any in- 
put, we can build a parser that receives a character 
sequence as input. It is not easy to evaluate, how- 
ever, because errors may occur in segmentation of 
the sequence into words and in estimation of their 
POSs. For this reason, in the tbllowing description, 
we assume a word sequence as the input. 
The criterion for a parser is the accuracy of its out- 
put dependency relations. This criterion is widely 
used to evahmte Japanese dependency parsers. The 
accuracy is the ratio of the nnmber of the words a.n- 
notated with the same dependency to the numl)er of 
the words as in the corpus: 
accuracy 
=#=words dependiug on tilt correct word 
~words 
Tile last word and the second-to-last word of" a sen- 
tence are excluded, because there is no ambiguity. 
The last word has no word to depend on and the 
second-todast word depends always on the last word. 
562 
Table 2: Czoss entorpy alld acellraey of each model. 
 model cross enl\[,lTopy a, CCUFaicy 
selectively lexicalized 6.927 -~- 
completely lexicalized 6.651 87.1% 
POS-based 7.000 87.5% 
linear structure* -- 78.7% 
* F, adl word del)ends on l;he next word. 
4.2 Ewduation 
Table 2 shows the cross entropy and parsing accu- 
racy Of the baseline, where all words depend on the 
next word, the POS-based dependency model and 
two lexicalized dependency models. In the selec- 
tively lexicalized model, words to be lexicalized are 
selected by the aJgo,:ithm described in section 2. In 
the completely lexicalized model, all words arc lcxi- 
calized. This result attests experimentally that the 
pa.rser based on the selectively lexicalized model is 
the best parser. As for predictive power, however, 
the completely lexica.lized model has the lowest cross 
e~/tropy. Thus this model is estimated to be the best 
 model for speech recognition. Although 
there is no direct relation between cross entropy of 
l;he  model and error ra.te of a speech rec- 
ognizer, if we consider a spoken la.nguage parser, it 
ma.y be better to select the words to be lexicalized 
using other criterion. 
We calculated the cross entropy and the parsing 
accuracy (if' the model whose parameters arc esti- 
mated fi'om ;I/d, 1/16, and 1/64 of the learning 
corpus. The relation between the learning corpus 
size and the cross entrol)y or the l)arsing a.ccm:acy is 
shown in Figure d. The cross entropy has a stronger 
tendency to decrease as the corpus size increases. 
As for accuracy, there is aJso a tendency for parsers 
to become more accurate as the size of the learning 
increases. The size of the cor/)us we h~we all this 
stage is not at all large, ltowever, its accuracy is at 
the top level of Japanese parsers, which nse 50,000- 
1.90,000 sentences. Therefore, we conclude that our 
approach is quite promising. 
5 Related Works 
lIistorica.lly, structures of natural s have 
been described by a context-free grammar a.nd all\]- 
biguities have been resolved by parsers based on a 
context-free grammar (Fujisaki et al., 1989). In re- 
eenl, years, some attempts have been made in the 
area of parsing by a tinite state model (Otlazer, 1999) 
etc. Our parser is also based on a finite state model. 
Unlike these models, we focused on reports on a 
limit on  structure caused by the capacity 
our memory (Yngve, 1960) (Miller, 19561. Thus our 
20-- 
16 
q 
12 
0 ,-~ 8 
accuracy --4 - _--4 
84.0% 86.2% 88.1% 89.9% 
% 
t'ross Clltropy \+\ "\ 
I00 % 
8O 
60 N' 
40 
20 
0100 101 102 103 104 105 0 
#characters in learning corpus 
I,'igure d: Relation between cross entropy a.nd pars- 
i~lg accuracy. 
model is psycholinguistically more al)propriate. 
Recently, in the area of parsers based oll a. stochas- 
tic context-fi:ee grammar (SCFG), some researchers 
have pointed out the importance of t.he lexicon 
and proposed lexiealized models (Charniak, 1997; 
Collins, 1997). 111 these papers, they reported sig- 
nificant improvement of parsing accuracy. Taking 
these reports into account, we introduced a method 
of pa.rlJal lexicalization and reported significant im-- 
provement of parsing accuracy. Our lexicalization 
method is also a.pplicable to a. SCFG-based parser 
and improves its parsing accuracy. 
The model we present in this pal)er is a genera- 
tire stochastic  model. Chelba and aelinek 
119981 presented a similar model. In their model, 
each word is predicted t¥om two right-most head 
words regardless of dependency rela.tion between 
these head words and the word. Eisner (\[996) also 
presented a. st;ochastie structura.1  model, in 
which ea.ch word is predicted t¥om its head word 
and the nearest one. This model is very similar to 
the parser presented by Collins 11.9961. The great- 
est difference between our model and these models 
is in that our model predicts the next word from the 
head words, or partial parse trees, depending on it. 
Clearly, it is not always two right-most head words 
that have dependency relation with the next word. 
It. follows that our model is linguistically more ap- 
propirate. 
There have been some attempts at stochastic 
Japal, ese parser (llaruno et al., 1998) (l"ujio and 
Matsmnoto, 19981 (Mori and Naga.o, 1.998). These 
Japanese parsers are based on a unit called bunsetsu, 
a sequence of one or more content words followed by 
zero or more traction words. The parsers take a 
sequence of units and outputs dependency relations 
between them. Unlike these parsers, our model de- 
563 
scribes dependencies between words; thus our model 
can easily be extended to other s. As tbr 
the accuracy, although a direct comparison is not 
easy between our parser (89.9%; 1.,072 sentences) 
and these parsers (82% - 85%; 50,000 - 190,000 sen- 
tenees) because of the difference of the units and 
the corpus, our parser is one of the state-of-the-art 
parsers \[br Japanese . It should be noted 
that ore: model describes relations among three or 
more units (case frame, consecutive dependency re- 
lations, etc.); thus our model benefits a greater deal 
from increase ot.' corpus size. 
6 Conclusion 
In this paper we have presented a stochastic lan- 
guage model based on dependency structure. This 
model treats a sentence as a word sequence and pre- 
dicts each word from left to right. "The history at 
each step of prediction is a sequence of partial parse 
trees covering the preceding words. To predict a 
word, ore: model first selects the partial parse trees 
that have a dependency relation with the word, and 
then predicts the next word from the selected partial 
parse trees. We also presented an algorithm %r lexi- 
calization. We lmilt parsers based on the POS-based 
model and its lexicalized version, whose parameters 
are estimated from 1,072 sentences of a financial 
newspaper. We tested the parsers on 119 sentences 
Dom the same newspaper, which we.re excluded fl:om 
the learning. The accuracy of the dependency rela- 
tion of the lexicalized parser was 89.9%, the highest 
obtained by any Japanese stochastic parser. 

References 

L. E. Baum. 1.972. An inequality and associated 
maximization technique in statistical estimation 
for probabilistie functions of Ma.rkov process. In- 
equalities, 3:1-8. 

Eugene Charniak. 1997. Statistical parsing with a 
context-fl:ee grammar and word statistics. In Pro- 
ceedings of the l/ith National ConfeTvnce on Arti- 
ficial Intclligence, pages 598-603. 

Ciprian Chelba and Frederic .l elinek. 1998. F, xploit- 
ing syntactic structure for  modeling. In 
Proceedings of the I Tth hdernational Conference 
on Computational Linguistics, pages 225-231. 

Kenneth Ward Church. 1988. A stochastic pa.rts 
program and noun phrase parser for unrestricted 
text. In Proceedings of the 3eeond Conference on 
Applied Natural Language Processing, pages 136- 
143. 

Michael John Collins, 1996. A new statistical parser 
based on bigram lexical dependencies. In Proceed- 
ings of the 34th Annual Meeting of the Association 
for Computational Linguistics, pages 184-191. 

Michael Collins. 1.997. Three genera.tire, lexiea.lised 
models for statistical parsing. In Proceedings of 
th(~ 35th Annual Meeting of the Association for 
Computational Linguistics, pages 16 -23. 

l)oug Cutting, Julian Kupiec, .Jan 1)edersen, and 
Penelope Sibun. 1992. A practical part-of-speech 
tagger. In Proceedings of the "lldrd Conference on 
Applied Natural Language Processing, pages 133 
l.d0. 

Evangelos Dermatas and George Kokkinakis. 1995. 
Automatic stochastic tagging of ha.tin:el  
texts. Computational Linguistics, 21 (2):137-103. 

Jason M. Eisner. 1.996. Three new probabilistic 
models for dependency parsing: An exploration. 
In Proceedings of the 16th lnlernational Co~@r- 
ence on Computational Linguistics , pages 340 
345. 

Masakazu Fujio and Yuji Matsumoto. 1998. 
Japanese dependency structure analysis based on 
lexicalized statistics. In Proceedings (of the Third 
Conference on Empirical Methods in Natural Lan- 
guage Processing, pages 87-96. 

T. Fujisaki, F. ,\]elinek, .1. Cocke, E. Black, and 
T. Nishino. 1.989. A probabilistic parsing method 
for sentence disambiguation. In Proceedings of the 
International .Parsing Workshop. 

Masahiko Itarmm, Satoshi Shirai, and Yoshifnmi 
Ooyama. 1998, Using decision trees to construct a 
practical parser. In Proceedings of the ITlh Inter- 
national Confer(race on Compul, alior~al Linguis- 
tics~ pages 505-511. 

Fredelick .\]elinek, 11,obert L. Mercer, and Salinr 
\[{oukos. 1991. Principles of lexica.1  
modeling for speech recognition. In Advances in 
,5'peeeh ,5'ignal Processing, chapter 21, pages 651- 
699. l)ekker. 

Julian Knpic'c, 1989. Augmenting a hidden Markov 
model l'or phrase-dependent word t.agging. In .Pro- 
ceedings of the DAI{I)A ,5'peeeh and Natural Lan- 
guage Workshop, pages 92- 08. 

Bernard Merialdo. 1994. '15~.gging English text with 
a probabilistie model. Computational Linguislics, 
~0(~):155 -171. 

George A. Miller. 1956. The magical number seven, 
plus or minus two: Some limits on our capacity 
for processing information. The Psychological l~e- 
view, 63:81-97. 

Shinsuke Mori and Makoto Nagao. 1998. A stochas- 
tic  model using dependency and its im- 
provement by word clustering. In Proceedings of 
the ITth International Co@'renee on Computa- 
tional Linguistics, pages 898-90~t. 

Kernel Ollazer. 1.999. l)ependency parsing with an 
extended finite state approach. In Proceedings of 
the 37t, h Annual Mecti~,g of the Association for 
Computational Linguistics, pages 254-260. 

Victor H. Yngve. 11960. A model and a hypothesis 
for  structure. The American Philosoph- 
ical Society, 104(5):444-466. 
