A Stochastic Language Model using Dependency 
and Its Improvement by Word Clustering 
Shinsuke Mori * 
Tokyo Research Labolatory, 
IBM Japan, Ltd. 
1623-14 Shimotsuruma 
Yamatoshi, Japan 
Makoto Nagao 
Kyoto University 
Yoshida-honmachi Sakyo 
Kyoto, Japan 
Abstract 
In this paper, we present a stochastic language 
model for Japanese using dependency. The predic- 
tion unit in this model is all attribute of "bunsetsu". 
This is represented by the product of the head of con- 
tent words and that of function words. The relation 
between the attributes of "bunsetsu" is ruled by a 
context-free grammar. The word sequences axe pre- 
dicted from the attribute using word n-gram model. 
The spell of Unknow word is predicted using charac- 
ter n-grain model. This model is robust in that it can 
compute the probability of an arbitrary string and 
is complete in that it models from unknown word to 
dependency at the same time. 
1 Introduction 
An effectiveness of stochastic language modeling as 
a methodology of natural language processing has 
been attested by various applications to the recog- 
nition system such as speech recognition and to the 
analysis system such as paxt-of-speech (POS) tagger. 
In this methodology a stochastic language model 
with some parameters is built and they axe estimated 
in order to maximize its prediction power (minimize 
the cross entropy) on an unknown input. Consid- 
ering a single application, it might be better to es- 
timate the parameters taking account of expected 
accuracy of recognition or analysis. This method is, 
however, heavily dependent on the problem and of_ 
fers no systematic solution, as fax as we know. The 
methodology of stochastic language modeling, how- 
ever, allows us to separate, from various frameworks 
of natural language processing, the language descrip- 
tion model common to them and enables us a sys- 
tematic improvement of each application. 
In this framework a description on a language is 
represented as a map from a sequence of alphabetic 
characters to a probability value. The first model 
is C. E. Shannon's n-gram model (Shannon, 1951). 
The parameters of the model are estimated from the 
frequency of n character sequences of the alphabet 
(n-gram) on a corpus containing a large number of 
sentences of a language. This is the same model as 
0 This work is done when the auther was at Kyoto Univ. 
used in almost all of the recent practicM applications 
in that it describes only relations between sequential 
elements. Some linguistic phenomena, however, axe 
better described by assuming relations between sep- 
axated elements. And modeling this kind of phenom- 
ena, the accuracies of various application axe gener- 
ally augmented. 
As for English, there have been researches in 
which a stochastic context-free grammar (SCFG) 
(Fujisaki et ~1., 1989) is used for model descrip- 
tion. Recently some researchers have pointed out the 
importance of the lexicon and proposed lexicalized 
models (Jelinek et al., 1994; Collins, 1997). In these 
models, every headword is propagated up through 
the derivation tree such that every parent receives a 
headword from the head-child. This kind of special- 
ization may, however, be excessive if the criterion is 
predictive power of the model. Research ~med at 
estimating the best specialization level for 2-gram 
model (Mori et aL, 1997) shows a class-based model 
is more predictive than a word-based 2-gram model, 
a completely lexicalized model, comparing cross en- 
tropy of a POS-based 2-graxa model, a word-based 
2-gram model and a class-based 2-graxa model, es- 
timated from information theoretical point of view. 
As for a parser based on a class-based SCFG, Chax- 
niak (1997) reports better accuracy than the above 
lexicalized models, but the clustering method is not 
clear enough and, in addition, there is no report 
on predictive power (cross entropy or perplexity). 
Hogenhout and Matsumoto (1997) propose a word- 
clustering method based on syntactic behavior, but 
no language model is discussed. As the experiments 
in the present paper attest, word-class relation is 
dependent on language model. 
In this paper, taking Japanese as the object lan- 
guage, we propose two complete stochastic language 
models using dependency between bugsetsu, a se- 
quence of one or more content words followed by 
zero, one or more function words, and evaluate their 
predictive power by cross entropy. Since the number 
of sorts of bunsetsu is enormous, considering it as a 
symbol to be predicted would surely invoke the data- 
sparseness problem. To cope with this problem we 
898 
use the concept of class proposed for a word n-gram 
model (Brown et al., 1992). Each bunsetsu is repre- 
sented by the class calculated from the POS of its 
last content word and that of its last function word. 
The relation between bunsetsu, called dependency, is 
described by a stochastic context-free grammar (Fu, 
1974) on the classes. From the class of a bunsetsu, 
the content word sequence and the function word se- 
quence are independently predicted by word n-gram 
models equipped with unknown word models (Mori 
and Yamaji, 1997). 
The above model assumes that the syntactic be- 
havior of each bunsetsu depends only on POS. The 
POS system invented by grammarians may not al- 
ways be the best in terms of stochastic language 
modeling. This is experimentally attested by the 
paper (Mori et al., 1997) reporting comparisons be- 
tween a POS-based n-gram model and a class-based 
n-gram model induced automatically. SVe now pro- 
pose, based on this report, a word-clustering method 
on the model we have mentioned above to success- 
fully improve the predictive power. In addition, we 
discuss a parsing method as an application of the 
model. 
We also report the result of experiments con- 
ducted on EDR corpus (Jap, 1993) The corpus is di- 
vided into ten parts and the models estimated from 
nine of them axe tested on the rest in terms of cross 
entropy. As the result, the cross entropy of the POS- 
based dependency model is 5.3536 bits axtd that of 
the class-based dependency model estimated by our 
method is 4.9944 bits. This shows that the clus- 
tering method we propose improves the predictive 
power of the POS-based model notably. Addition- 
ally, a parsing experiment proved that the parser 
based on the improved model has a higher accuracy 
than the POS-based one. 
2 Stochastic Language Model based 
on Dependency 
In this section, we propose a stochastic language 
model based on dependency. Formally this model is 
based on a stochastic context-free grammar (SCFG). 
The terminal symbol is the attribute of a bunsetsu, 
represented by the product of the head of the con- 
tent part and that of the function part. From the 
attribute, a word sequence that matches the bun. 
setsu is predicted by a word-based 2-gram model, 
and unknown words axe predicted from POS by a 
character-based 2-gram model. 
2.1 Sentence Model 
A Japanese sentence is considered as a sequence of 
units called bunsetsu composed of one or more con- 
tent words and function words. Let Cont be a set 
of content words, Func a set of function words and 
Sign a set of punctuation symbols. Then bunsetsu 
is defined as follows: 
Bnst = Cont+ Func * U Cont+ Func* Sign, 
where the signs "+" and "*" mean positive closure 
and Kleene closure respectively. Since the relations 
between bunsetsu known as dependency are not al- 
ways between sequential ones, we use SCFG to de- 
scribe them (Fu, 1974). The first problem is how 
to choose terminal symbols. The simplest way is to 
select each bunsetsu as a terminal symbol. In this 
case, however, the data-sparseness problem would 
surely be invoked, since the number of possible bun- 
setsu is enormous. To avoid this problem we use the 
concept of class proposed for a word n-gram model 
(Brown et al., 1992). All bunsetsu axe grouped by 
the attribute defined as follows: 
attrib(b) (1) 
= qast(co.t(b)), last(f..c(b)), Zast(sig.(b))), 
where the functions cont, func and sign take a 
bun~etsu as their argument and return its content 
word sequence, its function word sequence and its 
punctuation respectively. In addition, the function 
last(m) returns the POS of the last element of word 
sequence m or NULL if the sequence has no word. 
Given the attribute, the content word sequence and 
the function word sequence of the bunsetsu axe inde- 
pendently generated by word-based 2-gram models 
(Mori and Yamaji, 1997). 
2.2 Dependency Model 
In order to describe the relation between bunsetsu 
called dependency, we make the generally accepted 
assumption that no two dependency relations cross 
each other, and we introduce a SCFG with the at- 
tribute of bunsetsu as terminals. It is known, as a 
characteristic of the Japanese language, that each 
bunsetsu depends on the single bunsetsu appearing 
just before it. We say of two sequential bunsetsu 
that the first to appear is the anterior and the sec- 
ond is the posterior. We assume, in addition, that 
the dependency relation is a binary relation - that 
each relation is independent of the others. Then 
this relation is representing by the following form of 
rewriting rule of CFG: B =~ AB, where A is the at- 
tribute of the anterior bunsetsu and B is that of the 
posterior. 
Similarly to terminal symbols, non-terminal sym- 
bols can be defined as the attribute of bunsetsu. Also 
they can be defined as the product of the attribute 
and some additional information to reflect the char- 
acteristics of the dependency. It is reported that the 
dependency is more frequent between closer bunsetsu 
in terms of the position in the sentence (Maruyama 
and Ogino, 1992). In order to model these char- 
acteristics, we add to the attribute of bunsetsu an 
899 
(verb. ending, period. 2.0) 
(noun, NULL. comma, O, 0) 
kyou/noun ./sign 
(today) 
(noun. postp.. NULL. 0. 0) 
Kyoto / noun daigaku / noun he/postp. 
(Kyoto) (university) (to) 
I 
SCFG 
(verb. ending, period. 0.0) "~ j n-gram 
i/verb ku/ending ./sign 
(go) 
Figure 1: Dependency model based on bunsetsu 
additional information field holding the number of 
bunsetsu depending on it. Also the fact that a bun. 
setsu has a tendency to depend on a bunsetsu with 
comma. For this reason the number of bunsetsu with 
comma depending on it is also added. To avoid 
data-sparseness problem we set an upper bound for 
these numbers. Let d be the number of bunsetsu de- 
pending on it and v be the number of bunsetsu with 
comma depending on it, the set of terminal symbols 
T and that of non-terminal symbols V is represented 
as follows (see Figure 1): 
T = attrib(b) × {0} × {0} 
V=attrib(b) × {1, 2, ""dmaz} x {0, 1, "''Vmaz}. 
It should be noted that terminal symbols have no 
bunsetsu depending on them. It follows that all 
rewriting rules are in the following forms: 
S ~ (a, d, v) (2) 
(~, d~, v,)~ (a,, d~, v~){~3, d~, ~) (3) 
a 1 = a 3 
dl = min(ds + i, dm~.) 
min(vs + 1, v,n~.) 
vl = if sign(a2) = comma 
v3 otherwise 
where a is the attribute of bunsetsu. 
The attribute sequence of a sentence is generated 
through applications of these rewriting rules to the 
start symbol S. Each rewriting rule has a probability 
and the probability of the attribute sequence is the 
product of those of the rewriting rules used for its 
generation. Taking the example of Figure 1, this 
value is calculated as follows: 
P((noun,  JLL, comma, 0, 0) 
(noun, postp., NULL, 0, 0) 
(verb, ending, period, 0, 0)) 
= P(S ~ (verb, ending, perlod, 2, 0)) 
× P((verb, ending, period, 2, O) 
=~ (noun, NULL, comma, 0, 0) 
(verb, ending, period, 1, 0)) 
× P((verb, ending, period, 1, 0) 
=~ (noun, postp., NULL, 0, 0) 
(verb, ending, period, 0, 0)). 
The probability value of each rewriting rule is esti- 
mated from its frequency N in a syntactically anno- 
tated corpus as follows: 
P(S ~ (a~, all, vl)) 
N(S ::~ (al, dl, Va)) 
N(s) 
N((al, dl, vl)=~ (a2, d2, v~)(a3, d3, v3)) 
N((.I, dl, vl)) 
In a word n-gram model, in order to cope with 
data-sparseness problem, the interpolation tech- 
nique is applicable to SCFG. The probability of the 
interpolated model of grammars G1 and G2, whose 
900 
probabilities axe P1 and P2 respectively, is repre- 
sented as follows: 
P(A =~ a) = ~IPI(A =~ c~) +,~P2(A =~ a) 
0<~j < l(j=l, 2) and ~,+~2=1 (4) 
where A E V and a E (VUT)*. The coefficients are 
estimated by held-out method or deleted interpola- 
tion method (Jelinek et al., 1991). 
3 Word Clustering 
The model we have mentioned above uses the POS 
given manually for the attribute of bunsetsu. Chang- 
ing it into some class may improve the predictive 
power of the model. This change needs only a slight 
replacement in the model representing formula (1): 
the function last returns the class of the last word of 
a word sequence rn instead of the POS. The problem 
we have to solve here is how to obtain such classes 
i.e. word clustering. In this section, we propose 
an objective function and a search algorithm of the 
word clustering. 
3.1 Objective Function 
The aim of word clustering is to build a language 
model with less cross entropy without referring to 
the test corpus. Similar reseaxch has been success- 
ful, aiming at an improvement of a word n-gram 
model both in English and Japanese (Mori et al., 
1997). So we have decided to extend this research 
to obtain an optimal word-class relation. The only 
difference from the previous research is the language 
model. In this case, it is a SCFG in stead of a n- 
gram model. Therefore the objective function, called 
average cross entropy, is defined as follows: 
m 
y= __1 ~ H(Li,Mi), (5) 
m i----1 
where Li is the i-th learning corpus and Mi is the 
language model estimated from the learning corpus 
excluding the i-th learning corpus. 
3.2 Algorithm 
The solution space of the word clustering is the set of 
all possible word-class relations. The caxdinality of 
the set, however, is too enormous for the dependency 
model to calculate the average cross entropy for all 
word-class relations and select the best one. So we 
abandoned the best solution and adopted a greedy 
algorithm as shown in Figure 2. 
4 Syntactic Analysis 
Syntactic Analysis is defined as a function which 
receives a character sequence as an input, divides 
it into a bunsetsu sequence and determines depen- 
dency relations among them, where the concatena- 
tion of character sequences of all the bunsetsu must 
Let ml, m2, ..., mn be .b4 sorted 
in the descending order of frequency. 
cl := {ml, m2, ..., m,} 
c = {Cl} 
foreach i (1, 2, .--, n) 
f(mi) := cl 
foreach i (1, 2, ..., n) 
c := argmincecuc,,~ -H(move(f, mi, c)) 
if (-H(move(f, mi, c)) < H(f)) then 
/:= move(/, ms, c) 
update interpolation coeffÉcients. 
if (c = c,e~) then 
C := C u {c,,,,,,} 
iffil 
iffi2 
i=3 
i=4 
update interpolation coefficients 
c! "- ........................................ 
:" ................... i.:::~.::-:~., update interpolation coefficients 
update interpolation coefficient.5 
Figure 2: The clustering algorithm. 
be equal to the input. Generally there axe one or 
more solutions for any input. A syntactic analyzer 
chooses the structure which seems the most similar 
to the human decision. There are two kinds of an- 
alyzer: one is called a rule-based analyzer, which is 
based on rules described according to the intuition 
of grarnmarians; the other is called a corpus-based 
analyzer, because it is based on a large number of 
analyzed examples. In this section, we describe a 
stochastic syntactic analyzer, which belongs to the 
second category. 
4.1 Stochastic Syntactic Analyzer 
A stochastic syntactic analyzer, based on a stochas- 
tic language model including the concept of depen- 
dency, calculates the syntactic tree (see Figure 1) 
with the highest probability for a given input x ac- 
cording to the following formula: 
rh = argmax P(Tia~) U~(T)=Z 
901 
Table 1: Corpus. Table 2: Predictive power. 
#sentences #bunsetsu #word 
learning 174,524 1,610,832 4,251,085 
test 19,397 178,415 471,189 
#non-terminal cross 
language model +#terminal entropy 
POS-based model 576 5.3536 
class-based model 10,752 4.9944 
= argmax P(TIx)P(x ) 
W(T)=Z 
= argmax P(~\]T)P(T) ('." Bayes' formula) W(T)=:v 
=argmaxP(T) ('." P(xlT ) = 1), W(T)=Z 
where to (T) represents the character sequence of the 
syntactic tree T. P(T) in the last line is a stochas- 
tic language model including the concept of depen- 
dency. We use, as such a model, the POS-based de- 
pendency model described in section 2 or the class- 
based dependency model described in section 3. 
4.2 Solution Search Algorithm 
The stochastic context-free grammar used for syn- 
tactic analysis consists of rewriting rules (see for- 
mula (3)) in Chom~ky normal form (Hopcroft and 
Ullman, 1979) except for the derivation from the 
start symbol (formula (2)). It follows that a CKY 
method extended to SCFG, a dynamic-programming 
method, is applicable to calculate the best solution 
in O(n 3) time, where n is the number of input char- 
acters. It should be noted that it is necessary to 
multiply the probability of the derivation from the 
start symbol at the end of the process. 
5 Evaluation 
We constructed the POS-based dependency model 
and the class-based dependency model to evaluate 
their predictive power. In addition, we implemented 
parsers based on them which calculate the best syn- 
tactic tree from a given sequence of bun~etsu to ob- 
serve their accuracy. In this section, we present the 
experimental results and discuss them. 
5.1 Conditions on the Experiments 
As a syntactically annotated corpus we used EDR 
corpus (Jap, 1993). The corpus was divided into 
ten parts and the models estimated from nine of 
them were tested on the rest in terms of cross en- 
tropy (see Table 1). The number of characters in 
the Japanese writing system is set to 6,879. Two 
parameters which have not been determined yet in 
the explanation of the models (dmaz and v,naz) axe 
both set to 1. Although the best value for each of 
them can also be estimated using the average cross 
entropy, they are fixed through the experiments. 
5.2 Evaluation of Predictive Power 
For the purpose of evaluating the predictive power 
of the models, we calculated their cross entropy on 
the test corpus. In this process the annotated tree 
is used as the structure of the sentences in the test 
corpus. Therefore the probability of each sentence 
in the test corpus is not the summation over all its 
possible derivations. In order to compare the POS- 
based dependency model and the class-based depen- 
dency model, we constructed these models from the 
same learning corpus and calculated their cross en- 
tropy on the same test corpus. They are both inter- 
polated with the SCFG with uniform distribution. 
The processes for their construction are as follows: 
• POS-based dependency model 
1. estimate the interpolation coefficients in 
Formula (4) by the deleted interpolation 
method 
2. count the frequency of each rewriting rule 
on the whole learning corpus 
• class-based dependency model 
1. estimate the interpolation coefficients in 
Formula (4) by the deleted interpolation 
method 
2. calculate an optimal word-class relation by 
the method proposed in Section 3. 
3. count the frequency of each rewriting rule 
on the whole learning corpus 
The word-based 2-gram model for bunsetsu gener- 
ation and the character-based 2-gram model as an 
unknown word model (Mori and Yamaji, 1997) are 
common to the POS-based model and class-based 
model. Their contribution to the cross entropy is 
constant on the condition that the dependency mod- 
els contain the prediction of the last word of the con- 
tent word sequence and that of the function word 
sequence. 
Table 2 shows the cross entropy of each model 
on the test corpus. The cross entropy of the class- 
based dependency model is lower than that of the 
POS-based dependency model. This result attests 
experimentally that the class-based model estimated 
by our clustering method is more predictive than 
the POS-based model and that our word clustering 
902 
Table 3: Accuracy of each model. 
language model cross entropy accuracy 
POS-based model 5.3536 68.77% 
class-based model 4.9944 81.96% 
select always 53.10% the next bunsetsu 
method is efficient at improvement of a dependency 
model. 
We also calculated the cross entropy of the class- 
based model which we estimated with a word 2-gram 
model as the model M in the Formula (5). The num- 
ber of terminals and non-terminals is 1,148,916 and 
the cross entropy is 6.3358, which is much higher 
than that of the POS-base model. This result indi- 
cates that the best word-class relation for the depen- 
dency model is quite different from the best word- 
class relation for the n-gram model. Comparing the 
number of the terminals and non-terminals, the best 
word-class relation for n-gram model is exceedingly 
specialized for a dependency model. We can con- 
clude that word-class relation depends on the lan- 
guage model. 
5.3 Evaluation of Syntactic Analysis 
SVe implemented a parser based on the dependency 
models. Since our models, equipped with a word- 
based 2-graan model for bunsetsu generation and the 
character-based 2-gram as an unknown word model, 
can return the probability for amy input, we can 
build a parser, based on our model, receiving a char- 
acter sequence as input. Its evaluation is not easy, 
however, because errors may occur in bunsetsu gen- 
eration or in POS estimation of unknown words. For 
this reason, in the following description, we assume 
a bunsetsu sequence as the input. 
The criterion we adopted is the accuracy of depen- 
dency relation, but the last bunsetsu, which has no 
bunsetsu to depend on, and the second-to-last bun- 
setsu, which depends always on the last bunsetsu, 
are excluded from consideration. 
Table 3 shows cross entropy and parsing accuracy 
of the POS-based dependency model and the class- 
based dependency model. This result tells us our 
word clustering method increases parsing accuracy 
considerably. This is quite natural in the light of the 
decrease of cross entropy. 
The relation between the learning corpus size and 
cross entropy or parsing accuracy is shown in Fig- 
ure 3. The lower bound of cross entropy is the en- 
tropy of Japanese, which is estimated to be 4.3033 
bit (Mori and Yamaji, 1997). Taking this fact into 
consideration, the cross entropy of both of the mod- 
els has stronger tendency to decrease. As for ac- 
12 
10 
4 
2 
01'0, 
pOS-bm~l ~M~/m~l 
dm=-Imsetl aep*t~'y mad 
100% 
8O 
=. 
so 
2 
40 
20 
i i i i t i 0 
101 102 10 ~ 104 105 106 107 
#characters in learning corpus 
Figure 3: Relation between cross entropy and pars- 
ing accuracy. 
curacy, there also is a tendency to get more accu- 
rate as the learning corpus size increases, but it is a 
strong tendency for the class-based model than for 
the POS-based model. It follows that the class-based 
model profits more greatly from an increase of the 
learning corpus size. 
6 Conclusion 
In this paper we have presented dependency mod- 
els for Japanese based on the attribute of bunsetsu. 
They are the first fully stochastic dependency mod- 
els for Japanese which describes from character se- 
quence to syntactic tree. Next we have proposed 
a word clustering method, an extension of deleted 
interpolation technique, which has been proven to 
be efficient in terms of improvement of the pre- 
dictive power. Finally we have discussed parsers 
based on our model which demonstrated a remark- 
able improvement in parsing accuracy by our word- 
clustering method. 

References 
Peter F. Brown, Vincent J. Della Pietra, Peter V. 
deSouza, Jennifer C. Lal, and Robert L. Mercer. 
1992. Class-based n-gram models of natural lan- 
guage. Computational Linguistics, 18(4):467-479. 
Eugene Charniak. 1997. Statistical parsing with a 
context-free grammar and word statistics. In Pro- 
ceedings of the l~th National Conference on Arti- 
ficial Intelligence, pages 598-603. 
Michael Collins. 1997. Three generative, lexicalised 
models for statistical parsing. In Proceedings of 
the 35th Annual Meeting of the Association for 
Computational Linguistics, pages 16-23. 
King Sun Fu. 1974. Syntactic Methods in Pattern 
Recognition, volume 12 of Mathematics in Science 
and Engineering. Accademic Press. 
903 
T. Fujisaki, F. Jelinek, J. Cocke, E. Black, and 
T. Nishino. 1989. A probabilistic parsing method 
for sentence disambiguation. In Proceedings of the 
International Parsing Workshop. 
Wide R. ttogenhout and Yuji Matsumoto. 1997. A 
preliminary study of word clustering based on syn- 
tactic behavior. In Proceedings of the Computa- 
tional Natural Language Learning, pages 16-24. 
John E. ttopcroft and Jeffrey D. UUman. 1979. In- 
troduction to Automata Theory, Languages and 
Computation. Addison-~,Vesley Publishing. 
Japan Electronic Dictionary Research Institute, 
Ltd., 1993. EDR Electronic Dictionary Technical 
Guide. 
Fredelick Jelinek, Robert L. Mercer, and Salim 
Roukos. 1991. Principles of lexical language 
modeling for speech recognition. In Advances in 
Speech Signal Processing, chapter 21, pages 651- 
699. Dekker. 
F. Jelinek, J. Lafferty, D. Magerman, R. Mercer, 
A. Rantnaparkhi, and S. Roukos. 1994. Decision 
tree parsing using a hidden derivation model. In 
Proceedings of the ARPA Workshop on Human 
Language Technology, pages 256-261. 
ttiroshi Maruyama and Shiho Ogino. 1992. A statis- 
tical property of japanese phrase-to-phrase modifi- 
cations. Mathematical Linguistics, 18(7):348-352. 
Shinsuke Mort and Osamu Yamaji. 1997. An 
estimate of an upper bound for the entropy 
of japanese. Transactions of Information Pro- 
cessing Society of Japan, 38(11):2191-2199. (In 
Japanese). 
Shinsuke Mort, Masafumi Nishimura, and Nobuyuki 
Ito. 1997. l, Vord clustering for class-based lan- 
guage models. Transactions of Information Pro- 
cessing Society of Japan, 38(11):2200-2208. (In 
Japanese). 
C. E. Shannon. 1951. Prediction and entropy of 
printed english. Bell System Technical Journal, 
30:50-64. 
