Automatic Acquisition of Language Model 
based on Head-Dependent Relation between Words 
Seungmi Lee and Key-Sun Choi 
Department of Computer Science 
Center for Artificial Intelligence Research 
Korea Advanced Institute of Science and Technology 
e-mail: {leesm, kschoi}@world, kaist, ac. kr 
Abstract 
Language modeling is to associate a sequence 
of words with a priori probability, which is a 
key part of many natural language applications 
such as speech recognition and statistical ma- 
chine translation. In this paper, we present a 
language modeling based on a kind of simple 
dependency grammar. The grammar consists 
of head-dependent relations between words and 
can be learned automatically from a raw corpus 
using the reestimation algorithm which is also 
introduced in this paper. Our experiments show 
that the proposed model performs better than 
n-gram models at 11% to 11.5~ reductions in 
test corpus entropy. 
1 Introduction 
Language modeling is to associate a priori prob- 
ability to a sentence. It is a key part of many 
natural language applications such as speech 
recognition and statistical machine translation. 
Previous works for language modeling can be 
broadly divided into two approaches; one is n- 
gram-based and the other is grammar-based. 
N-gram model estimates the probability of a 
sentence as the product of the probability of 
each word in the sentence. It assumes that 
probability of the nth word is dependent on 
the previous n- 1 words. The n-gram prob- 
abilities are estimated by simply counting the 
n-gram frequencies in a training corpus. In 
some cases, class (or part of speech) n-grams 
are used instead of word n-grams(Brown et al., 
1992; Chang and Chen, 1996). N-gram model 
has been widely used so far, but it has always 
been clear that n-gram can not represent long 
distance dependencies. 
In contrast with n-gram model, grammar- 
based approach assigns syntactic structures to 
a sentence and computes the probability of the 
sentence using the probabilities of the struc- 
tures. Long distance dependencies can be rep- 
resented well by means of the structures. The 
approach usually makes use of phrase struc- 
ture grammars such as probabilistic context-free 
grammar and recursive transition network(Lari 
and Young, 1991; Sneff, 1992; Chen, 1996). In 
the approach, however, a sentence which is not 
accepted by the grammar is assigned zero prob- 
ability. Thus, the grammar must have broad- 
coverage so that any sentence will get non-zero 
probability. But acquisition of such a robust 
grammar has been known to be very difficult. 
Due to the difficulty, some works try to use an 
integrated model of grammar and n-gram com- 
pensating each other(McCandless, 1994; Meteer 
and Rohlicek, 1993). Given a robust grammar, 
grammar-based language modeling is expected 
to be more powerful and compact in model size 
than n-gram-based one. 
In this paper we present a language modeling 
based on a kind of simple dependency gram- 
mar. The grammar consists of head-dependent 
relations between words and can be learned au- 
tomatically from a raw corpus using the rees- 
timation algorithm which is also introduced in 
this paper. Based on the dependencies, a sen- 
tence is analyzed and assigned syntactic struc- 
tures by which long distance dependences are 
represented. Because the model can be thought 
of as a linguistic bi-gram model, the smoothing 
functions of n-gram models can be applied to it. 
Thus, the model can be robust, adapt easily to 
new domains, and be effective. 
The paper is organized as follows. We intro- 
duce some definitions and notations for the de- 
pendency grammar and the reestimation algo- 
rithm in section 2, and explain the algorithm in 
section 3. In section 4, we show the experimen- 
tal results for the suggested model compared to 
n-gram models. Finally, section 5 concludes this 
paper. 
2 A Simple Dependency Grammar 
In this paper, we assume a kind of simple de- 
pendency grammar which describes a language 
723 
by a set of head-dependent relations between 
words. A sentence is analyzed by establishing 
dependency links between individual words in 
the sentence. A dependency analysis, :D, of a 
sentence can be represented with arrows point- 
ing from head to dependent as depicted in Fig- 
ure 1. For structural generality, we assume that 
there is always a marking tag, "EOS"(End of 
Sentence), at the end of a sentence and it has 
the head word of the sentence as its own depen- 
dent("gave" in Figure 1). 
I gave him a book EOS 
Figure 1: An example dependency analysis 
A/) is a set of inter-word dependencies which 
satisfy the following conditions: (1) every word 
in the sentence has its head in the sentence ex- 
cept the head word of the sentence. (2) every 
word can have only one head. (3) there is nei- 
ther crossing nor cycle of dependencies. 
The probabilistic model of the simple depen- 
dency grammar is given by 
p(sentence) = ~-'~ p(D) 
2) 
= }2 II 2) x.-.+y6D 
where p(x--+ y) = p(yl x) 
freq(x --+ y) 
E, z)" 
Complete-Link and Complete-Sequence 
Here, we define complete-link and complete- 
sequence which represent partial :Ds for sub- 
strings. They are used to construct overall 
79s and used as the basic structures for the rees- 
timation algorithm in section 3. 
A set of dependency relations on a word se- 
quence, wij l, is a complete-link when the fol- 
lowing conditions are satisfied: 
• there is (wi -+ wi) or (wi e-- wj) exclu- 
sively. 
• Every inner word has a head in the word 
sequence. 
• Neither crossing nor cycle of dependency 
relations is allowed. 
tWe use wi for ith word in a sentence and wi,j for the 
word sequence from wl to wj(i < j). 
k her second child the bus 
Figure 2: Example complete-links 
A complete-link has direction. A complete-link 
on wij is said to be "rightward" if the outermost 
relation is (wi --+ wj), and "leftward" if the rela- 
tion is (wi e-- wj). Unit complete-link is defined 
on a string of two adjacent words, wi,;+l. In 
Figure 2, (a) is a rightward complete-link, and 
both of (b) and (c) are leftward ones. 
bird in the cage the bus book 
Figure 3: Example complete-sequences 
A complete-sequence is a sequence of 0 or 
more adjacent complete-links that have the 
same direction. A unit complete-sequence is de- 
fined on a string of one word. It is 0 sequence 
of complete-links. The direction of a complete- 
sequence is determined by the direction of the 
component complete-links. In Figure 3, (a) is a 
rightward complete-sequence composed of two 
complete-links, and (b) is a leftward one. (c) is a 
complete-sequence composed of zero complete- 
links, and it can be both leftward and rightward. 
The word of "complete" means that the de- 
pendency relations on the inner words are com- 
pleted and that consequently there is no need 
to process further on them. From now on, 
we use Lr(i,j)/Lt(i,j) for rightward/leftward 
complete-links and Sr(i,j)/St(i,j) for right- 
ward/leftward complete-sequences on wi, j. 
Any complete-link on wi, j can be viewed as 
the following combination. 
• L~(i,j): {(wi --+ wj), S~(i,m), St(m+l,j)} 
• Ll(i,j): {(wi e-- wj), St(i, m), St(m+l,j)} 
foram(i<m<j). 
Otherwise, the set of dependencies does not sat- 
isfy the conditions of no crossing, no cycle and 
no multiple heads and is not a complete-link any 
more. 
Similarly, any complete-sequence on wi,j can 
be viewed as the following combination. 
• S~(i,j): {Sr(i,m), L~(m,j)} 
• St(i,j): {Lt(i,m), St(m,j)} 
foram(i<m<j). 
In the case of complete-sequence, we can 
prevent multiple constructions of the same 
724 
complete-sequence by the above combinational 
restriction. 
Figure 4: Abstract representation of/) 
Figure 4 shows an abstract representation of 
a/) of an n-word sentence. When wk(1 < k <_ 
n) is the head of the sentence, any D of the 
sentence can be represented by a St(l, EOS) 
uniquely by the assumption that there is always 
the dependency relation, (wk +-- wEos). 
3 Reestimation Algorithm 
The reestimation algorithm is a variation of 
Inside-Outside algorithm(Jelinek et al., 1990) 
adapted to dependency grammar. In this sec- 
tion we first define the inside-outside probabili- 
ties of complete-links and complete-sequences, 
and then describe the reestimation algorithm 
based on them 2. 
In the followings, ~ indicates inside probabil- 
ity and a, is for outside probability. The su- 
perscripts, l and s, are used for "complete-link" 
and "complete-sequence" respectively. The sub- 
scripts indicate direction: r for "rightward" and 
I for "leftward". 
The inside probabilities of complete-links 
(n~(i,j), Lt(i,j)) and complete-sequences 
(Sr(i,j), Sl(i,j)) are as follows. 
j-1 
/3t~(i,j) = ~ p(wi --+ wj)/3~(i, m)t3~(m + 1,j). 
rn=i 
j--I 
/3\[(i,j) = E p(wi 6.- wj)t3~(i,m)13?(m + 1,j). 
rn=i 
j--1 
fl~(i,j) = ~ /3~(i,m)~t~(m,j). 
mini 
J 
/3?(i,j) = ~ /3\[(i,m)t3?(m,j). 
m=i+l 
The basis probabilities are: 
/31r(i,i + 1) = p(wi "~ wi+l) 
/3\[(i,i + 1) = p(wi (-" wi+l) 
/3~(i, i) = fl?(i, i) = 1 
/37(1, EO S) = p( wL, ) 
~A little more detailed explanation of the expressions 
can be found in (Lee and Choi, 1997). 
/3~(i,i+ 1) = p(L~(i,i+ 1)) = p(wi ~ wi+t) 
/37 (i, i + 1) = p(Lt(i, i + 1)) = p(wi +-- wi+t). 
/37(1, EOS) is the sentence probability be- 
cause every dependency analysis, D, is repre- 
sented by a St(l, EOS) and/37(1 , EOS) is sum 
of the probability of every St(l, EOS). 
probabilities for complete- 
(i, j)) and complete-sequences 
are as follows. 
The outside 
links (L,.(i,j), Lt 
(S~(i,j), St(i,j)) 
i at~(i,j) = 
n 
c~ (v, j)/3i~(v, i). 
a~ (i, h)/3?(j, h). 
h=j 
a~(i,j) = ~ a~(i,h)/3tr(j,h) 
h=j+l 
+atr(i , h)/3i~(j + 1, h)p(wi -+ Wh) 
+al(i, h)/3?(j + 1, h)p(wi ~ wh). 
i-I 
a~(i,j) = ~ a~(v,j)fl~(v,i) 
v----I 
+dr(v,j)Z;(v, i - t)p(wv wA 
+al(v,j)t3;(v , i- 1)p(wv e- wj). 
The basis probability is 
~(1, EOS) = 1. 
Given a training corpus, the initial grammar 
is just a list of all pairs of unique words in 
the corpus. The initial pairs represent the ten- 
tative head-dependent relations of the words. 
And the initial probabilities of the pairs can 
be given randomly. The training starts with 
the initial grammar. The train corpus is an- 
alyzed with the grammar and the occurrence 
frequency of each dependency relation is cal- 
culated. Based on the frequencies, probabili- 
ties of dependency relations are recalculated by 
C(wp --+ w~) The process w,) = C(w  
continues until the entropy of the training cor- 
pus becomes the minimum. The frequency of 
occurrence, C(wi --+ wj), is calculated by 
w) = -+ 
1 t • • t = p(wt,.)a.(,,3)/3~(i,j) 
where O~(wi ~ wj, D, wl,n) is 1 if the depen- 
dency relation, (wi --+ wj), is used in the D, 
725 
and 0 otherwise. Similarly, the occurrence fre- 
quency of the dependency relation, (wi +- wj), 
is computed by ~----L---o~l(i,j)~\[(i,j ). 
4 Preliminary experiments 
We have experimented with three language 
models, tri-gram model (TRI), bi-gram model 
(BI), and the proposed model (DEP) on a raw 
corpus extracted from KAIST corpus 3. The raw 
corpus consists of 1,589 sentences with 13,139 
words, describing animal life in nature. We 
randomly divided the corpus into two parts: a 
training set of 1,445 sentences and a test set of 
144 sentences. And we made 15 partial training 
sets which include the first s sentences in the 
whole training set, for s ranging from 100 to 
1,445 sentences. We trained the three language 
models for each partial training set, and tested 
the training and the test corpus entropies. 
TRI and BI was trained by counting the oc- 
currence of tri-grams and bi-grams respectively. 
DEP was trained by running the reestimation 
algorithm iteratively until it converges to an op- 
timal dependency grammar. On the average, 26 
iterations were done for the training sets. 
Smoothing is needed for language modeling 
due to the sparse data problem. It is to com- 
pensate for the overestimated and the under- 
estimated probabilities. Smoothing method it- 
self is an important factor. But our goal is not 
to find out a better smoothing method. So we 
fixed on an interpolation method and applied it 
for the three models. It can be represented as 
(McCandless, 1994) 
..., w,-x) = ,\P,(wilw,-,+l, ..., wi_l) 
+(1 - ..., 
where = C(wl, ..., w,-1) 
C(w,, ..., + K," 
The Ks is the global smoothing factor. The big- 
ger the Ks, the larger the degree of smoothing. 
For the experiments we used 2 for Ks. 
We take the performance of a language model 
to be its cross-entropy on test corpus, 
1 s 
IVl E-l°g2Pm(Si) i=1 
3KAIST (Korean Advanced Institute of Science and 
Technology) corpus has been under construction since 
1994. It consists of raw text collection(45,000,000 
words), POS-tagged collection(6,750,000 words), and 
tree-tagged collection(30,000 sentences) at present. 
where the test corpus contains a total of IV\] 
words and is composed of S sentences. 
3.4 i | | i | ! I 
3.23 
2.8 
>" 2.6 O. 
2.4 
u~ 2.2 ~ (DEP model) o 
2 a (TRI model) i 
1.8 
1.6 
1.4 
0 200 400 600 800 1000 1200 1400 600 
No. of training sentences 
Figure 5: Training corpus entropies 
Figure 5 shows the training corpus entropies 
of the three models. It is not surprising that 
DEP performs better than BI. DEP can be 
thought of as a kind of linguistic bi-gram model 
in which long distance dependencies can be rep- 
resented through the head-dependent relations 
between words. TRI shows better performance 
than both BI and DEP. We think it is because 
TRI overfits the training corpus, judging from 
the experimental results for the test corpus. 
9.5 i I I I I I I 
8.5 
uJ 7.5 
.=( (TRI model) 
7 / (DEP model) o 
6.5 a i I I I I I 
0 200 400 600 800 1000 1200 1400 1600 
No. of training sentences 
Figure 6: Test corpus entropies 
For the test corpus, BI shows slightly bet- 
ter performance than TRI as depicted in Fig- 
ure 6. Increase in the order of n-gram from 
two to three shows no gains in entropy reduc- 
tion. DEP, however, Shows still better per- 
formance than the n-gram models. It shows 
about 11.5% entropy reduction to BI and about 
11% entropy reduction to TRI. Figure 7 shows 
the entropies for the mixed corpus of training 
and test sets. From the results, we can see 
that head-dependent relations between words 
are more useful information than the naive n- 
gram sequences, for language modeling. We can 
see also that the reestimation algorithm can find 
out properly the hidden head-dependent rela- 
tions between words, from a raw corpus. 
726 
,r, 
f- 
uJ 
(n 
o 
Z 
10 
9 
8 
7 
6 
i i | i ! i i 
(BI model) (TRI model) 
(DEP model) 
5 
3 
0 200 400 600 800 1000 1200 1400 
No. of training sentences 
Figure 7: Mixed corpus entropies 
60000 
50000 
40000 
30000 
20000 
10000 
0 
600 
i ! | i i i ! (DEP model) o 
(TRI model) "*'-- 
rT I I I I I I 
200 400 600 800 1000 1200 1400 1600 No. of training sentences 
Figure 8: Model size 
Related to the size of model, however, DEP 
has much more parameters than TRI and BI 
as depicted in Figure 8. This can be a serious 
problem when we create a language model from 
a large body of text. In the experiments, how- 
ever, DEP used the grammar acquired automat- 
ically as it is. In the grammar, many inter-word 
dependencies have probabilities near 0. If we 
exclude such dependencies as was experimented 
for n-grams by Seymore and Rosenfeld (1996), 
we may get much more compact DEP model 
with very slight increase in entropy. 
5 Conclusions 
In this paper, we presented a language model 
based on a kind of simple dependency gram- 
mar. The grammar consists of head-dependent 
relations between words and can be learned au- 
tomatically from a raw corpus by the reestima- 
tion algorithm which is also introduced in this 
paper. By the preliminary experiments, it was 
shown that the proposed language model per- 
forms better than n-gram models in test cor- 
pus entropy. This means that the reestimation 
algorithm can find out the hidden information 
of head-dependent relation between words in a 
raw corpus, and the information is more useful 
than the naive word sequences of n-gram, for 
language modeling. 
We are planning to experiment the perfor- 
mance of the proposed language model for large 
corpus, for various domains, and with various 
smoothing methods. For the size of the model, 
we are planning to test the effects of excluding 
the dependency relations with near zero proba- 
bilities. 

References 
P. F. Brown, V. J. Della Pietra, P. V. deSouza, 
J. C. Lai, and R. L. Mercer. 1992. "Class- 
Based n-gram Models of Natural Language". 
Computational Linguistics, 18(4):467-480. 
C. Chang and C. Chen. 1996. "Application Is- 
sues of SA-class Bigram Language Models". 
Computer Processing of Oriental Languages, io(1):i-i5. 
S. F. Chen. 1996. "Building Probabilistic 
Models for Natural Language". Ph.D. the- 
sis, Havard University, Cambridge, Mas- 
sachusetts. 
F. Jelinek, J. D. Lafferty, and R. L. Mercer. 
1990. "Basic Methods of Probabilistic Con- 
text Free Grammars". Technical report, IBM 
- T.J. Watson Research Center. 
K. Lari and S. J. Young. 1991. "Applications 
of stochastic context-free grammars using the 
inside-outside algorithm". Computer Speech 
and Language, 5:237-257. 
S. Lee and K. Choi. 1997. "Reestimation and 
Best-First Parsing Algorithm for Probabilis- 
tic Dependency Grammar". In WVLC-5, 
pages 11-21. 
M. K. McCandless. 1994. "Automatic Acquisi- 
tion of Language Models for Speech Recog- 
nition". Master's thesis, Massachusetts Insti- 
tute of Technology. 
M. Meteer and J.R. Rohlicek. 1993. "Statis- 
tical Language Modeling Combining N-gram 
and Context-free Grammars". In ICASSP- 
93, volume II, pages 37-40, January. 
K. Seymore and R. Rosenfeld. 1996. "Scalable 
Trigram Backoff Language Models". Techni- 
cal Report CMU-CS-96-139, Carnegie Mellon 
University. 
S. Sneff. 1992. "TINA: A natural language sys- 
tem for spoken language applications". Com- 
putational Linguistics, 18(1):61-86. 
