Reestimation and Best-First Parsing Algorithm 
for Probabilistic Dependency Grammars 
Seungmi Lee and Key-Sun Choi 
Dept. of Computer Science 
Korea Advanced Institute of Science and Technology 
373-1, Kusung-Dong, YuSung-Gu, Taejon, 305-701, KOREA 
e-malh {leesm, kschoi}~orld, kaist, ac. kr 
Abstract 
This paper presents a reesthnation algorithm and a best-first parsing (BFP) algorithm 
for probabilistic dependency grummars (PDG). The proposed reestimation algorithm is 
a variation of the inside-outside algorithm adapted to probabilistic dependency gram- 
mars. The inside-outside algorithm is a probabilistic parameter reestimation algorithm 
for phrase structure grammars in Chomsky Normal Form (CNF). Dependency grammar 
represents a sentence structure as a set of dependency links between arbitrary two words 
in the sentence, and can not be reestimated by the inside-outside algorithm directly. 
In this paper, non-constituent objects, complete-llnk and complete-sequence are 
defined as basic units of dependency structure, and the probabilities of them are rees- 
timated. The reestimation and BFP algorithms utilize CYK-style chart and the non- 
constituent objects as chart entries. Both algoritbrn~ have O(n s) time complexities. 
1 Introduction 
There have been many efforts to induce grammars automatically from corpus by utilizing 
the vast amount of corpora with various degrees of annotations. Corpus-based, stochastic 
grammar induction has many profitable advantages such as simple acquisition and exten- 
sion of linguistic knowledges, easy treatment of ambiguities by virtue of its innate scoring 
mechanism, and fail-soi~ reaction to ill-formed or extra-grammatical sentences. 
Most of corpus-based grammar inductions have concentrated on phrase structure gram° 
mars (Black, Lafferty, and Roukos, 1992, Lari and Young, 1990, Magerman, 1994). The 
typical works on phrase structure grammar induction are as follows(Lari and Young, 1990, 
Carroll, 1992b): (1) generating all the possible rules, (2) reestimating the probabilities 
of rules using the inside-outside algorithm, and (3) finally finding a stable grammar by 
eliminating the rules which have probability values close to 0. Generating all the rules 
is done by restricting the number of nonterminals and/or the number of the right hand 
side symbols in the rules and enumerating all the possible combinations. Chen extracts 
rules by some heuristics and reestimates the probabilities of rules using the inside-outside 
algorithm (Chen, 1995). The inside-outside algorithm learns a grammar by iteratively ad- 
justing the rule probabilities to minimize the training corpus entropy. It is extensively used 
as reestimation algorithm for phrase structure grammars. 
Most of the works on phrase structure grammar induction, however, have partially 
succeeded. Estimating phrase structure grammars by minimizing the training corpus on- 
41 
tropy does not lead to the desired grammars which is consistent with human intuitions 
(de Marcken, 1995). To increase the correctness of the learned grammar, Marcken pro- 
posed to include lexical information to the phrase structure grammar. A recent trend of 
parsing is also to include lexiccal information to increase the correctness (Magerman, 1994, 
Collir~, 1996). This means that the lack of lexical information in phrase structure gram- 
mar is a major weak point for syntactic disambiguation. Besides the lack of lexical in- 
formation, the induction of phrase structure grnmmar may suffer from structural data 
sparseness with medium sized training corpus. The structural data sparseness means the 
lack of information on the grammar rules. An approach to increase the correctness of 
grammar induction is to learn a grammar from a tree-tagged corpus or bracketed corpus 
(Pereira and Schabes, 1992, Black, Lafferty, and Roukos, 1992). But the construction of 
vast sized tree-corpus or bracketed corpus is very labour-intensive and manual construction 
of such corpus may produce serious inconsistencies. And the structural-data sparseness 
problem still remains. 
The problems of structural-data sparseness and lack of lexical information can be less- 
ened with PDG. Dependency grammar defines a language as a set of dependency relations 
between any two words. The basic units of sentence structure in DG, the dependency re- 
lations are much simpler than the rules in phrase structure grnmmar. So, the search space 
of dependency grammar may be smaller and the grammar induction may be less affected 
by the structural-data sparseness. Dependency grammar induction has been studied by 
Carroll (Carroll, 1992b, Carroll, 1992a). In the works, however, the dependency grammar 
was rather a restricted form of phrase structure grarnrnarss. Accordingly, they extensively 
used the inside-outside algorithm to reestimate the grnmmnr and have the same problem 
of structural-data sparseness. 
In this paper, we propose a reestimation algorithm and a best-first parsing algorithm 
for PDG. The reestimation algorithm is a variation of the inside-outside algorithm adapted 
to PDG. The inside-outside algorithm is a probabilistic parameter reestimation algorithm 
for phrase structure grammars in CNF and thus can not be directly used for reestimation 
of probabilistic dependency grammnrs. We define non-constituent objects, complete-link 
and complete-sequence as basic units of dependency structure. Both of reestimation 
algorithm and best-first parsing algorithm utilize a CYK-style chart and the non-constituent 
objects as chart entries. Both algorithms have O(n s) time complexities. 
The rest of the paper is organized as follows. Section 2 defines the basic units and de- 
scribes best-first parsing algorithm. Section 3 describes the reestimation algorithm. Section 
4 shows the experimental results of reestimation algorithm on Korean and finally section 5 
concludes this paper. 
2 PDG Best First Parsing Algorithm 
Dependency grammar describes a language with a set of head-dependent relations between 
any two words in the language. Head-dependent relations represent specific relations such 
as modiflee-modifier, predicate-argument, etc. In general, a functional role is assigned to 
a dependency link and specifies the syntactic/semantic relation between the head and the 
dependent. However, in this paper, we use the minimal definition of dependency grnmmar 
with head-dependent relations only. In the future we will extend our dependency grammar 
into one with functions of dependency lln~. 
42 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
A dependency tree of a n-word sentence is always composed of n-1 dependency links. 
Every word in the sentence must have its head, except the word which is the head of the 
sentence. In a dependency tree, crossing links are not allowed. 
Salesperson sold the dog buiscuits 
Figure 1: Dependency tree: link representations 
Figure 1 shows a dependency tree as a hierarchical representation and a link represen- 
tation respectively. In both, the word ~sold" is the head of the sentence. 
Here, we define the non-constituent objects, complete-link and complete-sequence which 
are used in PDG reestimation and BFP algorithms. A set of dependency links constructed 
for word sequence wid is defined as complete-link ff the set satisfies following conditions: 
• The set has exclusively (wi ~ wj) or (wi ~ wj). 
• There is neither link-crossing nor llnk-cyele. 
• The set is composed of j - i dependency links. 
• Every inner word of wij must have its head and thus a link from the head. 
Complete-link has directionality. It is determined by the direction of the outermost depen- 
dency relation. If the complete-llnk has (wi --> wj), it is rightward, and if the complete-link 
has (wi ~-- wj), then it is leftward. Basic complete-link is a dependency link between 
adjacent two words. 
Complete-sequence is defined as a sequence of null or more adjacent complete-lknks of 
same direction. Basic complete-sequence is null sequence of complete-links which is defined 
on one word, the smallest word sequence. The direction of complete-sequence is determined 
by the direction of component complete-links. If the complete-sequence is composed of 
leftward complete-links, the complete-sequence is leftward, and vice versa. 
Figure 2 shows abstract rightward complete-llnk for wi,j, rightward complete-sequence 
for Wi,m, and leftward complete-sequence for Wrn+ld. Double-slashed line means a complete- 
sequence. Whatever the direction is, a complete-link for wij is always constructed with 
a dependency link between wi and wj, a rightward complete-sequence from i to ra, and 
a leftward complete-sequence from j to m + 1, for an m between i and j - 1. Rightward 
complete-sequence is always composed with a combination of a rightward complete-sequence 
and a rightward complete-link. On the contrary, leftward complete-sequence is always 
composed with a combination of a leftward complete-link and a leftward complete-sequence. 
These restrictions on composition of complete-sequence is for the ease of description of 
algorithm~ The basic complete-link and complete-sequence are also shown in the Figure 2. 
Following notations are used to represent the four kinds of objects for a word sequence wij 
and for an m from i to j-1. 
• Lr(i,j): rightward complete.link for wid, i.e. {(wi ---r wj), Sr(i,m), St(rn + 1.n)} 
• Lt(i,j): leftward complete-link for wi,a, i.e. {(wi ~ wj), Sr(i,m), St(m+ 1,n)} 
43 
I 
I 
W~/Wi+l .................................. Wm Wm÷l .................................... Wj I ~ ..\] IlWi Wi+l 
,, comprehend link 
Rightward Complete-link I 
Wi Wk Wm 
Rightward Complete-sequence 
Wm+ 1 WI Wj 
Leftward Complete-sequence tmi complet~ sequence 
Figure 2: Abstract complete-link and complete-sequence 
• St(i, j): rightward complete-sequence for wit/, i.e. {St (i, m), Lr (m, j)} 
• Sz(i,j): leftward complete-sequence for wid, i.e. {Lt(~,m), Sdm, j)} 
To generalize the structure of a dependency tree, we assume that there are marking 
tags, BOS (Begin Of Sentence) before w, and EOS (End Of Sentence) after wn and that 
there are always the dependency links, (wBos --+ WEos) and (wk ~-- wEos) when wk is 
the head word of the sentence. Then, by definition: any dependency tree of a sentence, 
wi.n can be uniquely represented with either a Lr(BOS, EOS) or a Sl(1, EOS) as depicted 
in Figure 3. This is because Lr (BOS, EOS) for any sentence is always composed of null 
Sr(BOS, BOS) and St(1,EOS). The head of a dependency tree Wk can be found in the 
rightmost Lt( k, EO S) of &(1, EO S). 
BOS Wl .... Wi ...... Wj .......... Wk(head) ... Wn EOS 
Lr(BOS,EOS) 
Figure 3: Abstract dependency tree of a sentence 
The probability of each object is defined as follows. 
pCLrCi,j)) = PCwi --~ wj)p(Sr(i, rn))p(SlC m + l,j)) 
pCZ,~(i, j)) = pC,v~ .- wj)pCS.(i, m))pC& (m + 1, j)) 
p(Sr(i, j)) = p(SrCi, m))PCLr(rn, j)) 
p(St(i, j)) = p(Ll(i, m))p(St (m, j)) 
i 
i 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
44 ! 
The m varies from i to j - 1 for Lz, Lr and St, and from i + 1 to j for Si. The best L~ 
and the best Lr always share the same m. This is because both are composed of the same 
sub-St and sub-St with rnaYimum probabilities. Basis probabilities are as follows: 
p(L,.(i, i + 1)) = p(w  
p(.Lt(i, i + 1)) = p(tO i ~-- W/+I) 
p(Sr(i,i + 1)) --p(Zr(i,i ÷ 1)) 
p(Sl(i,i + 1)) = p(Lt(i,i + 1)) 
p(St(i, i)) = p(Sr(i, i)) = 1 
Thus, the probability of a dependency tree is defined either by p(Lr(BOS, EOS)) or by 
p(st(1, sos). 
Leftward/Rightward Complete Link 
i Lr/LI 
Sr -- L ~" I i 
Leftward Complete Sequence Rightward Complete Sequence 
IIF II 
SI \[ \[ Sl 
Figure 4: Best first parsing entries 
The PDG best-first parsing algorithm constructs the best dependency tree in bottom- 
up manner, with dynamic programrrdng method using CYK style chart. It is based on 
complete-link and complete-sequence of non-constituent concept. The parsing algorithm 
constructs the complete-link.q and complete-sequences for substring, and merges incremen- 
tally the complete-links into larger complete-sequence and complete-sequences into larger 
complete-link until the Lr(BOS, EOS) with maximum probability is constructed. 
Eisner (Eisner, 1996) proposed an O(n 3) parsing algorithm for PDG. In their work, 
basic unit of chart entry is span which is also of non-constituent concept. But, the span 
slightly differs from our complete-sequence and complete-link. When two adjacent spans are 
merged into a larger span, some conditional tests must be satisfied. In our work, best-first 
parsing is done by inscribing the four entries with maximum probabilities, Lt (i, j), L~ (i, j), 
St(i,j), and Sr(i,j) to each chart positions in bottom-up/left-to-right manner without any 
extra condition checking. 
Figure 4 depicts the possible combinations of chart entries into a larger Lr, Lt, St, 
and Sr each. The sub-entries under the white-headed arrow and the sub-entries under the 
black-headed arrow are merged into a larger entries. The larger entries are inscribed into 
the bold box. 
There is an exception for chart entries of n+lth column. In the n+lth column, only 
the Lt(k, EOS) whose sub Sl is null can be inscribed. This is because there can be only 
one head word for a tree structure. If Lt(k, EOS) whose sub St is not null is inscribed into 
the chart, the overall tree structure will have two or more heads. 
45 
i 
• i 
! 
! 
The best parse is maximum Lr(BOS, EOS) in the chart position (0,n + 1). The best 
parse can also be found by the maximum St(l, EOS) because the Lr(BOS, EOS) is always 
composed of Sr(BOS, BOS) and Sl(l, EOS). 
The chart size is n2+4n+3 for n word sentence. For four items (Lr, Ll, St, and Sl) of 2 
each chart position, there can be maximally n searches. Thus, the time complexity of the 
best-fRrst parsing algorithm is O(nS). 
3 PDG Reestimation Algorithm 
For reestimation of dependency probabilities of PDG, eight kinds of chart entries are de- 
fined based on three factors: inside/outside, complete-link/complete-sequence, and left- 
ward/rightward. In following definitions, f~ is for inside probability and a is for outside prob- 
ability. Superscripts represent whether the object is complete-link or complete-sequence, l 
for complete-link and s for complete-sequence. Subscripts of f~ and a are for the direction- 
ality, r for rightward and l for leftward. 
Complete-link Inside Probabilities: jSlr, ~8~ Inside probability of a complete-link 
is the probability that word sequence wij will be generated when there is a dependency m 
relation between wi and wj. I 
= p( i ILr(i,j)) 
j-I I 
m--i 
f (i,j) = pCwi,#lLiCi, I 
m=i 
In Figure 5, ~(i,j), the inside probability of Lr(i,j) is depicted. In the left part of the 
l.....i-I i m+l ... m ... j j+ 1 ...... n 
I 
I t 
T 
Figure 5: Rightward complete-link Inside probability 
figure~ the gray partitions indicate all the possible constructions of St(/,m) and all the 
possible constructions of Sl(m + 1, j) respectively. Double-slashed links depict complete- 
sequences which compose the Lr together with the outermost dependency (wi ~ wj). The 
right part of the figure represents the chart. The bold box is the position where the jS~ 
is to be inscribed. Inside probability of a complete-link is the sum of the probabilities of 
all the possible constructions of the complete-link. As explained in the previous section, a 
46 
Lr(i,j) is composed of the dependency link between word i and word j (either (wi -+ wj) or 
(wi ~-- wj)), S,.(i, m) and Sl(rn+l,j) for an m from i to j-1. Inside probability of Lt(i,j) can 
be computed the same as that of Lr (i, j) except for the direction of dependency link between 
zvi and wj. The outermost dependency (zoi ~ wj) must be replaced with (wi ~ wj). Lr and 
Lt are not defined on word sequence of 1 length, zvi. The unit probabilities for ~ and ~ 
are as follows: 
~(i, i + ~) = p(~ ~ ~,~+~) 
l~(i,i + i) = p(wi ~ wi+~) 
~r(BOS, ~OS) = p(~,~) 
Any dependency tree of a sentence always has the dependency (z~BOS -+ Z~EOS) as the 
outermost dependency. So the ~r(BOS, EOS) is the same as the sentence probability 
which is the sum of probabilities of all the possible parses. 
Complete-sequence Inside Probabilities: ~r s, ~ Inside probability of complete- 
sequence is the probability that word sequence wio is generated when there is St(i, j) or 
S,(i,j). 
/~(i,j) = p(wijl&(i,j)) 
j-1 
= ~ ~gCi,=)~Cm,#) 
m 
= p(wial&Ci, j)) 
J 
= ~7 ~ICi,~)~t(m,#) 
n~.=i+l 
In Figure 6 and 7, the double-slashed link means complete-sequence, a sequence of null or 
more adjacent complete-links of same direction. A complete-sequence is composed of sub- 
"  l/I i 
1 ..... i-1 i ..... m ..... J j+l ...... n ~ nil 
i v 
ti 
Figure 6: Rightward complete-sequence Inside probability 
complete-sequence and sub-complete-link. Figure 6 depicts rightward complete-sequence 
for an m. The value of m varies from i to j-1. In Figure 7, St is composed of sub-L~ and 
sub-St. The inside probability of complete-sequence is the sum of the probabilities that the 
complete-sequences are constructed with. The basis for inside probabilities of complete- 
sequence are as follows. 
~r s (i, i) = ~ (i, i) = 1 
47 
1 ..... i-I i ..... m ..... J j+l ...... n ~ -- 
,i 
Figure 7: Leftward complete-sequence Inside probability 
~(i,i + 1) = p(L~(i,i + 1)) = p(w~ -+ w~+~) 
~t(i,i + 1) = p(Ll(i,i + 1)) = p(wl ~ w/+t) 
Because n+lth word, wEos can not be a dependent of any other word, l~r(k, EOS) or 
~rS(k, EOS) for k from 1 to n is not computed. And because there can be only one head 
of a tree, wEos must be head of only one word. Thus, in computation of flt(x, EOS) and 
fl~(x, EOS), only the Lts whose sub St is null are considered. 
Complete-link Outside Probability: atr, a~ This is the probability of producing the 
words before i and after j of a sente;ace while complete-link(i.j) generates wid. 
a~(i, j) = n(w~,~-~, n,(i, j), ~j+~,,,) 
i 
= ~ o4(,,,j)~(v,i) 
tr=l 
a~(i, j) = p(w~,~_x, L,(i, j),w~+~,.) 
= ~o4(i,h)~t~,h) 
Figure 8 and 9 depict the cases of Lr and Ll to become a substructure of larger Sr and 
12 ....... v ........... i ............ j j+l ....... n 
I~s a s 
r 
'L 
a l 
F 
Figure 8: Rightward complete-link Outside probability 
St respectively. In Figure 8, the outside probability of Lr which is inscribed in the bold 
box is computed by summing the products of the inside probabilities in the boxes under 
the white-headed arrow and the outside probabilities in the boxes under the black-headed 
48 
t i ~ i I \[ 
\] ..... i j h ..... n I ....... o°.o. 
Ii 
Figure 9: Leftward complete-link Outside probability 
arrow. Likewise, in Figure 9, the outside probability of Lt in the bold box is computed 
by slimming all the products of the inside probabilities under the white-headed arrow and 
the outside probabilities under the black-headed arrow each. This is because, in parsing, 
the subentries under the white-headed arrows and Lr/Lz in the bold boxes are merged into 
larger entries which are to be inscribed in the boxes under the black-headed arrows. Basis 
probability for complete-link outside probability is as follow. 
~(BOS, EOS) = 
a~(k, EOS) is always 0, for k = 1,n + I(EOS) because wEos can not be a dependent of 
any other word. 
Complete-sequence Outside Probabilities: aSr, ~/ This is the probability of produc- 
ing word sequence wl,i-1 and Wj+l,n while words i through j construct a complete-sequence. 
a,~(i,j) = p(w1,~-l,Sr(ij),w#+1,.) 
n 
= ~ ~(i,h)~h#,h) 
h=j+l 
+a~(i, h)l~(\] + 1,h)p(wi ~ Wh) 
+a~(~, h)Zt(i + 1, h)p(wi ~ ~h) 
In the above expression, The first term is for the construction of larger St(i, h) from the 
combination of Sr(i,j) and its adjacent Lr(j: h). The second term me~n~ the construction 
of larger Lr(i,h) from the combination of Sr(i,j), St(j + 1,h), and the dependency link 
from wi to wh. The third term is for the larger Lt(i,h) from the combination of Sr(i,\]), 
Si(j + 1,h) and the dependency link from wh to wi. The three terms in the expression are 
depicted in Figure 10. 
a~(i,j) = p(w~,i.~,Sdi, j),w,+~,.) 
i-1 
= E ~tCv,#)~Cv,~) 
v----1 
+a~C~,j)f~(~, ~ - 1)p(~ +- w~) 
a~ is the sum of all the probabilities that Sl is to become a subentry of larger entries: St, 
L~, and Lt. The first term in the above expression is for the combination of St(v,j) from 
Lt (v, i) and St (i, j). The second is for the construction of Lr (v, j) from St(v, i - 1), St(i, \]), 
49 
L____.J ~ I 
1 ..... i j h ..... n ..... O 
,..o.°* I a $ 
r 3f 
r 
I I m I 
1 ..... i j j+l ..... h ..... n 
r 
Bit 
al 
r 
I' I ~ I 
1 ..... i J j+l ..... h ..... n I 
aS i 
r 
Figure 10: Rightward complete-sequence Outside probability 
and the dependency link from wv to wj. The third term is for the construction of L~(v,j) 
from Sr(v,i - 1), St(i,j), and dependency relation from w i to w~. The three cases are 
depicted in Figure 11. The basis probabilities for complete-sequence outside probabilities 
are as follows. 
~(BOS, ZOS) = 4(BOS, EOS) = ~?(1, EOS) : 1 
The reesthnation algorithm computes the inside probabilities(#r ~, #~, #r', and #t) inscribing 
them into the chart in bottom-up and left-to-right. The outside probabilities(c~r, a~, ar s, 
and ~) are computed and inscribed into the chart in top-down and right-to-left. 
Training The training process is as foflow; 
1. Initialize the probabilities of dependency relations between all the possible word pairs. 
2. Compute initial entropy. 
3. Analyze the training corpus using the known probabilities, 
and recalculate the frequency of each dependency relation based on the analysis result. 
4. Compute the new probabilities based on the newly counted frequencies. 
5. Compute the new entropy. 
6. Continue 3 through 5 until the new entropy ~ previous entropy. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
! 
I 
I 
I 
I 
I 
! 
50 ! 
I 
I 
! 
I 
I 
I 
I 
1 2 ..... 
v-----1 
v i J j+l......n 
12 ..... v ..... i-I i ....... J j+l ...... n 
12 ..... v ....i.-I i ....... J j+l ...... n 
t~r ~t r 
as. 
~1,I. i 
T 
ct s. 
Figure 11: Leftward complete-sequence Outside probability 
The above iteration is continued until all the probabilities are settled down or the training 
corpus entropy converges to the minimum. The new usage count of a dependency relation 
is calculated as follows. In the following expression, the O¢~ is 1 if the dependency relation 
is used in the given tree and 0 otherwise. 
c(wi --+ wj) = ~p(tree\]wl,=)O~(wi ~ wj, tree, wl,,) 
tree 
_ 1 
p(wl,,~) 
_ 1 
p(~,~) 
1 
p(wl,.) 
_ 1 
p(wl,~) 
_ 1 
p(wl,,~) 
_ 1 
p(wl,n) 
~ p(~ee, wl,.)O=(w~ ~ wj, tree, wl,.) 
-- ~p(tree, wl,n,w~ ~ wj) 
tree 
-- (ilJl,n, W, ~ Wj) 
j-1 
-- ~ p(w~,n, Lr(i, j), S~ (i, m), &(~ + l, j)) 
rct=i 
j--1 
-- ~ p(~,~,~_~, w~,,~, w,~+~,j,wj+~,,~, L,.(i,j), S,.(i, ~), &(m + 1,j)) 
j-1 
m=i 
p(S~(i,m), ,~(rn + l:\])\[L~(i,j)) 
51 
v(~,~ls~(i, m)) 
v(w~+~jlsz(m + l,j)) 
_ 1 j-z 
1 t - v(~,~la~(i,j)NCi, j) 
Similarly, the usage count of (w~ ~ w#), c(wi ~- w#) is ~a~(i,j)g(i,j). 
a~ 
Chart has n2+4n+3 number of boxes. The reestimation algorithm computes eight items 
for each chart box and the computation of each item needs maximally n number of produc- 
tions and snmrnations respectively. So the time complexity of the algorithm is O(nS). 
The algorithm can be used for class-based (or tag-based) dependency grammar. With 
the concept of word class/tag, the complexity is affected by the class/tag size due to the 
class/tag ambiguities of each word. In the worst case, the time needed is 8 x t 2 x ns÷4~ ~+sn, 
so the complexity will be O(~n a) with respect to t, the number of classes and n, the length 
of input string. 
4 Experimental results on Korean 
We performed experiment of the reestimation algorithm on Korean language. Korean is a 
partially ordered language in which the head words are always placed to the right of their 
dependent words. Under such restriction on word order, an abstract dependency structure 
for Korean is depicted in Figure 12. Thus, in Korean, both of `sr(i,j) and Lr(i,j) are 
meaningless and not constructed. Only ,St, Ll and null `sr(i:i) are considered. Ll(i,j) is 
always composed with the combination of null Sr(i,i) and St(i + 1,j). 
Wi Wi+l ........................ Wj Wi Wi+ 1 
it comp~e~d link 
Wi ............ Wk ........... Wj 
II 
completed sequence 
(null sequence) 
Figure 12: Abstract complete-link and complete-sequence for Korean 
Korean sentence is spaced by "word-phrases" which is a sequence of content word and 
its functional words. In this experiment, the final part of speech(POS) of a "word-phrase" is 
selected as the representative POS and the inter-word-phrase dependencies are reestimated. 
We used 54 POS set. The initial probabilities of all the possible dependencies were set equal. 
The experiment was performed on three kinds of training and test sets extracted from 
52 
I 
I 
l 
I 
I 
I 
I 
i 
! 
i 
l 
I 
l 
! 
l 
i 
i 
Table 1: Train and test entropies 
iteration I Set-1 (14,427 words) ! Set-2 (40,818 words) Set-3 (336,824 words) 
1 
2 
3 
4 
13 
14 
15 
16 
17 
18 
4.439422 
2.273562 
2.194952 
2.167079 
2.139625 
2.139398 
2.139225 
2.139093 
2.138989 
2.138906 
4.486444 
2.356933 
2.283601 
2.258157 
2.235572 
2.235414 
2.235297 
2.235207 
4.486717 
2.357249 
2.289591 
2.265482 
2.243396 
2.243232 
2.243109 
2.243015 
1 Set-1 (2;170 words) i Set-2 (4,662 words) Set-3 ~i5,903 words) 
.... test entropy I 2.476484 ' 2.1505'53 2.251653 
KAIST corpora 1. The convergence of entropies(bits/word) through training iterations and 
the test corpus entropy are shown in the table 1. Set-1 is extracted from information and 
communication related texts; the train corpus of set-1 is 1,124 sentences (14,427 words) 
and the test corpus is 162 sentences (2,170 words). Set-2 is extracted from economy related 
texts; the train corpus is 3,499 sentences (40,818 words) and the test corpus is 409 sentences 
(4,662 words). Set-3 is not restricted to a domain; train corpus is 29,169 sentences (336,824 
words) and the test corpus is 1917 sentences (15,903 words). 
The experiment result shows that the proposed reestimation algorithm converges to a 
(local) minimum entropy. It shows also that the train and test entropies are not affected 
much by the domain nor by the size of training corpus. It may be because the reestimation 
was done on inter-POS dependencies, which is relatively simple. If the reestimation would 
be done on the dependencies between POS sequences for ~word-phrase" or on the depen- 
dencies between lexical entities~ the entropies may be affected much by the domain and the 
size of corpus. We plan such experiments. 
Below we show the parsing results of two example Korean sentences. We used the 
proposed best-first parsing algorithm to find the most probable parse of each sentence. The 
inter-word-phrase probabilities used for parsing are the reestimated ones for the training 
set-3. To the right of each Korean word-phrase, the meaning of it in English is given in the 
square brackets. In the parse representations, each individual inter-word-phrase probability 
is given to the right of the dependent word-phrase. The probability of each parse is the 
product of all the inter-word-phrase probabilities in the parse and is given on the top of 
each parse. 
Input Sentence: 
1 KAIST(Korean Advanced Institute of Science and Technology) corpora have been under construction 
since 1994 under Korean Information Base System construction (KIBS) project. It is supported by the 
Ministry of Culture and Sports, Korea. The KAIST corpora consist of raw text corpus(45,000.000 word- 
phrases), POS tagged corpus(6,750,000 word-phrases), and tree tagged corpus(30,000 sentences) at present. 
For our experiment, we extracted each train/tesz set from the POS tagged corpus. 
53 
1. ~!~el-o~\]-,~ - \[the laboratory\] 
2. ~71-:21- \[an electric generator\] 
3. "~l-~-~ \[equipped with\] 
4. oA-~4. \[is\] 
Parse: 7.893022e-04 
(EOS 
(4. ~-r..1- \[is\] 9.996915e-01 
(Z. ~4"~-~\]'-'~" \[the laboratory\] 
(3. ~\].~.o~ \[equipped with\] 
(2. ~'~Tl-yl- \[an electric generator\] 
7.129873e-02) 
1.150859e-01 
9.622178e-02)))) 
Input Sentence: 
1. ~z \[tunnel\] 
2. ~--~- \[the front of\] 
s. ~I~-~ \[passing\] 
4. ~ \[when\] 
5. ~'xFT\] \[suddenly\] 
6. ~--71- \[the car\] 
7. ~--,~-~ \[stopped\] 
Parse: 3.293545e-06 
(EOS 
(7. ~--~-ru~ \[stopped\] 9.996915e-01 
(4. --~-~ \[when\] S.S28747e-02 
(3. ~\]',.-},-e. \[passing\] 9.044279e-01 
(2. o~_.~. \[the front of\] 1.785020e-01 
(1. '~l~ \[tunnel\] 9.537951e-02)))) 
(5. ~-x\]-y\] \[suddenly\] 5.879496e-02) 
(6. ~..-7}- \[the car\] 9.504489e-02))) 
5 Conclusion and Further Works 
In this paper we have proposed a reestimation algorithm and a best-first parsing algorithm 
for probabilistic dependency grammars(PDG). The reestimation algorithm is a variation of 
the inside-outside algorithm adapted to PDG. In both of the reestimation algorithm and 
the parsing algorithm, the non-constituent objects, complete-link and complete-sequence, 
are defined as basic units for dependency structures and a CYK-style chart is utilized. 
Both algorithms have O(n 3) time complexities and can be used for corpus-based, stochastic 
I 
I 
| 
i 
I 
! 
I 
I 
I 
I 
I 
I 
I 
I 
I 
! 
I 
i 
54 I 
PDG induction. By experiment on Korean: we have shown that the reestimation algorithm 
converges to a local minimum and constitute a stable grammar. 
Compared to phrase structure grammars, PDG can be a useful and practical scheme 
for parsing model and language model. It is because dependency tree is much simpler 
and easily understood than the structure constructed by the phrase structure grammars. 
Besides the search space of the grammar may be smaller and the effect of structural data 
sparseness may be less. This also means that the reestimation algorithm for PDG can 
converge with smaller training corpus. We are planning to evaluate the parsing model 
based on the reestimated PDG and the PDG-based language model. 

References 
Black, E., J. Lafferty, and S. Roukos. 1992. "Development and evaluation of a broad- 
coverage probabilistic grammar of English-language computer manuals". In 30th Annual 
Meeting off the Association .for Computatinal Linguistics, pages 185-192. 
Carroll, G. 1992a. "Learning probabilistic dependency grammars from labeled text". In 
Working Notes, Fall Symposium Series, AAAI, pages 25-31. 
Carroll, G. 1992b. "Two Experiments on Learning Probabilistic Dependency Grammars 
from Corpora". Technical Report CS-92-16, Brown University. 
Chen, S.F. 1995. "Bayesian grammar induction for language modeling". In 33rd Anneal 
Meeting of the Association .for Computatinal Linguistics, pages 228-235. 
Collins, Michael John. 1996. "A New Statistical Parser Based on Bigram Lexical Depen- 
dencies". In COLING-96. 
de Marcken, C. 1995. "Lexical heads, phrase structure and the induction of grammar". In 
Third Workshop on Very Large Corpora. 
Eisner, Jason M. 1996. "Three New Probabilistic Models for Dependency Parsing: An 
Exploration". In COLING-96, pages 340-345. 
Lari, K. and S.J. Young. 1990. "The estimation of stochastic context-free grammars using 
the inside-outside algorithm". Computer Speech and Language, 4:35--56. 
Magerman, David M. 1994. "Natural Language Parsing as Statistical Pattern Recognition". 
Ph.D. thesis, Stanford University. 
Pereira, F. and Y. Schabes. 1992. "Inside-outside reestimation from partially bracketed 
corpora". In 50th Annual Meeting off the Association for Computational Linguistics, 
pages 128-135. 
