A Reestimation Algorithm fi~r I'robabilistic Recto'sire ~lYansition 
Network* 
)" Young S. tIan, mtd Key-Sun (,tul 
(;enter for Artificial Intelligence 
(;omputer Science l)epartme'at 
(,~vnter for Artificial Intelligence Rese, arch 
Korea Advanced lrtstitute of Science and Technology 
Tacjou, 305-70I, Korea 
yshau@cskiug.kaist, ac.kr, kschoi(~cskiug.kaist,ac.kr 
Abstract 
Prob~bilistic l{,ecursive Tr~msition Network(Pl~TN) is 
an elevated version of t{51'N to model and process lan-. 
guages in stoch~st, ic parameters. The representation 
is a direct derivation front the H,TN and keeps much 
the spirit of ltidden Markov Model at the same tint(,. 
We present a reestimation algorithm \['or Ptl,TN that is 
~ variation of Inside-Ontside algorithm that comput, es 
the vMues of the probabilistic parameters from sample 
sentences (parsed or unparsed). 
1. lntrodu(:tion 
In this pal)er , we introduce a network represen- 
tation, Probabilistic Recursive Transitio. Network 
that is directly derived fl'Oln R'CN and ItMM, and 
present an estimation algorithm lot tile proba- 
bilistic paraHteters. PR;12N is a \]\[\]TN mJgmented 
with probabilities in the transitions ~md states 
and with the lexical distributions in the transi-- 
tions, or is the Hidden Markov Model augmented 
with a stack that makes some traltsitions deter 
ministic. 
The paramete.r esthnation of PI{;I'N is devel 
oped as a wu'iation of Inside()utside algorithm. 
The hlsidc ()utside algorithm has becn applied 
e(,10t, I;o ~, ,.~* recently by Jelinek (1{t9{/) and \],ari 
(1991). The algorithm was first introduced by 
Baker in 1.979 and is the context free lmtguage 
version o\[ Forward-.Backw~rd algorithm in IIid-. 
*This research is partly supported by KOSEF (Km:ea 
Science altd Teclntology l"oundation) under tit{= title "A 
Study mt the Bnilding '\[~echni(lues for \[txdmst Km~wledge 
based Systems" from 19911 through 1994. 
den Markov Models. Its theoretical lbund~Ltion is 
laid by Baam aud Weh:h in the late 6l)'s, which 
in tarn is a type of the F,M Mgorithm in statistics 
(Rabiner, 1989). 
Kupiec (1991) introduced a trellis based es-. 
timation Mgorithm of Hidden SCFG that ae 
commodates both ilnside-Outside ~dgorithm and 
l!brward-.Backward ",flgorithm. The meaning of 
our work can be sought from the use of more 
plain topology of I{TN, whereas Kupiec's work is 
a unilied version of tbrward-.backword and Inside 
Outside ~lgorithms. Nonetheless, the implemen. 
ration of reestimation Mgorittun carries no more 
theoretical significance than the applicative efli 
ciency and variation for differing representations 
since B~ker first apt)lied it to CI"Gs. 
2. Probabilistic Recursive Tran- 
sition Network 
A probabilistic ff.l.'N (PRTN, hereafter) denoted 
by A is ~ 4 tuple. 
A is ~ transition m~trix containing tr~n.sition 
probabilities, ~tnd 13 is aiL observation matrix con- 
taining probability distribution of the words ob 
servable at each terminM transition where row 
and column correspond to terminM transitions 
and a list of words respective, ly. F specilies the 
types of transitions, and D2 denotes a stack. The 
first two model parameters are the same as that of 
I\[MM, thus typed transitions and the existence of 
a stack art', what distinguishes I'ttTN fl'om t\[MM. 
859 
The stack operations are associated with tran- 
sitions. According to the stack operation, tran- 
sitions are classified into three types. The first 
type is push transition in which state identifica- 
tion is pushed into the stack. The second type is 
pop transition which is selected by the content of 
stack. Transitions of the third type are not com- 
mitted to stack operation. The three types are 
also accompanied by different grammatical impli- 
cation, hence grammatical categories are assigned 
to trartsitions except pop transitions. Push transi- 
tions are associated with nonterminal categories, 
and will be called nonterminal transition when it 
is more transparent in later discussions. In gen- 
eral, the grammar expressed in PRTN consists 
of layers. A layer is a fragment of network that 
corresponds to a nonterminal. The third type of 
transition is linked to the category of terminals 
(words), titus is named terminal transition. Also 
a table of probability distribution of words is de- 
fined on each terminal transition. In the context 
of HMMs, tile words in the terminal transition 
are observations to be generated. Pop transitions 
represent returning of a layer to one of its possibly 
multiple higher layers. 
The network topology of PI~TN is not differ- 
ent fi-om that of RTN. In a conceptual drawing 
of a grammar, each layer looks like an indepen- 
dent network. Compared with conceptual draw- 
ing of the network, an operational view provides 
more vivid representation in which actual paths 
or parses are composed. The only difference be- 
tween the two is that in operational view a nonter- 
minal transition is connected directly to the first 
state of the corresponding layer. In this paper, 
the parses or paths are assumed to be sequences 
of dark-headed transitions (see Fig. I for exam- 
ple). 
Before we start explaining the algorithms let us 
define some notations. There is one start state 
denoted by 8, and one final state denoted by 
f. Also let us ca\]\] states immediately following 
a terminal transition terminal state, and states at 
which pop transitions are defined pop state. Some 
more notations are as follows. 
• first(l) returns the first state of layer I. 
• last(l) returns the last state of layer 1. 
• layer(,s) returns the layer state s belongs to. 
• bout(l) returns the states from which layer l 
branches out. 
• bin(l) returns the states to which layer I re 
turns. 
• terminal(1) returns a set of terminal edges in 
layer I. 
• nonterminal(l) returns a set of nonterminal 
edges in layer 1. 
• ij denotes the edge between states i and j. 
• \[i,j\] denotes the network segment between 
states i and j. 
• Wa~ b is an observation sequence covering 
from ath to bth observations. 
3. Reestilnation Algorithm 
PRTN is a RTN with probabilistic transitions 
and words 1 that can be estimated from sample 
sentences by means of statistical techniques, we 
present a reestimation algorithm for obtaining the 
probabilities of transitions and the observation 
symbols (words) defined at each terminal transi- 
tion. Inside-Outside algorithm provides a formal 
basis for estimating parameters of context free 
languages such that the probabilities of the ob- 
servation sequences (sample sentences ) are max- 
imized. The reestimation algorithm iteratively 
estimates the probabilistic parameters until the 
probability of sample sentence(s) reaches a cer- 
tain stability. The reestimation algorithm for 
PItTN is a variation of Inside-Outside algorithm 
customized for the representation. The algorithm 
to be discussed is defined only for well formed ob- 
servation sequences. 
Definition 1 An observation sequence is well 
formed if there exists at least a path that gen- 
erates the sequence in the network and starts at 
S and ends at 2:'. 
Let an obserw~tion sequence of length N denoted 
by 
W- W~W~...Wu. 
We start explaining the reestimation Mgorithm by 
defining Inside-probability. 
The Inside probability denoted by PI(i)s~t of 
state i is the probability that a portion of layer(i) 
1we do not consider probabilistic states in this p~per. 
860 
E--+ T ~. E E--+T 
'IF--* F * T 
3'.--~ F 
F--+( E ) 
F--* a 
calling returu 
states !;ta.tcs 
(F~ 1.o o,4 
0- 1,O 03 
04 - __~ ........ ~-~ ~"-® 
Figure 1: Illustration of PI¢TN. A parse is composed of dard-heatded transitions. 
(front state i to the last state of the layer) gen 
erattes W;~t. Thatt is, it is the probatbility thatt 
a certain fragment of a layer generates at certain 
segment of an input sentence, and this can be 
computed by summing the probabilities of all the 
possible paths in the layer segment that generate 
the given input segment. 
where c = last(layer(i)). 
More constructive re.presentation of Inside prob 
atbility is then 
k 
t 
wh, cre ik C tcrminal(h~ycr(i)), 
ia < i) ), 
,, = ), 
v ~ bin(layer(j)), 
'\].'he paths starting at state i arc classilied into two 
cases according to the type of hnmedi~te transi-- 
tion fl'om i: it can be of terminal or nonterminal 
type, In ease of terminal, ~J'ter the probatbility of 
the terminal transition is taken into account, the 
rest of the layer segment is responsible for the in- 
put segment short of one word just generated by 
the terminM tratnsition, in caase of nontmminM, 
first the transition probabilities (push and respec- 
tive pop tratnsitions) atre multiplied, then depend- 
ing on the coverage of the nonterminal transition 
(sublatyer) the rest of the current latyer is responsi- 
ble for the rmnaining input sequence after done by 
the sublaycr. After the last observation is made, 
the last state (pop state) of layer(i) should be 
reached. 
:1 ir i := 
l:)I(i)vH~t = 0 otherwise. 
Fig. 2 is the pi('toriM view of the Inside prob- 
ability. A well formed sequence can begin oidy 
at state ,S, thus to be strict, t~(5) has additional 
product term F(,5) that can be computed also 
using InsideOutside algorithm. Now define the 
Outside probability. 
The Outside probatbility denoted by Po(i, j).,~~. 
is the probatbility thatt patrtial sequences, Wl~.,q 
and Wt+1~N, are generated provided that the par- 
tiatt sequence, Ws~t, is generated by \[i,j\] given 
ruodel, A. This is a complementary point of 
Inside-probability. This time, we look at the out- 
side of given layer segme,tt and input segment. 
Assunfing a given latyer segment generates a given 
input segment, we want to colnpute the probat- 
bility that the surrounding portion of the whole 
I'R:i'N generates the rest of the input sequence. 
861 
layer(i) ~._:,,.. ik ._(~)__~ ... (~___~... 
layer(j) (~ • • • --'~ 
I S 
Figure 2: Illustration of Inside probability. 
The Outside probability is computed first by 
considering the current layer consisting of two 
parts after' excluding \[i,j\] that are captured in 
Inside-probability. Beyond the current layer is 
simply an Outside probability with respect to the 
current layer. 
And by definition, 
Po(i,j),~t = p(\[s, i\] ~ w>~_~, \[j, y\] -~ 
W,+I~N I A ) 
axfaev X 
x a=l b=t+l 
*~( f , i)o~,~ Pdj)~~beo( ~, y)o~~ . P;(f, i)~~t 
Fig. 3 shows the network configuration in com- 
puting the Outside probability, t'~(f,i)=~~_t is 
the probability that sequence, W=~~I, is gener- 
ated by layer(i) left to state i. PI(j)t+l~b is the 
probability that sequence Wt+l~b is generated by 
layer(i) right to state j. The portions of W not 
covered by W=~b is then left to the parent layers 
of layer(i). 
P~(f, i).,~t is a slight wriation of Inside proba- 
bility in which PI(f)=~b'S in the Inside probabil- 
ity formula are replaced by P~(f, i)a~b. \[ts actual 
computation is done as follows: 
PI(f),~t ifs_<t, 
1 ifs>tandf=i, 
0 ifs >t and f )Ai. 
wheTe x E bout(layer(i)), 
y e b~n(layer(i)), 
f = first(layer(i)), 
e = last(layer(i)), 
layer(i) = layer(j), 
layer(~) = layer(y). 
x represents a state from which layer(i) 
branches out, and y represents a state to which 
layer(j) returns to. Every time a different com- 
bination of left and right sequences with respect 
to W~~t is tried in the layer states i and j belong 
to, the rest of remaining sequences is the Outside 
probability at the layer above layer(i). 
When there is no subsequence to the right of 
W~~b (i.e., b = N), 
Po(i,j)a~N = 1. 
It is basically the same as Inside probability ex- 
cept that it carries a state identification i to check 
the vMidity of stop state. If there are observations 
left for generation (s _< t), things are done just as 
in computing Inside probability, ignoring i. When 
boundary point is reached (s > t), if the last state 
is i, it returns 1, and 0, otherwise. 
The probability of an observation sequence can 
be computed using Inside probability ~s 
p(wJA) -- P(\[s,a=\]-~ w>NI~) 
= P,(s)I_N. 
Now we can derive the reestimation algorithm for 
Ji and/~ using the Inside and Outside probabilL 
ties. As the result of constrained maximization of 
Bantu's auxiliary function, we have the following 
form of reestimation for each transition (Rabiner 
1989). 
862 
layer(x) ~) 
layer(i) @__,_ 
1 ,,,-, , , I l+l NI 
Figure 3: Illustration of Outside probability. 
expected no. of transitions from i to j d~j = 
expected no. of transitions front i 
The expected frequency is defined for each of 
the thre(, types of transition. For a terminal tran- 
sition, 
~N ¸ E,.=~ aijb(ij, W,.)l'o(i,j),.~,. 
Et(ij) == r(w I a) 
For a nontcrminal transition, 
alj Et(ij) 
~2~ e~(ik) + )2k e,.(i~) 
For nonterminal transitions, 
aij E,a(ij) ?2k e~(ik) + ?;k e,.(ik) 
And for pop transitions, notice that only pop 
transitions are possible at a pop state, 
__, ~N ~.~ aijPi(j).~~ta~,,I'o(i, v)~~t E~,ov(ij) P),~t(ij) =- '~=~ aij - 
~'( w I ~ ) E~ z,:,,o;,( i~ ) 
-.+ 
"lDh¢7'~ '\[' = la.~t(lf,~y¢?'(j)), ',J ~ bin(layer(j)), For a terminal transition ij aud ~I, observation 
1,,y,~(i) = 1.y,~(.0), l.:,j~.(j):~ l~'r(,,l y'"b°l " 
uv is apop transition. 
,-+ 
Y'-,t .,.t. wt=~, aijb(ij, Wt)l'o(i, J)t~t For a pop transition, b(ij, w) ~: 
~\]V -~ )2t=~ aijb(ij, Wt)l'o(i,j)t~t 
Epov( i j ) :: v(w I a) 
where u E: bout(layer(i)), 
j (~ bin(layer(i)), 
v := first(layer(i)), 
l..,j~,.(,.) - l.y~,,.(j), 
l<,jc,,,(~)- l~y~,,(0, 
u'~ is a nonterminM transitiolt . 
Considering that tr~msitions of terminal and 
nonterminM types can occur together at a state, 
the reestim~tion \['or terminal tr~msitions is done 
as follows: 
'file reestimation process co~ltinues until the 
probability of the observation sequences reaches a 
certain stability. It is not nnusuM to assume that 
the tra.iHing set can be very large, and even grow 
indefinitely in non trivial applications in which 
case additive traini~tg c~n be tried using a smooth- 
ing tectmiquc as in (Jarre and I'ieraccini \] 987). 
The complexity of \[Itside-Outside ~dgorithm is 
O(N a) both in the mnnber of states and input 
length (l~ari 1990). The ei\[iciency comes from the 
fact that the algorithm successfully exploits the 
context-freeness, l!br instance, the ge~mration of 
substrings by a nonterminal is independent of tit(; 
surroundings of the .aonterminal, and this is \]tow 
the product of the Inside and Outside probabili 
ties works and the COlnplexity is derived. 
863 
4. Conclusion 
Recently several probabilistic parsing approaches 
have been suggested such as SCFG, probabilis- 
tic GLR, and probabilistic link grammar (Laf- 
ferty, 1992). Kupiec extended the reestimation 
algorithm for SCFG to cover non-Chomsky nor- 
mal forms (Carroll, 1993). This paper further ad- 
vances the trend by implanting the Inside-Outside 
algorithm on the plain topology of RTN which 
distinguishes itself from Kupiec's work. 
\[8\] Lari, K.; and Young, S. J. (1991). "Applica- 
tions of stochastic context-free grammars us 
ing the Inside-Outside algorithm." Computer 
Speech and Language 5: 237-257. 
\[9\] Rabiner, Lawrence R. (1989). A Tutorial on 
Hidden Markov Models and Selected Applica- 
tions in Speech Recognition. Proceedings of the 
IEEE ~27 (2). 
References 
\[1\] Baker, J. K. (1979). Trainable Grammars 
for Speech I{~ecognition. Speech Communication 
Papers for the 97th Meeting of the acoustical 
Society of America (D.H. Klatt & J.J. Wolf, 
eds): 547-550. 
\[2\] Baum, L. E. (1972). An Inequality and As- 
sociated Maximization Technique in Statisti- 
cal Estimation for Probabilistic Functions of a 
Markov Process." Inequalities 3: 1-8. 
\[3\] Carroll J., and Briscoe E. (1993). General- 
ized probabilistic LR parsing of natural lan- 
guage (Corpora) with unification-based gram- 
mars. ACL 19 (1). 25-59. 
\[4\] Jarre, A., and Pieraccini, R. (1987). "Some 
Experiments on HMM Speaker Adaptation," 
Proceedings of ICASSP, paper 29.5. 
\[5\] John Lafferty., Daniel Sleator. and Davy Tem- 
perley. (1992). Grammatical trigrams: a prob- 
abilistic model of link grammar. In Proceedings 
of AAAI Fall symposium on Probabilistie Ap- 
proaches to Natural Language Processing, Cam- 
bridge, MA. 89-97. 
\[6\] Jelinek, F. Lafferty, J. D. and Mercer R. L. 
(1990). Basic Methods of Probabilistic Context 
Free Grammars. IBM RC 16374. IBM Contin- 
uous Speech Recognition Group. 
\[7\] Kupiec, Julian (1991). A Trellis-Based Algo- 
rithm For Estimating the Parameters of a ttid- 
den Stochastic Context-Free Grammar. Pro- 
ceedings, Speech and Natural Language Work- 
shop. sponsored by DARPA. Pacific Grove: 
241-246. 
864 
