A Chart Re-estimation Algorithm for a 
Probabilistic Recursive Transition 
Network 
Young S. Haw 
University of Suwon 
Key-Sun Choi* 
Korea Advanced Institute of Science and 
Technology 
A Probabilistic Recursive Transition Network is an elevated version of a Recursive Transition 
Network used to model and process context-free languages in stochastic parameters. We present 
a re-estimation algorithm for training probabilistic parameters, and show how efficiently it can 
be implemented using charts. The complexity of the Outside algorithm we present is O(N4G 3) 
where N is the input size and G is the number of states. This complexity can be significantly 
overcome when the redundant computations are avoided. Experiments on the Penn tree corpus 
show that re-estimation can be done more efficiently with charts. 
1. Introduction 
Though hidden Markov models have been successful in some applications such as 
corpus tagging, they are limited to the problems of regular languages. There have 
been attempts to associate probabilities with context-free grammar formalisms. Re- 
cently Briscoe and Carroll (1993) have reported work on generalized probabilistic LR 
parsing, and others have tried different formalisms such as LTAG (Schabes, Roth, and 
Osborne 1993) and Link grammar (Lafferty, Sleator, and Temperley 1992). Kupiec ex- 
tended a SCFG that worked on CNF to a general CFG (Kupiec 1991). The re-estimation 
algorithm presented in this paper may be seen as another version for general CFG. 
One significant problem of most probabilistic approaches is the computational 
burden of estimating the parameters (Lari and Young 1990). In this paper, we consider 
a probabilistic recursive transition network (PRTN) as an underlying grammar rep- 
resentation, and present an algorithm for training the probabilistic parameters, then 
suggest an improved version that works with reduced redundant computations. The 
key point is to save intermediate results and avoid the same computation later on. 
Moreover, the computation of Outside probabilities can be made only on the valid 
parse space once a chart is prepared. 
2. A Probabilistic Recursive Transition Network 
A PRTN denoted by A is a 6-tuple. 
= (A,u,s,;:,r,5). 
* Computer Science Department, University of Suwon, Suwon P.O. Box 77-78, Suwon, 440-600, Korea. 
E-mail: yshan@world.kaist.ac.kr 
Computer Science Department, Korea Advanced Institute of Science and Technology, Taejon, 305-701, 
Korea. E-mail: kschoi@csking.kaist.ac.kr 
(~ 1996 Association for Computational Linguistics 
Computational Linguistics Volume 22, Number 3 
CFG NP ~ art AP noun 
NP , AP noun 
NP ~ noun 
AP . adj AP 
AP = adj 
PRTN 
Call 
i States noun 0.2 
~.o o.8 ~ 
p a i v ~ v o.~,,~ 
............... J : ..... 
Figure 1 
Illustration of PRTN. A parse is composed of dark-headed transitions. 
i Return 
i States 
~@ 
A is a transition matrix containing transition probabilities, and B is a word matrix 
containing the probability distribution of the words observable at each terminal tran- 
sition. I? specifies the types of transitions, and ~ represents a stack. S and .~ denote 
start and final states, respectively. 
Stack operations are associated with transitions; transitions are classified into three 
types, according to the stack operation. The first type is nonterminal transition, in 
which state identification is pushed into the stack. The second type is pop transition, in 
which transition is determined by the content of the stack. The third type is transitions 
not committed to stack operation; these are terminal and empty transitions. In general, 
the grammar expressed in PRTN consists of layers. A layer is a fragment of network 
that corresponds to a nonterminal. A table of the probability distribution of words is 
defined at each terminal transition. Pop transitions represent the returning of a layer 
to one of its (possibly multiple) higher layers. 
In this paper, parses are assumed to be sequences of dark-headed transitions (see 
Figure 1). States at which pop transitions are defined are called pop states. Other 
notations are listed below. 
first(1) returns the first state of layer I. 
last(l) returns the last state of layer I. 
layer(s) returns the layer state s belongs to. 
bout(1) returns the states from which layer 1 branches out. 
bin(1) returns the states to which layer 1 returns. 
terminal(1) returns a set of terminal edges in layer 1. 
nonterminal(1) returns a set of nonterminal edges in layer I. 
denotes the edge between states i and j. 
\[i,j\] denotes the network segment between states i and j. 
Wa~b is a word sequence covering the a th to b th word. 
422 
Han and Choi A Chart Re-estimation Algorithm 
3. Re-estimation Algorithm 
The task of a re-estimation algorithm is to assign probabilities to transitions and the 
word symbols defined at each terminal transition. The Inside-Outside algorithm pro- 
vides a formal basis for estimating parameters of context free languages so that the 
probabilities of the word sequences (sample sentences) may be maximized. The re- 
estimation algorithm for PRTN uses a variation of the Inside-Outside algorithm cus- 
tomized for PRTN. 
Let a word sequence of length N be denoted by: 
W = Wl Wa--.WN. 
Now define the Inside probability. 
Definition 1 
The Inside probability denoted by Pi(i)s~t of state i is the probability that layer(i) 
generates the string positioned from s to t starting at state i given a model ,~. 
That is: 
Pi(i)s~t = P(\[i,e\] --+ Ws~tl,~) 
where e = last(layer(i)). And by definition: 
t 
P,(i)s~t = ~ aikb(T~, Ws)P,(k)s+l~t + ~ Y2 aijauvP,(j)s~rP,(v)r+l~t. 
k j r=s 
(1) 
---+ ___+ 
where ik E terminal(layer(i)), ij E nonterminal(layer(i) ), u = last(layer(j)), v E bin(layer(j)), 
and layer(i) = layer(v). 
After the last word is generated, the last state of layer(i) should be reached. 
1 if i = last(layer(i)), 
Pl(i)t+l~t = 0 otherwise. 
Figure 2 is the pictorial view of the Inside probability. A valid sequence can be- 
gin only at state $, thus to be strict, P1(8) has an additional product, P($). When 
the immediate transition ~ is of terminal type, the transition probability aq and the 
probability of the S th word at the transition b(~j, Ws) are multiplied together with the 
Inside probability of the rest of the sequence, Ws+l~t. 
Now define the Outside probability. 
Definition 2 
The Outside probability denoted by Po (i,j)s~t is the probability that partial sequences, 
Wl~s_l and Wt+l~N, are generated, provided that the partial sequence, Ws~t, is gen- 
erated by \[i,j\] given a model ~. 
And by definition: 
Po(i,j)s~t = P(\[,S',i\] --~ Wl~s_l, ~',..~-\] -~- Wt+l~ N I A) 
N 
= ~ ~ ~ axfaeYPr(f'i)a~s-lPl(j)t+l~bPo(x,y)a~b • 
x a=l b=t (2) 
423 
Computational Linguistics Volume 22, Number 3 
T~ 
layer(j) ,.,, = .... 
W I~ t I 
Figure 2 
Illustration of Inside probability. 
,aye~Ix> _- ? @ 
layer(i) ~ .... ~\[)_~.... =~_~...._~ 
w I' s, I' '1'*' NI 
Figure 3 
Illustration of Outside probability. 
where x E bout(layer(i)), y E bin(layer(i)), f =first(layer(i)), e = last(layer(i)), layer(i) = 
layer(\]'), and layer(x) = layer(y). 
The summation on x is defined only when a ~ 1 or b ~ N (i.e., there are words 
left to be generated). Nonterminal and its corresponding pop transitions are defined 
to be 1 when a = 1 and b = N. 
For a boundary case of the Outside probability where f is the first state of a layer 
in the above equation: 
1 iff = $, 
Po(i,j)l~S = 0 otherwise. 
Figure 3 shows the network configuration in computing the Outside probability. In 
equation 2, P~(f, i)~s-1 is the probability that sequence, Wa~-l, is generated by layer(i) 
left to state i, and Pl(j)t+l~b is the probability that sequence Wt+l~b is generated by 
layer(i) right to state j. 
The computation of P~ 0 c, i)s~t--a slight variation of the Inside probability in which 
the P~(f)a~b'S in equation 1 are replaced by P~0 c, i)a~b--is done as follows: 
P~(f,i)s~t 
{ pl0C)s~t if s __q t, 
= 1 if s > t andf = i, 
0 if s > t andf¢ i. 
It is basically the same as the Inside probability except that it carries an i that indicates 
a stop state. 
Now we can derive the re-estimation algorithm for .g and/3 using the Inside and 
Outside probabilities. As the result of constrained maximization of Baum's auxiliary 
424 
Han and Choi A Chart Re-estimation Algorithm 
function, we have the following form of re-estimation for each transition (Rabiner 
1989). 
expected number of transitions from state i to state j 
expected number of transitions from state i 
The expectation of each transition type is computed as follows: For a terminal transi- 
tion: 
Y'~ffr=l aijb(~, W~)Po(i,j)~~~ Et( ij ) = 
P(W I A) 
For a nonterminal transition: 
N ~s=l Y~fft-s aijP,(j)~ta~Po( i, v)s~t 
Ent(~ ) = P(wl 
where u = last(layer(j)), v ¢ bin(layer(j)), layer(i) = layer(v), layer(j) = layer(u), and uv 
is a pop transition. For a pop transition: 
N N ~-~s=l ~-~t=s &,vPl(v)s~taijPo(u, J)sNt 
Go,( ij ) = P(W I ,k) 
where u E bout(layer(i)), j c bin(layer(i)), v = first(layer(i)), layer(u) = layer(j), layer(v) = 
layer(i), and uv is a nonterminal transition. 
Since transitions of terminal and nonterminal types can occur together at a state, 
terminal transitions are estimated as follows: 
---+ 
a-ij Gk Et(ik) + Gk Ent(~) (3) 
For nonterminal transitions: 
E,,,( q) (4) 
And for pop transitions, notice that only pop transitions are possible at a pop state: 
L aq _ Go,(q) 
Y~k Epop( ik ) 
(5) 
For a terminal transition ~ and a word symbol w: 
Ct ~.,. w,=~,, aijb(~, Wt)Po(i,j)t~t b( ij,w) = 
Y'~fft=~ aqb(~, Wt)Po(i,j)t~t 
The re-estimation continues until the probability of the word sequences reaches a 
certain stability. 
425 
Computational Linguistics Volume 22, Number 3 
sentence W 
T 
Compute Inside probability 1 ofW \[ 
Inside Table 
Select valid Insides by \[~ running Inside algorithm 
topdown 
of 
valid Insides 
Run Outside algorithm \] I 
Figure 4 
Outside computation with chart. Inside computation builds a table of computed Insides. 
4. Chart Re-estimation Algorithm 
It can be shown that the complexity of the Inside algorithm is O(N3G 3) and that of the 
Outside algorithm is O(N4G 3) where N is the input size and G is the number of states. 
The complexity is too much for current workstations when either N or G becomes 
bigger than a few 10s. A basic implementation of the algorithm is to use a chart and 
avoid doing the same computations more than once. For instance, the table for storing 
Inside computations takes O(N2G2C) store, where C is the number of terminal and 
nonterminal categories. A chart item is a function of five parameters, and returns an 
Inside probability. 
I(i,j, s, t, c) = Pi(i,j)s~t. 
A chart item is associated with categories implying that the item is valid on the 
specified categories that begin the net fragment of the item. Suppose a net fragment 
\[i,j\] begins with NP and ADJP, then given a sentence fragment Ws~t, ADJP may not 
participate in generating Ws~t, while NP may. The information of valid categories is 
useful when the chart is used in computing Outside probabilities. 
An Outside probability is the result of computing many Inside probabilities. Com- 
puting an Inside probability even in an application of moderate size can be impractical. 
A naive implementation of Outside computation takes numerous Inside computations, 
so estimating even a parameter will not be realistic in a serial workstation (Lari and 
Young 1990). 
The proposed estimation algorithm aims at reducing the redundant Inside com- 
putations in computing an Outside probability. The idea is to identify the Inside prob- 
abilities used in generating an input sentence and to compute an Outside probability 
using mainly those Insides. This is done first by computing an Inside probability of 
the input sentence, which can return a table of Insides used in the computation. Note 
that the Insides in the deepest depth are produced first, as the recursion is released, 
thus there can be many Insides that are not relevant to the given sentence. The Insides 
that participate in generating the input sentence can be identified by running the In- 
side algorithm one more time, top-down. Figure 4 illustrates the steps of the revised 
Outside computation. 
The identified Insides, however, do not cover all the Insides needed in computing 
an Outside probability. This is because the Inside algorithm works on a network from 
426 
Han and Choi A Chart Re-estimation Algorithm 
left to right and one transition at a time. Many Insides that are missed in the table are 
compositions of smaller Insides. 
Once charts of selected Insides are prepared, an Outside probability is computed 
as follows: 
Po(i,j)s~, = ~ ~ axfaevlOe, i,a,s - 1)I(j,e,t + 1, b)Po(x,y)a~b . 
{xiI(x,y,a,b,c)>O} (a,b)ca(f,e,s,t) 
where x E bout(layer(i)), y c bin(layer(i)), f = first(layer(i)), e = last(layer(i)), c c 
{nonterminal}, layer(i) = layer(j), and layer(x) = layer(y). 
The function cr0C, e,s, t) returns a set of (a, b) pairs where there are Inside items 
I0 c, e, a, b) defined at the chart such that a G s and b > t. In short, the items for 
state f indicate the possible combinations of sentence segments inclusive of the given 
fragment Ws~t because the chart contains items of all the valid sentence segments that 
were generated through the layer ~c, e\]. When the current layer ~c, e\] is completed with 
the two Insides computed, the computation extends to the Outside. 
Useless advancements into high layers that do not lead to the successful comple- 
tion of a given sentence can be avoided by making sure that Ix, y\] generates Wa~b and 
the category of current layer c is defined, which can be checked by consulting the 
chart items for state x. 
5. Experiments 
The goal of our experiments is to see how much saving the new estimation algorithm 
achieves in computational cost. Out of 14,132 Wall Street Journal trees of the Penn tree 
corpus, 1,543 trees corresponding to sentences with 10 words or less were chosen, and 
the programs written in C language were run at a Sparcl0 workstation. 
The basic implementation of an Inside-Outside algorithm assumes tables for In- 
sides and Outsides so that identical Insides and Outsides need not be recomputed. 
A chart Outside or re-estimation algorithm assumes a refined table of Insides that 
contains only valid Insides used in generating the input sentence as discussed earlier, 
and Outside computation is done based on the refined table. 
The improvement from the chart re-estimation algorithm is measured in the num- 
ber of actual Inside and Outside computations done to estimate the parameters. Fig- 
ure 5 shows the average counts of Insides used in estimating 50 trees randomly selected 
from 1,543 samples. Before the re-estimation algorithm is applied, an RTN that faith- 
fully encodes the input trees without any overgeneration is constructed from the 50 
trees. The gain in Insides from the chart re-estimation algorithm is very clear, and in 
the case of Outsides the gain is even more conspicuous (see Figure 6). The number of 
Insides counted in chart version also includes the Insides computed in preparing the 
chart. 
6. Conclusion 
We have presented an efficient re-estimation algorithm for a PRTN that made use 
of only valid Insides. The method requires the preparation of a chart by running 
Inside computation twice over a whole sentence. The suggested method focuses mainly 
on reducing the computational overhead of Outside computation, which is a major 
portion of a re-estimation. The computation of an Inside probability may be improved 
further using a similar technique introduced in this paper. 
427 
Computational Linguistics Volume 22, Number 3 
2000 I I I I 
without chart o +J 
1500 wit r I , 
Insides 1000 
5OO 
0 
0 2 4 6 8 10 
Sentence size 
Figure g 
Gain in Insides using chart re-estimation, showing 50 randomly-chosen sentences out of 1,543 
samples. 
400 I I J I I I I | 
,1, 350 without chart o 
with ch 
3O0 
25O 
Outsides 200 
150 
100 
50 
o _._.__.7-~ o .~ t I I I I I 
2 3 4 5 6 7 8 9 10 
Sentence size 
Figure 6 
Gain in Outsides using chart re-estimation with the same 50 sentences as in Figure 5. 
428 
Han and Choi A Chart Re-estimation Algorithm 
References 
griscoe, Ted, and John Carroll. 1993. 
Generalized probabilistic LR parsing of 
natural language (Corpora) with 
unification-based grammars." 
Computational Linguistics 19(1): 25-57. 
Kupiec, Julian. 1991. A trellis-based 
algorithm for estimating the parameters 
of a hidden stochastic context-free 
grammar. In Proceedings of the Speech and 
Natural Language Workshop, pages 241-246, 
DARPA, Pacific Grove. 
Lafferty, John, Daniel Sleator, and Davy 
Temperley. 1992. Grammatical trigrams: A 
probabilistic model of link grammar. 
AAAI Fall Symposium Series: Probabilistic 
Approaches to Natural Language, pages 
89-97, Cambridge. 
Lari, K. and S. J. Young. 1990. The 
estimation of stochastic context-free 
grammars using the Inside-Outside 
algorithm. Computer Speech and Language 
4:35-56. 
Rabiner, Lawrence R. 1989. A tutorial on 
hidden Markov models and selected 
applications in speech recognition. In 
Proceedings of the IEEE 77, Volume 2. 
Schabes, Yves, Michael Roth, and Randy 
Osborne. 1993. Parsing the Wall Street 
Journal with the inside-outside algorithm. 
Sixth Conference of the European Chapter of 
the ACL, '93, Utrecht, the Netherlands, 
April. 
429 
