A 
A Trellis-Based Algorithm 
For Estimating The Parameters Of 
Hidden Stochastic Context-Free Grammar 
Julian Kupiec 
Xerox Palo Alto Research Center 
3333 Coyote Hill Road 
Palo Alto, CA 94304 
ABSTRACT 
The paper presents a new algorithm for estimating the pa- 
rameters of a hidden stochastic context-free grammar. In con- 
trast to the Inside/Outside (I/O) algorithm it does not require 
the grammar to be expressed in Chomsky normal form, and thus 
can operate directly on more natural representations of a gram- 
mar. The algorithm uses a trellis-based structure as opposed to 
the binary branching tree structure used by the I/O algorithm. 
The form of the trellis is an extension of that used by the For- 
ward/Backward algorithm, and as a result the algorithm reduces 
to the latter for components that can be modeled as finite-state 
networks. In the same way that a hidden Markov model (HMM) 
is a stochastic analogue of a finlte-state network, the represen- 
tation used by the new algorithm is a stochastic analogue of a 
recursive transition network, in which a state may be simple or 
itself contain an underlying structure. 
INTRODUCTION 
The algorithm described in this paper is concerned with 
using hidden Markov methods for estimation of the param- 
eters of a stochastic context-free grammar from free text. 
The Forward/Backward (F/B) algorithm (Baum, 1972) is 
capable of estimating the parameters of a hidden Markov 
model (i.e. a hidden stochastic regular grammar) and has 
been used with success to train text taggers (Jehnek, 1985). 
In the tagging apphcation the observed symbols are words 
and their underlying lexical categories are the hidden states 
of the model. 
A context-free grammar comprises both lexical (termi- 
nai) categories and grammatical (nonterminai) categories. 
One iterative method of estimation in this case involves 
parsing each sentence in the training corpus and for each 
derivation, accumulating counts of the number of times each 
rule is used. This method has been used by Fujisald et ai. 
(1989), and Chitrao & Grishman (1990). A more efficient 
method is the Inside/Outside algorithm, devised by Baker 
(1979) for grammars that are expressed in Chomsky nor- 
mal form. The algorithm described in this paper relaxes 
the requirement for a grammar to be expressed in a nor- 
mal form, and it is based on a trellis representation that is 
closely related to the F/B algorithm, and which reduces to 
it for finite-state networks. 
The development of the algorithm has various motiva- 
tions. Grammars must provide a large coverage to accom- 
modate the diversity of expression present in large collec- 
tions of unrestricted text. As a result they become more 
ambiguous. A stochastic grammar provides the capability 
to resolve ambiguity on a probabilistic basis, providing a 
practical approach to the problem. It also provides a way of 
modeling conditional dependence for incomplete grammars, 
or in the absence of any specific structural information. The 
latter is exemplified by the approach taken in many current 
taggers, which have a uniform model of second-order depen- 
dency between word categories. Kupiec (1989) has experi- 
mented with the inclusion of networks to model mixed-order 
dependencies. 
The use of hidden Markov methods is motivated by the 
flexibility they afford. Text corpora from any domain can be 
used for training, and there are no restrictions on a grammar 
due to conventions used during labehng. The methods also 
lend themselves to multi-hngual application. 
The representation used by the algorithm can be related 
to constituent structures used in other parsers such as chart 
parsers, providing a means of embedding this technique in 
them. 
REPRESENTATION 
The representation of a grammar and the basic trellis 
structure are discussed in this section. The starting point is 
241 
the conventional HMM network in which symbols are gen- 
erated at states (rather than on transitions) as described in 
Levinson et al. (1983). Such a network is represented by 
the parameter set (A, B, I) comprising the transition, out- 
put and initial matrices. The states in this kind of network 
will be referred to as terminal states from now on, and Will 
be represented pictorially with single circles. As a short- 
hand convenience in what follows, if the circle contains a 
symbol, then it is assumed that only that symbol is ever 
generated by the state. (The probability of generating it is 
then unity, and zero for all other symbols.) A single sym- 
bol is generated by a transition to a terminal state. For 
the grammars considered here, terminal states correspond 
to lexical categories. 
To this parameter set we will add four other parameters 
(N, F, To2, L). The boolean Top indicates whether the net- 
work is to be considered as the top-level network. Only one 
network may be assigned as the top-level network, and it is 
analogous to the root symbol of a grammar. The parameter 
F is the set of final states, specifying the allowable states 
in which a network can be considered to have accepted a 
sequence of observations. A different type of state will now 
be introduced, called a nonterminal state. It represents a 
reference to another network and is indicated diagrammati- 
cally with two concentric circles. When a transition is made 
to a nonterminal state, the state does not generate any ob- 
servations per se, but terminal nodes within the referred 
network do. A nonterminal state may be associated with a 
sequence of observation symbols, corresponding to the se- 
quence accepted by the underlying network. The parameter 
N is a matrix which indicates whether a state is a terminal 
or nonterminal state. Terminal states have a null entry in 
the matrix, and nonterminal states have a reference to the 
network which they represent. A grammar is usually com- 
posed of several networks, so each one is referred to with a 
unique label L. 
Figure 1 shows how rules in Chomsky normal form are 
represented as networks using the above scheme. The lexical 
form of the rules is included, illustrating how the left hand 
side of a rule corresponds to a network label, and the net- 
work structure is associated with the right-hand side. Ter- 
minal states are labeled in lower case and nonterminals in 
upper case. The numbers associated with the states are 
their initial probabilities which are also rule probabilities. 
For terminal nodes in the top-level network, initial prob- 
abilities have the same meaning as in the F/B algorithm. 
For all other networks, an initial probabihty corresponds to 
a production probability. States which have a non-zero ini- 
tial probability will be termed "Initial states" from now on. 
Any sequence recognized by a network must start on an ini- 
tial state and end on a final state. In Figure 1, final states 
are designated with the annotation "F'. Figure 2 shows how 
the terminal symbols in Figure 1 may be represented in a 
more compact style, by a single state having different B 
matrix probabilities for the symbols x and y. 
Network A 
( ) 
-~~)F 0.5 
~F 0.3 
A-> BC 
A -> x 
G 
0.2 F 
A -> y 
Figure 1: Network and Rules for Chomsky Normal Form 
©F 
0.5 
Figure 2: Representation for Terminal Symbols 
Terminology 
A grammar is represented as a set A/" of networks, and 
a component network labeled n is composed of parameters 
(A, B, I, N,F, Top, n). To strictly identify an element in the 
parameter set each element must be a function of its as- 
sociated network (e.g. A(n), I(n) etc.). In the following 
sections however, where the reference is obvious this nota- 
tion has been omitted to make formulae less cumbersome. 
Thus, given a network n E A\[, an element of its transition 
matrix A, from state i to state j is written a(i, j). Likewise 
the initial probability for state i is I(i). Assuming that sen- 
tences are used as text units, an observation sequence may 
consist of Y + 1 words, indexed from 0 to Y: 
(WO, ~/)1, ~/)2...'tOY) 
242 
Network 
State: 1 State: 2 
Trellis Diagram 
start 
1 
end 
2 
W 0 W 1 W 2 
Figure 3: An Example Network and Trellis Diagram 
It is useful to define a lookup function W(y) which re- 
turns the index k of the vocabulary entry vk matching the 
word Wy at positioh y in the sentence. The vocabulary entry 
may be a word or an equivalence class based on categories 
(Kupiec, 1989). An element of the output matrix B, rep- 
resenting the the probability of seeing word wy in terminal 
state j is then b(j,W(y)). In addition, three sets will be 
mentioned: 
1. Term(n) The set of terminal states in network n. 
2. Nonterm(n) This is the set of nonterminal states in 
network n. 
3. Final(n) The set F of final states in network n. 
The predicate Top(n) indicates that n is a top-level network, 
and ~ Top(n) indicates it isn't. Finally, the function N(p, n) 
returns the network to which state p in network n refers to. 
(If p is a terminal state it returns a null value). 
TRELLIS DIAGRAM 
Figure 3 shows the form of the trellis diagrams that are 
used for the computation of probabilities. In the F/B al- 
gorithm a single trellis is used, whose dimensions are the 
number of states in the network and the length of the sen- 
tence. A single trellis spans the whole sentence. In the new 
algorithm each network has an associated set of trellises, 
for subsequences starting at different positions in a sentence 
and ending at subsequent ones. (Only a single trellis start- 
ing at w0 is shown in Figure 3.) It can be seen that terminal 
state 2 has corresponding nodes in the trellis diagram, but 
nonterminal state 1 is represented by pairs of nodes. One 
node of the pair is called the start node and the other is 
termed the end node. Paths exist in the trellis for possible 
state transitions between successive words. However, it is 
also implicitly understood that paths also exist between the 
start node and subsequent end nodes for each nonterminal 
state. These implicit paths are shown as broken lines in Fig- 
ure 3 and correspond to paths that enter network A at some 
time, and return from it at the same or a later time. The 
probabilities associated with the implicit paths are assigned 
by reference to the trellis diagrams of the appropriate net- 
work. An implicit path from a start node at position x to 
an end node at position y for a nonterminal state p can be 
thought of as a constituent labeled p, that dominates the 
words from positions x through to y (inclusive) in a sen- 
tence. A network n is deemed to include the sequence wx 
...w~ if paths exists through the network which will generate 
this sequence or a longer one which includes it as a prefix. 
Thus it is not necessary to be at a final state of n at word 
w, to include w .... wy. 
The algorithm makes use of one set of trellis diagrams to 
compute "alpha" probabilities, and another for "beta" prob- 
abilities. These are both sprit into terminal, nonterminal- 
start and nonterminal-end probabilities, corresponding to 
the three different types of nodes in the trellis diagram. For 
the alpha set, these are labeled at, a,ts and a,t~ respec- 
tively. 
at(x, y, j, n): The probability of generating the words w; 
...wu inclusive and network n includes them, and being at 
the node for terminal state j at position y. 
,~,(*,u,j,n) = \[~/at(x,y--1, i,n)a(i,j)\] b(j,W(y)) 
+ \[~cent¢(z,y-l,q,n)a(q,j)\]b(j,W(y)) 
L q 
0 < y _< Y j,i 6 Term(n) 
o < • < u q • Non~erm(n) (1) 
~,(~, ~, j, n) = x(j)b(j, w(~)) 
o < ~ < r j e Term(n) (2) 
243 
It can be seen that if x = 0 and there are no nonterminal 
states, the previous expressions are as in the F/B algorithm. 
a,t~(x, y,p, n): The probability of generating the words 
wz...w~_l inclusive and network n includes them, and being 
at the start node of nonterminal state p at position y. 
crnts(x,y,p,n) = Eat(x,y- l,i,n)a(i,p) 
i 
+ ~,,~.,~(x, y - 1, q, n)a(q, p) 
q 
0 < y < Y p,q • Nonterm(n) 
0 <_ x < y i • Term(n) (3) 
o~..(x,x,p,n) = I(p) 
o < x < Y p • Nonterm(n) (4) 
Otnte(X, y,p, n): The probability of generating the words 
w .... wu inclusive and network n includes them, and being 
at the end node of nonterminal state p at position y. 
ot,,t~(z,y,p,n) = E a.t~(x,v,p,n)atot,~t(v,y,N(p,n)) 
0 < y < Y p • Nonterm(n) 
o < x < y (5) 
a,ot.l(v, y, n) 
O<y<Y 
O<v<y 
~,~t(,,, u, i,n) + ~-~ o,.,.(,,, u,p, n) 
i p 
i • Term(n) & i • Final(n) 
p • Nonterm(n) & p • Final(n) (6) 
The quantity Oltotat(V, y, n) refers to the probability that 
network n generates the words w .... w u inclusive and being 
in a final state of n at position y. 'The OLtota! probabilities 
correspond to the "Inner" (bottom-up) probabilities of the 
I/O algorithm. If the network topology for Chomsky normal 
form shown in Figure 1 is substituted in equation (6), the 
reeursion for the inner probabilities of the I/O algorithm 
will be produced after further substitutions using equations 
0)46)- 
In the previous equations (5) and (6) it can be seen that 
the a,te probabilities for a network are defined in terms of 
other ones. They will never be self-referential if the grammar 
is cycle-free, (i.e. there are no derivations A ~ A for any 
nonterminal production A). In the new algorithm cycles can 
be detected and self-referencing avoided. This is a similar 
situation to a chart parser where once a constituent with 
a given label, start and end position is built, no further 
instances of it are added. 
The alpha probabilities are all computed first. The beta 
probabilities can then be calculated, which unlike the F/B 
algorithm involve the alpha probabilities because prefixes of 
a sentence must be accounted for as well as suffixes. The 
beta probabilities are described below. For convenience in 
later equations the following functions fl~bo,,, and B,id, are 
first defined: 
~obo~.(x, U, n) = 
m6.Af r:N(r,m)=n O~v~x 
r • Nonterrn(m) (7) 
,e,ia,(~, y, i, n) = 
a(l, i)Z,(., y + 1, i, n)b(i, W(y + 1)) 
i 
+E a(l'q) E atot,~,(y+ 1, v,N(q,n))fl, te(x,v,q,n) 
q y<v<Y 
i • Term(n) 
q • Nonterm(n) (8) 
Bit(x, y, j, n): The probability of generating the prefix w0 
• ..w,-1 and suffix wu+l...wr given that network n includes 
wx...wy and is in terminal state j at position y. The indica- 
tor function Ind 0 is used in subsequent equations. Its value 
is unity when its argument is true and zero otherwise. In 
addition, elements that are not explicitly referenced by the 
ranges in the equations are assumed to be zero. 
flt(x,y,j,n) = ~side(X,y,j,n) 
+ Ind(j • Final(n))flabove(x, y, n) 
0 < y < Y j • Term(n) 
o < x < ~ (9) 
~,(~, Y, j, n) 
O<x<Y 
gobo~,(x, Y, n) 
j 6 Term(n) 
j 6 Final(n) & ,~ Top(n) (10) 
flt(O,Y,j,n) = 1.0 
j 6 Term(n) 
j E Final(n) & Top(n) (11) 
The previous equations reduce to the definitions for fl in the 
F/B algorithm when x = 0 and there are no nonterminal 
states in the network. 
244 
Pnte(~, y, p, n): The probability of generating the prefix 
WO...W,-1 and suffix wy+l ... WY given that network n in- 
cludes wX...wy and is at the end node of state p at position 
Y. 
Pnte(x, YtPj n) = Pside(~, Y,P, n) 
It can be seen that the values for Pnte(x,y,p, n) are de- 
fined in terms of those in other networks which reference 
n via Pabove. As a result this computation has a topdown 
order. In contrast, the ant,(z,y,p, n) probabilities involve 
other networks that are referred to by network n and so 
assigned in a bottom-up order. If the network topology 
for Chomsky normal form is substituted in equation (12), 
the recursion for the "Outer" probabilities of the 110 al- 
gorithm can be derived after further substitutions. The p 
probabilities for final states then correspond to the outer 
probabilities. 
Pnts(x, y,p, n): The probability of generating the prefix 
wo ... w,-1 and suffix wy ... wy given that network n includes 
w,...wy-1 and is at at the start node of state p at position 
Y 
RE-ESTIMATION FORMULAE 
Once the alpha and beta probabilities are available, it is 
straightforward to obtain new parameter estimates (A, B, 0. 
The total probability P of a sentence is found from the top 
level network nTop. 
There are four different kinds of transition: 
1. Terminal node i to terminal node j. 
2. Terminal node i to nonterminal start node p. 
3. Nonterminal end node p to nonterminal start node q. 
4. Nonterminal end node p to terminal node i. 
The expected total number of times a transition is made 
from state i to state j conditioned on the observed sentence 
is E($i,j). The following formulae give E($) for each of the 
above cases: 
0 = x Top(n) 
0 < x < Y 
Top(n) 
x<y<Y 
A new estimate a(;, j) for a typical transition is then: 
Only B matrix elements for terminal states are used, 
and are re-estimated as follows. The expected total number 
of times the k'th vocabulary entry vk is generated in state 
a conditioned on the observed sentence is E(qi,k). A new 
estimate for 6(i, k) can then be found: 
The initial state matrix I is re-estimated as follows: 
O-~x, 
O<x<Y 
i(p) 
O=x 
O<x<Y 
i 6 Term(n) & Top(n) 
i 6 Term(n) & ~ Top(n) (24) 
p ~ ,~,,.(~, ~,p, n)/~.,.(~, ~:,p, n) 
X 
p c Uonterm(n) ~ Top(n) 
p C Nonterm(n) & ~ Top(n) (25) 
IMPLEMENTATION 
Inspection of the preceding equations indicates that in 
similar fashion to the I/O algorithm, this algorithm has cu- 
bic complexity in both the length of a sentence and the 
number of states in the grammar. It has been implemented 
as a computer program, and verification was conducted in 
four stages, to facilitate debugging: 
1. Using top-level networks having only terminal states, 
check for exact numerical agreement of re-estimated 
parameters with those obtained by applying the F/B 
algorithm to the same examples. 
2. Create examples involving nonterminals, but which 
have finite-state equivalents, and verify as in stage 1. 
3. Create examples with several references to a given net- 
work, then build a finite-state equivalent in which the 
references are supplanted by network copies having 
tied parameters. Verify as in stage 1. 
4. Test using examples in Chomsky normal form and 
compare with results from the I/O algorithm. 
Unscaled arithmetic was employed to simplify the initial im- 
plementation. Subsequent versions will include logarithmic 
scaling to prevent inaccuracies due to arithmetic underflow. 
The representation would also benefit from the inclusion of 
a probability matrix for final states, rather than their use 
simply as constraints. 
As the representation used by the algorithm is a superset 
of that used by the F/B algorithm, it conveniently permits 
"Staged Training". Components that are finite-state net- 
works can be pre-trained using the F/B algorithm, and then 
inserted into a context-free superstructure. This may be 
done to obtain improved initial estimates for the algorithm, 
and/or to reduce the total amount of computation involved. 
Lari and Young (1990) describe experiments using the I/O 
algorithm in which such pre-training was found useful. Us- 
ing the algorithm, the parameters of a context-free grammar 
can be trained from a corpus of untagged text. Values for 
the production probabilities are directly available, and no 
conversion of the rules to or from Chomsky normal form is 
needed. Once trained, a grammar can be used to predict the 
most likely syntactic structure of new sentences using a cor- 
responding analogue of the Cocke-Younger-Kasami parser. 
CONCLUSION 
An iterative algorithm for estimating the parameters of a 
hidden stochastic context-free grammar has been described, 
which is a generalization of the F/B algorithm and the I/O 
algorithm. The algorithm reduces to the F/B algorithm 
for finite-state grammars, and to the I/O algorithm when a 
context-free grammar is expressed in Chomsky normal form. 
ACKNOWLEDGEMENTS 
I would like to thank Phil Chou and John Maxwell of 
Xerox PARC, for their helpful comments on this paper. 
REFERENCES 
\[1\] Baker, J.K. (1979). Trainable Grammars for Speech Recog- 
nition. Speech Communication Papers for the 97th Meeting 
o\] the Acoustical Society of America (D.H. Klatt & J.J. Woff, 
eds), pp. 547-550. 
\[2\] Bourn, L.E. (1972). An Inequality and Associated Maximiza- 
tion Technique in Statistical Estimation for Probabilistic 
Functions of a Markov Process. Inequalities, 3, pp. 1-8. 
\[3\] Chitrao, M.V. & Grishman, R. (1990). Statistical Parsing 
of Messages. Proceedings of the DARPA Speech and Natural 
Language Workshop. 
\[4\] Fujisakl, T., Jelinek, F., Cocke, J., Black, E. & Nishino, T. 
(1989). A Probabilistic Parsing Method for Sentence Disam- 
biguation. International Workshop on Parsing Technologies, 
Pittsburgh, PA. pp. 85-94. 
\[5\] Jelinek, F. (1985). Markov Source Modeling of Text Gener- 
ation. Impact of Processing Techniques on Communication 
(J.K. Skwirzinski, ed), Nijhoff, Dordrecht. 
\[6\] Kupiec, J.M. (1989). Augmenting a Hidden Markov Model 
for Phrase-Dependent Word Tagging. Proceedings o\] the 
DARPA Speech and Natural Language Workshop, Cape 
Cod, MA pp. 92-98. Morgan Kaufmaxm. 
\[7\] Lari, K. & Young, S.J. (1990). The Estimation of Stochas- 
tic Context-Free Grammars Using the Inside-Outside Algo- 
rithm. Computer Speech and Language, 4, pp. 35-56. 
\[8\] Levinson, S.E., Rabiner, L.R. & Sondhi, M.M. (1983). An In- 
troduction to the Application of the Theory of Probabilistic 
Functions of a Markov Process to Automatic Speech Recog- 
nition. Bell System Technical Journal, 62, pp. 1035-1074. 
246 
