An Algorithm for Estimating 
the Parameters of Unrestricted I-Iidden 
Stochastic Context-Free Grammars 
Julian Kupiec 
Xerox Palo Alto Research Center 
3333 Coyote Itill Road 
Palo Alto, CA 94304 
ABSTRACT 
A new algorithm is presented for estimating the pa- 
rameters of a stochastic context-free grammar (SCFG) 
from ordinary unparsed text. Unlike the Inside/Outside 
I/O) algorithm which requires a grammar to be spee- 
fled in Chomsky normal form, the new algorithm can 
estimate an arbitrary SCFG without ally need for trans- 
formation. The algorithm has worst-case cubic com- 
plexity in the length of a sentence and the number of 
nonterminais in the grammar. Instead of the binary 
branching tree structure used by the i/O algorithm, the 
new algorithm makes use of a trellis structure for com- 
putation. The trellis is a generalization of that used by 
the Baum-Welcb algorithm which is used for estimat- 
ing hidden stochastic regular grammars. Tile paper de- 
scribes tile relationship between the trellis and the more 
typical parse tree representation. 
INTRODUCTION 
This paper describes an iterative method for esti- 
mating the parameters of a hidden stochastic context- 
free grammar (SCFG). The "hidden" aspect arises from 
the fact that ~ome information is not available when 
the grammar is trained. When a parsed corpus is used 
for training, production probabilities can be estimated 
by counting the number of times each production is 
used in the parsed corpus. In the case of a hidden 
SCFG, the characteristic grammar is defined but the 
parse trees associated with the training corpus are not 
available. To proceed in this circumstance, some ini- 
tim prohabilitie~ are assigned which are iteratively re- 
estimated from their current values, and the training 
corpus. They are adjusted to (locally) maximize the 
likelihood of generating the training corpus. The EM 
algorithm (Dempster, 1977) embodies the approach just 
mentioned; the new algorithm can be viewed as its ap- 
plication to arbitrary SCFG's. The use of unparsed 
training corpora is desirable because changes in the 
grammar rules could conceivably require manually re- 
parsing the training corpus several times during gram- 
mar development. Stochastic grammarsenable ambigu- ity resolution to performed on the rational basis of niost 
likely interpretation. They also acconnnodate the de- 
velopment of more robust grammars having high cover- 
age where the attendant ambiguity is generally higher. 
Previous approaches to the problem of estimating 
hidden SCFG's include parsing schemes ill which MI 
derivations of all sentences in the training corpus are 
enumerated (Fujisaki et al., 1989; Chitrao & Grishman, 
1990)). An efficient alternative is the Inside/Outside 
(I/O) algorithm (Baker, 1979) which like the new algo- 
rithm, is limited to cubic complexity in both the num- 
ber of nonterminais and length of a ~entence. The I/O 
algorithm requires that tile grammar be in Chonmky 
normal form (CNF). Tile new algorithm hal the same 
complexity, but does not have this restriction, dispens- 
ing with the need to transform to and from GNF. 
TERMINOLOGY 
The training corpus can be conwmiently segmented 
into sentences for puposes of training; each sentence 
comprisinga sequence of words. A typical one may 
consist ofY + 1 words, indexed from O to Y: 
The lookup function W(y) returns the index k of the 
vocabulary entry vk matching tile word w~ at position 
y ill tile sentence. 
The algorithm uses a extension of the representa- 
tion and terminology used for "hidden Markov mod- 
eis'(hidden stochastic regular grammars) for which the 
Baum-Welch algorithm (Baum, 1972) is applicable (and 
which is also called the Forward/Backward (F/B)algo~ 
rithm). Grammar rules are represented as networks and 
illustrated graphically, maintaining a correspondence 
ACRES DE COLING-92, NANIES, 23-28 AO~r 1992 3 8 7 I'ROC. OF COLING-92, NANTES, AUG. 23-28. 1992 
Network NP  ou; 
0.2 0.2 
Det ADJP Noun 
0.2 0.4 
Figure 1: Networks for Lexical Ruh~ 
with the trellis structure on which the computation can 
be conveniently repre~nted. The terminology is closely 
related to that of Levinson, Rabiner & Sondhi (1983) 
and also Lari & Young (1990). 
A set ofA f different nonterminals are represented by 
A; networks. A component network for the nontermi- 
nal labeled n has a parameter set (A, B, I, N,F, Top, n). 
To uniquely identify an element of the parameter set 
requires that it be a function of its nonterminal label 
e.g. A(n), l(n) etc.). However this notation has been 
topped to make formulae less cumbersome. A network 
labeled NP is shown in Figure 1 which represents the 
following rules: 
NP ~ Noun (0.2) 
NP ~ Dee Noun (0.2) 
NP ~ Dee ADJP Noun (0.2) 
NP ==~ ADJP Noun (0.4) 
Noun ==~ "cat" (0.002) 
Noun ==~ "dog" (0.001) 
Det ==~ "the" (0.5) 
Dee ~ "a" (0.2) 
The rule NP =:~ Noun (0.2) means that if the NP 
rule is used, the probability is 0.2 that it produces a sin- 
gle Noun. In Figure 1, states are represented by circles 
with numbers inside to index them. NonierminMstates 
• re shown with double circles and represent references 
to other networks, such as ADJP. States marked with 
single circles ate called terminal states and represent 
part-of-speech categories. When a transition is made to 
a terminal state, a word of the current training sentence 
is generated. The word must have the same category 
as the state that generated it. Rules of the form Noun 
=:~ "cat" (0.002) and Noun ==~ "dog" (0.001) are 
collapsed into a state-dependent probability vector b(j), 
Network NP 
ADJP Noun 0.4~ F 
0.4 Det 
Figure 2: Equivalent Network 
l og .at i I_the 
© 
007 F0o0 000,7 °°/1:0, 
°°°'l :: j o.s__ 1 \[_tho 
© 
Noun 
Figure 3: Reprelentatlon for Terminal Productions 
which is an element of the output matrix B. Elements 
of the vector such as b(j W(y))represent the probabil- 
ity of seeing word wy in terminal state j. A transition 
to a nonterminal state does not in itself generate any 
words, however terminal states within the referenced 
network will do so. The parameter N is a matrix which 
indicates the label (e.g. n, NP, ADJP) of the net- 
work that a nonterminal state refers to. The proba- 
bility of making a transition from state i to state j is 
labeled a(i, j) and collectively these probabilities form 
the transition matrix A. The initial matrix I contains 
the production probabilities for rules that are modelled 
by the network. They are indicated in Figure 1 as num- 
bers beneath the state, if they are non-zero, l(i) can be 
equivalently viewed as the probability that some sub- 
sequence of n is started at state i. The parameter F 
is the set of final states; any sequence accepted by the 
network must terminate on a final state. In Figure 1 fi- 
nal states are designated with the annotation "F'. The 
boolean value Top indicates whether the network is the 
top-level network. Only one network may be assigned 
as the top-level network, which models productions in- 
ACT~ DE COLING-92, N^~T.S, 23-28 Aot~r 1992 3 8 8 Paoc. oF COLING-92, NA~rrEs, AuG. 23-28, 1992 
volving the root symbol of a grammar. 
An equivalent network for the same set of rules is 
shown in Figure 2. The lexical rules can be written 
compactly as networks, with fewer states. The transi- 
tions from the determiner state each have probability 
0.5 (i.e a(1, 2) : a(1,3) = 0.5). It should be noted that 
the algorithm can operate on either network. 
TRELLIS DIAGRAM 
"lYellis dia~rsans conveniently relate computational 
quantities to the network structure and a training sen- 
tence. Each network u has a set of Y + 1 trellises for 
subsequences of a sentence Wo...wy, starting at each 
different position and ending at subsequent ones. A 
single trellis spanning positions 0...2 is shown ill Figure 
4 for network NP. Nonterminal states are associated 
with a row of start nodes indicating where daughter 
constituents may start, and a row of end nodes that 
indicate where they end. A pair of start/end nodes 
thus refer to a daughter nonterminal constituent. In 
Figure 4, the ADJP network is referenced via the start 
state at position O. An adjective is then generated by 
a terminal state in the trellis for the ADJP network, 
followed by a transition and another adjective. The 
ADJP network is left at position 1, and a transition is 
made to the noun state where the word %at" is gen- 
erated. Terminal states are associated with a single 
row of nodes in the trellis (they represent terminal pro- 
ductions that span only a single position). The path 
taken through the trellis is shown with broken a line. 
A path through different trellises has a corresponding 
unique tree representation, as exemplified in Figure 5. 
In cases where para~ are ambiguous, several paths exist 
corresponding to the alternative derivations. We shall 
next consider the computation of the probabilities of 
the paths. Two basic quantities are involved, namely 
alpha and beta probabilities. Loosely speaking, the al- 
phas represent probabilities of subtrees associated with 
nonterminals, while the betas refer to the rest of the tree 
structure external to the subtrees. Subsequently, prod- 
ucts of these quantities will be formed, which represent 
the probabilities of productions being used in generat- 
ing a sentence. These are summed over all sentences 
in the training corpus to obtain the expected number 
of times each production is used, based on the current 
production probabilities and the training corpus. These 
are used like frequency counts would be for a parsed 
corpus, to form ratios that represent new estimates of 
the production probabilities. The procedure is iterated 
several times until the estimates do not change between 
iterations (i.e. the overall likelihood of producing the 
training corpus no longer increases). 
The algorithm makes use of one set of trellis dia- 
grams to compute alpha probabilities, and another for 
beta probabilities. These are both split into termi- 
nal, nonterminal-start and nontermiual-end probabili- 
ties, corresponding to the three different types of nodes 
in the trellis diagram. The alpha set are labeled at, 
c~,~t, and ante respectively. The algorithm was origi- 
nally formulated using solely the trellis representation 
(Kupiec, 1991) however the definitions that follow will 
Network NP 
=a. 
ADJP ' 
and (~ 
Det (~ 
Noun Q 
big black cat 
Figure 4: A Path through a Trelli= Diagram 
NP 
ADJP / 
AdJ AdJ Noun 
big black cat 
Figure 5: The EquivMent Tree 
ACRES DE COLING-92, NANTES, 23-28 Ao~r 1992 3 8 9 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
also be related to the consituent structures used in the 
equivalent parse trees. In the following equations, three 
sets will be mentioned: 
1. Term(n) The set of terminal states in network n. 
2. Nonterm(n) This is the set of nonterminal states 
in network n. 
3. Final(n) The set F of final states in network n. 
at(z, y, j, n): The probability that network n gener- 
ates the words w,...w~ inclusive and is at the node for 
terminal state j at position y. 
~,(~, v, J, n) = 
\[~a,(x,y-1, i,n)a(i,j)\] b(j, W(y)) 
+ \[E ant,(x,Y- l,q,n)a(q,j)\] b(j, W(Y)) 
i. q 
O<y<Y j, iETerm(n) 
O < x < y q E Nonterm(n) 
trt(z, x, j,n) = I(j)b(j, W(x)) 
O< z < Y j E Term(n) 
and whose next extension will involve trees dominated 
by N(p, n), the nonterminal referred to by state p. 
elate(z, y, p, n): The probability that network n gen- 
erates the words w~:...wy inclusive, and is at the end 
node of nonterminal state p at position y. 
a.,.(r, y, p, n) = 
E a,ts(z, v,p, n)t~totat(v, y, N(p, n)) 
~<v<_~ 
0 <_ y < Y p E Nonterm(n) 
0 < x < y (5) 
~,,.,.,(~, y, n) = ~,(~,~,i,n) + ~.,.(~,y,p,n) 
i p 
0 < y < Y i ~_ Term(n) &: i E Final(n) 
0 < v < y p E gontcrm(n) & p e Final(n) (6) 
crnte(x,y,p, n) represents the probability of a con- 
(l) stituent for n that spans x...y, formed by extending the 
various constituents ctnts(x,v,p,n) (ending at v - 1) 
with corresponding completed constituents starting at 
(2) v, ending at y and dominated by N(p, n). 
at(::, y, j, n) represents a constituent for nontermi- 
hal n spanning positions x...y. It is formed by extend- 
ing an incomplete constituent for n, by addition of the 
terminal w v at state j. The two terms indicate cases 
where the constituent previously ended on either a ter- 
minal or another constituent completed at y - 1 (as 
in Figure 5, where the complete ADJP constituent is 
followed by the noun "eat"). If j is a final state the 
extended constituent is complete. 
antJ (z, y, p, n): The probability that network n gen- 
erates the words wr...wv_l inclusive, and is at the start 
node of nonterminal state p at pc~ition y. 
.... (~,y,p,n) = ~ ~,(x,y- l,i,n)a(i,p) 
i 
+ Eoe,t,(x,y - 1,q,n)a(q,p) 
q 
0 < y < Y p,q ~. Nonterm(n) 
0 < x < y i e Term(n) (3) 
.... (~, ~,p, n) = 1(p) 
0 < x < Y p e Nonterm(n) (4) 
ant, (x, y, p, n) represents an incomplete constituent 
for nonterminal n whose left subtrees span z...y- 1, 
The quantity Oqot,a(v, y, n) refers to the probability 
that network n generates the words w~...w~ inclusive 
and is in a final state of n at position y. Equivalently 
it is the probability that nonterminal n dominates all 
derivations that span positions v...y. The Cttotat prob- 
abilities correspond to the "Inner" (bottom-up) proba- 
bilities of the I/O algorithm. If a network correspond- 
ing to Chomsky normal form is substituted in equation 
(6), the reeursion for the inner probabilities of the I/O 
algorithm will be produced after further substitutions 
using equations (1)-(6). 
In the previous equations (5) and (6) it can be seen 
that the a,,~, probabilities for a network are defined 
recursively. They will never be self-referential if the 
grammar is cycle-free, (i.e. there are no derivations 
A =:~ A for any nonterminal production A). Although 
only cycle-free grammars are of interest here, it is worth 
mention that if cycles do exist (with associated proba- 
bilities less than unity), the recursions form a geometric 
series which has a finite sum. 
The alpha probabilities are all computed first be- 
cause the beta probabilities make use of them. The 
latter are defined recursively in terms of trees that are 
external to a given constituent, and as a result the re- 
cursions are less obvious than those for the alpha prob- 
abilities. The basic recursion rests on the quantity '6,,~ 
which involves the following functions .6above and fl, la,: 
ACRES DE COLING-92, NANTES, 23-28 ^o~r 1992 3 9 0 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
r E Nonterm(m) 
fl.ide(X,y,l,n) = 
Z; Z: Z ...,.(o, r, ., r, m) 
mE.IV " r:N(r,m)=n O<v<x 
(7) 
m 
a(l, i)Bt(z, y + 1, i, n)b(i, W(y + 1)) + 
i 
a(l,q) ~ cqotaa(y.1- 1, w, N(q, n)),Snte(x, w, q, n) 
q II<w<Y 
i e Term(n) 
q e Nonterm(n) (8) 
Given a constituent n spanning x...y, fla6ove(~, Y, n) 
indicates how the constituents spanning v...y mid la- 
beled m that immediately dominate n relate to the con- 
stituents that are in turn external to m via flute(v, y, r,m). 
This situation is shown in Figure 6, where for simplic- 
ity ~nte(v,y, r, m) has not been graphically indicated. 
Note that m can dominate x...y as well as left subtrees. 
fl.ide(X, y, l, n) defines another reeursion for a con- 
stituent labeled n that spans x...y, and is in state I at 
time y. The recursion relates the addition of right sub- 
trees (spanning y+ 1...w) to the remaining external tree, 
via flnt~(x, w, q, n). This is depicted in Figure 7 (again 
the external trees repret~nted by time(x, w, q, n) are not 
shown). Also omitted from the figure is the first term 
of Equation 8 which relates the addition of a single ter- 
minal at position y + 1 to an external tree defined by 
fit(x, y + 1,i, n). fir and the various other probabilities 
for a beta trellis are defined next: 
fit (x, y, j, n): The probability of generating the pre- 
fix wo...w~-i and suffix w~+l...wY given that network n 
generated wz...wy and is in terminal state j at position 
y. The indicator function Ind 0 is used in subsequent 
equations. Its value is unity when its argument is true 
and zero otherwise. 
flt(~c,y,j,n) = ~slde(x,y,j,n) 
q- lnd(j E Finul(n))fla,ov.(x, y, n) 
O< y<Y jETerm(n) 
O<x<y (9) 
13,(~:,Y,j,n) = flO, ov,(x,Y,n) 
O_.v... x y 
Figure 6: Part of ~obo¢~(x,y, n) 
A 
/: g 
/ 
Y Y 
Y Y 
2 I ,, 
/ 
x y y+l .... w ... Y 
Figure Y: Part of//,,ne(x.y,l,n ) 
0 < x < Y j E Term(n) 
j c ri.at(n) ,~ ~ "top(n) 0o) 
3t(O,Y,j,n) : 1.0 
j G Term(n) 
j c Fi..t(.) ~ ~'op(n) (11) 
Tile first term in Equation 9 describes tile relationship 
of the tree external to x...y + 1 to tile tree external to 
x...y, with r~pect to state j generating the terminal 
wv4. l at time y + 1. If the constituent n spamfing x...y 
is complete, the second term describes the probability 
of the external tree via coastituents that immediately 
dominate n. 
AcrEs DE COLING-92, NANlXS, 23-28 Ao£rr 1992 3 9 1 PRoc. oF COL1NG-92, NANTES. AUG. 23-28, 1992 
flnte(x, y,p, n): The probability of generating the 
prefix wo...w,,_~ and suffix w~+x...wy given that net- 
work n generated wr...w~ andthe end node of state p 
is reached at position y. 
flnt.(x, y, p, n) = fl.ia.(z, y, p, n) 
+ lnd(p • Finnl(n))#,**~,(~, U, n) 
0 < y < Y i • Term(n), p • Nonterm(n) 
0 < x < y (12) 
flnt.(x,Y,p,n) = ~at.~.(z,Y,n) 
0 < • < Y p ~ Nonterm(n) 
p ~ Final(n) & ~ Top(n) (la) 
fl.,.(0, Y,v,n) = Lo 
p ~. Nonterm(n) 
p • Final(n) & Top(n) (14) 
/3,t~(x, y, p, n) has the same form as the previous for- 
mula for fit, but is used with nonterminal states. Via 
fla~,ov~ mad/3slae it relates the tree external to the con- 
stituent n (spanning x...y) to the trees external to v...y 
and z...w. During the recursion, the trees shown in 
Figures 6 and 7 are substituted into each other (at the 
positions shown with shaded areas). Thus the exter- 
nal trees are successively extended to the left and right, 
until the root of the outerm~t tree is reached. It can 
be seen that the values for j3nt~(x,y,p, n) are defined in 
terms of those in other networks which reference n via 
/3~bo~e. As a result this computation has a top-down or- 
der. In contrast, the cunte(z,y,p , n) probabilities involve 
other networks that are referred to by network n and 
so assigned in a bottom-up order. If the network topol- 
ogy for Chomsky normal form is substituted in equa- 
tion (12), the reeur~ion for the "Outer" probabilities 
of the I/O algorithm can be derived after further sub- 
stitutions. The ~ntt probabilities for final states then 
correspond to the outer probabilities. 
13,to(X,y,p, n): The probability of generating the 
prefix wo...w~-i and suffix w~...wy given that network 
n generated w~...ww_ x and is at the start node of state 
p at position y. 
/3.,.(~,~,p,n)= 
ottot,t(y,v,N(p,n))~n,t(z,v,p,n) 
~<v<_Y 
O < x < Y p • Nonterm(n) 
x _< y 5 Y (15) 
RE-ESTIMATION FORMULAE 
Once the alpha and beta probabilities are available, 
it is a straightforward matter to obtain new parame- 
ter estimates (A, B, I). The total probability P of a 
sentence in found from the top-level network nTop. 
P = atot.s(O,Y, nT~) 
Top(nTop) (16) 
There are four different kinds of transition: 
1. Terminalnode i to terminal node j. 
2. Terminal node i to nonlerminal start node p. 
3. Non~erminal end node p to nonierminal start q. 
4. Nonterminal end node p to terminal node i. 
The expected total number of times a transition is made 
from state i to state j conditioned on the observed sen- 
tence is E(¢ij). The following formulae give E(¢) for 
each of the above eases: 
1 E(¢,j) = ~,(~,~,~,n)n(~,j) × 
b(j,W(y+ 1))f/t(x,y+ 1,j,n) (17) 
1 E(¢,,~) = ~ ~\] ~ ~,(~, y, i, n)a(i,p) × 
&is(x, y+ 1,p, n) (18) 
1 Z ~ c6..(x, y, p, n)a(p, q) × E(¢p,d = 
x It 
/3nt.(x, y + 1,q,n) (19) 
1 y~a.,.(z,y,p,n)a(p,i) × E(¢,,,) = 
b(i,W(y+ l))flt(x,y+ l,i,n) (20) 
0 = x Top(n) 
0 <_ x < Y ~ Top(n) 
x<y<Y 
A new estimate 5(i, j) for a typical transition is then: 
E(¢~,A (21) h(i,j) - ~j E(¢ij) 
Only B matrix elements for terminal states are used, 
and are re-estimated as follows. The expected total 
number of times the k'th vocabulary entry vk is gener- 
ated in state i conditioned on the observed sentence is 
E(yl,k). A new estimate for b(i, k) can then be found: 
1 E(~:,k) = ~ ~ ~,(~,y,i, nl/~,(~,y,i,n) 
¢ y:W(y)=k 
o = • i • Term(n) & Top(n) 
O < x < Y i • Term(n) & ~ Top(n) 
ACTES DE COLlNG-92. NANTES. 23-28 Aob'r 1992 3 9 2 PROC. OF COLING-92. NAbrfES, AUG. 23-28. 1992 
x<_u<_Y 
~(~, k) - ~k E(~i,D 
The initial state matrix I is re-estimated as follows: 
1 i(i) = ~at(~,~,i,n)~t(z,z,i,n) 
0 = z i 6 Term(n) & Top(n) 
O < x < Y i e Term(n) & ~Top(n) (24) 
1 ~..,.(~, ~,p, n)Z.,.(~, ~,p, n) i(p) = -f 
0 = x p e Nonterm(n) & Top(n) 
0 < • < r p e lVonterm(n) ~ ~ Top(n) (25) 
DISCUSSION 
The preceding equations have been implemented as 
a computer program and this section de~cribe~ some 
practical issues with regard to a robust implementa- 
tion. The first concerns the size of the B matrix. For 
pedagogical reasons individual words were shown as el- 
ements of this matrix. A vocabulary exeeeding 220,000 
words is actually used by the program so it is not practi- 
cal to try to reliably estimate the B matrix probabilities 
for each word individually. Instead, only common words 
are represented individually; the rest of the words in the 
dictionary are partitioned into word equivalence classes 
(Kupiec, 1989) such that all words that can function as 
a particular set of part-of-speech categori~ are given 
the same label. Thus "type", "store" and "dog" would 
all be labelled as singular-noun-or-nninflected-verb. For 
the category set that is used, only 250 different equiva- 
lence elasees are necessary to cover the dictionary. 
It is important that the initial guesses for param- 
eter values are well-informed. All productions for any 
~i iven nonterminal were intially assumed to be equally kely, but the B matrix values were conveniently copied 
from a trained hidden Markov model (HMM) used for 
text-tagging. The HMM was also found very useful 
for verifying correct program operation. The algorithm 
has worst-case cubic complexity in both the length of a 
sentence and the number of nonterminal states in the 
grammar. An index can be used to efficiently update 
terminal states. For any word (or equivalence class) 
the index determines which terminal states require up- 
dating. Also when all probabilities in a column of any 
trellis become zero, no further computation is required 
for any other columns in the trellis. Grammars are 
currently being developed, and initial experiments have 
typically used eight training iterations, on training cor- 
pora comprising 1O,000 sentences or more (having an 
average of about 22 words per sentence). 
To eomphment the training algorithm, a parser has 
also been constructed which is a corresponding ana- 
logue of the Cocke-Younger-Kasami parser. The parser 
(22) is quite similar to the training algorithm, except that 
maximum probability paths are propagated instead of 
sums of probabilities. Trained grammars are used by 
(23) the parser to predict the most likely syntactic structure 
of new sentences. The applications for which the parser 
was developed make use of incomplete parses if a sen- 
tence is not covered by the grammar, thus top-down 
filtering is not used. 
ACRES DE COLING-92, NANTES, 23-28 AO~' 1992 3 9 3 PROC. Ol: COLING-92. NANTES, At~G. 23-28, 1992 

REFERENCES 

\[1\] Baker, J.K. (1979). Trainable Grammars for 
Speech Recognition. Speech Csmmnnication Pa- 
pers for the 97th Meeting of the Acoustical Soci- 
ety of America (D.H. Klatt & J.J. Wolf, eds), pp. 
547-550. 

\[2\] Baum, L.E. (1972). An Inequality and Associated 
Maximization Technique in Statistical Estimation 
for Probabilistie Functions of a Markov Process. 
Inequalities, 3, pp. 1-8. 

\[3\] Chitran, M.V. & Grishman, R. (1990). Statistical 
Parsing of Messages. Proceedings of the DARPA 
Speech and Natural Language Workshop. 

\[4\] Dempster, A.P., Laird, N. M., ~ Kubiu, D. 
B. (1977). Maximum Likelihood from Incomplete 
Data via the EM Algorithm. Journal of the Royal 
Statistical Society, B39, pp. 1-38. 

\[5\] Fujisaki, T., Jelinek, F., Cooke, J., Black, E. 
& Nishino, T. (1989). A Probabilistic Parsing 
Method for Sentence Disambiguation. lnterua. 
tional Workshop on Parsing Technologies, Pitts- 
burgh, PA. pp. 85-94. 

\[6\] Jelinek, F. (1985). Markov Source Modeling of 
Text Generation. Impact of Processing Techniques 
on Communication (J.K. Skwirzinski, ed), Nijhoff, 
Dordreeht. 

\[7\] Kupiec, J.M. (1989). Augmenting a Hidden Mar 
kov Model for Phrase-Dependent Word Tagging. 
Proceedings of the DARPA Speech and Natural 
Language Workshop, Cape Cod, MA pp. 92-98. 
Morgan Kaufmann. 

\[8\] Kupiec, J.M. (1991). A Trellis-Based Algorithm for 
Estimating the Parameters of a Hidden Stochas- 
tic Context-Free Grammar. Proceedings of the 
DARPA Speech and Natural Language Workshop, 
Pacific Grove, CA pp. 241-246. Morgan Kaufmann. 

\[9\] Lari, K. & Young, S.J. (1990). The Estimation 
of Stochastic Context-Free Grammars Using the 
Inside-Outside Algorithm. Computer Speech and 
Language, 4, pp. 35-56. 

\[10\] Levinson, S.E., Rabiner, L.R. & Sondhi, M.M. 
1983). An Introduction to the Application of the 
eory of Probabilistic Functions ofa Markov Pro- 
ce~ to Automatic Speech Recognition. Bell System 
Technical Journal, 62, pp. 1035-1074. 
