Computation of the Probability of 
Initial Substring Generation by Stochastic 
Context-Free Grammars 
Frederick Jelinek* 
John D. Lafferty* 
IBM T. J. Watson Research Center 
Speech recognition language models are based on probabilities P(Wk+I = v \[ WlW2~..., Wk) 
that the next word Wk+l will be any particular word v of the vocabulary, given that the 
word sequence Wl, w2,..., Wk is hypothesized to have been uttered in the past. If probabilistic 
context-free grammars are to be used as the basis of the language model, it will be necessary 
to compute the probability that successive application of the grammar rewrite rules (beginning 
with the sentence start symbol s) produces a word string whose initial substring is an arbitrary 
sequence wl, w2, . . . , Wk+l. In this paper we describe a new algorithm that achieves the required 
computation in at most a constant times k3-steps. 
1. Introduction 
The purpose of this article is to develop an algorithm for computing the probability 
that a stochastic context-free grammar (SCFG) (that is, a grammar whose production 
rules have attached to them a probability of being used) generates an arbitrary initial 
substring of terminals. Thus, we treat the same problem recently considered by Wright 
and Wrigley (1989) from the point of view of LR grammars. 
Probabilistic methods have been shown most effective in automatic speech recog- 
nition. Recognition (actually transcription) of natural unrestricted speech requires a 
"language model" that attaches probabilities to the production of all possible strings 
of words (Bahl et al. 1983). Consequently, if we believe that word generation can be 
modeled by context-free grammars, and if we want to base speech recognition (or 
handwriting recognition, optical character recogition, etc.) on such models, then it 
will become necessary to embed them into a probabilistic framework. 
In speech recognition we are presented with words one at a time, in sequence, and 
so we would like to calculate the probability P(s --* w~w2... Wk...) that an arbitrary 
string wlw2... Wk is the initial substring of a sentence generated by the given SCFG. I 
* P.O. Box 218, Yorktown Heights, NY 10598 
1 In fact, in speech recognition (Bahl et al. 1983) we are presented with a hypothesized past text (the 
history) WlW2 .. • Wk and are interested in computing, for any arbitrary word v, the conditioned 
probability P(Wk+ 1 ---- v \[ wlw2.., wk) that the next word uttered will be v given the hypothesized past 
wl w2 ... Wk. Assuming that successive sentences s are independent of each other (a rather dubious 
assumption justifiable only by a lack of adequate understanding of how one sentence influences 
another), we may as well take the view that wl is the first word of the current sentence and that Wk is 
not the last. Then 
P(s --* wlw2.., wkv...) P(Wk+I = v I WlW2...Wk) = 
P(s --* wlw2... Wk...) 
Hence our interest in the calculation of P(s --~ wlw2 • .. Wk...). 
(~) 1991 Association for Computational Linguistics 
Computational Linguistics Volume 17, Number 3 
2. Definition of Stochastic Context-Free Grammars 
We will now define stochastic context free grammars (SCFGs) and establish some 
notation. We will use script symbols for sets, lowercase letters for elements of the 
sets or specific string items, and capitals for variables. We start with a vocabulary 
V = {Vl, V2,...,VN} whose elements, words vi, are the terminal symbols of the lan- 
guage. We next list a set of nonterminals g = {gl = s, g2~... ~gM} whose elements 
gj are grammatical phrase markers. They include the distinguished phrase marker s, 
the sentence "start" symbol. The purpose of our grammar is to generate sentences 
WlW2... w n of varying length n. The generation is accomplished by use of production 
rules, belonging to a set 2"4, that rewrite individual phrase markers as sequences of 
phrase markers or words. For simplicity of manipulation but without loss of general- 
ity, we will limit the productions to the Chomsky Normal Form (CNF). That is, only 
the following types of productions will be allowed: 
1. H --+ GIG2 
2. H-+ V (1) 
The grammar is stochastic, because to each rule there is assigned a probability of its 
use. Let H be any nonterminal, and let #(H) be the number of productions rewriting 
H. The ith of these productions will then take place with probability P(i I H). It is 
assumed that for all i = 1,2 .... , #(H), P(i I H) is a strictly positive number and that 
#(~) 
P(i l H ) = 1 
i=1 
(2) 
It will be convenient to denote the probabilities P(i I H) by the productions they 
refer to, e.g., P(H --+ GIG2) or P(H --+ V). 
A context-free grammar is assumed to generate sentences from top to bottom, 
starting with some rule s --+ G1G2 that rewrites the sentence symbol s and is used with 
probability P(s --+ GIG2). The generated nonterminals G1 and G 2 are then rewritten, 
and the rewriting process continues until no nonterminals remain to be rewritten, all 
having been replaced by words through use of rewrite rules of type (1). The probability 
of the entire process is equal to the product of the probabilities of the individual re- 
write rules used. 
We say that a SCFG is well defined in case it forms a language model; that is, the 
total probability of strings of terminals generated by the grammar is equal to 1: 
OO 
XI P(s---+ WlW2...Wn)-~- 1 
n=l wlw2...wnEV. 
A context-free grammar is said to be proper if starting from the distinguished nontermi- 
nal s, the only nonterminals produced are those whose further rewriting can eventually 
result in a string of terminals. In fact, condition (2) is necessary and sufficient for a 
SCFG to be well defined if the underlying grammar is proper. 2 
The solutions to the following four problems are of interest. 
2 The following simple algorithm determines whether or not a grammar may be made proper by the 
elimination of rules. 
Let 8 be the set of all nonterminals itt such that a rule H --+ V exists for some nonterminal V. 
316 
Jelinek and Lafferty Probability of Initial Substring Generation 
. What is the probability P(s --, WlW2... Wn) that the grammar, beginning 
with the start nonterminal s, generates a given word string (sentence) 
WlW2...Wn, Wi E ~? 
The desired probability is computed by the Inside Algorithm (Baker 1979), which is a 
modification of the well-known CYK parsing algorithm (Younger 1967; Graham et al. 
1980). 
. What is the most probable parse of a given word string wlw2... Wk? That 
is, which sequence of rewrite rules resulting in wlw2... Wk is such that 
the product of its probabilities is maximal? 
This parse is computed by the Viterbi Algorithm (Jelinek 1985), which uses the same 
chart as the CYK algorithm. 
. What is the probability P(s --* WlW2... Wn...) that the grammar, 
beginning with the start nonterminal s, generates a word string 
(sentence) whose initial substring is wlw2.., wn? 
The algorithm providing the answer to this question is developed in the present paper. 
4. Given the set of rules specifying a context-free grammar, how should the 
probabilities of their use be determined? 
An answer to this question requires a criterion by which to judge it. The maxi- 
mum likelihood criterion is as follows: given a "training corpus" WT (that is, a set of 
sentences), determine the production probabilities so as to maximize the probability 
that the grammar generated WT. The Inside-Outside Algorithm (Baker 1979) extracts 
probabilities that locally (i.e., not necessarily globally) maximize the likelihood of Wv. 
3. Development of the Left-to-Right Inside (LRI) Algorithm 
In this section we will develop the Left-to-Right (LRI) Algorithm, which will allow us 
to calculate the desired probabilities P(s --* wlw2... Wk...). In order to present the LRI 
Algorithm, we will introduce some notation that will simplify the appearance of the 
following formulas. Let P(H(i~j>) denote the probability P(H --~ wi... wj) that starting 
with the nonterminal H, successive application of grammar rules has produced the 
sequence wiwi+l... Wj. That is, if the SCFG production process is represented by the 
usual tree diagram, then P(H(i~j)) is the sum of the probabilities of all trees whose 
root is H and whose leaves are wi, wi+l .... ~ wj. 
1. If S = ~, the grammar is proper. Stop. 
Else if S ~ G, then find the set ,,4 of all nonterminals H not belonging to S that rewrite as H --* G1 G2 
with G1 and G2 belonging to S. 
2. If ,,4 is not empty, include the set A in S and go to 1. 
3. If s C S, eliminate from U all nonterminals not belonging to S and purge all rules involving 
nonterminals not belonging to S. The resulting grammar is proper. 
Else if s ~ S, the grammar cannot be made proper by purging. 
317 
Computational Linguistics Volume 17, Number 3 
Next, let P(H << i,j) denote the ,.sum of the probabilities of all trees with root node 
H resulting in word strings whose initial substring is wiwi+ 1 ... Wj. Thus 
P(H << i,j) P(H(i,j)) + ~.:P(H --* wi... WjXl) 
Xl 
+ y~ P(H ---* wi... wjxlx2) +"" 
XIX2 
+ ~_~ P(H~ wi...WjXl...Xn) +"" 
Xl ...X n 
(3) 
Note that the first sum in (3) is. over all possible words xl, the second is over all 
possible word pairs xlx2, and the third sum (the general term) is over all possible word 
n-tuples XlX2... Xn. Using the notation (3), the desired probability P(s ~ wlw2... Wk...) 
is denoted by P(s << 1, k). 
In what follows we will need PL(H ~ G), the sum of the probabilities of all the 
rules H --* GIG2 whose first righthand side element is G1 = G. That is, 
PL(H -=' G) = E P(H --* GG2) (4) 
G2 
Next we define the quantity 
QL(H - c) = PL(H --* G) + y~PL(H ~ A1)PL(A1 -* G) 
A1 
+ y~ PL(H -* A1)PL(A1 ~ A2)PL(A2 --* G) +... 
A1 ,A2 
+ Z PL(H A )P (A1 C) +... 
A1 ,...,Ak 
= ~P(H*,Ga) 
ot 
(5) 
which is the sum of probabilities of all trees with root node H that produce G as the 
leftmost (first) nonterminal. Note that the last displayed (general) term accounts for 
all trees whose leftmost leaf has depth k. Note further that the above sum converges 
since we assume that our underlying grammar is proper, and that rule probabilities 
are non-zero. 
We are now ready to compute 
P(H <K i, i)=P(H--*wi) +~_PL(H-+ G)P(G --* wi) 
G 
+~ ~_, PL(H --* A1)PL(A1 ~ G)P(G --~ wi) 
G A~ 
+~_~ ~ PL(FI --* A1)PL(A1 ~ A2)PL(A2 ~ G)P(G --* wi) 
G A1 ,A2 
+~_~ ~ PL(H ---* A1)PL(A1 ~ A2)...PL(Ak --* G)P(G ---+ wi) 
G A1,...,Ak 
.-\]-... 
318 
Jelinek and Lafferty Probability of Initial Substring Generation 
Thus, using definition (5) we get 
P(H << i, i) = P(H -+ wi) + Y~ QL(H ~ G)P(G --+ wi) 
G 
(6) 
To compute P(H << i, i + n) for n > 0, we will need to define 
QL(H ~ G1G2) = P(H ~ GIG2) q- Z QL(H ~ A)P(A ~ GiG2) 
A 
(7) 
which can be seen to be the sum of probabilities of all trees with root node H whose 
last leftmost production results in leaves G1 and G2. To compute P(H << i, i + n) we 
will rely on the strict CN form of the grammar. Obviously, 
P(H << i,i + n) P(H --+ GIG2) \[P(G 1 (i, i))P(G2 (( i + 1, i + n) 
G1 ,G2 
+ P(G~(i,i+ 1})P(G2 << i+2, i+ n) +..- 
+ P(Gl(i,i+n- 1))P(G2 << i+n,i+n) 
+ P(G1 <<i,i+n)\] (8) 
since to generate the initial substring wiwi+l... Wi+n, some rule H -+ GIG2 must first 
be applied and then the first part of the substring must be generated from G2 and its 
remaining part (and perhaps more!) from G2. 
Defining the function 
R(G1, G2) = \[P(Gl(i, i))P(G2 << i + 1, i + n) + P(G1 (i, i + 1))P(G2 << i + 2, i + n) +... 
+P(GI(i, i + n - 1))P(G2 << i + n, i + n)\] (9) 
we can next rearrange (8) as follows: 
P(H << i,i + n) Z P(H ~ G1G2)R(G1, G2) 
G1 ,G2 
+ ~ PL (H ~ A1)P(A1 << i, i + n) 
A1 
(lO) 
where we took advantage of the definition (4) and denoted the variable in the last 
sum by A1 instead of by G1. 
Renaming H in (10) as A1, and A1 as A2, we get 
P(A~ << i,i + n) E P(al --~ GIG2)R(G1, G2) 
G1, G2 
q- Z PL(A1 ---+ a2)P(a 2 << i, i + n) 
A2 
(11) 
319 
Computational Linguistics Volume 17, Number 3 
Substituting (11) into (10) and collecting and factoring out common terms, we get 
P(H ~ i,i + n) ~ \[P(H--~ GIG2) + ~PL(H--* A1)P(A1--~ G1G2)I R(G1,G2 ) 
G1 ,G2 A1 
+ E PL(H -* A1)PL(A1 ~ A2)P(A2 << i, i + n) (12) 
A1 ,A2 
Next, renaming A1 in (11) as A2, and A2 as A3, and substituting the result into (12), 
we get 
P(H KK i, i + n) \[P(H ---* G1G2) -}- ~ PL(H --* A1)P(A1 ---* G1G2) 
G1 ,G2 A1 
q- A1 ~,A2 PL(H --, A1)PL(A1 ---* A2)P(A2 --~ G1G2)J R(G1, G2) (13) 
+ y~ PL(H ---* A1)PL(A1 --~ A2)PL(A2 ~ A3)P(A3 << i,i + n) 
A1 ,A2,A3 
The pattern is now clear. Since 
y~ PL(H ---* A1)PL(A1 ---* A2)...PL(Ak-1 ~ Ak)P(Ak ~ i,i + n) 
A1 ,...,Ak 
tends to 0 as k grows without limit, then using definition (7) and successive re- 
substitutions, we get the final formula 
P(H ~ i, i + n) = ~ QL(H ~ G1G2)R(GIG2) (14) 
G1 ,G2 
where the last equality follows from (9), the definition of R(G1, G2). 
We can now notice that formula (14) is very similar to the well-known formula 
P(H(i,i+n))= c,~P(H---~G1G2) I~-~P(GI(i'i+j-1))P(G2(i+j'i+n)) \],G2 j=l (15) 
that allows an iterative calculation of the (inside) probabilities P(H(i, i+ n)) ((15) serves 
as the basis for the Inside Algorithm (Baker 1979)). There are two differences between 
(14) and (15): instead of the rule probability P(H ~ G1G2) in (15), we have in (14) 
the sum-of-tree-probability function QL(H ~ G1G2) (defined in (7)), and instead of the 
simple span generation probability P(G2Ii +j, i+ n)) in (15), we have in (14) the initial 
substring generation probability P(G2 ~ i + j, i + n) (defined in (3)). It follows that 
once we determine how to calculate the values of QL(H ~ G1G2) (this is discussed in 
the next section), we will be able to compute iteratively all the other quantities (that 
320 
Jelinek and Lafferty Probability of Initial Substring Generation 
is, P(H << i,j) and P(Hli, j) )). In fact, it follows from (14) that to calculate P(s << 1,k) 
one proceeds as follows: 
. 
. 
3. 
Calculate probabilities P(G(i, i + n)) for i = 1,2,..., k- 1, 
n = 0, 1,2,..., k - i - 1, iteratively by formula (15). 
Calculate probabilities P(H << k, k) by formula (6). 
Calculate probabilities 
P(H << k - 1,k) = Z QL(H ~ G1G2)P(GI{k - 1,k - 1))P(G2 << k,k) 
G1 ,G2 
4. Calculate probabilities 
P(H << k - 2, k) QL(H :=~ G1G2) G1,G2 
P(C Ik 2,k+j - 3/)P(C  << k+\] - 2,k) 
kj=l 
k. Calculate probabilities 
r k-2 \] P(H << 2, k) = ~ QL(H ~ G1G2) ,y~P(G,(2, 1 +j))P(G2 << 2 +j,k) 
GI,G2 kJ=l 
k+ 1. Calculate the probability 
IY~k-1 q- 1,k)\] P(s << 1,k) = G1Z, G2 QL(S ---+ GIG2) L~=I P(GI(1,j))P(G2 << j 
4. Determination of the Functions QL(H ~ G1G2) and P(H << i, i) 
Let us first observe that if wi = v then P(H << i, i) = P(H --* v...) which, consistent 
with previous notation (5), we denote by QL (H =~ v). We then get from (6) 
QL(H ~ v) = P(H ---, v) + Z QL(H ~ G)P(G --~ v) 
G 
(16) 
It follows from (16) and (7) that to calculate the desired quantities P(H << i, i) and 
QL(H ~ GIG2) we must first determine the left corner probability sums QL(H =~ G). 
We will use matrix algebra to compute them. 
Let PL and QL denote the square matrices (their dimension is equal to the number 
of nonterminals) whose elements in the Hth row and Gth column are PL(H --~ G) 
(defined in (4)) and QL(H =~ G), respectively. Then equation (5) can be rewritten in 
matrix form as 
QL = PL + p2 + p3 +... prk +... (17) 
321 
Computational Linguistics Volume 17, Number 3 
where Pi L denotes i-fold multiplication of the matrix PL with itself. Post-multiplying 
both sides of (17) by the matrix PL, subtracting the resulting equation from (17), and 
cancelling terms, we get 
QL - QLPL == PL (18) 
Finally, denoting by I the diagonal unit matrix of the same dimension as PL, we 
get from (18) the desired solution 
QL = eg\[I -- PL\] -1 (19) 
where \[I - PL\]-I denotes the inverse of the matrix \[I - PL\]. 
Equation (16) can also be stated in matrix form. Denoting by Pw and Qw the 
rectangular matrices with elements P(H ~ w) and QL(H --* w) in the Hth row and wth 
column, respectively, we get from (16) that 
Qw = \[I + QL\]Pw (20) 
5. Conclusion 
While the LRI algorithm together with formulas (19) and (20) constitutes the solution 
to the stated problem, its practicality is limited to grammars whose total number 
of nonterminals is sufficiently limited so as to allow the calculation of the inverse 
\[I - PL\] -1 
The algorithm itself has exactly twice the complexity of the Inside Algorithm 
computing P(Hli , i + hi) by formula (15), and is thus of order n 3. In fact, once all the 
probabilities required for the computation of P(s ~ 1, k) are computed, to get the next 
probability of interest, P(s ~ 1, k + 1), one needs to compute the following quantities: 
1. The probabilities P(Gli, kl) for i = k, k- 1,..., 1, in that order. 
2. The probabilities P(H KK i, k + 1) for i = k + 1, k .... ,2, in that order. 
3. The probability P(s ~ 1, k + 1). 
Let us finally recall that the language model of speech recognition provides to the 
recognizer the probability P(Wk = v \[ wlw2...Wk-1) for all possible words v, and that 
we therefore must be able to compute the probability P(s --+ wlw2... Wk-lV...) for all 
N words v of the vocabulary. Fortunately, this does not mean carrying out the LRI 
algorithm N times for each word position k, but only M times, where M is the number 
of nonterminals of the grammar. 
In fact, a simple modification of the algorithm allows one to compute the proba- 
bilities of P(s --* wlw2... Wk-1 gi...) where gi is an element of the set of nonterminals 
= {gl = s~ g2~...,g~}. This may be done, for example, by setting 
1 if H=gi (21) P(H KK k, k) = 0 otherwise 
in the algorithm of Section 3. Our desired LRI probabilities can then be computed by 
the formula 
M 
P(s --* wlw2... Wk-lV...) = ~ QL(gi ~ v)P(s ~ wlw2... Wk-lgi...) (22) 
i=1 
This modification is particularly practical when the size of the vocabulary greatly 
exceeds the number of nonterminals in the grammar. 
322 
Jelinek and Lafferty Probability of Initial Substring Generation 
References 
Bahl, L. R.; Jelinek, F.; and Mercer, R. L. 
(1983). "A maximum likelihood approach 
to continuous speech recognition." IEEE 
Transactions on Pattern Analysis and 
Machine Intelligence, Vol PAMI-5, No 2, 
1798-1790. 
Baker, J. K. (1979). "Trainable grammars for 
speech recognition." Proceedings, Spring 
Conference of the Acoustical Society of 
America, Boston, MA, 547-550. 
Graham, S. L.; Harrison, M. A.; and Ruzzo, 
W. L. (1980). "An improved context-free 
recognizer," ACM Transactions on 
Programming Languages and Systems, Vol 2, 
No 3, 415-462. 
Jelinek, E (1985). "Markov source modeling 
of text generation." In The Impact of 
Processing Techniques on Communications, 
edited by J. K. Skwirzinski. Dordrecht: 
Nijhoff. 
Wright, J. H.; and Wrigley, E. N. (1989). 
"Probabilistic LR parsing for speech 
recognition." International Workshop on 
Parsing Technologies. 105-114. 
Younger, D. H. (1967). "Recognition and 
parsing of context free languages in time 
N3, " Information and Control 10, 1980-208. 
323 

