Squibs and Discussions 
Estimation of Probabilistic Context-Free 
Grammars I 
Zhiyi Chi* 
Brown University 
Stuart Geman* 
Brown University 
The assignment of probabilities to the productions of a context-free grammar may generate an 
improper distribution: the probability of all finite parse trees is less than one. The condition 
for proper assignment is rather subtle. Production probabilities can be estimated from parsed 
or unparsed sentences, and the question arises as to whether or not an estimated system is 
automatically proper. We show here that estimated production probabilities always yield proper 
distributions. 
1. Introduction 
Context-free grammars (CFG's) are useful because of their relatively broad coverage 
and because of the availability of efficient parsing algorithms. Furthermore, CFG's are 
readily fit with a probability distribution (to make probabilistic CFG's--or PCFG's), 
rendering them suitable for ambiguous languages through the maximum a posteriori 
rule of choosing the most probable parse. 
For each nonterminal symbol, a (normalized) probability is placed on the set of all 
productions from that symbol. Unfortunately, this simple procedure runs into an un- 
expected complication: the language generated by the grammar may have probability 
less than one. The reason is that the derivation tree may have probability greater than 
zero of never terminating--some mass can be lost to infinity. This phenomenon is well 
known and well understood, and there are tests for "tightness" (by which we mean 
total probability mass equal to one) involving a matrix derived from the expected 
growth in numbers of symbols generated by the probabilistic rules (see for example 
Booth and Thompson \[1973\], Grenander \[1976\], and Harris \[1963\]). 
What if the production probabilities are estimated from data? Suppose, for ex- 
ample, that we have a parsed corpus that we treat as a collection of (independent) 
samples from a grammar. It is reasonable to hope that if the trees in the sample are fi- 
nite, then an estimate of production probabilities based upon the sample will produce 
a system that assigns probability zero to the set of infinite trees. For example, there is 
a simple maximum-likelihood prescription for estimating the production probabilities 
from a corpus of trees (see Section 2), resulting in a PCFG. Is it tight? If the corpus is 
unparsed then there is an iterative approach to maximum-likelihood estimation (the 
EM or Baum-Welsh algorithm--again, see Section 2) and the same question arises: do 
we get actual probabilities or do the estimated PCFG's assign some mass to infinite 
trees? We will show that in both cases the estimated probability is tight. 2 
* Division of Applied Mathematics, Brown University, Providence, RI 02912 USA 
1 Note added in proof: An alternative proof of one of our main results (see Corollary, Section 3) recently 
appeared in the IEEE Transactions on Pattern Analysis and Machine Intelligence (S,~nchez and Bened(\[1997\]). 
2 When estimating from an unparsed corpus, we shall assume a model without null or unit productions; 
see Section 2. 
Computational Linguistics Volume 24, Number 2 
Wetherell (1980) has asked a similar question: a scheme (different from maximum 
likelihood) is introduced for estimating production probabilities from an unparsed 
corpus, and it is conjectured that the resulting system is tight. (Wetherell and others 
use the designation "consistent" instead of "tight," but in statistics, consistency refers 
to the asymptotic correctness of an estimator.) 
A trivial example is the CFG with one nonterminal and one terminal symbol, in 
Chomsky normal form: 
A ~ AA 
a ~ a 
where a is the only terminal symbol. Assign probability p to the first production 
(A ~ AA) and q = 1 -p to the second (A ~ a). Let Sh be the total probability of 
all trees with depth less than or equal to h. For example, $2 = q corresponding to 
A ~ a, and $3 = q + pq2 corresponding to {A ~ a} tO {A ~ AA, A --~ a,A --~ a}. In 
general, Sh+l = q + pSi. (Condition on the first production: with probability q the tree 
terminates and with probability p it produces two nonterminal symbols, each of which 
must now terminate with depth less than or equal to h.) It is not hard to show that 
Sh is nondecreasing and converges to min(1, I), meaning that a proper probability is 
a obtained if and only if p < ~. 
What if p is estimated from data? Given a set of finite parse trees wl, w2 ..... w,, the 
maximum-likelihood estimator for p (see Section 2) is, sensibly enough, the "relative 
frequency" estimator 
y'~nlf(A ~ AA; wi) 
~i=1 \[f(A ~ AA; wi) + f(A ~ a; wi)\] 
where f(.;w) is the number of occurrences of the production "." in the tree w. The 
sentence a m, although ambiguous (there are multiple parses when m > 2), always 
involves m - 1 of the A ~ AA productions and m of the A ~ a productions. Hence 
f(A ~ AA; Odi) < f(A ~ a; odi) for each wi. Consequently: 
f(A ---+ AA;wi) < l\[f(A ~ AA;~i) + f(A ~ a;wi)\] 
for each wi, and ~ < ½. The maximum-likelihood probability is tight. 
If only the yields (left-to-right sequence of terminals) Y(o;1), Y(w2) ..... Y(wn) are 
available, the EM algorithm can be used to iteratively "climb" the likelihood surface 
(see Section 2). In the simple example here, the estimator converges in one step and 
is the same ~ as if we had observed the entire parse tree for each wi. Thus, ~ is again 
less than ½ and the distribution is again tight. 
2. Maximum-Likelihood Estimation 
More generally, let G -- (V, T, R, S) denote a context-free grammar with finite variable 
set V, start symbol S E V, finite terminal set T, and finite production (or rule) set R. 
(We use R in place of the more typical P to avoid confusion with probabilities.) Each 
production in R has the form A ~ oL, where A E V and o~ E (VUT)*. In the usual way, 
probabilities are introduced through the productions: P : R --~ \[0,1\] such that VA E V: 
p(A -~ c~) = 1. (1) 
orE(rUT)* 
s.t. (A~c~)ER 
300 
Chi and Geman Probabilistic Context-Free Grammars 
Given a set of finite parse trees cab ca2,..., can, drawn independently according to 
the distribution imposed by p, we wish to estimate p. 
In terms of the frequency function f, introduced in Section 1, the likelihood of the 
data is 
L = L(p;cal,ca2 ..... con) 
n 
= II II p(A-  Y 
i=1 (A~)ER 
Recall the derivation of the maximum-likelihood estimator of p: The log of the likeli- 
hood is: 
n 
~ ~f(A --+ a;cai)logp(A ~ a). (2) 
AEV a s.t. i=1 
The function p : R ~ \[0,1\] subject to (1) that maximizes (2) satisfies: 
6 
AEV • • (A~o~)ER 
AAp(A ~ a) + f(A ~ a;cai)logp(A ~ a) = 0 
i=1 
V(B ~/3) E R where {AA }AEV are Lagrange multipliers. Denote the maximum-likelihood 
estimator by fi: 
n B 
AB q- ~i=lf( --+ /3;ca;) = 0 V(S ~ /3) E R f,(B +/3) 
Since ~ fi(B+/3)=l) 
fl sA. 
(8~fl)ea 
~(B --~/3) = ~=lf(B --~/3; cai) 
c~ s.t. H <B-~)e~ ~i=lf(B ---+ o4cai) 
(3) 
The maximum-likelihood estimator is the natural, "relative frequency," estimator. 
Suppose B E V is unobserved among the parse trees cabc02,-..,can. Then we can 
assign fi(B --+ fl) arbitrarily, requiring only that (1) be respected. Evidently the likeli- 
hood is unaffected by the particular assignment of fi(B --~ fl). Furthermore, it is not 
hard to see that any such B has probability zero of arising in any derivation that is 
based upon the maximum-likelihood probabilitiesg--~ence the issue of tightness is 
independent of this assignment. 
We will show that if f~ is the set of all (finite) parse trees generated by G, and if f~(ca) 
is the probability of ca ff fl under the maximum-likelihood production probabilities, 
then fi(f~) = 1. 
3 Consider any sequence of productions that leads from S to B. If the parent (antecedent) of B arose in 
the sample, then the last production has ~ probability zero and hence the sequence has probability 
zero. Otherwise, move "up" through the ancestors of B until finding the first variable in the S-to-B 
sequence represented in the sample (certainly S is represented). Apply the same reasoning to the 
production from that variable, and conclude that the given sequence has/3 probability zero. 
301 
Computational Linguistics Volume 24, Number 2 
2.1 The EM Algorithm 
Usually the derivation trees are unobserved--the sample, or corpus, contains only 
the yields Y(wl), Y(w2) ..... Y(wn) (Y(wi) E T* for each 1 < i < n). The likelihood is 
substantially more complex, since p(Y(w)) is now a marginal probability; we need to 
sum over the set of w E f~ that yield Y(w): 
p(Y(w)) = E p(Y(w')). 
wlEU~ Y(w¢)=Y(oa) 
In the case where only yields are observed, the treatment is complicated consider- 
ably by the possibility of null productions (A --, 0) and unit productions (A ~ B E V). 
If, however, the language of the grammar does not include the null string, then there is 
an equivalent grammar (one with the same language) that has no null productions and 
no unit productions (cf. Hopcroft & Ullman \[1979\], Theorem 4.4). It is, then, perhaps 
best to simplify the treatment by assuming that there are no null or unit productions. 
Therefore, when the corpus consists of yields only, we shall assume a priori a model 
free of null and unit productions, and study tightness for probabilities estimated under 
such a model. Based upon the results of Stolcke \[1995\] it is likely that this restriction 
can be relaxed, but we have not pursued this. 
Letting ~y denote {w Efk Y(w) = Y}, the likelihood of the corpus becomes 
n 
H E H P(A--'~oL)f(A~;~)" 
i=1 ~OE~y(~i) (A---~o~)ER 
And the maximum-likelihood equation becomes 
+ p(B fl) Ei=l EwEfly(wi, I-I(A--.)cR p(A -~ a)f(A-~";~) = 0 
fT(B ~ /3) = ~iL1 Ep~f(B ~ fl;w)lw E ~y(~,)\] (4) 
,~ s,,, " B E(B_~,E ~ Ei=IEpV( ~ a;w)lw E ~Y(o~,)\] 
where E~ is expectation under fi and where "\]w E~-~y(wi)" means "conditioned on 
0.2 E ~-~Y(wi)'" 
There is no hope for a closed form solution, but (4) does suggest an iteration 
scheme, which, as it turns out, "climbs" the likelihood surface (though there are no 
guarantees about approaching a global maximum): Let P0 be an arbitrary assignment 
respecting (1). Define a sequence of probabilities, ~n, by the iteration 
~n+i(B ~ fl) = ~i'=1E~,\[f(B ~ fl;w)iw E fly(w,)\] (5) 
~ ~" ~7=, E~,\[f(B ~ a;wliw E aY(o,D\] (8~a)ER 
The right-hand side is manageable, as long as we can manageably compute all possible 
parses of a sentence (yield) Y(w). (More efficient approaches exist; see Baker \[1979\].) 
This iteration procedure is an instance of the EM Algorithm. Baum \[1972\] first intro- 
duced it for hidden Markov models (regular grammars) and Baker \[1979\] extended 
it to the problem addressed here (estimation for context-free grammars). Dempster, 
Laird, and Rubin \[1977\] put the idea into a much more general setting and coined the 
302 
Chi and Geman Probabilistic Context-Free Grammars 
term EM for Expectation-Maximization. The right-hand side of (5) is computed us- 
ing the expected frequencies under j~,; pn+l is then the maximum-likelihood estimator, 
treating the expected frequencies as though they were observed frequencies. 
The issue of tightness comes up again. We will show that pn(f~) = 1 for each n > 0. 
3. Tightness of the Maximum-Likelihood Estimator 
Given a context-free grammar G = (V, T, R, S), let f2 be the set of finite parse trees, let 
p : R ~ \[0,1\] be a system of production probabilities satisfying (1), and let wl, w2,. •., w, 
be a set (sample) of finite parse trees 0;k EfL For now, null and unit productions are 
permitted. Finally, let ~ be the maximum-likelihood estimator of p, as defined by (3). 
(See also the remarks following \[3\] concerning variables unobserved in wl, 0;2 .... , w,.) 
More generally, ~ will refer to the probability distribution on (possibly infinite) parse 
trees induced by the maximum-likelihood estimator. 
Theorem 
~b(~) = 1 
Proof 
Let qA = p (derivation tree rooted with A fails to terminate). We will show that qs = 0 
(i.e., derivation trees rooted with S always terminate). 
For each A E V, let F(A; w) be the number of instances of A in w and let F(A; w) 
be the number of nonroot instances of A in w. Given oz E (V U T)*, let nA(cZ) be the 
number of instances of A in the string o~, and, finally, let ai be the ith component of 
the string o~. For any A E V: 
qA 
yt 
qA Z F(A; wi) 
i=1 
= p(UBEv U ..... t~,~ U i s,t. oti=B{Oli fails to terminate}) 
-~ Z p(U a s.t. B~c, \[-Ji s.t. o~i=B {Oli fails to terminate}) (A~c,)ER 
BEV 
= Z Z fi(A ~ c~)~(Ui s.t. ozi=B{Oli fails to terminate}lA ~ ol) 
BEV a s.t. B~c, (A~c~)ER 
< ~ ~ ~(A~cz)nB(eOqB 
BEV a s.t. aEc~ 
{ Y\]~, .~, nB(a) n ol;wi) } .... Y'~i__,f(A ---* 
s, " a B*V E(A~,e a ~i=lf( ----r OGWi) 
{ ~in=l E ,~,~ nB(ol)f(A --'~ Ol;Wi) } ..... 
qB n BEV 
Ei=I Z ,Z~'i~af(A ~ °Gwi) 
{ ~in=l ~ ~.'. "E~ nn(ol)f(A "---~ Ol;Wi) } Z (A~cQER = qB n 
BEV Zi=I F(A; ¢0i) 
Z qB ~ Z nB(OOf(m ~ OGCdi) 
BEV i=1 a s.t. Be~ (A~o,)ER 
303 
Computational Linguistics Volume 24, Number 2 
Sum over A E V: 
E qA E F(A;wi) _< E qB ~ E E nB(ol)f(A ~ c~;wi) 
AEV i~-1 BEV i=1 AEV c~ s.t. B~a (A~c~)ER 
n 
= ~ qB ~l:(B;wi) 
BEV i=1 
i.e./ 
H 
qA E(Ie(A;wi) - F(A;wi)) ~ 0 
AEV i=1 
Clearly, for every i = 1,2,...,n F(A;wi) = F(A;wi) whenever A # S and F(S;wi) < 
F(S; wi). Hence qs = 0, completing the proof of the theorem. \[\] 
Now let ~, be the system of probabilities produced by the nth iteration of the EM 
Algorithm (5): 
Corollary 
If R contains no null productions and no unit productions, then ~,(f~) = 1 Vn > 1. 
Proof 
Almost identical, except that we use (5) in place of (3) and end up with: 
n 
E qA EEG_1\[F(A;wi) - F(A;wi)lw C fly(w,)\] ~ 0. 
AEV i=1 
(6) 
In the absence of unit productions and null productions, F(A;w) < 21w \[ (twice the 
length of the string w). Hence the expectations in (6) are finite. Furthermore, F(A; w) 
and F(A; ~) satisfy the same conditions as before: I:(A; w) = F(A; w) except when A = S, 
in which case F(A; w) < F(A; w). Again, we conclude that qs = O. \[\] 
Acknowledgments 
We are indebted to Mark Johnson for 
encouraging us to look at this problem in 
the first place, and for much good advice 
along the way. This work was supported by 
the Army Research Office 
(DAAL03-92-G-0115), the National Science 
Foundation (DMS-9217655), and the Office 
of Naval Research (N00014-96-1-0647). 
References 
Baker, J. K. 1979. Trainable grammars for 
speech recognition. In Speech 
Communications Papers of the 97th Meeting of 
the Acoustical Society of America, 
pages 547-550, Cambridge, MA. 
Baum, L. E. 1972. An inequality and 
associated maximization techniques in 
statistical estimation of probabilistic 
functions of Markov processes. 
Inequalities, 3:1-8. 
Booth, T. L. and R. A. Thompson. 1973. 
Applying probability measures to abstract 
languages. IEEE Trans. on Computers, 
C-22:442-450. 
Dempster, A., N. Laird, and D. Rubin. 1977. 
Maximum likelihood from incomplete 
data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 
39:1-38. 
Grenander, U. 1976. Lectures in Pattern Theory 
Volume 1, Pattern Synthesis. 
Springer-Verlag, New York. 
Harris, T. E. 1963. The Theory of Branching Processes. 
Springer-Verlag, Berlin. 
Hopcroft, J. E. and J. D. Ullman. 1979. 
Introduction to Automata Theory, Languages, 
304 
Chi and Geman Probabilistic Context-Free Grammars 
and Computation. Addison Wesley. 
S~nchez, J. A. and J. M. Benedf. 1997. 
Consistency of stochastic context-free 
grammars from probabilistic estimation 
based on growth transformations. IEEE 
Transactions on Pattern Analysis and Machine 
Intelligence, 19:1052-1055. 
Stolcke, A. 1995. An efficient probabilistic 
context-free parsing algorithm that 
computes prefix probabilities. 
Computational Linguistics, 21:165-201. 
Wetherell, C. S. 1980. Probabilistic 
languages: A review and some open 
questions. Computing Surveys, 12:361-379. 
305 

