Statistical Properties of Probabilistic 
Context-Free Grammars 
Zhiyi Chi* 
University of Chicago 
We prove a number of useful results about probabilistic context-free grammars (PCFGs) and 
their Gibbs representations. We present a method, called the relative weighted frequency method, 
to assign production probabilities that impose proper PCFG distributions on finite parses. We 
demonstrate that these distributions have finite entropies. In addition, under the distributions, 
sizes of parses have finite moment of any order. We show that Gibbs distributions on CFG parses, 
which generalize PCFG distributions and are more powerful, become PCFG distributions if their 
features only include frequencies of production rules in parses. Under these circumstances, we 
prove the equivalence of the maximum-likelihood (ML) estimation procedures for these two types 
of probability distributions on parses. We introduce the renormalization of improper PCFGs 
to proper ones. We also study PCFGs from the perspective of stochastic branching processes. 
We prove that with their production probabilities assigned by the relative weighted frequency 
method, PCFGs are subcritical, i.e., their branching rates are less than one. We also show that 
by renormalization, connected supercritical PCFGs become subcritical ones. Finally, some minor 
issues, including identifiability and approximation of production probabilities of PCFGs, are 
discussed. 
1. Introduction 
This article proves a number of useful properties of probabilistic context-free grammars 
(PCFGs). In this section, we give an introduction to the results and related topics. 
1.1 Assignment of Proper PCFG Distributions 
Finite parse trees, or parses, generated by a context-free grammar (CFG) can be equipped 
with a variety of probability distributions. The simplest way to do this is by production 
probabilities. First, for each nonterminal symbol in the CFG, a probability distribution 
is placed on the set of all productions from that symbol. Then each finite parse tree 
is allocated a probability equal to the product of the probabilities of all productions 
in the tree. More specifically, denote a finite parse tree by T. For any production rule 
A ~ a of the CFG, letf(A ~ a; ~-) be the number of times it occurs in T. Let R be the 
set of all production rules. Then 
p(T) = H p(A ~ o~)/(A~;~-). 
(A--~oc)ER 
A CFG with a probability distribution on its parses assigned in this way is called 
a probabilistic context-free grammar (PCFG) (Booth and Thompson 1973; Grenander 
* Department of Statistics, University of Chicago, Chicago, IL 60637, USA. Email: 
chi@galton.uchicago.edu. This work was supported by the Army Research Office (DAAL03-92-G-0115), 
the National Science Foundation (DMS-9217655), and the Office of Naval Research (N00014-96-1-0647). 
(~) 1999 Association for Computational Linguistics 
Computational Linguistics Volume 25, Number 1 
1976) 1 and the probability distribution is called a PCFG distribution. A PCFG may 
be improper, i.e., the total probability of parses may be less than one. For instance, 
consider the CFG in Chomsky normal form: 
S~SS 
S ~ a (1) 
where S is the only nonterminal symbol, and a is the only terminal symbol. If p(S 
SS) = p, then p(S ~ a) = 1 - p. Let Xh be the total probability of all parses with height 
no larger than h. Clearly, Xh is increasing. It is not hard to see that Xh+l = 1 -- p + px~. 
Therefore, the limit of Xh, which is the total probability of all parses, is a solution for 
the equation x = 1 - p + px 2. The equation has two solutions: 1 and 1/p - 1. It can 
be shown that x is the smaller of the two: x = min(1,1/p - 1). Therefore, if p > 1/2, 
x < 1--an improper probability. 
How to assign proper production probabilities is quite a subtle problem. A suffi- 
cient condition for proper assignment is established by Chi and Geman (1998), who 
prove that production probabilities estimated by the maximum-likelihood (ML) esti- 
mation procedure (or relative frequency estimation procedure, as it is called in com- 
putational linguistics) always impose proper PCFG distributions. Without much diffi- 
culty, this result can be generalized to a simple procedure, called "the relative weighted 
frequency" method, which assigns proper production probabilities of PCFGs. We will 
give more details of the generalization in Section 3 and summarize the method in 
Proposition 1. 
1.2 Entropy and Moments of Parse Tree Sizes 
As a probabilistic model for languages, the PCFG model has several important statis- 
tical properties, among which is the entropy of PCFG distribution on parses. Entropy 
is a measure of the uncertainty of a probability distribution. The larger its entropy, 
the less one can learn about parses randomly sampled from the distribution. As an 
example, suppose we have a set S of N parses--or any objects--T1 ..... TN, where N is 
very large. We may ask how much one can learn from the sentence "T is a random 
sample from S." At one extreme, let the distribution on S be p(~-l) = 1, and P(;i) = O, 
for i ~ 1. Then, because with probability one, ; = T1, there is no uncertainty about the 
sample. In other words, we can get full information from the above sentence. At the 
other extreme, suppose the distribution is p(~-l) ..... p(TN) = 1/N. In this case, all 
the elements of S are statistically equivalent. No specific information is given about T 
that would make it possible to know it from S. Greater effort is required--for example, 
enumerating all the elements in S--to find what T is. Since S is big, the uncertainty 
about the sample is then much greater. Correspondingly, for the two cases, the entropy 
is 0 and N log N >> 0, respectively. 
Entropy plays a central role in the theory of information. For an excellent exposi- 
tion of this theory, we refer the reader to Cover and Thomas (1991). The theory has been 
applied in probabilistic language modeling (Mark, Miller, and Grenander 1996; Mark 
et al. 1996; Johnson 1998), natural language processing (Berger, Della Pietra, and Della 
Pietra 1996; Della Pietra, Della Pietra, and Lafferty 1997), as well as computational 
vision (Zhu, Wu, and Mumford 1997). In addition, all the models proposed in these 
articles are based on an important principle called the maximum entropy principle. 
Chapter 11 of Cover and Thomas (1991) gives an introduction to this principle. 
1 A probabilistic context-free grammar is also called a stochastic context-free grammar (SCFG) 
132 
Chi Probabilistic Context-Free Grammars 
Briefly, the maximum entropy principle says that among all the distributions that 
satisfy the same given conditions, the one that achieves the largest entropy should be 
the model of choice. For a distribution p on parses, its entropy is 
1 
H(p) = Z P(~) log p(T). 
T 
In order that the maximum entropy principle makes sense, all the candidate distribu- 
tions should have finite entropies, and this is usually implicitly assumed. 
Take Mark, Miller, and Grenander's (1996) model, for example. First, a PCFG 
distribution p is selected to serve as a "reference" distribution on parses. Then, by 
invoking the minimum relative entropy principle, which is a variant of the maximum 
entropy principle, the distribution that minimizes 
D(qIIP) = E q(T) log q0-) = E q(T) log 1 ~- p(T) ~- p~ - H(q) 
subject to a set of constraints incorporating context-sensitive features is chosen to be 
the distribution of the model. It is then easy to see that the assumption that H(q) is 
finite is necessary. 
Conceptually, having finite entropy is a basic requirement for a "good" proba- 
bilistic model because a probability distribution with infinite entropy has too much 
uncertainty to be informative. 
Problems regarding entropies of PCFGs are relatively easy to tackle because they 
can be studied analytically. Several authors have reported results on this subject, in- 
cluding Miller and O'Sullivan (1992), who gave analytical results on the rates of en- 
tropies of improper PCFGs. It is worthwhile to add a few more results on entropies of 
proper PCFGs. In this paper, we show that the entropies of PCFG distributions im- 
posed by production probabilities assigned by the relative weighted frequency method 
are finite (Section 4, Corollary 2). 
In addition to entropy, we will also study the moment of sizes of parses. The mo- 
ment is of statistical interest because it gives information on how sizes of parses are 
distributed. For PCFG distributions, the first moment of sizes of parses, i.e., the mean 
size of parses, is directly linked with the entropy: the mean size of parses is finite if 
and only if the entropy is. The second moment of sizes is another familiar quantity. 
The difference between the second moment and the mean squared is the variance of 
sizes, which tells us how "scattered" sizes of parses are distributed around the mean. 
Proposition 2 shows that, under distributions imposed by production probabilities as- 
signed by the relative weighted frequency method, sizes of parses have finite moment 
of any order. 
1.3 Gibbs Distributions on Parses and Renormalization of Improper PCFGs 
Besides PCFG distributions, a CFG can be equipped with many other types of proba- 
bility distributions. Among the most widely studied is the Gibbs distribution (Mark, 
Miller, and Grenander 1996; Mark et al. 1996; Mark 1997; Abney 1997). Gibbs distribu- 
tions arise naturally by invoking the maximum entropy principle. They are considered 
to be more powerful than PCFG distributions because they incorporate more features, 
especially context-sensitive features, of natural languages, whereas PCFG distributions 
only consider frequencies of production rules. On the other hand, Gibbs distributions 
are not always superior to PCFG distributions. A Gibbs distribution, with only fre- 
quencies of production rules in parse as its features, turns into a PCFG. More specif- 
ically, we will show in Proposition 4 in Section 5, that a CFG equipped with a Gibbs 
133 
Computational Linguistics Volume 25, Number 1 
distribution of the form 
1 P,~(r) = Z7 H e~'a~d(A~;~') (2) 
(A---+e~)ER 
is actually a PCFG, and we can get the production probabilities of the PCFG explicitly 
from the Gibbs form. 
The fact that a Gibbs distribution of the form in (2) is imposed by production 
probabilities has a useful consequence. Suppose p is an improper PCFG distribution. 
If we write the sum of p over all parses as Z, and assign to each parse tree a new 
probability equal to p(r)/Z, then we renormalize p to a Gibbs distribution ~ on parses. 
What (2) implies is that ~ is also a PCFG distribution (Corollary 3). Moreover, in 
Section 6 we will show that, under certain conditions, ~ is subcritical. 
There is another issue about the relations between PCFG distributions and Gibbs 
distributions of the form in (2), from a statistical point of view. Although PCFG dis- 
tributions are special cases of Gibbs distributions in the sense that the former can be 
written in the form of the latter, PCFG distributions cannot be put in the framework 
of Gibbs distributions if they have different parameter estimation procedures. We will 
compare the maximum-likelihood (ML) estimation procedures for these two distribu- 
tions. As will be seen in Section 5, numerically these two estimation procedures are 
different. However, Corollary 4 shows that they are equivalent in the sense that esti- 
mates by the two procedures impose the same distributions on parses. For this reason, 
a Gibbs distribution may be considered a generalization of PCFG, not only in form, 
but also in a certain statistical sense. 
1.4 Branching Rates of PCFGs 
Because of their context-free nature, PCFG distributions can also be studied from the 
perspective of stochastic processes. A PCFG can be described by a random branch- 
ing process (Harris 1963), and its asymptotic behavior can be characterized by its 
branching rate. A branching process, or its corresponding PCFG, is called subcritical 
(critical, supercritical), if its branching rate < 1 (= 1, > 1). A subcritical PCFG is always 
proper, whereas a supercritical PCFG is always improper. Many asymptotic properties 
of supercritical branching processes are established by Miller and O'Sullivan (1992). 
Chi and Geman (1998) proved the properness of PCFG distributions imposed by esti- 
mated production probabilities, and around the same time S~nchez and Benedi (1997) 
established the subcriticality of the corresponding branching processes, hence their 
properness. In this paper we will explore properties of branching rate further. First, 
in Proposition 5, we will show that if a PCFG distribution is imposed by production 
probabilities assigned by the relative weighted frequency method, then the PCFG is 
subcritical. The result generalizes that of S~nchez and Bened~ (1997), and has a less 
involved proof. Then in Proposition 7, we will demonstrate that a connected and 
improper PCFG, after being renormalized, becomes a subcritical PCFG. 
1.5 Identifiability and Approximation of Production Probabilities 
Returning to the statistical aspect of PCFGs, we will discuss the identifiability of pro- 
duction probabilities of PCFGs as well as parameters of Gibbs distributions. Briefly 
speaking, production probabilities of PCFGs are identifiable, which means that differ- 
ent production probabilities always impose different distributions on parses (Proposi- 
tion 8). In contrast, for the Gibbs distribution given by (2), the ,~ parameters are not 
identifiable; in fact, there are infinitely many different ,~ that impose the same Gibbs 
distribution. 
134 
Chi Probabilistic Context-Free Grammars 
Finally, in Proposition 9, we propose a method to approximate production prob- 
abilities. Perhaps the most interesting part about the result lies in its proof, which is 
largely information theoretic. We apply the Kullback-Leibler divergence, which is the 
information distance between two probability distributions, to prove the convergence 
of the approximation. In information theory literature, the Kullback-Leibler divergence 
is also called the relative entropy. We also use Lagrange multipliers to solve the con- 
strained minimization problem involved. Both Kullback-Leibler divergence and La- 
grange multipliers method are becoming increasingly useful in statistical modeling, 
e.g., modeling based on the maximum entropy principle. 
1.6 Summary 
As a simple probabilistic model, the PCFG model is applied to problems in linguistics 
and pattern recognition that do not involve much context sensitivity. To design sensi- 
ble PCFG distributions for such problems, it is necessary to understand some of the 
statistical properties of the distributions. On the other hand, the PCFG model serves as 
a basis for more expressive linguistic models. For example, many Gibbs distributions 
are built upon PCFG distributions by defining 
p(T)e;~uo ) 
PO-) - z " 
where p is a PCFG distribution. Therefore, in order for the Gibbs distribution P to 
have certain desired statistical properties, it is necessary for p to have those properties 
first. This paper concerns some of the fundamental properties of PCFGs. However, the 
methods used in the proofs are also useful for the study of statistical issues on other 
probabilistic models. 
This paper proceeds as follows: In Section 2, we gather the notations for PCFGs 
that will be used in the remaining part of the paper. Section 3 establishes the rel- 
ative weighted frequency method. Section 4 proves the finiteness of the entropies 
of PCFG distributions when production probabilities are assigned using the relative 
weighted frequency method. In addition, finiteness of the moment of sizes of parses 
are proved. Section 5 discusses the connections between PCFG distributions and Gibbs 
distributions on parses. Renormalization of improper PCFGs is also discussed here. 
In Section 6, PCFGs are studied from the random branching process point of view. 
Finally, in Section 7, idenfifiability of production probabilities and their approximation 
are addressed. 
• 2. Notations and Definitions 
In this section, we collect the notations and definitions we will use for the remaining 
part of the paper. 
Definition 1 
A context-free grammar (CFG) G is a quadruple (N, T, R, S), where N is the set of 
variables, T the set of terminals, R the set of production rules, and S E N is the start 
symbol. 2 Elements of N are also called nonterminal symbols. N, T, and R are always 
2 Some of our discussion requires that each sentential form have only finitely many parses. For this 
reason, we shall assume that in G, there are no null or unit productions. 
135 
Computational Linguistics Volume 25, Number 1 
assumed to be finite. Let f~ denote the set of finite parse trees of G, an element of 
which is always denoted as r. For each 7 c f~ and each production rule (A --+ o~) E R, 
definef(A --+ o4 7) to be the number of occurrences, or frequency, of the rule in r, and 
f(A; r) to be the number of occurrences of A in r. f(A; r) and f(A --+ c~; r) are related 
by 
f(A; v) : E f(A ---* o4 7). 
c~E(NUT)* 
s.t. (A--+cz)ER 
Define h(7) as the height of 7, which is the number of nonterminal nodes on the 
longest route from 7's root to its terminal nodes. Define It\] as the size of r, which is 
the total number of nonterminal nodes in 7. For any A E N and any sentential form 
"7 C (N tO T)*, define n(A; ~) as the number of instances of A in % Define I'7\] as the 
length of the sentential form. 
Definition 2 
Let A C 7 denote that the symbol A occurs in the parse 7. If A E T, let T A be the 
left-most maximum subtree of T rooted in A, which is the subtree of 7 rooted in A 
satisfying the condition that if ;' ~ 7A is also a subtree of v rooted in A, then 7 ~ is 
either a subtree of "rA, or a right sibling of 7A, or a subtree of a right sibling of 7A. Let 
AT be the root of rA, which is the left-most "shallowest" instance of A in r. 
Definition 3 
For any two symbols A E N and B c N tO T, not necessarily different, B is said to be 
reachable from A in G, if there is a sequence of symbols A0, A1 ..... An with A0 = A 
and An = B, and a sequence of sentential forms oe0 ..... o~n-1, such that each Ai -+ oq 
is a production in R and each Oq contains the next symbol Ai+l. G is called connected 
if all elements in N U T can be reached from all nonterminal symbols. 
We now define the probabilistic version of reachability in CFG. Suppose p is a 
distribution on fL For any two symbols A C N and B c NU T, not necessarily different, 
B is said to be reachable from A in G under p, if there is a T E f~ with p(T) > 0 and 
there is a subtree r' of 7, such that 7' is rooted in A and B E 7'. G is called connected 
under p if all symbols in N tO T can be reached from all nonterminal symbols under p. 
Definition 4 
A system of production probabilities of G is a function p : R ~ \[0,1\] such that for any 
A EN, 
Z 
~c(NuT)* 
s.t. (A--+c~)ER 
p(A ~ o 0 = 1. (3) 
We will also use p to represent the PCFG probability distribution on parses im- 
posed by p, via the formula 
p(7) = H p(A --+ oe) f(a-*~;T). (4) 
136 
Chi Probabilistic Context-Free Grammars 
Similarly, for any estimated system of production probabilities ~, we will also use ~ to 
represent the probability distribution on parses imposed by ~. We will write p(f~) as 
the total probability of all finite parse trees in fL 
Definition 5 
Now we introduce a notation in statistics. Let p be an arbitrary distribution on f~ and 
g(~-) a function of T EfL The expected value of g under the distribution p, denoted 
Epg(~-), is defined as 
Epg(T)= Ep(T)g(7). 
TEn 
Definition 6 
All the parse trees we have so far seen are rooted in S. It is often useful to investigate 
subtrees of parses, therefore it is necessary to consider trees rooted in symbols other 
than S. We call a tree rooted in A E N a parse (tree) rooted in A if it is generated from 
A by the production rules in R. Let f~A be the set of all finite parse trees with root A. 
Define pA as the probability distribution on f~A imposed by a system of production 
probabilities p, via (4). Also extend the notions of height and size of parses to trees 
in flA. 
When we write pA(T), we always assume that ~- is a parse tree rooted in A. When 
P = PA, Epg(T) equals Epg(~-) = ~nA pA(T)g(~-). We will use p(f~A) instead of pA(f~A) to 
denote the total probability of finite parse trees in flA. With no subscripts, f~ and p are 
assumed to be f~s and Ps, respectively. 
For convenience, we also extend the notion of trees to terminals. For each terminal 
t E T, define f~t as the set of the single "tree" {t}. Define pt(t) = 1, Itl = 0 and h(t) = O. 
For this paper we make the following assumptions: 
. 
. 
For each symbol A ~ S, there is at least one parse T with root S such that 
A E ~-. This will guarantee that each A ~ S can be reached from S; 
When a system of production probabilities p is not explicitly assigned, 
each production rule (A ~ a) E R is assumed to have positive 
probability, i.e., p(A ~ a) > 0. This guarantees that there are no useless 
productions in the PCFG. 
3. Relative Weighted Frequency 
The relative weighted frequency method is motivated by the maximum-likelihood 
(ML) estimation of production probabilities. We shall first give a brief review of ML 
estimation. 
We consider two cases of ML estimation. In the first case, we assume the data 
are fully observed, which means that all the samples are fully observed finite parses 
trees. Let T1, T2 ..... 7n be the samples. Then the ML estimate of p(A ~ a) is the ratio 
between the total number of occurrences of the production A ~ a in the samples and 
137 
Computational Linguistics Volume 25, Number 1 
the total number of occurrences of the symbol A in the samples, 
// 
Ef(A ----~ OL;Ti) 
p(a = i=l (5) 
n 
(A; 
i=1 
Because of the form of the estimator in (5), ML estimation in the full observa- 
tion case is also called relative frequency estimation in computational linguistics. This 
simple estimator, as shown by Chi and Geman (1998), assigns proper production prob- 
abilities for PCFGs. 
In the second case, the parse trees are unobserved. Instead, the yields Y1 = 
Y(rl),...,Yn = Y('rn), which are the left-to-right sequences of terminals of the un- 
known parses rl ..... rn, form the data. It can be proved that the ML estimate fi is 
given by 
n 
x~'~Ef)\[f(A ~ o~;r)lr E f~ri\] 
~O(A --* oz) = i=1 , (6) 
~ Ep\[f(A;r)lw C f~wi\] 
i=1 
where fy is the set of all parses with yield Y, i.e., f~w = {r E f~ : Y(r) = Y}. 
Equation (6) cannot be solved in closed form. Usually, the solution is computed 
by the EM algorithm with the following iteration (Baum 1972; Baker 1979; Dempster, 
Laird, and Rubin 1977): 
n 
~E~k\[f(A ~ a;r)\]r C f~r~\] 
/gk+l(A ---+ o:) ---- i=1 (7) // 
Ep~\[f(A; r)\]r E f~ri\] 
i=1 
Like ~ in (5), pk for k > 0 impose proper probabifity distributions on f~ (Chi and Geman 
1998). 
To unify (6) and (7), expand Ep~ \[f(A ~ c~; r)\]r E f~ri\], by the definition of expecta- 
tion, into 
E~k\[f(A --~ c~;r)iT C f~y,\] = E f(A --~ c~;r)fi/c(rir C f~y~). 
rCf~y i 
Let A be the set of parses whose yields belong to the data, i.e., A = {r : Y(r) E 
{Y1 ..... Yn}}. For each r E A, let y = Y(r) and 
i:Yi=y 
Then we observe that, for any production rule A ~ oz, 
L ~-~ f(A ~ c~;r),bk(rir * fly,) = Ef(A --* ~;r)W(r) 
i=l rEfly i rcA 
138 
Chi Probabilistic Context-Free Grammars 
n 
EE~k~(A ~ a;T)IT E fly,\] = Ef(A --* a;7)W(T). 
i=1 ~EA 
Therefore, (7) is transformed into 
~f(A~a;T)W(T) 
Ef(A;T)W(T) 
tEA 
The ML estimator in (6) can also be written in the above form, as can be readily 
checked by letting A be the set {T1 ..... rn} and W(T), for each r E A, be the number 
of occurrences of ~- in the data. In addition, in both full observation cases and partial 
observation cases, we can divide the weights of W(~-) by a constant so that their sum 
is 1. 
The above discussion leads us to define a procedure to assign production proba- 
bilities as follows. First, pick an arbitrary finite subset A of f~, with every production 
rule appearing in the trees in A. Second, assign to each T E A a positive weight W(T) 
such that ~=eA W(T) = 1. Finally, define a system of production probabilities p by 
Ef(A ~ a; 7)W(;) 
p(A --~ a) = ~'CA 
Ef(A;T)W(T) 
"rEA 
(8) 
Because of the similarity between (5) and (8), we call the procedure to assign produc- 
tion probabilities by (8) the "relative weighted frequency" method. 
Proposition 1 
Suppose all the symbols of N occur in the parses of A, and all the parses have positive 
weight. Then the production probabilities given by (8) impose proper distributions on 
parses. 
Proof 
The proof is almost identical to the one given by Chi and Geman (1998). Let qa = p 
(derivation tree rooted in A fails to terminate). We will show that qs = 0 (i.e., derivation 
trees rooted in S always terminate). For each A E V, letf(A;T) be the number of non- 
root instances of A in T. Given a E (V U T)*, let ai be the ith symbol of the sentential 
form a. For any A E V 
qA = P( ~J(a ~)caU{derivati°nbeginsA--~°qandc~ifailst°terminate})i 
= E p(A--~a)p(LJ{oqfailstoterminate}) 
(A.--~a)ER 
-< E p(A ~ a) E p({ai fails to terminate}) 
(A---~a)ER i 
= Z p(A ~ a) En(B;a)qe 
(A--~a)~a BEV 
139 
Computational Linguistics 
= E q B {~(A._,~)eRn(B;a)~_EAf(A--+ a;~-)W(T)} 
BEV ~EAf(A; T)W(T) 
Volume 25, Number 1 
qA Ef(A;T)W(T) _< Eqs E E n(B;a)f(A --4 a;v)W(q-) 
TEA BEV tEA (A--.-~oOER 
Sum over A E V: 
E qA Ef(A;T)WO') _< E qB E E E n(B;a)f(A ~ a;T)W(T) 
AEV TEA BEV ~-EA AEV (A---~o~)ER 
BEV ~-EA 
i,e.! 
E qA E~(A;T)-f(A;T))W(T) > 0 
ACV TEA 
Clearly, for every r E A, ff(A; T) = f(A; ~-) whenever A # S and f(S; q-) -- f(S; T) - 1. 
Hence qs = 0, completing the proof. \[\] 
Corollary 1 
Under the same assumption of Proposition 1, for each symbol A E N, p(ftA) = 1. 
Proof 
For any A e N, there is a ~- E A such that A E T. Since p(~-) > 0, this implies A is 
reachable from S under p. Using the notation given in Definition 2, we have 
qs >_ p({A E 7 and TA fails to terminate}) 
= p({TA fails to terminate}lA E T)p(A ¢ q-) 
p({TA fails to terminate}lA ¢ ~-) = 0, 
since qs = 0 and p(A ¢ ~-) > 0. By the nature of PCFGs, the form of TA is distributed 
according to pA, independent of its location in 7 or of the choice of subtrees elsewhere 
in T. Therefore the conditional probability of q-A failIng to terminate, given that A 
occurs in T, equals qA, proving that qA = O. \[\] 
4. Entropy and Moments of Parse Tree Sizes 
In this section, we will first show that if production probabilities are assigned by 
the relative weighted frequency method, then they impose PCFG distributions under 
which parse tree sizes have finite moment of any order. Based on this result, we 
will then demonstrate that such PCFG distributions have finite entropy and give the 
explicit form of the entropy. 
The mth moment of sizes of parses is given by 
EplT\[m=Ep(T)ITIm 
rED 
140 
Chi Probabilistic Context-Free Grammars 
and the entropy of a PC'FG distribution p is given by 
1 n(p) = p( )log 
weft 
To make the proofs more readable, we define, for any given A = {v'l ..... rn}, 
for any (A ~ a) E R, and 
F(A ~ a) = Ef(A ~ a;r)W(T), 
"rEA 
F(A) = E F(A ~ a) = Ef(A;r)W(r), 
aE(NUT)* TEA 
s.t. (A---~a)ER 
for any A E N; that is, F(A ~ a) is the weighted sum of the number of occurrences 
of the production rule A ~ a in A and F(A) is the weighted sum of the number of 
occurrences of A in A. 
The relative weighted frequency method given by (8) can be written as 
p(A ~ oz) - F(A ~ a) F(A) (9) 
We have the following simple lemma: 
Lemma 1 
For any A E N, 
and 
E F(A) = E ITIWff) 
AEN tEA 
(10) 
F(S)-I ifA=S E E F(B ~ 
7)n(A; 7) = F(A) if A # S (11) 
BEN 7 s.t. 
(B-.~7)ER 
(If ~-EA W(7) # 1, F(S) - 1 should be changed to F(S) - ~-EA W(r).) 
Proof 
For the first equation, 
EF(A)= E Ef(A;T)W(T)= E Ef(A;T)W(T) 
AEN AENTEA rEAAEN 
For the second equation, 
= Z 
TEA 
Z Z 
BEN 3' s.t. 
(B----.7)ER 
F(B~7)n(A; 7) = ~ E ~f(B-~v;T)WO-)n(A;'Y) 
BEN '3' s.t. "rcA (B~"y)cR 
= EW(T) E E f(B~3,;rln(A; 7) 
tEA BEN 3' s.t. 
(B--~7)ER 
(12) 
141 
Computational Linguistics Volume 25, Number 1 
For each A, 
E E f (B~9;~')n(A;9) 
BEN 7 s.t. (B---+7)ER 
is the number of nonroot instances of A in ~-. When A # S, the number of nonroot 
instances of A in ~- is equal tof(A; T). Substitute this into (12) to prove (11) for the case 
A # S. The case A = S is similarly proved. \[\] 
Proposition 2 
Suppose all the symbols in N occur in the parses of A, and all parses have posi- 
tive weights. If the production probabilities p are assigned by the relative weighted 
frequency method in (8), then for each m E N U {0}, EpIT\[ m < exp. 
Proof 
We shall show that for any A E N, if p = pA, then EpJTI m < oo. When m = 0, this is 
clearly true. Now suppose the claim is true for 0 ..... m - 1. For each A E N and k E N, 
define 
Mk,A E P A(r)irjm" 
TErrA hO-)<k 
It is easy to check 
Mk+l, A z 
L 
E E (1 + E \[TiI)mp(A ~ °~)P~I(T1)"' "PaL(TL)" 
aE(NuT)* T1,-..,TL i=1 (A--~a) ER TiE~c~i 
h(Ti)<k 
(13) 
where for ease of typing, we write L for la\[. For fixed o~, write 
L L 
(1 + E ITii)m ~- P(\]7"l\] .... , \]TLI) + E \]Tilm" 
i=1 i=1 
P is a polynomial in ITll ..... 17LI, each term of which is of the form 
ITllS' IT2I s: ...iTLi sa, 0 ~ 5i < m, s1 +82+ ...SL ~ m. (14) 
By induction hypothesis, there is a constant C > 1, such that for all 0 < s < m and 
AENUT, 
pA(~-)\[T\] s = EpA\]T\] s < C. 
"rE~A 
Then for each term with the form given in (14), 
E 
TIi...pTL 
TiEf~a i 
L 
- = II -< 
i=1 TiE'co i 
142 
Chi Probabilistic Context-Free Grammars 
There are less than L m = la\[ m terms in P(Inl ..... D-LI). Hence 
P(Inl ..... b-LI)p(A ~ a)pc~,(T1)...paL(TL)_< lalmC I~1. 
TI,...pT L 
So we get 
Mk+I,A _< ~ lalmcl~lp(A ~ oO 
aE(NUT)* 
s.t. (A-+a)ER 
+ ~ ~ ~lTilmp(A~a)P~,(~-l)"'P,~,~,(Tl~l) 
aE(NUT)* rl,'",rl~l i=1 
s.t. (A-..~a)ER TiE~-~a i 
h(Ti)~k 
~_ Io~l~Cl~lp(A ~ a) + ~ ~ Mk,~,p(A ~ a). 
aE(NUT)* ~E(NuT)* i=1 
s.t. (A-+a)ER s.t. (A--~a)ER 
Because the set of production rules is finite, the length of a sentenfial form that occurs 
on the right-hand side of a production rule is upper bounded, i.e., 
sup{iaI: for some AEN, (A~a) ER}<~. 
Therefore we can bound (laI + 1)mc j~l by a constant, say, K. Then we get 
Mk+i,A _< K + ~ ~ Mk,a,p(A -~ a). (15) 
aE(NuT)* i=1 
s.t. (A--~o~)ER 
Replace p(A ~ a) by F(A --~ a)/F(A), then multiply both sides of (15) by F(A) 
and sum over all A E N with F(A) > 0. By (10) and (11), we then get 
~Mk+i,AF(A) _< K~\]TIW(T)+ ~ ~ ~Mk, c~iF(A-+c 0 (by(10)) 
AEN ~-EA AEN aE(NUT)* i=1 
s.t. (A--~a) ER 
= K ~ I~-\]W(-r) + ~ ~ ~ n(B;a)Mk,BF(A --~ a) 
-tEA AEN o~E(NuT)* BEN 
s.t. (A--~oOER 
TEA 
= K~ Hwff)+ 
tEA 
Mk, B ~ ~ n(B;a)F(A -~ a) 
BEN AcN aE(NuT)* 
s.t. (A---~oOER 
~Mk, BF(B) + Mk,s(F(S) - 1). (by (11)) 
B¢S 
Because for each A E N, Mk+I,A _> Mk,a, we get 
Mk,AF(A) _< K HWO-) + M ,AF(A) + Mk, s(F(S) - 1) 
AEN TcA A~6S 
Mk,S _< K ~ \[TIW(T ) < cx~. 
TEA 
143 
Computational Linguistics Volume 25, Number 1 
Letting k ~ 0% by Mk,s T Eps\]T\] m, we get EpsiT\] m ~_ K~_EA \]7\]W(7) < cx~. To complete 
the induction, we shall show for every A E N U'T other than S, Epa \[7\] m < 0O. 
By conditional expectation, there is (see Definition 2 for the notations A C 7 and 
TA) 
Eps(iT\] m) = Eps(\]71 m \[A e 7)ps(A 4 7-) + Eps(\]7\[ m \]A ~ 7)ps(A ¢ 7-) 
Epsi7\] m 
Eps(\]7I m \[A E 7) < ps(A E 7) < ~" (16) 
since ps(A E 7) > O. Because \]TA\] < \]7\], Eps(iTA\] m \]A C 7) < ~. 
As in the proof of Corollary 1, 7A is independent of its location and other part of 
7, and is distributed by PA. Therefore 
pS(7A \]A e 7) = pA(7A), 
which leads to EpAIT\] m = Eps(ITAI m \]A E 7) ( cx;~. \[\] 
From Proposition 2 it follows that the mean size of parses is finite under p. Since 
f(A ~ Oz;T) < 171 for each production A ~ oz, it follows that the mean frequency of 
f(A ~ c~; 7) is finite. The next proposition gives the explicit form of the mean frequency 
in terms of the production probabilities assigned by the relative weighted frequency 
method. 
Proposition 3 
Under the same conditions of Proposition 2, the mean frequency of the production 
rule (A ~ o~) c R is the weighted sum of the numbers of its occurrences in parses of 
A, with weights W(T), i.e., 
Epf(A ~ a; ;) = Ef(A ~ c~; T)W(T) 
TcA 
(17) 
Proof 
Fix (A ~ oL) E R. For each C c N, write E(C) for Epcf(A ~ c~; T). We shall find the 
linear relations between E(C). To this end, for each v E f~¢, let C --~ "7 be the production 
rule applied to T'S root. Suppose 7 is composed of m symbols, 71 .... ,7ra, and 71,. • •, Tm 
are the daughter subtrees of T rooted in 71 ..... 7m, respectively. Then 
m 
f(A ~ o~; T) = x(C ~ 7) + E f( A ~ o~; 7k), 
k=l 
where 
j" 0 if C---~7#A--~oz, x(C---~ 7)= 
1 otherwise. 
Multiply both sides by p(7) and sum over all 7 E f~c which have C ~ 7 as 
the production rule applied at the root. By the definition of PCFG, p(7) = p(C --~ 
144 
Chi Probabilistic Context-Free Grammars 
r)p(rl).., p(rm), and rk can be any parse in f~Tk. Therefore, by factorization, we get 
x(C--~ 7)p(T) = ~ x(C ~ 7)p(C ~ 7)p(T1)'''p(~-m) 
rE~c~y vC~c~.y 
m 
= p(c -~ 7)x(c ~ 7) IIP(avk) 
k=l 
; p(C -~ 7)x(c ~ 7), (all p(a~k ) = 1 by Proposition 1), 
where f~c~ stands for the set of trees in which C ~ 7 is the rule applied at the root. 
Similarly, for each k, 
f(C ~ 7;~'k)p(r) = p(C ~ 7) ~ f(C ~ 7;rk)p(rk) h p(ri) 
rEf~c~r TC~C~, i=1 
= p(C~7)E(%). 
Therefore we get 
p(r)f(A ~ oz;r) 
"rE~c~-y 
m 
p(C ~ 7)(x(c ~ 7) + ~ E(7~)) 
k=l 
p(C~ 7)(x(C~ 71 + ~ n(B;7)E(Tk)). 
BEN 
s.t. BE3' 
Sum over all production rules for C. The left-hand side totals E(C) and 
E(C) = ~ p(C ~ 7)(x(C~ 7) + ~ n(B; 7)E(B)). 
-yE(NUT)* BcN 
s.t. (C---~3,)ER s.t. BE'y 
Replace p(C ~ 7) by F(C ~ 7)/F(C), according to (9). Then multiply both sides by 
F(C) and sum both sides over all C E N. We get 
r(C)E(C) 
CEN CEN 3,c(NuT)* 
s.t. (C---~.y)ER 
+ Z ~2 F(c ~ 7) Z 
CcN 3,c(NUT)" BcN 
s.t. (C---~)ER s.t. BET 
= F(A~o~)+~E(B)~ 
BcN CcN 3, s.t. (C-~)~R 
= F(A --~ c~) + ~ E(B)F(B) - E(S) 
BcN 
E(S) = F(A -~ ~), 
n(B; 7)E(B) 
F(C --~ 7)n(B; 7) 
(By (11)) 
completing the proof of (17). \[\] 
Now we can calculate the entropy of p in terms of production probabilities. 
145 
Computational Linguistics Volume 25, Number 1 
Corollary 2 
Under the conditions in Proposition 2, 
1 1 H(p)= E F(A~a) logF(A~a) EF(A) l°grtA~' 
(A-~c~)ER AEN 
which is clearly finite. 
Proof 
The calculation goes as follows, 
1 H(p) 
= EP(r) l°gp(r) 
"rE~ 
E p(r) log 1 ~e~ H p(A 
~ OL) f(A--~;r) 
(A--+a)ER 
1 EP(T) E f(A~a;r) log 
p(A OL ) -......4. rE~ (A--*a)ER 
1 ~ P(r)f(A-~ a;r) l°g p(A_~ a ) 
(A----~a) ER rEfl 
1 E Epf(A --~ a; r) log p(A -~ a) 
(A--~c~)ER 
F(A) E Ef (A ~ a;r)W(r)log 
F(A Og) ---+ (A-.+a) ER "tEA 
Zf(A ~ a; r)W(r) logF(A) 
(A-~a)ER rcA 
- ~ ~f(A ~ a;r)W(T)logF(A ~ a) 
(A--*a)ER tEA 
_- F(A) logF(A) - F(A  )logF(A 
AqN (A---~a)cR 
(Exchange the order of sun, nation) 
\[\] 
5. Gibbs Distributions on Parses and Renormalization of Improper PCFGs 
A Gibbs distribution on parses has the form 
e;~.u(r) P,x(r) - Z~, " 
where Z;~ = ~e ~u(r), and A = {Ai} and U(r) = {Ui(r)} are constants and functions 
on ~, respectively, both indexed by elements in a finite set I. The inner product A. U = 
E ~iUi is called the potential function of the Gibbs distribution and Z;~ is called the 
partition number for the exponential e ~u. 
The functions Ui are usually considered features of parses and the constants Ai 
are weights of these features. The index set I and the functions Ui(r) can take var- 
ious forms. Among the simplest choices for I is R, the set of production rules, and 
146 
Chi Probabilistic Context-Free Grammars 
correspondingly, 
U0- ) =f(T) = {f(A ~ o~; T)}(A~)cR. (18) 
Given constants A, if Z~, < oo, then we get a Gibbs distribution on parses given by 
e~,fO -) (19) 
P~(T) = Z;~ 
A proper PCFG distribution is a Gibbs distribution of the form in (19). To see this, 
let AA___~ a = logp(A ~ o~) for each (A ~ a) E R. Then 
p(T)= 
ZA = E eAf(~)= E p(r)= 1 
II p(A ~ a)f(A~; ~') = e:,'S(r) = _~_e~.UO), 
Z;, (A---~o~)ER 
which is a Gibbs form. 
A Gibbs distribution usually is not a PCFG distribution, because its potential 
function in general includes features other than frequencies of production rules. What 
if its potential function only has frequencies as features? More specifically, is the Gibbs 
distribution in (19) a PCFG distribution? The next proposition gives a positive answer 
to this question. 
Proposition 4 
The Gibbs distribution P;~ given by (19) is a PCFG distribution. That is, there are 
production probabilities p, such that for every ~- E f~, 
P~(T) = H p(A ~ a)/(A~;~-). 
(A---~oOER 
Proof 
The Gibbs distributions we have seen so far are only defined for parses rooted in S. By 
obvious generalization, we can define for each nonterminal symbols A the partition 
number 
Z~,(A) = E e;~'f(~) 
r~ a 
and the Gibbs distribution P('r) on parses rooted in A. For simplicity, also define 
ZA(t) = 1 and Pt(t) = 1 for each t E T. 
We first show ZA(A) < cx~ for all A. Suppose (S ~ o~) E R with \]oz\[ = n. The 
sum of e ~''f(T) over all 7 E f~s with S ~ o~ being applied at the root is equal to 
eAS~ZA(c~l)...Z;~(O:n), while less than the sum of e Af(~) over all ~- E f~s, which is 
ZA(S). Therefore, 
Z,x(S) > e~'s-~z~,(C~l)...Z~(c~n). 
Since Z;~ < cx~ and Z~(A) > 0, for all A, it follows that Z~(ai) is finite. For any variable 
A, there are variables A0 ---- S, A1 ..... An -- A E N and sentential forms oL(°) .... , ~(n-1) ff 
147 
Computational Linguistics Volume 25, Number 1 
(N U T)*, such that (Ai ~ a (i)) C R and Ai+l E a (i). By the same argument as above, 
we get 
I~(o I 
Z~(Ai) > e;~A, ~(') H Z~(a~ i))' 
k=l 
where a~ i) is the kth element in a (i). By induction, Zx(A) < (x). 
Now for (A -~ a) E R, with lal = n, define 
1 p(A ~ a) - Z~(A) e A~Za(al) .. . Z;,(an), (20) 
Since Z~(A) and Z~(ai) are finite, p(A ~ a) is well defined. 
The p's form a system of production probabilities, because for each A E N, 
1 I~1 1 
E p(A ~ a) - ZA(A) E eAA~ H ZA(ak) -- ZA(A) Z eA'f(r) = 1 
(A--+c~) ER (A----, ~) ER k=l "r ~ ~-~ A 
We shall prove, by induction on h(T), that 
P;~(~-) = H p(A ~ a) S(A--+~;~'). 
(A---*o:)ER 
When h(T) = 0, 7- is just a terminal, and the equation is obviously true. Suppose the 
equation is true for all ~- c f~A with h('r) < h, and all A C N. For any ~- E f~A with 
h(T) = h, let A ~ fl = fll ... tim be the production rule applied at the root. Then 
1 ,%A PA('r)- Z),(1-A)e ~'f(T) -- Z~A)e -~ fie "~'fO'k), 
k=l 
where Tk is the daughter subtree of T rooted in ilk- Each ~-k has height < h. Hence, by 
induction assumption, 
1 e~.fO_k) = P~(7-k) = II p(B ~ a) f(B~;~~) z (A) 
(B----~o~)ER 
e ;~'f('k) = Z;~(flk) H p(B ~ a) f(B-,~;'k) 
PAO-) 
m 1 
hA H e'X'f(rk) 
- Z~A) e 
k=l 
m 
1 ;~ H Z;~(I~) H p(B ~ a) f(B-+~;'k) -- Z-TA )e 
k=l (B---~oOER 
=p(A~ fl) fl H p(B~e) f(B~;'~) 
k=l (B---,cz)cR 
= H p(B ~ a) I(B-~a;~-), 
proving P), is imposed by p. \[\] 
148 
Chi Probabilistic Context-Free Grammars 
Proposition 4 has a useful application to the renormalization of improper PCFGs. 
Suppose a PCFG distribution p on f~ = f~s is improper. We define a new, proper 
distribution fi on f~, by 
p(9-) 
- 9- C 
We call fi the renormalized distribution of p on fL We can also define the renormalized 
distribution of PA on f~A, for each A E N, by 
pA(9-) p(~A)" 9- C flA. (21) 
Comparing fi with (19), we see that fi is a Gibbs distribution with frequencies of 
production rules as features. Therefore, by Proposition 4, fi is a PCFG distribution, and 
from the proof of Proposition 4, we get Corollary 3. 
Corollary 3 
Suppose the production probabilities of the improper distribution p are positive for 
all the production rules. Then the renormalized distributions ~ are induced by the 
production probabilities 
~(A ~ a) - 1 p(f~A)P(A ~ a) H P(f~B)n(B; e~). (22) 
BcN 
Therefore, fi on f~ is a PCFG distribution. 
Proof 
The only thing we have not mentioned is that )~A----~c~ ~- logp(A --~ a) are all bounded, 
since p are all positive. \[\] 
We have seen that PCFG distributions can be expressed in the form of Gibbs distri- 
butions. However, from the statistical point of view, this is not enough for regarding 
PCFG distributions as special cases of Gibbs distributions. An important statistical 
issue about a distribution is the estimation of its parameters. To equate PCFG distri- 
butions with special cases of Gibbs distributions, we need to show that estimators for 
production probabilities of PCFGs and parameters of Gibbs distributions produce the 
same results. 
Among many estimation procedures, the maximum-likelihood (ML) estimation 
procedure is commonly used. In the full observation case, if the data is composed of 
T1 ..... 9-n, then the estimator for the system of production probabilities is 
n 
fi = {fi(A ~ a)} = arg max II rI p(A ~ a) f(A~'~;~i), 
i=1 (A---~oc)cR 
(23) 
subject to 
p(A ~ o~) = 1, 
(A--~a)cR 
149 
Computational Linguistics Volume 25, Number I 
for any A C N and the estimator for parameters of Gibbs distributions with A of the 
form in (19) is 
I1\[ ex'u(Ti) (24) = arg max Z----~--' 
.X i=1 
In addition, the ML estimate fi in (23) can be analytically solved and the solution is 
given by Equation (5). 
In the partial observation case, if Y1 ..... Yn are the observed yields, then the esti- 
mators for the two distributions are 
n 
P={P(A~°z)}=argmaxH ~ H P (A~°z)f(A-,°~;~)' (25) 
i=1 Y(T)=Yi (A----~o~)cR 
subject to 
p(A ~ c~) = 1, 
(A~c~)ce 
for any A C N, and 
n e;~.u(~. ) 
~=argmaxH ~ Z~ 
A i=1 TC~'~yi 
(26) 
respectively. 
We want to compare the ML estimators for the two distributions and see if they 
produce the same results in some sense. Since the parameters p serve as base numbers 
in PCFG distributions, whereas A are exponents in Gibbs distributions, to make the 
comparison sensible, we take the logarithms of ~ and ask whether or not log p and 
are the same. Since the ML estimation procedure for PCFGs involves constrained 
optimization, whereas the estimation procedure for Gibbs distributions only involves 
unconstrained optimization, it is reasonable to suspect log ~ 7~ ~. Indeed, numerically 
log~ and ~ are different. For example, the estimator (23) only gives one estimate of 
the system of production probabilities, whereas the estimator (24) may yield infinitely 
many solutions. Such uniqueness and nonuniqueness of estimates is related to the 
identifiability of parameters. We will discuss this in more detail in Section 7. 
Despite their numerical differences, the ML estimators for PCFG distributions and 
Gibbs distributions with the form (19) are equivalent, In the sense that the estimates 
produced by the estimators impose the same distributions on parses. Because of this, 
In the context of ML estimation of parameters, we can regard PCFG distributions as 
special cases of Gibbs distributions. 
Corollary 4 
If ~ is the solution of (23), then log fi is a solution of ML estimation (24). Similarly, 
if ~ is a solution of (25), then log fi is a solution of ML estimation (26). Hence, the 
estimates of production probabilities of PCFG distributions and parameters of Gibbs 
distributions with the form (19) impose the same distributions on parses. 
150 
Chi Probabilistic Context-Free Grammars 
Proof 
Suppose ,~ is a solution for (24). By Proposition 4, the Gibbs distribution PK is imposed 
by a system of production probabilities ~. Then ~ is the solution of (23). Let A = log ~, 
i.e., A(A ~ a) = log~(A ~ a). Then A impose the same distribution on parses as 
,~. Therefore A are also a solution to (24). This proves the first half of the result. The 
second half is similarly proved. \[\] 
6. Branching Rates of PCFGs 
In this section, we study PCFGs from the perspective of stochastic branching processes. 
Adopting the set-up given by Miller and O'Sullivan (1992), we define the mean matrix 
M of p as a \]N I x I N\] square matrix, with its (A, B)th entry being the expected number 
of variables B resulting from rewriting A: 
M(A, B) = ~ p(A ~ o~)n(B; oL). (27) 
aE(NUT)* 
s.t. (A-*rx)cR 
Clearly, M is a nonnegative matrix. 
We say B E N can be reached from A E N, if for some n > 0, M (n) (A, B) > 0, where 
M (n) (A, B) is the (A, B)th element of M n. M is irreducible if for any pair A,'B E N, B 
can be reached from A. The corresponding branching process is called connected if M 
is irreducible (Walters 1982). It is easy to check that these definitions are equivalent to 
Definition 3. 
We need the result below for the study of branching processes. 
Theorem 1: (Perron-Frobenius) 
Let M = \[mq\] be a nonnegative k x k matrix. 
. 
. 
There is a nonnegative eigenvalue p such that no eigenvalue of A has 
absolute value greater than p. 
Corresponding to the eigenvalue p there is a nonnegative left (row) 
eigenvector L, = (Vl,..., ~'k) and a nonnegative right (column) eigenvector 
#k 
. If M is irreducible then p is a simple eigenvalue (i.e., the multiplicity of p 
is 1), and the corresponding eigenvectors are strictly positive (i.e. ui > 0, 
vi > 0 all i). 
The eigenvalue p is called the branching rate of the process. A branching process 
is called subcritical (critical, supercritical), if p < 1 (p = 1, p > 1). We also say a PCFG 
is subcritical (critical, supercritical), if its corresponding branching process is. When a 
PCFG is subcritical, it is proper. When a PCFG is supercritical, it is improper. 
The next result demonstrates that production probabilities assigned by the relative 
weighted frequency method impose subcritical PCFG distributions. 
151 
Computational Linguistics Volume 25, Number 1 
Proposition 5 
For p assigned by the relative weighted frequency method (8) and M by (27), 
p < 1. (28) 
Proof 
Let Ip be the right eigenvector of p, as described in item (2) of Theorem 1. We have 
Mip = pip. For each variable A, 
Therefore 
~M(A,B)IP(B)=pip(A). 
BcN 
This proves p < 1. 
~ p(A ~ oOn(B; a)ip(B) = pip(A). 
BcN o~c(NUT)* 
s.t. (A---~c~) ER 
Replacing p(A ~ a) by F(A ~ a)/F(A), according to (9), 
BEN ac(NuT)* F(A) n(B;a)ip(B)= pip(A). 
s.t. (A--~a) ER 
Multiply both sides by F(A) and sum over A c N. By (11), 
F(A)IP(A) - Ip(S) = p ~ F(A)IP(A). (29) 
AEN AEN 
We need to show that Ip(S) > 0. Assume Ip(S) = 0. Then for any n > 0, since Mnip = pnip, 
we have 
M(n)(S,A)IP(A) = pn,(s) = O. 
AcN 
For each A E N, M(n)(S,A)IP(A) = 0. Because each A E N is reachable from S under 
p, there is n > 0 such that M(n)(S,A) > 0. So we get Ip(A) = 0. Hence Ip = 0. This 
contradicts the fact that Ip is a nonnegative eigenvector of M. Therefore Ip(S) > 0. By 
(29), 
F(A)IP(A) > p ~ F(A)IP(A) > 0 
AEN AcN 
\[\] 
We will apply the above result to give another proof of Proposition 2. Before doing 
this, we need to introduce a spectral theorem, which is well-known in matrix analysis. 
Theorem 2 
Suppose M is an n x n real matrix. Let ¢(M) be the largest absolute value of M's 
eigenvalues. Then 
a(M) = lin~ IiMnII l/n, 
152 
Chi Probabilistic Context-Free Grammars 
where I\]MII is the norm of M defined by 
\[IMII = sup IM~I. 
IOr=l 
Now we can prove the following result. 
Proposition 6 
If M given by (27) has branching rate p < 1, then for each m E N U {0}, 
Epl~-I m < ce. (30) 
Proof 
We repeat the proof of Proposition 2 from Section 4 up to (15). Then, instead of smn- 
ming over A, we observe that (15) can be written as 
Mk+l, A ~ K + ~ M(A, B)Mk, B. 
BcN 
Write {Mk, A}A~N as 1V\[k, which is a vector indexed by A E N. We then have 
fitk+l < K1 + M/qt~, 
where 1 is defined as {1 ..... 1}, and for two-column vectors//and 17, //_< 17 means 
each component of # is _< the corresponding component of t,. Since the components 
in K1, M and/VIk are positive, the above relation implies 
M/~k+l G KM1 + M2/V\[k . 
Hence, we get 
/~k+2 ~ K1 + M/~'Ik+ 1 ~ K1 + KM1 + M2/~Ik . 
By induction, we get 
k-2 
M'lk < K~MJl + Mk-IM, I1, 
j=0 
k-2 
\]\]V/kl _< K~ IIMJlII11 + IIMk-llll~l I. (31) 
j=0 
By Theorem 2, for any p < p' < 1, IIMnl\] = o(p'n). Then (31) implies that IM, Ikl is 
bounded. Since/V/k are positive and increasing, it follows that M, Ik converge. \[\] 
Next we investigate how branching rates of improper PCFGs change after renor- 
malization. First, let us look at a simple example. Consider the CFG given by (1). 
Assign probability p to the first production (S ~ SS), and 1 - p to the second one 
(S ~ a). It was proved that the total probability of parses is min(1,1/p - 1). If p > 1/2, 
153 
Computational Linguistics Volume 25, Number 1 
then min(1,1/p - 1) = 1/p - 1 < 1, implying the PCFG is improper. To get the renor- 
malized distribution, take a parse "r with yield a m. Since f(S ~ SS; ~-) = m - 1 and 
f(S ~ a;T) = m, p(T) = pro-l(1 - p)m. Then the renormalized probability of T equals 
~p(~-) 1 -- pm-l(1 - p)m _ Pro( 1 _ p)m-1. ~(T) -- 1---- 
1/p- 1 
Therefore, ~ is assigned by a system of production probabilities ~ with ~(S ~ SS) = 
1 - p < 1/2, and ~(S ~ a) = p. So the renormalized PCFG is subcritical. 
More generally, we have the following result, which says a connected, improper 
PCFG, after being renormalized, becomes a subcritical PCFG. 
Proposition 7 
If p is a connected, improper PCFG distribution on parses, then its renormalized ver- 
sion ~ is subcritical. 
Proof 
We have 0 < P(f~s) < 1, and we shall first show, based on the fact that the PCFG is 
connected, that all 0 < p(f~a) < 1. Recall the proof of Corollary 1. There we got the 
relation qs _> qaps(A C T), where qa is the probability that trees rooted in A fail to 
terminate. Because the PCFG is connected, S is reachable from A, too. By the same 
argument, we also have qA ~ qSPA(S C T). Since both qs and pA(S E T) > O, qA > O, 
then p(f~A) = 1 -- qA < 1. Similarly, we can prove p(f~A) _> p(f~S)pA(S E ~') > O. 
For each A, define generating functions {gA} as in Harris (1963, Section 2.2), 
gA(S) = Z p(A ~ c~) H s~ (B;~)~ (32) 
ac(NuT)* BEN 
s.t. (A~c~)ER 
where s = {SA}A~N. Write g = {ga}acx and define g(n) =_ {g~)} recursively as 
g Is/ "\[ 
S (33) 
It is easy to see that ga(O) is the total probability of parses with root A and height 
1. By induction, g(A n) (0) is the total probability of parses with root A and height <_ n. 
Therefore, g(A n) (0) T p(f~a) < 1. 
Write r = {P(f~A)}A~N. Then 
g(r) = g(nlL% g(n) ( o) ) = nl~mcx g(g(n) (o) ) = nlL%g(nq-1)(0) = r. 
Therefore, r is a nonnegative solution of g(s) = s. It is also the smallest among such 
solutions. That is, if there is another nonnegative solution r' ~ r, then r < rq This is 
because 0 < r ~ implies g(n)(0) < g(n)(r') = r ~ for all n > 0, and by letting n ~ oo, r <_ r ~. 
Clearly, 1 is also a solution of g(s) = s. 
We now renormalize p to get ~ by (22). Define generating functionsf = ~fa} of 
and ff(n) in the same way as (32) and (33). Then 
f (sl = II s; 
tiE(NUT)* BEN 
s.t. (A-+a)ER 
154 
Chi Probabilistic Context-Free Grammars 
Z 
erE(NuT)* 
s.t. (A---~a)cR 
1 ~AP(A~ ol) H r~(B;'~) H S~ (B;~)' 
BEN BEN 
(34) 
For two vectors r = {rA} and {SA} , write rs for {rASA}, and r/s for {rA/SA}. Then 
(34) can be written 
f(s) - g(rs) 
Y 
Since all rA ~ p(~A) are positive, f (s) are well defined by the fractions. 
Because r is the smallest nonnegative solution of g(s) = s, by the above equation, 
I is the only solution off(s) = s in the unit cube. Since g(s) = s also has a solution 1, 
f(s) = s has a solution 1/r, which is strictly larger than 1. 
We want to know how f changes on the line segment connecting 1 and 1/r. Let 
u = 1/r - 1. Then u is strictly positive. Elements on the line segment between 1 and 
1/r can be represented by 1 + tu, with t E \[0,1\]. Define h(t) =f(1 + tu) - 1 - u. Then 
hA(t) = ~ ~(A --~ o~) II (1 + tuB) n(B;'~) - 1 - tUA. (35) 
ac(NuT)* BcN 
s.t. (A---~c~)ER 
Differentiate h at t = 0. Then h'(O) = Mu-u, where M is the mean matrix corresponding 
to ~. Every hA(t) is a convex function. Then, because hA(O) = hA(l) = 0, h~(0) < 0, 
which leads to Mu < u. 
We now show that for at least one A, (MU)A < UA. First of all, note that h~(0) = 0 
only if hA(t) is linear. Assume Mu = u, which leads to h'(0) = 0 and the linearity of 
h(t). Together with h(0) = 0, this implies h(t) ~ O. Choose t < 0 such that 1 + tuA > 0 
for all A. Then f(1 + tu) - 1 - tu = h(t) = 0. Therefore 1 + tu is a nonnegative solution 
of f(s) = s and is strictly less than 1. This contradicts the fact that 1 is the smallest 
nonnegative solution off(s) = s. 
Now we have Mu < u, and 3A, s.t. (MU)A < UA. Because p is connected, M is 
irreducible. By item (3) of Theorem 1, u is strictly positive, and there is a strictly 
positive left eigenvector L, such that L,M = py. Therefore yMu < vu, or p~,u < ~,u. 
Hence p < 1. This completes the proof. \[\] 
7. Identifiability and Approximation of Production Probabilities of PCFGs 
Identifiability of parameters is related to the consistency of estimates, both being im- 
portant statistical issues. Proving the consistency of the ML estimate of a system of 
production probabilities given in (5) is relatively straightforward. Consistency in this 
case means that, if p imposes a proper distribution, then as the size of the data com- 
posed of independent and identically distributed (i.i.d.) samples goes to infinity, with 
probability one, the estimate ~ converges to p. To see this, think of the sample parses 
as taken independently from a branching process governed by p. By the context-free 
nature of the branching process, for A E N, each instance of A selects a production 
A ~ a by probability p(A ~ a) independently of the other instances of A. As the 
size of the data set goes to infinity, the number of occurrences of A goes to infinity. 
Therefore, by the law of large numbers, the ratio between the number of occurrences 
of A ~ o~ and the number of occurrences of A, which is f~(A ~ ol), converges to 
p(A ~ o~), with probability one. 
155 
Computational Linguistics Volume 25, Number 1 
By the consistency of the ML estimate of a system of production probabilities, we 
can prove that production probabilities are identifiable parameters of PCFGs. In other 
words, different systems of production probabilities impose different PCFG distribu- 
tions. 
Proposition 8 
If pl, p2 impose distributions P1, P2, respectively, and pl¢ p2, then P1 ~ P2. 
Proof 
Assume P1 = P2. Then draw n i.i.d, samples from P1. Because the ML estimator ~ is 
consistent, as n ~ cx~, ~ ~ Pl, with probability 1. Because the n i.i.d, samples can also 
be regarded as drawn from P2, with the same argument, ~ ~ p2, with probability 1. 
Hence pl = p2, a contradiction. \[\] 
We mentioned in Section 5 that the ML estimators (24) and (26) may produce 
infinitely many estimates if the Gibbs distributions on parses have the form (19). This 
phenomenon of multiple solutions results from the nonidentifiability of parameters 
of the Gibbs distributions (19), which means that different parameters may yield the 
same distributions. 
To see why parameters of Gibbs distribution (19) are nonidentifiable, we note that 
the frequencies of production rules are linearly dependent, 
E f(A-~a;~-) = ~ n(A;a)f(B~a;T), 
(A---~a)ER (B---~a)eR 
ifA¢S, 
E f(S~c~;T) = E n(S;a)f(B~a;r)+l. 
(S--~a)cR (B--~oOER 
Therefore, there exists )~0 ¢ O, such that for any ~-, &0 .f(T) = O. If ~ is a solution for 
(24), then for any number t, 
==~ 
(~ + tM) .f(r) = ~ .f(r) 
e (X+t~°)'f(T) = e X'fO), Zs,+t), ° = Z~ 
= 
Thus for any t, ~ + t&0 is also a solution for (24). This shows that the parameters 
of Gibbs distribution (19) are nonidentifiable. 
Finally, we consider how to approximate production probabilities by mean fre- 
quencies of productions. Given i.i.d, samples of parses rl ..... ~'n from the distribution 
imposed by p, by the consistency of the ML estimate of p given by (5), 
n 
y~ff(A---~a;~)/n 
F(A----~) = i=1 n 
Ef(A;~)/n 
i=1 
156 
Chi Probabilistic Context-Free Grammars 
with probability 1, as n ~ oo. If the entropy of the distribution p is finite, then for 
every production rule (A ~ t) c R, 
// 
Ef(A ~ fl;T) ~ Ep(f(A ~ fl;7)) with probability 1, 
i=1 
fi(A ~ a) ~ Ep(f(A ~ a;7-)) Ep if(A; T)) with probability 1, 
Ep(f(A;~)) 
If the entropy is infinite, the above argument does not work, because both the 
numerator and the denominator of the fraction are infinity. Can we change the fraction 
a little bit so that it still makes sense, and at the same time yields good approximation 
to p(A ~ c~)? 
One way to do this is to pick a large finite subset f~' of f~ and replace the fraction 
by 
Ep(f(A --+ ~;~-)1~- e ~') 
Ep(f(A;~-)ir C ~') 
where Ep(f(A ~ a; T)IT C f~') is the conditional expectation of f(A -~ c~; r) given f~', 
which is defined as 
E f(A -~ c~; r)p(T) 
Ep(f(A -~ ~;~-)1~ ~ ~') = ~*n' 
TEf~' 
Because f~l is finite, the top of the fraction on the right-hand side is finite. Also the 
bottom of the fraction is positive. Therefore the conditional expectation of f is finite. 
The conditional expectation Ep (f(A; q-)IT ¢ f/') is similarly defined. 
The following result shows that as f~ expands, the approximation gets better. 
Proposition 9 
Suppose a system of production probabilities p imposes a proper distribution. Then 
for any increasing sequence of finite subsets ~-~n of ~ with ~n T ~-~, i.e., fll C f~2.. • C f~, 
f~n finite and Uf~n = f~, 
p(A -~ c~) = lim Ep(f(A ~ ~;~)l~ ~ ~n) 
n-*oo Ep(f(A;T)IT E ~~n) 
To prove the proposition, we introduce the Kullback-Leibler divergence. For any 
two probability distributions p and q on f~, the Kullback-Leibler divergence between 
p and q is defined as 
D(PlIq) ZP(~)l°- p(;) 
TEf~ g q~" 
where 0log q~° is defined as 0 for any q(r) _> 0. D(pliq) is nonnegative and equal 
to 0 if and only if p = q. One thing to note is that q need not be proper in order 
157 
Computational Linguistics Volume 25, Number 1 
to make D(pllq) nonnegative. Even when ~q(T) < 1, it is still true that D(p\]lq) > O. 
For more about the Kullback-Leibler divergence, we refer the readers to Cover and 
Thomas (1991). 
The Kullback-Leibler divergence has the simple property described below, which 
will be used in the proof of Proposition 9. 
Lemma 2 
If f~' is an arbitrary subset of f~, then 
D(plIq) > p(fY)log P(fY) 1 - p(fY) _ q~ + (1 - p(f~'))log1 q(f~')' 
Proof 
Consider the Kullback-Leibler divergence between the conditional distributions p(TlfY) 
and q(rlfY ), which equals 
E p(T\]fY)log P(TIfY) -- 1 p(T) " P(fY) > 0 ;ca' q(T\[f~'--------) 
p(f2') TEn' ~ p(r)log q~ -- 'og q~ __ 
p(r) > p(fy), p(fY) E p(7)log q~ _ ,og q~. 
TEat 
Similarly, 
1 p(;) > p(f~ \ fY)I P(f~ \ a') E P(;) og q~ _ og q(fl \ f~,) 
TCfl\a' 
> (1 - p(fY))log I - p(ft') - 1 q(fl')" 
The second ">" is because q(fl) < 1. These two inequalities together prove the lemma. 
\[\] 
Proof of Proposition 9 
Given n, for production probabilities q, let Kn(q) be the Kullback-Leibler divergence 
• between the conditional distribution p(r\]f~n) and the distribution imposed by q, 
p(r\[f~n) Kn(q) -- E p(r\]f~n)log • (36) 
(A---~a)ER 
We want to find q that minimizes Kn(q). This can be achieved by applying the 
Lagrange multiplier method. The condition that q is subject to is, 
E q(A ~ oz) = 1, (37) 
(A---~a)cR 
for every A C N. There are INI such constraints. To incorporate them into the mini- 
mization, we consider the function 
158 
Chi Probabilistic Context-Free Grammars 
where the unknown coefficients {)~A }AEN are called Lagrange multipliers. 
The q that minimizes Kn(q) subject to (37) satisfies 
OL(q) - O, 
Oq(A ~ a) 
for all (A ~ a) E R. By simple computation, this is equivalent to 
~_, f(A ~ a;T)P(Tlf~n ) = AAq(A --* a). 
rein 
Sum both sides over all a E (N U T)* such that (A -+ a) E R. By the constraints (37), 
AA = Z f(A;r)P(rlf~n)" 
TEfl~ 
Therefore we prove that/f there is a minimizer, it has to be pn, where 
Zf(A~a;r)p(r) 
pn(A---+~)= rE~. 
ZU(A;T)p(T) 
rein 
To see that there is a minimizer of Kn(q) subject to (37), consider the boundary 
points of the region 
{q_- >0, Z q(A --~ o~) = 1} 
c¢ s.t. (A--~a)ER 
Any boundary point of the region has a component equal to zero, hence for some 
r E fin, q(T) = 0, implying Kn(q) = ~. Because K,~(q) is a continuous function, Kn 
must attain its minimum inside the above region, and this minimizer, as has been 
shown, is pn- 
We need to show Pn ~ p. Let fY = ~-~n and apply Lemma 2 to p(rlf~n ) and ~n(r). 
Since p(~nJ~n) = 1, we get 0 < -logan(fin) < Kn(pn). On the other hand, because Pn 
is the minimizer of Kn, Kn(pn) _< Kn(p) -- - 1ogp(f~n). 
Because fin ~ f~ and p is proper, P(f~n) ~ 1. Therefore 0 < -log~n(f~n) _< 
- 1ogp(f~n) ~ 0. Hence Pn(f~n) --~ 1. 
Choose an arbitrary r EfL For all n large enough, r E f~n. Apply Lemma 2 to {r} 
and get 
_ lo- P(rlf~n) 1 - p(rlf~, ) 0 < p(Tl~-~n) ~ pn(~n) -~- (1 -- p(TJf~n))log i Z~-~) -< K~(~n) --~ 0 
p(TIf~n) + (1 -- p(rIf~n)) log 1 -- p(rlftn ) 
lim _ 1. K2ffTG) 
Together with p(TJf~n) ~ p(r) > 0 and pn(~n) ~ 1, this impfies 
lim pn(T) = lim pn(TJf~n) = nlirr~ pn(TJf~n) ---- p(v). 
1l ---+ ¢x3 ll ---~ oo 
159 
Computational Linguistics Volume 25, Number 1 
This nearly completes the proof. By the identifiability of production probabilities, 
Pn should converge to p. To make the argument more rigorous, by compactness of pn, 
every subsequence of pn has a limit point. Let p' be a limit point of a subsequence 
Phi. For any T, since Phi0-) ~ p('r), p'(T) = p(~-). By the identifiability of production 
probabilities, p' = p. Therefore p is the only limit point of pn. This proves Pn ~ p. 
Acknowledgments 
I am indebted to Stuart Geman and Mark 
Johnson for encouraging me to look at 
problems in this paper in the first place, 
and for various discussions and suggestions 
along the way. I am also grateful to the 
referees' careful reading of the original 
manuscript. 
References 
Abney, Steven P. 1997. Stochastic 
attribute-value grarmnars. Computational 
Linguistics, 23(4):597-618. 
Baker, James K. 1979. Trainable grammars 
for speech recognition. In Speech 
Communications Papers of the 97th Meeting of 
the Acoustical Society of America, 
pages 547-550, Cambridge, MA. 
Baum, Leonard E. 1972. An inequality and 
associated maximization technique in 
statistical estimation of probabilistic 
functions of Markov processes. 
Inequalities, 3:1-8. 
Berger, Adam L., Stephen Della Pietra, and- 
Vincent Della Pietra. 1996. A maximum 
entropy approach to natural language 
processing. Computational Linguistics, 
22(1):39-71. 
Booth, Taylor L. and Richard A. Thompson. 
1973. Applying probability measures to 
abstract languages. IEEE Transactions on 
Computers, C-22:442-450. 
Chi, Zhiyi and Stuart Geman. 1998. 
Estimation of probabilistic context-free 
grammars. Computational Linguistics, 
24(2):299-305. 
Cover, Thomas M. and Joy A. Thomas. 
1991. Elements of Information Theory. John 
Wiley & Sons, Inc. 
Della Pietra, Stephen, Vincent Della Pietra, 
and John Lafferty. 1997. Inducing features 
of random fields. IEEE Transactions on 
Pattern Analysis and Machine Intelligence, 
19(4):1-13, April. 
Dempster, Arthur Pentland, N. M. Laird, 
and Donald B. Rubin. 1977. Maximum 
likelihood from incomplete data via the 
EM algorithm. Journal of Royal Statistical 
Society, Series B, 39:1-38. 
Grenander, Ulf. 1976. Lectures in Pattern 
Theory Volume 1, Pattern Synthesis. 
Springer-Verlag, New York. 
Harris, Theodore Edward. 1963. The Theory 
of Branching Processes. Springer-Verlag, 
Berlin. 
Johnson, Mark. 1998. PCFG models of 
linguistic tree representations. 
Computational Linguistics. To appear. 
Mark, Kevin E. 1997. Markov Random Field 
Models for Natural Language. Ph.D. thesis, 
Department of Electrical Engineering, 
Washington University, May. 
Mark, Kevin E., Michael I. Miller, and Ulf 
Grenander. 1996. Constrained stochastic 
language models. In S. E. Levinson and L. 
Shepp, editors, Image Models (and Their 
Speech Model Cousins). Springer-Verlag, 
pages 131-140. 
Mark, Kevin E., Michael I. Miller, Ulf 
Grenander, and Steven P. Abney. 1996. 
Parameter estimation for constrained 
context-free language models. In 
Proceedings of the DARPA Speech and 
Natural Language Workshop, Image Models 
(and Their Speech Model Cousins), 
pages 146-149, Harriman, NY, February. 
Morgan Kaufmann. 
Miller, Michael I. and Joseph A. O'Sullivan. 
1992. Entropies and combinatorics of 
random branching processes and 
context-free languages. IEEE Transactions 
on Information Theory, 38(4), July. 
Sdnchez, Joan-Andreu and Josd-Miguel 
Bened(. 1997. Consistency of stochastic 
context-free grammars from probabilistic 
estimation based on growth 
transformations. IEEE Transactions on 
Pattern Analysis and Machine Intelligence, 
19(9):1052-1055, September. 
Walters, Peter. 1982. An Introduction to 
Ergodic Theory. Springer-Verlag, NY. 
Zhu, Song Chun, Ying Nian Wu, and David 
B. Mumford. 1997. Minimax entropy 
principle and its application to texture 
modeling. Neural Computation, 
9(8):1627-1660. 
160 
