Relating Probabilistic Grammars and Automata 
Steven Abney David McAllester Fernando Pereira 
AT&T Labs-Research 
180 Park Ave 
Florham Park NJ 07932 {abney, dmac, pereira}@research.att.com 
Abstract 
Both probabilistic context-free grammars 
(PCFGs) and shift-reduce probabilistic push- 
down automata (PPDAs) have been used for 
language modeling and maximum likelihood 
parsing. We investigate the precise relationship 
between these two formalisms, showing that, 
while they define the same classes of probabilis- 
tic languages, they appear to impose different 
inductive biases. 
1 Introduction 
Current work in stochastic language models 
and maximum likelihood parsers falls into two 
main approaches. The first approach (Collins, 
1998; Charniak, 1997) uses directly the defini- 
tion of stochastic grammar, defining the prob- 
ability of a parse tree as the probability that 
a certain top-down stochastic generative pro- 
cess produces that tree. The second approach 
(Briscoe and Carroll, 1993; Black et al., 1992; 
Magerman, 1994; Ratnaparkhi, 1997; Chelba 
and Jelinek, 1998) defines the probability of a 
parse tree as the probability that a certain shift- 
reduce stochastic parsing automaton outputs 
that tree. These two approaches correspond to 
the classical notions of context-free grammars 
and nondeterministic pushdown automata re- 
spectively. It is well known that these two clas- 
sical formalisms define the same language class. 
In this paper, we show that probabilistic context- 
free grammars (PCFGs) and probabilistic push- 
down automata (PPDAs) define the same class 
of distributions on strings, thus extending the 
classical result to the stochastic case. We also 
touch on the perhaps more interesting ques- 
tion of whether PCFGs and shift-reduce pars- 
ing models have the same inductive bias with 
respect to the automatic learning of model pa- 
rameters from data. Though we cannot provide 
a definitive answer, the constructions we use to 
answer the equivalence question involve blow- 
ups in the number of parameters in both direc- 
tions, suggesting that the two models impose 
different inductive biases. 
We are concerned here with probabilistic 
shift-reduce parsing models that define prob- 
ability distributions over word sequences, and 
in particular the model of Chelba and Je- 
linek (1998). Most other probabilistic shift- 
reduce parsing models (Briscoe and Carroll, 
1993; Black et al., 1992; Magerman, 1994; Rat- 
naparkhi, 1997) give only the conditional prob- 
ability of a parse tree given a word sequence. 
Collins (1998) has argued that those models fail 
to capture the appropriate dependency relations 
of natural language. Furthermore, they are not 
directly comparable to PCFGs, which define 
probability distributions over word sequences. 
To make the discussion somewhat more con- 
crete, we now present a simplified version of the 
Chelba-Jelinek model. Consider the following 
sentence: 
The small woman gave the fat man her 
sandwich. 
The model under discussion is based on shift- 
reduce PPDAs. In such a model, shift transi- 
tions generate the next word w and its associ- 
ated syntactic category X and push the pair 
(X, w) on the stack. Each shift transition 
is followed by zero or more reduce transitions 
that combine topmost stack entries. For exam- 
ple the stack elements (Det, the), (hdj, small), 
(N, woman) can be combined to form the single 
entry (NP, woman) representing the phrase "the 
small woman". In general each stack entry con- 
sists of a syntactic category and a head word. 
After generating the prefix "The small woman 
gave the fat man" the stack might contain the 
sequence (NP, woman)<Y, gave)(NP, man). The 
Chelba-Jelinek model then executes a shift tran- 
542 
S --+ (S, admired) 
(S, admired) --+ (NP, Mary)(VP, admired) 
(VP, admired) -+ (V, admired)(Np, oak) 
(NP, oak) -+ (Det, the)(N, oak) (N, 
oak) -+ (Adj, towering> (N, oak> 
(N, oak> -~ (Adj, strong>(N, oak> 
(N, oak) -+ (hdj, old>(N, oak) 
(NP, Mary) -+ Mary 
(N, oak) -+ oak 
Figure 1: Lexicalized context-free grammar 
sition by generating the next word. This is 
done in a manner similar to that of a trigram 
model except that, rather than generate the 
next word based on the two preceding words, it 
generates the next word based on the two top- 
most stack entries. In this example the Chelba- 
Jelinek model generates the word "her" from 
(V, gave)(NP, man) while a classical trigram 
model would generate "her" from "fat man". 
We now contrast Chelba-Jelinek style mod- 
els with lexicalized PCFG models. A PCFG is 
a context-free grammar in which each produc- 
tion is associated with a weight in the interval 
\[0, 1\] and such that the weights of the produc- 
tions from any given nonterminal sum to 1. For 
instance, the sentence 
Mary admired the towering strong old oak 
can be derived using a lexicalized PCFG based 
on the productions in Figure 1. Production 
probabilities in the PCFG would reflect the like- 
lihood that a phrase headed by a certain word 
can be expanded in a certain way. Since it can 
be difficult to estimate fully these likelihoods, 
we might restrict ourselves to models based on 
bilexical relationships (Eisner, 1997), those be- 
tween pairs of words. The simplest bilexical re- 
lationship is a bigram statistic, the fraction of 
times that "oak" follows "old". Bilexical rela- 
tionships for a PCFG include that between the 
head-word of a phrase and the head-word of a 
non-head immediate constituent, for instance. 
In particular, the generation of the above sen- 
tence using a PCFG based on Figure 1 would 
exploit a bilexical statistic between "towering" 
and "oak" contained in the weight of the fifth 
production. This bilexical relationship between 
"towering" and "oak" would not be exploited in 
either a trigram model or in a Chelba-Jelinek 
style model. In a Chelba-Jelinek style model 
one must generate "towering" before generating 
"oak" and then "oak" must be generated from 
(Adj, strong), (Adj, old). In this example the 
Chelba-Jelinek model behaves more like a clas- 
sical trigram model than like a PCFG model. 
This contrast between PPDAs and PCFGs 
is formalized in theorem 1, which exhibits a 
PCFG for which no stochastic parameterization 
of the corresponding shift-reduce parser yields 
the same probability distribution over strings. 
That is, the standard shift-reduce translation 
from CFGs to PDAs cannot be generalized to 
the stochastic case. 
We give two ways of getting around the above 
difficulty. The first is to construct a top-down 
PPDA that mimics directly the process of gen- 
erating a PCFG derivation from the start sym- 
bol by repeatedly replacing the leftmost non- 
terminal in a sentential form by the right-hand 
side of one of its rules. Theorem 2 states 
that any PCFG can be translated into a top- 
down PPDA. Conversely, theorem 3 states that 
any PPDA can be translated to a PCFG, not 
just those that are top-down PPDAs for some 
PCFG. Hence PCFGs and general PPDAs de- 
fine the same class of stochastic languages. 
Unfortunately, top-down PPDAs do not al- 
low the simple left-to-right processing that mo- 
tivates shift-reduce PPDAs. A second way 
around the difficulty formalized in theorem 1 
is to encode additional information about the 
derivation context with richer stack and state 
alphabets. Theorem 7 shows that it is thus 
possible to translate an arbitrary PCFG to a 
shift-reduce PPDA. The construction requires a 
fair amount of machinery including proofs that 
any PCFG can be put in Chomsky normal form, 
that weights can be renormalized to ensure that 
the result of grammar transformations can be 
made into PCFGs, that any PCFG can be put 
in Greibach normal form, and, finally, that a 
Greibach normal form PCFG can be converted 
to a shift-reduce PPDA. 
The construction also involves a blow-up in 
the size of the shift-reduce parsing automaton. 
This suggests that some languages that are con- 
cisely describable by a PCFG are not concisely 
describable by a shift-reduce PPDA, hence that 
the class of PCFGs and the class of shift-reduce 
PPDAs impose different inductive biases on the 
543 
CF languages. In the conversion from shift- 
reduce PPDAs to PCFGs, there is also a blow- 
up, if a less dramatic one, leaving open the pos- 
sibility that the biases are incomparable, and 
that neither formalism is inherently more con- 
cise. 
Our main conclusion is then that, while the 
generative and shift-reduce parsing approaches 
are weakly equivalent, they impose different in- 
ductive biases. 
2 Probabilistic and Weighted 
Grammars 
For the remainder of the paper, we fix a terminal 
alphabet E and a nonterminal alphabet N, to 
which we may add auxiliary symbols as needed. 
A weighted context-free grammar (WCFG) 
consists of a distinguished start symbol S E N 
plus a finite set of weighted productions of the 
form X -~ a, (alternately, u : X --~ a), where 
X E N, a E (Nt2E)* and the weight u is a non- 
negative real number. A probabilistic context- 
free grammar (PCFG) is a WCFG such that for 
all X, )-~u:x-~a u = 1. Since weights are non- 
negative, this also implies that u <_ 1 for any 
individual production. 
A PCFG defines a stochastic process with 
sentential forms as states, and leftmost rewrit- 
ing steps as transitions. In the more general 
case of WCFGs, we can no longer speak of 
stochastic processes; but weighted parse trees 
and sets of weighted parse trees are still well- 
defined notions. 
We define a parse tree to be a tree whose 
nodes are labeled with productions. Suppose 
node ~ is labeled X -~ a\[Y1,...,Yn\], where we 
write a\[Y1,...,Yn\] for a string whose nonter- 
minal symbols are Y1,...,Y~. We say that ~'s 
nonterminal label is X and its weight is u. The 
subtree rooted at ~ is said to be rooted in X. ~ is 
well-labeled just in case it has n children, whose 
nonterminal labels are Y1,..., Yn, respectively. 
Note that a terminal node is well-labeled only 
if a is empty or consists exclusively of terminal 
symbols. We say a WCFG G admits a tree d 
just in case all nodes of d are well-labeled, and 
all labels are productions of G. Note that no 
requirement is placed on the nonterminal of the 
root node of d; in particular, it need not be S. 
We define the weight of a tree d, denoted 
Wa(d), or W(d) if G is clear from context, to be 
the product of weights of its nodes. The depth 
r(d) of d is the length of the longest path from 
root to leaf in d. The root production it(d) is the 
label of the root node. The root symbol p(d) is 
the left-hand side of ~r(d). The yield a(d) of 
the tree d is defined in the standard way as the 
string of terminal symbols "parsed" by the tree. 
It is convenient to treat the functions 7r, p, 
a, and r as random variables over trees. We 
write, for example, {p = X} as an abbreviation 
for {dip(d)= X}; and WG(p = X) represents 
the sum of weights of such trees. If the sum 
diverges, we set WG(p = X) = oo. We call 
IIXHG = WG(p = X) the norm of X, and IIGII = 
IISlla the norm of the grammar. 
A WCFG G is called convergent if \[\[G\[\[ < oo. 
If G is a PCFG then \[\[G\[\[ = WG(p "- S) < 1, 
that is, all PCFGs are convergent. A PCFG 
G is called consistent if \]\]GII = 1. A sufficient 
condition for the consistency of a PCFG is given 
in (Booth and Thompson, 1973). If (I) and • are 
two sets of parse trees such that 0 < WG(~) < 
co we define PG((I)\]~) to be WG(~Nqt)/WG(kO). 
For any terminal string y and grammar G such 
that 0 < WG(p -- S) < co we define PG(Y) to 
be Pa(a = YIP = S). 
3 Stochastic Push-Down Automata 
We use a somewhat nonstandard definition of 
pushdown automaton for convenience, but all 
our results hold for a variety of essentially equiv- 
alent definitions. In addition to the terminal 
alphabet ~, we will use sets of stack symbols 
and states as needed. A weighted push-down 
automaton (WPDA) consists of a distinguished 
start state q0, a distinguished start stack symbol 
X0 and a finite set of transitions of the following 
form where p and q are states, a E E L.J {e}, X 
and Z1, ..., Zn are stack symbols, and w is a 
nonnegative real weight: 
x, pa~ Zl ... Zn, q 
A WPDA is a probabilistic push-down automa- 
ton (PPDA) if all weights are in the interval 
\[0, 1\] and for each pair of a stack symbol X and 
a state q the sum of the weights of all transitions 
of the form X,p ~ Z1 ...Z=, q equals 1. A ma- 
chine configuration is a pair (fl, q) of a finite 
sequence fl of stack symbols (a stack) and a ma- 
chine state q. A machine configuration is called 
halting if the stack is empty. If M is a PPDA 
containing the transition X,p ~ Z1...Zn,q 
then any configuration of the form (fiX, p) has 
544 
probability w of being transformed into the con- 
figuration (f~Z1...Zn, q> where this transfor- 
mation has the effect of "outputting" a if a ¢ e. 
A complete execution of M is a sequence of tran- 
sitions between configurations starting in the 
initial configuration <X0, q0> and ending in a 
configuration with an empty stack. The prob- 
ability of a complete execution is the product 
of the probabilities of the individual transitions 
between configurations in that execution. For 
any PPDA M and y E E* we define PM(Y) to 
be the sum of the probabilities of all complete 
executions outputting y. A PPDA M is called 
consistent if )-~ye~* PM(Y) = 1. 
We first show that the well known shift- 
reduce conversion of CFGs into PDAs can not 
be made to handle the stochastic case. Given a 
(non-probabilistic) CFG G in Chomsky normal 
form we define a (non-probabilistic) shift-reduce 
PDA SIt(G) as follows. The stack symbols of 
SIt(G) are taken to be nonterminals of G plus 
the special symbols T and ±. The states of 
SR(G) are in one-to-one correspondence with 
the stack symbols and we will abuse notation 
by using the same symbols for both states and 
stack symbols. The initial stack symbol is 1 
and the initial state is (the state corresponding 
to) _L. For each production of the form X --+ a 
in G the PDA SIt(G) contains all shift transi- 
tions of the following form 
Y,Z-~ YZ, X 
The PDA SR(G) also contains the following ter- 
mination transitions where S is the start symbol 
of G. 
E 1, S -+, T 
I,T -~,T 
Note that if G consists entirely of productions of 
the form S -+ a these transitions suffice. More 
generally, for each production of the form X -+ 
YZ in G the PDA SR(G) contains the following 
reduce transitions. 
Y, Z -~, X 
All reachable configurations are in one of the 
following four forms where the first is the initial 
configuration, the second is a template for all 
intermediate configurations with a E N*, and 
the last two are terminal configurations. 
<1, 1>, <11., x>, <I,T>, T> 
Furthermore, a configuration of the form 
(l_l_a, X) can be reached after outputting y if 
and only if aX :~ y. In particular, the machine 
can reach configuration (±_L, S) outputting y 
if and only if S :~ y. So the machine SR(G) 
generates the same language as G. 
We now show that the shift-reduce transla- 
tion of CFGs into PDAs does not generalize to 
the stochastic case. For any PCFG G we define 
the underlying CFG to be the result of erasing 
all weights from the productions of G. 
Theorem 1 There exists a consistent PCFG G 
in Chomsky normal .form with underlying CFG 
G' such that no consistent weighting M of the 
PDA SR(G ~) has the property that PM(Y) = 
Pa(u) for all U e 
To prove the theorem take G to be the fol- 
lowing grammar. 
1_ 1_ 
S -~ AX1, S 3+ BY1 
X, -~ CX2, X2 -~ CA 
Yl Cy2, Y2 A, C B 
A-~ a, S-~ b, C-~ c 
Note that G generates acca and bccb each 
with probability ½. Let M be a consistent 
PPDA whose transitions consist of some weight- 
ing of the transitions of SR(G'). We will as- 
sume that PM(Y) = PG(Y) for all y E E* 
and derive a contradiction. Call the nonter- 
minals A, B, and C preterminals. Note that 
the only reduce transitions in SR(G ~) com- 
bining two preterminals are C, A -~,X2 and 
C, B -~,Y2. Hence the only machine configu- 
ration reachable after outputting the sequence 
ace is (.I__LAC, C>. If PM(acca) -- ½ and 
PM(accb) -- 0 then the machine in configuration 
(.I_±AC, C> must deterministically move to con- 
figuration (I±ACC, A>. But this implies that 
configuration (IIBC, C> also deterministically 
moves to configuration <±±BCC, A> so we have 
PM(bccb) -= 0 which violates the assumptions 
about M. ,, 
Although the standard shift-reduce transla- 
tion of CFGs into PDAs fails to generalize to 
the stochastic case, the standard top-down con- 
version easily generalizes. A top-down PPDA 
is one in which only ~ transitions can cause the 
stack to grow and transitions which output a 
word must pop the stack. 
545 
Theorem 2 Any string distribution definable 
by a consistent PCFG is also definable by a top- 
down PPDA. 
Here we consider only PCFGs in Chom- 
sky normal form--the generalization to arbi- 
trary PCFGs is straightforward. Any PCFG 
in Chomsky normal form can be translated to 
a top-down PPDA by translating each weighted 
production of the form X --~ YZ to the set of 
expansion moves of the form W, X ~ WZ, Y 
and each production of the form X -~ a to the 
set of pop moves of the form Z, X 72-'~, Z. • 
We also have the following converse of the 
above theorem. 
Theorem 3 Any string distribution definable 
by a consistent PPDA is definable by a PCFG. 
The proof, omitted here, uses a weighted ver- 
sion of the standard translation of a PDA into 
a CFG followed by a renormalization step using 
lemma 5. We note that it does in general in- 
volve an increase in the number of parameters 
in the derived PCFG. 
In this paper we are primarily interested in 
shift-reduce PPDAs which we now define for- 
mally. In a shift-reduce PPDA there is a one- 
to-one correspondence between states and stack 
symbols and every transition has one of the fol- 
lowing two forms. 
Y, Za-~YZ, X a¢E 
EgW Y, Z -~+ , X 
Transitions of the first type are called shift 
transitions and transitions of the second type 
are called reduce transitions. Shift transitions 
output a terminal symbol and push a single 
symbol on the stack. Reduce transitions are 
e-transitions that combine two stack symbols. 
The above theorems leave open the question of 
whether shift-reduce PPDAs can express arbi- 
trary context-free distributions. Our main the- 
orem is that they can. To prove this some ad- 
ditional machinery is needed. 
4 Chomsky Normal Form 
A PCFG is in Chomsky normal form (CNF) if 
all productions are either of the form X -St a, 
a E E or X -~ Y1Y2, Y1,Y2 E N. Our next 
theorem states, in essence, that any PCFG can 
be converted to Chomsky normal form. 
Theorem 4 For any consistent PCFG G with 
PG(e) < 1 there exists a consistent PCFG C(G) 
in Chomsky normal form such that, for all y E 
E+: 
Pa(y) - ea(yly # e) PC(G)(Y) -- 1 - Pa(e) 
To prove the theorem, note first that, without 
loss of generality, we can assume that all pro- 
ductions in G are of one of the forms X --~ YZ, 
X -5t Y, X -~ a, or X -Y+ e. More specifi- 
cally, any production not in one of these forms 
must have the form X -5t ¢rfl where a and fl 
are nonempty strings. Such a production can 
be replaced by X -~ AB, A -~ a, and B 2+ fl 
where A and B are fresh nonterminal symbols. 
By repeatedly applying this binarization trans- 
formation we get a grammar in the desired form 
defining the same distribution on strings. 
We now assume that all productions of G 
are in one of the above four forms. This im- 
plies that a node in a G-derivation has at most 
two children. A node with two children will 
be called a branching node. Branching nodes 
must be labeled with a production of the form 
X -~ YZ. Because G can contain produc- 
tions of the form X --~ e there may be ar- 
bitrarily large G-derivations with empty yield. 
Even G-derivations with nonempty yield may 
contain arbitrarily large subtrees with empty 
yield. A branching node in the G-derivation 
will be called ephemeral if either of its chil- 
dren has empty yield. Any G-derivation d with 
la(d)l _ 2 must contain a unique shallowest 
non-ephemeral branching node, labeled by some 
production X ~ YZ. In this case, define 
fl(d) = YZ. Otherwise (la(d)l < 2), let fl(d) = 
a(d). We say that a nonterminal X is nontrivial 
in the grammar G if Pa(a # e I P = X) > O. 
We now define the grammar G' to consist of all 
productions of the following form where X, Y, 
and Z are nontrivial nonterminals of G and a is 
a terminal symbol appearing in G. 
X PG(~=YZ~p=x, ~#~) YZ 
X PG(~=a 12+=x, ~¢~) a 
We leave it to the reader to verify that G' has 
the property stated in theorem 4. • 
The above proof of theorem 4 is non- 
constructive in that it does not provide any 
546 
way of computing the conditional probabilities 
PG(Z = YZ I p = x, # and Pa(Z = 
a \[ p = X, a ¢ e). However, it is not 
difficult to compute probabilities of the form 
PG(¢ \[ p = X, r <_ t+ 1) from probabili- 
ties of the form PG((I) \] p = X, v _< t), and 
PG(¢ I P = X) is the limit as t goes to infinity 
of Pa((I )\] p= X, r_< t). We omit the details 
here. 
from X equals 1: 
= ~:x-~\[Y1 ..... y.\] u~ 
--  E .x-,oIv, ..... Y.l II lla 
= ..... y.\]ul-LwG(p= 
= wo(p=x)Wa(p= X) 
- 1 
5 Renormalization 
A nonterminal X is called reachable in a gram- 
mar G if either X is S or there is some (re- 
cursively) reachable nonterminal Y such that G 
contains a production of the form Y -~ a where 
contains X. A nonterminal X is nonempty 
in G if G contains X -~ a where u > 0 and a 
contains only terminal symbols, or G contains 
X -~ o~\[Y1, ..., Yk\] where u > 0 and each 
1~ is (recursively) nonempty. A WCFG G is 
proper if every nonterminal is both reachable 
and nonempty. It is possible to efficiently com- 
pute the set of reachable and nonempty non- 
terminals in any grammar. Furthermore, the 
subset of productions involving only nontermi- 
nals that are both reachable and nonempty de- 
fines the same weight distribution on strings. 
So without loss of generality we need only con- 
sider proper WCFGs. A reweighting of G is any 
WCFG derived from G by changing the weights 
of the productions of G. 
Lemma 5 For any convergent proper WCFG 
G, there exists a reweighting G t of G such that 
G ~ is a consistent PCFG such that for all ter- 
minal strings y we have PG' (Y) = Pa (Y). 
Proof." Since G is convergent, and every non- 
terminal X is reachable, we must have IIXIla < 
oo. We now renormalize all the productions 
from X as follows. For each production X -~ 
a\[Y1,..., Yn\] we replace u by 
¢ = II IIG 
IIXIla 
To show that G' is a PCFG we must show 
that the sum of the weights of all productions 
For any parse tree d admitted by G let 
d ~ be the corresponding tree admitted by G ~, 
that is, the result of reweighting the pro- 
ductions in d. One can show by induc- 
tion on the depth of parse trees that if 
p(d) = X then Wc,(d') = \[-~GWG(d). 
Therefore IIXIIG, = ~~{d\[p(d)=X} WG,(d') -~ 
~ ~{alo(e)=x} Wa(d) = = 1. In par- 
ticular, Ilaql = IlSlla,- 1, that is, G' is consis- 
tent. This implies that for any terminal string 
Y we have PG'(Y) = li-~Wa,(a = y, p = S) = 
Wa,(a = y, p = S). Furthermore, for any tree 
d with p(d) = S we have Wa,(d') = ~\[~cWa(d) 
and so WG,(a = y, p = S) - ~WG(a = 
y, p = S) = Pc(Y). " 
6 Greibach Normal Form 
A PCFG is in Greibach normal form (GNF) if 
every production X -~ a satisfies (~ E EN*. 
The following holds: 
Theorem 6 For any consistent PCFG G in 
CNF there exists a consistent PCFG G ~ in GNF 
such that Pc,(Y) = Pa(Y) for y e E*. 
Proof: A left corner G-derivation from X to 
Y is a G-derivation from X where the leftmost 
leaf, rather than being labeled with a produc- 
tion, is simply labeled with the nonterminal 
Y. For example, if G contains the productions 
X ~ YZ and Z -~ a then we canconstruct a 
left corner G-derivation from X to Y by build- 
ing a tree with a root labeled by X Z.~ YZ, a 
left child labeled with Y and a right child la- 
beled with Z -~ a. The weight of a left corner 
G-derivation is the product of the productions 
on the nodes. A tree consisting of a single node 
labeled with X is a left corner G-derivation from 
X toX. 
For each pair of nonterminals X, Y in G 
we introduce a new nonterminal symbol X/Y. 
547 
The H-derivations from X/Y will be in one 
to one correspondence with the left-corner G- 
derivations from X to Y. For each production 
in G of the form X ~ a we include the following 
in H where S is the start symbol of G: 
S --~ a S/X 
We also include in H all productions of the fol- 
lowing form where X is any nonterminal in G: 
x/x 
If G consists only of productions of the form 
S -~ a these productions suffice. More gener- 
ally, for each nonterminal X/Y of H and each 
pair of productions U ~ YZ, W ~-~ a we in- 
clude in H the following: 
X/Y ~2 a Z/W X/U 
Because of the productions X/X -~ e, WH(# : 
X/X) > 1 , and H is not quite in GNF. These 
two issues will be addressed momentarily. 
Standard arguments can be used to show 
that the H-derivations from X/Y are in one- 
to-one correspondence with the left corner G- 
derivations from X to Y. Furthermore, this one- 
to-one correspondence preserves weight--if d is 
the H-derivation rooted at X/Y corresponding 
to the left corner G-derivation from X to Y then 
WH (d) is the product of the weights of the pro- 
ductions in the G-derivation. 
The weight-preserving one-to-one correspon- 
dence between left-corner G-derivations from X 
to Y and H-derivations from X/Y yields the 
following. 
WH ( ao~ ) 
: ~'~(S_U+aS/X)EHUWH(~r : Ollp--- S/X) 
Po(a ) 
Theorem 5 implies that we can reweight the 
proper subset of H (the reachable and nonempty 
productions of H) so as to construct a consistent 
PCFG g with Pj((~) = PG(~). To prove theo- 
rem 6 it now suffices to show that the produc- 
tions of the form X/X -~ e can be eliminated 
from the PCFG J. Indeed, we can eliminate 
the e productions from J in a manner similar 
to that used in the proof of theorem 4. A node 
in an J-derivation is ephemeral if it is labeled 
X -~ e for some X. We now define a function 7 
on J-derivations d as follows. If the root of d is 
labeled with X -~ aYZ then we have four sub- 
cases. If neither child of the root is ephemeral 
then 7(d) is the string aYZ. If only the left child 
is ephemeral then 7(d) is aZ. If only the right 
child is ephemeral then 7(d) is aY and if both 
children are ephemeral then 7(d) is a. Analo- 
gously, if the root is labeled with X -~ aY, then 
7(d) is aY if the child is not ephemeral and a 
otherwise. If the root is labeled with X -~ e 
then 7(d) is e. 
A nonterminal X in K will be called trivial 
ifPj(7= e I P =X) = 1. We now define the 
final grammar G' to consist of all productions 
of the following form where X, Y, and Z are 
nontrivial nonterminals appearing in J and a is 
a terminal symbol appearing in J. 
X Pj(a=a I__~=X, "y¢¢) a 
X pj(a=aY~_~=X, "yCe) aY 
X PJ(a=aYZl-~ p=X' ~¢) aYZ 
As in section 4, for every nontrivial nonterminal 
X in K and terminal string (~ we have PK (a = 
(~ I P= X) = Pj(a= a I P= X, a ~ e). In 
particular, since Pj(e) = PG(() = 0, we have 
the following: 
= 
= Pj(a=alp=S ) 
= Pj(a) 
= Pa( ) 
The PCFG K is the desired PCFG in Greibach 
normal form. • 
The construction in this proof is essen- 
tially the standard left-corner transformation 
(Rosenkrantz and II, 1970), as extended by Sa- 
lomaa and Soittola (1978, theorem 2.3) to alge- 
braic formal power series. 
7 The Main Theorem 
We can now prove our main theorem. 
Theorem 7 For any consistent PCFG G there 
exists a shift-reduce PPDA M such that 
PM(Y) = PG(Y) for all y E ~*. 
Let G be an arbitrary consistent PCFG. By 
theorems 4 and 6~ we can assume that G con- 
sists of productions of the form S -~ e and 
548 
S l~w St plus productions in Greibach normal 
form not mentioning S. We can then replace 
the rule S 1_:+~ S ~ with all rules of the form 
S 0-__~)~' a where G contains S ~ ~' -+ a. We now 
assume without loss of generality that G con- 
sists of a single production of the form S -~ e 
plus productions in Greibach normal form not 
mentioning S on the right hand side. 
The stack symbols of M are of the form W~ 
where ce E N* is a proper suffix of the right hand 
side of some production in G. For example, if 
G contains the production X -~ aYZ then the 
symbols of M include Wyz, Wy, and We. The 
initial state is Ws and the initial stack symbol is 
±. We have assumed that G contains a unique 
production of the form S -~ e. We include the 
following transition in M corresponding to this 
production. 
A_,Ws~,T 
Then, for each rule of the form X -~ a~ in G 
and each symbol of the form Wx,~ we include 
the following in M: 
Z, Wx. ~ ZWx., Wz 
We also include all "post-processing" rules of 
the following form: 
Wx~W~ ~ W~ 
~.,1 ±,W~ ~,T 
I,T -:+,T 
Note that all reduction transitions are determin- 
istic with the single exception of the first rule 
listed above. The nondeterministic shift tran- 
sitions of M are in one-to-one correspondence 
with the productions of G. This yields the prop- 
erty that PM(Y) = PG(Y). • 
8 Conclusions 
The relationship between PCFGs and PPDAs 
is subtler than a direct application of the clas- 
sical constructions relating general CFGs and 
PDAs. Although PCFGs can be concisely trans- 
lated into top-down PPDAs, we conjecture that 
there is no concise translation of PCFGs into 
shift-reduce PPDAs. Conversely, there appears 
to be no concise translation of shift-reduce PP- 
DAs to PCFGs. Our main result is that PCFGs 
and shift-reduce PPDAs are intertranslatable, 
hence weakly equivalent. However, the non- 
conciseness of our translations is consistent with 
the view that stochastic top-down generation 
models are significantly different from shift- 
reduce stochastic parsing models, affecting the 
ability to learn a model from examples. 

References 
Alfred V. Aho and Jeffrey D. Ullman. 1972. The 
Theory of Parsing, Translation and Compiling, 
volume I. Prentice-Hall, Englewood Cliffs, New 
Jersey. 
Ezra Black, Fred Jelinek, John Lafferty, David 
Magerman, Robert Mercer, and Salim Roukos. 
1992. Towards history-based grammars: Using 
richer models for probabilistic parsing. In Pro- 
ceedings of the 5th DARPA Speech and Natural 
Language Workshop. 
Taylor Booth and Richard Thompson. 1973. Apply- 
ing probability measures to abstract languages. 
IEEE Transactions on Computers, C-22(5):442- 
450. 
Ted Briscoe and John Carroll. 1993. Generalized 
probabilistic LR parsing of natural language (cor- 
pora) with unification-based grammars. Compu- 
tational Linguistics, 19(1):25-59. 
Eugene Charniak. 1997. Statistical parsing with 
a context-free grammar and word statistics. 
In Fourteenth National Conference on Artificial 
Intelligence, pages 598-603. AAAI Press/MIT 
Press. 
Ciprian Chelba and Fred Jelinek. 1998. Exploit- 
ing syntactic structure for language modeling. In 
COLING-ACL '98, pages 225-231. 
Michael Collins. 1998. Head-Driven Statistical Mod- 
els for Natural Language Parsing. Ph.D. thesis, 
University of Pennsylvania. 
Jason Eisner. 1997. Bilexical grammars and a cubic- 
time probabilistic parser. In Proceedings of the 
International Workshop on Parsing Technologies. 
David M. Magerman. 1994. Natural Language Pars- 
ing as Statistical Pattern Recognition. Ph.D. the- 
sis, Department of Computer Science, Stanford 
University. 
Adwait Ratnaparkhi. 1997. A linear oberved time 
statistical parser based on maximum entropy 
models. In Claire Cardie and Ralph Weischedel, 
editors, Second Conference on Empirical Meth- 
ods in Natural Language Processing (EMNLP-2), 
Somerset, New Jersey. Association For Computa- 
tional Linguistics. 
Daniel J. Rosenkrantz and Philip M. Lewis II. 1970. 
Deterministic left corner parser. In IEEE Con- 
ference Record of the 11th Annual Symposium on 
Switching and Automata Theory, pages 139-152. 
Arto Salomaa and Matti Soittola. 1978. Automata- 
Theoretic Aspects of Formal Power Series. 
Springer-Verlag, New York. 
