Probabilistic Parsing Strategies
Mark-Jan Nederhof
Faculty of Arts
University of Groningen
P.O. Box 716
NL-9700 AS Groningen
The Netherlands
markjan@let.rug.nl
Giorgio Satta
Dept. of Information Engineering
University of Padua
via Gradenigo, 6/A
I-35131 Padova
Italy
satta@dei.unipd.it
Abstract
We present new results on the relation between
context-free parsing strategies and their probabilis-
tic counter-parts. We provide a necessary condition
and a sufficient condition for the probabilistic exten-
sion of parsing strategies. These results generalize
existing results in the literature that were obtained
by considering parsing strategies in isolation.
1 Introduction
Context-free grammars (CFGs) are standardly used
in computational linguistics as formal models of the
syntax of natural language, associating sentences
with all their possible derivations. Other computa-
tional models with the same generative capacity as
CFGs are also adopted, as for instance push-down
automata (PDAs). One of the advantages of the use
of PDAs is that these devices provide an operational
specification that determines which steps must be
performed when parsing an input string, something
that is not offered by CFGs. In other words, PDAs
can be associated to parsing strategies for context-
free languages. More precisely, parsing strategies
are traditionally specified as constructions that map
CFGs to language-equivalent PDAs. Popular ex-
amples of parsing strategies are the standard con-
structions of top-down PDAs (Harrison, 1978), left-
corner PDAs (Rosenkrantz and Lewis II, 1970),
shift-reduce PDAs (Aho and Ullman, 1972) and LR
PDAs (Sippu and Soisalon-Soininen, 1990).
CFGs and PDAs have probabilistic counterparts,
called probabilistic CFGs (PCFGs) and probabilis-
tic PDAs (PPDAs). These models are very popular
in natural language processing applications, where
they are used to define a probability distribution
function on the domain of all derivations for sen-
tences in the language of interest. In PCFGs and
PPDAs, probabilities are assigned to rules or tran-
sitions, respectively. However, these probabilities
cannot be chosen entirely arbitrarily. For example,
for a given nonterminalAin a PCFG, the sum of the
probabilities of all rules rewritingAmust be 1. This
means that, out of a total of saymrules rewritingA,
only m 1 rules represent “free” parameters.
Depending on the choice of the parsing strategy,
the constructed PDA may allow different probabil-
ity distributions than the underlying CFG, since the
set of free parameters may differ between the CFG
and the PDA, both quantitatively and qualitatively.
For example, (Sornlertlamvanich et al., 1999) and
(Roark and Johnson, 1999) have shown that a prob-
ability distribution that can be obtained by training
the probabilities of a CFG on the basis of a corpus
can be less accurate than the probability distribution
obtained by training the probabilities of a PDA con-
structed by a particular parsing strategy, on the basis
of the same corpus. Also the results from (Chitrao
and Grishman, 1990), (Charniak and Carroll, 1994)
and (Manning and Carpenter, 2000) could be seen
in this light.
The question arises of whether parsing strate-
gies can be extended probabilistically, i.e., whether
a given construction of PDAs from CFGs can be
“augmented” with a function defining the probabili-
ties for the target PDA, given the probabilities asso-
ciated with the input CFG, in such a way that the ob-
tained probabilistic distributions on the CFG deriva-
tions and the corresponding PDA computations are
equivalent. Some first results on this issue have been
presented by (Tendeau, 1995), who shows that the
already mentioned left-corner parsing strategy can
be extended probabilistically, and later by (Abney et
al., 1999) who show that the pure top-down parsing
strategy and a specific type of shift-reduce parsing
strategy can be probabilistically extended.
One might think that any “practical” parsing
strategy can be probabilistically extended, but this
turns out not to be the case. We briefly discuss
here a counter-example, in order to motivate the ap-
proach we have taken in this paper. Probabilistic
LR parsing has been investigated in the literature
(Wright and Wrigley, 1991; Briscoe and Carroll,
1993; Inui et al., 2000) under the assumption that
it would allow more fine-grained probability distri-
butions than the underlying PCFGs. However, this
is not the case in general. Consider a PCFG with
rule/probability pairs:
S!AB; 1 B!bC; 23
A!aC; 13 B!bD; 13
A!aD; 23 C!xc; 1
D!xd; 1
There are two key transitions in the associated LR
automaton, which represent shift actions over c and
d (we denote LR states by their sets of kernel items
and encode these states into stack symbols):
 c : fC!x c;D!x dg c7!
fC!x c;D!x dgfC!xc g
 d : fC!x c;D!x dg d7!
fC!x c;D!x dgfD!xd g
Assume a proper assignment of probabilities to the
transitions of the LR automaton, i.e., the sum of
transition probabilities for a given LR state is 1. It
can be easily seen that we must assign probabil-
ity 1 to all transitions except  c and  d, since this
is the only pair of distinct transitions that can be ap-
plied for one and the same top-of-stack symbol, viz.
fC!x c;D!x dg. However, in the PCFG
model we have
Pr(axcbxd)
Pr(axdbxc) =
Pr(A!aC ) Pr(B!bD)
Pr(A!aD) Pr(B!bC ) =
1
3 
1
32
3 
2
3
= 14
whereas in the LR PPDA model we have
Pr(axcbxd)
Pr(axdbxc) =
Pr( c) Pr( d)
Pr( d) Pr( c) = 16=
1
4:
Thus we conclude that there is no proper assignment
of probabilities to the transitions of the LR automa-
ton that would result in a distribution on the gener-
ated language that is equivalent to the one induced
by the source PCFG. Therefore the LR strategy does
not allow probabilistic extension.
One may seemingly solve this problem by drop-
ping the constraint of properness, letting each tran-
sition that outputs a rule have the same probability
as that rule in the PCFG, and letting other transitions
have probability 1. However, the properness condi-
tion for PDAs has been heavily exploited in pars-
ing applications, in doing incremental left-to-right
probability computation for beam search (Roark
and Johnson, 1999; Manning and Carpenter, 2000),
and more generally in integration with other lin-
ear probabilistic models. Furthermore, commonly
used training algorithms for PCFGS/PPDAs always
produce proper probability assignments, and many
desired mathematical properties of these methods
are based on such an assumption (Chi and Geman,
1998; S´anchez and Bened´ı, 1997). We may there-
fore discard non-proper probability assignments in
the current study.
However, such probability assignments are out-
side the reach of the usual training algorithms for
PDAs, which always produce proper PDAs. There-
fore, we may discard such assignments in the cur-
rent study, which investigates aspects of the poten-
tial of training algorithms for CFGs and PDAs.
What has been lacking in the literature is a theo-
retical framework to relate the parameter space of a
CFG to that of a PDA constructed from the CFG by
a particular parsing strategy, in terms of the set of
allowable probability distributions over derivations.
Note that the number of free parameters alone is
not a satisfactory characterization of the parameter
space. In fact, if the “nature” of the parameters is
ill-chosen, then an increase in the number of param-
eters may lead to a deterioration of the accuracy of
the model, due to sparseness of data.
In this paper we extend previous results, where
only a few specific parsing strategies were consid-
ered in isolation, and provide some general char-
acterization of parsing strategies that can be prob-
abilistically extended. Our main contribution can
be stated as follows.
 We define a theoretical framework to relate the
parameter space defined by a CFG and that de-
fined by a PDA constructed from the CFG by a
particular parsing strategy.
 We provide a necessary condition and a suffi-
cient condition for the probabilistic extension
of parsing strategies.
We use the above findings to establish new results
about probabilistic extensions of parsing strategies
that are used in standard practice in computational
linguistics, as well as to provide simpler proofs of
already known results.
We introduce our framework in Section 3 and re-
port our main results in Sections 4 and 5. We discuss
applications of our results in Section 6.
2 Preliminaries
In this paper we assume some familiarity with def-
initions of (P)CFGs and (P)PDAs. We refer the
reader to standard textbooks and publications as for
instance (Harrison, 1978; Booth and Thompson,
1973; Santos, 1972).
A CFGGis a tuple ( ;N;S;R), with  and N
the sets of terminals and nonterminals, respectively,
S the start symbol and R the set of rules. In this
paper we only consider left-most derivations, repre-
sented as strings d2R and simply called deriva-
tions. For  ; 2( [N) , we write  )d  with
the usual meaning. If  = S and  = w2  , we
call d a complete derivation of w. We say a CFG is
reduced if each rule in R occurs in some complete
derivation.
A PCFG is a pair (G;p) consisting of a CFG G
and a probability function p from R to real num-
bers in the interval [0;1]. A PCFG is proper
if P =(A! )2R p( ) = 1 for each A 2 N.
The probability of a (left-most) derivation d =
 1    m,  i 2 R for 1  i  m, is p(d) =Q
m
i=1 p( i). The probability of a string w 2   is p(w) = P
S)dw p(d). A PCFG is consistent if 
w2  p(w) = 1. A PCFG (G;p) is reduced ifGis
reduced.
In this paper we will mainly consider push-down
transducers rather than push-down automata. Push-
down transducers not only compute derivations of
the grammar while processing an input string, but
they also explicitly produce output strings from
which these derivations can be obtained. We use
transducers for two reasons. First, constraints on
the output strings allow us to restrict our attention
to “reasonable” parsing strategies. Those strategies
that cannot be formalized within these constraints
are unlikely to be of practical interest. Secondly,
mappings from input strings to derivations, such as
those realized by push-down transducers, turn out
to be a very powerful abstraction and allow direct
proofs of several general results.
Contrary to many textbooks, our push-down de-
vices do not possess states next to stack symbols.
This is without loss of generality, since states can
be encoded into the stack symbols, given the types
of transitions that we allow. Thus, a PDT A is a
6-tuple ( 1;  2; Q; Xin; Xfin;  ), with  1 and
 2 the input and output alphabets, respectively, Q
the set of stack symbols, including the initial and fi-
nal stack symbols Xin and Xfin, respectively, and
 the set of transitions. Each transition has one
of the following three forms: X 7! XY, called a
push transition, YX 7!Z, called a pop transition,
or X x;y7! Y, called a swap transition; here X, Y,
Z2Q, x2 1[f"gis the input read by the tran-
sition and y 2  2 is the written output. Note that
in our notation, stacks grow from left to right, i.e.,
the top-most stack symbol will be found at the right
end. A configuration of a PDT is a triple ( ;w;v),
where  2 Q is a stack, w 2   1 is the remain-
ing input, and v 2   2 is the output generated so
far. Computations are represented as strings c 2
  . For configurations ( ;w;v) and ( ;w0;v0), we
write ( ;w;v) ‘c ( ;w0;v0) with the usual mean-
ing, and write ( ;w;v) ‘ ( ;w0;v0) when c is of
no importance. If (Xin;w;") ‘c (Xfin;";v), then
c is a complete computation of w, and the output
string v is denoted out(c). A PDT is reduced if
each transition in  occurs in some complete com-
putation.
Without loss of generality, we assume that com-
binations of different types of transitions are not al-
lowed for a given stack symbol. More precisely,
for each stack symbol X 6= Xfin, the PDA can
only take transitions of a single type (push, pop or
swap). A PDT can easily be brought in this form
by introducing for each X three new stack symbols
Xpush, Xpop and Xswap and new swap transitions
X ";"7! Xpush, X ";"7! Xpop and X ";"7! Xswap. In
each existing transition that operates on top-of-stack
X, we then replace X by one from Xpush, Xpop or
Xswap, depending on the type of that transition. We
also assume that Xfin does not occur in the left-
hand side of a transition, again without loss of gen-
erality.
A PPDT is a pair (A;p) consisting of a PDT A
and a probability functionpfrom to real numbers
in the interval [0;1]. A PPDT is proper if
   =(X7!XY)2 p( ) = 1 for each X 2 Q
such that there is at least one transition X 7!
XY, Y 2Q;
   =(Xx;y7!Y)2 p( ) = 1 for each X2Q such
that there is at least one transition X x;y7! Y,
x2 1[f"g;y2  2;Y 2Q; and
   =(YX7!Z)2 p( ) = 1, for each X;Y 2Q
such that there is at least one transition YX7!
Z, Z2Q.
The probability of a computation c =  1    m,
 i 2  for 1  i  m, is p(c) =Q
m
i=1 p( i). The probability of a stringw isp(w) =P
(Xin;w;")‘c(Xfin;";v) p(c). A PPDT is consistent
if  w2  p(w) = 1. A PPDT (A;p) is reduced if
Ais reduced.
3 Parsing Strategies
The term “parsing strategy” is often used informally
to refer to a class of parsing algorithms that behave
similarly in some way. In this paper, we assign a
formal meaning to this term, relying on the obser-
vation by (Lang, 1974) and (Billot and Lang, 1989)
that many parsing algorithms for CFGs can be de-
scribed in two steps. The first is a construction of
push-down devices from CFGs, and the second is
a method for handling nondeterminism (e.g. back-
tracking or dynamic programming). Parsing algo-
rithms that handle nondeterminism in different ways
but apply the same construction of push-down de-
vices from CFGs are seen as realizations of the same
parsing strategy.
Thus, we define a parsing strategy to be a func-
tion S that maps a reduced CFG G = ( 1; N; S;
R) to a pairS(G) = (A;f) consisting of a reduced
PDTA= ( 1; 2;Q;Xin;Xfin; ), and a func-
tion f that maps a subset of   2 to a subset of R ,
with the following properties:
 R  2.
 For each string w 2   1 and each complete
computation c on w, f(out(c)) = d is a (left-
most) derivation of w. Furthermore, each sym-
bol from R occurs as often in out(c) as it oc-
curs in d.
 Conversely, for each string w 2   1 and
each derivation d of w, there is precisely
one complete computation c on w such that
f(out(c)) = d.
If c is a complete computation, we will write f(c)
to denote f(out(c)). The conditions above then im-
ply thatf is a bijection from complete computations
to complete derivations. Note that output strings of
(complete) computations may contain symbols that
are not in R, and the symbols that are in R may
occur in a different order in v than in f(v) = d.
The purpose of the symbols in  2 R is to help
this process of reordering of symbols from R in v,
as needed for instance in the case of the left-corner
parsing strategy (see (Nijholt, 1980, pp. 22–23) for
discussion).
A probabilistic parsing strategy is defined to
be a function S that maps a reduced, proper and
consistent PCFG (G;pG) to a triple S(G;pG) =
(A;pA;f), where (A;pA) is a reduced, proper and
consistent PPDT, with the same properties as a
(non-probabilistic) parsing strategy, and in addition:
 For each complete derivation d and each com-
plete computation c such that f(c) = d, pG(d)
equals pA(c).
In other words, a complete computation has the
same probability as the complete derivation that it
is mapped to by function f. An implication of
this property is that for each string w 2   1, the
probabilities assigned to that string by (G;pG) and
(A;pA) are equal.
We say that probabilistic parsing strategyS0 is an
extension of parsing strategyS if for each reduced
CFGGand probability functionpG we haveS(G) =
(A;f) if and only if S0(G;pG) = (A;pA;f) for
some pA.
4 Correct-Prefix Property
In this section we present a necessary condition for
the probabilistic extension of a parsing strategy. For
a given PDT, we say a computation c is dead if
(Xin;w1;")‘c ( ;";v1), for some  2Q , w1 2
  1 and v1 2   2, and there are no w2 2   1 and
v2 2  2 such that ( ;w2;") ‘ (Xfin;";v2). In-
formally, a dead computation is a computation that
cannot be continued to become a complete compu-
tation. We say that a PDT has the correct-prefix
property (CPP) if it does not allow any dead com-
putations. We also say that a parsing strategy has
the CPP if it maps each reduced CFG to a PDT that
has the CPP.
Lemma 1 For each reduced CFG G, there is a
probability function pG such that PCFG (G;pG) is
proper and consistent, and pG(d) > 0 for all com-
plete derivations d.
Proof. Since G is reduced, there is a finite set D
consisting of complete derivations d, such that for
each rule  in G there is at least one d 2 D in
which  occurs. Let n ;d be the number of occur-
rences of rule  in derivation d2D, and let n be
 d2D n ;d, the total number of occurrences of  in
D. LetnA be the sum ofn for all rules withAin
the left-hand side. A probability function pG can be
defined through “maximum-likelihood estimation”
such that pG( ) = n nA for each rule  = A! .
For all nonterminals A,   =A! pG( ) =
  =A! n nA = nAnA = 1, which means that the
PCFG (G;pG) is proper. Furthermore, it has been
shown in (Chi and Geman, 1998; S´anchez and
Bened´ı, 1997) that a PCFG (G;pG) is consistent if
pG was obtained by maximum-likelihood estimation
using a set of derivations. Finally, since n > 0 for
each  , also pG( ) > 0 for each  , and pG(d) > 0
for all complete derivations d.
We say a computation is a shortest dead compu-
tation if it is dead and none of its proper prefixes is
dead. Note that each dead computation has a unique
prefix that is a shortest dead computation. For a
PDT A, let TA be the union of the set of all com-
plete computations and the set of all shortest dead
computations.
Lemma 2 For each proper PPDT (A;pA),
 c2TA pA(c) 1.
Proof. The proof is a trivial variant of the proof
that for a proper PCFG (G;pG), the sum ofpG(d) for
all derivationsdcannot exceed 1, which is shown by
(Booth and Thompson, 1973).
From this, the main result of this section follows.
Theorem 3 A parsing strategy that lacks the CPP
cannot be extended to become a probabilistic pars-
ing strategy.
Proof. Take a parsing strategySthat does not have
the CPP. Then there is a reduced CFGG = ( 1;N;
S;R), withS(G) = (A;f) for someAand f, and
a shortest dead computation c allowed byA.
It follows from Lemma 1 that there is a proba-
bility function pG such that (G;pG) is a proper and
consistent PCFG and pG(d) > 0 for all complete
derivations d. Assume we also have a probability
function pA such that (A;pA) is a proper and con-
sistent PPDT andpA(c0) = pG(f(c0)) for each com-
plete computation c0. SinceAis reduced, each tran-
sition  must occur in some complete computation
c0. Furthermore, for each complete computation c0
there is a complete derivation d such that f(c0) = d,
and pA(c0) = pG(d) > 0. Therefore, pA( ) > 0
for each transition  , and pA(c) > 0, where c is the
above-mentioned dead computation.
Due to Lemma 2, 1   c02TA pA(c0)  
 w2  1 pA(w) + pA(c) >  w2  1 pA(w) =
 w2  1 pG(w). This is in contradiction with the con-
sistency of (G;pG). Hence, a probability function
pAwith the properties we required above cannot ex-
ist, and thereforeS cannot be extended to become a
probabilistic parsing strategy.
5 Strong Predictiveness
In this section we present our main result, which is a
sufficient condition allowing the probabilistic exten-
sion of a parsing strategy. We start with a technical
result that was proven in (Abney et al., 1999; Chi,
1999; Nederhof and Satta, 2003).
Lemma 4 Given a non-proper PCFG (G;pG),G =
( ;N;S;R), there is a probability functionp0G such
that PCFG (G;p0G) is proper and, for every com-
plete derivation d, p0G(d) = 1C  pG(d), where C =P
S)d0w;w2  pG(d
0).
Note that if PCFG (G;pG) in the above lemma is
consistent, then C = 1 and (G;p0G) and (G;pG) de-
fine the same distribution on derivations. The nor-
malization procedure underlying Lemma 4 makes
use of quantitiesPA)dw;w2  pG(d) for each A2
N. These quantities can be computed to any degree
of precision, as discussed for instance in (Booth and
Thompson, 1973) and (Stolcke, 1995). Thus nor-
malization of a PCFG can be effectively computed.
For a fixed PDT, we define the binary relation ;
on stack symbols by: Y ; Y0 if and only if
(Y;w;") ‘ (Y0;";v) for some w 2   1 and
v 2   2. In words, some subcomputation of the
PDT may start with stack Y and end with stack Y0.
Note that all stacks that occur in such a subcompu-
tation must have height of 1 or more. We say that a
(P)PDA or a (P)PDT has the strong predictiveness
property (SPP) if the existence of three transitions
X 7!XY, XY1 7!Z1 and XY2 7!Z2 such that
Y ; Y1 and Y ; Y2 implies Z1 = Z2. Infor-
mally, this means that when a subcomputation starts
with some stack  and some push transition  , then
solely on the basis of  we can uniquely determine
what stack symbol Z1 = Z2 will be on top of the
stack in the firstly reached configuration with stack
height equal toj j. Another way of looking at it is
that no information may flow from higher stack el-
ements to lower stack elements that was not already
predicted before these higher stack elements came
into being, hence the term “strong predictiveness”.
We say that a parsing strategy has the SPP if it maps
each reduced CFG to a PDT with the SPP.
Theorem 5 Any parsing strategy that has the CPP
and the SPP can be extended to become a proba-
bilistic parsing strategy.
Proof. Consider a parsing strategy S that has the
CPP and the SPP, and a proper, consistent and re-
duced PCFG (G;pG), G = ( 1; N; S; R). Let
S(G) = (A;f),A = ( 1; 2;Q;Xin;Xfin; ).
We will show that there is a probability function pA
such that (A;pA) is a proper and consistent PPDT,
and pA(c) = pG(f(c)) for all complete computa-
tions c.
We first construct a PPDT (A;p0A) as follows.
For each scan transition  = X x;y7! Y in  , let
p0A( ) = pG(y) in case y 2 R, and p0A( ) = 1
otherwise. For all remaining transitions  2 , let
p0A( ) = 1. Note that (A;p0A) may be non-proper.
Still, from the definition off it follows that, for each
complete computation c, we have
p0A(c) = pG(f(c)); (1)
and so our PPDT is consistent.
We now map (A;p0A) to a language-equivalent
PCFG (G0;pG0), G0 = ( 1;Q;Xin;R0), where R0
contains the following rules with the specified asso-
ciated probabilities:
 X ! YZ with pG0(X ! YZ ) = p0A(X 7!
XY ), for each X 7! XY 2  with Z the
unique stack symbol such that there is at least
one transition XY07!Z with Y ;Y0;
 X ! xY with pG0(X ! xY ) = p0A(X x7!
Y ), for each transition X x7!Y 2 ;
 Y !" with pG0(X !") = 1, for each stack
symbol Y such that there is at least one transi-
tion XY 7!Z2 or such that Y = Xfin.
It is not difficult to see that there exists a bijection
f0 from complete computations of A to complete
derivations ofG0, and that we have
pG0(f0(c)) = p0A(c); (2)
for each complete computation c. Thus (G0;pG0)
is consistent. However, note that (G0;pG0) is not
proper.
By Lemma 4, we can construct a new PCFG
(G0;p0G0) that is proper and consistent, and such that
pG0(d) = p0G0(d), for each complete derivation d of
G0. Thus, for each complete computationcofA, we
have
p0G0(f0(c)) = pG0(f0(c)): (3)
We now transfer back the probabilities of rules of
(G0;p0G0) to the transitions ofA. Formally, we define
a new probability function pA such that, for each
 2 , pA( ) = p0G0( ), where  is the rule in R0
that has been constructed from  as specified above.
It is easy to see that PPDT (A;pA) is now proper.
Furthermore, for each complete computation c ofA
we have
pA(c) = p0G0(f0(c)); (4)
and so (A;pA) is also consistent. By combining
equations (1) to (4) we conclude that, for each com-
plete computation c of A, pA(c) = p0G0(f0(c)) =
pG0(f0(c)) = p0A(c) = pG(f(c)). Thus our parsing
strategyS can be probabilistically extended.
Note that the construction in the proof above can
be effectively computed (see discussion in Section 4
for effective computation of normalized PCFGs).
The definition of p0A in the proof of Theorem 5
relies on the strings output byA. This is the main
reason why we needed to consider PDTs rather
than PDAs. Now assume an appropriate probabil-
ity function pA has been computed, such that the
source PCFG and (A;pA) define equivalent dis-
tributions on derivations/computations. Then the
probabilities assigned to strings over the input al-
phabet are also equal. We may subsequently ignore
the output strings if the application at hand merely
requires probabilistic recognition rather than proba-
bilistic transduction, or in other words, we may sim-
plify PDTs to PDAs.
The proof of Theorem 5 also leads to the obser-
vation that parsing strategies with the CPP and the
SPP as well as their probabilistic extensions can be
described as grammar transformations, as follows.
A given (P)CFG is mapped to an equivalent (P)PDT
by a (probabilistic) parsing strategy. By ignoring
the output components of swap transitions we ob-
tain a (P)PDA, which can be mapped to an equiva-
lent (P)CFG as shown above. This observation gives
rise to an extension with probabilities of the work on
covers by (Nijholt, 1980; Leermakers, 1989).
6 Applications
Many well-known parsing strategies with the CPP
also have the SPP. This is for instance the case
for top-down parsing and left-corner parsing. As
discussed in the introduction, it has already been
shown that for any PCFG G, there are equiva-
lent PPDTs implementing these strategies, as re-
ported in (Abney et al., 1999) and (Tendeau, 1995),
respectively. Those results more simply follow
now from our general characterization. Further-
more, PLR parsing (Soisalon-Soininen and Ukko-
nen, 1979; Nederhof, 1994) can be expressed in our
framework as a parsing strategy with the CPP and
the SPP, and thus we obtain as a new result that this
strategy allows probabilistic extension.
The above strategies are in contrast to the LR
parsing strategy, which has the CPP but lacks the
SPP, and therefore falls outside our sufficient condi-
tion. As we have already seen in the introduction, it
turns out that LR parsing cannot be extended to be-
come a probabilistic parsing strategy. Related to LR
parsing is ELR parsing (Purdom and Brown, 1981;
Nederhof, 1994), which also lacks the SPP. By an
argument similar to the one provided for LR, we can
show that also ELR parsing cannot be extended to
become a probabilistic parsing strategy. (See (Ten-
deau, 1997) for earlier observations related to this.)
These two cases might suggest that the sufficient
condition in Theorem 5 is tight in practice.
Decidability of the CPP and the SPP obviously
depends on how a parsing strategy is specified. As
far as we know, in all practical cases of parsing
strategies these properties can be easily decided.
Also, observe that our results do not depend on the
general behaviour of a parsing strategy S, but just
on its “point-wise” behaviour on each input CFG.
Specifically, if S does not have the CPP and the
SPP, but for some fixed CFG G of interest we ob-
tain a PDT A that has the CPP and the SPP, then
we can still apply the construction in Theorem 5.
In this way, any probability function pG associated
withG can be converted into a probability function
pA, such that the resulting PCFG and PPDT induce
equivalent distributions. We point out that decid-
ability of the CPP and the SPP for a fixed PDT can
be efficiently decided using dynamic programming.
One more consequence of our results is this. As
discussed in the introduction, the properness condi-
tion reduces the number of parameters of a PPDT.
However, our results show that if the PPDT has the
CPP and the SPP then the properness assumption is
not restrictive, i.e., by lifting properness we do not
gain new distributions with respect to those induced
by the underlying PCFG.
7 Conclusions
We have formalized the notion of CFG parsing strat-
egy as a mapping from CFGs to PDTs, and have in-
vestigated the extension to probabilities. We have
shown that the question of which parsing strategies
can be extended to become probabilistic heavily re-
lies on two properties, the correct-prefix property
and the strong predictiveness property. As far as we
know, this is the first general characterization that
has been provided in the literature for probabilistic
extension of CFG parsing strategies. We have also
shown that there is at least one strategy of practical
interest with the CPP but without the SPP, namely
LR parsing, that cannot be extended to become a
probabilistic parsing strategy.
Acknowledgements
The first author is supported by the PIO-
NIER Project Algorithms for Linguistic Process-
ing, funded by NWO (Dutch Organization for
Scientific Research). The second author is par-
tially supported by MIUR under project PRIN No.
2003091149 005.
References
S. Abney, D. McAllester, and F. Pereira. 1999. Re-
lating probabilistic grammars and automata. In
37th Annual Meeting of the Association for Com-
putational Linguistics, Proceedings of the Con-
ference, pages 542–549, Maryland, USA, June.
A.V. Aho and J.D. Ullman. 1972. Parsing, vol-
ume 1 of The Theory of Parsing, Translation and
Compiling. Prentice-Hall.
S. Billot and B. Lang. 1989. The structure of
shared forests in ambiguous parsing. In 27th
Annual Meeting of the Association for Com-
putational Linguistics, Proceedings of the Con-
ference, pages 143–151, Vancouver, British
Columbia, Canada, June.
T.L. Booth and R.A. Thompson. 1973. Apply-
ing probabilistic measures to abstract languages.
IEEE Transactions on Computers, C-22(5):442–
450, May.
T. Briscoe and J. Carroll. 1993. Generalized prob-
abilistic LR parsing of natural language (cor-
pora) with unification-based grammars. Compu-
tational Linguistics, 19(1):25–59.
E. Charniak and G. Carroll. 1994. Context-
sensitive statistics for improved grammatical lan-
guage models. In Proceedings Twelfth National
Conference on Artificial Intelligence, volume 1,
pages 728–733, Seattle, Washington.
Z. Chi and S. Geman. 1998. Estimation of prob-
abilistic context-free grammars. Computational
Linguistics, 24(2):299–305.
Z. Chi. 1999. Statistical properties of probabilistic
context-free grammars. Computational Linguis-
tics, 25(1):131–160.
M.V. Chitrao and R. Grishman. 1990. Statistical
parsing of messages. In Speech and Natural Lan-
guage, Proceedings, pages 263–266, Hidden Val-
ley, Pennsylvania, June.
M.A. Harrison. 1978. Introduction to Formal Lan-
guage Theory. Addison-Wesley.
K. Inui, V. Sornlertlamvanich, H. Tanaka, and
T. Tokunaga. 2000. Probabilistic GLR parsing.
In H. Bunt and A. Nijholt, editors, Advances
in Probabilistic and other Parsing Technologies,
chapter 5, pages 85–104. Kluwer Academic Pub-
lishers.
B. Lang. 1974. Deterministic techniques for ef-
ficient non-deterministic parsers. In Automata,
Languages and Programming, 2nd Colloquium,
volume 14 of Lecture Notes in Computer Science,
pages 255–269, Saarbr¨ucken. Springer-Verlag.
R. Leermakers. 1989. How to cover a grammar.
In 27th Annual Meeting of the Association for
Computational Linguistics, Proceedings of the
Conference, pages 135–142, Vancouver, British
Columbia, Canada, June.
C.D. Manning and B. Carpenter. 2000. Proba-
bilistic parsing using left corner language mod-
els. In H. Bunt and A. Nijholt, editors, Ad-
vances in Probabilistic and other Parsing Tech-
nologies, chapter 6, pages 105–124. Kluwer Aca-
demic Publishers.
M.-J. Nederhof and G. Satta. 2003. Probabilis-
tic parsing as intersection. In 8th International
Workshop on Parsing Technologies, pages 137–
148, LORIA, Nancy, France, April.
M.-J. Nederhof. 1994. An optimal tabular parsing
algorithm. In 32nd Annual Meeting of the Associ-
ation for Computational Linguistics, Proceedings
of the Conference, pages 117–124, Las Cruces,
New Mexico, USA, June.
A. Nijholt. 1980. Context-Free Grammars: Cov-
ers, Normal Forms, and Parsing, volume 93 of
Lecture Notes in Computer Science. Springer-
Verlag.
P.W. Purdom, Jr. and C.A. Brown. 1981. Pars-
ing extended LR(k) grammars. Acta Informatica,
15:115–127.
B. Roark and M. Johnson. 1999. Efficient proba-
bilistic top-down and left-corner parsing. In 37th
Annual Meeting of the Association for Compu-
tational Linguistics, Proceedings of the Confer-
ence, pages 421–428, Maryland, USA, June.
D.J. Rosenkrantz and P.M. Lewis II. 1970. Deter-
ministic left corner parsing. In IEEE Conference
Record of the 11th Annual Symposium on Switch-
ing and Automata Theory, pages 139–152.
J.-A. S´anchez and J.-M. Bened´ı. 1997. Consis-
tency of stochastic context-free grammars from
probabilistic estimation based on growth trans-
formations. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 19(9):1052–1055,
September.
E.S. Santos. 1972. Probabilistic grammars and au-
tomata. Information and Control, 21:27–47.
S. Sippu and E. Soisalon-Soininen. 1990. Parsing
Theory, Vol. II: LR(k) and LL(k) Parsing, vol-
ume 20 of EATCS Monographs on Theoretical
Computer Science. Springer-Verlag.
E. Soisalon-Soininen and E. Ukkonen. 1979. A
method for transforming grammars into LL(k)
form. Acta Informatica, 12:339–369.
V. Sornlertlamvanich, K. Inui, H. Tanaka, T. Toku-
naga, and T. Takezawa. 1999. Empirical sup-
port for new probabilistic generalized LR pars-
ing. Journal of Natural Language Processing,
6(3):3–22.
A. Stolcke. 1995. An efficient probabilistic
context-free parsing algorithm that computes
prefix probabilities. Computational Linguistics,
21(2):167–201.
F. Tendeau. 1995. Stochastic parse-tree recognition
by a pushdown automaton. In Fourth Interna-
tional Workshop on Parsing Technologies, pages
234–249, Prague and Karlovy Vary, Czech Re-
public, September.
F. Tendeau. 1997. Analyse syntaxique et
s´emantique avec ´evaluation d’attributs dans
un demi-anneau. Ph.D. thesis, University of
Orl´eans.
J.H. Wright and E.N. Wrigley. 1991. GLR pars-
ing with probability. In M. Tomita, editor, Gen-
eralized LR Parsing, chapter 8, pages 113–128.
Kluwer Academic Publishers.
