Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 343–350,
New York, June 2006. c©2006 Association for Computational Linguistics
Estimation of Consistent
Probabilistic Context-free Grammars
Mark-Jan Nederhof
Max Planck Institute
for Psycholinguistics
P.O. Box 310
NL-6500 AH Nijmegen
The Netherlands
MarkJan.Nederhof@mpi.nl
Giorgio Satta
Dept. of Information Engineering
University of Padua
via Gradenigo, 6/A
I-35131 Padova
Italy
satta@dei.unipd.it
Abstract
We consider several empirical estimators
for probabilistic context-free grammars,
and show that the estimated grammars
have the so-called consistency property,
under the most general conditions. Our
estimators include the widely applied ex-
pectation maximization method, used to
estimate probabilistic context-free gram-
mars on the basis of unannotated corpora.
This solves a problem left open in the lit-
erature, since for this method the consis-
tency property has been shown only under
restrictive assumptions on the rules of the
source grammar.
1 Introduction
Probabilistic context-free grammars are one of the
most widely used formalisms in current work in sta-
tistical natural language parsing and stochastic lan-
guage modeling. An important property for a proba-
bilistic context-free grammar is that it be consistent,
that is, the grammar should assign probability of one
to the set of all finite strings or parse trees that it
generates. In other words, the grammar should not
lose probability mass with strings or trees of infinite
length.
Several methods for the empirical estimation of
probabilistic context-free grammars have been pro-
posed in the literature, based on the optimization of
some function on the probabilities of the observed
data, such as the maximization of the likelihood of
a tree bank or a corpus of unannotated sentences. It
has been conjectured in (Wetherell, 1980) that these
methods always provide probabilistic context-free
grammars with the consistency property. A first re-
sult in this direction was presented in (Chaudhuri et
al., 1983), by showing that a probabilistic context-
free grammar estimated by maximizing the likeli-
hood of a sample of parse trees is always consistent.
In later work by (S´anchez and Bened´ı, 1997)
and (Chi and Geman, 1998), the result was in-
dependently extended to expectation maximization,
which is an unsupervised method exploited to es-
timate probabilistic context-free grammars by find-
ing local maxima of the likelihood of a sample of
unannotated sentences. The proof in (S´anchez and
Bened´ı, 1997) makes use of spectral analysis of ex-
pectation matrices, while the proof in (Chi and Ge-
man, 1998)isbasedonasimplercountingargument.
Both these proofs assume restrictions on the un-
derlying context-free grammars. More specifically,
in (Chi and Geman, 1998) empty rules and unary
rules are not allowed, thus excluding infinite ambi-
guity, that is, the possibility that some string in the
input sample has an infinite number of derivations in
thegrammar. Thetreatmentofgeneralformcontext-
free grammars has been an open problem so far.
In this paper we consider several estimation meth-
ods for probabilistic context-free grammars, and we
show that the resulting grammars have the consis-
tency property. Our proofs are applicable under
the most general conditions, and our results also
include the expectation maximization method, thus
solving the open problem discussed above. We use
an alternative proof technique with respect to pre-
343
vious work, based on an already known renormal-
ization construction for probabilistic context-free
grammars, which has been used in the context of
language modeling.
The structure of this paper is as follows. We pro-
vide some preliminary definitions in Section 2, fol-
lowed in Section 3 by a brief overview of the esti-
mation methods we investigate in this paper. In Sec-
tion 4 we prove some properties of a renormaliza-
tion technique for probabilistic context-free gram-
mars, and use this property to show our main results
in Section 5. Section 6 closes with some concluding
remarks.
2 Preliminaries
In this paper we use mostly standard notation, as for
instance in (Hopcroft and Ullman, 1979) and (Booth
and Thompson, 1973), which we summarize below.
A context-free grammar (CFG) is a 4-tupleG =
(N,Σ,S,R) where N and Σ are finite disjoint sets
of nonterminal and terminal symbols, respectively,
S ∈ N is the start symbol and R is a finite set of
rules. Each rule has the form A → α, where A ∈ N
and α ∈ (Σ ∪N)∗. We write V for set Σ ∪N.
Each CFG G is associated with a left-most de-
rive relation ⇒, defined on triples consisting of two
strings γ,δ ∈ V∗ and a rule pi ∈ R. We write γ pi⇒ δ
if and only if γ = uAγprime and δ = uαγprime, for some
u ∈ Σ∗, γprime ∈ V∗, and pi = (A → α). A left-
most derivation for G is a string d = pi1···pim,
m ≥ 0, such that γ0 pi1⇒ γ1 pi2⇒ ··· pim⇒ γm, for
some γ0,...,γm ∈ V∗; d = ε (where ε denotes
the empty string) is also a left-most derivation. In
the remainder of this paper, we will let the term
derivation always refer to left-most derivation. If
γ0 pi1⇒ ··· pim⇒ γm for some γ0,...,γm ∈ V∗, then
we say that d = pi1···pim derives γm from γ0 and
we writeγ0 d⇒ γm; d = εderives anyγ0 ∈ V∗ from
itself.
A (left-most) derivation d such that S d⇒ w,
w ∈ Σ∗, is called a complete derivation. If d is
a complete derivation, we write y(d) to denote the
(unique) string w ∈ Σ∗ such that S d⇒ w. We
define D(G) to be the set of all complete deriva-
tions for G. The language generated by G is the set
of all strings derived by complete derivations, i.e.,
L(G) = {y(d)|d ∈ D(G)}. It is well-known that
there is a one-to-one correspondence between com-
plete derivations and parse trees for strings in L(G).
For X ∈ V and α ∈ V∗, we write f(X,α) to
denote the number of occurrences of X in α. For
(A → α) ∈ R and a derivation d, f(A → α,d)
denotes the number of occurrences of A → α in d.
We let f(A,d) =summationtextαf(A → α,d).
A probabilistic CFG (PCFG) is a pair G =
(G,pG), where G is a CFG and pG is a function
from R to real numbers in the interval [0,1]. We
say that G is proper if, for every A ∈ N, we have
summationdisplay
A→α
pG(A → α) = 1. (1)
Function pG can be used to assign probabilities to
derivations of the underlying CFG G, in the follow-
ing way. Ford = pi1···pim ∈ R∗,m ≥ 0, we define
pG(d) =
mproductdisplay
i=1
pG(pii). (2)
Note that pG(ε) = 1. The probability of a string
w ∈ Σ∗ is defined as
pG(w) = summationdisplay
y(d)=w
pG(d). (3)
A PCFG is consistent if
summationdisplay
w
pG(w) = 1. (4)
Consistency implies that the PCFG defines a proba-
bility distribution over both sets D(G) and L(G).
If a PCFG is proper, then consistency means that
no probability mass is lost in derivations of infinite
length. All PCFGs in this paper are implicitly as-
sumed to be proper, unless otherwise stated.
3 Estimation of PCFGs
In this section we give a brief overview of some esti-
mation methods for PCFGs. These methods will be
later investigated to show that they always provide
consistent PCFGs.
In natural language processing applications, esti-
mation of a PCFG is usually carried out on the ba-
sis of a tree bank, which in this paper we assume to
be a sample, that is, a finite multiset, of complete
derivations. Let D be such a sample, and let D be
344
the underlying set of derivations. For d ∈ D, we
let f(d,D) be the multiplicity of d in D, that is, the
number of occurrences of d in D. We define
f(A → α,D) =summationdisplay
d∈D
f(d,D)·f(A → α,d), (5)
and let f(A,D) =summationtextαf(A → α,D).
Consider a CFG G = (N,Σ,R,S) defined by
all and only the nonterminals, terminals and rules
observed in D. The criterion of maximum likeli-
hood estimation (MLE) prescribes the construction
of a PCFGG = (G,pG) such that pG maximizes the
likelihood of D, defined as
pG(D) = productdisplay
d∈D
pG(d)f(d,D), (6)
subject to the properness conditions summationtextαpG(A →
α) = 1 for eachA ∈ N. The maximization problem
above has a unique solution, provided by the estima-
tor (see for instance (Chi and Geman, 1998))
pG(A → α) = f(A → α,D)f(A,D) . (7)
We refer to this as the supervised MLE method.
In applications in which a tree bank is not avail-
able, one might still use the MLE criterion to train
a PCFG in an unsupervised way, on the basis of a
sample of unannotated sentences, also called a cor-
pus. Let us call C such a sample and C the underly-
ing set of sentences. For w ∈ C, we let f(w,C) be
the multiplicity of w in C.
Assume a CFG G = (N,Σ,R,S) that is able
to generate all of the sentences in C, and possibly
more. The MLE criterion prescribes the construc-
tion of a PCFG G = (G,pG) such that pG maxi-
mizes the likelihood of C, defined as
pG(C) = productdisplay
w∈C
pG(w)f(w,C), (8)
subject to the properness conditions as in the super-
vised case above. The above maximization prob-
lem provides a system of |R| nonlinear equations
(see (Chi and Geman, 1998))
pG(A → α) =summationtext
w∈C f(w,C)·EpG(d|w) f(A → α,d)summationtext
w∈C f(w,C)·EpG(d|w) f(A,d)
, (9)
where Ep denotes an expectation computed under
distribution p, and pG(d|w) is the probability of
derivation d conditioned by sentence w (so that
pG(d|w) > 0 only if y(d) = w). The solution to
the above system is not unique, because of the non-
linearity. Furthermore, each solution of (9) identi-
fies a point where the curve in (8) has partial deriva-
tives of zero, but this does not necessarily corre-
spond to a local maximum, let alone an absolute
maximum. (A point with partial derivatives of zero
that is not a local maximum could be a local min-
imum or even a so-called saddle point.) In prac-
tice, this system is typically solved by means of an
iterative algorithm called inside/outside (Charniak,
1993), whichimplementstheexpectationmaximiza-
tion (EM) method (Dempster et al., 1977). Starting
with an initial function pG that probabilistically ex-
tends G, a so-called growth transformation is com-
puted, defined as
pG(A → α) =
summationtext
w∈C f(w,C)·
summationtext
y(d)=w
pG(d)
pG(w)·f(A → α,d)summationtext
w∈C f(w,C)·
summationtext
y(d)=w
pG(d)
pG(w)·f(A,d)
.(10)
Following (Baum, 1972), one can show that
pG(C) ≥ pG(C). Thus, byiteratingthegrowthtrans-
formation above, we are guaranteed to reach a local
maximum for (8), or possibly a saddle point. We
refer to this as the unsupervised MLE method.
We now discuss a third estimation method for
PCFGs, which was proposed in (Corazza and Satta,
2006). This method can be viewed as a general-
ization of the supervised MLE method to probabil-
ity distributions defined over infinite sets of com-
plete derivations. Let D be an infinite set of com-
plete derivations using nonterminal symbols in N,
start symbol S ∈ N and terminal symbols in Σ.
We assume that the set of rules that are observed
in D is drawn from some finite set R. Let pD be
a probability distribution defined over D, that is,
a function from set D to interval [0,1] such thatsummationtext
d∈D pD(d) = 1.
Consider the CFG G = (N,Σ,R,S). Note that
D ⊆ D(G). We wish to extend G to some PCFG
G = (G,pG) in such a way that pD is approxi-
mated bypG (viewed as a distribution over complete
derivations) as well as possible according to some
criterion. One possible criterion is minimization of
345
the cross-entropy between pD and pG, defined as
the expectation, under distribution pD, of the infor-
mation of the derivations in D computed under dis-
tribution pG, that is
H(pD ||pG) = EpD log 1p
G(d)
= −summationdisplay
d∈D
pD(d)·logpG(d). (11)
We thus want to assign to the parameters pG(A →
α), A → α ∈ R, the values that minimize (11), sub-
ject to the conditionssummationtextαpG(A → α) = 1 for each
A ∈ N. Note that minimization of the cross-entropy
above is equivalent to minimization of the Kullback-
Leibler distance between pD and pG. Also note that
the likelihood of an infinite set of derivations would
always be zero and therefore cannot be considered
here.
The solution to the above minimization problem
provides the estimator
pG(A → α) = EpD f(A → α,d)E
pD f(A,d)
. (12)
A proof of this result appears in (Corazza and Satta,
2006), and is briefly summarized in Appendix A,
in order to make this paper self-contained. We call
the above estimator the cross-entropy minimization
method.
The cross-entropy minimization method can be
viewed as a generalization of the supervised MLE
method in (7), as shown in what follows. Let D and
D bedefinedasforthesupervisedMLEmethod. We
define a distribution over D as
pD(d) = f(d,D)|D| . (13)
Distribution pD is usually called the empirical dis-
tributionassociated withD. Applying the estimator
in (12) to pD, we obtain
pG(A → α) =
=
summationtext
d∈D pD(d)·f(A → α,d)summationtext
d∈D pD(d)·f(A,d)
=
summationtext
d∈D
f(d,D)
|D| ·f(A → α,d)summationtext
d∈D
f(d,D)
|D| ·f(A,d)
=
summationtext
d∈D f(d,D)·f(A → α,d)summationtext
d∈D f(d,D)·f(A,d)
. (14)
This is the supervised MLE estimator in (7). This re-
minds us of the well-known fact that maximizing the
likelihood of a (finite) sample through a PCFG dis-
tribution amounts to minimizing the cross-entropy
between the empirical distribution of the sample and
the PCFG distribution itself.
4 Renormalization
In this section we recall a renormalization technique
for PCFGs that was used before in (Abney et al.,
1999), (Chi, 1999) and (Nederhof and Satta, 2003)
for different purposes, and is exploited in the next
section to prove our main results. In the remainder
of this section, we assume a fixed, not necessarily
proper PCFGG = (G,pG), with G = (N,Σ,S,R).
We define the renormalization of G as the PCFG
R(G) = (G,pR) with pR specified by
pR(A → α) =
pG(A → α)·
summationtext
d,w pG(α
d⇒ w)
summationtext
d,w pG(A
d⇒ w). (15)
It is not difficult to see that R(G) is a proper PCFG.
We now show an important property of R(G), dis-
cussed before in (Nederhof and Satta, 2003) in the
context of so-called weighted context-free gram-
mars.
Lemma 1 For each derivation d with A d⇒ w, A ∈
N and w ∈ Σ∗, we have
pR(A d⇒ w) = pG(A
d⇒ w)
summationtext
dprime,wprime pG(A
dprime⇒ wprime). (16)
Proof. The proof is by induction on the length ofd,
written |d|. If |d| = 1 we must have d = (A → w),
and thus pR(d) = pR(A → w). In this case, the
statement of the lemma directly follows from (15).
Assume now |d| > 1 and let pi = (A → α)
be the first rule used in d. Note that there must
be at least one nonterminal symbol in α. We can
then write α as u0A1u1A2···uq−1Aquq, for q ≥ 1,
Ai ∈ N, 1 ≤ i ≤ q, and uj ∈ Σ∗, 0 ≤
j ≤ q. In words, A1,...,Aq are all of the occur-
rences of nonterminals in α, as they appear from
left to right. Consequently, we can write d in the
form d = pi · d1···dq for some derivations di,
1 ≤ i ≤ q, with Ai di⇒ wi, |di| ≥ 1 and with
346
w = u0w1u1w2···uq−1wquq. Below we use the
fact that pR(uj ε⇒ uj) = pG(uj ε⇒ uj) = 1 for
each j with 0 ≤ j ≤ q, and further using the def-
inition of pR and the inductive hypothesis, we can
write
pR(A d⇒ w) =
= pR(A → α)·
qproductdisplay
i=1
pR(Ai di⇒ wi)
= pG(A → α)·
summationtext
dprime,wprime pG(α
dprime⇒ wprime)
summationtext
dprime,wprime pG(A
dprime⇒ wprime) ·
·
qproductdisplay
i=1
pR(Ai di⇒ wi)
= pG(A → α)·
summationtext
dprime,wprime pG(α
dprime⇒ wprime)
summationtext
dprime,wprime pG(A
dprime⇒ wprime) ·
·
qproductdisplay
i=1
pG(Ai di⇒ wi)
summationtext
dprime,wprime pG(Ai
dprime⇒ wprime)
= pG(A → α)·
summationtext
dprime,wprime pG(α
dprime⇒ wprime)
summationtext
dprime,wprime pG(A
dprime⇒ wprime) ·
·
producttextq
i=1 pG(Ai
di⇒ w
i)
producttextq
i=1
summationtext
dprime,wprime pG(Ai
dprime⇒ wprime)
= pG(A → α)·
summationtext
dprime,wprime pG(α
dprime⇒ wprime)
summationtext
dprime,wprime pG(A
dprime⇒ wprime) ·
·
producttextq
i=1 pG(Ai
di⇒ w
i)
summationtext
dprime,wprime pG(α
dprime⇒ wprime)
= pG(A → α)·
producttextq
i=1 pG(Ai
di⇒ w
i)
summationtext
dprime,wprime pG(A
dprime⇒ wprime) ·
= pG(A
d⇒ w)
summationtext
dprime,wprime pG(A
dprime⇒ wprime). (17)
As an easy corollary of Lemma 1, we have that
R(G) is a consistent PCFG, as we can write
summationdisplay
d,w
pR(S d⇒ w) =
= summationdisplay
d,w
pG(S d⇒ w)
summationtext
dprime,wprime pG(S
dprime⇒ wprime)
=
summationtext
d,w pG(S
d⇒ w)
summationtext
dprime,wprime pG(S
dprime⇒ wprime) = 1. (18)
5 Consistency
In this section we prove the main results of this
paper, namely that all of the estimation methods
discussed in Section 3 always provide consistent
PCFGs. We start with a technical lemma, central
to our results, showing that a PCFG that minimizes
the cross-entropy with a distribution over any set of
derivations must be consistent.
Lemma 2 Let G = (G,pG) be a proper PCFG
and let pD be a probability distribution defined over
some set D ⊆ D(G). If G minimizes function
H(pD ||pG), then G is consistent.
Proof. LetG = (N,Σ,S,R), and assume thatG is
not consistent. We establish a contradiction. SinceG
is not consistent, we must havesummationtextd,w pG(S d⇒ w) <
1. Let then R(G) = (G,pR) be the renormalization
of G, defined as in (15). For any derivation S d⇒ w,
w ∈ Σ∗, with d in D, we can use Lemma 1 and
write
pR(S d⇒ w) =
= 1summationtext
dprime,wprime pG(S
dprime⇒ wprime) ·pG(S
d⇒ w)
> pG(S d⇒ w). (19)
In words, every complete derivation d in D has a
probability in R(G) that is strictly greater than in
G. But this means H(pD ||pR) < H(pD ||pG),
against our hypothesis. Therefore, G is consistent
and pG is a probability distribution over set D(G).
Thus function H(pD ||pG) can be interpreted as the
cross-entropy. (Observe that in the statement of the
lemma we have avoided the term ‘cross-entropy’,
sincecross-entropiesareonlydefinedforprobability
distributions.)
Lemma 2 directly implies that the cross-entropy
minimization method in (12) always provides a con-
sistent PCFG, since it minimizes cross-entropy for a
distribution defined over a subset of D(G). We have
already seen in Section 3 that the supervised MLE
method is a special case of the cross-entropy min-
imization method. Thus we can also conclude that
a PCFG trained with the supervised MLE method is
347
alwaysconsistent. Thisprovidesanalternativeproof
of a property that was first shown in (Chaudhuri et
al., 1983), as discussed in Section 1.
We now prove the same result for the unsuper-
vised MLE method, without any restrictive assump-
tion on the rules of our CFGs. This solves a problem
that was left open in the literature (Chi and Geman,
1998); see again Section 1 for discussion. Let C and
C be defined as in Section 3. We define the empiri-
cal distribution of C as
pC(w) = f(w,C)|C| . (20)
Let G = (N,Σ,S,R) be a CFG such that C ⊆
L(G). Let D(C) be the set of all complete deriva-
tions for G that generate sentences in C, that is,
D(C) = {d|d ∈ D(G), y(d) ∈ C}.
Further, assume some probabilistic extensionG =
(G,pG) of G, such that pG(d) > 0 for every d ∈
D(C). We define a distribution over D(C) by
pD(C)(d) = pC(y(d))· pG(d)p
G(y(d))
. (21)
It is not difficult to verify that
summationdisplay
d∈D(C)
pD(C)(d) = 1. (22)
We now apply to G the estimator in (12), in order
to obtain a new PCFG ˆG = (G, ˆpG) that minimizes
the cross-entropy betweenpD(C) and ˆpG. According
to Lemma 2, we have that ˆG is a consistent PCFG.
Distribution ˆpG is specified by
ˆpG(A → α) =
=
summationtext
d∈D(C) pD(C)(d)·f(A → α,d)summationtext
d∈D(C) pD(C)(d)·f(A,d)
=
summationtext
d∈D(C)
f(y(d),C)
|C| ·
pG(d)
pG(y(d))·f(A → α,d)summationtext
d∈D(C)
f(y(d),C)
|C| ·
pG(d)
pG(y(d))·f(A,d)
=
summationtext
w∈C f(w,C)·
summationtext
y(d)=w
pG(d)
pG(w)·f(A → α,d)summationtext
w∈C f(w,C)·
summationtext
y(d)=w
pG(d)
pG(w)·f(A,d)
=
summationtext
w∈C f(w,C)·EpG(d|w)f(A → α,d)summationtext
w∈C f(w,C)·EpG(d|w)f(A,d)
. (23)
Since distribution pG was arbitrarily chosen, sub-
ject to the only restriction that pG(d) > 0 for ev-
ery d ∈ D(C), we have that (23) is the growth
estimator (10) already discussed in Section 3. In
fact, for each w ∈ L(G) and d ∈ D(G), we have
pG(d|w) = pG(d)pG(w). We conclude with the desired
result, namely that a general form PCFG obtained at
any iteration of the EM method for the unsupervised
MLE is always consistent.
6 Conclusions
In this paper we have investigated a number of
methods for the empirical estimation of probabilis-
tic context-free grammars, and have shown that the
resulting grammars have the so-called consistency
property. This property guarantees that all the prob-
ability mass of the grammar is used for the finite
strings it derives. Thus if the grammar is used in
combination with other probabilistic models, as for
instance in a speech processing system, consistency
allows us to combine or compare scores from differ-
ent modules in a sound way.
To obtain our results, we have used a novel proof
technique that exploits an already known construc-
tion for the renormalization of probabilistic context-
free grammars. Our proof technique seems more
intuitive than arguments previously used in the lit-
erature to prove the consistency property, based on
counting arguments or on spectral analysis. It is
not difficult to see that our proof technique can
also be used with probabilistic rewriting formalisms
whose underlying derivations can be characterized
by means of context-free rewriting. This is for
instance the case with probabilistic tree-adjoining
grammars (Schabes, 1992; Sarkar, 1998), for which
consistency results have not yet been shown in the
literature.
A Cross-entropy minimization
Inorder tomake thispaper self-contained, wesketch
a proof of the claim in Section 3 that the estimator
in (12) minimizes the cross entropy in (11). A full
proof appears in (Corazza and Satta, 2006).
Let D, pD and G = (N,Σ,R,S) be defined as
in Section 3. We want to find a proper PCFG G =
(G,pG) such that the cross-entropy H(pD ||pG) is
minimal. We use Lagrange multipliers λA for each
A ∈ N and define the form
∇ = summationdisplay
A∈N
λA ·(summationdisplay
α
pG(A → α)−1) +
348
−summationdisplay
d∈D
pD(d)·logpG(d). (24)
We now consider all the partial derivatives of∇. For
each A ∈ N we have
∂∇
∂λA =
summationdisplay
α
pG(A → α)−1. (25)
For each (A → α) ∈ R we have
∂∇
∂pG(A → α) =
= λA − ∂∂p
G(A → α)
summationdisplay
d∈D
pD(d)·logpG(d)
= λA −summationdisplay
d∈D
pD(d)· ∂∂p
G(A → α)
logpG(d)
= λA −summationdisplay
d∈D
pD(d)· ∂∂p
G(A → α)
log productdisplay
(B→β)∈R
pG(B → β)f(B→β,d)
= λA −summationdisplay
d∈D
pD(d)· ∂∂p
G(A → α)
summationdisplay
(B→β)∈R
f(B → β,d)·logpG(B → β)
= λA −summationdisplay
d∈D
pD(d)· summationdisplay
(B→β)∈R
f(B → β,d)·
∂
∂pG(A → α) logpG(B → β)
= λA −summationdisplay
d∈D
pD(d)·f(A → α,d)·
· 1ln(2) · 1p
G(A → α)
= λA − 1ln(2) · 1p
G(A → α)
·
·summationdisplay
d∈D
pD(d)·f(A → α,d)
= λA − 1ln(2) · 1p
G(A → α)
·
· EpD f(A → α,d). (26)
By setting to zero all of the above partial derivatives,
we obtain a system of|N|+|R|equations, which we
must solve. From ∂∇∂pG(A→α) = 0 we obtain
λA ·ln(2)·pG(A → α) =
EpDf(A → α,d). (27)
We sum over all strings α such that (A → α) ∈ R,
deriving
λA ·ln(2)·summationdisplay
α
pG(A → α) =
= summationdisplay
α
EpD f(A → α,d)
= summationdisplay
α
summationdisplay
d∈D
pD(d)·f(A → α,d)
= summationdisplay
d∈D
pD(d)·summationdisplay
α
f(A → α,d)
= summationdisplay
d∈D
pD(d)·f(A,d)
= EpD f(A,d). (28)
From each equation ∂∇∂λA = 0 we obtainsummationtext
αpG(A → α) = 1 for each A ∈ N (our original
constraints). Combining this with (28) we obtain
λA ·ln(2) = EpD f(A,d). (29)
Replacing (29) into (27) we obtain, for every rule
(A → α) ∈ R,
pG(A → α) = EpD f(A → α,d)E
pD f(A,d)
. (30)
This is the estimator introduced in Section 3.
References
S. Abney, D. McAllester, and F. Pereira. 1999. Relating
probabilistic grammars and automata. In 37th Annual
Meeting of the Association for Computational Linguis-
tics, Proceedings of the Conference, pages 542–549,
Maryland, USA, June.
L. E. Baum. 1972. An inequality and associated max-
imization technique in statistical estimations of prob-
abilistic functions of Markov processes. Inequalities,
3:1–8.
T.L. Booth and R.A. Thompson. 1973. Applying prob-
abilistic measures to abstract languages. IEEE Trans-
actions on Computers, C-22(5):442–450, May.
E. Charniak. 1993. Statistical Language Learning. MIT
Press.
R.Chaudhuri, S.Pham, andO.N.Garcia. 1983. Solution
of an open problem on probabilistic grammars. IEEE
Transactions on Computers, 32(8):748–750.
Z. Chi and S. Geman. 1998. Estimation of probabilis-
ticcontext-freegrammars. Computational Linguistics,
24(2):299–305.
349
Z. Chi. 1999. Statistical properties of probabilistic
context-free grammars. Computational Linguistics,
25(1):131–160.
A. Corazza and G. Satta. 2006. Cross-entropy and es-
timation of probabilistic context-free grammars. In
Proc. of HLT/NAACL 2006 Conference (this volume),
New York.
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977.
Maximum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society, B,
39:1–38.
J.E. Hopcroft and J.D. Ullman. 1979. Introduction
to Automata Theory, Languages, and Computation.
Addison-Wesley.
M.-J. Nederhof and G. Satta. 2003. Probabilistic pars-
ing as intersection. In 8th International Workshop on
Parsing Technologies, pages137–148, LORIA,Nancy,
France, April.
J.-A. S´anchez and J.-M. Bened´ı. 1997. Consistency
of stochastic context-free grammars from probabilis-
tic estimation based on growth transformations. IEEE
Transactions on Pattern Analysis and Machine Intelli-
gence, 19(9):1052–1055, September.
A. Sarkar. 1998. Conditions on consistency of proba-
bilistic tree adjoining grammars. In Proc. of the 36th
ACL, pages 1164–1170, Montreal, Canada.
Y. Schabes. 1992. Stochastic lexicalized tree-adjoining
grammars. In Proc. of the 14th COLING, pages 426–
432, Nantes, France.
C. S. Wetherell. 1980. Probabilistic languages: A re-
view and some open questions. Computing Surveys,
12(4):361–379.
350
