Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 335–342,
New York, June 2006. c©2006 Association for Computational Linguistics
Cross-Entropy and Estimation of
Probabilistic Context-Free Grammars
Anna Corazza
Department of Physics
University “Federico II”
via Cinthia
I-80126 Napoli, Italy
corazza@na.infn.it
Giorgio Satta
Department of Information Engineering
University of Padua
via Gradenigo, 6/A
I-35131 Padova, Italy
satta@dei.unipd.it
Abstract
We investigate the problem of training
probabilistic context-free grammars on
the basis of a distribution defined over
an infinite set of trees, by minimizing
the cross-entropy. This problem can be
seen as a generalization of the well-known
maximum likelihood estimator on (finite)
tree banks. We prove an unexpected the-
oretical property of grammars that are
trained in this way, namely, we show
that the derivational entropy of the gram-
mar takes the same value as the cross-
entropy between the input distribution and
the grammar itself. We show that the re-
sult also holds for the widely applied max-
imum likelihood estimator on tree banks.
1 Introduction
Probabilistic context-free grammars are able to de-
scribe hierarchical, tree-shaped structures underly-
ing sentences, and are widely used in statistical nat-
ural language processing; see for instance (Collins,
2003) and references therein. Probabilistic context-
free grammars seem also more suitable than finite-
state devices for language modeling, and several
language models based on these grammars have
been recently proposed in the literature; see for in-
stance (Chelba and Jelinek, 1998), (Charniak, 2001)
and (Roark, 2001).
Empirical estimation of probabilistic context-free
grammars is usually carried out on tree banks, that
is, finite samples of parse trees, through the max-
imization of the likelihood of the sample itself. It
is well-known that this method also minimizes the
cross-entropy between the probability distribution
induced by the tree bank, also called the empirical
distribution, and the tree probability distribution in-
duced by the estimated grammar.
In this paper we generalize the maximum like-
lihood method, proposing an estimation technique
that works on any unrestricted tree distribution de-
fined over an infinite set of trees. This generalization
is theoretically appealing, and allows us to prove un-
expected properties of the already mentioned maxi-
mum likelihood estimator for tree banks, that were
not previously known in the literature on statistical
natural language parsing. More specifically, we in-
vestigate the following information theoretic quanti-
ties
• the cross-entropy between the unrestricted tree
distribution given as input and the tree distri-
bution induced by the estimated probabilistic
context-free grammar; and
• the derivational entropy of the estimated prob-
abilistic context-free grammar.
These two quantities are usually unrelated. We show
that these two quantities take the same value when
the probabilistic context-free grammar is trained us-
ing the minimal cross-entropy criterion. We then
translate back this property to the method of max-
imum likelihood estimation. Our general estima-
tion method also has practical applications in cases
one uses a probabilistic context-free grammar to ap-
proximate strictly more powerful rewriting systems,
335
as for instance probabilistic tree adjoining gram-
mars (Schabes, 1992).
Not much is found in the literature about the
estimation of probabilistic grammars from infinite
distributions. This line of research was started
in (Nederhof, 2005), investigating the problem of
training an input probabilistic finite automaton from
an infinite tree distribution specified by means of an
input probabilistic context-free grammar. The prob-
lem we consider in this paper can then be seen as
a generalization of the above problem, where the in-
putmodeltobetrainedisaprobabilisticcontext-free
grammar and the input distribution is an unrestricted
tree distribution. In (Chi, 1999) an estimator that
maximizes the likelihood of a probability distribu-
tion defined over a finite set of trees is introduced,
as a generalization of the maximum likelihood es-
timator. Again, the problems we consider here can
be thought of as generalizations of such estimator to
the case of distributions over infinite sets of trees or
sentences.
The remainder of this paper is structured as fol-
lows. Section 2 introduces the basic notation and
definitions and Section 3 discusses our new esti-
mation method. Section 4 presents our main re-
sult, which is transferred in Section 5 to the method
of maximum likelihood estimation. Section 6 dis-
cusses some simple examples, and Section 7 closes
with some further discussion.
2 Preliminaries
Throughout this paper we use standard notation and
definitions from the literature on formal languages
and probabilistic grammars, which we briefly sum-
marize below. We refer the reader to (Hopcroft and
Ullman, 1979) and (Booth and Thompson, 1973) for
a more precise presentation.
A context-free grammar (CFG) is a tuple G =
(N,Σ,R,S), where N is a finite set of nonterminal
symbols, Σ is a finite set of terminal symbols dis-
joint from N, S ∈ N is the start symbol and R is a
finite set of rules. Each rule has the form A → α,
where A ∈ N and α ∈ (Σ ∪ N)∗. We denote by
L(G) and T(G) the set of all strings, resp., trees,
generated by G. For t ∈ T(G), the yield of t is
denoted by y(t).
For a nonterminal A and a string α, we write
f(A,α) to denote the number of occurrences of A
in α. For a rule (A → α) ∈ R and a tree t ∈ T(G),
f(A → α,t) denotes the number of occurrences of
A → α in t. We let f(A,t) =summationtextα f(A → α,t).
A probabilistic context-free grammar (PCFG) is
a pair G = (G,pG), with G a CFG and pG a func-
tion from R to the real numbers in the interval [0,1].
A PCFG is proper if for every A ∈ N we havesummationtext
α pG(A → α) = 1. The probability of t ∈ T(G)
is the product of the probabilities of all rules in t,
counted with their multiplicity, that is,
pG(t) = productdisplay
A→α
pG(A → α)f(A→α,t). (1)
The probability of w ∈ L(G) is the sum of the prob-
abilities of all the trees that generate w, that is,
pG(w) = summationdisplay
y(t)=w
pG(t). (2)
A PCFG is consistent ifsummationtextt∈T(G) pG(t) = 1.
In this paper we write log for logarithms in base 2
and ln for logarithms in the natural base e. We also
assume 0 · log0 = 0. We write Ep to denote the
expectation operator under distribution p. In case G
is proper and consistent, we can define the deriva-
tional entropy of G as the expectation of the infor-
mation of parse trees in T(G), computed under dis-
tribution pG, that is,
Hd(pG) = EpG log 1p
G(t)
= − summationdisplay
t∈T(G)
pG(t)·logpG(t). (3)
Similarly, for each A ∈ N we also define the non-
terminal entropy of A as
HA(pG) =
= EpG log 1p
G(A → α)
= −summationdisplay
α
pG(A → α)·logpG(A → α). (4)
3 Estimation based on cross-entropy
Let T be an infinite set of (finite) trees with inter-
nal nodes labeled by symbols in N, root nodes la-
beled by S ∈ N and leaf nodes labeled by symbols
336
in Σ. We assume that the set of rules that are ob-
served in the trees inT is drawn from some finite set
R. Let pT be a probability distribution defined over
T, that is, a function from T to set [0,1] such thatsummationtext
t∈T pT(t) = 1.
The skeleton CFG underlying T is defined as
G = (N,Σ,R,S). Note that we have T ⊆ T(G)
and,inthegeneralcase,theremightbetreesinT(G)
that do not appear in T. We wish anyway to approx-
imate distribution pT the best we can, by turning
G into some proper PCFG G = (G,pG) and set-
ting parameters pG(A → α) appropriately, for each
(A → α) ∈ R.
One possible criterion is to choose pG in such a
way that the cross-entropy between pT and pG is
minimized, where we now view pG as a probability
distribution defined over T(G). The cross-entropy
between pT and pG is defined as the expectation un-
der distributionpT of the information, computed un-
der distribution pG, of the trees in T(G)
H(pT ||pG) = EpT log 1p
G(t)
= −summationdisplay
t∈T
pT(t)·logpG(t). (5)
Since G should be proper, the minimization of (5) is
subject to the constraintssummationtextα pG(A → α) = 1, for
each A ∈ N.
To solve the minimization problem above, we use
Lagrange multipliers λA for each A ∈ N and define
the form
∇ = summationdisplay
A∈N
λA ·(summationdisplay
α
pG(A → α)−1) +
−summationdisplay
t∈T
pT(t)·logpG(t). (6)
We now view ∇ as a function of all the λA and the
pG(A → α), and consider all the partial derivatives
of ∇. For each A ∈ N we have
∂∇
∂λA =
summationdisplay
α
pG(A → α)−1.
For each (A → α) ∈ R we have
∂∇
∂pG(A → α) =
= λA − ∂∂p
G(A → α)
summationdisplay
t∈T
pT(t)·logpG(t)
= λA −summationdisplay
t∈T
pT(t)· ∂∂p
G(A → α)
logpG(t)
= λA −summationdisplay
t∈T
pT(t)· ∂∂p
G(A → α)
log productdisplay
(B→β)∈R
pG(B → β)f(B→β,t)
= λA −summationdisplay
t∈T
pT(t)· ∂∂p
G(A → α)
summationdisplay
(B→β)∈R
f(B → β,t)·logpG(B → β)
= λA −summationdisplay
t∈T
pT(t)· summationdisplay
(B→β)∈R
f(B → β,t)·
∂
∂pG(A → α) logpG(B → β)
= λA −summationdisplay
t∈T
pT(t)·f(A → α,t)·
· 1ln(2) · 1p
G(A → α)
= λA − 1ln(2) · 1p
G(A → α)
·
·summationdisplay
t∈T
pT(t)·f(A → α,t)
= λA − 1ln(2) · 1p
G(A → α)
·
·EpT f(A → α,t).
We now need to solve a system of|N|+|R|equa-
tions obtained by setting to zero all of the above par-
tial derivatives. From each equation ∂∇∂pG(A→α) = 0
we obtain
λA ·ln(2)·pG(A → α) =
= EpT f(A → α,t). (7)
We sum over all strings α such that (A → α) ∈ R
λA ·ln(2)·summationdisplay
α
pG(A → α) =
= summationdisplay
α
EpT f(A → α,t)
= summationdisplay
α
summationdisplay
t∈T
pT(t)·f(A → α,t)
= summationdisplay
t∈T
pT(t)·summationdisplay
α
f(A → α,t)
= summationdisplay
t∈T
pT(t)·f(A,t)
= EpT f(A,t). (8)
337
From each equation ∂∇∂λA = 0 we obtainsummationtext
α pG(A → α) = 1 for each A ∈ N (our origi-
nal constraints). Combining with (8) we obtain
λA ·ln(2) = EpT f(A,t). (9)
Replacing (9) into (7) we obtain, for every rule
(A → α) ∈ R,
pG(A → α) = EpT f(A → α,t)E
pT f(A,t)
. (10)
The equations in (10) define the desired estimator
for our PCFG, assigning to each rule A → α a prob-
ability specified as the ratio between the expected
number of A → α and the expected number of A,
under the distribution pT. We remark here that the
minimization of the cross-entropy above is equiva-
lent to the minimization of the Kullback-Leibler dis-
tance between pT and pG, viewed as tree distribu-
tions. Also, note that the likelihood of an infinite set
of derivations would always be zero and therefore
cannot be considered here.
To be used in the next section, we now show that
the PCFG G obtained as above is consistent. The
line of our argument below follows a proof provided
in (Chi and Geman, 1998) for the maximum like-
lihood estimator based on finite tree distributions.
Without loss of generality, we assume that in G the
start symbol S is never used in the right-hand side
of a rule.
For each A ∈ N, let qA be the probability that a
derivation in G rooted in A fails to terminate. We
can then write
qA ≤ summationdisplay
B∈N
qB ·summationdisplay
α
pG(A → α)f(B,α).(11)
The inequality follows from the fact that the events
considered in the right-hand side of (11) are not mu-
tually exclusive. Combining (10) and (11) we obtain
qA ·EpT f(A,t) ≤
≤ summationdisplay
B∈N
qB ·summationdisplay
α
EpTf(A → α,t)f(B,α).
Summing over all nonterminals we have
summationdisplay
A∈N
qA ·EpT f(A,t) ≤
≤ summationdisplay
B∈N
qB · summationdisplay
A∈N
summationdisplay
α
EpT f(A → α,t)f(B,α)
= summationdisplay
B∈N
qB ·EpT fc(B,t), (12)
where fc(B,t) indicates the number of times a node
labeled by nonterminal B appears in the derivation
tree t as a child of some other node.
From our assumptions on the start symbol S, we
have that S only appears at the root of the trees
in T(G). Then it is easy to see that, for every
A negationslash= S, we have EpTfc(A,t) = EpTf(A,t), while
EpTfc(S,t) = 0 and EpTf(S,t) = 1. Using these
relations in (12) we obtain
qS ·EpT f(S,T) ≤ qS ·EpT fc(S,T),
from which we conclude qS = 0, thus implying the
consistency of G.
4 Cross-entropy and derivational entropy
In this section we present the main result of the pa-
per. We show that, when G = (G,pG) is estimated
by minimizing the cross-entropy in (5), then such
cross-entropy takes the same value as the deriva-
tional entropy of G, defined in (3).
In (Nederhof and Satta, 2004) relations are de-
rived for the exact computation ofHd(pG). For later
use, we report these relations below, under the as-
sumption that G is consistent (see Section 3). We
have
Hd(pG) = summationdisplay
A∈N
outG(A)·HA(pG). (13)
Quantities HA(pG), A ∈ N, have been defined
in(4). ForeachA ∈ N, quantityoutG(A)isthesum
of the probabilities of all trees generated by G, hav-
ing root labeled by S and having a yield composed
of terminal symbols with an unexpanded occurrence
of nonterminal A. Again, we assume that symbol
S does not appear in any of the right-hand sides of
the rules in R. This means that S only appears at
the root of the trees in T(G). Under this condi-
tion, quantities outG(A) can be exactly computed
by solving the following system of linear equations
(see also (Nederhof, 2005))
outG(S) = 1; (14)
for each A negationslash= S
outG(A) =
= summationdisplay
B→β
outG(B)·f(A,β)·pG(B → β).(15)
338
We can now prove the equality
Hd(pG) = H(pT ||pG), (16)
where G is the PCFG estimated by minimizing the
cross-entropy in (5), as described in Section 3.
We start from the definition of cross-entropy
H(pT ||pG) =
= −summationdisplay
t∈T
pT(t)·logpG(t)
= −summationdisplay
t∈T
pT(t)·log productdisplay
A→α
pG(A → α)f(A→α,t)
= −summationdisplay
t∈T
pT(t)·
· summationdisplay
A→α
f(A → α,t)·logpG(A → α)
= − summationdisplay
A→α
logpG(A → α)·
·summationdisplay
t∈T
pT(t)·f(A → α,t)
= − summationdisplay
A→α
logpG(A → α)·
·EpT f(A → α,t). (17)
From our estimator in (10) we can write
EpT f(A → α,t) =
= pG(A → α)·EpT f(A,t). (18)
Replacing (18) into (17) gives
H(pT ||pG) =
= − summationdisplay
A→α
logpG(A → α)·
·pG(A → α)·EpT f(A,t)
= − summationdisplay
A∈N
EpT f(A,t)·
·summationdisplay
α
pG(A → α)·logpG(A → α)
= summationdisplay
A∈N
EpT f(A,t)·H(pG,A). (19)
Comparing (19) with (13) we see that, in order to
prove the equality in (16), we need to show relations
EpT f(A,t) = outG(A), (20)
for every A ∈ N. We have already observed in Sec-
tion 3 that, under our assumption on the start symbol
S, we have
EpT f(S,t) = 1. (21)
We now observe that, for any A ∈ N with A negationslash= S
and any t ∈ T(G), we have
f(A,t) =
= summationdisplay
B→β
f(B → β,t)·f(A,β). (22)
For each A ∈ N with A negationslash= S we can then write
EpT f(A,t) =
= summationdisplay
t∈T
pT(t)·f(A,t)
= summationdisplay
t∈T
pT(t)· summationdisplay
B→β
f(B → β,t)·f(A,β)
= summationdisplay
B→β
summationdisplay
t∈T
pT(t)·f(B → β,t)·f(A,β)
= summationdisplay
B→β
EpT f(B → β,t)·f(A,β). (23)
Once more we use relation (18), which replaced
in (23) provides
EpT f(A,t) =
= summationdisplay
B→β
EpT f(B,t)·
·f(A,β)·pG(B → β). (24)
Notice that the linear system in (14) and (15) and the
linear system in (21) and (24) are the same. Thus we
conclude that quantities EpT f(A,t) and outG(A)
are the same for each A ∈ N. This completes our
proof of the equality in (16). Some examples will be
discussed in Section 6.
Besides its theoretical significance, the equality
in (16) can also be exploited in the computation of
the cross-entropy in practical applications. In fact,
cross-entropy is used as a measure of tightness in
comparing different models. In case of estimation
from an infinite distribution pT, the definition of the
cross-entropy H(pT ||pG) contains an infinite sum-
mation, which is problematic for the computation of
such quantity. In standard practice, this problem is
overcome by generating a finite sampleT(n) of large
sizen, throughthedistributionpT, andthencomput-
ing the approximation (Manning and Sch¨utze, 1999)
H(pT ||pG) ∼ −1nsummationdisplay
t∈T
f(t,T(n))·logpG(t),
339
where f(t,T(n)) indicates the multiplicity, that is,
the number of occurrences, oftinT(n). However, in
practical applications n must be very large in order
to have a small error. Based on the results in this
section, we can instead compute the exact value of
H(pT ||pG) by computing the derivational entropy
Hd(pG), using relation (13) and solving the linear
system in (14) and (15), which takes cubic time in
the number of nonterminals of the grammar.
5 Estimation based on likelihood
In natural language processing applications, the es-
timation of a PCFG is usually carried out on the ba-
sis of a finite sample of trees, called tree bank. The
so-called maximum likelihood estimation (MLE)
method is exploited, which maximizes the likeli-
hood of the observed data. In this section we show
that the MLE method is a special case of the esti-
mation method presented in Section 3, and that the
results of Section 4 also hold for the MLE method.
Let T be a tree sample, and let T be the under-
lying set of trees. For t ∈ T, we let f(t,T ) be the
multiplicity of t in T . We define
f(A → α,T ) =
= summationdisplay
t∈T
f(t,T )·f(A → α,t), (25)
and let f(A,T ) = summationtextα f(A → α,T ). We can in-
duce from T a probability distribution pT , defined
over T, by letting for each t ∈ T
pT (t) = f(t,T )|T| . (26)
Note thatsummationtextt∈T pT (t) = 1. DistributionpT is called
the empirical distribution of T .
Assume that the trees in T have internal nodes
labeled by symbols in N, root nodes labeled by
S and leaf nodes labeled by symbols in Σ. Let
also R be the finite set of rules that are observed
in T . We define the skeleton CFG underlying T as
G = (N,Σ,R,S). In the MLE method we proba-
bilistically extend the skeleton CFG G by means of
a function pG that maximizes the likelihood of T ,
defined as
pG(T ) = productdisplay
t∈T
pG(t)f(t,T), (27)
subject to the usual properness conditions on pG.
Such maximization provides the estimator (see for
instance (Chi and Geman, 1998))
pG(A → α) = f(A → α,T )f(A,T ) . (28)
Letusconsidertheestimatorin(10). Ifwereplace
distribution pT with our empirical distribution pT ,
we derive
pG(A → α) =
= EpT f(A → α,t)E
pT f(A,t)
=
summationtext
t∈T
f(t,T)
|T| ·f(A → α,t)summationtext
t∈T
f(t,T)
|T| ·f(A,t)
=
summationtext
t∈T f(t,T )·f(A → α,t)summationtext
t∈T f(t,T )·f(A,t)
= f(A → α,T )f(A,T ) . (29)
This is precisely the estimator in (28).
From relation (29) we conclude that the MLE
method can be seen as a special case of the general
estimatorinSection3, withtheinputdistributionde-
fined over a finite set of trees. We can also derive
the well-known fact that, in the finite case, the maxi-
mization of the likelihood pG(T ) corresponds to the
minimization of the cross-entropy H(pT ||pG).
Let nowG = (G,pG) be a PCFG trained onT us-
ing the MLE method. Again from relation (29) and
Section 3 we have that G is consistent. This result
has been firstly shown in (Chaudhuri et al., 1983)
and later, with a different proof technique, in (Chi
and Geman, 1998). We can then transfer the results
of Section 4 to the supervised MLE method, show-
ing the equality
Hd(pG) = H(pT ||pG). (30)
This result was not previously known in the litera-
ture on statistical parsing of natural language. Some
examples will be discussed in Section 6.
6 Some examples
In this section we discuss a simple example with the
aim of clarifying the theoretical results in the previ-
ous sections. For a real number q with 0 < q < 1,
340
Figure 1: Derivational entropy of Gq and cross-
entropies for three different corpora.
consider the CFG G defined by the two rules S →
aS and S → a, and letGq = (G,pG,q) be the proba-
bilistic extension of G with pG,q(S → aS) = q and
pG,q(S → a) = 1 −q. This grammar is unambigu-
ous and consistent, and each tree t generated by G
has probability pG,q(t) = qi · (1 −q), where i ≥ 0
is the number of occurrences of rule S → aS in t.
We use below the following well-known relations
(0 < r < 1)
+∞summationdisplay
i=0
ri = 11−r, (31)
+∞summationdisplay
i=1
i·ri−1 = 1(1−r)2. (32)
The derivational entropy of Gq can be directly
computed from its definition as
Hd(pG,q) = −
+∞summationdisplay
i=0
qi ·(1−q)·log
parenleftBig
qi ·(1−q)
parenrightBig
= −(1−q)
+∞summationdisplay
i=0
qi logqi +
−(1−q)·log(1−q)·
+∞summationdisplay
i=0
qi
= −(1−q)·logq ·
+∞summationdisplay
i=0
i·qi −log(1−q)
= − q1−q ·logq −log(1−q). (33)
See Figure 1 for a plot of Hd(pG,q) as a function
of q.
If a tree bank is given, composed of occurrences
of trees generated by G, the value of q can be es-
timated by applying the MLE or, equivalently, by
minimizingthecross-entropy. Weconsiderheresev-
eral tree banks, to exemplify the behaviour of the
cross-entropy depending on the structure of the sam-
ple of trees. The first tree bank T contains a single
tree t with a single occurrence of rule S → aS and
a single occurrence of rule S → a. We then have
pT (t) = 1 and pG,q(t) = q · (1 − q). The cross-
entropy between distributions pT and pG,q is then
H(pT ,pG,q) = −logq ·(1−q)
= −logq −log(1−q). (34)
The cross-entropy H(pT ,pG,q), viewed as a func-
tion of q, is a convex-∪ function and is plotted in
Figure 1 (line indicated by Kd = 1, see below). We
can obtain its minimum by finding a zero for the first
derivative
d
dqH(pT ,pG,q) = −
1
q +
1
1−q
= 2q −1q ·(1−q) = 0, (35)
which gives q = 0.5. Note from Figure 1 that
the minimum of H(pT ,pG,q) crosses the line cor-
responding to the derivational entropy, as should be
expected from the result in Section 4.
More in general, for integers d > 0 and K > 0,
consider a tree sample Td,K consisting of d trees ti,
1 ≤ i ≤ d. Each ti contains ki ≥ 0 occurrences
of rule S → aS and one occurrence of rule S → a.
Thus we have pTd,K(ti) = 1d and pG,q(ti) = qki ·
(1−q). We letsummationtextdi=1 ki = K. The cross-entropy is
H(pTd,K,pG,q) =
= −
dsummationdisplay
i=0
1
d ·logq
ki −log(1−q)
= −Kd logq −log(1−q). (36)
In Figure 1 we plot H(pTd,K,pG,q) in the case Kd =
0.5 and in the case Kd = 1.5. Again, we have that
these curves intersect with the curve corresponding
to the derivational entropy Hd(pG,q) at the points
were they take their minimum values.
341
7 Conclusions
We have shown in this paper that, when a PCFG is
estimated from some tree distribution by minimiz-
ing the cross-entropy, then the cross-entropy takes
the same value as the derivational entropy of the
PCFG itself. As a special case, this result holds for
the maximum likelihood estimator, widely applied
in statistical natural language parsing. The result
also holds for the relative weighted frequency esti-
mator introduced in (Chi, 1999) as a generalization
of the maximum likelihood estimator, and for the es-
timator introduced in (Nederhof, 2005) already dis-
cussedintheintroduction. Inajournalversionofthe
present paper, which is under submission, we have
also extended the results of Section 4 to the unsuper-
vised estimation of a PCFG from a distribution de-
fined over an infinite set of (unannotated) sentences
and, as a particular case, to the well-knonw inside-
outside algorithm (Manning and Sch¨utze, 1999).
In practical applications, the results of Section 4
can be exploited in the computation of model tight-
ness. In fact, cross-entropy indicates how much the
estimated model fits the observed data, and is com-
monly exploited in comparison of different models
on the same data set. We can then use the given
relation between cross-entropy and derivational en-
tropy to compute one of these two quantities from
the other. For instance, in the case of the MLE
method we can choose between the computation of
the derivational entropy and the cross-entropy, de-
pending basically on the instance of the problem at
hand. As already mentioned, the computation of the
derivational entropy requires cubic time in the num-
ber of nonterminals of the grammar. If this num-
ber is large, direct computation of (5) on the corpus
might be more efficient. On the other hand, if the
corpus at hand is very large, one might opt for direct
computation of (3).
Acknowledgements
Helpful comments from Zhiyi Chi, Alberto lavelli,
Mark-Jan Nederhof and Khalil Simaan are grate-
fully acknowledged.
References
T.L. Booth and R.A. Thompson. 1973. Applying prob-
abilistic measures to abstract languages. IEEE Trans-
actions on Computers, C-22(5):442–450, May.
E.Charniak. 2001. Immediate-headparsingforlanguage
models. In 39th Annual Meeting and 10th Conference
of the European Chapter of the Association for Com-
putational Linguistics, Proceedings of the Conference,
pages 116–123, Toulouse, France, July.
R.Chaudhuri,S.Pham,andO.N.Garcia. 1983. Solution
of an open problem on probabilistic grammars. IEEE
Transactions on Computers, 32(8):748–750.
C. Chelba and F. Jelinek. 1998. Exploiting syntactic
structureforlanguagemodeling. In36thAnnualMeet-
ing of the Association for Computational Linguistics
and 17th International Conference on Computational
Linguistics, volume 1, pages 225–231, Montreal, Que-
bec, Canada, August.
Z. Chi and S. Geman. 1998. Estimation of probabilis-
ticcontext-freegrammars. Computational Linguistics,
24(2):299–305.
Z. Chi. 1999. Statistical properties of probabilistic
context-free grammars. Computational Linguistics,
25(1):131–160.
M. Collins. 2003. Head-driven statistical models for
natural language parsing. Computational Linguistics,
pages 589–638.
J.E. Hopcroft and J.D. Ullman. 1979. Introduction
to Automata Theory, Languages, and Computation.
Addison-Wesley.
C.D. Manning and H. Sch¨utze. 1999. Foundations
of Statistical Natural Language Processing. Mas-
sachusetts Institute of Technology.
M.-J. Nederhof and G. Satta. 2004. Kullback-Leibler
distance between probabilistic context-free grammars
and probabilistic finite automata. In Proc. of the 20th
COLING, volume 1, pages 71–77, Geneva, Switzer-
land.
M.-J. Nederhof. 2005. A general technique to train lan-
guage models on language models. Computational
Linguistics, 31(2):173–185.
B. Roark. 2001. Probabilistic top-down parsing
and language modeling. Computational Linguistics,
27(2):249–276.
Y. Schabes. 1992. Stochastic lexicalized tree-adjoining
grammars. In Proc. of the fifteenth International
Conference on Computational Linguistics, volume 2,
pages 426–432, Nantes, August.
342
