Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL-X),
pages 14–20, New York City, June 2006. c©2006 Association for Computational Linguistics
Non-Local Modeling with a Mixture of PCFGs
Slav Petrov Leon Barrett Dan Klein
Computer Science Division, EECS Department
University of California at Berkeley
Berkeley, CA 94720
{petrov, lbarrett, klein}@eecs.berkeley.edu
Abstract
While most work on parsing with PCFGs
has focused on local correlations between
tree configurations, we attempt to model
non-local correlations using a finite mix-
ture of PCFGs. A mixture grammar fit
with the EM algorithm shows improve-
ment over a single PCFG, both in parsing
accuracy and in test data likelihood. We
argue that this improvement comes from
the learning of specialized grammars that
capture non-local correlations.
1 Introduction
The probabilistic context-free grammar (PCFG) for-
malism is the basis of most modern statistical
parsers. The symbols in a PCFG encode context-
freedom assumptions about statistical dependencies
in the derivations of sentences, and the relative con-
ditional probabilities of the grammar rules induce
scores on trees. Compared to a basic treebank
grammar (Charniak, 1996), the grammars of high-
accuracy parsers weaken independence assumptions
by splitting grammar symbols and rules with ei-
ther lexical (Charniak, 2000; Collins, 1999) or non-
lexical (Klein and Manning, 2003; Matsuzaki et al.,
2005) conditioning information. While such split-
ting, or conditioning, can cause problems for sta-
tistical estimation, it can dramatically improve the
accuracy of a parser.
However, the configurations exploited in PCFG
parsers are quite local: rules’ probabilities may de-
pend on parents or head words, but do not depend
on arbitrarily distant tree configurations. For exam-
ple, it is generally not modeled that if one quantifier
phrase (QP in the Penn Treebank) appears in a sen-
tence, the likelihood of finding another QP in that
same sentence is greatly increased. This kind of ef-
fect is neither surprising nor unknown – for exam-
ple, Bock and Loebell (1990) show experimentally
that human language generation demonstrates prim-
ing effects. The mediating variables can not only in-
clude priming effects but also genre or stylistic con-
ventions, as well as many other factors which are not
adequately modeled by local phrase structure.
A reasonable way to add a latent variable to a
generative model is to use a mixture of estimators,
in this case a mixture of PCFGs (see Section 3).
The general mixture of estimators approach was first
suggested in the statistics literature by Titterington
et al. (1962) and has since been adopted in machine
learning (Ghahramani and Jordan, 1994). In a mix-
ture approach, we have a new global variable on
which all PCFG productions for a given sentence
can be conditioned. In this paper, we experiment
with a finite mixture of PCFGs. This is similar to the
latent nonterminals used in Matsuzaki et al. (2005),
but because the latent variable we use is global, our
approach is more oriented toward learning non-local
structure. We demonstrate that a mixture fit with the
EM algorithm gives improved parsing accuracy and
test data likelihood. We then investigate what is and
is not being learned by the latent mixture variable.
While mixture components are difficult to interpret,
we demonstrate that the patterns learned are better
than random splits.
2 Empirical Motivation
It is commonly accepted that the context freedom
assumptions underlying the PCFG model are too
14
VP
VBD
increased
NP
CD
11
NN
%
PP
TO
to
NP
QP
#
#
CD
2.5
CD
billion
PP
IN
from
NP
QP
#
#
CD
2.25
CD
billion
Rule Score
QP→ # CD CD 131.6
PRN→ -LRB- ADJP -RRB 77.1
VP→ VBD NP , PP PP 33.7
VP→ VBD NP NP PP 28.4
PRN→ -LRB- NP -RRB- 17.3
ADJP→ QP 13.3
PP→ IN NP ADVP 12.3
NP→ NP PRN 12.3
VP→ VBN PP PP PP 11.6
ADVP→ NP RBR 10.1
Figure 1: Self-triggering: QP→# CD CD. If one British financial occurs in the sentence, the probability of
seeing a second one in the same sentence is highly inreased. There is also a similar, but weaker, correlation
for the American financial ($). On the right hand side we show the ten rules whose likelihoods are most
increased in a sentence containing this rule.
strong and that weakening them results in better
models of language (Johnson, 1998; Gildea, 2001;
Klein and Manning, 2003). In particular, certain
grammar productions often cooccur with other pro-
ductions, which may be either near or distant in the
parse tree. In general, there exist three types of cor-
relations: (i) local (e.g. parent-child), (ii) non-local,
and (iii) self correlations (which may be local or
non-local).
In order to quantify the strength of a correlation,
we use a likelihood ratio (LR). For two rules X→α
and Y→β, we compute
LR(X→α, Y→β) = P(α,β|X,Y )P(α|X,Y )P(β|X,Y )
This measures how much more often the rules oc-
cur together than they would in the case of indepen-
dence. For rules that are correlated, this score will
be high (≫ 1); if the rules are independent, it will
be around 1, and if they are anti-correlated, it will be
near 0.
Among the correlations present in the Penn Tree-
bank, the local correlations are the strongest ones;
they contribute 65% of the rule pairs with LR scores
above 90 and 85% of those with scores over 200.
Non-local and self correlations are in general com-
mon but weaker, with non-local correlations con-
tributing approximately 85% of all correlations1. By
adding a latent variable conditioning all productions,
1Quantifying the amount of non-local correlation is prob-
lematic; most pairs of cooccuring rules are non-local and will,
due to small sample effects, have LR ratios greater than 1 even
if they were truly independent in the limit.
we aim to capture some of this interdependence be-
tween rules.
Correlations at short distances have been cap-
tured effectively in previous work (Johnson, 1998;
Klein and Manning, 2003); vertical markovization
(annotating nonterminals with their ancestor sym-
bols) does this by simply producing a different dis-
tribution for each set of ancestors. This added con-
text leads to substantial improvement in parsing ac-
curacy. With local correlations already well cap-
tured, our main motivation for introducing a mix-
ture of grammars is to capture long-range rule cooc-
currences, something that to our knowledge has not
been done successfully in the past.
As an example, the rule QP→# CD CD, rep-
resenting a quantity of British currency, cooc-
curs with itself 132 times as often as if oc-
currences were independent. These cooccur-
rences appear in cases such as seen in Figure 1.
Similarly, the rules VP→VBD NP PP , S and
VP→VBG NP PP PP cooccur in the Penn Tree-
bank 100 times as often as we would expect if they
were independent. They appear in sentences of a
very particular form, telling of an action and then
giving detail about it; an example can be seen in Fig-
ure 2.
3 Mixtures of PCFGs
In a probabilistic context-free grammar (PCFG),
each rule X→α is associated with a conditional
probability P(α|X) (Manning and Sch¨utze, 1999).
Together, these rules induce a distribution over trees
P(T). A mixture of PCFGs enriches the basic model
15
VP
VBD
hit
NP
a record
PP
in 1998
,
,
S
VP
VBG
rising
NP
1.7%
PP
after inflation adjustment
PP
to $13,120
S
NP
DT
No
NX
NX
NNS
lawyers
CC
or
NX
NN
tape
NNS
recorders
VP
were present
.
.
(a) (b)
S
S
NP
DT
These
NN
rate
NNS
indications
VP
VBP
are
RB
n’t
ADJP
directly comparable
:
;
S
NP
NN
lending
NNS
practices
VP
VBP
vary
ADVP
widely
PP
by location
.
.
X
X
SYM
**
ADJP
VBN
Projected
(c) (d)
Figure 2: Tree fragments demonstrating coocurrences. (a) and (c) Repeated formulaic structure in one
grammar: rules VP→VBD NP PP , S and VP→VBG NP PP PP and rules VP→VBP RB ADJP
and VP→VBP ADVP PP. (b) Sibling effects, though not parallel structure, rules: NX→NNS and
NX→NN NNS. (d) A special structure for footnotes has rules ROOT→X and X→SYM coocurring
with high probability.
by allowing for multiple grammars, Gi, which we
call individual grammars, as opposed to a single
grammar. Without loss of generality, we can as-
sume that the individual grammars share the same
set of rules. Therefore, each original rule X→α
is now associated with a vector of probabilities,
P(α|X,i). If, in addition, the individual grammars
are assigned prior probabilities P(i), then the entire
mixture induces a joint distribution over derivations
P(T,i) = P(i)P(T|i) from which we recover a dis-
tribution over trees by summing over the grammar
index i.
As a generative derivation process, we can think
of this in two ways. First, we can imagine G to be
a latent variable on which all productions are con-
ditioned. This view emphasizes that any otherwise
unmodeled variable or variables can be captured by
the latent variable G. Second, we can imagine se-
lecting an individual grammar Gi and then gener-
ating a sentence using that grammar. This view is
associated with the expectation that there are multi-
ple grammars for a language, perhaps representing
different genres or styles. Formally, of course, the
two views are the same.
3.1 Hierarchical Estimation
So far, there is nothing in the formal mixture model
to say that rule probabilities in one component have
any relation to those in other components. However,
we have a strong intuition that many rules, such as
NP→DT NN, will be common in all mixture com-
ponents. Moreover, we would like to pool our data
across components when appropriate to obtain more
reliable estimators.
This can be accomplished with a hierarchical es-
timator for the rule probabilities. We introduce a
shared grammar Gs. Associated to each rewrite is
now a latent variable L = {S, I} which indicates
whether the used rule was derived from the shared
grammar Gs or one of the individual grammars Gi:
P(α|X,i) =
λP(α|X,i,ℓ=I) + (1−λ)P(α|X,i,ℓ=S),
where λ ≡ P(ℓ = I) is the probability of
choosing the individual grammar and can also
be viewed as a mixing coefficient. Note that
P(α|X,i,ℓ=S) = P(α|X,ℓ=S), since the shared
grammar is the same for all individual grammars.
This kind of hierarchical estimation is analogous to
that used in hierarchical mixtures of naive-Bayes for
16
text categorization (McCallum et al., 1998).
The hierarchical estimator is most easily de-
scribed as a generative model. First, we choose a
individual grammar Gi. Then, for each nonterminal,
we select a level from the back-off hierarchy gram-
mar: the individual grammar Gi with probability λ,
and the shared grammar Gs with probability 1−λ.
Finally, we select a rewrite from the chosen level. To
emphasize: the derivation of a phrase-structure tree
in a hierarchically-estimated mixture of PCFGs in-
volves two kinds of hidden variables: the grammar
G used for each sentence, and the level L used at
each tree node. These hidden variables will impact
both learning and inference in this model.
3.2 Inference: Parsing
Parsing involves inference for a given sentence S.
One would generally like to calculate the most prob-
able parse – that is, the tree T which has the high-
est probability P(T|S)∝summationtexti P(i)P(T|i). How-
ever, this is difficult for mixture models. For a single
grammar we have:
P(T,i) = P(i)
productdisplay
X→α∈T
P(α|X,i).
This score decomposes into a product and it is sim-
ple to construct a dynamic programming algorithm
to find the optimal T (Baker, 1979). However, for a
mixture of grammars we need to sum over the indi-
vidual grammars:summationdisplay
i
P(T,i) =
summationdisplay
i
P(i)
productdisplay
X→α∈T
P(α|X,i).
Because of the outer sum, this expression unfor-
tunately does not decompose into a product over
scores of subparts. In particular, a tree which maxi-
mizes the sum need not be a top tree for any single
component.
As is true for many other grammar formalisms in
which there is a derivation / parse distinction, an al-
ternative to finding the most probable parse is to find
the most probable derivation (Vijay-Shankar and
Joshi, 1985; Bod, 1992; Steedman, 2000). Instead
of finding the tree T which maximizes summationtexti P(T,i),
we find both the tree T and component i which max-
imize P(T,i). The most probable derivation can be
found by simply doing standard PCFG parsing once
for each component, then comparing the resulting
trees’ likelihoods.
3.3 Learning: Training
Training a mixture of PCFGs from a treebank is an
incomplete data problem. We need to decide which
individual grammar gave rise to a given observed
tree. Moreover, we need to select a generation path
(individual grammar or shared grammar) for each
rule in the tree. To learn estimate parameters, we
can use a standard Expectation-Maximization (EM)
approach.
In the E-step, we compute the posterior distribu-
tions of the latent variables, which are in this case
both the component G of each sentence and the hier-
archy level L of each rewrite. Note that, unlike dur-
ing parsing, there is no uncertainty over the actual
rules used, so the E-step does not require summing
over possible trees. Specifically, for the variable G
we have
P(i|T) = P(T,i)summationtext
j P(T,j)
.
For the hierarchy level L we can write
P(ℓ = I|X →α,i,T) =
λP(α|X,ℓ=I)
λP(α|X,i,ℓ=I) + (1−λ)P(α|X,ℓ=S),
where we slightly abuse notation since the rule
X →α can occur multiple times in a tree T.
In the M-step, we find the maximum-likelihood
model parameters given these posterior assign-
ments; i.e., we find the best grammars given the way
the training data’s rules are distributed between in-
dividual and shared grammars. This is done exactly
as in the standard single-grammar model using rela-
tive expected frequencies. The updates are shown in
Figure 3.3, where T = {T1,T2,...}is the training
set.
We initialize the algorithm by setting the assign-
ments from sentences to grammars to be uniform
between all the individual grammars, with a small
random perturbation to break symmetry.
4 Results
We ran our experiments on the Wall Street Jour-
nal (WSJ) portion of the Penn Treebank using the
standard setup: We trained on sections 2 to 21,
and we used section 22 as a validation set for tun-
ing model hyperparameters. Results are reported
17
P(i)←
summationtext
Tk∈TP(i|Tk)summationtext
i
summationtext
Tk∈TP(i|Tk)
=
C8
Tk∈TP(i|Tk)
k
P(l = I)←
summationtext
Tk∈T
summationtext
X→α∈Tk P(ℓ = I|X →α)summationtext
Tk∈T|Tk|
P(α|X,i,ℓ = I)←
summationtext
Tk∈T
summationtext
X→α∈Tk P(i|Tk)P(ℓ = I|Tk,i,X →α)summationtext
α′
summationtext
Tk∈T
summationtext
X→α′∈Tk P(i|Tk)P(ℓ = I|Tk,i,X →α′)
Figure 3: Parameter updates. The shared grammar’s parameters are re-estimated in the same manner.
on all sentences of 40 words or less from section
23. We use a markovized grammar which was an-
notated with parent and sibling information as a
baseline (see Section 4.2). Unsmoothed maximum-
likelihood estimates were used for rule probabili-
ties as in Charniak (1996). For the tagging proba-
bilities, we used maximum-likelihood estimates for
P(tag|word). Add-one smoothing was applied to
unknown and rare (seen ten times or less during
training) words before inverting those estimates to
give P(word|tag). Parsing was done with a sim-
ple Java implementation of an agenda-based chart
parser.
4.1 Parsing Accuracy
The EM algorithm is guaranteed to continuously in-
crease the likelihood on the training set until conver-
gence to a local maximum. However, the likelihood
on unseen data will start decreasing after a number
of iterations, due to overfitting. This is demonstrated
in Figure 4. We use the likelihood on the validation
set to stop training before overfitting occurs.
In order to evaluate the performance of our model,
we trained mixture grammars with various numbers
of components. For each configuration, we used EM
to obtain twelve estimates, each time with a different
random initialization. We show the F1-score for the
model with highest log-likelihood on the validation
set in Figure 4. The results show that a mixture of
grammars outperforms a standard, single grammar
PCFG parser.2
4.2 Capturing Rule Correlations
As described in Section 2, we hope that the mix-
ture model will capture long-range correlations in
2This effect is statistically significant.
the data. Since local correlations can be captured
by adding parent annotation, we combine our mix-
ture model with a grammar in which node probabil-
ities depend on the parent (the last vertical ancestor)
and the closest sibling (the last horizontal ancestor).
Klein and Manning (2003) refer to this grammar as
a markovized grammar of vertical order = 2 and hor-
izontal order = 1. Because many local correlations
are captured by the markovized grammar, there is a
greater hope that observed improvements stem from
non-local correlations.
In fact, we find that the mixture does capture
non-local correlations. We measure the degree to
which a grammar captures correlations by calculat-
ing the total squared error between LR scores of the
grammar and corpus, weighted by the probability
of seeing nonterminals. This is 39422 for a sin-
gle PCFG, but drops to 37125 for a mixture with
five individual grammars, indicating that the mix-
ture model better captures the correlations present
in the corpus. As a concrete example, in the Penn
Treebank, we often see the rules FRAG→ADJP
and PRN→, SBAR , cooccurring; their LR is 134.
When we learn a single markovized PCFG from the
treebank, that grammar gives a likelihood ratio of
only 61. However, when we train with a hierarchi-
cal model composed of a shared grammar and four
individual grammars, we find that the grammar like-
lihood ratio for these rules goes up to 126, which is
very similar to that of the empirical ratio.
4.3 Genre
The mixture of grammars model can equivalently be
viewed as capturing either non-local correlations or
variations in grammar. The latter view suggests that
the model might benefit when the syntactic structure
18
 0  10  20  30  40  50  60
Log Likelihood
Iteration
Training data
Validation data
Testing data
 79
 79.2
 79.4
 79.6
 79.8
 80
 1  2  3  4  5  6  7  8  9
F1
Number of Component Grammars
Mixture model
Baseline: 1 grammar
(a) (b)
Figure 4: (a) Log likelihood of training, validation, and test data during training (transformed to fit on the
same plot). Note that when overfitting occurs the likelihood on the validation and test data starts decreasing
(after 13 iterations). (b) The accuracy of the mixture of grammars model with λ = 0.4 versus the number of
grammars. Note the improvement over a 1-grammar PCFG model.
varies significantly, as between different genres. We
tested this with the Brown corpus, of which we used
8 different genres (f, g, k, l, m, n, p, and r). We fol-
low Gildea (2001) in using the ninth and tenth sen-
tences of every block of ten as validation and test
data, respectively, because a contiguous test section
might not be representative due to the genre varia-
tion.
To test the effects of genre variation, we evalu-
ated various training schemes on the Brown corpus.
The single grammar baseline for this corpus gives
F1 = 79.75, with log likelihood (LL) on the testing
data=-242561. The first test, then, was to estimate
each individual grammar from only one genre. We
did this by assigning sentences to individual gram-
mars by genre, without using any EM training. This
increases the data likelihood, though it reduces the
F1 score (F1 = 79.48, LL=-242332). The increase
in likelihood indicates that there are genre-specific
features that our model can represent. (The lack of
F1 improvement may be attributed to the increased
difficulty of estimating rule probabilities after divid-
ing the already scant data available in the Brown cor-
pus. This small quantity of data makes overfitting
almost certain.)
However, local minima and lack of data cause dif-
ficulty in learning genre-specific features. If we start
with sentences assigned by genre as before, but then
train with EM, both F1 and test data log likelihood
drop (F1 = 79.37, LL=-242100). When we use
EM with a random initialization, so that sentences
are not assigned directly to grammars, the scores go
down even further (F1 = 79.16, LL=-242459). This
indicates that the model can capture variation be-
tween genres, but that maximum training data likeli-
hood does not necessarily give maximum accuracy.
Presumably, with more genre-specific data avail-
able, learning would generalize better. So, genre-
specific grammar variation is real, but it is difficult
to capture via EM.
4.4 Smoothing Effects
While the mixture of grammars captures rule cor-
relations, it may also enhance performance via
smoothing effects. Splitting the data randomly could
produce a smoothed shared grammar, Gs, that is
a kind of held-out estimate which could be supe-
rior to the unsmoothed ML estimates for the single-
component grammar.
We tested the degree of generalization by eval-
uating the shared grammar alone and also a mix-
ture of the shared grammar with the known sin-
gle grammar. Those shared grammars were ex-
tracted after training the mixture model with four in-
dividual grammars. We found that both the shared
grammar alone (F1=79.13, LL=-333278) and the
shared grammar mixed with the single grammar
(F1=79.36, LL=-331546) perform worse than a sin-
19
gle PCFG (F1=79.37, LL=-327658). This indicates
that smoothing is not the primary learning effect
contributing to increased F1.
5 Conclusions
We examined the sorts of rule correlations that may
be found in natural language corpora, discovering
non-local correlations not captured by traditional
models. We found that using a model capable of
representing these non-local features gives improve-
ment in parsing accuracy and data likelihood. This
improvement is modest, however, primarily because
local correlations are so much stronger than non-
local ones.

References
J. Baker. 1979. Trainable grammars for speech recog-
nition. Speech Communication Papers for the 97th
Meeting of the Acoustical Society of America, pages
547–550.
K. Bock and H. Loebell. 1990. Framing sentences. Cog-
nition, 35:1–39.
R. Bod. 1992. A computational model of language per-
formance: Data oriented parsing. International Con-
ference on Computational Linguistics (COLING).
E. Charniak. 1996. Tree-bank grammars. In Proc. of
the 13th National Conference on Artificial Intelligence
(AAAI), pages 1031–1036.
E. Charniak. 2000. A maximum–entropy–inspired
parser. In Proc. of the Conference of the North Ameri-
can chapter of the Association for Computational Lin-
guistics (NAACL), pages 132–139.
M. Collins. 1999. Head-Driven Statistical Models for
Natural Language Parsing. Ph.D. thesis, Univ. of
Pennsylvania.
Z. Ghahramani and M. I. Jordan. 1994. Supervised
learning from incomplete data via an EM approach. In
Advances in Neural Information Processing Systems
(NIPS), pages 120–127.
D. Gildea. 2001. Corpus variation and parser perfor-
mance. Conference on Empirical Methods in Natural
Language Processing (EMNLP).
M. Johnson. 1998. Pcfg models of linguistic tree repre-
sentations. Computational Linguistics, 24:613–632.
D. Klein and C. Manning. 2003. Accurate unlexicalized
parsing. Proc. of the 41st Meeting of the Association
for Computational Linguistics (ACL), pages 423–430.
C. Manning and H. Sch¨utze. 1999. Foundations of Sta-
tistical Natural Language Processing. The MIT Press,
Cambridge, Massachusetts.
T. Matsuzaki, Y. Miyao, and J. Tsujii. 2005. Probabilis-
tic CFG with latent annotations. In Proc. of the 43rd
Meeting of the Association for Computational Linguis-
tics (ACL), pages 75–82.
A. McCallum, R. Rosenfeld, T. Mitchell, and A. Ng.
1998. Improving text classification by shrinkage in a
hierarchy of classes. In Int. Conf. on Machine Learn-
ing (ICML), pages 359–367.
M. Steedman. 2000. The Syntactic Process. The MIT
Press, Cambridge, Massachusetts.
D. Titterington, A. Smith, and U. Makov. 1962. Statisti-
cal Analysis of Finite Mixture Distributions. Wiley.
K. Vijay-Shankar and A. Joshi. 1985. Some computa-
tional properties of tree adjoining grammars. Proc. of
the 23th Meeting of the Association for Computational
Linguistics (ACL), pages 82–93.
