Parsing with Treebank Grammars: Empirical Bounds, Theoretical
Models, and the Structure of the Penn Treebank
Dan Klein and Christopher D. Manning
Computer Science Department
Stanford University
Stanford, CA 94305-9040
a0 klein, manning
a1 @cs.stanford.edu
Abstract
This paper presents empirical studies and
closely corresponding theoretical models of
the performance of a chart parser exhaus-
tively parsing the Penn Treebank with the
Treebank’s own CFG grammar. We show
howperformanceisdramaticallyaffectedby
rule representation and tree transformations,
but little by top-down vs. bottom-up strate-
gies. We discuss grammatical saturation, in-
cluding analysis of the strongly connected
components of the phrasal nonterminals in
the Treebank, and model how, as sentence
length increases, the effective grammar rule
size increases as regions of the grammar
are unlocked, yielding super-cubic observed
time behavior in some configurations.
1 Introduction
This paper originated from examining the empirical
performance of an exhaustive active chart parser us-
ing an untransformed treebank grammar overthe Penn
Treebank. Our initial experiments yielded the sur-
prising result that for many configurations empirical
parsing speed was super-cubic in the sentence length.
This led us to look more closely at the structure of
the treebank grammar. The resulting analysis builds
on the presentation of Charniak (1996), but extends
it by elucidating the structure of non-terminal inter-
relationships in the Penn Treebank grammar. On the
basis of these studies, we build simple theoretical
models which closely predict observed parser perfor-
mance, and, in particular, explain the originally ob-
served super-cubic behavior.
We used treebank grammars induced directly from
the local trees of the entire WSJ section of the Penn
Treebank (Marcus et al., 1993) (release 3). For each
length and parameter setting, 25 sentences evenly dis-
tributed through the treebank were parsed. Since we
were parsing sentences from among those from which
our grammar was derived, coverage was never an is-
sue. Every sentence parsed had at least one parse - the
parse with which it was originally observed.1
The sentences were parsed using an implementa-
tion of the probabilistic chart-parsing algorithm pre-
sented in (Klein and Manning, 2001). In that paper,
we present a theoretical analysis showing an a2a4a3a6a5a8a7a10a9
worst-case time bound for exhaustively parsing arbi-
trary context-free grammars. In what follows, we do
not make use of the probabilistic aspects of the gram-
mar or parser.
2 Parameters
The parameters we varied were:
a11 Tree Transforms: NOTRANSFORM, NOEMPTIES,
NOUNARIESHIGH, and NOUNARIESLOW.
a11 Grammar Rule Encodings: LIST, TRIE, or MIN
a11 Rule Introduction: TOPDOWN or BOTTOMUP
The default settings are shown above in bold face.
Wedonotdiscussallpossiblecombinationsofthese
settings. Rather,wetakethebottom-upparserusingan
untransformed grammar with trie rule encodings to be
the basic form of the parser. Except where noted, we
will discuss how each factor affects this baseline, as
most of the effects are orthogonal. When we name a
setting, any omitted parameters are assumed to be the
defaults.
2.1 Tree Transforms
In all cases, the grammar was directly induced from
(transformed) Penn treebank trees. The transforms
used are shown in figure 1. For all settings, func-
tional tags and crossreferencing annotations were
stripped. For NOTRANSFORM, no other modification
was made. In particular, empty nodes (represented as
-NONE- in the treebank) were turned into rules that
generated the empty string (a12 ), and there was no col-
lapsing of categories (such as PRT and ADVP) as is of-
ten done in parsing work (Collins, 1997, etc.). For
1Effectively “testing onthe training set” would beinvalid
ifwewishedtopresentperformanceresultssuchasprecision
andrecall,butitisnotaproblemforthepresentexperiments,
whichfocussolely ontheparserloadandgrammarstructure.
TOP
S-HLN
NP-SBJ
-NONE-
a13
VP
VB
Atone
TOP
S
NP
-NONE-
a13
VP
VB
Atone
TOP
S
VP
VB
Atone
TOP
S
Atone
TOP
VB
Atone
(a) (b) (c) (d) (e)
Figure 1: Tree Transforms: (a) The raw tree, (b) NO-
TRANSFORM, (c) NOEMPTIES, (d) NOUNARIES-
HIGH (e) NOUNARIESLOW
NNP
NNS
NNPNNP
NNJJ
NNSJJ
CD
NNCD
NNDT
NNDT NN
NNSDT
JJDT NN
CCNP NP
PPNP
SBARNP
NNSNN
NN
PRP
QP
NNS
NNS
NNS
NNS
NNP
NNP NN
JJ
CD NN
NN
DT
NN
JJ NN
NP CC NP
NN
SBAR
PP
PRP
QP

NNS
NNS
NNS
NNP NNP
JJ
NN
CD NN
NNSDT
JJ NN
NP NPCC
PP
SBAR
NN
PRP
QP
NN
LIST TRIE MIN
Figure 2: Grammar Encodings: FSAs for a subset of
the rules for the category NP. Non-black states are
active, non-white states are accepting, and bold transi-
tions are phrasal.
NOEMPTIES, empties were removed by pruning non-
terminalswhichcoverednoovertwords. For NOUNA-
RIESHIGH, and NOUNARIESLOW, unary nodes were
removedas well, by keeping only the tops and the bot-
toms of unary chains, respectively.2
2.2 Grammar Rule Encodings
The parser operates on Finite State Automata (FSA)
grammar representations. We compiled grammar
rules into FSAs in three ways: LISTs, TRIEs, and
MINimized FSAs. An example of each representa-
tion is given in figure 2. For LIST encodings, each
local tree type was encoded in its own, linearly struc-
tured FSA, corresponding to Earley (1970)-style dot-
ted rules. For TRIE, there was one FSA per cate-
gory, encoding together all rule types producing that
category. For MIN, state-minimized FSAs were con-
structed from the trie FSAs. Note that while the rule
encoding may dramatically affect the efficiency of a
parser, it does not change the actual set of parses for a
given sentence in any way.3
2In no case were the nonterminal-to-word or TOP-to-
nonterminal unaries altered.
3FSAs are not the only method of representing and com-
pacting grammars. For example, the prefix compacted tries
we use are the same as the common practice of ignoring
items before the dot in a dotted rule (Moore, 2000). Another
0
60
120
180
240
300
360
0 10 20 30 40 50
SentenceLength
Avg.Time(seconds)
List-NoTransform
exp3.54r0.999
Trie-NoTransform
exp3.16r0.995
Trie-NoEmpties
exp3.47r0.998
Trie-NoUnariesHigh
exp3.67r0.999
Trie-NoUnariesLow
exp3.65r0.999
Min-NoTransform
exp2.87r0.998
Min-NoUnariesLow
exp3.32r1.000
Figure 3: The average time to parse sentences using
various parameters.
3 Observed Performance
In this section, we outline the observed performance
of the parser for various settings. We frequently speak
in terms of the following:
a11 span: a range of words in the chart, e.g., [1,3]4
a11 edge: a category over a span, e.g., NP:[1,3]
a11 traversal: a way of making an edge from an
active and a passive edge, e.g., NP:[1,3] a14
(NPa15 DT.NN:[1,2] + NN:[2,3])
3.1 Time
The parser has an a2a4a3a17a16a19a18a20a5 a7 a9 theoretical time bound,
where a5 is the number of words in the sentence to be
parsed, a18 is the number of nonterminal categories in
the grammar and a16 is the number of (active) states in
the FSA encoding of the grammar. The time bound
is derived from counting the number of traversals pro-
cessed by the parser, each taking a2a4a3a22a21a23a9 time.
In figure 3, we see the average time5 taken per sen-
tence length forseveral settings, with the empirical ex-
ponent (and correlation a24 -value) from the best-fit sim-
ple power law model to the right. Notice that most
settings show time growth greater than a2a4a3a25a5 a7 a9 .
Although, a2a4a3a25a5 a7 a9 is simply an asymptotic bound,
there are good explanations for the observed behav-
ior. There are two primary causes for the super-cubic
time values. The first is theoretically uninteresting.
Theparserisimplemented in Java, whichusesgarbage
collection for memory management. Even when there
is plenty of memory for a parse’s primary data struc-
tures, “garbage collection thrashing” can occur when
logical possibility would be trie encodings which compact
the grammar states by common suffix rather than common
prefix, as in (Leermakers, 1992). The savings are less than
for prefix compaction.
4Notethatthenumberofwords(or size)ofaspanisequal
to the difference between the endpoints.
5The hardware was a 700 MHz Intel Pentium III, and we
used up to 2GB of RAM for very long sentences or very
poor parameters. With good parameter settings, the system
can parse 100+ word treebank sentences.
0.0M
5.0M
10.0M
15.0M
20.0M
0 10 20 30 40 50
SentenceLength
Avg.Traversals
NoTransform
exp2.86r1.000
NoEmpties
exp3.28r1.000
NoUnariesHigh
exp3.74r0.999
NoUnariesLow
exp3.83r0.999
0.0M
5.0M
10.0M
15.0M
20.0M
0 10 20 30 40 50
SentenceLength
Avg.Traversals
List
exp2.60r0.999
Trie
exp2.86r1.000
Min
exp2.78r1.000
0.994
0.995
0.996
0.997
0.998
0.999
1.000
1.001
1.002
0 10 20 30 40 50
SentenceLength
Ratio(TD/BU)
Edges
Traversals
(a) (b) (c)
Figure4: (a)Thenumberoftraversalsfordifferentgrammartransforms. (b)Thenumberoftraversalsfordifferent
grammar encodings. (c) The ratio of the number of edges and traversals produced with a top-down strategy over
the number produced with a bottom-up strategy (shown for TRIE-NOTRANSFORM, others are similar).
parsing longer sentences as temporary objects cause
increasingly frequent reclamation. To see past this ef-
fect, which inflatesthe empirical exponents, weturn to
the actual traversal counts, which better illuminate the
issues at hand. Figures 4 (a) and (b) show the traversal
curves corresponding to the times in figure 3.
The interesting cause of the varying exponents
comes from the “constant” terms in the theoretical
bound. The second half of this paper shows how
modeling growth in these terms can accurately predict
parsing performance (see figures 9 to 13).
3.2 Memory
The memory bound for the parser is a2a4a3a17a16a26a5a8a27a10a9 . Since
the parser is running in a garbage-collected environ-
ment, it is hard to distinguish required memory from
utilized memory. However, unlike time and traversals
which in practice can diverge, memory requirements
match the number of edges in the chart almost exactly,
since the large data structures are all proportional in
size to the number of edges a28a30a29a31a2a4a3a17a16a26a5a8a27a10a9 .6
Almost all edges stored are activeedges (a32a34a33a36a35a38a37 for
sentenceslonger than30 words), of whichtherecan be
a2a4a3a39a16a26a5a8a27a10a9 : one for every grammar state and span. Pas-
sive edges, of which there can be a2a4a3a39a18a20a5a8a27a40a9 , one for ev-
ery category and span, are a shrinking minority. This
isbecause, while a18 is boundedaboveby27in thetree-
bank7 (forspans a32 2), a16 numbersinthethousands(see
figure 12). Thus, required memory will be implicitly
modeled when we model active edges in section 4.3.
3.3 Tree Transforms
Figure 4 (a) shows the effect of the tree transforms on
traversal counts. The NOUNARIES settings are much
more efficient than the others, however this efficiency
comes at a price in terms of the utility of the final
parse. For example, regardless of which NOUNARIES
6A standard chart parser might conceivably require stor-
ing more than a41a43a42a6a44a46a45 traversals on its agenda, but ours prov-
ably never does.
7This count is the number of phrasal categories with the
introduction of a TOP label for the unlabeled top treebank
nodes.
transform is chosen, there will be NP nodes missing
from the parses, making the parses less useful for any
task requiring NP identification. For the remainder of
the paper, we will focus on the settings NOTRANS-
FORM and NOEMPTIES.
3.4 Grammar Encodings
Figure 4 (b) shows the effect of each tree transform on
traversal counts. The more compacted the grammar
representation, the more time-efficient the parser is.
3.5 Top-Down vs. Bottom-Up
Figure 4 (c) shows the effect on total edges and
traversals of using top-down and bottom-up strategies.
There are some extremely minimal savings in traver-
sals duetotop-downfilteringeffects, butthere isacor-
responding penalty in edges as rules whose left-corner
cannot be built are introduced. Given the highly unre-
strictive nature of the treebank grammar, it is not very
surprising that top-down filtering provides such little
benefit. However, this is a useful observation about
real world parsing performance. The advantages of
top-down chart parsing in providing grammar-driven
prediction are often advanced (e.g., Allen 1995:66),
butin practicewefindalmostnovaluein thisforbroad
coverage CFGs. While some part of this is perhaps
due to errors in the treebank, a large part just reflects
the true nature of broad coveragegrammars: e.g., once
you allow adverbial phrases almost anywhere and al-
low PPs, (participial) VPs, and (temporal) NPs to be
adverbial phrases, along with phrases headed by ad-
verbs, then there is very little useful top-down control
left. With such a permissive grammar, the only real
constraints are in the POS tags which anchor the local
trees (see section 4.3). Therefore, for the remainder of
the paper, we consider only bottom-up settings.
4 Models
In the remainder of the paper we provide simple mod-
els that nevertheless accurately capture the varying
magnitudes and exponents seen for different grammar
encodings and tree transformations. Since the a5 a7 term
of a2a4a3a17a16a19a18a20a5 a7 a9 comes directly from the number of start,
split, and end points for traversals, it is certainly not
responsible for the varying growth rates. An initially
plausible possibility is that the quantity bounded by
the a18 term is non-constant in a5 in practice, because
longer spans are more ambiguous in terms of the num-
ber of categories they can form. This turns out to
be generally false, as discussed in section 4.2. Alter-
nately, the effective a16 term could be growing with a5 ,
which turns out to be true, as discussed in section 4.3.
The number of (possibly zero-size) spans for a sen-
tence of length a5 is fixed: a3a25a5a48a47a49a21a10a9a50a3a6a5a51a47a53a52a54a9a56a55a54a52 . Thus,
to be able to evaluate and model the total edge counts,
we look to the number of edges over a given span.
Definition 1 The passive (or active) saturation of a
given span is the number of passive (or active) edges
over that span.
In the total time and traversal bound a2a4a3a17a16a19a18a20a5 a7 a9 , the
effective value of a16 is determined by the active satu-
ration, while the effective value of a18 is determined by
the passive saturation. An interesting fact is that the
saturation of a span is, for the treebank grammar and
sentences, essentially independent of what size sen-
tence the span is from and where in the sentence the
span begins. Thus, for a given span size, we report the
average over all spans of that size occurring anywhere
in any sentence parsed.
4.1 Treebank Grammar Structure
The reason that effective growth is not found in the
a18 component is that passive saturation stays almost
constant as span size increases. However, the more in-
teresting result is not that saturation is relatively con-
stant (for spans beyond a small, grammar-dependent
size), but that the saturation values are extremely large
compared to a18 (see section 4.2). For the NOTRANS-
FORM and NOEMPTIES grammars, most categories
are reachable from most other categories using rules
which can be applied over a single span. Once you get
one of these categories over a span, you will get the
rest as well. We now formalize this.
Definition 2 A category a57 is empty-reachable in a
grammar a58 if a57 can be built using only empty ter-
minals.
The empty-reachable set for the NOTRANSFORM
grammar is shown in figure 5.8 These 23 categories
plus the tag -NONE- create a passive saturation of 24
for zero-spans for NOTRANSFORM (see figure 9).
Definition 3 A category a59 is same-span-reachable
from a category a57 in a grammar a58 if a59 can be built
from a57 using a parse tree in which, aside from at most
8The set of phrasal categories used in the Penn Tree-
bank is documented in Manning and Sch¨utze (1999, 413);
Marcus et al. (1993, 281) has an early version.
ADJP ADVP FRAG INTJ NAC NP
NX PP PRN QP RRC S
SBAR SBARQ SINV SQ TOP UCP
VP WHADVP WHNP WHPP X
Figure 5: The empty-reachable set for the NOTRANS-
FORM grammar.
ADJP ADVP
FRAG INTJ NAC
NP NX PP PRN QP
RRC S SBAR SBARQ
SINV SQ UCP VP
WHNP X
TOP
CONJP
LST PRT
WHADJP WHADVP
WHPP
Figure 6: The same-span reachability graph for the
NOTRANSFORM grammar.
ADJP ADVP
FRAG INTJ NP
PP PRN QP S
SBAR UCP VP
WHNP
TOP
CONJP
LST
NAC
NX
SQ X
RRC
PRT
WHADJP SBARQ
WHADVP
SINV
WHPP
Figure 7: The same-span-reachability graph for the
NOEMPTIES grammar.
one instance of a57 , every node not dominating that in-
stance is an instance of an empty-reachable category.
Thesame-span-reachabilityrelationinducesagraph
over the 27 non-terminal categories. The strongly-
connected component (SCC) reduction of that graph is
shown in figures 6 and 7.9 Unsurprisingly, the largest
SCC, which contains most “common” categories (S,
NP, VP, PP, etc.) is slightly larger for the NOTRANS-
FORM grammar, since the empty-reachable set is non-
empty. However, note that even for NOTRANSFORM,
the largest SCC is smaller than the empty-reachable
set, since empties providedirect entry into some of the
lower SCCs, in particular because of WH-gaps.
Interestingly, this same high-reachability effect oc-
curs even for the NOUNARIES grammars, as shown in
the next section.
4.2 Passive Edges
The total growth and saturation of passive edges is rel-
atively easy to describe. Figure 8 shows the total num-
9Implied arcs have been removed for clarity. The relation
is in fact the transitive closure of this graph.
0.0K
5.0K
10.0K
15.0K
20.0K
25.0K
30.0K
0 10 20 30 40 50
SentenceLength
Avg.PassiveTotals
NoTransform
exp1.84r1.000
NoEmpties
exp1.97r1.000
NoUnariesHigh
exp2.13r1.000
NoUnariesLow
exp2.21r0.999
0.0K
5.0K
10.0K
15.0K
20.0K
25.0K
30.0K
0 10 20 30 40 50
SentenceLength
Avg.PassiveTotals
NoTransform
exp1.84r1.000
NoEmpties
exp1.95r1.000
NoUnariesHigh
exp2.08r1.000
NoUnariesLow
exp2.20r1.000
Figure 8: The average number of passive edges processed in practice (left), and predicted by our models (right).
0
5
10
15
20
25
30
0 1 2 3 4 5 6 7 8 9 10
SpanSize
Avg.PassiveSaturation
NoTransform
NoEmpties
NoUnariesHigh
NoUnariesLow
0
5
10
15
20
25
30
0 1 2 3 4 5 6 7 8 9 10
SpanSize
Avg.PassiveSaturation
NoTransform
NoEmpties
NoUnariesHigh
NoUnariesLow
Figure 9: The average passive saturation (number of passive edges) for a span of a given size as processed in
practice (left), and as predicted by our models (right).
ber of passive edges by sentence length, and figure 9
shows the saturation as a function of span size.10 The
grammar representation does not affect which passive
edges will occur for a given span.
The large SCCs cause the relative independence of
passive saturation from span size for the NOTRANS-
FORM and NOEMPTIES settings. Onceanycategoryin
the SCC is found, all will be found, as well as all cate-
gories reachable from that SCC. For these settings, the
passive saturation can be summarized by three satura-
tion numbers: zero-spans (empties) a60a62a61a23a63a36a64a66a65 , one-spans
(words)a60a62a61a23a63a36a64a56a67 , and all larger spans (categories)a60a62a61a23a63a36a64
a27
.
Taking averages directly from the data, we have our
first model, shown on the right in figure 9.
For the NOUNARIES settings, there will be no
same-span reachability and hence no SCCs. To reach
a new category always requires the use of at least one
overtword. However,for spans of size 6 or so, enough
words exist that the same high saturation effect will
still be observed. This can be modeled quite simply
by assuming each terminal unlocks a fixed fraction of
the nonterminals, as seen in the right graph of figure 9,
but we omit the details here.
Using these passive saturation models, we can di-
rectly estimate the total passive edge counts by sum-
mation:
a60a68a64a70a69a10a64a50a3a25a5a71a9a19a29a53a72a31a73a74a76a75
a65
a3a6a5a77a47a78a21a80a79a82a81a70a9a83a60a62a61a23a63a36a64
a74
10The maximum possible passive saturation for any span
greater than one is equal to the number of phrasal categories
in the treebank grammar: 27. However, empty and size-one
spans can additionally be covered by POS tag edges.
The predictions are shown in figure 8. For the NO-
TRANSFORM or NOEMPTIES settings, thisreduces to:
a60a84a64a70a69a10a64a50a3a6a5a71a9a19a29a86a85 a73a88a87
a67a22a89
a73
a27
a60a62a61a90a63a38a64
a27
a47a78a3a6a5a71a9a6a60a91a61a23a63a38a64a56a67a92a47a31a3a25a5a93a47a94a21a10a9a83a60a91a61a23a63a38a64a66a65
We correctly predict that the passive edge total ex-
ponents will be slightly less than 2.0 when unaries are
present, and greater than 2.0 when they are not. With
unaries, the linear terms in the reduced equation are
significant over these sentence lengths and drag down
the exponent. The linear terms are larger for NO-
TRANSFORM and therefore drag the exponent down
more.11 Without unaries, the more gradual satura-
tion growth increases the total exponent, more so for
NOUNARIESLOW than NOUNARIESHIGH. However,
notethat forspans around 8 andonward, the saturation
curves are essentially constant for all settings.
4.3 Active Edges
Active edges are the vast majority of edges and essen-
tially determine(non-transient)memory requirements.
While passive counts depend only on the grammar
transform, active counts depend primarily on the en-
codingforgeneralmagnitudebutalsoonthetransform
for the details (and exponent effects). Figure 10 shows
the total active edges by sentence size for three set-
tings chosen to illustrate the main effects. Total active
growth is sub-quadratic for LIST, but has an exponent
of up to about 2.4 for the TRIE settings.
11Note that, over these values of
a95 , even a basic quadratic
function like the simple sum a72 a95a4a96a94a42a83a95a98a97a100a99a101a45a17a95a91a102a90a103 has a best-
fit simple power curve exponent of only a104a105a99a90a106a107a10a108 for the same
reason. Moreover, note that a95a84a109a101a102a23a103 has a higher best-fit expo-
nent, yet will never actually outgrow it.
0.0M
0.5M
1.0M
1.5M
2.0M
0 10 20 30 40 50
SentenceLength
Avg.ActiveTotals
List-NoTransform
exp1.88r0.999
Trie-NoTransform
exp2.18r0.999
Trie-NoEmpties
exp2.43r0.999
0.0M
0.5M
1.0M
1.5M
2.0M
0 10 20 30 40 50
SentenceLength
Avg.ActiveTotals
List-NoTransform
exp1.81r0.999
Trie-NoTransform
exp2.10r1.000
Trie-NoEmpties
exp2.36r1.000
Figure 10: The average number of active edges for sentences of a given length as observed in practice (left), and
as predicted by our models (right).
0.0K
2.0K
4.0K
6.0K
8.0K
10.0K
12.0K
14.0K
0 5 10 15 20
SpanLength
Avg.ActiveSaturation
List-NoTransform
exp0.092r0.957
Trie-NoTransform
exp0.323r0.999
Trie-NoEmpties
exp0.389r0.997
0.0K
2.0K
4.0K
6.0K
8.0K
10.0K
12.0K
14.0K
0 5 10 15 20
SpanLength
Avg.ActiveSaturation
List-NoTransform
exp0.111r0.999
Trie-NoTransform
exp0.297r0.998
Trie-NoEmpties
exp0.298r0.991
Figure 11: The average active saturation (number of active edges) for a span of a given size as processed in
practice (left), and as predicted by our models (right).
NOTRANS NOEMPTIES NOUHIGH NOULOW
LIST 80120 78233 81287 100818
TRIE 17298 17011 17778 22026
MIN 2631 2610 2817 3250
Figure 12: Grammar sizes: active state counts.
To model the active totals, we again begin by mod-
eling the active saturation curves, shown in figure 11.
Theactivesaturationfor anyspanis boundedaboveby
a16 , the number of active grammar states (states in the
grammar FSAs whichcorrespondto activeedges). For
list grammars, this number is the sum of the lengths of
all rules in the grammar. For trie grammars, it is the
number of unique rule prefixes (including the LHS)
in the grammar. For minimized grammars, it is the
number of states with outgoing transitions (non-black
states in figure 2). The value of a16 is shown for each
setting in figure 12. Note that the maximum number of
active states is dramatically larger for lists since com-
mon rule prefixesare duplicated many times. For min-
imized FSAs, the state reduction is even greater. Since
states which are earlier in a rule are much more likely
to match a span, the fact that tries (and min FSAs)
compress early states is particularly advantageous.
Unlike passive saturation, which was relatively
close to its bound a18 , active saturation is much farther
below a16 . Furthermore, while passive saturation was
relatively constant in span size, at least after a point,
active saturation quite clearly grows with span size,
even for spans well beyond those shown in figure 11.
We now model these active saturation curves.
What does it take for a given active state to match a
given span? For TRIE and LIST, an active state cor-
responds to a prefix of a rule and is a mix of POS
tags and phrasal categories, each of which must be
matched, in order, over that span for that state to be
reached. Given the large SCCs seen in section 4.1,
phrasal categories, to a first approximation, might as
well be wildcards, able to match any span, especially
if empties are present. However, the tags are, in com-
parison, very restricted. Tags must actually match a
word in the span.
More precisely, consider an active state a63 in the
grammar and a span a61 . In the TRIE and LIST encod-
ings, there is some, possibly empty, list a110 of labels
that must be matched over a61 before an activeedge with
this state can be constructed over that span.12 Assume
that the phrasal categories in a110 can match any span
(or any non-zero span in NOEMPTIES).13 Therefore,
phrasal categories in a110 do not constrain whether a63 can
match a61 . The real issue is whether the tags in a110 will
matchwordsin a61 . Assumethatarandomtagmatchesa
random word with a fixed probabilitya60 , independently
of where the tag is in the rule and where the word is in
the sentence.14 Assume further that, although tags oc-
cur more often than categories in rules (63.9% of rule
items are tags in the NOTRANSFORM case15), given a
12The essence of the MIN model, which is omitted here,
is that states are represented by the “easiest” label sequence
which leads to that state.
13The model for the NOUNARIES cases is slightly more
complex, but similar.
14This is of course false; in particular, tags at the end of
rules disproportionately tend to be punctuation tags.
15Although the present model does not directly apply to
the NOUNARIES cases, NOUNARIESLOW is significantly
fixed number of tags and categories, all permutations
are equally likely to appear as rules.16
Under these assumptions, the probability that an ac-
tive state a63 is in the treebank grammar will depend
only on the number a64 of tags and a111 of categories in
a110 . Call this pair a112a92a3a25a63a113a9a114a29a115a3a6a64a101a116a56a111a90a9 the signature of a63 . For
a given signature a112 , let a111a50a69a10a117a91a5a118a64a50a3a25a112a8a9 be the number of ac-
tive states in the grammar which have that signature.
Now, take a state a63 of signature a3a6a64a101a116a119a111a120a9 and a span a61 .
If we align the tags in a63 with words in a61 and align
the categories in a63 with spans of words in a61 , then pro-
vided the categories align with a non-empty span (for
NOEMPTIES)oranyspanatall(for NOTRANSFORM),
then the question of whether this alignment of a63 with a61
matches is determined entirely by the a64 tags. However,
with our assumptions, the probability that a randomly
chosen set of a64 tags matches a randomly chosen set of
a64 words is simplya60a68a121 .
Wethenhaveanexpressionforthechanceofmatch-
ing a specific alignment of an active state to a specific
span. Clearly, there can be many alignments which
differonlyinthespansofthecategories,butlineupthe
sametags withthesamewords. However,therewillbe
a certain number of unique ways in which the words
and tags can be lined up between a63 and a61 . If we know
this number, we can calculate the total probability that
there is some alignment which matches. For example,
consider the state NP a15 NP CC NP . PP (which has
signature (1,2) - the PP has no effect) over a span of
length a5 , with empties available. The NPs can match
any span, so there are a5 alignments which are distinct
from the standpoint of the CC tag - it can be in any
position. The chance that some alignment will match
is therefore a21a8a79a100a3a70a21a80a79a122a60a91a9 a73 , which, for smalla60 is roughly
linear in a5 . It should be clear that for an active state
like this, the longer the span, the more likely it is that
this state will be found over that span.
It is unfortunately not the case that all states
with the same signature will match a span length
with the same probability. For example, the state
NP a15 NP NP CC . NP has the same signature, but must
align the CC with the final element of the span. A state
like this will not become more likely (in our model) as
span size increases. However, with some straightfor-
ward but space-consuming recurrences, we can calcu-
late the expected chance that a random rule of a given
signature will match a given span length. Since we
know how many states have a given signature, we can
calculate the total active saturation a63a113a61a90a63a38a64a50a3a25a5a71a9 as
a63a113a61a23a63a36a64a50a3a25a5a71a9a123a29a78a72a78a124a125a111a50a69a10a117a84a5a118a64a50a3a39a112a8a9a70a28a46a126a10a127
a124a8a128a129
a3a25a130a122a63a38a64a70a111a120a131a132a3a25a63a91a116a56a5a71a9a134a133
more efficient than NOUNARIESHIGH despite having more
active states, largely because using the bottoms of chains in-
creases the frequency of tags relative to categories.
16This is also false; tags occur slightly more often at the
beginnings of rules and less often at the ends.
This model has two parameters. First, there isa60 which
weestimateddirectlybylooking attheexpectedmatch
between the distribution of tags in rules and the distri-
bution of tags in the Treebank text (which is around
1/17.7). No factor for POS tag ambiguity was used,
another simplification.17 Second, there is the map
a111a50a69a10a117a91a5a118a64 from signatures to a number of active states,
which was read directly from the compiled grammars.
This model predicts the active saturation curves
shown to the right in figure 11. Note that the model,
though not perfect, exhibits the qualitative differences
between the settings, both in magnitudes and expo-
nents.18 In particular:
a11 The transform primarily changes the saturation over
shortspans,whiletheencodingdeterminestheover-
all magnitudes. For example, in TRIE-NOEMPTIES
the low-span saturation is lower than in TRIE-
NOTRANSFORM since short spans in the former
case can match only signatures which have both a64
and a111 small, while in the latter only a64 needs to be
small. Therefore, the several hundred states which
are reachable only via categories all match every
span starting from size 0 for NOTRANSFORM, but
are accessed only gradually for NOEMPTIES. How-
ever, for larger spans, the behavior converges to
counts characteristic for TRIE encodings.
a11 For LIST encodings, the early saturations are huge,
due to the fact that most of the states which are
available early for trie grammars are precisely the
ones duplicated up to thousands of times in the list
grammars. However, the additive gain over the ini-
tialstates is roughlythe same forboth, asafter afew
items are specified, the tries become sparse.
a11 The actual magnitudes and exponents19 of the sat-
urations are surprisingly well predicted, suggesting
that this model captures the essential behavior.
Theseactivesaturation curvesproducethe activeto-
talcurvesin figure10, whicharealsoqualitativelycor-
rect in both magnitudes and exponents.
4.4 Traversals
Nowthat wehavemodelsfor activeand passiveedges,
we can combine them to model traversal counts as
well. We assume that the chance for a passive edge
and an active edge to combine into a traversal is a sin-
gle probability representing howlikely an arbitrary ac-
tive state is to have a continuation with a label match-
ing an arbitrarypassivestate. List rule states haveonly
one continuation, while trie rule states in the branch-
17In general, the
a135 we used was lower for not having mod-
eled tagging ambiguity, but higher for not having modeled
the fact that the SCCs are not of size 27.
18And does so without any “tweakable” parameters.
19Note that the list curves do not compellingly suggest a
power law model.
0.0M
5.0M
10.0M
15.0M
20.0M
0 10 20 30 40 50
SentenceLength
Avg.Traversals
List-NoTransform
exp2.60r0.999
Trie-NoTransform
exp2.86r1.000
Trie-NoEmpties
exp3.28r1.000
0.0M
5.0M
10.0M
15.0M
20.0M
0 10 20 30 40 50
SentenceLength
Avg.Traversals
List-NoTransform
exp2.60r0.999
Trie-NoTransform
exp2.92r1.000
Trie-NoEmpties
exp3.47r1.000
Figure 13: The average number of traversals for sentences of a given length as observed in practice (left), and as
predicted by the models presented in the latter part of the paper (right).
ing portion of the trie average about 3.7 (min FSAs
4.2).20 Making another uniformity assumption, we as-
sume that this combination probability is the contin-
uation degree divided by the total number of passive
labels, categorical or tag (73).
In figure 13, we give graphs and exponents of the
traversal counts, both observed and predicted, for var-
ious settings. Ourmodel correctlypredictsthe approx-
imate values and qualitative facts, including:
a11 For LIST, the observed exponent is lower than for
TRIEs, though the total number of traversals is dra-
matically higher. This is because the active satura-
tion is growing much faster for TRIEs; note that in
cases like this the lower-exponent curve will never
actually outgrow the higher-exponent curve.
a11 Of the settings shown, only TRIE-NOEMPTIES
exhibits super-cubic traversal totals. Despite
their similar active and passive exponents, TRIE-
NOEMPTIES and TRIE-NOTRANSFORM vary in
traversal growth due to the “early burst” of active
edges which gives TRIE-NOTRANSFORM signifi-
cantly more edges over short spans than its power
law would predict. This excess leads to a sizeable
quadratic addend in the number of transitions, caus-
ing the average best-fit exponent to drop without
greatly affecting the overall magnitudes.
Overall, growth of saturation values in span size in-
creases best-fit traversal exponents, while early spikes
in saturation reduce them. The traversal exponents
therefore range from LIST-NOTRANSFORM at 2.6 to
TRIE-NOUNARIESLOW at over 3.8. However, the fi-
nalperformance ismore dependentonthe magnitudes,
which range from LIST-NOTRANSFORM as the worst,
despiteitsexponent,to MIN-NOUNARIESHIGH asthe
best. The single biggest factor in the time and traver-
sal performance turned out to be the encoding, which
is fortunate because the choice of grammar transform
will depend greatly on the application.
20Thisisa simplificationaswell, sincetheshorter prefixes
that tend to have higher continuation degrees are on average
also a larger fraction of the active edges.
5 Conclusion
We built simple but accurate models on the basis of
two observations. First, passive saturation is relatively
constant in span size, but largedue to high reachability
among phrasal categories in the grammar. Second, ac-
tive saturation grows with span size because, as spans
increase, thetags in agivenactiveedgeare more likely
to find a matching arrangement over a span. Combin-
ing these models, we demonstrated that a wide range
of empirical qualitative and quantitative behaviors of
an exhaustive parser could be derived, including the
potential super-cubic traversal growth over sentence
lengths of interest.

References

James Allen. 1995. Natural Language Understand-
ing. Benjamin Cummings, Redwood City, CA.

Eugene Charniak. 1996. Tree-bank grammars. In
Proceedings of the Thirteenth National Conference
on Artificial Intelligence, pages 1031-1036.
Michael John Collins. 1997. Three generative, lexicalised models for statistical parsing. In ACL/EACL 8, pages 16-23.

Jay Earley. 1970. An efficient context-free parsing algorithm. Communications of the ACM, 6:451-455.

Dan Klein and Christopher D. Manning. 2001. An
agenda-based chart parser for arbitrary prob-
abilistic context-free grammars. Technical Report
dbpubs/2001-16, Stanford University.

R. Leermakers. 1992. A recursive ascent Earley
parser. Information Processing Letters, 41:87-91.

Christopher D. Manning and Hinrich Schutze. 1999.
Foundations of Statistical Natural Language Processing. MIT Press, Boston, MA.

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1993. Building a large annotated
corpus of English: The Penn treebank. Computational Linguistics, 19:313-330.

Robert C. Moore. 2000. Improved left-corner chart
parsing for large context-free grammars. In Proceedings of the Sixth International Workshop on
Parsing Technologies.
