Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL-X),
pages 29–36, New York City, June 2006. c©2006 Association for Computational Linguistics
What are the Productive Units of Natural Language Grammar? A DOP
Approach to the Automatic Identification of Constructions.
Willem Zuidema
Institute for Logic, Language and Computation
University of Amsterdam
Plantage Muidergracht 24, 1018 TV, Amsterdam, the Netherlands.
jzuidema@science.uva.nl
Abstract
We explore a novel computational ap-
proach to identifying  constructions or
 multi-word expressions (MWEs) in an
annotated corpus. In this approach,
MWEs have no special status, but emerge
in a general procedure for  nding the best
statistical grammar to describe the train-
ing corpus. The statistical grammar for-
malism used is that of stochastic tree sub-
stitution grammars (STSGs), such as used
in Data-Oriented Parsing. We present an
algorithm for calculating the expected fre-
quencies of arbitrary subtrees given the
parameters of an STSG, and a method
for estimating the parameters of an STSG
given observed frequencies in a tree bank.
We report quantitative results on the ATIS
corpus of phrase-structure annotated sen-
tences, and give examples of the MWEs
extracted from this corpus.
1 Introduction
Many current theories of language use and acquisi-
tion assume that language users store and use much
larger fragments of language than the single words
and rules of combination of traditional linguistic
models. Such fragments are often called construc-
tions, and the theories that assign them a central
role  construction grammar (Goldberg, 1995; Kay
and Fillmore, 1999; Tomasello, 2000; Jackendoff,
2002, among others). For construction grammar-
ians, multi-word expressions (MWEs) such as id-
ioms, collocations,  xed expressions and compound
verbs and nouns, are not so much exceptions to the
rule, but rather extreme cases that reveal some fun-
damental properties of natural language.
In the construction grammar tradition, co-
occurrence statistics from corpora have often been
used as evidence for hypothesized constructions.
However, such statistics are typically gathered on
a case-by-case basis, and no reliable procedure ex-
ists to automatically identify constructions. In con-
trast, in computational linguistics, many automatic
procedures are studied for identifying MWEs (Sag
et al., 2002)  with varying success  but here they
are treated as exceptions: identifying multi-word ex-
pressions is a pre-processing step, where typically
adjacent words are grouped together after which the
usual procedures for syntactic or semantic analysis
can be applied. In this paper I explore an alter-
native formal and computational approach, where
multi-word constructions have no special status,
but emerge in a general procedure to  nd the best
statistical grammar to describe a training corpus.
Crucially, I use a formalism known as  Stochastic
Tree Substitution Grammars (henceforth, STSGs),
which can represent single words, contiguous and
noncontiguous MWEs, context-free rules or com-
plete parse trees in a uni ed representation.
My approach is closely related to work in statisti-
cal parsing known as Data-Oriented Parsing (DOP),
an empirically highly successful approach with la-
beled recall and precision scores on the Penn Tree
Bank that are among the best currently obtained
(Bod, 2003). DOP,  rst proposed in (Scha, 1990),
29
can be seen as an early formalization and combina-
tion of ideas from construction grammar and statis-
tical parsing. Its key innovations were (i) the pro-
posal to use fragments of trees from a tree bank as
the symbolic backbone; (ii) the proposal to allow, in
principle, trees of arbitrary size and shape as the el-
ementary units of combination; (iii) the proposal to
use the occurrence and co-occurrence frequencies as
the basis for structural disambiguation in parsing.
The model I develop in this paper is true to these
general DOP ideals, although it differs in impor-
tant respects from the many DOP implementations
that have been studied since its  rst inception (Bod,
1993; Goodman, 1996; Bod, 1998; Sima’an, 2002;
Collins and Duffy, 2002; Bod et al., 2003, and many
others). The crucial difference is in the estimation
procedure for choosing the weights of the STSG
based on observed frequencies in a corpus. Existing
DOP models converge to STSGs that either (i) give
all subtrees of the observed trees nonzero weights
(Bod, 1993; Bod, 2003), or (ii) give only the largest
possible fragments nonzero weights (Sima’an and
Buratto, 2003; Zollmann and Sima’an, 2005). The
model in this paper, in contrast, aims at  nding the
smallest set of productive units that explain the oc-
currences and co-occurrences in a corpus. Large
subtrees only receive non-zero weights, if they occur
more frequently than can be expected on the basis of
the weights of smaller subtrees.
2 Formalism, Notation and Definitions
2.1 Stochastic Tree Substitution Grammars
STSGs are a simple generalization of Stochas-
tic Context Free Grammars (henceforth, SCFGs),
where the productive units are elementary trees of
arbitrary size instead of the rewrite rules in SCFGs
(which can be viewed as trees of depth 1). STSGs
form a restricted subclass of Stochastic Tree Adjoin-
ing Grammars (henceforth, STAGs) (Resnik, 1992;
Schabes, 1992), the difference being that STSGs
only allow for substitution and not for adjunction
(Joshi and Sarkar, 2003). This limits the genera-
tive capacity to that of context-free grammars, and
means STSGs cannot be fully lexicalized. These
limitations notwithstanding, the close relationship
with STAGs is an attractive feature with extensions
to the class of mildly context-sensitive languages
(Joshi et al., 1991) in mind. Most importantly, how-
ever, STSGs are already able to model a vast range
of statistical dependencies between words and con-
stituents, which allows them to rightly predict the
occurrences of many constructions (Bod, 1998).
For completeness, we include the usual de -
nitions of STSGs, the substitution operation and
derivation and parse probabilities (Bod, 1998), us-
ing our own notation. An STSG is a 5-tuple
〈Vn,Vt,S,T,w〉, where Vn is the set of non-terminal
symbols; Vt is the set of terminal symbols; S ∈ Vn is
the start symbol; T is a set of elementary trees, such
that for every t ∈ T the unique root node r(t) ∈ Vn,
the set of internal nodes i(t) ⊂ Vn and the set of leaf
nodes l(t) ⊂ Vn ∪ Vt;  nally, w : T → [0,1] is a
probability (weight) distribution over the elementary
trees, such that for any t ∈ T, summationtexttprime∈R(t) w(tprime) = 1,
where R(t) is the set of elementary trees with the
same root label as t. It will prove useful to also de-
 ne the set of all possible trees θ over the de ned
alphabets (with the same conditions on root, internal
and leaf nodes as for T), and the set of all possible
complete parse trees Θ (with r(t) = S and all leaf
nodes l(t) ⊂ Vt). Obviously, T ⊂ θ and Θ ⊂ θ.
The substitution operation ◦ is de ned if the left-
most nonterminal leaf in t1 is identical to the root of
t2. Performing substitution t1 ◦ t2 yields t3, if t3 is
identical to t1 with the leftmost nonterminal leaf re-
placed by t2. A derivation is a sequence of elemen-
tary trees, where the  rst tree t ∈ T has root-label
S and every next tree combines through substitution
with the result of the substitutions before it. The
probability of a derivation d is de ned as the prod-
uct of weights of the elementary trees involved:
P(d = t1 ◦ ... ◦ tn) =
nproductdisplay
i=1
(w (ti)) . (1)
A parse tree is any tree t ∈ Θ. Multiple derivations
can yield the same parse tree; the probability of a
parse tree p equals the sum of the probabilities of
the different derivations that yield that same tree:
P(p) =
summationdisplay
d: ˆd=p
(P (d)), (2)
where ˆd is the tree derived by derivation d.
In this paper, we are only concerned with gram-
mars that de ne proper probability distributions over
30
trees, such that the probability of all derivations sum
up to 1 and no probability mass gets lost in deriva-
tions that never reach a terminal yield. We require:
summationdisplay
p∈Θ
P(p) =
summationdisplay
d: ˆd∈Θ
P(d) = 1. (3)
2.2 Usage Frequency and Occurrence
Frequency
In addition to these conventional de nitions, we will
make use in this paper of the concepts  usage fre-
quency and  occurrence frequency . When we
consider an arbitrary subtree t, the usage frequency
u(t) describes the relative frequency with which el-
ementary tree t is involved in a set of derivations.
Given a grammar G ∈ STSG, the expected usage
frequency is:
u(t) =
summationdisplay
d:t∈d
(P (d) C (t,d)), (4)
where C (t,d) gives the number of occurrences of
t in d. The set of derivations, and hence usage fre-
quency, is usually considered hidden information.
The occurrence frequency f(t) describes the rela-
tive frequency with which t occurs as a subtree of a
set of parse trees, which is usually assumed to be
observable information. If grammar G is used to
generate trees, it will create a tree bank where each
parse tree will occur with an expected frequency as
in equation (2). More generally, the expected oc-
currence frequency f(t) (relative to the number n of
complete trees in the tree bank) of a subtree t is:
E[f(t)] =
summationdisplay
p:t∈p∗
(P (p)C (t,p∗)), (5)
where p∗ is the multiset of all subtrees of p.
Hence, w(t), u(t) and f(t) all assign values (the
latter two not necessarily between 0 and 1) to trees.
An important question is how these different val-
ues can be related. For STSGs which have only
elementary trees of depth 1, and are thus equiva-
lent to SCFGs, these relations are straightforward:
the usage frequency of an elementary tree simply
equals its expected frequency, and can be derived
from the weights by multiplying inside and out-
side probabilities (Lari and Young, 1990). Estimat-
ing the weights of an (unconstrained and untrans-
formed) SCFG from an tree bank is straightforward,
as weights, in the limit, simply equal the relative
frequency of each depth-1 subtree (relative to other
depth-1 subtrees with the same root label).
When elementary trees can be of arbitrary depth,
however, many different derivations can yield the
same tree, and a given subtree t can emerge with-
out the corresponding elementary tree ever having
been used. The expected frequencies are sums of
products, and  if one wants to avoid exhaustively
enumerating all possible parse trees  surprisingly
dif cult to calculate, as will become clear below.
2.3 From weights to usage frequencies and
back
Relating usage frequencies to weights is relatively
simple. With a bit of algebra we can work out the
following relations:
u(t) =



w(t) if r(t) = S
w(t)
summationdisplay
tprime:r(t)∈l(tprime)
u(tprime)Ctprimet otherwise
(6)
where Ctprimet gives the number of occurrences of the
root label r(t) of t among the leaves of tprime. The in-
verse relation is straightforward:
w(t) = u(t)summationtext
tprime∈R(t) u(tprime)
. (7)
2.4 From usage frequency to expected
frequency
The two remaining problems  calculating expected
frequencies from weights and estimating the weights
from observed frequencies  are surprisingly dif-
 cult and heretofore not satisfactorily solved. In
(Zuidema, 2006) we evaluate existing estimation
methods for Data-Oriented Parsing, and show that
they are ill-suited for learning tasks such as stud-
ied in this paper. In the next section, we present a
new algorithm for estimation, which makes use of
a method for calculating expected frequencies that
we sketch in this section. This method makes use of
sub- and supertree relations that we explain  rst.
We de ne two types of subtrees of a given tree t,
which, for lack of better terminology, we will call
 twigs and  prunes of t. Twigs are those subtrees
headed by any of t’s internal nodes and everything
31
below. Prunes are those subtrees headed by t’s root-
node, pruned at any number (≥ 0) of internal nodes.
Using ◦ to indicate left-most substitution, we write:
• t1 is a twig of t2, if either t1 = t2 or ∃t3, such
that t3 ◦ t1 = t2;
• t1 is a prune of t2, if either t1 = t2 or ∃t3 ... tn,
such that t1 ◦ t3 ... ◦ tn = t2;
• tprime = prx(t), if x is a set of nodes in t, such that
if t is pruned at each i ∈ x it equals tprime.
Thus de ned, the set of all subtrees st(t) of t cor-
responds to the set of all prunes of all twigs of t:
st(t) = {tprimeprime|∃tprime(tprime ∈ tw(t) ∧ tprimeprime ∈ pr(tprime)).
We further de ne the sets of supertwigs, super-
prunes and supertrees as follows:
• hatwidertw(t) = {tprime|t ∈ tw(tprime)}
• hatwiderprx(t) = {tprime|t = prx(tprime)}
• hatwidest(t) = {tprime|t ∈ st(tprime)}.
Using these sets, and the set of derivations D(t) of
the fragment t, a general expression for the expected
frequency of t is:
E[f(t)] =
summationdisplay
d∈D(t)
αβ
α =
summationdisplay
τ∈ctw(d1)
summationdisplay
τprime∈ dprx(t)(τ)
u(τprime)
β =
productdisplay
tprime∈
〈d2,...,dn〉
summationdisplay
τprime∈ dprx(t)(tprime)
w parenleftbigτprimeparenrightbig (8)
where 〈d1,... ,dn〉 is the sequence of elementary
trees in derivation d. A derivation of this equation
is provided on the author’s website1. Note that it
1http://staff.science.uva.nl/∼jzuidema. The intuition behind
it is as follows. Observe  rst that there are many ways in which
an arbitrary fragment t can emerge, many of which do not in-
volve the usage of the elementary tree t. It is useful to partition
the set of all derivations of complete parse trees according to the
substitution sites inside t that they involve, and hence according
to the corresponding derivations of t. The  rst summation in (8)
simply sums over all these cases.
Each derivation of t involves a  rst elementary tree d1, and
possibly a sequence of further elementary trees 〈d2, . . . , dn〉.
Roughly speaking, the α-term in equation (8) describes the fre-
quency with which a d1 will be generated. The β-term then
describes the probability that d1 will be expanded as t. The
equation simpli es considerably for those fragments that have
no nonterminal leaves: the set dprx(t) then only contains t, and
the two summations over this set disappear. The equation fur-
ther simpli es if only depth-1 elementary trees have nonzero
weights (i.e. for SCFGs): α and β then essentially give outside
and inside probabilities (Lari and Young, 1990). However, for
unconstrained STSGs we need all sums and products in (8).
will, in general, be computationally extremely ex-
pensive to calculate E[f(t)] . We will come back to
computational ef ciency issues in the discussion.
3 Estimation: push-n-pull
The goal of this paper is an automatic discovery
procedure for  nding  constructions based on oc-
currence and co-occurrence frequencies in a corpus.
Now that we have introduced the necessary termi-
nology, we can reformulate this goal as follows:
What are the elementary trees with multiple words
with the highest usage frequency in the STSG esti-
mated from an annotated corpus? Thus phrased, the
crucial next step is to decide on an estimation proce-
dure for learning an STSG from a corpus.
Here we develop an estimation procedure we call
 push-n-pull . The basic idea is as follows. Given
an initial setting of the parameters, the method cal-
culates the expected frequency of all complete and
incomplete trees. If a tree’s expected frequency is
higher than its observed frequency, the method sub-
tracts the difference from the tree’s score, and dis-
tributes ( pushes ) it over the trees involved in its
derivations. If it is lower, it  pulls the difference
from these same derivations. The method includes a
bias for moving probability mass to smaller elemen-
tary trees, to avoid over tting; its effects become
smaller as more data gets observed.
Because the method for calculating estimated fre-
quency works with usage-frequencies, the push-n-
pull algorithm also uses these as parameters. More
precisely, it manipulates a  score , which is the
product of usage frequency and the total number of
parse trees observed. Implicit here is the assumption
that by shifting usage frequencies between different
derivations, the relation with weights remains as in
equation (6). Simulations suggest this is reasonable.
In the current implementation, the method starts
with all frequency mass in the longest derivations,
i.e. in the depth-1 elementary trees. Finally, the cur-
rent implementation is incremental. It keeps track of
the frequencies with which it observes subtrees in a
corpus. For each tree received, it  nds all derivations
and all probabilities, updates frequencies and scores
according to the rules sketched above. In pseudo-
code, the push-n-pull algorithm is as follows:
for each observed parse tree p
32
for each depth-1 subtree t in p
update-score(t,1.0)
for each subtree t of p
∆ =min(sc(t),B + γ(E[f(t)] − f(t)))
∆prime = 0
for each of n derivations d of t
let tprime ... tprimeprime be all elementary trees in d
δ =min(sc(tprime),... ,sc(tprimeprime),−∆/n)
∆prime− = δ
for each elementary tree tprime in d
update-score(tprime,δ)
update-score (t,∆prime)
where sc(t) is the score of t, B is the bias to-
wards smaller subtrees, γ is the learning rate param-
eter and f(t) is the observed frequency of t. ∆prime thus
gives the actual change in the score of t, based on
the difference between expected and observed fre-
quency, bias, learning rate and how much scores can
be pushed or pulled2. For computational ef ciency,
only subtrees with a depth no larger than d = 3 or
d = 4 and only derivations involving 2 elementary
trees are considered.
4 Results
We have implemented the algorithms for calculat-
ing the expected frequency, and the push-n-pull al-
gorithm for estimation. We have evaluated the algo-
rithms on a number of simple example STSGs and
found that the expected frequency algorithm cor-
rectly predicts observed frequencies. We have fur-
ther found that  unlike existing estimation meth-
ods  the push-n-pull algorithm converges to STSGs
that closely model the observed frequencies (i.e. that
maximize the likelihood of the data) without putting
all probability mass in the largest elementary trees
(i.e. whilst retaining generalizations about the data).
Here we report  rst quantitative results on the
ATIS3 corpus (Hemphill et al., 1990). Before pro-
cessing, all trees (train and test set) were converted
to a format that our current implementation requires
(all non-terminal labels are unique, all internal nodes
have two daughters, all preterminal nodes have a
single lexical daughter; all unary productions and
all traces were removed). The set of trees was ran-
domly split in a train set of 462 trees, and a test set
2An important topic for future research is to clarify the rela-
tion between push-n-pull and Expectation Maximization.
of 116 trees. The push-n-pull algorithm was then
run in 10 passes over the train set, with d = 3,
B = 0 and γ = 0.1. By calculating the most proba-
ble parse3 for each yield of the trees in test set, and
running  evalb we arrive at the following quantita-
tive results: a string set coverage of 84% (19 failed
parses), labeled recall of 95.07, and labeled preci-
sion of 95.07. We obtained almost identical num-
bers on the same data with a reimplementation of
the DOP1 algorithm (Bod, 1998).
method # rules Cov. LR LP EM
DOP1 77852 84% 95.07 95.07 83.5
p-n-p 58799 84% 95.07 95.07 83.5
Table 1: Parseval scores of DOP1 and push-n-pull
on the same 462-116 random train-testset split of a
treebank derived from the ATIS3 corpus (we empha-
size that all trees, also those of the test-set, were con-
verted to Chomsky Normal Form, whereby unary
production and traces were removed and top-nodes
relabeled  TOP . These results are thus not compa-
rable to previous methods evaluated on the ATIS3
corpus.) EM is  exact match .
method # rules Cov. LR LP EM
sc > 0.3 8593 77% 80.8 80.8 46.3
sc > 0.1 98443 77% 81.9 81.9 48.8
Table 2: Parseval scores using a p-n-p induced
STSG on the same treebank as in table 1, using a
different random 525-53 train-testset split. Shown
are results were only elementary trees with scores
higher than 0.3 and 0.1 respectively are used.
However, more interesting is a qualitative anal-
ysis of the STSG induced, which shows that, un-
like DOP1, push-n-pull arrives at a grammar that
gives high weights (and scores) to those elementary
3We approximated the most probable parse as follows (fol-
lowing (Bod, 2003)). We  rst converted the induced STSG to
an isomorph SCFG, by giving the internal nodes of every ele-
mentary tree t unique address-labels, and reading off all CFG
productions (all with weight 1.0, except for the top-production,
which receives the weight of t). An existing SCFG parser
(Schmid, 2004) was then used, with a simple unknown word
heuristic, to generate the Viterbi n-best parses with n = 100,
and, after removing the address labels, all equal parses and their
probabilities were summed, and the one with highest probabil-
ity chosen.
33
trees that best explain the overrepresentation of cer-
tain constructions in the data. For instance, in a run
with d = 4,γ = 1.0,B = 1.0, the 50 elemen-
tary trees with the highest scores, as shown in  g-
ure 1, are all exemplary of frequent formulas in the
ATIS corpus such as  show me X ,  I’d like to X ,
 which of these ,  what is the X ,  cheapest fare 
and   ights from X to Y . In short, the push-n-pull
algorithm  while starting out considering all possi-
ble subtrees  converges to a grammar which makes
linguistically relevant generalizations. This allows
for a more compact grammar (58799 rules in the
SCFG reduction, vs. 77852 for DOP1), whilst re-
taining DOP’s excellent empirical performance.
5 Discussion
Calculating E[f(t)] using equation (8) can be ex-
tremely expensive in computational terms. One will
typically want to calculate this value for all subtrees,
the number of which is exponential in the size of the
trees in the training data. For each subtree t, we will
need to consider the set of all its derivations (expo-
nential in the size of t), and for each derivation the
set of supertwigs of the  rst elementary trees and,
for incompletely lexicalized subtrees, the set of su-
perprunes of all elementary trees in their derivations.
The latter two sets, however, need not be constructed
for every time the expected frequency E[f(t)] is cal-
culated. Instead, we can, as we do in the current im-
plementation, keep track of the two sums for every
change of the weights.
However, there are many further possibilities for
improving the ef ciency of the algorithm that are
currently not implemented. Equation (8) remains
valid under various restrictions on the elementary
trees that we are willing to consider as productive
units. Some of these will remove the exponential de-
pendence on the size of the trees in the training data.
For instance, in the case where we restrict the pro-
ductive units (with nonzero weights) to depth-1 trees
(i.e. CFG rules), equation (8) collapses to the prod-
uct of inside and outside probabilities, which can be
calculated using dynamical programming in polyno-
mial time (Lari and Young, 1990). A major topic for
future research is to de ne linguistically motivated
restrictions that allow for ef cient computation.
Another concern is the size of the grammar the
estimation procedure produces, and hence the time
and space ef ciency of the resulting parser. Ta-
ble 1 already showed that push-n-pull leads to a
more concise grammar. The reason is that many po-
tential elementary trees receive a score (and weight)
0. More generally, push-n-pull generates extremely
tilted score distributions, which allows for even
more compact but highly accurate approximations.
In table 2 we show, for the d = 4 grammar of  g-
ure 1, that a 10-fold reduction of the grammar size
by pruning elementary trees with low scores, leads
only to a small decrease in the LP and LR measures.
Another interesting question is if and how the
current algorithm can be extended to the full class
of Stochastic Tree-Adjoining Grammars (Schabes,
1992; Resnik, 1992). With the added operation of
adjunction, equation (8) is not valid anymore. Given
the computational complexities that it already gives
rise to, however, it seems that issue of linguisti-
cally motivated restrictions (other than lexicaliza-
tion) should be considered  rst. Finally, given that
the current approach is dependent on the availability
of a large annotated corpus, an important question
is if and how it can be extended to work with un-
labeled data. That is, can we transform the push-n-
pull algorithm to perform the unsupervised learning
of STSGs? Although most work on unsupervised
grammar learning concerns SCFGs (including some
of our own (Zuidema, 2003)) it is interesting to note
that much of the evidence for construction grammar
in fact comes from the language acquisition litera-
ture (Tomasello, 2000).
6 Conclusions
Theoretical linguistics has long strived to account
for the unbounded productivity of natural language
syntax with as few units and rules of combination
as possible. In contrast, construction grammar and
related theories of grammar postulate a heteroge-
neous and redundant storage of  constructions . If
this view is correct, we expect to see statistical sig-
natures of these constructions in the distributional
information that can be derived from corpora of nat-
ural language utterances. How can we recover those
signatures? In this paper we have presented an ap-
proach to identifying the relevant statistical correla-
tions in a corpus based on the assumption that the
34
TOP
VB
 SHOW 
VP*
PRP
 ME 
NP
NP*
DT NNS
NP**
PP-DIR PP-DIR*
(a) The  show me NP PP frame,
which occurs very frequently in
the training data and is repre-
sented in several elementary trees
with high weight.
WHNP-1
WDT
 WHICH 
PP
IN
 OF 
NP
DT
 THESE 
NNS
 FLIGHTS 
(b) The complete parse tree
for the sentence  Which of
these  ights , which occurs
16 times in training data.
TOP
NNS
 FLIGHTS 
NP*
PP-DIR
IN
 FROM 
NP**
NNP NNP*
PP-DIR*
TO
 TO 
NNP**
(c) The frame for   ights from NP to
NP 
1. ((TOP (VB  SHOW ) (VP* (PRP  ME ) (NP (NP* DT NNS) (NP** PP-DIR PP-DIR*)))) 17.79 0.008 30)
2. ((TOP (VB  SHOW ) (VP* (PRP  ME ) (NP (NP* DT NNS) NP**))) 10.34 0.004 46
3. (TOP (PRP  I ) (VP (MD  WOULD ) (VP* (VB  LIKE ) (VP** TO VP***)))) 10.02 0.009 20
4. (WHNP-1 (WDT  WHICH ) (PP (IN  OF ) (NP (DT  THESE ) (NNS  FLIGHTS )))) 8.80 0.078 16
5. (TOP (WP  WHAT ) (SQ (VBZ  IS ) (NP-SBJ (DT  THE ) (NN  PRICE )))) 8.76 0.005 20
6. (TOP (WHNP (WDT  WHAT ) (NNS  FLIGHTS )) (SQ (VBP  ARE ) (SQ* (EX  THERE ) SQ**))) 8.25 0.006 36
7. (VP* (PRP  ME ) (NP (NP* (DT  THE ) (NNS  FLIGHTS )) (NP** (PP-DIR IN NNP) (PP-DIR* TO NNP*)))) 7.90 0.023 18
8. (TOP (WHNP (WDT  WHAT ) (NNS  FLIGHTS )) (SQ (VBP  ARE ) (SQ* (EX  THERE ) (SQ** PP-DIR-3 PP-DIR-4)))) 6.64 0.005 26
9. (TOP (PRP  I ) (VP MD (VP* (VB  LIKE ) (VP** TO VP***)))) 6.48 0.006 20
10. (TOP (PRP  I ) (VP (VBP  NEED ) (NP (NP* DT NN) (NP** PP-DIR NP***)))) 5.01 0.004 10
11. (TOP (VB  SHOW ) (VP* (PRP  ME ) (NP (DT  THE ) NNS))) 4.94 0.002 16
12. (TOP WP (SQ (VBZ  IS ) (NP-SBJ (DT  THE ) (NN  PRICE )))) 4.91 0.0028 20
13. (TOP (WHNP (WDT  WHAT ) (NNS  FLIGHTS )) (SQ (VBP  ARE ) (SQ* EX (SQ** PP-DIR-3 PP-DIR-4)))) 4.16 0.003 26
14. (TOP (VB  SHOW ) (VP* (PRP  ME ) (NP (NNS  FLIGHTS ) NP*))) 4.01 0.001 16
15. (TOP (VB  SHOW ) (VP* (PRP  ME ) (NP (DT  THE ) NP*))) 3.94 0.002 12
16. (TOP (WHNP (WDT  WHAT ) (NNS  FLIGHTS )) (SQ (VBP  ARE ) (SQ* EX SQ**))) 3.92 0.003 36
17. (TOP (PRP  I ) (VP (VBP  NEED ) (NP (NP* DT NN) NP**))) 3.85 0.003 14
18. (TOP (WP  WHAT ) (SQ VBZ (NP-SBJ (DT  THE ) (NN  PRICE )))) 3.79 0.002 20
19. (WHNP-1 (WDT  WHICH ) (PP (IN  OF ) (NP (DT  THESE ) NNS))) 3.65 0.032 16
20. (TOP (VB  SHOW ) (VP* (PRP  ME ) (NP NP* (SBAR WDT VP**)))) 3.64 0.002 14
21. (TOP (VB  SHOW ) (VP* PRP (NP (NP* DT NNS) (NP** PP-DIR PP-DIR*)))) 3.61 0.002 30
22. (TOP (WHNP (WDT  WHAT ) NNS) (SQ (VBP  ARE ) (SQ* (EX  THERE ) (SQ** PP-DIR-3 PP-DIR-4)))) 3.30 0.002 26
23. (VP (MD  WOULD ) (VP* (VB  LIKE ) (VP** (TO  TO ) (VP*** VB* VP****)))) 3.25 0.012 16
24. (TOP (WDT  WHICH ) VP) 3.1460636 0.001646589 12
25. (TOP (VB  SHOW ) (VP* (PRP  ME ) (NP (NP* DT NP**) NP***))) 3.03 0.001 12
26. (TOP (VB  SHOW ) (VP* (PRP  ME ) (NP NP* (NP*** PP-DIR PP-DIR*)))) 2.97 0.001 12
27. (PP (IN  OF ) (NP* (NN*  FLIGHT ) (NP** NNP (NP*** NNP* NP****)))) 2.95 0.015 8
28. (TOP (VB  SHOW ) (VP* (PRP  ME ) (NP (DT  THE ) (NNS  FARES )))) 2.85 0.001 8
29. (VP (VBP  NEED ) (NP (NP* (DT  A ) (NN  FLIGHT )) (NP** PP-DIR NP***))) 2.77 0.009 12
30. (TOP (VB  SHOW ) (VP* (PRP  ME ) (NP NP* (NP** PP-DIR PP-DIR*)))) 2.77 0.001 34
31. (TOP (JJS  CHEAPEST ) (NN  FARE )) 2.74 0.001 6
32. (TOP (VB  SHOW ) (VP* (PRP  ME ) (NP (NP* DT NP**) (NP*** PP-DIR PP-DIR*)))) 2.71 0.001 8
33. (TOP (NN  PRICE ) (PP (IN  OF ) (NP* (NN*  FLIGHT ) (NP** NNP NP***)))) 2.69 0.001 6
34. (TOP (NN  PRICE ) (PP (IN  OF ) (NP* (NN*  FLIGHT ) NP**))) 2.68 0.001 8
35. (PP-DIR (IN  FROM ) (NP (NNP  WASHINGTON ) (NP* (NNP*  D ) (NNP**  C )))) 2.67 0.006 6
36. (PP-DIR (IN  FROM ) (NP** (NNP  NEWARK ) (NP*** (NNP*  NEW ) (NNP**  JERSEY )))) 2.60 0.005 6
37. (S* (PRP  I ) (VP (MD  WOULD ) (VP* (VB  LIKE ) (VP** TO VP***)))) 2.59 0.11 8
38. (TOP (VBZ  DOES ) (SQ* (NP-SBJ DT (NN  FLIGHT )) (VP (VB  SERVE ) (NN*  DINNER )))) 2.48 0.002 8
39. (TOP (PRP  I ) (VP (MD  WOULD ) (VP* (VB  LIKE ) VP**))) 2.37 0.002 20
40. (TOP (WP  WHAT ) (SQ (VBZ  IS ) (NP-SBJ DT (NN  PRICE )))) 2.33 0.001 20
41. (S* (PRP  I ) (VP MD (VP* (VB  LIKE ) (VP** TO VP***)))) 2.33 0.100 8
42. (WHNP**** (PP-TMP (IN*  ON ) (NNP**  FRIDAY )) (PP-LOC (IN**  ON ) (NP (NNP***  AMERICAN ) (NNP****  AIRLINES )))) 2.30 0.086 6
43. (VP* (PRP  ME ) (NP (NP* (DT  THE ) NNS) (NP** (PP-DIR IN NNP) (PP-DIR* TO NNP*)))) 2.29 0.007 18
44. (TOP (WHNP* (WDT  WHAT ) (NNS  FLIGHTS )) (WHNP** (PP-DIR (IN  FROM ) NNP) (WHNP*** (PP-DIR* TO NNP*) (PP-TMP IN* NNP**)))) 2.28 0.001 12
45. (SQ (VBP  ARE ) (SQ* EX (SQ** (PP-DIR-3 IN NNP) (PP-DIR-4 TO NNP*)))) 2.26 0.015 14
46. (TOP (VB  SHOW ) (VP* (PRP  ME ) (NP (NP* DT NNS) (SBAR WDT VP**)))) 2.22 0.001 8
47. (TOP (NNS  FLIGHTS ) (NP* (PP-DIR (IN  FROM ) (NP** NNP NNP*)) (PP-DIR* (TO  TO ) NNP**))) 2.20 0.001 10)
48. ((VP (VBP  NEED ) (NP (NP* (DT  A ) (NN  FLIGHT )) (NP** (PP-DIR IN NNP) NP***))) 2.1346128 0.007185978 10)
49. ((NP (NP* (DT  THE ) (NNS  FLIGHTS )) (NP** (PP-DIR (IN  FROM ) (NNP  BALTIMORE )) (PP-DIR* (TO  TO ) (NNP*  OAKLAND )))) 2.1335514 0.00381956 10)
50. ((TOP (VB  SHOW ) (VP* (PRP  ME ) (NP (NP* DT NNS) (NP** PP-DIR NP***)))) 2.09 0.001 8)
Figure 1: Three examples and a list of the  rst 50 elementary trees with multiple words of an STSG induced
using the push-n-pull algorithm on the ATIS3 corpus. For use in the current implementation, the parse
trees have been converted to Chomsky Normal Form (all occurrences of A → B, B → ω are replaced by
A → ω; all occurrences of A → BCω are replaced by A → BA∗, A∗ → Cω), all non-terminal labels are
made unique for a particular parse tree (address labeling not shown) and all top nodes are replaced by the
non-terminal  TOP . Listed are the elementary trees of the induced STSG with for each tree the score, the
weight and the frequency with which it occurs in the training set.
35
corpus is generated by an STSG, and by inferring
the properties of that underlying STSG. Given our
best guess of the STSG that generated the data, we
can start to ask questions like: which subtrees are
overrepresented in the corpus? Which correlations
are so strong that it is reasonable to think of the cor-
related phrases as a single unit? We presented a new
algorithm for estimating weights of an STSG from a
corpus, and reported promising empirical results on
a small corpus.
Acknowledgments
The author is funded by the Netherlands Organi-
sation for Scienti c Research (Exacte Wetenschap-
pen), project number 612.066.405. Many thanks to
Yoav Seginer, Rens Bod and Remko Scha and the
anonymous reviewers for very useful comments.

References
Rens Bod, Remko Scha, and Khalil Sima’an, editors.
2003. Data-Oriented Parsing. CSLI Publications,
University of Chicago Press, Chicago, IL.
Rens Bod. 1993. Using an annotated corpus as a stochas-
tic grammar. In Proceedings EACL’93, pages 37 44.
Rens Bod. 1998. Beyond Grammar: An experience-
based theory of language. CSLI, Stanford, CA.
Rens Bod. 2003. An ef cient implementation of a new
DOP model. In Proceedings EACL’03.
Michael Collins and Nigel Duffy. 2002. New ranking
algorithms for parsing and tagging: Kernels over dis-
crete structures, and the voted perceptron. ACL’02.
Adele E. Goldberg. 1995. Constructions: A Construc-
tion Grammar Approach to Argument Structure. The
University of Chicago Press, Chicago, IL.
Joshua Goodman. 1996. Ef cient algorithms for parsing
the DOP model. In Proceedings EMNLP’96, p. 143 
152.
C.T. Hemphill, J.J. Godfrey, and G.R. Doddington. 1990.
The ATIS spoken language systems pilot corpus. In
Proceedings of the DARPA Speech and Natural Lan-
guage Workshop. Morgan Kaufman, Hidden Valley.
Ray Jackendoff. 2002. Foundations of Language. Ox-
ford University Press, Oxford, UK.
Aravind Joshi and Anoop Sarkar. 2003. Tree adjoining
grammars and their application to statistical parsing.
In Bod et al. (Bod et al., 2003), pages 253 282.
A. Joshi, K. Vijay-Shanker, and D. Weir. 1991. The
convergence of mildly context-sensitive grammar for-
malisms. In Peter Sells, Stuart Shieber, and Tom Wa-
sow, editors, Foundational issues in natural language
processing, pages 21 82. MIT Press, Cambridge MA.
P. Kay and C. Fillmore. 1999. Grammatical construc-
tions and linguistic generalizations. Language, 75:1 
33.
K. Lari and S.J. Young. 1990. The estimation of stochas-
tic context-free grammars using the inside-outside al-
gorithm. Computer Speech and Language, 4:35 56.
Philip Resnik. 1992. Probabilistic tree-adjoining gram-
mar as a framework for statistical natural language
processing. In Proceedings COLING’92, p. 418 424.
Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann A.
Copestake, and Dan Flickinger. 2002. Multiword ex-
pressions: A pain in the neck for NLP. In Proceedings
CICLing, pages 1 15.
Remko Scha. 1990. Taaltheorie en taaltechnolo-
gie; competence en performance. In R. de Kort
and G.L.J. Leerdam, editors, Computertoepassingen
in de Neerlandistiek, pages 7 22. LVVN, Almere.
http://iaaa.nl/rs/LeerdamE.html.
Yves Schabes. 1992. Stochastic lexicalized tree-
adjoining grammars. In Proceedings COLING’92,
pages 425 432.
Helmut Schmid. 2004. Ef cient parsing of highly am-
biguous context-free grammars with bit vectors. In
Proceedings COLING’04.
Khalil Sima’an and Luciano Buratto. 2003. Backoff pa-
rameter estimation for the DOP model. In Proceedings
ECML’03, pages 373 384.
Khalil Sima’an. 2002. Computational complexity of
probabilistic disambiguation. Grammars, 5(2):125 
151.
Michael Tomasello. 2000. The item-based nature of chil-
dren’s early syntactic development. Trends in Cogni-
tive Science, 4(4):156 163.
Andreas Zollmann and Khalil Sima’an. 2005. A consis-
tent and ef cient estimator for data-oriented parsing.
Journal of Automata, Languages and Combinatorics.
Willem Zuidema. 2003. How the poverty of the stimulus
solves the poverty of the stimulus. In Suzanna Becker,
Sebastian Thrun, and Klaus Obermayer, editors, Ad-
vances in Neural Information Processing Systems 15,
pages 51 58. MIT Press, Cambridge, MA.
Willem Zuidema. 2006. Theoretical evaluation of esti-
mation methods for Data-Oriented Parsing. In Pro-
ceedings EACL’06 (Conference Companion), pages
183 186.
