Transformational Priors Over Grammars
Jason Eisner <jason@cs.jhu.edu>
Johns Hopkins University, 3400 N. Charles St., NEB 224, Baltimore, MD
Abstract
This paper proposes a novel class of PCFG parameterizations
that support linguistically reasonable priors over PCFGs. To
estimate the parameters is to discover a notion of relatedness
among context-free rules such that related rules tend to have
related probabilities. The prior favors grammars in which the
relationships are simple to describe and have few major excep-
tions. A basic version that bases relatedness on weighted edit
distance yields superior smoothing of grammars learned from
the Penn Treebank (20% reduction of rule perplexity over the
best previous method).
1 A Sketch of the Concrete Problem
This paper uses a new kind of statistical model to smooth
the probabilities of PCFG rules. It focuses on “flat” or
“dependency-style” rules. These resemble subcategoriza-
tion frames, but include adjuncts as well as arguments.
The verb put typically generates 3 dependents—a
subject NP at left, and an object NP and goal PP at right:
 S!NP put NP PP: Jim put [the pizza] [in the oven]
But put may also take other dependents, in other rules:
 S!NP Adv put NP PP: Jim often put [a pizza] [in the oven]
 S!NP put NP PP PP: Jim put soup [in an oven] [at home]
 S!NP put NP: Jim put [some shares of IBM stock]
 S!NP put Prt NP: Jim put away [the sauce]
 S!TO put NP PP: to put [the pizza] [in the oven]
 S!NP put NP PP SBAR: Jim put it [to me] [that ::: ]
These other rules arise if put can add, drop, reorder,
or retype its dependents. These edit operations on rules
are semantically motivated and quite common (Table 1).
We wish to learn contextual probabilities for the edit
operations, based on an observed sample of flat rules. In
English we should discover, for example, that it is quite
common to add or delete PP at the right edge of a rule.
These contextual edit probabilities will help us guess the
true probabilities of novel or little-observed rules.
However, rules are often idiosyncratic. Our smooth-
ing method should not keep us from noticing (given
enough evidence) that put takes a PP more often than
most verbs. Hence this paper’s proposal is a Bayesian
smoothing method that allows idiosyncrasy in the gram-
mar while presuming regularity to be more likely a priori.
The model will assign a positive probability to each
of the infinitely many formally possible rules. The fol-
lowing bizarre rule is not observed in training, and seems
very unlikely. But there is no formal reason to rule it out,
and it might help us parse an unlikely test sentence. So
the model will allow it some tiny probability:
 S!NP Adv PP put PP PP PP NP AdjP S
2 Background and Other Approaches
A PCFG is a conditional probability function p(RHS j
LHS).1 For example, p(V NP PPjVP) gives the proba-
bility of the rule VP!V NP PP. With lexicalized non-
terminals, it has form p(Vput NPpizza PPinjVPput).
Usually one makes an independence assumption and
defines this asp(Vput NP PPjVPput) times factors that
choose dependent headwords pizza and in according
to the selectional preferences of put. This paper is about
estimating the first factor, p(Vput NP PPjVPput).
In supervised learning, it is simplest to use a max-
imum likelihood estimate (perhaps with backoff from
put). Charniak (1997) calls this a “Treebank grammar”
and gambles that assigning 0 probability to rules unseen
in training data will not hurt parsing accuracy too much.
However, there are four reasons not to use a Treebank
grammar. First, ignoring unseen rules necessarily sacri-
fices some accuracy. Second, we will show that it im-
proves accuracy to flatten the parse trees and use flat,
dependency-style rules like p(NP put NP PPjSput);
this avoids overly strong independence assumptions, but
it increases the number of unseen rules and so makes
Treebank grammars less tenable. Third, backing off from
the word is a crude technique that does not distinguish
among words.2 Fourth, one would eventually like to re-
duce or eliminate supervision, and then generalization is
important to constrain the search to reasonable grammars.
To smooth the distribution p(RHSjLHS), one can de-
fine it in terms of a set of parameters and then estimate
those parameters. Most researchers have used an n-gram
model (Eisner, 1996; Charniak, 2000) or more general
Markov model (Alshawi, 1996) to model the sequence
of nonterminals in the RHS. The sequence Vput NP PP
in our example is then assumed to be emitted by some
Markov model of VPput rules (again with backoff from
put). Collins (1997, model 2) uses a more sophisticated
model in which all arguments in this sequence are gener-
ated jointly, as in a Treebank grammar, and then a Markov
process is used to insert adjuncts among the arguments.
While Treebank models overfit the training data,
Markov models underfit. A simple compromise (novel to
this paper) is a hybrid Treebank/Markov model, which
backs off from a Treebank model to a Markov. Like
this paper’s main proposal, it can learn well-observed id-
iosyncratic rules but generalizes when data are sparse.3
1Nonstandardly, this allows infinitely many rules with p>0.
2One might do better by backing off to word clusters, which
Charniak (1997) did find provided a small benefit.
3Carroll and Rooth (1998) used a similar hybrid technique
                                            Association for Computational Linguistics.
                      Language Processing (EMNLP), Philadelphia, July 2002, pp. 63-70.
                         Proceedings of the Conference on Empirical Methods in Natural
These models are beaten by our rather different model,
transformational smoothing, which learns common
rules and common edits to them. The comparison is a
direct one, based on the perplexity or cross-entropy of
the trained models on a test set of S!   rules.4
A subtlety is that two annotation styles are possible. In
the Penn Treebank, put is the head of three constituents
(V, VP, and S, where underlining denotes a head child)
and joins with different dependents at different levels:
 [S [NP Jim] [VP [V put] [NP pizza] [PP in the oven]]]
In the flattened or dependency version that we prefer,
each word joins with all of its dependents at once:
 [S [NP Jim] put [NP pizza] [PP in the oven]]
A PCFG generating the flat structure must estimate
p(NP put NP PP j Sput). A non-flat PCFG adds
the dependents of put in 3 independent steps, so in ef-
fect it factors the flat rule’s probability into 3 suppos-
edly independent “subrule probabilities,” p(NP VPput j
Sput) p(Vput NP PPjVPput) p(putjVput).
Our evaluation judges the estimates of flat-rule prob-
abilities. Is it better to estimate these directly, or as a
product of estimated subrule probabilities?5 Transforma-
tional smoothing is best applied to the former, so that the
edit operations can freely rearrange all of a word’s depen-
dents. We will see that the Markov and Treebank/Markov
models also work much better this way—a useful finding.
3 The Abstract Problem: Designing Priors
This section outlines the Bayesian approach to learning
probabilistic grammars (for us, estimating a distribution
over flat CFG rules). By choosing among the many
grammars that could have generated the training data, the
learner is choosing how to generalize to novel sentences.
To guide the learner’s choice, one can explicitly spec-
ify a prior probability distribution p( ) over possible
grammars  , which themselves specify probability dis-
tributions over strings, rules, or trees. A learner should
seek  that maximizes p( ) p(D j ), where D is the
set of strings, rules, or trees observed by the learner. The
first factor favors regularity (“pick an a priori plausible
grammar”), while the second favors fitting the idiosyn-
crasies of the data, especially the commonest data.6
to evaluate rule distributions that they acquired from an
automatically-parsed treebank.
4All the methods evaluated here apply also to full PCFGs,
but verb-headed rules S!   present the most varied, inter-
esting cases. Many researchers have tried to learn verb subcat-
egorization, though usually not probabilistic subcategorization.
5In testing the latter case, we sum over all possible internal
bracketings of the rule. We do train this case on the true internal
bracketing, but it loses even with this unfair advantage.
6This approach is called semi-Bayesian or Maximum A Pos-
Priors can help both unsupervised and supervised
learning. (In the semi-supervised experiments here, train-
ing data is not raw text but a sparse sample of flat rules.)
Indeed a good deal of syntax induction work has been
carried out in just this framework (Stolcke and Omohun-
dro, 1994; Chen, 1996; De Marcken, 1996; Gr¨unwald,
1996; Osborne and Briscoe, 1997). However, all such
work to date has adopted rather simple prior distributions.
Typically, it has definedp( ) to favor PCFGs whose rules
are few, short, nearly equiprobable, and defined over a
small set of nonterminals. Such definitions are conve-
nient, especially when specifying an encoding for MDL,
but since they treat all rules alike, they may not be good
descriptions of linguistic plausibility. For example, they
will never penalize the absence of a predictable rule.
A prior distribution can, however, be used to encode
various kinds of linguistic notions. After all, a prior is
really a soft form of Universal Grammar: it gives the
learner enough prior knowledge of grammar to overcome
Chomsky’s “poverty of the stimulus” (i.e., sparse data).
 A preference for small or simple grammars, as above.
 Substantive preferences, such as a preference for verbs
to take 2 nominal arguments, or to allow PP adjuncts.
 Preferences for systematicity, such as a preference for
the rules to be consistently head-initial or head-final.
This paper shows how to design a prior that favors a
certain kind of systematicity. Lexicalized grammars for
natural languages are very large—each word specifies a
distribution over all possible dependency rules it could
head—but they tend to have internal structure. The new
prior prefers grammars in which a rule’s probability can
be well-predicted from the probabilities of other rules, us-
ing linguistic transformations such as edit operations.
For example, p(NP Adv w put NP PPjSw) cor-
relates with p(NP w NP PPjSw). Both numbers are
high for w = put, medium for w = fund, and low for
w = sleep. The slope of the regression line has to do
with the rate of preverbal Adv-insertion in English.
The correlation is not perfect (some verbs are espe-
cially prone to adverbial modification), which is why we
will only model it with a prior. To just the extent that evi-
dence aboutw is sparse, the prior will cause the learner to
smooth the two probabilities toward the regression line.
4 Patterns Worth Modeling
Before spelling out our approach, let us do a sanity check.
A frame is a flat rule whose headword is replaced with
teriori learning, since it is equivalent to maximizing p( jD).
It is also equivalent to Minimum Description Length (MDL)
learning, which minimizes the total number of bits‘( )+‘(Dj
 ) needed to encode grammar and data, because one can choose
an encoding scheme where ‘(x) = log2 p(x), or conversely,
define probability distributions by p(x) = 2 ‘(x).
MI   MI   MI   
9.01 [NP ADJP-PRD] [NP RB ADJP-PRD] 4.76 [TO S] [ S] 5.54 [TO NP PP] [NP TO NP]
8.65 [NP ADJP-PRD] [NP PP-LOC-PRD] 4.17 [TO S] [TO NP PP] 5.25 [TO NP PP] [NP MD NP .]
8.01 [NP ADJP-PRD] [NP NP-PRD] 2.77 [TO S] [TO NP] 4.67 [TO NP PP] [NP MD NP]
7.69 [NP ADJP-PRD] [NP ADJP-PRD .] 6.13 [TO NP] [TO NP SBAR-TMP] 4.62 [TO NP PP] [TO ]
8.49 [NP NP-PRD] [NP NP-PRD .] 5.72 [TO NP] [TO NP PP PP] 3.19 [TO NP PP] [TO NP]
7.91 [NP NP-PRD] [NP ADJP-PRD .] 5.36 [TO NP] [NP MD RB NP] 2.05 [TO NP PP] [ NP]
7.01 [NP NP-PRD] [NP ADJP-PRD] 5.16 [TO NP] [TO NP PP PP-TMP] 5.08 [ NP] [ADVP-TMP NP]
8.45 [NP ADJP-PRD .] [NP PP-LOC-PRD] 5.11 [TO NP] [TO NP ADVP] 4.86 [ NP] [ADVP NP]
8.30 [NP ADJP-PRD .] [NP NP-PRD .] 4.85 [TO NP] [TO NP PP-LOC] 4.53 [ NP] [ NP PP-LOC]
8.04 [NP ADJP-PRD .] [NP NP-PRD] 4.84 [TO NP] [MD NP] 3.50 [ NP] [ NP PP]
7.01 [NP ADJP-PRD .] [NP ADJP-PRD] 4.49 [TO NP] [NP TO NP] 3.17 [ NP] [ S]
7.01 [NP SBAR] [NP SBAR . ”] 4.36 [TO NP] [NP MD S] 2.28 [ NP] [NP NP]
4.75 [NP SBAR] [NP SBAR .] 4.36 [TO NP] [NP TO NP PP] 1.89 [ NP] [NP NP .]
6.94 [NP SBAR .] [“ NP SBAR .] 4.26 [TO NP] [NP MD NP PP] 2.56 [NP NP] [NP NP .]
5.94 [NP SBAR .] [NP SBAR . ”] 4.26 [TO NP] [TO NP PP-TMP] 2.20 [NP NP] [ NP]
5.90 [NP SBAR .] [S , NP .] 4.21 [TO NP] [TO PRT NP] 4.89 [NP NP .] [NP ADVP-TMP NP .]
5.82 [NP SBAR .] [NP ADVP SBAR .] 4.20 [TO NP] [NP MD NP] 4.57 [NP NP .] [NP ADVP NP .]
4.68 [NP SBAR .] [ SBAR] 3.99 [TO NP] [TO NP PP] 4.51 [NP NP .] [NP NP PP-TMP]
4.50 [NP SBAR .] [NP SBAR] 3.69 [TO NP] [NP MD NP .] 3.35 [NP NP .] [NP S .]
3.23 [NP SBAR .] [NP S .] 3.60 [TO NP] [TO ] 2.99 [NP NP .] [NP NP]
2.07 [NP SBAR .] [NP ] 3.56 [TO NP] [TO PP] 2.96 [NP NP .] [NP NP PP .]
1.91 [NP SBAR .] [NP NP .] 2.56 [TO NP] [NP NP PP] 2.25 [NP NP .] [ NP PP]
1.63 [NP SBAR .] [NP NP] 2.04 [TO NP] [NP S] 2.20 [NP NP .] [ NP]
4.52 [NP S] [NP S .] 1.99 [TO NP] [NP NP] 4.82 [NP S .] [ S]
4.27 [NP S] [ S] 1.69 [TO NP] [NP NP .] 4.58 [NP S .] [NP S]
3.36 [NP S] [NP ] 1.68 [TO NP] [NP NP PP .] 3.30 [NP S .] [NP ]
2.66 [NP S] [NP NP .] 1.03 [TO NP] [ NP] 2.93 [NP S .] [NP NP .]
2.37 [NP S] [NP NP] 4.75 [S , NP .] [NP SBAR .] 2.28 [NP S .] [NP NP]
Table 1: The most predictive pairs of sentential frames. If S! occurs in training data at least 5 times with a given headword in
the position, then S! also tends to appear at least once with that headword. MI measures the mutual information of these
two events, computed over all words. When MI is large, as here, the edit distance between  and  tends to be strikingly small (1
or 2), and certain linguistically plausible edits are extremely common.
the variable “ ” (corresponding towabove). Table 1 il-
lustrates that in the Penn Treebank, if frequent rules with
frame  imply matching rules with frame  , there are
usually edit operations (section 1) to easily turn  into  .
How about rare rules, whose probabilities are most in
need of smoothing? Are the same edit transformations
that we can learn from frequent cases (Table 1) appropri-
ate for predicting the rare cases? The very rarity of these
rules makes it impossible to create a table like Table 1.
However, rare rules can be measured in the aggregate,
and the result suggests that the same kinds of transforma-
tions are indeed useful—perhaps even more useful—in
predicting them. Let us consider the set R of 2,809,545
possible flat rules that stand at edit distance 1 from the set
of S!   rules observed in our English training data.
That is, a rule such as Sput !NP put NP is in R if it
did not appear in training data itself, but could be derived
by a single edit from some rule that did appear.
A bigram Markov model (section 2) was used to iden-
tify 2,714,763 rare rules in R—those that were predicted
to occur with probability < 0:0001 given their head-
words. 79 of these rare rules actually appeared in a
development-data set of 1423 rules. The bigram model
would have expected only 26.2 appearances, given the
lexical headwords in the test data set. The difference is
statistically significant (p< 0:001, bootstrap test).
In other words, the bigram model underpredicts the
edit-distance “neighbors” of observed rules by a factor
of 3.7 One can therefore hope to use the edit transforma-
tions to improve on the bigram model. For example, the
7Similar results are obtained when we examine just one par-
ticular kind of edit operation, or rules of one particular length.
DeleteYtransformation recognizes that if   X Y Z   
has been observed, then   X Z   is plausible even if
the bigram X Z has not previously been observed.
Presumably, edit operations are common because they
modify a rule in semantically useful ways, allowing the
filler of a semantic role to be expressed (Insert), sup-
pressed (Delete), retyped (Substitute), or heavy-shifted
(Swap). Such “valency-affecting operations” have re-
peatedly been invoked by linguists; they are not confined
to English.8 So a learner of an unknown language can
reasonably expect a priori that flat rules related by edit
operations may have related probabilities.
However, which edit operations varies by language.
Each language defines its own weighted, contextual,
asymmetric edit distance. So the learner will have to dis-
cover how likely particular edits are in particular con-
texts. For example, it must learn the rates of prever-
bal Adv-insertion and right-edge PP-insertion. Evidence
about these rates comes mainly from the frequent rules.
5 A Transformation Model
The form of our new model is shown in Figure 1. The
vertices are flat context-free rules, and the arcs between
them represent edit transformations. The set of arcs leav-
8Carpenter (1991) writes that whenever linguists run into the
problem of systematic redundancy in the syntactic lexicon, they
design a scheme in which lexical entries can be derived from
one another by just these operations. We are doing the same
thing. The only twist that the lexical entries (in our case, flat
PCFG rules) have probabilities that must also be derived, so we
will assume that the speaker applies these operations (randomly
from the hearer’s viewpoint) at various rates to be learned.
exp 1
Z1 exp 2+ 8
Z1To fund NP
exp 3+ 5+ 8
Z2
exp 0
Z4
exp 0
Z8
exp 3+ 5+ 9
Z6
exp 2+ 9
Z5
exp 1
Z5
0.0002
START
HALT START(merge)
To merge NP
To merge PP NP
To merge NP PP
HALT
To fund NP PP
To fund PP NP
HALT
HALT START(fund)
exp 7+ 8
Z3exp 0
Z3
0.0011
exp 0
Z7
exp 7+ 9
Z7
exp 3+ 4
Z2
exp 0
Z2
exp 0
Z6
exp 3+ 4
Z6exp 6Z
3
exp 6
Z7
 0 halts  3 inserts PP  6 deletes PP  8 yields To fund NP PP
 1 chooses To NP  4 inserts PP before NP  7 moves NP right past PP  9 yields To merge NP PP
 2 chooses To NP PP  5 inserts PP before right edge
Figure 1: A fragment of a transformation model. Vertices are possible context-free rules (their left-hand sides, Sfund! and
Smerge!, are omitted to avoid visual clutter). Arc probabilities are determined log-linearly, as shown, from a real-valued vector 
of feature weights. The Z values are chosen so that the arcs leaving each vertex have total probability 1. Dashed arrows represent
arcs not shown here (there are hundreds from each vertex, mainly insertions). Also, not all features are shown (see Table 2).
ing any given vertex has total probability 1. The learner’s
job is to discover the probabilities.
Fortunately, the learner does not have to learn a sep-
arate probability for each of the (infinitely) many arcs,
since many of the arcs represent identical or similar edits.
As shown in Figure 1, an arc’s probability is determined
from meaningful features of the arc, using a conditional
log-linear model of p(arc j source vertex). The learner
only has to learn the finite vector  of feature weights.
Arcs that represent similar transformations have similar
features, so they tend to have similar probabilities.
This transformation model is really a PCFG with un-
usual parameterization. That is, for any value of  , it
defines a language-specific probability distribution over
all possible context-free rules (graph vertices). To sam-
ple from this distribution, take a random walk from the
special vertex START to the special vertex HALT. The
rule at the last vertex reached before HALT is the sample.
This sampling procedure models a process where the
speaker chooses an initial rule and edits it repeatedly.
The random walk might reach Sfund!To fund NP
in two steps and simply halt there. This happens
with probability 0:0011  exp 1Z1  exp 0Z2 . Or, having
arrived at Sfund!To fund NP, it might transform
it into Sfund!To fund PP NP and then further to
Sfund!To fund NP PP before halting.
Thus, p (Sfund!To fund NP PP) denotes the
probability that the random walk somehow reaches
Sfund!To fund NP PP and halts there. Condi-
tionalizing this probability gives p (To NP PP j
Sfund), as needed for the PCFG.9
9The experiments of this paper do not allow transformations
Given , it is nontrivial to solve for the probability dis-
tribution over grammar rules e. Let I (e) denote the flow
to vertex e. This is defined to be the total probability of
all paths from START toe. Equivalently, it is the expected
number of times e would be visited by a random walk
from START. The following recurrence defines p (e):10
I (e) =  e;START +Pe0I (e0) p(e0!e) (1)
p (e) = I (e) p(e!HALT) (2)
Since solving the large linear system (1) would be pro-
hibitively expensive, in practice we use an approximate
relaxation algorithm (Eisner, 2001) that propagates flow
through the graph until near-convergence. In general this
may underestimate the true probabilities somewhat.
Now consider how the parameter vector  affects the
distribution over rules, p (e), in Figure 1:
 By raising the initial weight  1, one can
increase the flow to Sfund!To fund NP,
Smerge!To merge NP, and the like. By equa-
tion (2), this also increases the probability of these rules.
But the effect also feeds through the graph to increase
the flow and probability at those rules’ descendants in
the graph, such as Smerge!To merge NP PP.
So a single parameter  1 controls a whole complex of
rule probabilities (roughly speaking, the infinitival transi-
tives). The model thereby captures the fact that, although
that change the LHS or headword of a rule, so it is trivial to find
the divisor p (Sfund): in Figure 1 it is 0.0011. But in general,
LHS-changing transformations can be useful (Eisner, 2001).
10Where  x;y = 1 if x = y, else  x;y = 0.
rules are mutually exclusive events whose probabilities
sum to 1, transformationally related rules have positively
correlated probabilities that rise and fall together.
 The exception weight  9 appears on all and only the
arcs to Smerge!To merge NP PP. That rule has
even higher probability than predicted by PP-insertion as
above (since merge, unlike fund, actually tends to sub-
categorize for PPwith). To model its idiosyncratic prob-
ability, one can raise  9. This “lists” the rule specially
in the grammar. Rules derived from it also increase in
probability (e.g., Smerge!To Adv merge NP PP),
since again the effect feeds through the graph.
 The generalization weight  3 models the strength of
the PP-insertion relationship. Equations (1) and (2) im-
ply that p (Sfund!To fund NP PP) is modeled as
a linear combination of the probabilities of that rule’s
parents in the graph.  3 controls the coefficient of
p (Sfund!To fund NP) in this linear combination,
with the coefficient approaching zero as  3 ! 1.
 Narrower generalization weights such as  4 and  5
control where PP is likely to be inserted. To learn the
feature weights is to learn which features of a transfor-
mation make it probable or improbable in the language.
Note that the vertex labels, graph topology, and arc
parameters are language independent. That is, Figure 1
is supposed to represent Universal Grammar: it tells a
learner what kinds of generalizations to look for. The
language-specific part is  , which specifies which gener-
alizations and exceptions help to model the data.
6 The Prior
The model has more parameters than data. Why? Beyond
the initial weights and generalization weights, in practice
we allow one exception weight (e.g.,  8; 9) for each rule
that appeared in training data. (This makes it possible to
learn arbitrary exceptions, as in a Treebank grammar.)
Parameter estimation is nonetheless possible, using a
prior to help choose among the many values of  that do
a reasonable job of explaining the training data. The prior
constrains the degrees of freedom: while many parame-
ters are available in principle, the prior will ensure that
the data are described using as few of them as possible.
The point of reparameterizing a PCFG in terms of  ,
as in Figure 1, is precisely that only one parameter is
needed per linguistically salient property of the PCFG.
Making  3 > 0 creates a broadly targeted transforma-
tion. Making  9 6= 0 or  1 6= 0 lists an idiosyncratic rule,
or class of rules, together with other rules derived from
them. But it takes more parameters to encode less sys-
tematic properties, such as narrowly targeted edit trans-
formations ( 4; 5) or families of unrelated exceptions.
A natural prior for the parameter vector  2 Rk is
therefore specified in terms of a variance  2. We simply
say that the weights  1; 2;::: k are independent sam-
ples from the normal distribution with mean 0 and vari-
ance  2 > 0 (Chen and Rosenfeld, 1999):
  N(0; 2) N(0; 2)     N(0; 2) (3)
or equivalently, that is drawn from a multivariate Gaus-
sian with mean ~0 and diagonal covariance matrix  2I,
i.e.,   N(~0; 2I).
This says that a priori, the learner expects most fea-
tures in Figure 1 to have weights close to zero, i.e., to be
irrelevant. Maximizing p( ) p(D j  ) means finding
a relatively small set of features that adequately describe
the rules and exceptions of the grammar. Reducing the
variance  2 strengthens this bias toward simplicity.
For example, if Sfund!To fund NP PP and
Smerge!To fund NP PP are both observed more
often than the current p distribution predicts, then the
learner can follow either (or both) of two strategies: raise
 8 and 9, or raise 3. The former strategy fits the training
data only; the latter affects many disparate arcs and leads
to generalization. The latter strategy may harm p(Dj )
but is preferred by the prior p( ) because it uses one pa-
rameter instead of two. If more than two words act like
merge and fund, the pressure to generalize is stronger.
7 Perturbation Parameters
In experiments, we have found that a slight variation on
this model gets slightly better results. Let  e denote the
exception weight (if any) that allows one to tune the prob-
ability of rulee. We eliminate e and introduce a different
parameter e, called a perturbation, which is used in the
following replacements for equations (1) and (2):
I (e) =  e;START +
X
e0
I (e0) exp e p(e0!e)(4)
p (e) = I (e) exp e p(e!HALT)=Z (5)
where Z is a global normalizing factor chosen so thatP
ep (e) = 1. The new prior on  e is the same as theold prior on  
e.
Increasing either  e or  e will raise p (e); the learner
may do this to account for observations of e in training
data. The probabilities of other rules consequently de-
crease so that Pep (e) = 1. When  e is raised, all
rules’ probabilities are scaled down slightly and equally
(because Z increases). When  e is raised, e steals proba-
bility from its siblings,11 but these are similar toeso tend
to appear in test data if e is in training data. Raising  e
without disproportionately harming e’s siblings requires
manipulation of many other parameters, which is discour-
aged by the prior and may also suffer from search error.
We speculate that this is why  e works better.
11Raising the probability of an arc from e0 to e decreases the
probabilities of arcs from e0 to siblings of e, as they sum to 1.
(Insert) (Insert, target)
(Insert, left) (Insert, target, left)
(Insert, right) (Insert, target, right)
(Insert, left, right)
(Insert, side) (Insert, side, target)
(Insert, side, left) (Insert, side, target, left)
(Insert, side, right) (Insert, side, target, right)
(Insert, side, left, right)
If the arc inserts
Adv after TO
in TO fund PP,
then
target=Adv
left=TO
right=——
side=left of head
Table 2: Each Insert arc has 14 features. The features of any
given arc are found by instantiating the tuples above, as shown.
Each instantiated tuple has a weight specified in .
S!   rules only train dev test
Treebank sections 0–15 16 17
sentences 15554 1343 866
rule tokens 18836 1588 973
rule types 11565 1317 795
frame types 2722 564 365
headword types 3607 756 504
novel rule tokens 51.6% 47.8%
novel frame tokens 8.9% 6.3%
novel headword tokens 10.4% 10.2%
novel rule types 61.4% 57.5%
novel frame types 24.6% 16.4%
novel headword types 20.9% 18.8%
nonterminal types 78
# transformations applicable to 158n 1 158n 1 158n 1
rule with RHS length =n
Table 3: Properties of the experimental data. “Novel” means
not observed in training. “Frame” was defined in section 4.
8 Evaluation12
To evaluate the quality of generalization, we used pre-
parsed training data D and testing data E (Table 3).
Each dataset consisted of a collection of flat rules such as
Sput!NP put NP PP extracted from the Penn Tree-
bank (Marcus et al., 1993). Thus, p(D j  ; ) and
p(Ej ; ) were each defined as a product of rule prob-
abilities of the form p ; (NP put NP PPjSput).
The learner attempted to maximize p( ; )  p(D j
 ; ) by gradient ascent. This amounts to learning the
generalizations and exceptions that related the training
rules D. The evaluation measure was then the perplex-
ity on test data,  log2p(E j ; )=jEj. To get a good
(low) perplexity score, the model had to assign reason-
able probabilities to the many novel rules in E (Table 3).
For many of these rules, even the frame was novel.
Note that although the training data was preparsed into
rules, it was not annotated with the paths in Figure 1 that
generated those rules, so estimating  and  was still an
unsupervised learning problem.
The transformation graph had about 14 features per arc
(Table 2). In the finite part of the transformation graph
that was actually explored (including bad arcs that com-
pete with good ones), about 70000 distinct features were
encountered, though after training, only a few hundred
12See (Eisner, 2001) for full details of data preparation,
model structure, parameter initialization, backoff levels for the
comparison models, efficient techniques for computing the ob-
jective and its gradient, and more analysis of the results.
Treebank/Markov
basic Katz one-counta
flat non-flatb flat flat non-flat
(a) Treebank 1 1
1-gram 1774.9 86435.1 340.9 160.0 193.2
2-gram 135.2 199.3 127.2 116.2 174.7
3-gram 136.5 177.4 132.7 123.3 174.8
Collinsc 363.0 494.5 197.9
transformation 108.6
averagedd 102.3
(b) 1-gram 1991.2 96318.8 455.1 194.3 233.1
2-gram 162.2 236.6 153.2 138.8 205.6
3-gram 161.9 211.0 156.8 145.7 208.1
Collins 414.5 589.4 242.0
transformation 124.8
averaged 118.0
aBack off from Treebank grammar with Katz vs. one-count
backoff (Chen and Goodman, 1996) (Note: One-count was al-
ways used for backoff within the n-gram and Collins models.)
bSee section 2 for discussion
cCollins (1997, model 2)
dAverage of transformation model with best other model
Table 4: Perplexity of the test set under various models. (a) Full
training set. (b) Half training set (sections 0–7 only).
feature weights were substantial, and only a few thousand
were even far enough from zero to affect performance.
There was also a parameter  e for each observed rule e.
Results are given in Table 4a, which compares the
transformation model to various competing models dis-
cussed in section 2. The best (smallest) perplexities ap-
pear in boldface. The key results:
 The transformation model was the winner, reducing
perplexity by 20% over the best model replicated from
previous literature (a bigram model).
 Much of this improvement could be explained by
the transformation model’s ability to model exceptions.
Adding this ability more directly to the bigram model,
using the new Treebank/Markov approach of section 2,
also reduced perplexity from the bigram model, by 6%
or 14% depending on whether Katz or one-count backoff
was used, versus the transformation model’s 20%.
 Averaging the transformation model with the best com-
peting model (Treebank/bigram) improved it by an addi-
tional 6%. So using transformations yields a total per-
plexity reduction of 12% over Treebank/bigram, and 24%
over the best previous model from the literature (bigram).
 What would be the cost of achieving such a perplexity
improvement by additional annotation? Training the av-
eraged model on only the first half of the training set, with
no further tuning of any options (Table 4b), yielded a test
set perplexity of 118.0. So by using transformations, we
can achieve about the same perplexity as the best model
without transformations (Treebank/bigram, 116.2), using
only half as much training data.
 Furthermore, comparing Tables 4a and 4b shows that
the transformation model had the most graceful perfor-
mance degradation when the dataset was reduced in size.
1e−101e−071e−041e−01
1e−10
1e−07
1e−04
1e−01
p(rule | headword): averaged transf. 5e−04 5e−03 5e−02 5e−015e−04
5e−03
5e−02
5e−01
p(rule | headword): Treebank/bigram
0.001 0.010 0.100 1.0000.001
0.010
0.100
1.000Figure 2: Probabilities of test set flat rules under the averaged model, plotted against the corresponding probabilities under the
best transformation-free model. Improvements fall above the main diagonal; dashed diagonals indicate a factor of two. The three
log-log plots (at different scales!) partition the rules by the number of training observations: 0 (left graph), 1 (middle), 2 (right).
This is an encouraging result for the use of the method
in less supervised contexts (although results on a noisy
dataset would be more convincing in this regard).
 The competing models from the literature are best used
to predict flat rules directly, rather than by summing over
their possible non-flat internal structures, as has been
done in the past. This result is significant in itself. Ex-
tending Johnson (1998), it shows the inappropriateness of
the traditional independence assumptions that build up a
frame by several rule expansions (section 2).
Figure 2 shows that averaging the transformation
model with the Treebank/bigram model improves the lat-
ter not merely on balance, but across the board. In other
words, there is no evident class of phenomena for which
incorporating transformations would be a bad idea.
 Transformations particularly helped raise the estimates
of the low-probability novel rules in test data, as hoped.
 Transformations also helped on test rules that had
been observed once in training with relatively infrequent
words. (In other words, the transformation model does
not discount singletons too much.)
 Transformations hurt slightly on balance for rules ob-
served more than once in training, but the effect was tiny.
All these differences are slightly exaggerated if one com-
pares the transformation model directly with the Tree-
bank/bigram model, without averaging.
The transformation model was designed to use edit
operations in order to generalize appropriately from a
word’s observed frames to new frames that are likely to
appear with that word in test data. To directly test the
model’s success at such generalization, we compared it
to the bigram model on a pseudo-disambiguation task.
Each instance of the task consisted of a pair of rules
from test data, expressed as (word, frame) pairs (w1;f1)
and (w2;f2), such that f1 and f2 are “novel” frames that
did not appear in training data (with any headword).
Each model was then asked: Does f1 go with w1 and
f2 with w2, or vice-versa? In other words, which is big-
ger, p(f1 jw1) p(f2 jw2) or p(f2 jw1) p(f1 jw2)?
Since the frames were novel, the model had to make
the choice according to whether f1 or f2 looked more
like the frames that had actually been observed with w1
in the past, and likewise w2. What this means depends
on the model. The bigram model takes two frames to
look alike if they contain many bigrams in common. The
transformation model takes two frames to look alike if
they are connected by a path of probable transformations.
The test data contained 62 distinct rules (w;f) in
which f was a novel frame. This yielded 62 612 = 1891
pairs of rules, leading to 1811 task instances after obvi-
ous ties were discarded.13
Baseline performance on this difficult task is 50% (ran-
dom guess). The bigram model chose correctly in 1595
of the 1811 instances (88.1%). Parameters for “memo-
rizing” specific frames do not help on this task, which in-
volves only novel frames, so the Treebank/bigram model
had the same performance. By contrast, the transforma-
tion model got 1669 of 1811 correct (92.2%), for a more-
than-34% reduction in error rate. (The development set
showed similar results.) However, since the 1811 task
instances were derived non-independently from just 62
novel rules, this result is based on a rather small sample.
9 Discussion
This paper has presented a nontrivial way to reparameter-
ize a PCFG in terms of “deep” parameters representing
transformations and exceptions. A linguistically sensible
prior was natural to define over these deep parameters.
Famous examples of “deep reparameterization” are the
Fourier transform in speech recognition and the SVD
transform for Latent Semantic Analysis in IR. Like our
technique, they are intended to reveal significant structure
through the leading parameters while relegating noise and
exceptions to minor parameters. Such representations
13An obvious tie is an instance where f1 = f2, or where
both w1 and w2 were novel headwords. (The 62 rules included
11 with novel headwords.) In such cases, neither the bigram nor
the transformation model has any basis for making its decision:
the probabilities being compared will necessarily be equal.
make it easier to model the similarity or probability of the
objects at hand (waveforms, documents, or grammars).
Beyond the fact that it shows at least a good perplex-
ity improvement (it has not yet been applied to a real
task), an exciting “big idea” aspect of this work is its
flexibility in defining linguistically sensible priors over
grammars. Our reparameterization is made with refer-
ence to a user-designed transformation graph (Figure 1).
The graph need not be confined to edit distance transfor-
mations, or to the simple features of Table 2 (used here
for comparability with the Markov models), which con-
dition a transformation’s probability on local context.
In principle, the approach could be used to capture
a great many linguistic phenomena. Figure 1 could be
extended with more ambitious transformations, such as
gapping, gap-threading, and passivization. The flat rules
could be annotated with internal structure (as in TAG) and
thematic roles. Finally, the arcs could bear further fea-
tures. For example, the probability of unaccusative move-
ment (someone sank the boat!the boat sank) should de-
pend on whether the headword is a change-of-state verb.
Indeed, Figure 1 can be converted to any lexicalized
theory of grammar, such as categorial grammar, TAG,
LFG, HPSG, or Minimalism. The vertices represent lex-
ical entries and the arcs represent probabilistic lexical re-
dundancy rules or metarules (see footnote 8). The trans-
formation model approach is therefore a full stochas-
tic treatment of lexicalized syntax— apparently the first
to treat lexical redundancy rules, although (Briscoe and
Copestake, 1999) give an ad hoc approach. See (Eisner,
2001; Eisner, 2002a) for more discussion.
It is worthwhile to compare the statistical approach
here with some other approaches:
 Transformation models are similar to graphical mod-
els: they allow similar patterns of deductive and abduc-
tive inference from observations. However, the vertices
of a transformation graph do not represent different ran-
dom variables, but rather mutually exclusive values of the
same random variable, whose probabilities sum to 1.
 Transformation models incorporate conditional log-
linear (maximum entropy) models. As an alternative,
one could directly build a conditional log-linear model
of p(RHS j LHS). However, such a model would learn
probabilities, not relationships. A feature weight would
not really model the strength of the relationship between
two frames e;e0 that share that feature. It would only in-
fluence both frames’ probabilities. If the probability of e
were altered by some unrelated factor (e.g., an exception
weight), then the probability of e0 would not respond.
 A transformation model can be regarded as a proba-
bilistic FSA that consists mostly of  -transitions. (Rules
are only emitted on the arcs to HALT.) This perspective
allows use of generic methods for finite-state parameter
estimation (Eisner, 2002b). We are strongly interested in
improving the speed of such methods and their ability to
avoid local maxima, which are currently the major diffi-
culty with our system, as they are for many unsupervised
learning techniques. We expect to further pursue trans-
formation models (and simpler variants that are easier to
estimate) within this flexible finite-state framework.
The interested reader is encouraged to look at (Eisner,
2001) for a much more careful and wide-ranging discus-
sion of transformation models, their algorithms, and their
relation to linguistic theory, statistics, and parsing. Chap-
ter 1 provides a good overview. For a brief article high-
lighting the connection to linguistics, see (Eisner, 2002a).

References
Hiyan Alshawi. 1996. Head automata for speech translation.
In Proceedings of ICSLP, Philadelphia, PA.
T. Briscoe and A. Copestake. 1999. Lexical rules in constraint-
based grammar. Computational Linguistics, 25(4):487–526.
Bob Carpenter. 1991. The generative power of categorial gram-
mars and head-driven phrase structure grammars with lexical
rules. Computational Linguistics, 17(3):301–313.
Glenn Carroll and Mats Rooth. 1998. Valence induction with a
head-lexicalized PCFG. In Proceedings of EMNLP.
Eugene Charniak. 1997. Statistical parsing with a context-free
grammar and word statistics. In Proc. of AAAI, 598–603.
Eugene Charniak. 2000. A maximum-entropy inspired parser.
In Proceedings of NAACL.
Stanley Chen and Joshua Goodman. 1996. An empirical study
of smoothing techniques. In Proceedings of ACL.
Stanley F. Chen and Ronald Rosenfeld. 1999. A Gaussian prior
for smoothing maximum entropy models. Technical Report
CMU-CS-99-108, Carnegie Mellon University, February.
Stanley Chen. 1996. Building Probabilistic Models for Natural
Language. Ph.D. thesis, Harvard University.
Michael J. Collins. 1997. Three generative, lexicalised models
for statistical parsing. In Proceedings of ACL/EACL, 16–23.
Carl De Marcken. 1996. Unsupervised Language Acquisition.
Ph.D. thesis, MIT.
Jason Eisner. 1996. Three new probabilistic models for depen-
dency parsing: An exploration. Proc. of COLING, 340–345.
Jason Eisner. 2001. Smoothing a Probabilistic Lexicon via Syn-
tactic Transformations. Ph.D. thesis, Univ. of Pennsylvania.
Jason Eisner. 2002a. Discovering syntactic deep structure via
Bayesian statistics. Cognitive Science, 26(3), May.
Jason Eisner. 2002b. Parameter estimation for probabilistic
finite-state transducers. In Proceedings of the 40th ACL.
P. Gr¨unwald. 1996. A minimum description length approach
to grammar inference. In S. Wermter et al., eds., Symbolic,
Connectionist and Statistical Approaches to Learning for
NLP, no. 1040 in Lecture Notes in AI, pages 203–216.
Mark Johnson. 1998. PCFG models of linguistic tree represen-
tations. Computational Linguistics, 24(4):613–632.
Beth Levin. 1993. English Verb Classes and Alternations: A
Preliminary Investigation. University of Chicago Press.
M. Marcus, B. Santorini, and M.A. Marcinkiewicz. 1993.
Building a large annotated corpus of English: The Penn Tree-
bank. Computational Linguistics, 19(2):313–330.
Miles Osborne and Ted Briscoe. 1997. Learning stochastic cat-
egorial grammars. In Proceedings of CoNLL, 80–87. ACL.
A. Stolcke and S.M. Omohundro. 1994. Inducing probabilistic
grammars by Bayesian model merging. In Proc. of ICGI.
