Theoretical Evaluation of Estimation Methods for Data-Oriented Parsing
Willem Zuidema
Institute for Logic, Language and Computation
University of Amsterdam
Plantage Muidergracht 24, 1018 TV, Amsterdam, the Netherlands.
jzuidema@science.uva.nl
Abstract
We analyze estimation methods for Data-
Oriented Parsing, as well as the theoret-
ical criteria used to evaluate them. We
show that all current estimation methods
areinconsistent inthe “weight-distribution
test”, and argue that these results force us
to rethink both the methods proposed and
the criteria used.
1 Introduction
Stochastic Tree Substitution Grammars (hence-
forth, STSGs) are a simple generalization of Prob-
abilistic Context Free Grammars, where the pro-
ductive elements are not rewrite rules but elemen-
tary trees of arbitrary size. The increased flexibil-
ity allows STSGs to model a variety of syntactic
and statistical dependencies, using relatively com-
plexprimitivesbut just asingle andextremelysim-
pleglobal rule: substitution. STSGscanbeseenas
Stochastic Tree Adjoining Grammars without the
adjunction operation.
STSGsaretheunderlying formalism of most in-
stantiations of an approach to statistical parsing
known as “Data-Oriented Parsing” (Scha, 1990;
Bod, 1998). In this approach the subtrees of the
trees in a tree bank are used as elementary trees of
the grammar. In most DOP models the grammar
used is an STSGwith, in principle, all subtrees1 of
the trees in the tree bank as elementary trees. For
disambiguation, the best parse tree is taken to be
the most probable parse according to the weights
of the grammar.
Several methods have been proposed to decide
on the weights based on observed tree frequencies
1A subtree tprime of a parse tree t isa tree such that every node
iprime in tprime equals a node i in t, and iprime either has no daughters or
the same daughter nodes as i.
inatreebank. Thefirst suchmethod isnowknown
as “DOP1” (Bod, 1993). In combination with
someheuristic constraints on the allowed subtrees,
it has been remarkably successful on small tree
banks. Despite this empirical success, (Johnson,
2002) argued that it is inadequate because it is bi-
ased and inconsistent. His criticism spearheaded
a number of other methods, including (Bonnema
et al., 1999; Bod, 2003; Sima’an and Buratto,
2003; Zollmann and Sima’an, 2005), and will be
the starting point of our analysis. As it turns out,
the DOP1 method really is biased and inconsis-
tent, but not for the reasons Johnson gives, and it
really is inadequate, but not because it is biased
and inconsistent. In this note, we further show that
alternative methods that have been proposed, only
partly remedy the problems with DOP1, leaving
weight estimation as an important open problem.
2 Estimation Methods
The DOP model and STSG formalism are de-
scribed in detail elsewhere, for instance in (Bod,
1998). The main difference with PCFGs is that
multiple derivations, using elementary trees with
a variety of sizes, can yield the same parse tree.
The probability of a parse p is therefore given by:
P(p) = summationtextd:ˆd=p P(d), where ˆd is the tree derived
byderivation d,P(d) = producttextt∈d w(t) andw(t) gives
the weights of elementary trees t, which are com-
bined in the derivation d (here treated as a multi-
set).
2.1 DOP1
In Bod’s original DOP implementation (Bod,
1993; Bod, 1998), henceforth DOP1, the weights
of an elementary tree t is defined as its relative
frequency (relative to other subtrees with the same
root label) in the tree bank. That is, the weight
183
wi = w(ti) of an elementary tree ti is given by:
wi = fisummationtext
j:r(tj)=r(ti)(fj)
, (1)
where fi = f(ti) gives the frequency of subtree ti
in a corpus, and r(ti) is the root label of ti.
In his critique of this method, (Johnson, 2002)
considers a situation where there is an STSG G
(the target grammar) with a specific set of sub-
trees (t1 ...tN) and specific values of the weights
(w1 ...wN) . He evaluates an estimation proce-
dure which produces a grammar Gprime (the estimated
grammar), by looking at the difference between
the weights of G and the expected weights of Gprime.
Johnson’s test for consistency is thus based on
comparing the weight-distributions between target
grammar and estimated grammar2. I will therefore
refer to this test as the “weight-distribution test”.
t1 = S
A
a
A
a
t2 =S
A
a
A
t3 =S
A A
a
t5 =S
A
a
t4 =S
A A
t6 =S
A
t7 =A
a
Figure 1: The example of (Johnson, 2002)
(Johnson, 2002) looks at an example grammar
G ∈ STSG with the subtrees as in figure 1. John-
son considers the case where the weights of all
trees of the target grammar G are 0, except for
w7, which is necessarily 1, and w4 and w6 which
are w4 = p and w6 = 1 − p. He finds that the
expected values of the weights w4 and w6 of the
estimated grammar Gprime are:
E[wprime4] = p2 + 2p, (2)
E[wprime6] = 1−p2 + 2p, (3)
which are not equal to their target values for all
values of p where 0 < p < 1. This analysis
thus shows that DOP1is unable to recover the true
weights of the given STSG, and hence the incon-
sistency of the estimator with respect to the class
of STSGs.
Although usually cited as showing the inad-
equacy of DOP1, Johnson’s example is in fact
2More precisely, it is based on evaluating the estimator’s
behavior for any weight-distribution possible in the STSG
model. (Prescher et al., 2003) give a more formal treatment
of bias and consistency in the context of DOP.
not suitable to distinguish DOP1 from alternative
methods, because no possible estimation proce-
dure can recover the true weights in the case con-
sidered. In the example there are only two com-
plete trees that can be observed in the training
data, corresponding to the trees t1 and t5. It is
easy to see that when generating examples with
the grammar in figure 1, the relative frequencies3
f1 ...f4 of the subtrees t1 ...t4 must all be the
same, and equal to the frequency of the complete
tree t1 which can be composed in the following
ways from the subtrees in the original grammar:
t1 = t2 ◦t7 = t3 ◦t7 = t4 ◦t7 ◦t7. (4)
It follows that the expected frequencies of each of
these subtrees are:
E[f1] = E[f2] = E[f3] = E[f4] (5)
= w1 + w2w7 + w3w7 + w4w7w7
Similarly, the other frequencies are given by:
E[f5] = E[f6] = w5 + w6w7 (6)
E[f7] = 2(w1 + w2w7 + w3w7
+w4w7w7) + w5 + w6w7
= 2E[f1] +E[f5]. (7)
From these equations it is immediately clear
that, regardless of the amount of training data,
the problem is simply underdetermined. The val-
ues of 6 weights w1 ...w6 (w7 = 1) given only
2 frequencies f1 and f5 (and the constraint thatsummationtext
6
i=1(fi) = 1) are not uniquely defined, and no
possible estimation method will be able to reliably
recover the true weights.
The relevant test is whether for all possible
STSGs and in the limit of infinite data, the ex-
pected relative frequencies of trees given the es-
timated grammar, equal the observed relative fre-
quencies. I will refer to this test as the “frequency-
distribution test”. As it turns out, the DOP1
method also fails this more lenient test. The easi-
est wayto show this, using again figure 1, isas fol-
lows. The weights wprime1 ...wprime7 of grammar Gprime will –
by definition – be set to the relative frequencies of
the corresponding subtrees:
wprimei =
braceleftBigg f
iP
6
j=1 fj
for i = 1...6
1 for i = 7. (8)
3Throughout this paper I take frequencies fi to be relative
to the size of the corpus.
184
Thegrammar Gprime will thus produce the complete
trees t1 and t5 with expected frequencies:
E[fprime1] = wprime1 + wprime2wprime7 +wprime3wprime7 + wprime4wprime7wprime7
= 4 f1summationtext6
j=1 fj
(9)
E[fprime5] = wprime5 + wprime6wprime7 = 2 f5summationtext6
j=1 fj
. (10)
Now consider the two possible complete trees
t1 and t5, and the fraction of their frequencies
f1/f5. In the estimated grammar Gprime this fraction
becomes:
E[fprime1]
E[fprime5] =
4n f1P6
j=1 fj
2n f5P6
j=1 fj
= 2f1f
5
. (11)
That is, in the limit of infinite data, the estima-
tion procedure not only –understandably– fails to
find the target grammar amongst the many gram-
mars that could have produced the observed fre-
quencies, it in fact chooses a grammar that could
never have produced these observed frequencies
at all. This example shows the DOP1 method is
biased and inconsistent for the STSG class in the
frequency-distribution test4.
2.2 Correction-factor approaches
Based on similar observation, (Bonnema et al.,
1999; Bod, 2003) propose alternative estimation
methods, which involve a correction factor to
move probability mass from larger subtrees to
smaller ones. For instance, Bonnema et al. replace
equation (1) with:
wi = 2−N(ti) fisummationtext
j:r(tj)=r(ti)(fj)
, (12)
where N(ti) gives the number of internal nodes
in ti (such that 2−N(ti) is inversely proportional
to the number of possible derivations of ti). Sim-
ilarly, (Bod, 2003) changes the way frequencies
fi are counted, with a similar effect. This ap-
proach solves the specific problem shown in equa-
tion (11). However, the following example shows
that the correction-factor approaches cannot solve
the more general problem.
4Note that there are settings of the weights w1 . ..w7 that
generate a frequency-distribution that could also have been
generated with a PCFG. The example given applies to such
distribution as well, and therefore also shows the inconsis-
tency of the DOP1 method for PCFG distributions.
t1 = S
A
a
A
b
t2 = S
A
b
A
a
t3 = S
A
a
A
a
t4 = S
A
b
A
b
t5 =S
A
a
A
t6 =S
A A
b
t7 =S
A
b
A
t8 =S
A A
a
t9 =S
A A
t10 =A
a
t11 =A
b
Figure 2: Counter-example to the correction-
factor approaches
Consider the STSG in figure 2. The expected
frequencies f1 ...f4 are here given by:
E[f1] = w1 + w5w11 + w6w10 + w9w10w11
E[f2] = w2 + w7w10 + w8w11 + w9w11w10
E[f3] = w3 + w5w10 + w8w10 + w9w10w10
E[f4] = w4 + w6w11 + w7w11 + w9w11w11
(13)
Frequencies f5 ...f11 are again simple com-
binations of the frequencies f1 ...f4. Observa-
tions of these frequencies therefore do not add
any extra information, and the problem of find-
ing the weights of the target grammar is in general
again underdetermined. But consider the situation
where f3 = f4 = 0 and f1 > 0 and f2 > 0.
This constrains the possible solutions enormously.
If we solve the following equations for w3 ...w11
with the constraint that probabilities with the same
root label add up to 1: (i.e. summationtext9i=1(wi) = 1,
w10 + w11 = 1):
w3 + w5w10 + w8w10 + w9w10w10 = 0
w4 + w6w11 + w7w11 + w9w11w11 = 0,
we find, in addition to the obvious w3 = w4 = 0,
the following solutions: w10 = w6 = w7 = w9 =
0 ∨ w11 = w5 = w8 = w9 = 0 ∨ w5 =
w6 = w7 = w8 = w9 = 0. That is, if we ob-
serve no occurrences of trees t3 and t4 in the train-
ing sample, we know that at least one subtree in
each derivation of these strings must have weight
zero. However, any estimation method that uses
the (relative) frequencies of subtrees and a (non-
zero) correction factor that is based on the size of
the subtrees, will give non-zero probabilities to all
weights w5 ...w11 if f1 > 0 and f2 > 0, as we
assumed. In other words, these weight estimation
methods for STSGs are also biased and inconsis-
tent in the frequency-distribution test.
185
2.3 Shortest derivation estimators
Because the STSG formalism allows elementary
trees of arbitrary size, every parse tree in a tree
bank could in principle be incorporated in an
STSG grammar. That is, we can define a trivial
estimator with the following weights:
wi =
braceleftbigg f
i if ti is an observed parse tree
0 otherwise
(14)
Such an estimator is not particularly interesting,
because it does not generalize beyond the training
data. It is a point to note, however, that this esti-
mator is unbiased and consistent in the frequency-
distribution test. (Prescher et al., 2003) prove that
any unbiased estimator that uses the “all subtrees”
representation has the same property, and con-
clude that lack of bias is not a desired property.
(Zollmann and Sima’an, 2005) propose an esti-
mator based on held-out estimation. The training
corpus is split into an estimation corpus EC and a
held out corpus HC. The HC corpus is parsed
by searching for the shortest derivation of each
sentence, using only fragments from EC. The
elementary trees of the estimated STSG are as-
signed weights according to their usage frequen-
cies u1,...,uN in these shortest derivations:
wi = uisummationtext
j:r(tj)=r(ti) uj
. (15)
This approach solves the problem with bias de-
scribed above, while still allowing for consistency,
as Zollmann & Sima’an prove. However, their
proof only concerns consistency in the frequency-
distribution test. As the corpus EC grows to be
infinitely large, every parse tree in HC will also
be found in EC, and the shortest derivation will
therefore in the limit only involve a single ele-
mentary tree: the parse tree itself. Target STSGs
with non-zero weights on smaller elementary trees
will thus not be identified correctly, even with an
infinitely large training set. In other words, the
Zollmann & Sima’an method, and other methods
that converge to the “complete parse tree” solution
such as LS-DOP (Bod, 2003) and BackOff-DOP
(Sima’an and Buratto, 2003), are inconsistent in
the weight-distribution test.
3 Discussion & Conclusions
A desideratum for parameter estimation methods
isthat they converge tothe correct parameters with
infinitely many data – that is, we like an estima-
tor to be consistent. The STSG formalism, how-
ever, allows for many different derivations of the
same parse tree, and for many different grammars
to generate the same frequency-distribution. Con-
sistency in the weight-distribution test is there-
fore too stringent a criterion. We have shown that
DOP1 and methods based on correction factors
also fail the weaker frequency-distribution test.
However, the only current estimation methods
that are consistent in the frequency-distribution
test, have the linguistically undesirable property
of converging to a distribution with all probabil-
ity mass in complete parse trees. Although these
method fail the weight-distribution test for the
whole class of STSGs, we argued earlier that this
test is not the appropriate test either. Both estima-
tion methods for STSGs and the criteria for eval-
uating them, thus require thorough rethinking. In
forthcoming work we therefore study yet another
estimator, and the linguistically motivated evalua-
tion criterion of convergence to a maximally gen-
eral STSG consistent with the training data5.
References
Rens Bod. 1993. Using an annotated corpus as a stochastic
grammar. In Proceedings EACL’93, pp. 37–44.
Rens Bod. 1998. Beyond Grammar: An experience-based
theory of language. CSLI, Stanford, CA.
Rens Bod. 2003. An efficient implementation of a new DOP
model. In Proceedings EACL’03.
Remko Bonnema, Paul Buying, and Remko Scha. 1999.
A new probability model for data oriented parsing. In
Paul Dekker, editor, Proceedings of the Twelfth Amster-
dam Colloquium. ILLC, University of Amsterdam.
Mark Johnson. 2002. The DOP estimation method is biased
and inconsistent. Computational Linguistics, 28(1):71–
76.
D. Prescher, R. Scha, K. Sima’an, and A. Zollmann. 2003.
On the statistical consistency of DOP estimators. In Pro-
ceedings CLIN’03, Antwerp, Belgium.
Remko Scha. 1990. Taaltheorie en taaltechnologie; compe-
tence en performance. In R. de Kort and G.L.J. Leerdam,
eds, Computertoepassingen in de Neerlandistiek, pages 7–
22. LVVN, Almere.http://iaaa.nl/rs/LeerdamE.html.
Khalil Sima’an and Luciano Buratto (2003). Backoff pa-
rameter estimation for the DOP model. In Proceedings
ECML’03, pp. 373–384. Berlin: Springer Verlag.
Andreas Zollmann and Khalil Sima’an. 2005. A consistent
and efficient estimator for data-oriented parsing. Journal
of Automata, Languages and Combinatorics. In press.
5The author is funded by NWO, project nr. 612.066.405,
and would like to thank the anonymous reviewers and several
colleagues for comments.
186
