Parsing with the Shortest Derivation 
Rens Bod 
htformatics P.cseareh Institute, University of Leeds, Leeds LS2 9JT, & 
Institute for Logic, Language and Computation, University of Amsterdam 
tens @ scs.lecd s.ac.uk 
Abstract 
Common wisdom has it that tile bias of stochastic 
grammars in favor of shorter deriwttions of a sentence 
is hamfful and should be redressed. We show that the 
common wisdom is wrong for stochastic grammars 
that use elementary trees instead o1' conlext-l'ree 
rules, such as Stochastic Tree-Substitution Grammars 
used by Data-Oriented Parsing models. For such 
grammars a non-probabilistic metric based on tile 
shortest derivation outperforms a probabilistic metric 
on the ATIS and OVIS corpora, while it obtains 
competitive results on the Wall Street Journal (WSJ) 
corpus. This paper also contains the first publislmd 
experiments with DOP on the WSJ. 
1. Introduction 
A well-known property of stochastic grammars is their 
prope,lsity to assign highe, probabilities to shorter 
derivations o1' a sentence (cf. Chitrao & Grishman 
1990; Magerman & Marcus 1991; Briscoe & Carroll 
1993; Charniak 1996). This propensity is due to the 
probability o1' a derivation being computed as tile 
product of the rule probabilities, and thus shorter 
derivations involving fewer rules tend to have higher 
probabilities, ahnost regardless of the training data. 
While this bias may seem interesting in the light of 
the principle of cognitive economy, shorter derivat- 
ions generate smaller parse h'ees (consisting of fewer 
nodes) whiclt are not warranted by tile correct parses 
of sentences. Most systems therefore redress this bias, 
for instance by normalizillg the derivation probability 
(see Caraballo & Charniak 1998). 
However, for stochastic grammars lhat use 
elementary trees instead o1' context-l'ree rules, the 
propensity to assign higher probabilities to shorter 
derivations does not necessarily lead to a bias in 
favor of smaller parse trees, because elementary trees 
may differ in size and lexicalization. For Stochastic 
Tree-Substitution Grammars (STSG) used by Data- 
Oriented Parsing (DOP) models, it has been observed 
lhat the shortest derivation of a sentence consists of 
the largest subtrees seen in a treebank thai generate 
that sentence (of. Bed 1992, 98). We may therefore 
wonder whether for STSG lhe bias in favor of shorter 
derivations is perhaps beneficial rather than llarmful. 
To investigate this question we created a new 
STSG-DOP model which uses this bias as a feature. 
This non-probabilistic DOP model parses each 
sentence by returning its shortest derivation 
(consisting of tile fewest subtrees seen in ttle corpus). 
Only if there is more than one shortest derivation the 
model backs off to a frequency ordering o1' the corpus- 
subtrees and chooses the shortest deriwttion with most 
highest ranked subtrees. We compared this non- 
probabilistic DOP model against tile probabilistic 
DOP model (which estimales the most probable parse 
for each sentence) on three different domains: tbe 
Penn ATIS treebank (Marcus et al. 1993), the Dutch 
OVIS treebank (Bonnema el al. 1997) and tile Penn 
Wall Street Journal (WSJ) treebank (Marcus el al. 
1993). Stwprisingly, the non-probabilistic DOP model 
oult~erforms the probabilistic I)OP model on both lhe 
ATIS and OVIS treebanks, while it obtains competit- 
ive resuhs on the WSJ trcebank. We conjectu,c thai 
any stochastic granlnlar which uses units of flexible 
size can be turned into an accurate non-probabilistic 
version. 
Tile rest of this paper is organized as follows: 
we first explain botll the probabilistic and non- 
prolmbilistic DOP model. Next, we go into tile 
computational aspccls of these models, and finally 
we compare lhe performance of the models on the 
three treebanks. 
2. Probabilistic vs. Non-Probabilistic 
Data-Oriented Parsing 
Both probabilistic and non-probabilistic DOP are 
based on the DOP model in Bod (1992) which 
extracts a Stochastic Tree-Substitution Grammar fi'om 
a treebank ("STSG-DOP"). I STSG-DOP uses subtrees 
J Note that the l)OP-approach of extracting grammars f,om 
corpora has been applied to a wide variety of other 
grammatical fimncworks, including Tree-lnsertio,~ Grmnmar 
69 
from parse trees in a corpus as elementary trees, and 
leftmost-substitution to combine subtrees into new 
trees. As at\] example, consider a very simple corpus 
consisting of only two trees (we leave out some 
subcategorizations to keel) the example simt}le): 
S s /x /X 
NP VP \] ~ NP VP 
she V NP \] \] ~ she VI' 1'1' 
....... ted A A NP PP 
V NP P NP /N/N IA 
saw the dog wilh Ihe telescope 
tile dress i } A 
{}I1 Ihe rack 
Figure 1. A simple corpus o1' two trees. 
A new sentence such as She saw the dress with the 
telescope can be parsed by combining subtrees from 
this corpus by means of lel'tmost-substitution 
(indicated as °): 
S o X o = ,S /x & /x 
P i" v,. ..... IA i P /x.. 
A with the telescope 
she VP PP s e VI' A A 7, 
SaW SaW I\]IC dl'CSs whh the telescope 
Figure 2. Derivation and parse tree for the sentence She saw 
lhe dress with the telescope 
Note that other derivations, involving different 
subtrees, may yield the same parse tree; for instance: 
S o NP 
NP VP the dress 
sl!e VP PP A A 
V NP P i IA 
saw with file telescope 
S 
NP VP 
VP PP A A 
V NP P NP 
saw the dB:ss with the telescope 
Figure 3. l)ifferent derivation yielding the same parse true 
lbr She saw tile &'ess with tile telescope 
Note also that, given this example corpus, the 
sentence we considered is ambiguous; by combining 
(Hoogweg 2000), Tree-Adjoining Grammar (Neumalm 
1998), Lexical-Functional Grammar (Bed & Kaplan 1998; 
Way 1999; Bed 2000a), Head-driven Phrase Structure 
Grammar (Neumann & Flickinger 1999), and Montague 
Grammar (van den Berg et al. 1994; Bed 1998). For the 
relation between DOP and Memory-Based Learning, see 
Daelemans (1999). 
other subtrees, a dilTerent parse may be derived, 
which is analogous to the first rather than the second 
corpus SelllOllCe: 
S o V o pp S A I A A 
NP VP NP VP saw I} NP 
IA i /X IA she V NP with Ih¢ telescol}e she V NP 
NP PP NP PP 
P NP Ihe dless the dress I A 
with tile telese:}pe 
Figure 4. Different derivation yielding a different parse tree 
lbr Site saw the dress with the telescope 
The probabilistic and non-probabilistie DOP models 
differ in the way they define the best parse tree of a 
sentence. We now discuss these models separately. 
2.1 The probabilistic DOP model 
The probabilistic DOP model introduced in Bed 
(1992, 93) computes the most probable parse tree of a 
sentence from the normalized subtree l'requencies in 
the corpus. The probability of a subtree t is estimated 
as the nunlber of occurrences of t seen in the corpus, 
divided by the total number of occurrences of corpus- 
subtrees that have the same root label as t. Let It I 
rett, rn the number of occurrences of t in the corpus 
and let r(t) return the root label of t then: 
l'(t)=ltl/Zt':r(r)=,.(t)lt' \[.2 Tim probability of a 
derivation is computed as the product of the 
probabilities of the subtrees involved in it. The 
probability of a parse tree is computed as the sum of 
the probabilities ol' all distinct derivations that 
produce that tree. The parse tree with the highest 
2 It should be stressed that there may be several other ways 
to estimate subtree probabilities in I)OP. For example, 
Bonnema et al. (1999) estimate the probability era subtree 
as the probability that it has been involved in the derivation 
of a corpus tree. It is not yet known whether this alternative 
probability model outperforms the model in Bed (1993). 
Johnson (1998) pointed out that the subtree estimator in 
Bed (1993) yields a statistically inconsistent model. This 
means that as the training corpus increases the 
corresponding sequences of probability distributions do not 
converge to the true distribution that generated the training 
data. Experiments with a consistent maximum likelihood 
estimator (based on the inside-outside algorithln in Lari and 
Yot, ng 1990), leads however to a significant decrease in 
parse accuracy on the ATIS and OVIS corpora. This 
indicates that statistically consistency does not necessarily 
lead to better performance. 
70 
probalfility is defined as the best parse tree of a 
Selltence. 
The fn'obabilistie DOP model thus considers 
counts of subtrees of a wide range o1' sizes in 
computing the probability of a tree: everything from 
counts ¢51' single-level rules to cotmts of entire trees. 
2.2 The noil-probabilistic I)OP model 
Tim non-prolmlfilistic I)OP model uses a rather 
different definition of the best parse tree. instead of 
conqmting the most probable patse of a sentence, it 
computes the parse tree which can be generated by 
the fewest eorpus-stibtrees, i.e., by the shortest 
deriwltion independent of the subtree prolmbilities. 
Since stlblrees are allowed to be of arbitrary size, tile 
shortest derivation typically corresponds 1(5 the pa.rse 
tree which consists of largest possible corpus- 
subtrees, thus maximizing syntaclic context. For 
cxmnple, given the corpus in Figure 1, the best parse 
tree for She saw the dress with the telescope is given 
in Figure 3, since that parse tree can be generated by 
a derivation of only two corpus-sublrees, while tile 
parse tree in Figure 4 needs at least three corpus- 
sublrees to be generated. (Interestingly, the parse lree 
with the sho,'test derivation in Figure 3 is also tile 
most probable parse tree according to pl'obalfilistic 
1)O1 ) for this corpus, but this need not always be so. 
As mentioned, the probabilistic 1)O1' lnodel has 
ah'eady a bias l(5 assign higher probabilities to parse 
trees that can be generaled 153, shorter deriwtlions. The 
non-pvobabilistic I)OP model makes this bias 
absolute.) 
The shortest deriwttion may not t)e unique: it 
may happen that different parses of a sentence are 
generated by tile same mininml nmnl)er of corpus- 
subtrees. In lhat ease the model backs off to a 
l'requency ordering (51' the subtrees. That is, all 
subtrees of each root label arc assigned a rank 
according to their frequency ill the co,pus: the most 
frequent sublree (or subtrees) (51" each root label get 
rank 1, the second most frequent subtree gels rank 2, 
etc. Next, the rank of each (shortest) derivation is 
computed its the sum of the ranks (51" tile subtrecs 
involved. The deriwttion with the smallest sum, or 
highest rank, is taken as the best derivation producing 
the best parse tree. 
The way we compute the rank of a de,'ivalion 
by surmrdng up lhe ranks of its subtrees may seem 
rather ad hoc. However, it is possible to provide an 
information-theoretical ,notivation for this model. 
According to Zipl"s law, rank is roughly prolxsrtional 
to tile negative logaritlun of frequency (Zipf 1935). In 
Slmnnon's Information Theory (Shannon 194~,), tile 
negative logaritlun (of base 2) of the probability of an 
event is belter known as the information (51' that event. 
Thus, tile rank of a subtree is roughly proportional to 
its information. It follows that minimizing the sum of 
the sublrce ranks in a derivation corresponds to 
minimizing the (self-)information of a derivation. 
3. Computational Aspects 
3.1 Computing the most probable parse 
Bed (1993) showed how standard chart parsing 
techniques can be applied to probabilistic DOP. Each 
corpus-subtree t is converted into a context-free rule r 
where the lefthand side <51" r corresponds to tile root 
label of t and tile righthand side of r corresponds to 
the fronlier labels of t. Indices link the rules to the 
original subtrees so as to maintain the sublree's 
internal structure and probability. These rules are used 
lO Cl'e~:lte il. derivation forest for a sentenc/2, illld the 
most p,obable parse is computed by sampling a 
sufficiently large number of random deriwttions from 
the forest ("Monte Carlo disamt~iguation", see Bed 
1998; Chappelier & Rajman 2000). While this 
technique has been sttccessfully applied to parsing 
lhe ATIS portion in the Penn Treebank (Marcus et al. 
1993), it is extremely time consuming. This is mainly 
because lhe nun/bcr of random derivations thai should 
be sampled to reliably estimate tile most prolmble 
parse increases exponentially with the sentence 
length (see Goodman 1998). It is therefore question- 
able whether Bed's slunpling teclmique can be scaled 
to larger corpora such as tile OVIS and the WSJ 
corpora. 
Goodman (199g) showed how tile probabil- 
istic I)OP model can be reduced to a compact 
stochastic context-free grammar (SCFG) which 
contains exactly eight SCFG rules for each node in 
the training set trues. Although Goodman's rcductkm 
method does still not allow for an efficient 
computation {51 tile most probable parse in DOP (ill 
fact, the prol~lem of computing the most prolmble 
parse is NP-hard -- sue Sima'an 1996), his method 
does allow for an efficient computation o1' the 
"nmximun~ constituents parse", i.e., the parse tree that 
is most likely to have the largest number of correct 
constitueuts (also called the "labeled recall parse"). 
Goodman has shown on tile ATIS corpus that the 
nla.xinltllll constituents parse perfor,ns at least as well 
as lhe most probable parse if all subtl'ees are used. 
Unfortunately, Goodman's reduction method remains 
71 
beneficial only if indeed all treebank subtrces arc 
used (see Sima'an 1999: 108), while maximum parse 
accuracy is typically obtained with a snbtree set 
which is smalle," than the total set of subtrees (this is 
probably due to data-sparseness effects -- see 
Bonnema et al. 1997; Bod 1998; Sima'an 1999). 
In this paper we will use Bod's subtree-to-rule 
conversion method for studying the behavior of 
probabilistic against non-probabilistic DOP for 
different maximtnn subtree sizes. However, we will 
not use Bod's Monte Carlo sampling technique from 
complete derivation forests, as this turns out to be 
computationally impractical for our larger corpora. 
Instead, we use a Viterbi n-best search and estimate 
the most probable parse fi'mn the 1,000 most probable 
deriwltions, summing up tile probabilities hi' derivat- 
ions that generate the same tree. Tile algorithm for 
computing n most probable deriwttions follows 
straightforwardly from the algorithm which computes 
the most probable derivation by means of Viterbi 
optimization (see Sima'an 1995, 1999). 
3.2 Computing the shortest derivation 
As with the probabilistic DOP model, we first convert 
the corpus-subtrees into rewrite rules. Next, the 
shortest derivation can be computed in the same way 
as the most probable deriwltion (by Viterbi) if we 
give all rules equal probabilities, in which case tile 
shortest derivation is equal to the most probable 
deriwltion. This can be seen as follows: if each rule 
has a probability p then the probability of a derivation 
involving n rules is equal to pn, and since 0<p<l the 
derivation with the fewest rules has the greatest 
probability. In out" experiments, we gave each rule a 
probability mass equal to I/R, where R is the ntunbcr 
of distinct rules derived by Bod's method. 
As mentioned above, the shortest derivation 
may not be unique. In that case we compute all 
shortest derivations of a sentence and then apply out" 
ranking scheme to these derivations. Note that this 
ranking scheme does distinguish between snbtrees or 
different root labels, as it ranks the subtrecs given 
their root label. The ranks of the shortest derivations 
are computed by summing up the ranks of the 
subtrees they involve. The shortest derivation with the 
smallest stun o1' subtree ranks is taken to produce tile 
best parse tree. 3 
3 It may happcn that different shortest derivations generate 
the same tree. We will not distinguish between these cases, 
however, and co,npt, te only the shortest derivation with the 
highest rank. 
4. Experimental Comparison 
4.1 Experiments on the ATIS corpus 
For our first comparison, we used I0 splits from the 
Penn ATIS corpus (Marcus et al. 1993) into training 
sets of 675 sentences and test sets of 75 sentences. 
These splits were random except for one constraint: 
tbat all words in the test set actually occurred in the 
training set. As in Bod (1998), we eliminated all 
epsilon productions and all "pseudo-attachments". As 
accuracy metric we used the exact match defined as 
the percentage of the best parse trees that are 
identical to the test set parses. Since the Penn ATIS 
portion is relatively small, we were able to compute 
the most probable parse both by means of Monte 
Carlo sampling and by means of Viterbi n-best. Table 
1 shows the means o1' tile exact match accuracies for 
increasing maximum subtrec depths (up to depth 6). 
Depth of Probabilistic DOP Non-probabilistic 
subtrees Monte Carlo Viterbi n-best l)OP 
l 46.7 46.7 24.8 
<2 67.5 67.5 40.3 
__.<3 78. l 78.2 57.1 
__<4 83.6 83.0 81.5 
-<5 83.9 83.4 83.6 
-<6 84. I 84.0 85.6 
Tablc 1. Exact match accuracies/'or the ATIS corpus 
Tile table shows that tile two methods for probabilistic 
DOP score roughly tile same: at dcpfll _< 6, the Monte 
Carlo method obtains 84.1% while the Viterbi n-best 
method obtains 84.0%. These differences are not 
statistically significant. The table also shows that for 
small subtree depths the non-probabilistic DOP model 
performs considerably worse than the probabilistic 
model. This may not be surprising since for small 
subtrecs tile shortest derivation corresponds to tile 
smallest parse tree which is known to be a bad 
prediction of the correct parse tree. Only il' the 
subtrees are larger than depth 4, the non-probabilistic 
DOP model scores roughly the same as its 
probabilistic counterpart. At subtree depth < 6, the 
non-probabilistic DOP model scores 1.5% better than 
the best score of the probabilistic DOP model, which 
is statistically significant according to paired t-tests. 
4.2 Experiments on tile OVIS corpus 
For out" comparison on tile OVIS corpus (Bonnema ct 
al. 1997; Bod 1998) we again used 10 random splits 
under tile condition that all words in tile test set 
occurred in the training set (9000 sentences for 
72 
training, 1000 sentences for testing). The ()VIS trees 
contain both syntactic and se,nantic annotations, but 
no epsilon productions. As in Bod (1998), we lreated 
the syntactic and semantic annotations of each node 
as one label. Consequently, the labels are very 
restrictive and collecting statistics over them is 
difficult. Bonncma et al. (1997) and Sima'an (1999) 
report that (probal)ilislic) I)OP sulTers considerably 
from data-sparseness on OVIS, yielding a decrease in 
parse accuracy if subtrees larger lh'an depth 4 are 
included. Thus it is interesting to investigate how non- 
probabilistic DOP behaves on this corpus. Table 2 
shows the means of the exact match accuracies for 
increasing subtree depths. 
l)epth of I'rolmbilistic Non-probabilistic 
subtrecs 1)OP D()P 
1 83.1 70.4 
~2 87.6 85.1 
_<3 89.6 89.5 
_<4 90.0 90.9 
_<5 89.7 91.5 
_<6 88.8 92.2 
Table 2. Exact match accuracies for the OV1S corpus 
We again sue that the non-pl'olmlfilistic l)()P model 
performs badly fOl small subtree depths while it 
outperforms the probabilislic DOP model if the 
sublrees gel larger (in this case for depth > 3). Bul 
while lhe accuracy of probabilislic l)()P deteriorates 
after depth 4, the accuracy of non-prolmbilistic 1)O1 + 
contintms to grow. Thus non-prolmlfilistic \])()P seems 
relatively insensitive to tile low frequency of larger 
subtrees. This properly may be especially useful if no 
meaningful statistics can be collected while 
sentences can still be parsed by large chunks. At 
depth ___ 6, non-probabilislic DOP scores 3.4% better 
than probalfilistic DOP, which is statistically signifi- 
cant using paired t-tests. 
4.3 Experiments on the WSJ corpus 
Both the ATIS and OVIS corpus represent restricted 
domains. In order to extend ()tit" results to a broad- 
coverage domain, we tested tile two models also on 
tile Wall Street Journal portion in the Penn Treebank 
(Marcus et ill. 1993). 
To make our results comparable to ()tilers, we 
did not test on different random splits but used the 
now slandard division of the WSJ with seclions 2-21 
for training (approx. 40,000 sentences) and section 23 
for testing (see Collins 1997, 1999; Charniak 1997, 
2000; l~,atnalmrkhi 1999); we only tested on sentences 
_< 40 words (2245 sentences). All trees were stripped 
off their Selllalltic lags, co-reference information and 
quotation marks. We used all training set sublrees o1' 
depth 1, but due to memory limitations we used a 
subset of the subtrees larger than depth l by taking for 
each depth a random sample o1' 400,000 subtrecs. No 
subtrces larger than depth 14 were used. This resulted 
into a total set of 5,217,529 subtrees which were 
smoothed by Good-Turing (see Bod 1996). We did not 
employ a separate part-of-speech tagger: tile test 
sentences were directly parsed by the training set 
subtrees. For words that were unknown in tile training 
set, we guessed their categories by means of the 
method described in Weischedel et al. (1993) which 
uses statistics on word-endings, hyphenation and 
capitalization. The guessed category for each 
llllklloWn Wol'd was converted into a depth-I subtree 
and assigned a probability (or frequency for non- 
probabilistic I)OP) by means of simple Good-Turing. 
As accuracy metric we used the standard 
PAP, SEVAI, scores (Black et al. 1991) to compare a 
proposed parse 1' with tile corresponding correct 
treebank parse 7' as follows: 
# correct constituents in P l.abcled Precision - 
# constilucnts in 1' 
# COI'I'CCI. COllstittlcnts ill l ~ Labeled Recall = 
# constituents in T 
A constituent in P is "correct" if there exisls a 
constituent in 7' of tile sanle label that spans the same 
words. As in other work, we collapsed AI)VP and 
Pl?Jl" to the same label when calculating these scores 
(see Collins 1997; I~,atnaparkhi 1999; Charniak 1997). 
Table 3 shows the labeled precision (LP) and 
labeled recall (LR) scores for probabilistic and non- 
probabilistic DOP for six different maximum subtree 
depths. 
Depth of l'robabilistic I)OP Non-probabilislic 
subtrecs l)OP 1,1 ~ LR LP LR 
<_4 84.7 84.1 81.6 80.1 
<_6 86.2 86.0 85.0 84.7 
_<8 87.9 87.1 87.2 87.0 
_< 10 88.6 88.0 86.8 86.5 
_<12 89.1 88.8 87.1 86.9 
_< 14 89.5 89.3 87.2 86.9 
Table 3. Scores on tile WSJ corpus (sentences _< 40 words) 
73 
The table shows that probabilistic DOP outperl'orms 
non-probabilistic DOP for maximum subtree depths 4 
and 6, while the models yield rather similar results for 
maximum subtree depth 8. Surprisingly, the scores of 
nonq~robabilistic DOP deteriorate if the subtrees are 
further enlarged, while tile scores of probabilistic 
DOP continue to grow, up to 89.5% LP and 89.3% 
LR. These scores are higher than those of several 
other parsers (e.g. Collins 1997, 99; Charniak 1997), 
but remain behind tim scores of Charniak (2000) who 
obtains 90.1% LP and 90.1% LR for sentences _< 40 
words. However, in Bod (2000b) we show that even 
higher scores can be obtained with probabilistic DOP 
by restricting tile number of words in the subtree 
frontiers to 12 and restricting the depth of unlexical- 
ized subtrees to 6; with these restrictions an LP of 
90.8% and an LR of 90.6% is achieved. 
We may raise the question as to whether we 
actually need these extremely large subtrees to obtain 
our best results. One could argue that DOP's gain in 
parse accuracy with increasing subtree depth is due to 
tile model becoming sensitive to the int'luence o1' 
lexical heads higher in tile lree, and that this gain 
could also be achieved by a more compact depth-1 
DOP model (i.e. an SCFG) which annotates the 
nonterminals with headwords. However, such a head- 
lexicalized stochastic grammar does not capture 
dependencies between nonheadwords (such as more 
aud than in tile W,qJ construction carry more people 
than cargo where neither more nor th\[lll are headwords 
ol' tile NP-constitucnt lllore people than cargo)), 
whe,eas a frontier-lexicalized DOP model using large 
subtrecs does capture these dependencies since it 
includes subtrees in which e.g. more and than are the 
only frontier words. In order to isolate tile contribution 
of nonheadword dependencies, we eliminated all 
subtrees containing two or more nonheadwnrds (where 
a nonheadword of a subtl'ec is a word which is not a 
headword of the subtree's root nonterminal -- although 
such a nonheadword may be a headword of one of the 
subtree's internal nodes). On the WSJ this led to a 
decrease in LP/LR of 1.2%/1.0% for probabilistic 
DOP. Thus nonheadword dependencies contribute to 
higher parse accuracy, and should not be discarded. 
This goes against common wisdom that the relevant 
lexical dependencies can be restricted to the locality 
of beadwords of constituents (as advocated in Collins 
1999). It also shows that DOP's frontier lexicalization 
is a viable alternative to constituent lexicalization 
(as proposed in Charniak 1997; Collins 1997, 99; 
Eisner 1997). Moreover, DOP's use of large subtrees 
makes tim model not only more lexically but also 
more structurally sensitive. 
5. Conclusion 
Commou wisdom has it that tile bias o1' stochastic 
grammars ill favor of shorter derivations is harnfful 
and should be redressed. We have shown that the 
common wisdom is wrong for stochastic trek- 
substitution grammars that use elementary trees (11' 
flexible size. For such grammars, a non-probabilistic 
metric based on the shortest derivation outperforms a 
probabilistic metric on tile ATIS and OVIS corpora, 
while it obtains competitive results on tile Wall 
Street Journal corpus. We have seen that a non- 
probabilistic version o1' DOP performed especially 
well on corpora for which collecting sublree statistics 
is difficult, while sentences can still be parsed by 
relatively large chunks. We have also seen that 
probabilistic DOP obtains very competitive results on 
the WSJ corpus. Finally, we conjecture thai any 
stochastic grammar which uses elementary treks 
rather than context-free rules can be turned into an 
accurate non-probabilistic version (e.g. Tree-Insertion 
Grammar and Tree-At\[joining Grammar). 
Acknowledgements 
Thanks to Khalil Sima'an and three anonymous 
reviewers for useful suggestions. 

References 

Berg, M. van den, R. Bod and R. Scha, 1994. "A Corpus- 
Based Approach to Semantic Interpretation", 
l'roceedings Ninth Amsterdam Colloquium, 
Amsterdam, The Netherlands. 

Black, E. et al. 1991, A Procedure for Quantitatively 
Comparing the Syntactic Coverage of English, 
Proceedings DARPA Speech and Natural Language 
Workshop, Pacific Grove, Morgan Kattfinann. 

Bod, P,. 1992. "Data-Oriented Parsing (DOP)", Proceedings 
COLING-92, Nantes, France. 

Bod, R. 1993. "Using an Annotated Corpus as a Stochastic 
Grammar", EACL '93, Utrecht, The Netherlands. 

Bod, R. 1996. "Two Questions about Data Oriented 
Parsing", Proceedings Fourth Workshop on Very Large 
Corpora, Copenhagen, 1)cnmark. 

Bod, R. 1998. Beyond Grammar: An EaT,erience-l~ased 
Theot3, of Language, CSLI Publications, Cambridge 
University Press. 

Bod, R. 2000a. "An Empirical Evaluation of LFG-DOP '', 
Proceedings COLING-2000, Saarbriickcn, Germany. 

Bod, R. 200Oh. "Redundancy and Minimality in Statistical 
Parsing with Ihe I)OP Model", submitted for 
publicalion. 

Bod, P,. and R Kaplan 1998. "A Probabilistic Corpus-Drivcn Model for lexical Functional Analysis", 
Proceedings COLING-ACL '98 Montreal, CanadzL 

Bonnema, R., R. Bod and R. Scha, 1997. 'A DOP Model for 
Semantic Interpretation", Proceedings ACL/EACL-97, 
Madrid, Spain. 

Bonncma, R., t'. Buying and P,. Scha, 1999. "A New 
I'robalfilily Model for I)ata-()ricntcd Parsing", 
lb-oceedings o.f the Amsterdam Colloqttittm 1999, 
Amsterdam, The Nethcrhmds. 

Briscoe, T. and J. Carroll, 1993. "Generalized Probabilistic 
LR Parsing of Natural  (Corpora) with 
Unification-based Grammars, Computational 
Linguistics 19(1 ), 25-59. 

Caraballo, S. and E. Charniak, 1998. "New Figures of Merit 
for Best-First Probabilistic Chart Parsing", 
Computational Linguistics 2d, 275-298. 

Chappclicr, J. and M. l,'ajman, 2000. "Monte Carlo 
Sampling Ior Nl'-hard Maximization I'roblems in the 
l"ramework of Wc ghtcd Parsing", iu Natmcd 
l.altguag<! l)roce.ssing -- NI.I' 2000, l.eclure Not,:!.,; i,q 
Arttl/h:ial lnlelligence 1835, 1). Chrislodoulakis (ed.), 
2000, 106-117. 

Charniak, 1';. 1996. "Tree-bank Grammars", Proceedings 
AAAI-96, Menlo Park, Ca. 

Charniak, 1!. 1997. "Statistical Parsing wilh a Contcxl-l:rcc 
Grammar and Word Statistics", l'roceedinw AAAI-97. 
Menlo Park. Ca. 

Charniak, E 2000. "A Maximimum-Entropy-Inspired Parser", 
Proceedings ANLP-NAACL '2000, Scnttlc, Washington. 

Chitrao, M. and P,. Grishman, 1990. "Statistical Parsing of 
Messages", Proceedings DARPA Speech and Language 
Workshop 1990. 

Collins, M. 1997. "Three generalive lexicalised models for 
statistical parsing", Proceedings EACL/ACL'97, 
Madrid, Spain. 

Collins, M. 1999. llead-l)riven Statistical Models./br Natural 
I,anguagc I'arsing, Phl)-thesis, University of 
Pennsylvania, PA. 

I)aclcmans, W. (ed.) 1999. Memmy-Ilased Language 
Processing, Journal./br l';.vperimental and Theoretical 
Arti/i'cial Inlelligence, 11(3). 

Eisner, J. 1997. "Bilexical Grammars and a Cubic-Time 
Probabilistic Parser", Proceedings of International 
Workshop on Parsing Technologies Boston, Mass. 

Goodman, J. 1998. Palwing lnsMe-Out, Ph.l). thesis, l larwu'd 
University, Mass. 

l loogweg, 1.. 2000. Extending DOPI with the Insertion 
Operation, Maslel"S thesis, University of Amsterdam, 
The Netherhmds. 

Johnson, M. 1998. "The I)OP Estimatkm Method is Biased 
and Inconsistent", squib. 

l,ari, K. and S. Young 1990. "The Estimation of Stochastic 
Context-Free Granunars Using the lnside-()utsidc 
Algorithm", ()mtl~ltlel" ,~'1~(3cc11 alld l,angttctge, d, 35-56. 

Magerman, D. and M. Marcus, 1991. "Pearl: A Probabilistic 
Chart Parser", Proceedings EACL '91, Berlin, 
(iermany. 

Marcus, M., B. Santorini and M. Marcinkiewicz, 1993. 
Building a Large Annotated Corpus of English the 
Penn Treebank", Computational Linguistics 19(2). 

Neumann, G. 1998. "Automatic Extraction of Stochastic 
Lexicalized Tree Grammars from Treebanks", 
Proceedings of the 4th Workshop on Tree-Adjoining 
Grammars and Related Frameworks, Philadclplaia, PA. 

Neumann, (3. and I). Flickinger, 1999. "l~earning Stochastic 
l,exicalizcd Tree (\]lanllnars frolll IIPSG", I)FKI 
Technical l(epoit, Saartwiicken, Germany. 

Ilalnaparkhi, A. 1999. "l,carning to Parse Natural 14mguage 
with Maximum linlropy Models", Machit~e /.earning 
34. 151-176. 

S\[laIlllOll, C. 1948. A Mathematical Theory of Communic- 
ation. IMII System Technical .Iota'hal. 27, 379-423, 623- 
656. 

Sima'an, K. 1995. "An optimized algorilhin for l)ata 
Oriented Parsing", in: R. Milkov and N, Nicolov 
(eds.), Recent Advam:es in Natttral l,alsguage 
l'rocessing 1995, w>lume 136 of Current lsstws in 
Linguistic T/woO'. JohI1 Bcnjalll.ins, Aiilslerdali1. 

Sima'an, K. 1996. "Computational Complexity of 
Plobabilistic Disambiguation by means of Tree 
Grammars. Proceedings COLING-96, Copenhagen, 
Dcnmark. 

Sima'an, K. 1999. l.earni/tg l:i/.\]icient l)isambigua/i:m. 113,C 
Dissertation SOlits 199%02, l.Jtrcdlt University / 
University of Amsterdam, March 1999, The 
Nclherlands. 

Weischedel, R., M. Meteer, R, Schwarz, L. Ramshaw and 
J. Palmucci, 1993. "Coping with Ambiguity and 
Unknown Words through Probabilistic Models", 
Computational Linguistics, 19(2), 359-382. 

Way, A. 1999. "A t lybrid Architecture for I~.obus! MT using 
I,t"G-I)OP", .hmrnal o/" l'2xlwrimental am/ Theoretical 
Arti/}'cial Intelligence I1 (Special Issue on Memory- 
Based l~anguage Processing). 

Zipf, G. 1935. The l'sycho-Biology of Language, l loughton 
Mifflin. 
