Efficient Algorithms for Parsing the DOP Model * 
Joshua Goodman 
Harvard University 
33 Oxford St. 
Cambridge, MA 02138 
goodman@das.harvard.edu 
Abstract 
Excellent results have been reported for Data- 
Oriented Parsing (DOP) of natural language texts 
(Bod, 1993c). Unfortunately, existing algorithms 
are both computationally intensive and difficult 
to implement. Previous algorithms are expen- 
sive due to two factors: the exponential number 
of rules that must be generated and the use of 
a Monte Carlo parsing algorithm. In this paper 
we solve the first problem by a novel reduction of 
the DOP model to:a small, equivalent probabilistic 
context-free grammar. We solve the second prob- 
lem by a novel deterministic parsing strategy that 
maximizes the expected number of correct con- 
stituents, rather than the probability of a correct 
parse tree. Using ithe optimizations, experiments 
yield a 97% crossing brackets rate and 88% zero 
crossing brackets rate. This differs significantly 
from the results reported by Bod, and is compara- 
ble to results from a duplication of Pereira and 
Schabes's (1992) experiment on the same data. 
We show that Bod's results are at least partially 
due to an extremely fortuitous choice of test data, 
and partially due to using cleaner data than other 
researchers. 
Introduction 
The Data-Oriented Parsing (DOP) model has a 
short, interesting, and controversial history. It 
was introduced by Remko Scha (1990), and was 
then studied by Rens Bod. Unfortunately, Bod 
(1993c, 1992) was not able to find an efficient exact 
* I would like to acknowledge support from Na- 
tional Science Foundation Grant IRI-9350192 and a 
National Science Foundation Graduate Student Fel- 
lowship. I would also like to thank Rens Bod, Stan 
Chen, Andrew Kehler, David Magerman, Wheeler 
Rural, Stuart Shieber, and Khalil Sima'an for help- 
ful discussions, and comments on earlier drafts, and 
the comments of the anonymous reviewers. 
algorithm for parsing using the model; however he 
did discover and implement Monte Carlo approxi- 
mations. He tested these algorithms on a cleaned 
up version of the ATIS corpus, and achieved some 
very exciting results, reportedly getting 96% of his 
test set exactly correct, a huge improvement over 
previous results. For instance, Bod (1993b) com- 
pares these results to Schabes (1993), in which, 
for short sentences, 30% of the sentences have no 
crossing brackets (a much easier measure than ex- 
act match). Thus, Bod achieves an extraordinary 
&fold error rate reduction. 
Not surprisingly, other researchers attempted 
to duplicate these results, but due to a lack of de- 
tails of the parsing algorithm in his publications, 
these other researchers were not able to confirm 
the results (Magerman, Lalferty, personal commu- 
nication). Even Bod's thesis (Bod, 1995a) does 
not contain enough information to replicate his re- 
sults. 
Parsing using the DOP model is especially dif- 
ficult. The model can be summarized as a spe- 
cial kind of Stochastic Tree Substitution Grammar 
(STSG): given a bracketed, labelled training cor- 
pus, let every subtree of that corpus be an elemen- 
tary tree, with a probability proportional to the 
number of occurrences of that subtree in the train- 
ing corpus. Unfortunately, the number of trees is 
in general exponential in the size of the training 
corpus trees, producing an unwieldy grammar. 
In this paper, we introduce a reduction of the 
DOP model to an exactly equivalent Probabilistic 
Context Free Grammar (PCFG) that is linear in 
the number of nodes in the training data. Next, 
we present an algorithm for parsing, which returns 
the parse that is expected to have the largest num- 
ber of correct constituents. We use the reduction 
and algorithm to parse held out test data, com- 
paring these results to a replication of Pereira and 
143 
S 
NP VP 
PN PN V NP 
DET N 
Figure 1: Training corpus tree for DOP example 
Schabes (1992) on the same data. These results 
are disappointing: the PCFG implementation of 
the DOP model performs about the same as the 
Pereira and Schabes method. We present an anal- 
ysis of the runtime of our algorithm and Bod's. 
Finally, we analyze Bod's data, showing that some 
of the difference between our performance and his 
is due to a fortuitous choice of test data. 
This paper contains the first published repli- 
cation of the full DOP model, i.e. using a parser 
which sums over derivations. It also contains algo- 
rithms implementing the model with significantly 
fewer resources than previously needed. Further- 
more, for the first time, the DOP model is com- 
pared on the same data to a competing model. 
Previous Research 
The DOP model itself is extremely simple and can 
be described as follows: for every sentence in a 
parsed training corpus, extract every subtree. In 
general, the number of subtrees will be very large, 
typically exponential in sentence length. Now, use 
these trees to form a Stochastic Tree Substitu- 
tion Grammar (STSG). There are two ways to de- 
fine a STSG: either as a Stochastic Tree Adjoin- 
ing Grammar (Schabes, 1992) restricted to sub- 
stitution operations, or as an extended PCFG in 
which entire trees may occur on the right hand 
side, instead of just strings of terminals and non- 
terminals. 
Given the tree of Figure 1, we can use the 
DOP model to convert it into the STSG of Figure 
2. The numbers in parentheses represent the prob- 
abilities. These trees can be combined in various 
ways to parse sentences. 
In theory, the DOP model has several advan- 
tages over other models. Unlike a PCFG, the use 
of trees allows capturing large contexts, making 
the model more sensitive. Since every subtree is 
included, even trivial ones corresponding to rules 
in a PCFG, novel sentences with unseen contexts 
144 
s (3) s s = s (~)_ 
~ ~ ~ B (1) 
A C A D E B E B I 
I I I I I I I 
X X X X ~r 
Figure 3: Simple Example STSG 
can still be parsed. 
Unfortunately, the number of subtrees is huge; 
therefore Bod randomly samples 5% of the sub- 
trees, throwing away the rest. This significantly 
speeds up parsing. 
There are two existing ways to parse using the 
DOP model. First, one can find the most proba- 
ble derivation. That is, there can be many ways 
a given sentence could be derived from the STSG. 
Using the most probable derivation criterion, one 
simply finds the most probable way that a sentence 
could be produced. Figure 3 shows a simple ex- 
ample STSG. For the string xx, what is the most 
probable derivation? The parse tree 
S 
A C 
I I 
x x 
has probability ~ of being generated by the trivial 
derivation containing a single tree. This tree cor- 
responds to the most probable derivation of xx. 
One could try to find the most probable parse 
tree. For a given sentence and a given parse tree, 
there are many different derivations that could 
lead to that parse tree. The probability of the 
parse tree is the sum of the probabilities of the 
derivations. Given our example, there are two dif- 
ferent ways to generate the parse tree 
S 
E B 
I I 
x x 
each with probability -~, so that the parse tree has 
probability -~. This parse tree is most probable. 
Bod (1993c) shows how to approximate this 
most probable parse using a Monte Carlo algo- 
rithm. The algorithm randomly samples possible 
derivations, then finds the tree with the most sam- 
pled derivations. Bod shows that the most proba- 
ble parse yields better performance than the most 
probable derivation on the exact match criterion. 
NP (½) 
DET N 
VP (½) 
V NP 
s 
NP VP 
V NP //'...... 
DET N 
vP (½) 
NP (½) 
V NP PN PN 
DET N 
s s 
NP VP NP VP 
PN PN PN PN V NP 
s 
s 
NP VP 
NP VP 
V NP 
s ({) 
NP VP 
PN PN V NP 
DET N 
Figure 2: Sample STSG Produced from DOP Model 
Khalil Sima'an (1996) implemented a version 
of the DOP model, which parses efficiently by lim- 
iting the number of trees used and by using an 
efficient most probable derivation model. His ex- 
periments differed from ours and Bod's in many 
ways, including his use of a ditferent version of the 
ATIS corpus; the use of word strings, rather than 
part of speech strings; and the fact that he did not 
parse sentences containing unknown words, effec- 
tively throwing out the most difficult sentences. 
Furthermore, Sim a'an limited the number of sub- 
stitution sites for his trees, effectively using a sub- 
set of the DOP model. 
Reduction of DOP to PCFG 
Unfortunately, Bod's reduction to a STSG is ex- 
tremely expensive, even when throwing away 95% 
of the grammar. Fortunately, it is possible to find 
an equivalent PCFG that contains exactly eight 
PCFG rules for each node in the training data; 
thus it is O(n). Because this reduction is so much 
smaller, we do not discard any of the grammar 
when using it. The PCFG is equivalent in two 
senses: first it generates the same strings with the 
same probabilities; second, using an isomorphism 
defined below, it generates the same trees with the 
same probabilities, although one must sum over 
several PCFG trees for each STSG tree. 
To show this reduction and equivalence, we 
must first define some terminology. We assign ev- 
ery node in every tree a unique number, which we 
will call its address. Let A@k denote the node at 
address k, where A is the non-terminal labeling 
that node. We will need to create one new non- 
terminal for each node in the training data. We 
will call this non-terminal Ak. We will call non- 
terminals of this form "interior" non-terminals, 
and the original non-terminals in the parse trees 
145 
"exterior". 
Let aj represent the number of subtrees 
headed by the node A@j. Let a represent the num- 
ber of subtrees headed by nodes with non-terminal 
A, that is a = ~j aj. 
Consider a node A~j of the form: 
A@j 
B@k C@l 
How many subtrees does it have? Consider first 
the possibilities on the left branch. There are bk 
non-trivial subtrees headed by B@k, and there is 
also the trivial case where the left node is sim- 
ply B. Thus there are bk ÷ 1 different possibil- 
ities on the left branch. Similarly, for the right 
branch there are cl + 1 possibilities. We can cre- 
ate a subtree by choosing any possible left subtree 
and any possible right subtree. Thus, there are 
aj = (bk + 1)(c~ + 1) possible subtrees headed by 
A@j. In our example tree of Figure 1, both noun 
phrases have exactly one subtree: np4 -- nl>z -- 1; 
the verb phrase has 2 subtrees: vp3 = 2; and the 
sentence has 6: sl = 6. These numbers correspond 
to the number of subtrees in Figure 2. 
We will call a PCFG subderivation isomor- 
phic to a STSG tree if the subderivation begins 
with an external non-terminal, uses internal non- 
terminals for intermediate steps, and ends with 
external non-terminals. For instance, consider the 
tree 
NP VP 
PN PN V NP 
taken from Figure 2. The following PCFG sub- 
derivation is isomorphic: S ~ NP@I VP@2 
PN PN VP@2 =~ PN PN V NP. We say 
that a PCFG derivation is isomorphic to a STSG 
derivation if there is a corresponding PCFG sub- 
derivation for every step in the STSG derivation. 
We will give a simple small PCFG with the 
following surprising property: for every subtree in 
the training corpus headed by A, the grammar will 
generate an isomorphic subderivation with proba- 
bility 1/a. In other words, rather than using the 
large, explicit STSG, we can use this small PCFG 
that generates isomorphic derivations, with iden- 
tical probabilities. 
The construction is as follows. For a node such 
as 
A@j 
B@k C@l 
we will generate the following eight PCFG rules, 
where the number in parentheses following a rule 
is its probability. 
Aj --~ SC (1/aj) A ~ BC (l/a) 
Aj ~ BkC (bh/aj) A ~ BkC (bk/a) 
Aj ~ BCi (ci/aj) A ~ BCz (cJa) 
Aj ~ B~Ci (bkcl/aj) A ~ BkCl (bkcl/a) 
(1) 
We will show that subderivations headed 
by A with external non-terminals at the roots 
and leaves, internal non-terminals elsewhere have 
probability 1/a. Subderivations headed by Aj 
with external non-terminals only at the leaves, in- 
ternal non-terminals elsewhere, have probability 
1/aj. The proof is by induction on the depth of 
the trees. 
For trees of depth 1, there are two cases: 
A A@j 
B C B C 
Trivially, these trees have the required probabili- 
ties. 
Now, assume that the theorem is true for trees 
of depth n or less. We show that it holds for trees 
of depth n + 1. There are eight cases, one for each 
of the eight rules. We show two of them. Let 
B@k 
• represent a tree of at most depth n with 
external leaves, headed by B@k, and with internal 
intermediate non-terminals. Then, for trees such 
as 
146 
PCFG derivation 
4 productions 
S 
NP@3 VP@I 
PN PN V NP 
DET N 
STSG derivation 
2 subtrees 
S 
NP VP 
PN PN V NP 
NP 
DET N 
Figure 4: Example of Isomorphic Derivation 
B@k C@l 
: : 
1 1 bhc ! ~ 1 the probability of the tree is ~ ~ ai ~. Simi- 
larly, for another case, trees headed by 
A 
B@k C 
the probability of the tree is b~ b~a = ~'1 The other 
six cases follow trivially with similar reasoning. 
We call a PCFG derivation isomorphic to a 
STSG derivation if for every substitution in the 
STSG there is a corresponding subderivation in 
the PCFG. Figure 4 contains an example of iso- 
morphic derivations, using two subtrees in the 
STSG and four productions in the PCFG. 
We call a PCFG tree isomorphic to a STSG 
tree if they are identical when internal non- 
terminals are changed to external non-terminals. 
Our main theorem is that this construction pro- 
duces PCFG trees isomorphic to the STSG trees 
with equal probability. If every subtree in the 
training corpus occurred exactly once, this would 
be trivial to prove. For every STSG subderiva- 
tion, there would be an isomorphic PCFG sub- 
derivation, with equal probability. Thus for every 
STSG derivation, there would be an isomorphic 
PCFG derivation, with equal probability. Thus 
every STSG tree would be produced by the PCFG 
with equal probability. 
However, it is extremely likely that some sub- 
trees, especially trivial ones like 
S 
NP VP 
will occur repeatedly. 
If the STSG formalism were modified slightly, 
so that trees could occur multiple times, then our 
relationship could be made one to one. Consider a 
modified form of the DOP model, in which when 
subtrees occurred multiple times in the training 
corpus, their counts were not merged: both iden- 
tical trees are added to the grammar. Each of 
these trees will have a lower probability than if 
their counts were merged. This would change the 
probabilities of the derivations; however the prob- 
abilities of parse trees would not change, since 
there would be correspondingly more derivations 
for each tree. Now, the desired one to one relation- 
ship holds: for every derivation in the new STSG 
there is an isomorphic derivation in the PCFG 
with equal probability. Thus, summing over all 
derivations of a tree in the STSG yields the same 
probability as summing over all the isomorphic 
derivations in the PCFG. Thus, every STSG tree 
would be produced by the PCFG with equal prob- 
ability. 
It follows trivially from this that no extra trees 
are produced by the PCFG. Since the total prob- 
ability of the trees produced by the STSG is 1, 
and the PCFG produces these trees with the same 
probability, no probability is "left over" for any 
other trees. 
Parsing Algorithm 
There are several different evaluation metrics one 
could use for finding the best parse. In the sec- 
tion covering previous research, we considered the 
most probable derivation and the most probable 
parse tree. There is one more metric we could con- 
sider. If our performance evaluation were based on 
the number of constituents correct, using measures 
similar to the crossing brackets measure, we would 
want the parse tree that was most likely to have 
the largest number of correct constituents. With 
this criterion and the example grammar of Figure 
3, the best parse tree would be 
147 
S 
A A B 
I I 
x ~g 
The probability that the S constituent is correct is 
1.0, while the probability that the A constituent is 
correct is ~, and the probability that the B con- 
stituent is correct is }. Thus, this tree has on 
average 2 constituents correct. All other trees will 
have fewer constituents correct on average. We 
call the best parse tree under this criterion the 
Maximum Constituents Parse. Notice that this 
parse tree cannot even be produced by the gram- 
mar: each of its constituents is good, but it is not 
necessarily good when considered as a full tree. 
Bod (1993a, 1995a) shows that the most prob- 
able derivation does not perform as well as the 
most probable parse for the DOP model, getting 
65% exact match for the most probable deriva- 
tion, versus 96% correct for the most probable 
parse. This is not surprising, since each parse 
tree can be derived by many different deriva- 
tions; the most probable parse criterion takes 
all possible derivations into account. Similarly, 
the Maximum Constituents Parse is also derived 
from the sum of many different derivations. Fur- 
thermore, although the Maximum Constituents 
Parse should not do as well on the exact match 
criterion, it should perform even better on the 
percent constituents correct criterion. We have 
previously performed a detailed comparison be- 
tween the most likely parse, and the Maximum 
Constituents Parse for Probabilistic Context Free 
Grammars (Goodman, 1996); we showed that the 
two have very similax performance on a broad 
range of measures, with at most a 10% difference 
in error rate (i.e., a change from 10% error rate to 
9% error rate.) We therefore think that it is rea- 
sonable to use a Maximum Constituents Parser to 
parse the DOP model. 
The parsing algorithm is a variation on the 
Inside-Outside algorithm, developed by Baker 
(1979) and discussed in detail by Lari and Young 
(1990). However, while the Inside-Outside algo- 
rithm is a grammar re-estimation algorithm, the 
algorithm presented here is just a parsing algo- 
rithm. It is closely related to a similar algorithm 
used for Hidden Markov Models (Rabiner, 1989) 
for finding the most likely state at each time. How- 
ever, unlike in the HMM case where the algorithm 
produces a simple state sequence, in the PCFG 
case a parse tree is produced, resulting in addi- 
for length := 2 to n 
for s := 1 to n-length+l 
t := s + length- I; 
for all non-terminals X 
sum\[X\] := g(s, t, X); 
loop over addresses k 
let X := non-terminal at k; 
let sum\[X\] := sum\[X\] + g(s,t,X_k); 
loop over non-terminals X 
let max_X := art max of sum IX\] 
loop over r such that s <= r < t 
let best_split := 
max of maxc\[s,r\] + maxc\[r+l,t\]; 
maxc\[s,t\] := sum\[max_X\] + best_split; 
Figure 5: Maximum Constituents Data-Oriented 
Parsing Algorithm 
tional constraints. 
A formal derivation of a very similar algorithm 
is given elsewhere (Goodman, 1996); only the in- 
tuition is given here. The algorithm can be sum- 
marized as follows. First, for each potential con- 
stituent, where a constituent is a non-terminal, a 
start position, and an end position, find the prob- 
ability that that constituent is in the parse. After 
that, put the most likely constituents together to 
form a parse tree, using dynamic programming. 
The probability that a potential constituent 
occurs in the correct parse tree, P(X * 
ws...wtlS ~ wl...wn), will be called g(s,t,X). 
In words, it is the probability that, given the 
sentence wl...w,, a symbol X generates ws...wt. 
We can compute this probability using elements 
of the Inside-Outside algorithm. First, compute 
the inside probabilities, e(s, t, X) = P(X =~ 
w,...wt). Second, compute the outside probabil- 
ities, /(s,t,X) = P(S ~ wl...w~-lXwt+l...wn). 
Third, compute the matrix g(s, t, X): 
g(s,t,X) 
P(S ::~ wI...w,-1Xwt+I...w,)P(X ~ w,...wt) 
P(S ~ wl...wn) 
= f(s, t, X) x e(s, t, X)/e(1, n, S) 
Once the matrix g(s, t, X) is computed, a dy- 
namic programming algorithm can be used to de- 
termine the best parse, in the sense of maximizing 
the number of constituents expected correct. Fig- 
ure 5 shows pseudocode for a simplified form of 
148 
this algorithm. 
For a grammar with g nonterminals and train- 
ing data of size T, the run time of the algorithm 
is O(Tn 2 + gn 3 + n a) since there are two layers 
of outer loops, each with run time at most n, and 
inner loops, over addresses (training data), non- 
terminals and n. However, this is dominated by 
the computation of the Inside and Outside prob- 
abilities, which takes time O(rna), for a grammar 
with r rules. Since there are eight rules for every 
node in the training data, this is O(Tn3). 
By modifying the algorithm slightly to record 
the actual split used at each node, we can recover 
the best parse. The entry maxc\[1, n\] contains 
the expected number of correct constituents, given 
the model. 
Experimental Results and 
Discussion 
We are grateful to Bod for supplying the data 
that he used for his experiments (Bod, 1995b, 
Bod, 1995a, Bod, 1993c). The original ATIS data 
from the Penn Tree Bank, version 0.5, is very 
noisy; it is difficult to even automatically read this 
data, due to inconsistencies between files. Re- 
searchers are thus left with the difficult decision 
as to how to clean the data. For this paper, we 
conducted two sets of experiments: one using a 
minimally cleaned set of data, 1 making our results 
comparable to previous results; the other using 
the ATIS data prepared by Bod, which contained 
much more significant revisions. 
Ten data sets were constructed by randomly 
splitting minimally edited ATIS (Hemphill et al., 
1990) sentences into a 700 sentence training set, 
and 88 sentence test set, then discarding sentences 
of length > 30. For each of the ten sets, both the 
DOP algorithm outlined here and the grammar 
induction experiment of Pereira and Schabes were 
run. Crossing brackets, zero crossing brackets, and 
the paired differences are presented in Table 1. 
All sentences output by the parser were made bi- 
nary branching (see the section covering analysis 
of Bod's data), since otherwise the crossing brack- 
ets measures are meaningless (Magerman, 1994). 
1A diff file between the original ATIS data and 
the cleaned up version, in a form usable by the 
"eft' program, is available by anonymous FTP from 
ftp://ftp.das.harvard.edu/pub/goodman/atis-ed/ 
tLtb.par-ed and ti_tb.pos-ed. Note that the number 
of changes made was small. The diff files sum to 457 
bytes, versus 269,339 bytes for the original files, or less 
than 0.2%. 
Criteria 
Cross Brack DOP 
Cross Brack P&S 
Cross Brack DOP-P&S 
Zero Cross Brack DOP 
Zero Cross Brack P&S 
Min 
86.53% 
86.99% 
-3.79% 
60.23% 
54.02% 
Max 
96.O6% 
94.41% 
2.87% 
75.86% 
78.16% 
Range 
9.53% 
7.42% 
6.66% 
15.63% 
24.14% 
Mean 
90.15% 
90.18% 
-0.03% 
66.11% 
63.94% 
2.17% 
StdDev 
2.65% 
2.59% 
2.34% 
5.56% 
7.34% 
Zero Cross Brack DOP-P&S -5.68% 11.36% 17.05% 5.57% 
Table 1: DOP versus Pereira and Schabes on Minimally Edited ATIS 
Criteria Min Max Range Mean StdDev 
Cross Brack DOP 95.63% 98.62% 2.99% 97.16% 0.93% 
Cross Brack P&S 94.08% 97.87% 3.79% 96.11% 1.14% 
Cross Brack DOP-P&S 
Zero Cross Brack DOP 
-0.16% 
78.67% 
3.03% 
90.67% 
3.19% 
12.00% 
Zero Cross Brack P&S 70.67% 88.00% 17.33% 
Zero Cross Brack DOP-P&S -1.33% 20.00% 21.33% 
Exact Match DOP 58.67% 68.00% 9.33% 
1.05% 
86.13% 
79.20% 
6.93% 
63.33% 
1.04% 
3.99% 
5.97% 
5.65% 
3.22% 
Table 2: DOP versus Pereira and Schabes on Bod's Data 
A few sentences were not parsable; these were as- 
signed right branching period high structure, a 
good heuristic (Brill, 1993). 
We also ran experiments using Bod's data, 
75 sentence test sets, and no limit on sentence 
length. However, while Bod provided us with his 
data, he did not provide us with the split into 
test and training that he used; as before we used 
ten random splits. The results are disappointing, 
as shown in Table 2. They are noticeably worse 
than those of Bod, and again very comparable 
to those of Pereira and Schabes. Whereas Bod 
reported 96% exact match, we got only 86% us- 
ing the less restriCtive zero crossing brackets cri- 
terion. It is not clear what exactly accounts for 
these differences. 2 It is also noteworthy that the 
results are much better on Bod's data than on the 
minimally edited data: crossing brackets rates of 
96% and 97% on Bod's data versus 90% on min- 
imally edited data. Thus it appears that part of 
Bod's extraordinary performance can be explained 
by the fact that his data is much cleaner than the 
data used by other researchers. 
DOP does do slightly better on most mea- 
sures. We performed a statistical analysis using a 
t-test on the paired differences between DOP and 
Pereira and Schabes performance on each run. On 
~Ideally, we would exactly reproduce these exper- 
iments using Bod's algorithm. Unfortunately, it was 
not possible to get a full specification of the algorithm. 
149 
the minimally edited ATIS data, the differences 
were statistically insignificant, while on Bod's data 
the differences were statistically significant beyond 
the 98'th percentile. Our technique for finding sta- 
tistical significance is more strenuous than most: 
we assume that since all test sentences were parsed 
with the same training data, all results of a sin- 
gle run are correlated. Thus we compare paired 
differences of entire runs, rather than of sentences 
or constituents. This makes it harder to achieve 
statistical significance. 
Notice also the minimum and maximum 
columns of the "DOP-P&S" lines, constructed by 
finding for each of the paired runs the difference 
between the DOP and the Pereira and Schabes 
algorithms. Notice that the minimum is usually 
negative, and the maximum is usually positive, 
meaning that on some tests DOP did worse than 
Pereira and Schabes and on some it did better. It 
is important to run multiple tests, especially with 
small test sets like these, in order to avoid mis- 
leading results. 
Timing Analysis 
In this section, we examine the empirical runtime 
of our algorithm, and analyze Bod's. We also note 
that Bod's algorithm will probably be particularly 
inefficient on longer sentences. 
It takes about 6 seconds per sentence to run 
our algorithm on an HP 9000/715, versus 3.5 hours 
to run Bod's algorithm on a Sparc 2 (Bod, 1995b). 
Factoring in that the HP is roughly four times 
faster than the Sparc, the new algorithm is about 
500 times faster. Of course, some of this difference 
may be due to differences in implementation, so 
this estimate is fairly rough. 
Furthermore, we believe Bod's analysis of his 
parsing algorithm is flawed. Letting G represent 
grammar size, and e represent maximum estima- 
tion error, Bod correctly analyzes his runtime as 
O(Gn3e-2). However, Bod then neglects analy- 
sis of this e -~ term, assuming that it is constant. 
Thus he concludes that his algorithm runs in poly- 
nomial time. However, for his algorithm to have 
some reasonable chance of finding the most proba- 
ble parse, the number of times he must sample his 
data is at least inversely proportional to the con- 
ditional probability of that parse. For instance, 
if the maximum probability parse had probability 
1/50, then he would need to sample at least 50 
times to be reasonably sure of finding that parse. 
Now, we note that the conditional probabil- 
ity of the most probable parse tree will in general 
decline exponentially with sentence length. We as- 
sume that the number of ambiguities in a sentence 
will increase linearly with sentence length; if a five 
word sentence has on average one ambiguity, then 
a ten word sentence will have two, etc. A linear 
increase in ambiguity will lead to an exponential 
decrease in probability of the most probable parse. 
Since the probability of the most probable 
parse decreases exponentially in sentence length, 
the number of random samples needed to find 
this most probable parse increases exponentially 
in sentence length. Thus, when using the Monte 
Carlo algorithm, one is left with the uncomfortable 
choice of exponentially decreasing the probability 
of finding the most probable parse, or exponen- 
tially increasing the runtime. 
We admit that this is a somewhat informal 
argument. Still, the Monte Carlo algorithm has 
never been tested on sentences longer than those 
in the ATIS corpus; there is good reason to believe 
the algorithm will not work as well on longer sen- 
tences. Note that our algorithm has true runtime 
O(Tn3), as shown previously. 
Analysis of Bod's Data 
In the DOP model, a sentence cannot be given an 
exactly correct parse unless all productions in the 
correct parse occur in the training set. Thus, we 
can get an upper bound on performance by ex- 
150 
amining the test corpus and finding which parse 
trees could not be generated using only produc- 
tions in the training corpus. Unfortunately, while 
Bod provided us with his data, he did not specify 
which sentences were test and which were training. 
We can however find an upper bound on average 
case performance, as well as an upper bound on 
the probability that any particular level of perfor- 
mance could be achieved. 
Bod randomly split his corpus into test and 
training. According to his thesis (Bod, 1995a, 
page 64), only one of his 75 test sentences had 
a correct parse which could not be generated from 
the training data. This turns out to be very sur- 
prising. An analysis of Bod's data shows that at 
least some of the difference in performance be- 
tween his results and ours must be due to an 
extraordinarily fortuitous choice of test data. It 
would be very interesting to see how our algorithm 
performed on Bod's split into test and training, 
but he has not provided us with this split. Bod did 
examine versions of DOP that smoothed, allowing 
productions which did not occur in the training 
set; however his reference to coverage is with re- 
spect to a version which does no smoothing. 
In order to perform our analysis, we must de- 
termine certain details of Bod's parser which af- 
fect the probability of having most sentences cor- 
rectly parsable. When using a chart parser, as 
Bod did, three problematic cases must be han- 
dled: e productions, unary productions, and n-ary 
(n > 2) productions. The first two kinds of pro- 
ductions can be handled with a probabilistic chart 
parser, but large and difficult matrix manipula- 
tions are required (Stolcke, 1993); these manipu- 
lations would be especially difficult given the size 
of Bod's grammar. Examining Bod's data, we find 
he removed e productions. We also assume that 
Bod made the same choice we did and eliminated 
unary productions, given the difficulty of correctly 
parsing them. Bod himself does not know which 
technique he used for n-ary productions, since the 
chart parser he used was written by a third party 
(Bod, personal communication). 
The n-ary productions can be parsed in a 
straightforward manner, by converting them to bi- 
nary branching form; however, there are at least 
three different ways to convert them, as illus- 
trated in Table 3. In method "Correct", the n- 
ary branching productions are converted in such a 
way that no overgeneration is introduced. A set of 
special non-terminals is added, one for each partial 
right hand side. In method "Continued", a single 
Original Correct Continued Simple 
A 
B C D E 
A 
B *_CDE 
C *_DE 
D E 
A 
B A_* 
C A_* 
D E 
A 
B A 
C A 
A D E 
Table 3: Transformations from N-ary to Binary Branching Structures 
Correct Continued Simple 
no unary 0.78 0.0000002 0.88 0.0009484 0.90 0.0041096 
unary 0.80 0.0000011 0.90 0.0037355 0.92 0.0150226 
Table 4: Probabilities of Sentences with Unique Productions/Test Data with Ungeneratable Sentences 
new non-terminal is introduced for each original 
non-terminal. Because these non-terminals occur 
in multiple contexts, some overgeneration is in- 
troduced. However, this overgeneration is con- 
strained, so that elements that tend to occur only 
at the beginning, middle, or end of the right hand 
side of a production cannot occur somewhere else. 
If the "Simple" method is used, then no new non- 
terminals are introduced; using this method, it is 
not possible to recover the n-ary branching struc- 
ture from the resulting parse tree, and significant 
overgeneration occurs. 
Table 4 shows the undergeneration probabili- 
ties for each of these possible techniques for han- 
dling unary productions and n-ary productions. 3 
The first number in each column is the probabil- 
ity that a sentence in the training data will have 
a production that occurs nowhere else. The sec- 
ond number is the probability that a test set of 75 
sentences drawn from this database will have one 
ungeneratable sentence: 75p~4(1 - p).4 
The table is arranged from least generous to 
most generous: in the upper left hand corner is 
a technique Bod might reasonably have used; in 
that case, the probability of getting the test set 
he described is lessthan one in a million. In the 
aA perl script for analyzing Bod's data is available 
by anonymous FTP from 
ftp://ftp.das.harvard,edu/pub/goodman/analyze.perl 
4Actually, this is a slight overestimate for a few 
reasons, including the fact that the 75 sentences are 
drawn without replacement. Also, consider a sentence 
with a production that occurs only in one other sen- 
tence in the corpus; there is some probability that both 
sentences will end up fin the test data, causing both to 
be ungeneratable. 
151 
lower right corner we give Bod the absolute max- 
imum benefit of the doubt: we assume he used 
a parser capable of parsing unary branching pro- 
ductions, that he used a very overgenerating gram- 
mar, and that he used a loose definition of "Exact 
Match." Even in this case, there is only about a 
1.5% chance of getting the test set Bod describes. 
Conclusion 
We have given efficient techniques for parsing the 
DOP model. These results are significant since the 
DOP model has perhaps the best reported parsing 
accuracy; previously the full DOP model had not 
been replicated due to the difficulty and computa- 
tional complexity of the existing algorithms. We 
have also shown that previous results were par- 
tially due to an unlikely choice of test data, and 
partially due to the heavy cleaning of the data, 
which reduced the difficulty of the task. 
Of course, this research raises as many ques- 
tions as it answers. Were previous results due only 
to the choice of test data, or are the differences in 
implementation partly responsible? In that case, 
there is significant future work required to under- 
stand which differences account for Bod's excep- 
tional performance. This will be complicated by 
the fact that sufficient details of Bod's implemen- 
tation are not available. 
This research also shows the importance of 
testing on more than one small test set, as well 
as the importance of not making cross-corpus com- 
parisons; if a new corpus is required, then previous 
algorithms should be duplicated for comparison. 

References 
\[Baker, 1979\] J.K. Baker. 1979. Trainable gram- 
mars for speech recognition. In Proceedings of 
the Spring Conference of the Acoustical Society 
of America, pages 547-550, Boston, MA, June. 
\[Bod, 1992\] Rens Bod. 1992. Mathematical prop- 
erties of the data oriented parsing model. Paper 
presented at the Third Meeting on Mathematics 
of Language (MOL3), Austin Texas. 
\[Bod, 1993a\] Rens Bod. 1993a. Data-oriented 
parsing as a general framework for stochas- 
tic language processing. In K. Sikkel and 
A. Nijholt, editors, Parsing Natural Language. 
Twente, The Netherlands. 
\[Bod, 1993b\] Rens Bod. 1993b. Monte 
Carlo parsing. In Proceedings Third Inter- 
national Workshop on Parsing Technologies, 
Tilburg/Durbury. 
\[Bod, 1993c\] Rens Bod. 1993c. Using an anno- 
tated corpus as a stochastic grammar. In Pro- 
ceedings of the Sixth Conference of the European 
Chapter of the ACL, pages 37-44. 
\[Bod, 1995a\] Rens Bod. 1995a. Enriching Lin- 
guistics with Statistics: Performance Models of 
Natural Language. University of Amsterdam 
ILLC Dissertation Series 1995-14. Academische 
Pers, Amsterdam. 
\[Bod, 1995b\] Rens Bod. 1995b. The problem 
of computing the most probable tree in data- 
oriented parsing and stochastic tree grammars. 
In Proceedings of the Seventh Conference of the 
European Chapter of the ACL. 
\[Brill, 1993\] Eric Brill. 1993. A Corpus-Based Ap- 
proach to Language Learning. Ph.D. thesis, Uni- 
versity of Pennsylvania. 
\[Goodman, 1996\] Joshua Goodman. 1996. Pars- 
ing algorithms and metrics. In Proceedings of 
the 34th Annual Meeting of the ACL. To ap- 
pear. 
\[Hemphill et al., 1990\] Charles T. Hemphill, 
John J. Godfrey, and George R. Doddington. 
1990. The ATIS spoken language systems pilot 
corpus. In DARPA Speech and Natural Lan- 
guage Workshop, Hidden Valley, Pennsylvania, 
June. Morgan Kaufmann. 
\[Lari and Young, 1990\] K. Lari and S.J. Young. 
1990. The estimation of stochastic context-free 
grammars using the inside-outside algorithm. 
Computer Speech and Language, 4:35-56. 
\[Magerman, 1994\] David Magerman. 1994. Nat- 
ural Language Parsing as Statistical Pattern 
Recognition. Ph.D. thesis, Stanford University 
University, February. 
\[Pereira and Schabes, 1992\] Fernando Pereira 
and Yves Schabes. 1992. Inside-Outside rees- 
timation from partially bracketed corpora. In 
Proceedings of the 30th Annual Meeting of the 
ACL, pages 128-135, Newark, Delaware. 
\[Rabiner, 1989\] L.R. Rabiner. 1989. A tutorial 
on hidden Markov models and selected applica- 
tions in speech recognition. Proceedings of the 
IEEE, 77(2), February. 
\[Scha, 1990\] R. Scha. 1990. Language theory and 
language technology; competence and perfor- 
mance. In Q.A.M. de Kort and G.L.J. Leerdam, 
editors, Computertoepassingen in de Neerlan- 
distiek. Landelijke Vereniging van Neerlandici 
(LVVN-jaarboek), Almere. In Dutch. 
\[Schabes et al., 1993\] Yves Schabes, Michal Roth, 
and Randy Osborne. 1993. Parsing the Wall 
Street Journal with the Inside-Outside algo- 
rithm. In Proceedings of the Sixth Conference of 
the European Chapter of the ACL, pages 341- 
347. 
\[Schabes, 1992\] Y. Schabes. 1992. Stochastic lexi- 
calized tree-adjoining grammars. In Proceedings 
of the l$th International Conference on Compu- 
tational Linguistics. 
\[Sima'an, 1996\] Khalil Sima'an. 1996. Efficient 
disambiguation by means of stochastic tree sub- 
stitution grammars. In R. Mitkov and N. Ni- 
colov, editors, Recent Advances in NLP 1995, 
volume 136 of Current Issues in Linguistic The- 
ory. John Benjamins, Amsterdam. 
\[Stolcke, 1993\] Andreas Stolcke. 1993. An ef- 
ficient probabilistic context-free parsing algo- 
rithm that computes prefix probabilities. Tech- 
nical Report TR-93-065, International Com- 
puter Science Institute, Berkeley, CA. 
