Using an Annotated Corpus as a Stochastic Grammar 
Rens Bod 
Department of Computational Linguistics 
University of Amsterdam 
Spuistraat 134 
NL-1012 VB Amsterdam 
rens@alf.leLuva.nl 
Abstract 
In Data Oriented Parsing (DOP), an annotated 
corpus is used as a stochastic grammar. An 
input string is parsed by combining subtrees 
from the corpus. As a consequence, one parse 
tree can usually be generated by several 
derivations that involve different subtrces. This 
leads to a statistics where the probability of a 
parse is equal to the sum of the probabilities of 
all its derivations. In (Scha, 1990) an informal 
introduction to DOP is given, while (Bed, 
1992a) provides a formalization of the theory. 
In this paper we compare DOP with other 
stochastic grammars in the context of Formal 
Language Theory. It it proved that it is not 
possible to create for every DOP-model a 
strongly equivalent stochastic CFG which also 
assigns the same probabilities to the parses. 
We show that the maximum probability parse 
can be estimated in polynomial time by 
applying Monte Carlo techniques. The model 
was tested on a set of hand-parsed strings from 
the Air Travel Information System (ATIS) 
spoken language corpus. Preliminary 
experiments yield 96% test set parsing 
accuracy. 
1 Motivation 
As soon as a formal grammar characterizes a non- 
trivial part of a natural language, .almost every input 
string of reasonable length gets an unmanageably large 
number of different analyses. Since most of these 
analyses are not perceived as plausible by a human 
language user, there is a need for distinguishing the 
plausible parse(s) of an input string from the 
implausible ones. In stochastic language processing, it 
is assumed that the most plausible parse of an input 
string is its most probable parse. Most instantiations 
of this idea estimate the probability of a parse by 
assigning application probabilities to context free 
rewrite roles (Jelinek, 1990), or by assigning 
combination probabilities to elementary structures 
(Resnik, 1992; Schabes, 1992). 
There is some agreement now that context free rewrite 
rules are not adequate for estimating the probability of 
a parse, since they cannot capture syntactie/lexical 
context, and hence cannot describe how the probability 
of syntactic structures or lexical items depends on that 
context. In stochastic tree-adjoining grammar 
(Schabes, 1992), this lack of context-sensitivity is 
overcome by assigning probabilities to larger 
structural units. However, it is not always evident 
which structures should be considered as elementary 
structures. In (Schabes, 1992) it is proposed to infer a 
stochastic TAG from a large training corpus using an 
inside-outside-like iterative algorithm. 
Data Oriented Parsing fDOP) (Scha, 1990; Bod, 
1992a), distinguishes itself from other statistical 
approaches in that it omits the step of inferring a 
grammar from a corpus. Instead, an annotated corpus 
is directly used as a stochastic grammar. An input 
string is parsed by combining subtrees from the 
corpus. In this view, every subtree can be considered 
as an elementary structure. As a consequence, one 
parse tree can usually be generated by several 
derivations that involve different subtrees. This leads 
to a statistics where the probability of a parse is equal 
to the sum of the probabilities of all its derivations. It 
is hoped that this approach can accommodate all 
statistical properties of a language corpus. 
37 
Let us illustrate DOP with an extremely simple 
example. Suppose that a cotpns consists of only two 
trees: 
A 
NP VP NP VP 
Suppose that our combination operation (indicated 
with o) consists of substituting a subtree on the 
leftmost identically labeled leaf node of another 
subtree. Then the sentence Mary likes Susan can be 
parsed as an S by combining the following subtre~ 
from the corpus. 
S o NP o NP 
/k 
v NP l 
NP VP I A 
Mary V NP is! 
But the same parse tree can also be derived by 
combining other subirees, for instance: 
S o NP o V 
/k 
v ~ I 
Sm 
S o NP o VP o NP /"-,.w sI, L 
,,L 
Thus, a parse can have several derivations involving 
different subtrees. These derivations have different 
probabilities. Using the corpus as our stochastic 
grammar, we estimate the probabifity of substituting a 
certain subtree on a specific node as the probability of 
selecting this subtree among all subtrees in the corpus 
that could be substituted on that node. The probability 
of a derivation can be computed as the product of the 
probabilities of the subtre~ that are combined. For the 
example derivations above, this yields: 
P(Ist example) = 1/20 • 1/4 • 1/4 
P(2nd example) = 1/20 • 1/4 • 1/2 
P(3rd example) = 2/20 • 1/4 • 1/8 • 1/4 
= 1/320 
= 1/160 
= 1/1280 
This example illustrates that a stntigtical language 
model which defines probabilities over parses by 
taking into ac~unt only one ,derivation, does not 
accommodate all statistical properties of a language 
corpus. Instead, we will defme the probability of a 
parse as the sum of the probabilities of all its 
derivations. Finally, the probability of a string is 
equal to the sum of the probabilities of all its parses. 
We will show ,hat conventional parsing techniques 
can be applied to DOP, but that this becomes very 
inefficient, since the number of derivations of a parse 
grows exponentially with the length of the input 
suing. However, we will show that DOP can be 
parsed in polynomial time by using Monte Carlo 
techniques. 
An important advantage of using a corpus for 
probability calculation, is that no tr0jning of 
parameters is needed, as is the case for other stochastic 
grammars (Jelinek et al., 1990; Pereira and Schabes, 
1992; Schabes, 1992). Secondly, since we take into 
account all derivations of a parse, no relationship that 
might possibly be of statistical interest is ignored. 
38 
2 The Model 
As might be clear by now, a IX)P-model is 
characterized by a corpus of tree structures, together 
with a set of operations that combine subtrees from 
the corpus into new trees. In this section we explain 
more precisely what we mean by subtree, operations 
etc., in order to arrive at definitions of a parse and the 
probability of a parse with respect to a corpus. For a 
treatment of DOP in more formal terms we refer to 
(Bod, 1992a). 
2.1 Subtree 
A subtree of a tree T is a connected subgraph S of T 
such that for every node in S holds that if it has 
daughter nodes, then these are equal to the daughter 
nodes of the corresponding node in T. It is trivial to 
see that a subuee is also a tree. In the following 
example T 1 and T2 are subtrees of T, whereas T 3 
isn't. 
Y S 
I /x John V NP 
~s Tavp T3s 
NP VP V NP A I I I 
V NP likes John NP 
The general definition above also includes subUees 
consisting of one node. Since such subtrees do not 
contribute to the parsing process, we exclude these 
pathological cases and consider as the set of sublrees 
the non-trivial ones consisting of more than one node. 
We shall use the following notation to indicate that a 
tree t is a non-trivial subtree of a tree in a corpus C: 
t e C =oer 3 T 6 C: t is a non-trivial subtree of T 
2.2 Operations 
In this article we will limit ourselves to the basic 
operation of substitution. Other possible operations 
are left to future research. If t and u are trees, such that 
the leftmost non-terminal leaf of t is equal to the root 
of u, then tou is the tree that results from substituting 
this non-terminal leaf in t by tree u. The partial 
function o is called substitution. We will write 
(tou)ov as touov, and in general (..((tlot2)ot3)o..)otn as 
tlot2ot3o...otn. The restriction le£tmost in the defin- 
ition is motivated by the fact that it eliminates 
different derivations consisting of the same subtrees. 
2.3 Parse 
Tree Tis a parse of input string s with respect to a 
corpus C, iffthe yieldof Tis equal to s and there are 
subtrees tI,...,tn e C, such that T-- tlO.., otn. The set 
of parses of s with respect to C, is thus given by: 
parses(s,C) = 
{T I yield(T) = s A 3 tl ..... tne C: T = tlo...otn} 
The definition correctly includes the trivial case of a 
subtree from the corpus whose yield is equal to the 
complete input string. 
2.4 Derivation 
A derivation of a parse T with respect to a corpus C, 
is a tuple of subtrees (tl ..... ta) such that tl ..... tne C 
and tlo...otn = T. The set of derivations of T with 
respect to C, is thus given by: 
Derivations(T,C) = 
{(tl ..... t~) I tl ..... tne C A tlO...otn= T} 
2.5 Probability 
2.5.1 Subtree 
Given a subtree tl e C, a function root that yields the 
root of a tree, and a node labeled X, the conditional 
probability P(t=tl / root(t)=X) denotes the probability 
that t/ is substituted on X. If root(Q)¢ X, tins 
probability is 0. If root(t1) = X, this probability can 
be estimated as the ratio between the number of 
occurrences of tl in C and the total number of 
occurrences of subtrees t' in C for which holds that 
root(f) = X. Evidently, Zi P(t=-ti I root(O=X) = 1 
holds. 
2.5.2 Derivation 
The probability of a derivation (tl ..... tn) is equal to 
the probability that the subtrees tl ..... tn are combined. 
This probability can be computed as the product of the 
39 
conditional probabilities of the subtrees tl ..... t o. Let 
lnI(x) be the leflmost non-terminal leaf of tree x, then: 
P(t=tllrOOt(t)--S) • I-li-_.2ton P(t=ti I root(t) = lnl(ti.l)) 
2.5.3 Parse 
The probability of a parse is equal to the probability 
that any of its derivations occurs. Since the 
derivations are mutually exclusive, the probability of a 
parse T is the sum of the probabilities of all its 
derivations. Let Detivations(T,C) = \[ d I ..... dn}, then: 
P(T) = ~,i P(di). The conditional probability of a 
parse T given input siring s, can be computed as the 
ratio between the probability of T and the sum of the 
probabilities of all parses of s. 
2.5.4 String 
The probability of a string is equal to the probability 
that any of its parses occurs. Since the parses are 
mutually exclusive, the probability of a string s can be 
computed as the sum of the probabilities of all its 
parses. Let Parse.s(s,C) = {T I ..... Tn}, then: P(s) = 
2~ i P(T/). It can be shown that ~'i P(si) = 1 holds. 
3 Superstrong Equivalence 
There is an important question as to whether it is 
possible to create for every DOP-model a strongly 
equivalent stochastic CFG which also assigns the 
same probabifities to the parses. In order to discuss 
this question, we introduce the notion of superstrong 
equivalence. Two stochastic grammars are called 
superstrongly equivalent, if they are strongly 
equivalent (i.e. they generate the same strings with the 
same trees) and they generate the same probability 
distribution over the trees. 
The question as to whether for every DOP-model there 
exists a strongly equivalent stochastic CFG, is rather 
trivial, since every subtree can be decomposed into 
rewrite rules describing exactly every level of 
constituent structure of that subtree. The question as 
to whether for every DOP-model there exists a 
supets¢ongly equivalent stochastic CFG, can also be 
answered without too much difficulty. We shall give a 
counter-example, showing that there exists a DOP- 
model for which there is no superstrongly equivalent 
stochastic CFG. 
Proposition It is not the case that/'or every DOP- 
model there exists a superstrongly equivalent 
stochastic CFG. 
Proof 
Consider the following DOP-model, consisting of a 
corpus with just one tree. 
S b I 
a 
This corpus contains three subtrees, namely 
S S 
S b I 
a 
tl 
S I 
S b a 
t2 t3 
The conditional probabilities of the subtrees are: 
P(t=-t I I root(t)=S) = 1/3, P(t=t 2 1 root(t)=S) = 1/3, 
P(~t3 1 root(t)=S) = 1/3. Thus, Z, i P(t=ti fi'oot(t)=S) = 
1 holds. The language generated by this model is 
{ab*}. Let us consider the probabilities of the parses 
of the strings a and ab. The parse of siring a can be 
generated by exactly one derivation: by applying 
subtree t3. The probability of this parse is hence equal 
to 1/3. The parse of ab can be generated by two 
derivations: by applying subtree tl, or by combining 
subUees t2 and t3. The probability of this parse is 
equal to the sum of the probabilities of its two 
derivations, which is equal to P(t=--tl~OOt(t)=S) + 
P(~t2~oot(t)=S) * P(t=t31root(t)=S)= 1/3 + 1/3,1/3 
=4/9. 
If we now want to construct a superstrongly equivalent 
stochastic CFG, it should assign the same 
probabilities to these parses. We will show that this is 
impossible. A CFG which is strongly equivalent with 
the DOP-model above, should contain the following 
rewrite rules. 
S ~ Sb (1) 
S --, a (2) 
There may be other rules as well, but they should not 
modify the language or slructures generated by the 
CFG above. Thus, the rewrite rule S --~ A may be 
40 
added to the rules, as well as A --~ B, whereas the 
rewrite rule S -o ab may not be added. 
Our problem is now whether we can assign 
probabilities to these rules such that the probability of 
the parse of a equals 1/3, and the probability of the 
parse of ab equals 4/9. The parse of a can exhaustively 
be generated by applying rule (2), while the parse of 
ab can exhaustively be generated by applying rules (1) 
and (2). Thus the following should hold: 
P(2) = 1/3 
P(1)*P(2) = 4/9 
This implies that t)(I),1/3 = 4/9, thus P(1) = 4/9 • 3 
= 4/3. This means that the probability of rule (1) 
should be larger than I, which is not allowed. Thus, 
we have proved that not for every DOP-model there 
exists a superstrongly equivalent stochastic CFG. In 
(Bod, 1992b) superstrong equivalence relations 
between other stochastic grammars are studied. 
4 Monte Carlo Parsing 
It is easy to show that an input string can be parsed 
with conventional parsing techniques, by applying 
subtrees instead of rules to the input string (Bod, 
1992a). Every subtree t can be seen as a production 
rule toot(O --, ~ where the non-terminals of the yield 
of the right hand side constitute the symbols to which 
new rules/subtrees are applied. Given a polynomial 
time parsing algoritiun, a derivation of the input 
string, and hence a parse, can be calculated in 
polynomial time. But if we calculate the probability 
of a parse by exhaustively calculating all its 
derivations, the time complexity becomes exponential, 
since the number of derivations of a parse of an input 
string grows exponentially with the length of the 
input string. 
Nevertheless, by applying Monte Carlo techniques 
Crlammersley and Handscomb, 1964), we can estimate 
the probability of a parse and make its error arbitrarily 
small in polynomial time. The essence of Monte 
Carlo is very simple: it estimates a probability 
distribution of events by taking random samples. The 
larger the samples we take, the higher the reliability. 
For DOP this means that, instead of exhaustively 
calculating all parses with all their derivations, we 
randomly calculate N parses of an input string (by 
taking random samples from the subtrees that can be 
substituted on a specific node in the parsing process). 
The estimated probability of a certain parse given the 
input string, is then equal to the number of times that 
parse occurred normalized with respect to N. We can 
estimate a probability as accurately as we want by 
choosing Nas large as we want, since according to the 
Strong Law of Large Numbers the estimated 
probability converges to the actual probability. From a 
classical result of probability theory (Chebyshev's 
inequality) it follows that the time complexity of 
achieving a maximum error e is given by O(e'2). Thus 
the error of probability estimation can be made 
arbitrarily small in polynomial time - provided that 
the parsing algorithm is not worse than polynomial. 
Obviously, probable parses of an input string are more 
likely to be generated than improbable ones. Thus, in 
order to estimate the maximum probability parse, it 
suffices to sample until stability in the top of the 
parse distribution occurs. The parse which is generated 
most often is then the maximum probability parse. 
We now show that the probability that a certain parse 
is generated by Monte Carlo, is exactly the probability 
of that parse according to the DOP-model. First, the 
probability that a subtree t e C is sampled at a certain 
point in the parsing process (where a non-terminal X 
is to be substituted) is equal to P( t I root(t) = X ). 
Secondly, the probability that a certain sequence 
tl ..... tn of subtrees that constitutes a derivation of a 
parse T, is sampled, is equal to the product of the 
conditional probabilities of these subtrees. Finally, the 
probability that any sequence of subtrees that 
constitutes a derivation of a certain parse T, is 
sampled, is equal to the sum of the probabilities that 
these derivations are sampled. This is the probability 
that a certain parse T is sampled, which is equivalent 
to the probability of T according to the DOP-model. 
We shall call a parser which applies this Monte Carlo 
technique, a Monte Carlo parser. With respect to the 
theory of computation, a Monte Carlo parser is a probabilistic algorithm 
which belongs to the class of 
Bounded error Probabilistic Polynomial time (BPP) 
algorithms. BPP-problems are characterized by the 
following: it may take exponential time to solve them 
exactly, but there exists an estimation algorithm with 
a probability of error that becomes arbitrarily small in 
polynomial time. 
Experiments on the ATIS corpus 
For our experiments we used part-of-speech sequences 
of spoken-language transcriptions from the Air Travel 
Information System (ATIS) corpus (Hemphill et al., 
1990), with the labeled-bracketings of those sequences 
in the Penn Treebank (Marcus, 1991). The 750 
41 
labeled-bracketings were divided at random into a 
DOP-corpus of 675 trees and a test set of 75 part-of- 
speech sequences. The following tree is an example 
from the DOP-corpns, where for reasons of readability 
the lexical items are added to the part-of-speech tags. 
( (S (NP *) fVP (VB Show) 
(NP (PP me)) 
(NP (NP (PDT all)) 
(DT the) (JJ nonstop) (NNS flights) (Pp (PP ON from) 
(NP (NP Dallas))) (PP (TO to) 
(NP (NP Denver)))) 
(ADJP (JJ early) 
(PP (IN in) (NP (DT the) 
(NN morning)))))) .) 
As a measure for pars/n# accuracy we took the 
percentage of the test sentences for which the 
maximum probability parse derived by the Monte 
Carlo parser (for a sample size N) is identical to the 
Treebankparse. 
It is one of the most essential features of the DOP 
approach, that arbitrarily large subtrees are taken into 
consideration. In order to test the usefulness of this 
feature, we performed different experiments 
constraining the depth of the subtrees. The depth of a 
tree is defmed as the length of its longest path. The 
following table shows the results of seven 
experiments. The accuracy refers to the parsing 
accuracy at sample size N= I00, and is rounded off to 
the nearest integer. 
depth accuracy 
ii 
~2 87% 
~3 92% 
.~4 93% 
.~ 93% 
~6 95% 
~7 95% 
unbounded 96% 
Parsing accuracy for the ATIS corpus, sample size N= I00. 
The table shows that there is a relatively rapid inc~'~ase 
in parsing accuracy when enlarging the maximum 
depth of the subUees to 3. The accuracy keeps 
increasing, at a slower rate, when the depth is enlarged 
further. The highest accuracy is obtained by using all 
subtrees from the corpus: 72 out of the 75 sentences 
from the test set are parsed correctly. 
In the following figure, parsing accuracy is plotted 
against the sample size Nfor three of our experiments: 
the experiments where the depth of the subtrees is 
constrained to 2 and 3, and the experiment where the 
depth is unconswained. (The maximum depth in the 
ATIS corpus is 13.) 
75 
I I I 
sample size N 
100 
Parsing accuracy for the ATIS corpus, with depth < 2, with 
depth < 3 and with unbounded depth. 
In (Pereira and Schabes, 1992), 90.36% bracketing 
accuracy was reported using a stochastic CFG trained 
on bracketings from the ATIS corpus. Though we 
cannot make a direct c¢~parison, our pilot experiment 
suggests that our model may have better performance 
than a stochastic CFG. However, there is still an error 
rate of 4%. Although there is no reason to expect 
100% accuracy in the absence of any semantic or 
pragmatic analysis, it seems that the accuracy might 
be further improved. Three limitations of the current 
experiments are worth mentioning, 
Fn~t, the Treebank annotations are not rich enough. 
Although the Treebank uses a relatively rich part-of- 
speech system (48 terminal symbols), there are only 
15 non-terwinal symbols. Especially the internal 
su~cmre of noun phrases is very poor. Semantic 
annotations are completely absent. 
42 
Secondly, it could be that subtrees which occur only 
once in the corpus, give bad estimations of their actual 
probabilities. The question as to whether reestimation 
techniques would further improve the accuracy, must 
be considered in future research. 
Thirdly, it could be that our corpus is not large 
enough. This brings us to the question as to how 
much parsing accuracy depends on the size of the 
corpus. For studying this question, we performed 
additional experiments with different corpus sizes. 
Starting with a corpus of only 50 parse trees 
(randomly chosen from the initial DOP-corpus of 675 
trees), we increased its size with intervals of 50. As 
our test set, we took the same 75 p-o-s sequences as 
used in the previous experiments. In the next figure 
the parsing accuracy, for sample size N = 100, is 
plotted against the corpus size, using all corpus 
subtrees. 
100 
75. 
25. 
0 0 
0 0 
0 
0 0 
O 
O 
0 O 
O 
i~o ~ 3~o & 5~o & 
corpus size 
Parsing accuracy for the ATIS corpus, with unbounded 
depth. 
675 
The figure shows the increase in parsing accuracy. For 
a corpus size of 450 trees, the accuracy reaches already 
88%. After this, the growth decreases, but the accuracy 
is still growing at corpus size 675. Thus, we would 
expect a higher accuracy if the corpus is further 
enlarged. 
6 Conclusions and Future Research 
We have presented a language model that uses an 
annotated corpus as a stochastic grammar. We 
restricted ourselves to substitution as the only 
combination operation between corpus subtrees. A 
statistical parsing theory was developed, where one 
parse can be generated by different derivations, and 
where the probability of a parse is computed as the 
sum of the probabilities of all its derivations. It was 
shown that our model cannot always be described by a 
stochastic CFG. It turned out that the maximum 
probability parse can be estimated as accurately as 
desired in polynomial time by using Monte Carlo 
techniques. The method has been succesfully tested on 
a set of part-of-speech sequences derived from the 
ATIS corpus. It turned out that parsing accuracy 
improved if larger subtrees were used. 
We would like to extend our experiments to larger 
corpora, like the Wall Street Journal corpus. This 
might raise computational problems, since the number 
of subtrees becomes extremely large. Furthermore, in 
order to tackle the problem of data sparseness, the 
possibility of abstracting from corpus data should be 
included, but statistical models of abstractions of 
features and categories are not yet available. 
Acknowledgements 
The author is very much indebted to Remko Scha for 
many valuable comments on earlier versions of this 
paper. The author is also grateful to Mitch Marcus for 
supplying the ATIS corpus. 
References 
R. Bod, 1992a. "A Computational Model of 
Language Performance: Data Oriented Parsing", 
Proceedings COLING~92, Nantes. 
R. Bod, 1992b. "Mathematical Properties of the Data 
Oriented Parsing Model", paper presented at the Th/rd 
Meeting on Mathematics of Language OVIOL3), 
Austin, Texas. 
J.M. Hammersley and D.C. Handscomb, 1964. Monte 
Carlo Methods, Chapman and Hall, London. 
C.T. Hemphill, J.J. Godfrey and G.R. Doddington, 
1990. "The ATIS spoken language systems pilot 
corpus". DARPA Speech and Natural Language 
Workshop, Hidden Valley, Morgan Kaufmann. 
F. Jelinek, J.D. Lafferty and R.L. Mercer, 1990. Basic 
Methods of Probabilistic Context Free Grammars, 
Technical Report IBM RC 16374 (#72684), Yorktown 
Heights. 
43 
M. Marcus, 1991. "Very Large Annotated Database of 
America~ English". DARPA Speech and Naawal 
Language Workshop, ~ Grove, Morgan 
Kaufmarm. 
F. Pereira and Y. Schabes, 1992. "Inside-Outside 
Reestimation from Partially Bracketed Corlmra', 
Proceedings ACY., 92, Newark. 
P. Resnik, 1992. "Probabilistic Tree-Adjoining 
Grammar as a Framework for Statistical Natural 
Language Processing", Proceedings COLING92, 
Nantes. 
R. Scha, 1990. "Language Theory and Language 
Technology; Competence and Performance" (in 
Dutch), in Q.A.M. de Kort & G.L.J. Leordam (eds.), 
Computeltoepassingen in de Needanclistiek, Almere: 
Landelijkc Vereniging van Neerlandici (LVVN- 
jaatbock). 
Y. Schabes, 1992. "Stochastic Lexicalized Tree- 
Adjoining Grammars", Proceedings COLING'92, 
Nantes. 
44 
