A COMPUTATIONAL MODEL OF LANGUAGE 
DATA ORIENTED PARSING 
RENS BOlt* 
Department of Computational I Jnguistics 
University of Amsterdmn 
Spuistraat 134 
1012 VII Amsterdam 
The Netherlands 
rens@alf.let.uva.nl 
PERFORMANCE: 
Abstract 
1)ata Oriented Parsing (IX)P) is a model where no 
abstract rules, but language ext~riences in the ti3ru~ of all 
,'malyzed COlpUS, constitute the basis for langnage 
processing. Analyzing a new input means that the 
system attempts to find tile most probable way to 
reconstruct the input out of frugments that alr"c~y exist 
ill the corpus. Disambiguation occurs as a side-effect. 
DOP can be implemented by using colivelllional parsing 
strategies. 
In~oducfion 
This paper tommlizes the model for natural Imlgnage 
introduced m \[Sclm 199o\]. Since that article is written 
in Dutch, we will translate Some parts of it more or less 
literally in this introduction. According to Scba, the 
current tradition of language processing systems is based 
on linguistically motivated competence models of 
natural Imlguages. "llte problems that these systems lull 
iato, suggest file necessity of a more perfommnce 
oriented model of language processing, that takes into 
account the statistical properties of real language use. 
qllerefore Scha proposes a system ritat makes use of an 
annotated corpus. AnMyzing a new input means that the 
system attempts to find the most probable way to 
reconstruct the input out of fragments that already exist 
in the corpus. 
The problems with competence grammars that 
are mentioned in Scha's aiticle, include the explosion of 
ambiguities, the fact tilat Itunmn judgemeats on 
grammaticality are not stable, that competence granunars 
do not account for language eh~alge, alld that no existing 
rule-based grammar gives a descriptively 'adequate 
characterization of an actual language. According to 
Scha, tile deveh,pment of a fornml gnatunar fur natural 
latlguage gets more difficult ,as tire grammar gets larger. 
When the number of phenotnena one has already takea 
into account gets larger, the number of iareractions that 
must be considered when ,me tries to introduce all 
account of a new pllenomenon grows accordingly. 
As to tile problem of ,'mtbiguity, it has turned 
out that as soon as a formal gratmnar clmracterizes a 
non-trivial part of a natural language, almost every input 
sentence of reasonable length gets ml re\]manageably 
large number of different structural analyses (and 
* The author wishes to thank his colleagues at the Department 
of Computational Linguistics of the Ilaiversity of Amsterdam 
for many fruitful discussions, and, in particular, Remko Scha, 
Martin van den Berg, Kwee Tjoe l,iong and Frodenk Somsen 
for valuable comments on earlier w~'rsions of this paper. 
semantical interpretations). I "lids is problenmtic since 
most of these interpretations ~re not perceived as 
lVossible by a hunmn language user, while there are no 
systematic reasons 111 exclude tileln on syutactic or 
sematltic grounds. Often it is just a ntatter of relative 
implausibility: tile only reason why a certain 
iarerpmtarion of a sentence is not perceived, is that 
aanther interprctatilm is much more plausible. 
Competence and Performance 
'tale lhnriations of the current language procossing 
systerus are not suprising: riley are the direct 
consequence of rile fact that these systems implement 
Chart\]sky's notion of a coutpetence grmnmar. The 
formal grilnuuars that constitute the subject-nmtter of 
theoretieal linguistics, aim at characterizing the 
clnnpetencc of tile langnage user. But the preferences 
language users have m the case of ambiguous sentences, 
are paradigm instances of perfonatmce phenomena. 
In order to build effective lauguage processing 
systems we nmst intplement performanec-grammars, 
rather than competence gratumars, qlaese performance 
granmuus shouM not only contain information on the 
structural possibilities of file general I~mgnage system, 
but also on details of actual language use in a language 
conmmnity, and of tile language experiences of an 
individual, which cause this individual to have certain 
expectations on what kinds of uUerances are going to 
occur, and what slractures and interpretations these 
utterances are going to have. 
Therc is all alternative linguistic tradition tluat 
has always focused on the concrete details of actual 
language use: file statistical tradition. In this approach, 
syntactic structure is usually ignored; only 'superficial' 
stalistical properties of a large coqms are described: file 
probability that a certain word is followed by a certain 
other word, the probability that a certain sequence of 
two words is followed by a ce~ml word, etc. (Markov- 
cludns, see e.g. \[Bahl 1983\]). This approach bus 
perforumd succesfully ill certain practical tasks, such ,as 
selecting the most probable sentence from the outputs of 
a speech recognition coruptment. It will be clear that this 
approach is not suitable for mmly other tasks, because 
no uotion of syntactic structme is used. Aud there are 
statistical dependencies within the sentences of a corpus, 
that cam extend over all arbitrarily long sequence of 
words; this is ignored by file Markov-approach. The 
challenge is now to develop a theory of language 
processlag that does justice to tile statistieM ,as well as 
to tile structural aspects of langange. 
1 In \[Martin 19791 it is reported that their t~ser generated 455 
different lxuses for tile sentence "lAst the sales of products 
produced in 1973 with the products produced in 1972". 
ACRES DE COLING-92, NANTES, 23-28 no(tr 1992 8 5 5 PROC. OV COLING-92. NAN rES, AUG. 23-28, 1992 
The Synthesis of Syntax and Statistics 
The idea that a synthesis between syntactic and 
statistical approaches could be useful has incidentally 
been proposed before, but has not been worked out very 
well so far. The only technical elaboration of this idea 
that exists at the moment, the notion of a probabilisdc 
gtamnmr, is of a rather simplistic nature. A probabilistic 
grammar is simply a juxtaposition of the most 
fundamental syntactic notion and the most fundamental 
statistical notion: it is an "old-fashioned" context free 
grammar, that describes syntactic structures by means of 
a set of abstract rewrite rules that are now provided with 
probabilities that correspond to the application- 
probabilities of the rules (see e.g. \[Jeliuek 1990\]). 
As long as a probabilistic grammar only 
assigns probabilities to individual rewrite rules, the 
grammar cannot account for all statistical properties of a 
language corpus. It is, for instance, not possible to 
indicate how the probability of syntactic structures or 
lexical items depends on their syntacticflexical context. 
As a consequence of this, it is not possible to recognize 
frequent phrases and figures of speech as such - a 
disappointing property, for one would prefer that such 
phrases and figures of speech would get a high priority 
in the ranking of the possible syntactic analyses of a 
sentence. Some improvements can be made by applying 
the Markov-approach to rewrite rules, as is found in the 
work of \[Salomaa 1969\] and \[Magerman 1991\]. 
Nevertheless, any approach which ties probabilities to 
rewrite rules will never be able to acconunodate all 
statistical dependencies. Optimal statistical estimations 
can only be achieved if tile statistics are applied to 
different kinds of units than rewrite rules. It is 
interesting to note that also in the field of theoretical 
linguistics tile necessity to use other kinds of structural 
units has been put forward. The clearest articulation of 
this idea is found in the work of \[Fillmore 1988\]. 
From a linguistic point of view that 
emphasizes the syntactic complexities caused by 
idiomatic and semi-idiomatic expressions, Fillmore et 
al. arrive at the proposal to describe language not by 
means of a set of rewrite rules, but by meaus of a set of 
constructions. A construction is a tree-strncture: a 
fragment of a constituent-structure that can comprise 
more than one level. This tree is labeled with syntactic, 
semantic and pragnmtic categories and feature-values. 
Lexical items can be specified as part of a construction. 
Constructions can be idiomatic in nature: the meaning 
of a larger constituent can be specified without being 
constructed front the meanings of its sub-constituents. 
Fillmore's ideas still show the influence of the 
tradition of formal grammars: the constructions are 
schemata, and the combinatorics of putting the 
constructions together looks very much like a context 
free gramnmr. But the way in which Filhnore 
generalizes the notion of grmnmar resolves the problems 
we found in the current statistical grammars: if a 
constrnction-granunar is combined with statistical 
notions it is perhaps possible to represent all statistical 
information. This is one of the central ideas behind our 
approach. 
A New Approach: Data Oriented 
Parsing 
The starting-point of our approach is the idea indicated 
above, that when a human language user analyzes 
sentences, there is a strong preference for the recognition 
of sentences, constituents and patterns that occurred 
before in the experience of the language user. There is a 
statistical component in language processing that prefers 
more frequent structures and interpretations to less 
frequently perceived alternatives. 
The information we ideally would like to use 
in order to model the language performance uf a natural 
language user, comprises therefore an enumeration of all 
lexical items and syntactic/semantic structures ever 
experieaced by the language user, with their frequency of 
occurrence. In practice this means: a very large corpus of 
sentences with their syntactic analyses and semantic 
interpretatious. Every senteace comprises a large number 
of constructions: not only the whole sentence and all its 
constituents, but also the patterns that can be abstracted 
from the analyzed sentence by introducing 'free variables' 
for lexical elements or complex constituents. 
Parsing then does not happen by applying 
grammatical rules to rite input sentence, but by 
constructing an optinml analogy between the input 
sentence and as many corpus sentences ,as possible. 
Sometimes the system shall need to abstract away from 
most of the properties of the trees in the corpus, and 
sometimes a part of tile input is found literally in the 
corpus, and can be treated as one unit in the parsing 
process. Thus the system tries to combine constructions 
from the corpus so as to reconstruct the input sentence 
as 'well' as possible. ~llte preferred parse out of all parses 
of the input sentence is obtained by maximizing file 
conditional probability of a parse given the sentence. 
Finally, the preferred parse is added to the corpus, 
bringing it into a new 'state'. 
To illustrate the basic idea, consider the 
following extremely simple exmnple. Assume that the 
whole corpus consists of only the following two trees: 
S A 
M:' VP 
A Wa V NP .~. pp 
w' i'/x -ilil 
coting'92 P Pr opec~d Pr in Judy 
I I I in Nantes coling'90 
Then the input sentence who opened cohng '92 in Nantes 
in July can be analyzed as an S by combining the 
following constructions from file corpus: 
+ + 
NP VP Pr Pr PP 
V1° PP who coling~2 P Pr AA !j 
V NP P Pr i los 1 I I 
opened in July 
ACRES DE COLING-92° NANTES, 23-28 AOL'r 1992 8 5 6 I)ROC. (71: COLING-92, NANTES. AUG. 23-28, 1992 
The Model 
In order to come to fomml definitions of p,'u'se and 
prefettedparse we tirst specify some basic notions. 
Labels 
We distinguish between file set of lexical l,lbels L and 
the ~t of non-lexical labels IV. Lexical labels represent 
words. Non-lexical l',fl~els represent syi~tactic and/or 
semantic mid/or i)ragnlalie infonnatiou, depending on 
file kind of corpns being used. We write J~ for l, ul~ 
SUing 
Given a set of hlbcls ~, a string is all u-tuple of 
elements of ~: (LI,...,L n) ~ ~u. All input string is ml 
nquple of elements of L: (l,t,._,Ln) ~ I, n. A 
Collckttellatio\[l ~ Gill big defined OI( sllil(gS US usual: 
(;l,...,b),~(c,...,ll) = (a,...,b,c,...,d). 
Tree 
Given a set of labels J~, the set of trees is defined as tile 
snmllest set Tree sucl~ that 
(1) ifI,~, then (l,,O)~Tree 
(2) if L6"~, tl,..,,tneTi'ce., then (l,,(ll,...,tn))eT~ee 
For a set of trees 77cc over a ~t of labels ~, we define a 
function root. ~/i-ee-9~ mid a tuuction le;tves: ~l?ee~L n 
by 
for n_>O, root((L,(tl,...,tn))) = I, 
rot n>O, le,~ves((L,(tt,...,l~t))) ~ l¢,'lves(tl)*... ~le~lves(tn) 
torn--O, leaves((L,O)) = (L) 
Corpus 
A corpus C is a multiset of trees, ill file ~nse that ally 
tree can occur zero, nile or more times. 'File lt~tves of 
every tree in a corpus is ml element of Ln: it consfimtes 
the string of wo(ds of which that tree is the amdysis that 
seemed most appropriate for understanding tile striug ill 
the context in which it was uttered. 
Construction8 
Ill order to define the Constowtions of a tree, we need 
two additional notions: Subffees and l~tttems, 
Snbtrees((L,(tl,...,t~))) = 
n 
\[(L,(tl,..,t~))} u (~ Snhtrees¢ll)) 
i=~ 
Pattems((L,(t 1,..., In))) = 
{(L,O) 1 ty {(l,(ul,...,no.) ) / Vi~11,,l: nid~attenls(ti)l 
Constructions(T) = {t / 3beSubtrecs(1): teP,'tttenls(u)} 
We Slulll use tile lbllowing notation for a constnlction of 
a tree in a corpus: tee =tier ~nc()" tc(.imstmctionsO0. 
Example: consider tree T. qhe trees T 1 and T 2 m~ 
conslnletions of T, while '\['3 is not. 
T S TI " 
T T VP PP VP PP 
Iv , I /x 
op~wwwi N ~ Ju~/ 
vi a po 
T 3~N,~ 
vp pp / 
p 
Composition 
If t and u are trees, such Ilmt tile le\[tmost non-lcxic;ll 
leMof t is equal to the mot of n, then tou is the tree that 
results from substituting this leaf in t by tree u. The 
i)mtial function o:'l~eexTree-47ivc is called ~mlposJtion. 
We will write (toU)ov ;Ls touov, and ill general 
(..((tloQ)o(~)o..)otn as tl~t2o(~o...otn. 
Exmuple: 
v t~ vp Np vp 
T VP pp t~a:l VP PP N Pr 
tr0 he I 
Palp~e 
Tree 7' is a par~ of input slring s with respect to C, iff 
leaves(7) = s and there m'e constructions tl,-.,tn e (~, 
such that 1' = tto...ot n. A tuple (fl,...,t n) of such 
constructions is said to generate par~ Tof s. Note that 
different tuples of constructions Gm generate the .,vante 
parse. The set of par~s of s will( respect to C, 
P,'use(s,C), is given by 
I','use(S,C) = 
(1 eTive / lcaves(T)=s A 3tl,...,t . e C" ?-tlo..:tn\] 
"File set of tuples of C(nlstructions that generate a parse 
7; "lbples(F,C), is given by 
luplcs('L(O = \[(tl,...,t~p / tl,...,t n ~" C A tlo...otn=T } 
Probability 
All input string can have several parses and every such 
parse can be generated by ~veral different c()mbinations 
()f COllstruclious lrOlll tile corpus. What we are interested 
in, is, given an input string s, tile probability that 
arbiffury conlbinations of COllSIxuctions fro((I tile colpus 
generate a celtain prose 25 of s. Thus we are interested ill 
tile colldJtkmal prolXlbility of a pm'se 1)given s, with as 
probability space tile set of constructions of O'ees in the 
corpus. 
l,et '/~ be a parse of iupet string s, and supl~)se 
timt 15 can exhaustively be generated by k tuples of 
constructions: 1iqges(15,C) = ((tl l,..,thn), 
(t21,..,12n2) ..... (tkh..,tknt)}. Thell 7) occurs ill" 
(tll,...,tlnl) or (t21,...,ten 2) or .... or (Ikl,,.,tknk) 
occur, aud (thl,...,tlmt) (~culs iff thl and th2 and .... 
ACrl!s ol.: COLING-92. NAN rEs. 23-28 AOt~:f 1992 8 5 7 l)mlc. OF COI,ING-92, NANTES, AU¢}. 23-28. I992 
and t/mh Occur (hall,k\]). Thus the probability of Ti is 
given by 
P(T i) = P( (tllr%.r3tlm) u .... ~ (tglc3...r~tknvJ 
k ~p 
In shortened form: P(Ti) = P(u (el tlxl) ) 
p=l q=l 
Tile events tpq are no__L mutually exclusive, since 
conslructions can overlap, and can include other 
constructions. The formula for tim joint probability of 
events E i is given by: 
n n 
P(,'3 E i) = 11 l'OSilEi_l.,.h'l) 
i=l i=l 
Tile formula for the probability of combination of events 
E i (that are not independent) is given by (see e.g. \[Harris 
1966\]): 
k 
P(L/ Ei) = X P(Ei) - X l'(ldi1~L'i2) + X P(h'it,'~Ei2~Ei3) 
i=l i i1<i2 i1<i2.~i3 
- .... +/- P(EI~E2~ ... c~lS k) 
We will use Bayes' decomposition formula to 
derive the conditional probability of "1) given s. Let 7/~ 
and Tj be parses of s; the conditional probability of T i 
given s, is illen given by: 
P(Ti)P(sFI" i) P(r)P(srl~) 
V(7)ts) .................................... 
P(s) z~j P(Tj)P(slTj) 
Since P(slTj) is 1 for all j, we may write 
P(T) 
P(Tils ) ........... ~ p<rj) 
A parse 1)of s with nmxinml conditional 
probability P(Tils) is called a preferred parse of s. 
Implementation 
Several different implementations of DOP are possible. 
In \[Scholtes 1992\] a neural net implementation of DOP 
is proposed, ltere we will show that conventional rule- 
based parsing strategies can be applied tn DOP, by 
converting constructions into rules. A construction Can 
be seen as a production rule, where the lefthand-side of 
the rule is constituted by the root of rite construction 
and the righthand-side is constituted by the leaves of the 
construction. The only exmt condition is that of every 
such rule its corresponding construction should be 
remembered in order to generate a parse-tree for the input 
string (by composing the constructions that correspond 
to the rules ilmt are applied). For a construction t, the 
corresponding production rule is given by 
root(t) ~ leaves(O 
In order to calculate the pteterredparse of an input string 
by maximizing the conditional probability, all parses 
with all possible tuples of constructions must be 
generated, which becomes highly inefficient. Often we 
are not interested in all parses of an alnbiguous input 
string, neither in their exact probabilities, but only in 
which parse is the preferred parse. Thus we would like 
to have a strategy fllat estimates file top of the 
probability hierarchy of parses. "llais call be achieved by 
using Monte Carlo techniques (see e.g. \[Hammersley 
1964\]): we estimate the preferred parse by taking random 
samples frotn the space of possibilities. This will give 
us a more effective approach dian exhaustively 
calculating the probabilities. 
Discussion 
Although DOP has not yet been tested thoroughly 2, we 
call already predict sonic of its capabilities. In DOP, the 
probability of a parse depends on all tuples of 
coustructious that generate that parse. ~lhe more different 
ways in which a parse can be generated, the lligher its 
probability. This implies that a parse which can (also) 
be generated by relatively large constructions is favoured 
over a parse which can only be generated by relatively 
small constructions. This means that prepositiotml 
plu'ase attxichments arid figures of speech can be 
processed adequately by I)OP. 
As 1o the problem of hmguage acquisition, this 
ntight seem problematic for DOP: with all "already 
analyzed corpus, only adult language behaviour can be 
simulated. The problem of language acquisition is itt 
our perspective the problem of the acquisition of an 
initial corpus, in which non-linguistic input and 
pragmatics should play na important role. 
An additional remark should be devoted here to 
formal granlmars and disambiguation. Much work has 
been done to extend rule-based granunars with 
selectional restrictions such that the explosion of 
ambiguities is constrabled considerably, llowever, to 
represent semantic and pragmatic constraints is a very 
expensive task. No one has ever succeeded in doing so 
except in relatively small grammars. Furthermore, a 
basic question renmins as to whether it is possible to 
formally etlcode all of die syntactic, semantic alld 
pragmatic infomlation needed for disambiguation. In 
DOP, the additional infornmtion that one can draw from 
a corpus of hand-marked structural annotations is that 
one can by-pass the necessity for modelling world 
knowledge, since this will autonmtically enter into the 
disarnbiguation of structures by Imnd. Extracting 
constructions from these structures, and combining them 
in the most probable way, taking into account all 
possible statistical dependencies between them, 
preserves this world knowledge in the best possible 
way. 
In conclusion, it may be interesting to note that 
our idea of using past lallguage experiences instead of 
rules, has much in cormnon with Stich's ideas about 
language (\[Stich 1971\]). lu Stich's view, judgements of 
gralnmaticality are not determined by applying a 
precompiled set of gratmuar rules, but rather have the 
character of a perceptual judgement on the question to 
what extent rite judged sentonce 'lotiks like' the 
sentences the language user has in his head as examples 
of granlmaticality. The cot)crete language experiences of 
file past of a language user determine how a new 
utterance is processed; there is no evidence for file 
assumption that past language experiences are 
generalized into a consistent theory that defines the 
2 Corpora that will be used to lust DOP, mcude tile Tosca 
Corpus, built at the University of Nijmugen, and possibly the 
Penn Trcebm~k, built at the Umversity of Pennsylvania. 
AcrEs Dr: COLING-92. NANTES. 23-28 AOt',q" 1992 8 S 8 PROC. OF COLING-92, NANTES, Aunt. 23-28. 1992 
grammaticality and the structure of new utterances 
univocally. 
References 
\[Bahl 1983\]: Bahl, L., Jelinek, F. and Mercer, R., 'A 
Maximum Likelihood Approach to Continuous Speech 
Recognition', in: /EEE Transactions on Pattern Analysis 
and Machine Intelligence, Vol. PAMI-5, No.2. 
\[Fillmore 1988\] Fillmore, C., Kay, P. mid O'Connor, 
M., 'Regularity and idiomaticity in grammatical 
constructions: the ease of let alone', L,'mguage 64, p. 
501-538. 
\[Hanmlersley 1964\]: Hauunersley, J.M. and 
tlandscomb, D.C., Monte C~lo Methods, Chapnum 
and Hall, London. 
\[Hams 1966\]: Harris, 11, lbeory of Probability, 
Addison-Wesley, Reading (Mass). 
\[Jelinek 1990\]: Jelinek, F., l.afferty, J.D. and Mercer, 
R.I,., B~ic Methods of Probabilistic Context Free 
Granmuws, Yorktown tleights: IBM RC 16374 
(#72684). 
\[Magerman 1991\]: Magemmn, D. and Marcus, M., 
'Pearl: A Probabilistic Chart P~u'ser',in: Proceedings of 
the European Chapter of the ACL'91, Berlin. 
\[Martin 1979\]: M,'min, W.A., Preliminary analysis of a 
bre.adth-tirst parsing algorithin: Theoretical ,and 
experimental results (Technical Report No. TR-261). 
MIT LCS. 
\[Salomaa 1969\]: Salomaa, A., 'Probabilistic and 
weighted grmnmars', in: lnfomJation and control 15, p. 
529-544, 
\[Scha 1990\]: Scha, R., 'Language Theory and Language 
Technology; Competence and Perfomumce' (in Dutch), 
in: Q.A.M. de Kort & G.L.J. I,eerdam (eds.), 
Computertoepassingen in de Ncerlandistiek, Almere: 
Landelijke Vereniging van Neerlandici. (LVVN- jaarboek) 
\[Scholtes 1992\]: Scholtes, J. C. and Bloembergen, S., 
'The Design of a Neural Data-Oriented Parsing 0DOP) 
System', Proceedings of the Intonational Joint 
Conference on Neural Networks 1992, Baltimore. 
\[Stich 1971\]: Stich, S.P., 'What every speaker knows', 
in: Philosophical Review 80, p.476-496. 
ACRES DE COLING-92, NANTES, 23-28 AOt~l" 1992 8 5 9 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
