A DOP Model for Semantic Interpretation* 
Remko Bonnema, Rens Bod and Remko Scha 
Institute for Logic, Language and Computation 
University of Amsterdam 
Spuistraat 134, 1012 VB Amsterdam 
Bonnema~mars. let. uva. nl 
Rens. Bod@let. uva. nl 
Remko. Scha@let. uva. nl 
Abstract 
In data-oriented language processing, an 
annotated language corpus is used as a 
stochastic grammar. The most probable 
analysis of a new sentence is constructed 
by combining fragments from the corpus in 
the most probable way. This approach has 
been successfully used for syntactic anal- 
ysis, using corpora with syntactic annota- 
tions such as the Penn Tree-bank. If a cor- 
pus with semantically annotated sentences 
is used, the same approach can also gen- 
erate the most probable semantic interpre- 
tation of an input sentence. The present 
paper explains this semantic interpretation 
method. A data-oriented semantic inter- 
pretation algorithm was tested on two se- 
mantically annotated corpora: the English 
ATIS corpus and the Dutch OVIS corpus. 
Experiments show an increase in seman- 
tic accuracy if larger corpus-fragments are 
taken into consideration. 
1 Introduction 
Data-oriented models of language processing em- 
body the assumption that human language per- 
ception and production works with representations 
of concrete past language experiences, rather than 
with abstract grammar rules. Such models therefore 
maintain large corpora of linguistic representations 
of previously occurring utterances. When processing 
a new input utterance, analyses of this utterance are 
constructed by combining fragments from the cor- 
pus; the occurrence-frequencies of the fragments are 
used to estimate which analysis is the most probable 
one. 
* This work was partially supported by NWO, the 
Netherlands Organization for Scientific Research (Prior- 
ity Programme Language and Speech Technology). 
For the syntactic dimension of language, vari- 
ous instantiations of this data-oriented processing 
or "DOP" approach have been worked out (e.g. 
Bod (1992-1995); Charniak (1996); Tugwell (1995); 
Sima'an et al. (1994); Sima'an (1994; 1996a); 
Goodman (1996); Rajman (1995ab); Kaplan (1996); 
Sekine and Grishman (1995)). A method for ex- 
tending it to the semantic domain was first intro- 
duced by van den Berg et al. (1994). In the present 
paper we discuss a computationally effective version 
of that method, and an implemented system that 
uses it. We first summarize the first fully instanti- 
ated DOP model as presented in Bod (1992-1993). 
Then we show how this method can be straightfor- 
wardly extended into a semantic analysis method, if 
corpora are created in which the trees are enriched 
with semantic annotations. Finally, we discuss an 
implementation and report on experiments with two 
semantically analyzed corpora (ATIS and OVIS). 
2 Data-Oriented Syntactic Analysis 
So far, the data-oriented processing method has 
mainly been applied to corpora with simple syntac- 
tic annotations, consisting of labelled trees. Let us 
illustrate this with a very simple imaginary example. 
Suppose that a corpus consists of only two trees: 
S S 
NP VP NP VP 
Det N sings Det N whistles I I I I 
every woman a man 
Figure 1: Imaginary corpus of two trees 
We employ one operation for combining subtrees, 
called composition, indicated as o; this operation 
identifies the leftmost nonterminal leaf node of one 
tree with the root node of a second tree (i.e., the 
second tree is substituted on the leftmost nontermi- 
159 
nal leaf node of the first tree). A new input sentence 
like "A woman whistles" can now be parsed by com- 
bining subtrees from this corpus. For instance: 
S N S 
NP VP wom~ NP VP 
Det N whistles Det N whistles 
t I I 
a a woman 
Figure 2: Derivation and parse for "A woman 
whistles" 
Other derivations may yield the same parse tree; 
for instance I : 
S NP Det VP I o I 
NP VP Det N a whistles I 
wonlaN 
S 
NP VP 
Det N whistles 
I I 
a woman 
Figure 3: Different derivation generating the same 
parse for ".4 woman whistles" 
or 
S Det N S 
NP VP a woman NP VP 
Det N whistles Det N whistles 
I I 
a woman 
Figure 4: Another derivation generating the same 
parse for "A woman whistles" 
Thus, a parse tree can have many derivations in- 
volving different corpus-subtrees. DOP estimates 
the probability of substituting a subtree t on a spe- 
cific node as the probability of selecting t among all 
subtrees in the corpus that could be substituted on 
that node. This probability is equal to the number of 
occurrences of a subtree t, divided by the total num- 
ber of occurrences of subtrees t' with the same root 
node label as t : P(t) = Itl/~t':root(e)=roo~(t) It'l" 
The probability of a derivation tl o ... o tn can be 
computed as the product of the probabilities of the 
subtrees this derivation consists of: P( tl o... o t,~) = 
rL P(ti). The probability of a parse tree is equal to 
1Here t o u o v o w should be read as ((t o u) o v) o w. 
the probability that any of its distinct derivations is 
generated, which is the sum of the probabilities of all 
derivations of that parse tree. Let t~ be the i-th sub- 
tree in the derivation d that yields tree T, then the 
probability of T is given by: P(T) = ~d 1-Ii P(tid). 
The DOP method differs from other statisti- 
cal approaches, such as Pereira and Schabes (1992), 
Black et al. (1993) and Briscoe (1994), in that it 
does not predefine or train a formal grammar; in- 
stead it takes subtrees directly from annotated sen- 
tences in a treebank with a probability propor- 
tional to the number of occurrences of these sub- 
trees in the treebank. Bod (1993b) shows that 
DOP can be implemented using context-free pars- 
ing techniques. To select the most probable parse, 
Bod (1993a) gives a Monte Carlo approximation al- 
gorithm. Sima'an (1995) gives an efficient polyno- 
mial algorithm for a sub-optimal solution. 
The model was tested on the Air Travel In- 
formation System (ATIS) corpus as analyzed in 
the Penn Treebank (Marcus et al. (1993)), achiev- 
ing better test results than other stochastic 
grammars (cf. Bod (1996), Sima'an (1996a), 
Goodman (1996)). On Penn's Wall Street Jour- 
nal corpus, the data-oriented processing approach 
has been tested by Sekine and Grishman (1995) and 
by Charniak (1996). Though Charniak only uses 
corpus-subtrees smaller than depth 2 (which in our 
experience constitutes a less-than-optimal version 
of the data-oriented processing method), he reports 
that it "outperforms all other non-word-based sta- 
tistical parsers/grammars on this corpus". For an 
overview of data-oriented language processing, we 
refer to (Bod and Scha, 1996). 
3 Data-Oriented Semantic Analysis 
To use the DOP method not just for syntactic anal- 
ysis, but also for semantic interpretation, four steps 
must be taken: 
1. decide on a formalism for representing the 
meanings of sentences and surface-constituents. 
2. annotate the corpus-sentences and their 
surface-constituents with such semantic repre- 
sentations. 
3. establish a method for deriving the mean- 
ing representations associated with arbitrary 
corpus-subtrees and with compositions of such 
subtrees. 
4. reconsider the probability calculations. 
We now discuss these four steps. 
3.1 Semantic formalism 
The decision about the representational formalism 
is to some extent arbitrary, as long as it has a well- 
160 
S :Vx(woman(x)-*sing(x)) S'.qx(man(x)Awhistle(x)) 
NP: ~.YVx(woman(×)-*Y(x)) VP:sing NP: ;~.Y"'lx(man(x)AY(x)) VP:whistle 
Det:kX~Y~(X(x)-*Y(x)) N:woman sings Det:XXXY3x(X(×)AY(x)) N:man whistles I I I 
every woman a man 
Figure 5: Imaginary corpus of two trees with syntactic and semantic labels. 
S:dl(d2) 
NP:d 1 (d2) VP:sing NP:dl(d2) VP:whistle 
.i L Det:kXkY~(X(x)---~Y(x)) N:woman stags Det: ~,X~.Y3×(X(x)^Y(x)) N:man whi ties I I 
every woman a man 
Figure 6: Same imaginary corpus of two trees with syntactic and semantic labels using the daughter notation. 
defined model-theory and is rich enough for repre- 
senting the meanings of sentences and constituents 
that are relevant for the intended application do- 
main. For our exposition in this paper we will 
use a wellknown standard formalism: extensional 
type theory (see Gamut (1991)), i.e., a higher-order 
logical language that combines lambda-abstraction 
with connectives and quantifiers. The first imple- 
mented system for data-oriented semantic interpre- 
tation, presented in Bonnema (1996), used a differ- 
ent logical language, however. And in many appli- 
cation contexts it probably makes sense to use an 
A.I.-style language which highlights domain struc- 
ture (frames, slots, and fillers), while limiting the 
use of quantification and negation (see section 5). 
3.2 Semantic annotation 
We assume a corpus that is already syntactically 
annotated as before: with labelled trees that indi- 
cate surface constituent structure. Now the basic 
idea, taken from van den Berg et al. (1994), is to 
augment this syntactic annotation with a semantic 
one: to every meaningful syntactic node, we add a 
type-logical formula that expresses the meaning of 
the corresponding surface-constituent. H we would 
carry out this idea in a completely direct way, the 
toy corpus of Figure 1 might, for instance, turn into 
the toy corpus of Figure 5. 
Van den Berg et al. indicate how a corpus of this 
sort may be used for data-oriented semantic inter- 
pretation. Their algorithm, however, requires a pro- 
cedure which can inspect the semantic formula of a 
node and determine the contribution of the seman- 
tics of a lower node, in order to be able to "fac- 
tor out" that contribution. The details of this pro- 
cedure have not been specified. However, van den 
Berg et ai. also propose a simpler annotation con- 
vention which avoids the need for this procedure, 
and which is computationally more effective: an an- 
notation convention which indicates explicitly how 
the semantic formula for a node is built up on the 
basis of the semantic formulas of its daughter nodes. 
Using this convention, the semantic annotation of 
the corpus trees is indicated as follows: 
• For every meaningful lexical node a type logical 
formula is specified that represents its meaning. 
• For every meaningful non-lexical node a for- 
mula schema is specified which indicates how 
its meaning representation may be put together 
out of the formulas assigned to its daughter 
nodes. 
In the examples below, these schemata use the vari- 
able dl to indicate the meaning of the leftmost 
daughter constituent, d2 to indicate the meaning 
of the second daughter constituent, etc. Using this 
notation, the semantically annotated version of the 
toy corpus of Figure 1 is the toy corpus rendered in 
Figure 6. This kind of semantic annotation is what 
will be used in the construction of the corpora de- 
scribed in section 5 of this paper. It may be noted 
that the rather oblique description of the semantics 
of the higher nodes in the tree would easily lead to 
mistakes, if annotation would be carried out com- 
pletely manually. An annotation tool that makes 
the expanded versions of the formulas visible for the 
annotator is obviously called for. Such a tool was 
developed by Bonnema (1996), it will be briefly de- 
scribed in section 5. 
161 
This annotation convention obviously, assumes 
that the meaning representation of a surface- 
constituent can in fact always be composed out of 
the meaning representations of its subconstituents. 
This assumption is not unproblematic. To maintain 
it in the face of phenomena such as non-standard 
quantifier scope or discontinuous constituents cre- 
ates complications in the syntactic or semantic anal- 
yses assigned to certain sentences and their con- 
stituents. It is therefore not clear yet whether 
our current treatment ought to be viewed as com- 
pletely general, or whether a treatment in the vein 
of van den Berg et al. (1994) should be worked out. 
3.3 The meanings of subtrees and their 
compositions 
As in the purely syntactic version of DOP, we now 
want to compute the probability of a (semantic) 
analysis by considering the most probable way in 
which it can be generated by combining subtrees 
from the corpus. We can do this in virtually the 
same way. The only novelty is a slight modification 
in the process by which a corpus tree is decomposed 
into subtrees, and a corresponding modification in 
the composition operation which combines subtrees. 
If we extract a subtree out of a tree, we replace the 
semantics of the new leaf node with a unification 
variable of the same type. Correspondingly, when 
the composition operation substitutes a subtree at 
this node, this unification variable is unified with 
the semantic formula on the substituting tree. (It 
is required that the semantic type of this formula 
matches the semantic type of the unification vari- 
able.) 
A simple example will make this clear. First, let 
us consider what subtrees the corpus makes avail- 
able now. As an example, Figure 7 shows one of the 
decompositions of the annotated corpus sentence "A 
man whistles". We see that by decomposing the tree 
into two subtrees, the semantics at the breakpoint- 
node N: man is replaced by a variable. Now an 
analysis for the sentence "A woman whistles" can, 
for instance, be generated in the way shown in Fig- 
ure 8. 
3.4 The Statistical Model of Data-Oriented 
Semantic Interpretation 
We now define the probability of an interpretation 
of an input string. 
Given a partially annotated corpus as defined 
above, the multiset of corpus subtrees consists of 
all subtrees with a well-defined top-node seman- 
tics, that are generated by applying to the trees of 
the corpus the decomposition mechanism described 
NP:dl(d2) VP:whistle 
w !t,os Det: kX~.Y~x(X(x)^Y(x)) N:man I I 
a mail 
VP:whistle 
Det: ~,XkY3x(X(x)AY(x)) N:U whistles I 
a 
N:man 
man 
Figure 7: Decomposing a tree into subtrees with uni- 
fication variables. 
N:woman 
o L 
NP:dl(d2) VP:whistle woman 
Det: kXkY--Jx(X(x) AY(x)) N:U whistles 
a 
NP:d I (d2) VP:whistle 
Dec kXkY3x(X(x)^Y(x)) N:woman whistles I 
a woman 
Figure 8: Generating an analysis for "A woman 
whistles". 
above. The probability of substituting a subtree t on 
a specific node is the probability of selecting t among 
all subtrees in the multiset that could be substituted 
on that node. This probability is equal to the num- 
ber of occurrences of a subtree t, divided by the total 
number of occurrences of subtrees t' with the same 
root node label as t: 
N 
P(t) = Et':root(t')=root(t) Irl (1) 
A derivation of a string is a tuple of subtrees, such 
that their composition results in a tree whose yield is 
the string. The probability of a derivation tl o... o tn 
is the product of the probabilities of these subtrees: 
P(tl o... o tn) = II P(td (2) 
i 
A tree resulting from a derivation of a string is called 
a parse of this string. The probability of a parse is 
162 
the probability that any of its derivations occurs; 
this is the sum of the probabilities of all its deriva- 
tions. Let rid be the i-th subtree in the derivation d 
that yields tree T, then the probability of T is given 
by: 
P(T) = E H P(t,d) (3) 
d i 
An interpretation of a string is a formula which is 
provably equivalent to the semantic annotation of 
the top node of a parse of this string. The proba- 
bility of an interpretation I of a string is the sum of 
the probabilities of the parses of this string with a 
top node annotated with a formula that is provably 
equivalent to I. Let ti4p be the i-th subtree in the 
derivation d that yields parse p with interpretation 
I, then the probability of I is given by: 
P(I) = E E H P(t,d,) (4) 
p d i 
We choose the most probable interpretation/.of a 
string s as the most appropriate interpretation of s. 
In Bonnema (1996) a semantic extension of the 
DOP parser of Sima'an (1996a) is given. But in- 
stead of computing the most likely interpretation 
of a string, it computes the interpretation of the 
most likely combination of semantically annotated 
subtrees. As was shown in Sima'an (1996b), the 
most likely interpretation of a string cannot be com- 
puted in deterministic polynomial time. It is not yet 
known how often the most likely interpretation and 
the interpretation of the most likely combination of 
semantically enriched subtrees do actually coincide. 
4 Implementations 
The first implementation of a semantic DOP-model 
yielded rather encouraging preliminary results on a 
semantically enriched part of the ATIS-corpus. Im- 
plementation details and experimental results can 
be found in Bonnema (1996), and Bod et al. (1996). 
We repeat the most important observations: 
* Data-oriented semantic interpretation seems to 
be robust; of the sentences that could be parsed, 
a significantly higher percentage received a cor- 
rect semantic interpretation (88%), than an ex- 
actly correct syntactic analysis (62%). 
* The coverage of the parser was rather low 
(72%), because of the sheer number of differ- 
ent semantic types and constructs in the trees. 
• The parser was fast: on the average six times 
as fast as a parser trained on syntax alone. 
The current implementation is again an extension 
of Sima'an (1996a), by Bonnema 2. In our experi- 
ments, we notice a robustness and speed-up compa- 
rable to our experience with the previous implemen- 
tation. Besides that, we observe higher accuracy, 
and higher coverage, due to a new method of orga- 
nizing the information in the tree-bank before it is 
used for building the actual parser. 
A semantically enriched tree-bank will generally 
contain a wealth of detail. This makes it hard for 
a probabilistic model to estimate all parameters. In 
sections 4.1 and 4.2, we discuss a way of generalizing 
over semantic information in the tree-bank, be\]ore a 
DOP-parser is trained on the material. We automat- 
ically learn a simpler, less redundant representation 
of the same information. The method is employed 
in our current implementation. 
4.1 Simplifying the tree-bank 
A tree-bank annotated in the manner described 
above, consists of tree-structures with syntactic and 
semantic attributes at every node. The semantic 
attributes are rules that indicate how the meaning- 
representation of the expression dominated by that 
node is built-up out of its parts. Every instance of 
a semantic rule at a node has a semantic type asso- 
ciated with it. These types usually depend on the 
lexical instantiations of a syntactic-semantic struc- 
ture. 
If we decide to view subtrees as identical iff their 
syntactic structure, the semantic rule at each node, 
and the semantic type of each node is identical, 
any fine-grained type-system will cause a huge in- 
crease in different instantiations of subtrees. In the 
two tree-banks we tested on, there are many sub- 
trees that differ in semantic type, hut otherwise 
share the same syntactic/semantic structure. Disre- 
garding the semantic types completely, on the other 
hand, will cause syntactic constraints to govern both 
syntactic substitution and semantic unification. The 
semantic types of constituents often give rise to dif- 
ferences in semantic structure. If this type informa- 
tion is not available during parsing, important clues 
will be missing, and loss of accuracy will result. 
Apparently, we do need some of the information 
present in the types of semantic expressions. Ignor- 
ing semantic types will result in loss of accuracy, but 
distinguishing all different semantic types will result 
in loss of coverage and generalizing power. With 
these observations in mind, we decided to group the 
types, and relax the constraints on semantic unifi- 
cation. In this approach, every semantic expression, 
2With thanks to Khalil Sima'an for fruitful discus- 
sions, and for the use of his parser 
163 
and every variable, has a set of types associated with 
it. In our semantic DOP model, we modify the con- 
straints on semantic unification as follows: A vari- 
able can be unified with an expression, if the inter- 
section of their respective sets of types is not empty. 
The semantic types are classified into sets that 
can be distinguished on the basis of their behavior 
in the tree-bank. We let the tree-bank data decide 
which types can be grouped together, and which 
types should be distinguished. This way we can 
generalize over semantic types, and exploit relevant 
type-information in the parsing process at the same 
time. In learning the optimal grouping of types, we 
have two concerns: keeping the number of different 
sets of types to a minimum, and increasing the se- 
mantic determinacy of syntactic structures enhanced 
with type-information. We say that a subtree T, 
with type-information at every node, is semantically 
determinate, iff we can determine a unique, correct 
semantic rule for every CFG rule R 3 occurring in T. 
Semantic determinacy is very attractive from a com- 
putational point of view: if our processed tree-bank 
has semantic determinacy, we do not need to involve 
the semantic rules in the parsing process. Instead, 
the parser yields parses containing information re- 
garding syntax and semantic types, and the actual 
semantic rules can be determined on the basis of 
that information. In the next section we will elabo- 
rate on how we learn the grouping of semantic types 
from the data. 
4.2 Classification of semantic types 
The algorithm presented in this section proceeds by 
grouping semantic types occurring with the same 
syntactic label into mutually exclusive sets, and as- 
signing to every syntactic label an index that indi- 
cates to which set of types its corresponding seman- 
tic type belongs. It is an iterative, greedy algorithm. 
In every iteration a tuple, consisting of a syntactic 
category and a set of types, is selected. Distinguish- 
ing this tuple in the tree bank, leads to the great- 
est increase in semantic determinacy that could be 
found. Iteration continues until the increase in se- 
mantic determinacy is below a certain threshold. 
Before giving the algorithm, we need some defini- 
tions: 
3By "CFG rule", we mean a subtree of depth 1, with- 
out a specified root-node semantics, but with the features 
relevant for substitution, i.e. syntactic category and se- 
mantic type. Since the subtree of depth 1 is the smallest 
structural building block of our DOP model, semantic 
determinacy of every CFG rule in a subtree, means the 
whole subtree is semantically determinate. 
tuplesO 
tuples(T) is the set of all pairs (c, s) in a tree- 
bank T, where c is a syntactic category, and s is 
the set of all semantic types that a constituent 
of category c in T can have. 
apply() 
if c is a category, s is a set of types, and T is a 
tree-bank 
then apply((c, s), T) yields a tree-bank T', by 
indexing each instance of category c in T, such 
that the c constituent is of semantic type t E s, 
with a unique index i. 
ambO 
if T is a tree-bank 
then arab(T) yields an n E N, such that n is the 
sum of the frequencies of all CFG rules R that 
occur in T with more than one corresponding 
semantic rule. 
The algorithm starts with a tree-bank To; in To, 
the cardinality of tuples(To) equals the number of 
different syntactic categories in To. 
1. Ti=o 
repeat 
2. 
3. 
4. 
until 
5. Ti-1 
D((c, s)) = amb(T/)- 
amb( apply( (c, s), Ti) ) 
= {(c,s')13(c, s) 
tuples(T~)& s' E 21sl) 
7-/= argmax D(r') 
r'ET; 
i:=i+1 
Ti := apply(ri, Ti-1) 
D(T~-I) <-- 5 
(5) 
21sl is the powerset of s. In the implementation, 
a limit can be set to the cardinality of s' E 21sl, to 
avoid excessively long processing time. Obviously, 
the iteration will always end, if we require 5 to be 
> 0. When the algorithm finishes, TO,... , Ti--1 con- 
tain the category/set-of-types pairs that took the 
largest steps towards semantic determinacy, and are 
therefore distinguished in the tree-bank. The se- 
mantic types not occurring in any of these pairs are 
grouped together, and treated as equivalent. 
Note that the algorithm cannot be guaranteed to 
achieve full semantic determinacy. The degree of se- 
mantic determinacy reached, depends on the consis- 
tency of annotation, annotation errors, the granular- 
ity of the type system, peculiarities of the language, 
in short: on the nature of the tree-bank. To force 
semantic determinacy, we assign a unique index to 
those rare instances of categories, i.e, left hand sides 
164 
PER 
USer I 
ik V 
wants 
I wi! 
ADV MP # today 
I I niet vandaag 
S 
dl.d2 
VP 
dl.d2 
MP 
MP MP dl.d2 
MP CON MP P NP ! tomorrow destinatlon.place 
I I 
maar morgen naar NP NP 
town.almere suffix.buiten 
I I 
almere buiten 
Figure 9: A tree from the OVIS tree-bank 
of CFG-rules, that do not have any distinguishing 
features to account for their differing semantic rule. 
Now the resulting tree-bank embodies a function 
from CFG rules to semantic rules. We store this 
function in a table, and strip all semantic rules from 
the trees. As the experimental results in the next 
section show, using a tree-bank obtained in this way 
for data oriented semantic interpretation, results in 
high coverage, and good probability estimations. 
5 Experiments on the OVIS 
tree-bank 
The NWO 4 Priority Programme "Language and 
Speech Technology" is a five year research pro- 
gramme aiming at the development of advanced 
telephone-based information systems. Within this 
programme, the OVIS 5 tree-bank is created. Using 
a pilot version of the OVIS system, a large number 
of human-machine dialogs were collected and tran- 
scribed. Currently, 10.000 user utterances have re- 
ceived a full syntactic and semantic analysis. Re- 
grettably, the tree-bank is not available (yet) to the 
public. More information on the tree-bank can be 
found on http : ~~grid. let. rug. nZ : 4321/. The se- 
mantic domain of all dialogs, is the Dutch railways 
schedule. The user utterances are mostly answers 
to questions, like: "From where to where do you 
want to travel?", "At what time do you want to 
arrive in Amsterdam?", "Could you please repeat 
your destination?". The annotation method is ro- 
bust and flexible, as we are dealing with real, spo- 
ken data, containing a lot of clearly ungrammatical 
utterances. For the annotation task, the annotation 
4Netherlands Organization for Scientific Research 
5Public Transport Information System 
workbench SEMTAGS is used. It is a graphical inter- 
face, written by Bonnema, offering all functionality 
needed for examining, evaluating, and editing syn- 
tactic and semantic analyses. SEMTAGS is mainly 
used for correcting the output of the DOP-parser. 
It incrementally builds a probabilistic model of cor- 
rected annotations, allowing it to quickly suggest al- 
ternative semantic analyses to the annotator. It took 
approximately 600 hours to annotate these 10.000 
utterances (supervision included). 
Syntactic annotation of the tree-bank is conven- 
tional. There are 40 different syntactic categories in 
the OVIS tree-bank, that appear to cover the syn- 
tactic domain quite well. No grammar is used to 
determine the correct annotation; there is a small 
set of guidelines, that has the degree of detail nec- 
essary to avoid an "anything goes"-attitude in the 
annotator, but leaves room for his/her perception of 
the structure of an utterance. There is no concep- 
tual division in the tree-bank between POS-tags and 
nonterminal categories. 
Figure 9 shows an example tree from the tree- 
bank. It is an analysis of the Dutch sentence: "Ik(I) 
wil( want ) niet( not ) vandaag( today) maar( but ) mor- 
gen(tomorrow) naar(to) Almere Buiten(Almere 
Buiten)". The analysis uses the formula schemata 
discussed in section 3.2, but here the interpreta- 
tions of daughter-nodes are so-called "update" ex- 
pressions, conforming to a frame structure, that 
are combined into an update of an information 
state. The complete interpretation of this utterance 
is: user.wants.((\[#today\];\[itomorrow\]);destination.- place.(town.almere;suffix.buiten)). 
The semantic for- 
malism employed in the tree-bank is the topic of the 
next section. 
5.1 The Semantic formalism 
The semantic formalism used in the OVIS 
tree-bank, is a frame semantics, defined in 
Veldhuijzen van Zanten (1996). In this section, we 
give a very short impression. The well-formedness 
and validity of an expression is decided on the ba- 
sis of a type-lattice, called a frame structure. The 
interpretation of an utterance, is an update of an 
information state. An information state is a repre- 
sentation of objects and the relations between them, 
that complies to the frame structure. For OVIS, the 
various objects are related to concepts in the train 
travel domain. In updating an information state, 
the notion of a slot-value assignment is used. Every 
object can be a slot or a value. The slot-value assign- 
ments are defined in a way that corresponds closely 
to the linguistic notion of a ground-focus structure. 
The slot is part of the common ground, the value 
165 
Interpretation: Exact Match 
95 % 
85 , , , 
1 2 3 4 5 
Max. subtree depth 
Figure 10: Size of training set: 8500 
Sem./Synt. Analysis: Exact Match 
90 % 87.69 88.21 
85.64 -- 88.66 
85 83.08-- 
80" , I I I I 1 2 3 4 5 
Max. subtree depth 
Figure 11: Size of training set: 8500 
is new information. Added to the semantic formal- 
ism are pragmatic operators, corresponding to de- 
nial, confirmation , correction and assertion 6 that 
indicate the relation between the value in its scope, 
and the information state. 
An update expression is a set of paths through the 
frame structure, enhanced with pragmatic operators 
that have scope over a certain part of a path. For 
the semantic DOP model, the semantic type of an 
expression ¢ is a pair of types (tz,t2). Given the 
type-lattice "/-of the frame structure, tl is the lowest 
upper bound in T of the paths in ¢, and t2 is the 
greatest lower bound in Tof the paths in ¢. 
5.2 Experimental results 
We performed a number of experiments, using a ran- 
dom division of the tree-bank data into test- and 
training-set. No provisions were taken for unknown 
words. The results reported here, are obtained by 
randomly selecting 300 trees from the tree-bank. All 
utterances of length greater than one in this selection 
are used as testing material. We varied the size of 
the training-set, and the maximal depth of the sub- 
trees. The average length of the test-sentences was 
4.74 words. There was a constraint on the extrac- 
tion of subtrees from the training-set trees: subtrees 
could have a maximum of two substitution-sites, and 
no more than three contiguous lexical nodes (Expe- 
rience has shown that such limitations improve prob- 
6In the example in figure 9, the pragmatic opera- 
tors #, denial, and !, correction, axe used 
Interpretation: Exact Match 
90.76 92.31 /0 88.21~ 
90 87.18 
71.27 
70" , ' , 1000 2500 40'00 5500 7000 85'00 
Tralningset size 
Figure 12: Max. depth of subtrees = 4 
Sem./Synt. Analysis: Exact Match 
90 % 87.18 88.21 
80 79 49 
68 711 
70 i , , 
1000 2500 40'00 5500 7000 8500 
Tralningset size 
Figure 13: Max. depth of subtrees = 4 
ability estimations, while retaining the full power of 
DOP). Figures 10 and 11 show results using a train- 
ing set size of 8500 trees. The maximal depth of sub- 
trees involved in the parsing process was varied from 
1 to 5. Results in figure 11 concern a match with 
the total analysis in the test-set, whereas Figure 10 
shows success on just the resulting interpretation. 
Only exact matches with the trees and interpreta- 
tions in the test-set were counted as successes. The 
experiments show that involving larger fragments in 
the parsing process leads to higher accuracy. Appar- 
ently, for this domain fragments of depth 5 are too 
large, and deteriorate probability estimations 7. The 
results also confirm our earlier findings, that seman- 
tic parsing is robust. Quite a few analysis trees that 
did not exactly match with their counterparts in the 
test-set, yielded a semantic interpretation that did 
match. Finally, figures 12 and 13 show results for 
differing training-set sizes, using subtrees of maxi- 
mal depth 4. 

References 
M. van den Berg, R. Bod, and R. Scha. 1994. 
A Corpus-Based Approach to Semantic Interpre- 
tation. In Proceedings Ninth Amsterdam Collo- 
quium. ILLC,University of Amsterdam. 
E. Black, R. Garside, and G. Leech. 1993. 
Statistically-Driven Computer Grammars of En- 
glish: The IBM/Lancaster Approach. Rodopi, 
Amsterdam-Atlanta. 
R. Bod. 1992. A computational model of language 
performance: Data Oriented Parsing. In Proceed- 
ings COLING'92, Nantes. 
R. Bod. 1993a. Monte Carlo Parsing. In Proceedings 
Third International Workshop on Parsing Tech- 
nologies, Tilburg/Durbuy. 
R. Bod. 1993b. Using an Annotated Corpus as a 
Stochastic Grammar. In Proceedings EACL'93, 
Utrecht. 
R. Bod. 1995. Enriching Linguistics with 
Statistics: Performance models of Natural 
Language. Phd-thesis, ILLC-dissertation 
series 1995-14, University of Amsterdam. 
ftp ://ftp. fwi. uva. nl/pub/theory/illc/- 
dissert at ions/DS-95-14, text. ps. gz 
R Bod. 1996. Two Questions about Data-Oriented 
Parsing. In Proceedings Fourth Workshop on Very 
Large Corpora, Copenhagen, Denmark. (cmp- 
lg/9606022). 
R. Bod, R. Bonnema, and R. Scha. 1996. A data- 
oriented approach to semantic interpretation. In 
Proceedings Workshop on Corpus-Oriented Se- 
mantic Analysis, ECAI-96, Budapest, Hungary. 
(cmp-lg/9606024). 
R. Bod and R. Scha. 1996. Data-oriented lan- 
guage processing, an overview. Technical Re- 
port LP-96-13, Institute for Logic, Language and 
Computation, University of Amsterdam. (cmp- 
lg/9611003). 
R. Bonnema. 1996. Data oriented se- 
mantics. Master's thesis, Department of 
Computational Linguistics, University of Am- 
sterdam, http ://mars. let. uva. nl/remko_b/- 
dopsem/script ie. html 
T. Briscoe. 1994. Prospects for practical parsing of 
unrestricted text: Robust statistical parsing tech- 
niques. In N. Oostdijk and P de Haan, editors, 
Corpus-based Research into Language. Rodopi, 
Amsterdam. 
E. Charniak. 1996. Tree-bank grammars. In Pro- 
ceedings AAAI'96, Portland, Oregon. 
L. Gamut. 1991. Logic, Language and Meaning. 
Chicago University Press. 
J Goodman. 1996. Efficient Algorithms for Parsing 
the DOP Model. In Proceedings Empirical Meth- 
ods in Natural Language Processing, Philadelphia, 
Pennsylvania. 
R. Kaplan. 1996. A probabilistic approach 
to Lexical-Functional Grammar. Keynote pa- 
per held at the LFG-workshop 1996, Greno- 
ble, France. ftp ://ftp.parc. xerox, com/pub/- 
nl/slides/grenoble96/kaplan-dopt alk .ps. 
M. Marcus, B. Santorini, and M. Marcinkiewicz. 
1993. Building a Large Annotated Corpus of En- 
glish: The Penn Treebank. Computational Lin- 
guistics, 19(2). 
F. Pereira and Y. Schabes. 1992. Inside-outside 
reestimation from partially bracketed corpora. In 
Proceedings of the 30th Annual Meeting of the 
ACL, Newark, De. 
M. Rajman. 1995a. Apports d'une approche a base 
de corpus aux techniques de traitement automa- 
tique du langage naturel. Ph.D. thesis, Ecole Na- 
tionale Superieure des Telecommunications, Paris. 
M. Rajman. 1995b. Approche probabiliste de 
l'analyse syntaxique. Traitement Automatique des 
Langues, 36:1-2. 
S. Sekine and R. Grishman. 1995. A corpus- 
based probabilistic grammar with only two 
non-terminals. In Proceedings Fourth Interna- 
tional Workshop on Parsing Technologies, Prague, 
Czech Republic. 
K. Sima'an, R. Bod, S. Krauwer, and R. Scha. 1994. 
Efficient Disambiguation by means of Stochastic 
Tree Substitution Grammars. In Proceedings In- 
ternational Conference on New Methods in Lan- 
guage Processing. CCL, UMIST, Manchester. 
K. Sima'an. 1995. An optimized algorithm for Data 
Oriented Parsing. In Proceedings International 
Conference on Recent Advances in Natural Lan- 
guage Processing. Tzigov Chark, Bulgaria. 
K. Sima'an. 1996a. An optimized algorithm for 
Data Oriented Parsing. In R. Mitkov and N. Ni- 
colov, editors, Recent Advances in Natural Lan- 
guage Processing 1995, volume 136 of Current Is- 
sues in Linguistic Theory. John Benjamins, Ams- 
terdam. 
K. Sima'an. 1996b. Computational Complexity of 
Probabilistic Disambiguation by means of Tree- 
Grammars. In Proceedings COLING'96, Copen- 
hagen, Denmark. 
D. Tugwell. 1995. A state-transition grammar for 
data-oriented parsing. In Proceedings European 
Chapter of the ACL'95, Dublin, Ireland. 
G. Veldhuijzen van Zanten. 1996. Seman- 
tics of update expressions. NWO priority 
Programme Language and Speech Technology, 
http ://grid. let. rug. nl : 4321/. 
