A query tool for syntactically annotated corpora* 
Laura Kallmeyer 
UFRL, Universit4 Paris 7 
2, place Jussieu 
75251 Paris cedex 05 
lmara, kallmeyer@linguist, jussieu, fr 
Abstract; 
This paper presents a query tool for syntacti- 
caUy ~.nnotated corpora. The query tool is de- 
veloped to search the Verbmobil treebanks an- 
notated at the University of Tfibingen. How- 
ever, in principle it also can be adapted to 
other corpora such as the Negra Corpus, the 
Penn Treebank or the French treebank devel- 
oped in Paris. The tool uses a query  
that allows to search for tokens, syntactic cat- 
egories, grammatical functions and binary re- 
lations of (immediate) dominance and linear 
precedence between nodes. The overall idea 
is to extract in an initializing phase the rele- 
vant information from the corpus and store it 
in a relational database. An incoming query 
is then translated into a corresponding SQL 
query that is evaluated on the database. 
1 Introduction 
1.1 Syntactic annotation and 
linguistic research 
With the increasing availability of large 
amounts of electronic texts, linguists have ac- 
cess to more and more material for empiri- 
cally based linguistic research. Furthermore, 
electronic corpora are more and more richly 
annotated and thereby more and more de- 
tailed and structured information contained 
in the corpora becomes accessable. Currently 
many corpora are tagged with morphosyntac- 
tic categories (part-of-speech) and there are 
already several syntactically annotated cor- 
pora. Examples are the Penn Treebank (Mar- 
cus et al., 1994; Bies et al., 1995) annotated 
at the University of Pennsylvania, the Ne- 
gra corpus (Brauts et al., 1999) developed in 
Saarbriicken, the Verbmobil treebank~ (Hin- 
richs et al., 2000) annotated in Tiibingen 
*The work presented here was done as part of a 
project in SFB 441 "Linguistic Data Structures" at 
the University of Tiibingen. 
and the French treebank annotated in Paris 
(Abeill4 and C14ment, 1999). 
However, in order to have access to these 
rich linguistic annotations, adequate query 
tools are needed. 
In the following, an example of a linguisti- 
cally relevant construction is considered that 
illustrates how useful access to structural in- 
formation in a corpus might be. 
fiber Chomsky habe ich ein Buch 
about Chomsky have I a book 
(1) gelesen 
read 
'I have read a book about Chomsky' 
Linguists are often concerned with con- 
structions that seem not very natural and 
where intuitions about grammaticality fail. 
An example is (1) where we have an accusative 
object (ein Buch) which is positioned between 
the two verbal elements and whose modifier 
(the prepositional phrase iiber Chomsky) is 
topicailzed. 
Some people claim (1) to be ungrammatical 
whereas other people are inclined to accept it. 
In these cases it is very useful to search in an 
adequate corpus for more natural data show- 
ing the same construction (see also (Meurers, 
1999) for other ex~.mples of the use of corpora 
for linguistic research). 
In order to find structures like (1) in a Ger- 
man corpus, one needs to search for 
(a) a prepositional phrase modifying the ac- 
cusative object and preceding the finite 
verb (i.e. in the so-called vorfeld), and 
(b) an accusative object between finite verb 
and infinite verb forms (i.e. in the so- 
called rnittelfeld) 
Obviously, two things need to be available 
in order to enable such a search. On the one 
hand, one needs a corpus with an annotation 
that is rich enough to encode the properties 
190 
(a) and (b). On the other hand, one needs a 
query tool with a query  that allows 
to express the properties (a) and (b). 
Corpora encoding features such as (a) and 
(b) are for example the Verbmobil treebanks. 
1.2 Current query tools 
Query tools such as xkwic (Christ, 1994) that 
allow to search on the tokens and their part- 
of-speech categories using regular expressions 
do not allow a direct search on the syntac- 
tic annotation structures of the corpus, i.e. a 
search for specific relations between nodes in 
the annotation such as dominance or linear 
precedence. Therefore many queries a linguist 
would like to ask using a syntactically anno- 
tated corpus either cannnot be expressed in a 
regular expression based  or at least 
cannot be expressed in an intuitive way. 
Even more recent query s as 
SgmlQL (Le Maitre et al., 1998) and XML-QL 
(Deutsch et al., 1999) that refer to the SGML 
or XML annotation of a corpus are in gen- 
eral not adequate for syntactically annotated 
corpora: if the annotations are trees and 
the nesting of SGML/XML elements encodes 
the tree structure, such query s work 
nicely. With regular path expressions as sup- 
ported by XML-QL, it is possible to search 
not only for parent but also for dominance 
relations. However, in order to deal with 
discontinuous constituents, most syntactically 
annotated corpora do not contain trees but 
slightly different data structures. The Penn 
Treebank for example consists of trees with 
an additional coindexation relation, Negra al- 
lows crossing branches and in Verbmobil, an 
element (a tree-like structure) in the corpus 
might contain completely disconnected nodes. 
In order to express these annotations in XML, 
one has to encode for example each node and 
each edge as a single element as in (Mengel 
and Lezius, 2000). But then a query for a 
dominance relation can no longer be formu- 
lated with a regular path expression. 
In this paper, I propose a query tool that 
allows to search for parent, dominance and 
linear precedence relations even in corpora an- 
notated with structures slightly different from 
trees. 
2 The Verbmobil treebanks 
The German Verbmobil corpus (Stegmann et 
al., 1998; Hinrichs et al., 2000) is a tree- 
bank annotated at the University of Tiibingen 
SIMPX 
I VF 
! 
I PX 
--I 
I NX 
I 
D 
I I 
APPR NE 
fiber C. 
I 
I I LK VC 
I I I 
VXFIN NX VXINF I I I 
I I \[ 
VAFIN PPER VVPP 
habe ich gel~en 
± 
I MF 
I 
I 
NX 
I I 
I I 
ART NN 
ein Buch 
Figure 1: Annotation of (1) in Verbmobil for- 
mat 
that contains approx. 38.000 trees (or rather 
tree-like annotation structures since, as al- 
ready mentioned, the structures are not al- 
ways trees). The corpus consists of spoken 
texts restricted to the domain of arrangement 
of business appointments. 
The Verbmobil corpus is part-of-speech 
tagged using the Stuttgart Tiibingen tagset 
(STTS) described in (Schiller et al., 1995). 
One of the design decisions in Verbmobil 
was that for the purpose of reusability of the 
treebank, the annotation scheme should not 
reflect a commitment to a particular syntac- 
tic theory. Therefore a surface-oriented am 
notation scheme was adopted that is inspired 
by the notion of topological fields in the sense 
of (HShle, 1985). The discontinuous position- 
ing of the verbal elements in verb-first and 
verb-second sentences (as in (1) for example) 
is the traditional reason to structure the Ger- 
man sentence by means of topological fields: 
The verbal elements have the categories LK 
( linke Klammer) and VC (verbal complex), and 
roughly everything preceding the LK forms the 
'voffeld' VF, everything between LK and vc 
forms the 'mittelfeld' MF and the 'nachfeld' NF 
follows the verbal complex. 
The Verbmobil corpus is annotated with 
syntactic categories as node labels, grammat- 
ical functions as edge labels and dependency 
relations. The syntactic categories are based 
on traditional phrase structure and on the the- 
ory of topological fields. In contrast to Negra 
or Penn Treebank, there are neither crossing 
branches nor empty categories. Instead, de- 
191 
pendency relations are expressed within the 
• grammatical functions (e.g. OA-MOD for a con- 
stituent modifying the accusative object). 
A sample annotation conformant to the 
Verbmobil annotation scheme is the annota- 
tion of (1) shown in Fig. 1. (The elements set 
in boxes are edge labels.) 
In order to search for structures as in Fig. 1, 
one needs to search for trees containing a node 
nl with label PX and grammatical function 
0A-MOD, a node n2 with label VF that domi- 
nates nl, a node n3 with label MF and a node 
n4 with label NX and gra.mmatical function 0A 
that is immediately dominated by n3. 
Evaluating a query for structures as in 
Fig. 1 on the Verbmobil corpus gives results 
such as (2) that sound much more natural 
than the constructed example (1). 
• tja, fiber Flugverbindungen habe ich 
about flight connections have I 
(2) leider keine Information. 
unfortunately no information 
'unfortunately I have no information 
about flight connections.' 
This example illustrates the usefulness of 
syntactic annotations for linguistic research 
• and it shows the need of query s and 
query tools that allow access to these annota- 
tions. 
3 The query  
3.1 Syntax 
As query  for the German Verbmo- 
bil corpus, a first order logic without quan- 
tification is chosen where variables are inter- 
preted as existentially quantified. Negation 
is only allowed for atomic formula. It seems 
that even this very simple logic already gives 
a high degree of expressive power with respect 
to the queries linguists are interested in (see 
for example (Kallmeyer, 2000) for theoretical 
investigations of query s). However, 
it might be that at a later stage the query 
 will be extended. 
Let C (the node labels, i.e. syntactic cate- 
gories and part-of-speech categories), E (the 
edge labels, i.e. grammatical functions) and T 
(the terminals, i.e. tokens) be pairwise disjoint 
finite sets. >, >>, .. are constants for the 
binary relations immediate dominance (par- 
ent relation), dominance (reflexive transitive 
closure of immediate dominance) and linear 
precedence. The set 1N of natural numbers is 
used as variables. Further, ~, I, ! are logi- 
cal connectives (conjunction, disjunction, and 
negation). 
Definition 1 ((C, E, T)-queries) 
( C, E, T)-queries are inductively defined: 
(a) for all iE IN, tE T: 
token(i)=t and token(i) !=t are 
queries, 
(b) for all iE IN, cE C: 
cat (i)=c and cat (i) !=c are queries, 
(c) for all iE IN, eE E: 
fct(i)=e and fct(i)!=e are queries, 
(d) for all i, j E IN: 
i > j andi !> j are queries, 
i >> j andi !>> j are queries, 
i .. j andi !.. j are queries, 
(e) for all queries ql, q2: 
ql ~ q2 and (ql I q2) are queries. 
Of course, when adapting this  to 
another corpus, depending on the specific an- 
notation scheme, other unary or binary pred- 
icates might be added to the query . 
This does not change the complexity of the 
query  in general. 
However, it is also possible that at a later 
point negation needs to be allowed in a general 
way or that quantification needs to be added 
to the query  for linguistic reasons. 
Such modifications would affect the complex- 
ity of the  and the performance of 
the tool. Therefore the decision was taken to 
keep the  as simple as possible in the 
beginning. 
3.2 Intended models 
In the case of the German Verbmobil corpus, 
the data structures are not trees, since struc- 
tures as in Fig. 2, which shows the annotation 
of the long-distance wh-movement in (3), can 
occur. The structure in Fig. 2 does not have 
a unique root node, and the two nodes with 
label SINPX have neither a dominance nor a 
linear precedence relation. 
(3) wen glaubst du liebt Maria 
whom believe you loves Maria 
'whom do you believe Maria loves' 
Therefore, the models of our queries are de- 
fined as more general structures than finite 
trees. 
A model is a tuple (/g, T ~, T), £, p, ~/, a) 
where/g is the set of nodes, 7 ~, T~ and £ are the 
192 
SIMPX 
I I D D 
VF MF 
I I 
1 I 
NX NX 
I I 
I I PWS 
NE 
wen Maria 
I 
E3 SIMPX \] 
I I D \[3 
I I LK MF LK 
I I I 
I I I VXFIN NX VXFIN 
I I I 
I I I 
WFIN PPER VVFIN 
glaubst du liebt 
Figure 2: Annotation of (3) in Verbmobil for- 
mat 
binary relations immediate dominance (par- 
ent), dominance and linear precedence, # is 
a function assigning syntactic categories or 
part-of-speech tags to nodes, r/ is a function 
mapping edges to grammatical functions, and 
a assigns tokens to the leaves (i.e. the nodes 
that do not dominate any other node). 
,.? , 
Definition 2 (Query model) 
Let C, E and T be disjoint alphabets. 
(ILl, 79, 73, £, #, rl, a) is a query model with cat- 
egories C, edge labels E and terminals Tiff 
1. l~ is a finite set with Lt n (C U E U T) = O, 
the set of nodes. 
2. P, £, 73 • ILl × U, such that: 
(a) 79 is irreflexive, and for all x • ILl 
there is at most one v • ILl with 
(v, x) • 79. 
(b) 73 is the reflexive transitive closure of 
79, and 73 is antisymmetric. 
(c) £ is transitive. 
(d) .for all x, y • lg: if (x, y) • £, then 
<~, y) ¢ 73 and (u, x) ¢ 73. 
(e) for all z,y • U: (x,y) • £ i# for 
all z, w • ld with (x, z), (y, w) • 73, 
(z, w) • £ holds. 
(f) for all x, y, z • L(: if (x, y), (x, z) • 73, 
then either (x, z) • 73 or (z, x) • 73 or 
(x, z) • £ or (z, x) • £. 
3. # : Lt ~ C is a total .function. 
4. rl : 7 ~ ~ E is a total .function. 
5. a : {u • Ltl there is no u' with (u,u') • 
79} ~ T is a total .function. 
With (b), (c) and (d), £ is also irreflexive 
and antisymmetric. 
In contrast to finite trees, our query mod- 
els do not necessarily have a unique root 
node, i.e. a node that dominates all other 
nodes. Consequently, the so-called exhaus- 
tiveness property does not hold since two 
nodes in a query model might be completely 
disconnected. In other words, it does not hold 
in general that (x,y) • 73 or (y,x) • 73 or 
(x, y) • £ or (y, x) • £ for all x, y •/4. This 
holds only for nodes x, y •/4 where a node z 
exists that dominates x and y. 
3.3 Semantics 
Satisfiability of a query q by a query model 
M is defined in the classical model-theoretic 
way with respect to an injective assignment g 
mapping node variables to nodes in the query 
model. 
Definition 3 (Satisfiability) 
Let M = (Lt, P, 73, £, #, rl, a) be a query model 
and let g : iN ~ Lt be an injective .function. 
For all i6 iN, t6 T, c6 C, e6 E: 
• M ~g token(i)=t iffa(g(i)) =t. 
• M ~g token(i)!=t iffa(g(i)) #t. 
• M ~g cat(i)=c iff#(g(i)) =c. 
• M ~g cat (i) !=c iff ~(g(i)) ~c. 
• M ~g fct(i)=e iff there is a u 6 lg with 
(u,g(i)) 6 P and r/((u,g(i))) =e. 
• M ~9 fct(±) !=e iffthere is no u 6 Lt with 
(~,g(i)) e p and ~((~,g(i)>) =e 
For all i, j 6 iN: 
• i ~g i > j iff (g(i),g(j)) 6 7 9 (i.e. g(i) 
immediately dominates g(j)) 
• M ~9 i !> j iff (g(i),g(j)) ~ 79 
• i ~g i >> j iff(g(i),g(j)) 6 73 (i.e. g(i) 
dominates g(j)) 
• MEg i !>> j iff(g(i),g(j)) ~73 
• i ~g i .. j iff (g(i),g(j)) 6 £ 
(i.e. gCi) is le# oSg(j)) 
• M~g i !.. j iff(g(i),g(j)) ~£ 
For all queries ql, q2: 
• M~gqx a q2 iffM~gql andM~gq2 
• M ~a (ql I q2) iffM ~a ql orM ~g q2. 
Note that the condition that g needs to 
be injective means that different variables are 
considered to refer to different-nodes. In this 
respect, Def. 3 differs from traditional model- 
theoretic semantics. 
193 
node_pair_i\[tree_id\[no.dellnode2 \[fi._id \[ 
pair_class 
\[ clad \[ n._idl \[~~ pl\[ p21 dl I d21 11\[12 I 
 ode_c\]ass \[ n-id \] cat I fct I 
tokens../\[ tree_id \[ n_A,d_.~ I word J 
Figure 3: The relational database schema 
As an example, consider the query for struc- 
tures as in (1) that is shown in (4). The struc- 
ture in Fig. 1 is a query model satisfying (4). 
(4) cat(1)=PX & fct(1)=0A-MOD 
-g~ cat(2)=VF ~ 2>>1 
cat (3)=?IF ~ cat (4) =NX 
& fct(4)=0A ~ 3>>4 
4 Storing the corpus in a database 
As already mentioned, the general idea of the 
query tool is to store the information one 
wants to search for in a relational database 
and then to translate an expression in the 
query  presented in the previous sec- 
tion into an SQL expression that is evaluated 
on the database. The first part is performed 
by an initializing component and needs to be 
done only once per corpus, usually by the cor- 
pus administrator. The second part, i.e. the 
querying of the corpus, is performed by a 
query component. 
The tool is implemented in Java with Java 
Database Connectivity (JDBC) as interface 
and mysql as database management system. 
4.1 The relational database schema 
The German Verbmobil corpus consist of sev- 
eral subcorpora. In the relational database 
there are two global tables, node_class and 
pair_class. Besides these, for each of the sub- 
corpora identified by i there are tables to- 
kens_/ and node_pair_/. The database schema 
is shown in Fig. 3. The arrows represent 
foreign keys. The colnmn cl_id in the ta- 
ble node_pair_/, for example, is a foreign key 
referring to the colnmn clad in the table 
pair_class. This means that each entry for 
clad in node_pair_/uniquely refers to One en- 
try for clad in pair_class. 
The content of the tables is as follows: 
• node_class contains node classes charac- 
terized by category (node label) and gram- 
matical function (edge label between the 
node and its mother). Each node class 
has a unique identifier, namely the column 
n_id. 
• pair_class contains classes of node pairs 
characterized by the two node classes and 
the parent, dominance and linear prece- 
dence relation between the two nodes. The 
columns pl, p2, dl, d2, 11 and 12 stand for 
binary relations and have values 1 or 0 de- 
pending on whether the relation holds or 
not. pl signifies immediate dominance of 
the first node over the second, p2 immedi- 
ate dominance of the second over the first, 
dl dominance of the first over the second, 
etc. Each node pair class has a unique iden- 
tifier, namely its clad. 
• tokens_/ contains all leaves from subcor- 
pus i with their tokens (word). 
• node_pair_/contains all node pairs from 
subcorpus i with their pair class. Of 
course, only pairs of nodes belonging to one 
single annotation structure are stored. 
4.2 Initializing the database 
The storage of the corpus in the database is 
done by an initializing component. This com- 
ponent extracts information from the struc- 
tures in export format (the format used for the 
German Verbmobil corpus) and stores them 
in the database. The export format explicitly 
encodes tokens, categories and edge labels, 
linear precedence between leaves and the par- 
ent (immediate dominance) relation. Domi- 
nance and linear precedence in general how- 
ever need to be precompiled. 
First the dominance relation is computed 
simply as reflexive transitive closure of the 
parent relation. 
Linear precedence on the leaves can be im- 
mediately extracted from the export format. 
When computing linear precedence for inter- 
nal nodes, the specific properties of the data 
structures in Verbmobil (see Section 3) must 
be taken into account. Unlike in finite trees, 
for two nodes Ul,U2, the fact that ul domi- 
nates some x and u2 dominates some y and x 
is left of y is not s,fl~cient to decide that Ul 
is left of u2. Instead (see axiom (e) in Def. 2) 
the following holds for two nodes ul, u2: ul is 
left of u2 iff for all x,y dominated by ul,u2 
respectively: x is left of y. 
In general, the database schema itself does 
194 
scheinbar ADV -- HD 500 
nicht PTKNEG -- HD 501 
beides PIS -- HD 502 
zusammen ADV -- HI) 503 
$. -- -- 0 
#500 ADVX -- - 505 
#501 ADVX -- - 505 
#502 NX -- HD 504 
#503 ADVX -- - 504 
#504 NX -- HI) 505 
#505 NX .... 0 
#E0S 24 
Corresponding structure: 
NX 
I 
I 
NX I 
. 
B B @ 
I \] I I ADVX ADVX NX ADVX 
I l I I 
I I I \] 
ADV PTKNEG PIS ADV 
schembar nicht beides zus. 
195 
Figure 4: Export format of sentence 24 in cd20 
and corresponding structure 
not reflect the concrete properties of the query 
model, in particular the properties of the bi- 
nary relations are not part of the database 
schema, e.g. considering only the database, 
the dominance and linear precedence relations 
are not necessarily trA.n~itive. Therefore, the 
query tool can be easily adapted to other data 
structures, for example to feature structures 
with reentrancy as annotations. In this case, 
a modification of the part of the initializing 
component that computes the binary relations 
would be sufficient. 
As an example, consider how sentence 24 
in the subcorpus cd20 (identifier 20) is stored 
in the database. This sentence was chosen 
for the simple reason that it is not too long 
but contains enough nodes to provide a useful 
example. Besides this, its construction and its 
tokens are not of any interest here. 
Fig. 4 shows the sentence in its export for- 
mat, i.e. the way it originally occurs in the 
corpus, together with a picture of the corre- 
sponding structure. Parts of the tables in the 
tokens_20 
tree_id 
24 
24 
24 
24 
24 
node word 
0 scheinbar 
1 nicht 
2 beides 
3 zusammen 
4 
node_pair.20 
treedd nodel 
24 I 0 24 0 
24 0 
., 
24 I 9 
pair_class 
node2 clad 
1 1459 
2 2608 
3 120 
10 11327 
did I nJdl nJd2 I pl I P21 dl I d21 11112 
120 I 13 13 "l'ololololllo 
13271 24 25 I'01110111010 
°.. 
node_class 
n_id cat fct 
... 
13 ADV HD 
°°. 
24 NX HD 
25 NX - 
Figure 5: Sentence 24 in the database 
database concerning sentence 24 are shown in 
Fig. 5. Each line in the export format cor- 
responds to one node. The nodes are as- 
signed numbers 0, 1, ... in the order of the 
lines in the export format. The nodes with 
tokens (i.e. that are leaves) are inserted into 
the table tokens_20. Furthermore, each pair 
of nodes occurring in sentence 24 is inserted 
into the table node_pair_20 together with its 
pair class. Both orders of a pair are stored. 1 
The pair classes and node classes belonging to 
a pair can be found in the two global tables. 
Consider for example the nodes 9 and 10 in 
sentence 24 (the node labelled NX that domi- 
nates beides zusammen and the topmost node 
with label NX). The clad of this pair is 1327. 
*In a previous version just one order was stored 
but it turned out that for some queries this causes an 
exponential time complexity depending on the number 
of variables occurring in the query. This problem is 
avoided storing both orders of a node pair. 
#BOS 24 25 898511955 1 
The corresponding entry in pair_class tells us 
that the second node is the :mother of the first, 
that the second dominates the first, and that 
there is no linear precedence relation between 
the two nodes. Furthermore, the node classes 
identified by n_idl and had2 are such that the 
first node has label NX and grammatical func- 
tion HD whereas the second\[ has label NX and 
no grammatical function. 
4.3 The size of the database 
So far, in order to test the tool, approximately 
one quarter of the German Verbmobil corpus 
is stored in the database, namely the following 
subcorpora: 
id sub- trees tokens pairs 
corpus 
15 cdl5 1567 15474 1326416 
20 cd20 2151 21069 1941056 
21 cd21 2416 22360 1761082 
22 cd22 1723 16587 1229324 
24 cd24 2255 22763 2129548 
The table pa~_class has 23024 entries and 
node_class has 213 entries. The following ta- 
ble shows the current size of the files: 
Table data file index file 
(KS) (KB) 
node_class 
pair_class 
tokens_15 
tokens_20 
tokens_21 
tokens_22 
tokens_24 
node_pair_15 
node_pair_20 
node_pair_21 
node_pair_22 
node_pair_24 
1 10 
585 1500 
303 293 
413 403 
439 423 
325 311 
446 430 
9067 72556 
13269 106377 
12039 96383 
8404 67153 
14557 116694 
5 Searching the corpus 
In order to search the corpus, one needs of 
course to know the specific properties of the 
annotation scheme. These are described in the 
STTS guidelines (Schiller et al., 1995) and the 
Verbmobil stylebook (Stegmann et al., 1998) 
that must be both available to any user of the 
query tool. 
Currently, the query component does not 
yet process all possible expressions in the 
query . In particular, it does not 
allow disjunctions and it does not allow to 
query for tokens. Other atomic queries com- 
bined with with negations and conjunctions 
are possible. In particular, complex syntac- 
tic structures involving category and edge la- 
bels and binary relations can be searched. The 
query component will be completed very soon 
to process all queries defined in Section 3. 
The query component takes an expression 
in the query  as input and trans- 
lates this into a corresponding SQL expres- 
sion, which is then passed to the database. 
As an example, consider again the query (4) 
repeated here as (5): 
(5) cat(1)=PX & fct(1)=0A-MOD 8~ 
cat(2)=VF & 2>>1 & cat(3)=MF & 
cat(4)=NX ~ fct(4)=0A & 3>>4 
For query (5) as input performed on the 
subcorpus cd20, the query component pro- 
duces the following SQL query: 
SELECT DISTINCT npl.tree_id 
FROM 
node_class AS ncl, node_class AS nc2, 
node_class AS nc3, node_class AS nc4, 
node_pair_20 AS npl, pair_class AS pc1, 
node_pair_20 AS np2, pair_class AS pc2 
WHERE 
ncl.cat='PX' AND ncl.fct='0A-MOD' 
AND nc2.cat='VF ' AND nc3.cat='MF' 
AND nc4.cat='NX' AND nc4.fct='0A' 
AND pc1 .n_idl=nc2.n_id 
AND pcl.n_id2=ncl.n_id AND pcl.dl=l 
AND pc2 .n_id1=nc3.n_id 
AND pc2.n_id2=nc4.n_id AND pc2.dl=l 
AND npl. cl_id=pcl, c1_id 
AND npl.tree_id=np2, tree_id 
AND np2. cl_id=pc2, cl_id; 
As a second example consider the search for 
long distance wh-movements as in (3). The 
annotation of (3) using the Verbmobil annota- 
tion scheme was shown in Fig. 2. Such struc- 
tures might be characterized by the following 
properties: there is an interrogative pronoun 
(part-of-speech tag PWS for substituting inter- 
rogative pronoun) that is part of a simplex 
clause and there is another simplex clause con- 
tainlng a finite verb such that the two sim- 
plex clauses are not connected and the pro- 
noun precedes the finite verb. This leads to 
the query (6): 
(6) cat(1)=PWS & cat(2)=SIMPX & 2>>1 
& cat(3)=SIMPX & cat(4)=VVFIN 
& 2!>>3 & 3!>>2 & 2!..3 ~ 3!..2 
8~ 1..4 ~ 3>>4 
196 
Performed on cd20, (6) as input leads to the 
following SQL query: 
SELECT DISTINCT npl.tree_id 
FROM 
node_class AS ncl, node_class AS nc2, 
node_class AS nc3, node_class AS nc4, 
node_pair_20 AS npl, 
pair_class AS pcl, 
node_pair_20 AS np2, 
pair_class AS pc2, 
node_pair_20 AS np3, 
pair_class AS pc3, 
node_pair_20 AS np4, 
pair_class AS pc4 
WHERE 
ncl.cat='PWS' AND nc2.cat='SIMPX' 
AND nc3.cat='SIMPX' 
AND 
AND 
AND 
AND 
AND 
AND 
AND 
AND 
AND 
.,AND 
AND 
AND 
AND 
AND 
AND 
AND 
AND 
AND 
AND 
AND 
AND 
AND 
nc4. cat='WFIN' 
pc I. n_idl=nc2, n_id 
pcl.n_id2=ncl.n_id AND pcl.dl=l 
pc2.n_idl=nc2 .n_id 
pc2. n_id2=nc3, n_id 
pc2.dl=O AND pc2.d2=O 
pc2.11=0 AND pc2.12=0 
pc3. n_idl=nc 1.n_id 
pc3.n_id2=nc4.n_id AND pc3.11=1 
pc4. n_idl=nc3.n_id 
pc4.n_id2=nc4.n_id AND pc4. d1=1 
npl. cl_id=pc 1. cl_id 
npl. nodel=np2, node 1 
np 1. node2=np3, node 1 
npl. tree_id=np2, tree_id 
np2. cl_id=pc2, cl_id 
np2. node2=np4, node 1 
np2. tree_id=np3, tree_id 
np3. cl_id=pc3, cl_id 
np3. node 2=np4. node 2 
np3. tree_id=np4, tree_id 
np4. cl_id=pc4, cl_id; 
Currently the database and the tool are 
running on a Pentium II PC 400MHz 128MB 
under Linux. On this machine, example (5) 
takes 1.46 sec to be answered by mysql, and 
example (6) takes 6.43 sec to be answered. 
This shows that although the queries, in par- 
ticular the last one, are quite complex and in- 
volve many intermediate results, the perfor- 
mauce of the system is quite efficient. 
The performance of course depends cru- 
cially on the size of intermediate results. 
In cases where more than one node pair is 
searched for (as in the two examples above) 
the order of the pairs is important since the 
result set of the first pair restricts the second 
pair. In (5) for example, first a node pair with 
a PX with function OA-MOD dominated by a VF 
is searched for. Afterwards, the search for the 
NX with function 0A in the lffF is restricted to 
those trees that were found when searching 
for the first pair. Obviously, the first pair is 
much more restrictive than the second. If the 
order is reversed, the query takes much more 
time to process. Currently the ordering of the 
pairs needs to be done by the user, i.e. depends 
on the incoming query. However, we plan to 
implement at least partly an ordering of the 
binary conjuncts in the query depending on 
the frequency of the syntactic categories and 
grammatical functions involved in the pairs. 
The obvious advantage of using a relational 
database to store the corpus is that some parts 
of the work are taken over by the database 
management system such as the search of the 
corpus. Furthermore, and this is crucial, the 
indexing functionalities of the database man- 
agement system can be used to increase the 
performance of the tool, e.g. indexes are put 
on clad in node_pair_/and on nAdl and had2 
in pair_class. 
6 Conclusion and future work 
In this paper, I have presented a query tool for 
syntactically annotated corpora that is devel- 
oped for the German Verbmobil treebank an- 
notated at the University of Tiibingen. The 
key idea is to extract in an initializing phase 
the information one wants to search for from 
the corpus and to store it in a relational 
database. The search itself is done by trans- 
lating an input query that is an expression in 
a simple quantifier free first order logic into an 
SQL query that is then passed to the database 
system. 
An obvious advantage of this architecture is 
that a considerable amount of work is taken 
over by the database management system and 
therefore needs not to be implemented. Fur- 
thermore, the mysql indexing functionalities 
can be used to directly affect the performance 
of the search. 
The query tool is work in progress, and I 
briefly want to point out some of the things 
that still need to be done. First, the set of 
queries the tool can process needs to be ex- 
tended to all queries allowed in the query lan- 
guage. This will be done very soon. An- 
other task for the near future is, as men- 
tioned in the previous section, to add an or- 
197 
dering mechanism on binary conjuncts in or- 
der to ensure that the more restrictive node 
pairs are searched for first. Further, the de- 
sign of a graphical user-interface to enter the 
queries is planned, allowing to specify queries 
by drawing partial trees instead of typing in 
the expressions in the query . Finally, 
we also want to implement a web-based user- 
interface for the query tool. 
Besides these tasks that all concern the cur- 
rent query tool for the German Verbmobil cor- 
pus, a more general issue to persue in the fu- 
ture is to adapt the tool to other corpora. In 
some cases, this implies a modification of the 
way binary relations are precompiled, and in 
some other cases this would even lead to a 
modification of the query  and the 
database schema, namely in those cases where 
other binary relations are needed, e.g. the 
coindexation relation in the case of the Penn 
Treebank. 
Acknowledgments 
For fruitful discussions I would like to thank 
Oliver Plaehn and Ilona Steiner. Further- 
more, I am grateful to three anonymous re- 
viewers for their valuable comments. 

References 
Anne Abeill~ and Lionel ClEment. 1999. A tagged 
reference Corpus for French. In Proceedings off 
EA CL-LINC, Bergen. 
Ann Bies, Mark Ferguson, Karen Katz, and 
Robert MacIntyre. 1995. Bracketing Guidelines 
for Treebank II Style Penn Treebank Project. 
University of Pennsylvania. 
Thorsten Brants, Wojciech Skut, and Hans Uszko- 
reit. 1999. Syntactic Annotation of a German 
Newspaper Corpus. In Journdes ATALA, 18- 
19 juin 1999, Corpus annotds pour la syntaxe, 
pages 69-76, Paris. 
Oliver Christ. 1994. A modular and flexible archi- 
tecture for an integrated corpus query system. 
In Proceedings o\[ COMPLEX'9~. 
Alin Deutsch, Mary Fernandez, Daniela Florescu, 
Alon Levy, and Dan Suciu. 1999. A Query Lan- 
guage for XML. In Proceedings of the Interna- 
tional World Wide Web Conference (WWW), 
volume 31, pages 1155-1169. 
Erhard W. Hinrichs, Julia Bartels, Yasuhiro 
Kawata, Valia Kordoni, and Heike Telljohann. 
2000. The VERBMOBIL Treebanks. In Pro- 
ceedings of KONVENS 2000, October. To ap- 
pear. 
Tilman HShie. 1985. Der Begriff 'Mittelfeld'. An- 
merkungen fiber die Theorie der topologischen 
Felder. In A. SchSne, editor, Kontroversen alte 
und neue. Akten des 7. Internationalen Ger- 
manistenkongresses G6ttingen, pages 329--340. 
Laura Kallmeyer. 2000. On the Complexity of 
Queries for Structurally Annotated Linguistic 
Data. In Proceedings of ACIDCA'2000, pages 
105-110, March. 
Jacques Le Maitre, Elisabeth Murisasco, and 
Monique Rolbert. 1998. From Annotated Cor- 
pora to Databases: the SgmlQL Language. In 
John Nerbonne, editor, Linguistic databases. 
CSLI. 
Mitchell Marcus, Grace Kim, Mary Arm 
Marcinkiewicz, Robert MacIntyre, Ann Bies, 
Mark Ferguson, Karen Katz, and Britta Schas- 
berger. 1994. The Penn Treebank: Annotating 
Predicate Argument Structure. In ARPA '94. 
Andreas Mengel and Wolfgang Lezius. 2000. An 
XML-based encoding format for syntactically 
annotated corpora. In Proceedings of LREC 
2000. 
Detmar Meurers. 1999. Von pa~ie\]len Kon- 
stituenten, erstaunlichen Passiven und ver- 
wirrten Franken. zur Verwendung yon Korpora 
fiir die theoretische Linguistik. Handout at the 
DGfS Jahrestagung, February. 
A. Schiller, S. Teufel, and C. Thielen. 1995. 
Guidelines fiir das Tagging deutscher Textcor- 
pora mit STTS. Manuskript Universit~t 
Stuttgart und Universit~t Tfibingen. 
Rosemary Stegrnann, Heike Schulz, and Er- 
hard W. Hinrichs. 1998. Stylebook for the 
German Treebank in VERBMOBIL. Eberhard- 
Karls Universit~it Tiibingen. 
