Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 555–562, Vancouver, October 2005. c©2005 Association for Computational Linguistics
Using MONA for Querying Linguistic Treebanks
Stephan Kepser∗
Collaborative Research Centre 441
University of Tcurrency1ubingen
Tcurrency1ubingen, Germany
kepser@sfs.uni-tuebingen.de
Abstract
MONA is an automata toolkit provid-
ing a compiler for compiling formulae of
monadic second order logic on strings or
trees into string automata or tree automata.
In this paper, we evaluate the option of
using MONA as a treebank query tool.
Unfortunately, we  nd that MONA is not
an option. There are several reasons why
the main being unsustainable query an-
swer times. If the treebank contains larger
trees with more than 100 nodes, then even
the processing of simple queries may take
hours.
1 Introduction
In recent years large amounts of electronic texts have
become available providing a new base for empiri-
cal studies in linguistics and offering a chance to lin-
guists to compare their theories with large amounts
of utterances from  the real world . While tagging
with morphosyntactic categories has become a stan-
dard for almost all corpora, more and more of them
are nowadays annotated with re ned syntactic in-
formation. Examples are the Penn Treebank (Mar-
cus et al., 1993) for American English annotated at
the University of Pennsylvania, the French treebank
(Abeill·e and Cl·ement, 1999) developed in Paris, the
TIGER Corpus (Brants et al., 2002) for German
annotated at the Universities of Saarbrcurrency1ucken and
 This research was funded by a German Science Founda-
tion grant (DFG SFB441-6).
Stuttgart, and the Tcurrency1ubingen Treebanks (Hinrichs et
al., 2000) for Japanese, German and English from
the University of Tcurrency1ubingen. To make these rich syn-
tactic annotations accessible for linguists, the devel-
opment of powerful query tools is an obvious need
and has become an important task in computational
linguistics.
Consequently, a number of treebank query tools
have been developed. Probably amongst the most
important ones are CorpusSearch (Randall, 2000),
ICECUP III (Wallis and Nelson, 2000), fsq (Kepser,
2003), TGrep2 (Rohde, 2001), and TIGERSearch
(Kcurrency1onig and Lezius, 2000). A common feature of
these tools is the relatively low expressive power
of their query languages. Explicit or implicit ref-
erences to nodes in a tree are mostly interpreted ex-
istentially. The notable exception is fsq, which em-
ploys full  rst order logic as its query language.
The importance of the expressive power of the
query language is a consequence of the sizes of the
available treebanks, which can contain several ten-
thousand trees. It is clearly impossible to browse
these treebanks manually searching for linguistic
phenomena. But a query tool that does not permit
the user to specify the sought linguistic phenomenon
quite precisely is not too helpful, either. If the user
can only approximate the phenomenon he seeks an-
swer sets will be very big, often containing several
hundred to thousand trees. Weeding through answer
sets of this size is cumbersome and not really fruit-
ful. If the task is to gain small answer sets, then
query languages must be powerful. The reason why
the above mentioned query tools still offer query
languages of limited expressive power is the fear that
555
there may be a price to be paid for offering a pow-
erful query language, namely longer query answer
times due to more complex query evaluation algo-
rithms.
At least on a theoretical level, this fear is not
necessarily justi ed. As was recently shown by
Kepser (2004), there exists a powerful query lan-
guage with a query evaluation algorithm of low com-
plexity. The query language is monadic second-
order logic (MSO henceforth), an extension of  rst-
order logic that additionally allows for the quanti -
cation over sets of tree nodes. The fact that makes
this language so appealing beyond its expressive
power is that the evaluation time of an MSO query
on a tree is only linear in the size of the tree. The
query evaluation algorithm proceeds in two steps. In
the  rst step, a query is compiled into an equivalent
tree automaton. In the second, the automaton is run
on each tree of the treebank. Since a run of an au-
tomaton on a tree is linear in the size of the tree, the
evaluation of an MSO query is linear in the size of a
tree.
There has sometimes been the question whether
the expressive power of MSO is really needed. Be-
yond the statements above about retrieving small an-
swer sets there is an important argument concern-
ing the expressive power of the grammars underly-
ing the annotation of the treebanks. A standard as-
sumption in the description of the syntax of natural
languages is that at least context-free string gram-
mars are required. On the level of trees, these corre-
spond to regular tree grammars (G·ecseg and Steinby,
1997). It is natural to demand that the expressive
power of the query language matches the expressive
power of the underlying grammar. Otherwise there
can be linguistic phenomena annotated in the tree-
bank for which a user cannot directly query. The
query language which exactly matches the expres-
sive power of regular tree grammars is MSO. In
other words, a set of trees is de nable by a regu-
lar tree grammar iff there is an MSO formula that
de nes this set of trees (G·ecseg and Steinby, 1997).
Hence MSO is a natural choice of query language
under the given assumption that the syntax of natu-
ral language is (at least) context-free on the string or
token level.
Since the use of MSO as a query language for
treebanks is  at least on a theoretical level  quite
appealing, it is worth trying to develop a query sys-
tem that brings these theoretical concepts to prac-
tise. The largest and most demanding subpart of
this enterprise is the development of a tree automata
toolkit, a toolkit that compiles formulae into tree au-
tomata and performs standard operations on tree au-
tomata such as union, intersection, negation, and de-
termination. Since this task is very demanding, it
makes sense to investigate whether one could use
existing tree automata toolkits before starting to de-
velop a new one. To the authors’ knowledge, there
exists only one of-the-shelf usable tree automata
toolkit, and that is MONA (Klarlund, 1998). It is
the aim of this paper to give an evaluation of using
MONA for querying linguistic treebanks.
2 The Tree Automata Toolkit MONA
Tree automata are generalisations of  nite state au-
tomata to trees. For a general introduction to tree au-
tomata, we refer the reader to (G·ecseg and Steinby,
1997). There exists a strong connection between tree
automata and MSO. A set of trees is de nable by
an MSO formula if and only if there exists a tree
automaton accepting this set. This equivalence is
constructive, there is an algorithm that constructs an
automaton from a given MSO formula.
MONA is an implementation of this relation-
ship. It is being developed by Nils Klarlund, Anders
Młller, and Michael Schwartzbach. Its intended
main uses are hardware and program veri cation.
MONA is actually an implementation of the compi-
lation of monadic second order logic on strings and
trees into  nite state automata or tree automata, re-
spectively. But we focus exclusively on the tree part
here. As we will see later, MONA was not devel-
oped with linguistic applications in mind.
2.1 The Language of MONA
The language of MONA is pure monadic second or-
der logic of two successors. We will only mention
the part of the language that is needed for describing
trees. There are  rst-order and second-order terms.
A  rst-order variable is a  rst-order term. The con-
stant root is a  rst-order term denoting the root node
of a tree. If t is a  rst-order term and s is a non-
empty sequence of 0’s and 1’s, then t.s is a  rst-
order term. 0 denotes the left daughter, and 1 the
556
right daughter of a node. A sequence of 0’s and
1’s denotes a path in the tree. The term root.011,
e.g., denotes the node that is reached from the root
by  rst going to the left daughter and then going
twice down to the right daughter. A set variable is
a second-order term. If t1, . . . ,tk are  rst-order terms
then ft1, . . . ,tkg is a second-order term. We consider
the following formulae. Let t,t0 be  rst-order terms
and T,T0 be second order terms. Atomic formulae
are
 t = t0  Equality of nodes,
 T = T0  Equality of node sets,
 t in T  t is a member of set T,
 empty(T )  Set T is empty.
Formulae are constructed from atomic formulae
using the boolean connectives and quanti cation.
Let f and y be formulae. Then we de ne complex
formulae as
  f  Negation of f,
 f & y  Conjunction of f and y,
 f j y  Disjunction of f and y,
 f => y  Implication of f and y,
 ex1 x : f  First-order existential quanti ca-
tion of x in f,
 all1 x : f  First-order universal quanti ca-
tion of x in f,
 ex2 X : f  Existential quanti cation of set X
in f,
 all2 X : f  Universal quanti cation of set X in
f.
We note that there is no way to extend this lan-
guage. This has three important consequences.
Firstly, we are restricted to using binary trees only.
And secondly, we cannot accommodate linguistic la-
bels in a direct way. We have to  nd some coding.
Finally, and this is a signi cant drawback that may
exclude the use of MONA for many applications,
we cannot code the tokens, the word sequence at the
leaves of a tree. Hence we can neither query for par-
ticular words or sequences of words. We can only
query the structure of a tree  including the labels.
2.2 The MONA Compiler
The main program of MONA is a compiler that com-
piles formulae in the above described language into
tree automata. The input is a  le containing the for-
mulae. The output is an analysis of the automaton
that is constructed. In particular, it is stated whether
original tree t
translate
d15d15
original query q
translate d15d15
MONA formula
for t
compile
d15d15
MONA formula
for q and t
compile
d15d15
Automaton for t
import d52d52d105d105d105d105d105d105d105d105d105d105d105
d105d105d105d105d105d105d105
Automaton for q and t
or /0
Figure 1: Method of using MONA for querying.
the formula is satis able at all, i.e., whether an au-
tomaton can be constructed.
MONA does not provide a method to execute an
automaton on a tree. But if a formula can be com-
piled into an automaton, this automaton can be out-
put to  le. And a  le containing an automaton can be
imported into a  le containing a formula. We there-
fore use the following strategy to query treebanks
using MONA. Each tree from the treebank is trans-
lated into a formula in the MONA language. How
this can be done, will be described later. The for-
mula representing the tree is then compiled into an
automaton and written to  le. Now the treebank ex-
ists as a set of automata  les. A query to the original
treebank will also be translated into a MONA for-
mula. For each tree of the treebank, this formula is
extended by an import statement for the automaton
representing the tree. If and only if the extended for-
mula representing query and tree can be compiled
into an automaton, then the tree is a match for the
original query. This way we can use MONA to query
the treebank. The method is depicted in Figure 1.
3 The Tcurrency1ubingen Treebanks
In order to evaluate the usability of MONA as a
query tool we had to chose some treebank to do our
evaluation on. We opted for the Tcurrency1ubingen Treebank
of spoken German. The Tcurrency1ubingen Treebanks, an-
notated at the University of Tcurrency1ubingen, comprise a
German, an English and a Japanese treebank con-
sisting of spoken dialogs restricted to the domain of
arranging business appointments. For our evalua-
tion, we focus on the German treebank (Tcurrency1uBa-D/S)
(Stegmann et al., 2000; Hinrichs et al., 2000) that
contains approximately 38.000 trees.
The treebank is part-of-speech tagged using the
Stuttgart-Tcurrency1ubingen tag set (STTS) developed by
557
0 1 2 3 4
500 501
502 503
504
505
ist
VAFIN
der
ART
vierundzwanzigste
ADJA
Juli
NN
.
$.
HD HD
VXFIN
HD −
ADJX
− HD
NX
PRED
LK
−
MF
−
SIMPX
Figure 2: An example tree from Tcurrency1uBa-D/S.
Schiller et al. (1995). One of the design decisions
for the development of the treebank was the commit-
ment to reusability. As a consequence, the choice of
the syntactic annotation scheme should not re ect a
particular syntactic theory but rather be as theory-
neutral as possible. Therefore a surface-oriented
scheme was adopted to structure German sentences
that uses the notion of topological  elds in a sense
similar to that of Hcurrency1ohle (1985). The verbal elements
have the categories LK (linke Klammer) and VC (ver-
bal complex); roughly everything preceeding the LK
forms the  Vorfeld VF, everything between LK and
VC forms the  Mittelfeld MF, and the material fol-
lowing the VC forms the  Nachfeld NF.
The treebank is annotated with syntactic cate-
gories as node labels, grammatical functions as edge
labels and dependency relations. The syntactic cat-
egories follow traditional phrase structure and the
theory of topological  elds. An example of a tree
can be found in Figure 2. To cope with the charac-
teristics of spontaneous speech, the data structures in
the Tcurrency1ubingen Treebanks are of a more general form
than trees. For example, an entry may consist of sev-
eral tree structures. It may also contain completely
disconnected nodes. In contrast to TIGER or the
Penn Treebank, there are neither crossing branches
nor empty categories.
There is no particular reason why we chose this
treebank. Many others could have been used as well
for testing the applicability of MONA.
4 Converting Trees into Automata
4.1 Translating Trees into Tree Descriptions
When translating trees from the treebank into
MONA formulae describing these trees we consider
proper trees only. Many treebanks, including Tcurrency1uBa-
D/S, contain more complex structures than proper
trees. For the evaluation purpose here we simplify
these structures as follows. We ignore the secondary
relations. And we introduce a new super root. All
disconnected subparts are connected to this super
root. Note that we employ this restriction for the
evaluation purpose only. The general method does
not require these restrictions, because even more
complex tree-like structures can be recoded into
proper binary trees, as is shown in (Kepser, 2004).
As stated above, the translation of trees into for-
mulae has to perform two tasks. The trees, which
are arbitrarily branching, must be transformed into
binary trees. And the linguistic labels, i.e., the node
categories and grammatical functions, have to be
coded. For the transformation into binary trees, we
employ the First-Daughter-Next-Sibling encoding,
a rather standard technique. Consider an arbitrary
node x in the tree. If x has any daughters, its left-
most daughter will become the left daughter of x in
the transformed tree. If x has any sisters, then its
leftmost sister will become the right daughter of x in
the transformed tree. This transformation is applied
recursively to all nodes in the tree. For example, the
tree in Figure 2 is transformed into the binary tree in
Figure 3.
Note how the disconnected punctuation node at
the lower right corner in Figure 2 becomes the right
daughter of the SIMPX node in Figure 3. Note also
that we have both category and grammatical func-
tion as node labels for those nodes that have a gram-
matical function.
Such a binary tree is now described by a several
formulae. The  rst formula, called Carcase, collects
the addresses of all nodes in the tree to describe the
tree structure without any labels. For our example
tree, the formula would be
Carcase =froot,root.0,root.00,root.000,root.0000,
root.01,root.001,root.0010,root.00100,
root.001001,root.0010010,root.0010011g.
A syntactic category or grammatical function is
coded as the set of nodes in the tree that are labelled
558
SROOT
qqqq
qqqq
qq
SIMPX
qqqq
qqqq
qq
MMMM
MMMM
MM
LK 
oooo
ooooo
oo
MMMM
MMMM
MM $PERIOD
VXFIN HD
ooooo
ooooo
oo MF 
qqqq
qqqq
qq
VAFIN HD NX PRED
ooooo
ooooo
o
ART 
OOOOO
OOOOO
O
ADJX 
ooooo
ooooo
o
MMMM
MMMM
MM
ADJA HD NN HD
Figure 3: Binary recoding of the tree in Figure 2.
with this category or function. This is the way to cir-
cumvent the problem that we cannot extend the lan-
guage of MONA. Here are some formulae for some
labels of the example tree.
LK = froot.00g, ART = froot.00100g, HD =
froot.000,root.0000,root.0010010, root.0010011g.
For all category or function labels that are not
present at a particular tree, but part of the label set of
the treebank, we state that the corresponding sets are
empty. For example, the description of the example
tree contains the formula empty(VC).
We implemented a program that actually performs
this translation step. The input is a fraction of
the Tcurrency1uBa-D/S treebank in NEGRA export format
(Brants, 1997). The output is a  le for each tree
containing the MONA formulae coding this tree. In
this way, we get a set of MONA formulae describing
each tree.
4.2 Compiling Tree Descriptions into
Automata
As mentioned above, the next step consists in com-
piling each tree description into an equivalent au-
tomaton. This is the  rst part of the evaluation.
We tested whether MONA can actually perform this
compilation. Astonishingly, the answer is not as
simple as one might expect. It turns out that the com-
puting power required to perform the compilation is
quite high. To start, we chose a very small subset of
the Tcurrency1uBa-D/S, just 1000 trees. Some of these trees
contain more than 100 nodes, one more than 200
nodes. Processing descriptions of these large trees
actually requires a lot of computing power.
It seems it is not possible to perform this compi-
lation step on a desktop machine. We used an AMD
2200 machine with 2GB Ram for a try, but aborted
the compilation of the 1000 trees after 15 hours. At
that time, only 230 trees had been compiled.
To actually get through the compilation of the
treebank we transfered the task to a cluster com-
puter. On this cluster we used 4 nodes each equipped
with two AMD Opteron 146 (2GHz, 4GB Ram) in
parallel. Parallelisation is simple since each tree de-
scription can be compiled independently of all the
others. The parallelisation was done by hand. Using
this equipment we could compile 999 trees in about
4 hours. These 4 hours are the time needed to com-
plete the whole task, not pure processing time. The
tree containing more than 200 nodes could still not
be compiled. Its compilation terminated unsuccess-
fully after 6 hours. We decided to drop this tree from
the sample.
It is obvious that this is a major obstacle for using
MONA. It is dif cult to believe that many linguists
will have access to a cluster computers and suf cient
knowledge to use it. And we expect on the base of
our experiences that a compilation on an ordinary
desktop machine can take several days, provided the
machine is equipped with large amounts of memory.
Otherwise it will fail. One still has to consider that
1000 trees are not much. The TIGER corpus and the
Tcurrency1uBa-D/S have each about 40.000 trees. Thus one
may argue that this fact alone makes MONA unsuit-
able for use by linguists. But the compilation step
has to be performed only once. The  les contain-
ing the resulting automata are machine independent.
Hence a corpus provider could at least in theory pro-
vide his corpus as a collection of MONA automata.
This labour would be worth trying, if the resulting
automata could be used for ef cient querying.
5 Querying the Treebank
In order to query the treebank we designed a query
language that has MSO as its core but contains fea-
tures desirable for treebank querying. Naturally the
language is designed to query the original trees, not
their codings. It is therefore necessary to translate
a query into an equivalent MONA formula that re-
559
spects the translation of the trees.
5.1 The Query Language
The query language is de ned as follows. The lan-
guage has a LISP-like syntax. First-order variables
(x,y, . . .) range over nodes, set variables (X,Y, . . .)
range over sets of nodes. The atomic formulae are
 (cat x NX)  Node x is of category NX,
 (fct x HD)  Node x is of grammatical function
HD,
 (> x y)  Node x is the mother of node y,
 (>+ x y)  Node x properly dominates y,
 (. x y)  Node x is immediately to the left
of y,
 (.. x y)  Node x is to the left of y,
 (= x y)  Node x and y are identical,
 (= X Y)  Node sets X and Y are identical,
 (in x X)  Node x is a member of set X.
Complex formulae are constructed by boolean
connectives and quanti cation. Let x be a node
variable, X a set variable, and f and y formulae.
Then we have
 (! f)  Negation of f,
 (& f y)  Conjunction of f and y,
 (| f y)  Disjunction of f and y,
 (-> f y)  Implication of f and y,
 (E x f)  Existential quanti cation of x in f,
 (A x f)  Universal quanti cation of x in f,
 (E2 X f)  Existential quanti cation of set
variable X in f,
 (A2 X f)  Universal quanti cation of set vari-
able X in f.
5.2 Translating the Query Language
The next step consists of translating queries in this
language into MONA formulae. As is simple to see,
the translation of the complex formulae is straight
forward, because they are essentially the same in
both languages. The more demanding task is con-
nected with the translation of formulae on category
and function labels and the tree structure, i.e., dom-
inance and precedence.
As described above, categories and functions are
coded as sets. Hence a query for a category or func-
tion is translated into a formula expressing set mem-
bership in the relevant set. For example, the query
(cat x SIMPX) is translated into (x in SIMPX).
The translations of dominance and precedence are
the most complicated ones, because we transformed
the treebank trees into binary trees. Now we have to
reconstruct the original tree structures out of these
binary trees. In the  rst step we have to de ne
dominance on coded binary trees. The MONA lan-
guage contains formulae for the left and right daugh-
ter of a node, but there is no formula for dominance,
the transitive closure of the daughter relation. That
we can de ne dominance at all is a consequence of
the expressive power of MSO. As was shown by
Courcelle (1990), the transitive closure of any MSO-
de nable binary relation is also MSO-de nable. Let
R be an MSO-de nable binary relation. Then
8X (8z,w(z 2 X ^R(z,w) ! w 2 X)^
8z(R(x,z) ! z 2 X)) ! y 2 X
is a formula with free variables x and y that de nes
the transitive closure of R. If we now take R(x,y) in
the above formula to be (x.0 = y j x.1 = y) we de ne
dominance (dom). In a similar fashion we can de ne
that y is on the rightmost branch of x (rb(x,y)) by
taking R(x,y) to be (x.1 = y).
Now for immediate dominance, if node x is the
mother of y in the original tree, we have to distin-
guish to cases. In the simpler case, y is the leftmost
daughter of x, so after transformation, y is the left
daughter of x. Or y is not the leftmost daughter of
x, in that case it is a sister of the leftmost daugh-
ter z of x. All sisters of z are found on the rightmost
branch of z in the transformed trees. Hence (> x y)
is translated into (x.0 = y jex1 z : x.0 = z & rb(z,y)).
Proper dominance is treated similarly. If we iter-
ate the above argument that the daughters of a node
x in the original tree become the left daughter z of x
and the rightmost successors of z, we can see that z
and all the nodes dominated by z in the translated
tree are actually all the nodes dominated by x in
the original tree. Hence (>+ x y) is translated into
(x.0 = y j ex1 z : x.0 = z & dom(z,y)).
For precedence, consider a node x in a coded bi-
nary tree. By de nition the left daughter of x and
all her successors are nodes that preceed the right
daughter of x and her successors in the original tree.
Thus (.. x y) is translated into
(x.1 = y j ( ex1 z,w,v : z.0 = w & z.1 = v &
(w = x j dom(w,x)) &
(v = y j dom(v,y)))).
560
Immediate precedence can be expressed using
precedence. Node x immediately precedes y if x pre-
cedes y there is no node z that is preceeded by x and
precedes y.
There is a small issue in the translation of quan-
ti ed formulae. In the translation of a  rst-order
quanti cation (existential or universal) of a variable
x we have to make sure that x actually ranges over
the nodes in a particular tree. Otherwise MONA
may construct an automaton that contains the coded
tree as a substructure, but is more general. In such
a case we could no longer be certain that a solution
found by MONA actually represents a proper match
of the original query on the original tree. To solve
this problem, we add (x in Carcase) to the transla-
tion of (E x f) or (A x f). E.g., (E x f) trans-
lates to (ex1 x : x in Carcase & f0) where f0 is the
translation of f. The same holds  mutatis mutandis
 for set variable quanti cation.
5.3 Performing a Query
We implemented a small program that performs the
above described translation of queries. It actually
does a little bit more. It adds the de ning formulae
for dom and rb. Furthermore, as mentioned above,
MONA allows to include a precompiled automaton
into a set of MONA formulae via a special import
declaration. Such an import declaration is used to
include the automata representing the (coded) trees
from the treebank. Thus the set of MONA formulae
to evaluate a query consist of the translation of the
query, the formulae for dom and rb, and an import
declaration for one tree from the treebank. This set
of MONA formulae can now be fed into MONA to
try to compile it into an automaton. If the compila-
tion is successful, there exists an automaton that at
the same time represents the translation of the query
and the translation of the given tree. Hence the tree
is a match for the query. If there is no automaton, the
tree is no match for the query. To perform the query
on the whole treebank there is a loop that stepwise
imports every tree and calls MONA to check if an
automaton can be compiled. The result is the set of
tree IDs that identify the trees that match the query.
We tested this method on our small treebank of
999 trees from Tcurrency1uBa-D/S. Unfortunately it turned
out that the reloading of large precompiled automata
(representing large trees) also requires enormous
computational resources. We experimented with
a very simple query: 9x NX(x) (or (E x (cat x
NX))). On our desktop machine (AMD 2200, 2GB
Ram), it took 6 hours and 9 minutes to process this
query. If we pose the same query on the whole tree-
bank Tcurrency1uBa-D/S (with about 38.000 trees) using es-
tablished query tools like TIGERSearch or fsq, pro-
cessing time is about 5 seconds. Hence the method
of using MONA is clearly not appropriate for desk-
top computers.
Even access to larger computing power does not
solve the problem. We processed the same query
on one processor (AMD Opteron 146, 2GHz, 4GB
Ram) of the cluster computer mentioned above.
There it took 1 minute and 30 seconds. About the
same query answer time was required for a second,
more complex query that asked for two different NX
nodes and a third SIMPX node. These query answer
times are still too long, because we queried only
about one fortieth of the whole treebank. Since each
tree is queried separately, we can expect a linear time
increase in the query time in the number of trees. In
other words, evaluating the query on the whole tree-
bank would probably take about 1 hour. And that
on a computer with such massive computing power.
TIGERSearch and fsq are 720 times faster, and they
run on desktop computers.
6 Conclusions
Despite the many reported successful applications of
MONA in other areas, we have to state that MONA
is clearly not a choice for querying linguistic tree-
banks. Firstly, we cannot use MONA to query for to-
kens (or words). Secondly, the compilation of a tree-
bank into a set of automata is extremely dif cult and
resources consuming, if not impossible. And  nally,
practical query answer times are way too long. Ap-
parently, reloading precompiled automata represent-
ing large trees takes too much time, because the au-
tomata representing these large trees are themselves
huge.
We note that this is unfortunately not the  rst neg-
ative experience of trying to apply MONA to com-
putational linguistics tasks. Morawietz and Cor-
nell (1999), who try to use MONA to compile logi-
cal formalisations of GB-theory, also report that au-
tomata get too large to work with.
561
The general problem behind these two unsuccess-
ful applications of MONA to problems in computa-
tional linguistics seems to be that MONA does not
allow users to de ne their own signatures. Hence
linguistic labels have to be coded in an indirect fash-
ion. Though this coding works in theory, the result-
ing automata can become huge. The reason for this
explosion in automata size, though, remains myste-
rious.
The negative experience we made with MONA
does on the other hand not mean that the whole en-
terprise of using tree automata for querying tree-
banks is deemed to fail. It seems that it is rather
this particular de cit of MONA of providing no di-
rect way to cope with labelled trees that causes the
negative result. It could therefore well be worth try-
ing to implement tree automata for labelled trees and
use these for treebank querying.
References
Anne Abeill·e and Lionel Cl·ement. 1999. A tagged ref-
erence corpus for French. In Proceedings of EACL-
LINC.
Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang
Lezius, and George Smith. 2002. The TIGER Tree-
bank. In Kiril Simov, editor, Proceedings of the Work-
shop on Treebanks and Linguistic Theories, Sozopol.
Thorsten Brants. 1997. The NEGRA export format.
CLAUS Report 98, Universitcurrency1at des Saarlandes, Com-
puterlinguistik, Saarbrcurrency1ucken, Germany.
Bruno Courcelle. 1990. Graph rewriting: An alge-
braic and logic approach. In Jan van Leeuwen, edi-
tor, Handbook of Theoretical Computer Science, vol-
ume B, chapter 5, pages 193 242. Elsevier.
Ferenc G·ecseg and Magnus Steinby. 1997. Tree lan-
guages. In Grzegorz Rozenberg and Arto Salomaa,
editors, Handbook of Formal Languages, Vol 3: Be-
yond Words, pages 1 68. Springer-Verlag.
Erhard Hinrichs, Julia Bartels, Yasuhiro Kawata, Valia
Kordoni, and Heike Telljohann. 2000. The VERBMO-
BIL treebanks. In Proceedings of KONVENS 2000.
Tilman Hcurrency1ohle. 1985. Der Begriff ‘Mittelfeld’. An-
merkungen currency1uber die Theorie der topologischen Felder.
In A. Schcurrency1one, editor, Kontroversen, alte und neue.
Akten des 7. Internationalen Germanistenkongresses,
pages 329 340.
Stephan Kepser. 2003. Finite Structure Query: A tool
for querying syntactically annotated corpora. In Ann
Copestake and Jan Haji c, editors, Proceedings EACL
2003, pages 179 186.
Stephan Kepser. 2004. Querying linguistic treebanks
with monadic second-order logic in linear time. Jour-
nal of Logic, Language, and Information, 13:457 470.
Nils Klarlund. 1998. Mona & Fido: The logic-
automaton connection in practice. In Computer Sci-
ence Logic, CSL ’97, LNCS 1414, pages 311 326.
Springer.
Esther Kcurrency1onig and Wolfgang Lezius. 2000. A descrip-
tion language for syntactically annotated corpora. In
Proceedings of the COLING Conference, pages 1056 
1060.
Mitchell Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1993. Building a large annotated cor-
pus of English: The Penn Treebank. Computational
Linguistics, 19(2):313 330.
Frank Morawietz and Tom Cornell. 1999. The MSO
logic-automaton connection in linguistics. In Alain
Lecomte, Franc‚ois Lamarche, and Guy Perrier, ed-
itors, Logical Aspects of Computational Linguistics,
LNCS 1582, pages 112 131. Springer.
Beth Randall. 2000. CorpusSearch user’s manual. Tech-
nical report, University of Pennsylvania. http://
www.ling.upenn.edu/mideng/ppcme2dir/.
Douglas Rohde. 2001. Tgrep2. Technical report,
Carnegie Mellon University.
Anne Schiller, Simone Teufel, and Christine Thielen.
1995. Guidelines fcurrency1ur das Tagging deutscher Textcor-
pora mit STTS. Manuscript, Universities of Stuttgart
and Tcurrency1ubingen.
Rosmary Stegmann, Heike Telljohann, and Erhard Hin-
richs. 2000. Stylebook for the German treebank in
VERBMOBIL. Technical Report 239, SfS, University
of Tcurrency1ubingen.
Sean Wallis and Gerald Nelson. 2000. Exploiting fuzzy
tree fragment queries in the investigation of parsed cor-
pora. Literary and Linguistic Computing, 15(3):339 
361.
562
