A Description Language for Syntactically Annotated Corpora 
Esther KSnig and Wolfgang Lezius 
IMS, University of Stuttgart, Germany 
www. ims. uni-stuttgart, de/proj ekte/TIGEg 
Abstract 
This paper introduces a description  
for syntactically annotated corpora which al- 
lows for encoding both the syntactic annota- 
tion to a corpus and the queries to a syntac- 
tically annotated corpus. 
In terms of descriptive adequacy and com- 
putational efficiency, the description lan- 
guage is a compromise between script-like 
corpus query s and high-level, typed 
unification-based grammar formalisms. 
1 Introduction 
Syntactically annotated corpora like the 
Penn Treebank (Marcus et al., 1993), the 
NeGra corpus (Skut et al., 1998) or the sta- 
tistically dismnbiguated parses in (Bell et 
al., 1999) provide a wealth of intbrmation, 
which can only be exploited with an ade- 
quate query . For example, one 
might want to retrieve verbs with their sen- 
tential complements, or specific fronting or 
extraposition phenomena. So far, queries to 
a treebank have been formulated in script- 
ing s like tgrep, Perl or others. Re- 
cently, some powerful query s have 
been developed: an exalnple of a high- 
level, constraint-based  is described 
in (Duchier and Niehren, 1999). (Bird et al., 
2000) propose a query  for the gen- 
eral concept of annotation grat)hs,, A graph- 
ical query notation tbr trees is under devel- 
opment in the ICE project (UCL, 2000). 
In the current paper, we present a pro- 
posal for a graph description  which 
is meant to fulfill two conflicting require- 
ments: On the one hand, the  
should be close to traditional linguistic de- 
scriptions s, i.e. to grammar for- 
malisms, as a basis for modular, under- 
standable code, even for complex corpus 
queries. On the other lmnd, the  
should not preclude etlicient query evalua- 
tion. Our answer is to profit from the re- 
search on typed, feature-based/constraint- 
based grammar tbrmalisms (e.g. (Carpenter, 
1992), (Copestake, 1999), (DSrre and Dorna, 
1993), (D6I're et al., 1996), (Emele and Za- 
jac, 1990), (H6ht~ld and Smolka, 1988)), and 
to pick those ingredients which are known to 
be con~i)utationally 'tractable' in some sense. 
2 The Query Language 
2.1 The right kind of graphs 
If syntactic analysis is meant to provide 
for a basis of semantic interpretation, the 
predicate-argulnent structure of a sentence 
nmst be recoverable fi'om its syntactic ana- 
lysis. Nonlocal dependencies like topicaliza- 
tion, right extraposition, tell us that tr'ccs 
are not expressive enough. We need a way 
to connect an extraposed constituent with its 
syntactic resp. semantic head. This can be 
done either by introducing empty leaf nodes 
plus a means for node coreference (like in 
the Penn Treebank) or by admitting cross- 
ing edges. In our project, the latter solution 
has been chosen (Skut et al., 1997), partly 
tbr the reason that it is simpler to annotate 
(no decision on the right place of a trace has 
to be taken). We call this extension of trees 
with crossing edges syntaz graphs. An exam- 
ple is shown in Fig. 1. 
In order to discuss the details of the lan- 
guage, we will make reference to the simpler 
syntax graph in Fig. 2. 
1056 
C 
K Eq 
Kq r~q 
F~q + + 
Die Tagung hat mohr Teilnohmer als \]e zuvor 
ART NN VVFIN HAT NN KOKOM ADV ADV 
Def.Fem.Nom.Sg Fem.Nom.Sg.* 3.Akk.Pl %'.* Masc.Akk.Pl.* ...... 
Figure 1: A syntax graph with crossing edges ("the conference has more tmrticipants than 
ever bet:bre") 
eJn Mann Ifiuft 
ART NN VVFIN 
Figure 2: A simple syntax graph ("a man 
I'l\].IIS '~) 
2.2 Nodes: feature records 
Syntactic phrases and lexical entries usu- 
ally come with a bundle of morphosyntae- 
tic information like part-of speech, case, gen- 
der, and mnnber. In computational linguis- 
tics, t~ature structures are used for that pur- 
pose. Since we need only a way to repre- 
sent morphosyntactic information (not Syll- 
tactic or semantic structures) themselves, we 
restrict ourselves to feature records, i.e. fiat; 
feature structures whose tbature values are 
constants. We admit Boolean tbrmulas, tbr 
the fl.'ature values, as well as tbr the feature- 
value pairs themselves. 
For example, all proper nouns ("NE") and 
nouns ("NN") can be retrieved by 
\[pos= "NE" I "NN"\] 
As usual, strucl;ura\] identity ca.n be ex- 
pressed by the use of logical variables. How- 
ever, variables must not occur in the SCOl)e 
of negation, since this would introduce the 
colnlmtational overhead of inequality con- 
straints. 
The values of a feature with 'infinite' range 
like word or 1emma can be referred to by reg- 
ular exl)ressions, e.g. the nouns ("NN") with 
initial M can be retrieved by 
\[word = /^M.*/ & pos="NN"\] 
The/-symbols inark a regular expression. 
2.3 Node relations 
Since gral)hs are two-dimensional objects, we 
need one basic node relation tbr each di- 
mension, direct precedence . for the hor- 
izontal dilnension and direct dominance > 
tbr the vertical dimension (the precedence of 
two inner nodes is defined as the precedence 
1057 
of their leftmost terminal successors (Lezius 
and KSnig, 2000a)) Some convenient derived 
node relations are the following: 
>* dominance (minimum path length 1) 
>n dominance in n steps (n > 0) 
>m,n dominance between ~n, and n steps 
(0 < m < n) 
>Ol leftmost terminal successor 
('left corner') 
>@r rightmost terminal successor 
('right corner') 
• * precedence (minimum nmnber of inter- 
vals: 1) 
• n precedence with rt intervals (n > 0) 
• m,n precedence between m and 'n, intervals 
(0 < m < 
$ siblings 
$.* siblings with precedence 
2.4 Graph descriptions 
We admit restricted 13oolean expressions 
over node relations, i.e. conjunction and dis- 
junction, but no negation. For examI)le, tile 
queries 
#nl : \[word="ein" ~ pos="ART"\] 
#n2: \[word="Mann" & pos="NN"\] 
#nl #n2 
and 
#nl:\[cat="NP"\] >"NK" \[pos="kRT"\] 
& #nl >"NK" \[word="Mann"\] 
art both satisfied by the NP-constituent 
in Fig. 2. #nl, #n2 art variables. Tile sym- 
bol "NR" is an edge label. Edges can be la- 
belled in order to indicate the syntactic re- 
lation between two nodes. 
2.5 Types 
For tile t)urpose of conceptual chuity, tile 
user can define type hierarchies. 'SubtylleS: 
may also be constants e.g. like in the case of 
part-of-speech symbols. Here is all excerpt 
from the type hierarchy tbr the STTS tagset: 
nominal := noun,properNoun,pronoun. 
noun := "NN". 
properNoun := "NE". 
pronoun "= 
"PPPER","PPOS","PRELS", ... . 
This hierarchy can be used to tbrmulate 
queries in a more concise manner: 
\[pos=nominal\] .* \[pos="VVFIN"\] 
2.6 Templates 
E.g. Ibr a concrete lexicon acquisition task, 
one might have to define a collection of in- 
terdependent, comI)lex queries. In order 
to keel) tile resulting code tractable and 
reusable, queries call be organised into teln- 
plates (oi macros). Templates can take log- 
ical variables as arguments and may refer to 
other temi)lates , as long as there is no (em- 
bedded) self reference. Logically, templates 
art offline-compilable Horn fbrmula. 
Here are some examples tbr template def 
initions. A simple notion of VerbPhrase is 
being de.fined with reference to a notion of 
PrepPhrase. 
PrepPhrase ( #nO : \[cat="PP ''\] 
> #nl : \[pos="APPR"\] 
#nO 
> #n2: \[pos="NE"\] 
#nl.#n2 ) ; 
VerbPhrase ( #nO : \[cat="VP"\] 
> #nl : \[pos="VVFIN"\] 
#nO > #n2 & 
#nl.#n2 ) <- 
PrepPhrase (#n2) ; 
1058 
3 The Corpus Annotation 
Language 
3.1 Corpus annotation vs. queries 
Actually, the query  is rather a dc- 
,scription  which (:an 1)e used also 
for encoding the syntactic annotation of a 
corpus. \]n the current proje, ct, a SylltaC- 
tically disambiguated corpus is being 1)re- 
duced. This means, that, for corl)us anno- 
tation, only a sub of the i)rol)osed 
 is adnlissibh', with the following re- 
strict;ions: 
• The graph (;ollstrailltS Illay only inclu(le 
the, t)asi(: node relations (>, .). 
,, The only logical contlective on all struc- 
tural levels is the COl\junction el)cra- 
ter &. 
• lq,egular expressions are, 'not admitted. 
,, Tyl)es and teml)lates are 'uo/, admitted. 
The automatically generate(1 corl)us anno- 
tation (:ode (generate(1 from the, outl)ut of 
tile gral)hical annotation interface) for Fig. 2 
looks as fl)llows, with some additional mark- 
up for ease of processing. 
<sentence £d="i" roeC="5"> 
"1": \[uord="ein" & pos="hRT"\] gg 
"2": \[word="Mann" g~ pos="NN"\] g~ 
"3": \[uord="l~iuft" & pos="VVFIN"\] 
"4": \[cat="NP"\] & 
"5": \[cat="S ''\] & 
("l" "2") ("2" "3") 
("5" >"SB" "4") & ("5" >"HD" "3") 
("4" >"NK" "1") & ("4" >"NK" "2") 
3.2 An XML representation 
When designing the, architecture of our sys- 
loin, we had to deal with the 1)roblem of var- 
ious diflhrent formats for the representation 
of syntactically annotated corpora: Penn 
~lYe, ebank, Ne, Gra (Skut et al., 1.997), Tip- 
st;er, Susmme, several fi)rnlats for chunked 
texts and the I)roposed des(:ription ,. 
Thus, we have developed an XML based for- 
mat which guarantees maximmn 1)ortabil- 
ity (Mengel and Lezius, 2000). An online 
('onversion tool (NeOra, Penn Treebank -+ 
XML) is availabh', on our project homepage. 
4: Formal Semantics 
Compared to most other corpus description 
and corpus query s, o111 graph (te- 
scription  comes with a ibrmal and 
a clear-cut operational semantics, which has 
been described ill a technical report (Lez- 
ills anti KSnig, 2000a). The semantics has 
been compiled from the correslmntling parts 
of tbrmal semantics of the typed, unification- 
based gramlnar tbrmalisms and constraint- 
based logic programming s which 
have been cited above. Due to the, fact 
that the corpus slid the query are repre- 
se, nted in the same description , one 
Call detille a (;oi1se(tllellce relation })et\veell 
the corl)uS and the query. Essentially, the 
annotated cortms corresponds to a Prolog 
database, and the corpus query to a Prolog 
query. A query result is a syntax graph from 
the tort)us. 
5 Implementation 
One might argue that commercial and re- 
search implementations tbr structurally an- 
notated texts are already available, i.e. 
XML-retrieval systems, e.f. (LTG, 1999). 
However, we intend to solve t)rol)lems 
which are spe('ifi(" to natural  de- 
scriptions: non-eml)e(t(ling (non-tree-lilw,) 
structm'al annotations crossing edge, s, 
and, on the long-texm, re, trieval of co- 
indexed sul.)structures (co-refl;rence phenom- 
ena). A domain-specific impleme, ntation of 
the search engine gives the basis for opti- 
inizations wrt. linguistic applications (Lezius 
and KSnig, 20001)). 
Before queries can be (wahlate.d on a new 
corl)uS (e.ncoded in the NeGra, Penn Tree- 
bank or XML format), a preprocessing tool 
has to convert it into the format of the de- 
scription . Subsequently, the col 
pus is indexed in order to guarantee efficient 
lookups during the query evaluation. The 
query processor to date is cal)able of evaluat- 
ing 1)asic queries (cf. Sect. 2.2-2.4)..To sup- 
port all popular platforms, the tool is imple- 
mented in JawL There, is a servlet available 
on the project web page which illustrates the, 
cuir(:nt stage of the implementation. 
1059 
Conclusion 
Syntactic corpus annotations, complex cor- 
pus queries and comt)utational grammars 
have one common point: they are descrip- 
tions of natural  grammars. Our 
claim is that corpus query s should 
be close to traditional grammar fbrmalisins 
in order to make complicated information ex- 
traction tasks easier to encode. The level of 
processing efficiency of scripting s 
can still be reached if one restricts oneself to 
'off-line' compilable  elements only. 

References 

Franz Bell, Glenn Carroll, Detlef Prescher, 
Stefan Riezler, and Mats Rooth. 1999. 
Inside-outside estimation of a lexicalized 
pcfg ibr german. In Proceedings of the 
37th Annual Meeting of the ACL, Mary- 
land. 

Steven Bird, Peter Buneman, and Tan 
Wang-Chiew. 2000. Towards a query lan- 
guage for annotation graphs. In Proceed- 
ings of the LREC 2000, Athens, Greece. 

Bob Carpenter. 1992. The Logic of Typed 
Feature Structures. Tracts in Theoretical 
Computer Science. Cambridge University 
Press, Cambridge. 

Ann Copestake, 1999. Th, e (new) LKB sys- 
tem. www-csli.stanibrd.edu, /~aac/doc5- 
2.pdf 

Jochen D6rre and Michael Dorna. 1993. 
cur - a formalism tbr linguistic knowl- 
edge representation. Deliverable R.1.2A, 
DYANA 2, August. 

3ochen DSrre, Dov M. Gabbay, and Es- 
ther KSnig. 1996. Fibred semantics tbr 
feature-based grammar logic. Journal of 
Logic, Language, and Infi)rmation. Spe- 
cial Issue on Language and Proof Theory, 
5:387-422. 

Denys Duchier and Joachim Niehren. 1999. 
Solving dominance constraints with finite 
set constraint programming. Technical 
report, Universitiit des Saarlandes, Pro- 
gramming Systems Lab. 

Martin Emele and Rfmi Zajac. 1990. A 
fixed-point semantics for feature type 
systems. In Proceedings of the 2nd 
International Workshop on Conditional 
and Typed Rewriting Systems, Montreal, 
Canada. 

Markus HShfeld and Gert Smolka. 1988. 
Definite relations over constraint lan- 
guages. LILOG-Report 53, IBM Deutsch- 
land, Stuttgart, Baden-Wfirttemberg, Oc- 
tober. 

Wolfgang Lezius and Esther KSnig. 2000a. 
The TIGER  - a description lan- 
guage for syntax graphs. Internal reI)ort, 
IMS, University of Stuttgart. 

Wolf'gang Lezius and Esther K5nig. 2000b. 
Towards a search engine for syntactically 
annotated corpora. In Proceedings of the 
KONVENS 2000, Ihnenau, Germany. 

LTG Language Technology Group, Ed- 
inburgh, 1999. LT XML version 1.1. 
User docum.cntation and reference guide. 
www. ltg. ed. ac. uk, software/xmL 

Mitchell Marcus, Beatrice Santorini, and 
Mary Ann Marcinkiewicz. 1993. Building 
a large annotated corpus of English: The 
Penn Treebank. Coraputational Linguis- 
tics. 

Andreas Mengel and Wolfgang Lezius. 2000. 
An XML-based representation tbrmat tbr 
syntactically annotated corpora. In Pro- 
ceedings of the LREC 2000, Athens, 
Greece. 

Wojciech Skut, Brigitte Krenn, Thorsten 
Brants, and Hans Uszkoreit. 1997. An 
annotation scheme ibr free word order 
s. In Proceedings of the 5th 
Conference on Applied Natural Language 
Processing (ANLP), Washington, D.C., 
March. 

Wojciech Skut, Thorsten Brants, Brigitte 
Krenn, and Hans Uszkoreit. 1998. A lin- 
guistically interpreted corpus of german 
newspaper text. In ESSLI 1998, Work- 
shop on Recent Advances in Corpus An- 
notation. 

UCL University College London, 2000. ICE 
(International Corpus of English). 
