Proceedings of HLT/EMNLP 2005 Demonstration Abstracts, pages 14–15,
Vancouver, October 2005.
Prague Dependency Treebank as an exercise book of Czech
Barbora Hladk´a and Ondˇrej Kuˇcera
Institute of Formal and Applied Linguistics
Charles University
Malostransk·e n·am. 25
118 00 Prague, Czech Republic
hladka@ufal.mff.cuni.cz, ondrej.kucera@centrum.cz
Abstract
There was simply linguistics at the begin-
ning. During the years, linguistics has
been accompanied by various attributes.
For example corpus one. While a name
corpus is relatively young in linguistics,
its content related to a language - collec-
tion of texts and speeches - is nothing new
at all. Speaking about corpus linguistics
nowadays, we keep in mind collecting of
language resources in an electronic form.
There is one more attribute that comput-
ers together with mathematics bring into
linguistics - computational. The progress
from working with corpus towards the
computational approach is determined by
the fact that electronic data with the  un-
limited computer potential give opportu-
nities to solve natural language processing
issues in a fast way (with regard to the pos-
sibilities of human being) on a statistically
signi cant amount of data.
Listing the attributes, we have to stop for
a while by the notion of annotated cor-
pora. Let us build a big corpus including
all Czech text data available in an elec-
tronic form and look at it as a sequence of
characters with the space having dominat-
ing status  a separator of words. It is very
easy to compare two words (as strings), to
calculate how many times these two words
appear next to each other in a corpus, how
many times they appear separately and so
on. Even more, it is possible to do it
for every language (more or less). This
kind of calculations is language indepen-
dent  it is not restricted by the knowl-
edge of language, its morphology, its syn-
tax. However, if we want to solve more
complex language tasks such as machine
translation we cannot do it without deep
knowledge of language. Thus, we have
to transform language knowledge into an
electronic form as well, i.e. we have to
formalize it and then assign it to words
(e.g., in case of morphology), or to sen-
tences (e.g., in case of syntax). A cor-
pus with additional information is called
an annotated corpus.
We are lucky. There is a real annotated
corpus of Czech  Prague Dependency
Treebank (PDT). PDT belongs to the top
of the world corpus linguistics and its sec-
ond edition is ready to be of cially pub-
lished (for the  rst release see (Haji c et al.,
2001)). PDT was born in Prague and had
arisen from the tradition of the successful
Prague School of Linguistics. The depen-
dency approach to a syntactical analysis
with the main role of verb has been ap-
plied. The annotations go from the mor-
phological level to the tectogrammatical
level (level of underlying syntactic struc-
ture) through the intermediate syntactical-
analytical level. The data (2 mil. words)
have been annotated in the same direction,
i.e., from a more simple level to a more
14
complex one. This fact corresponds to
the amount of data annotated on a partic-
ular level. The largest number of words
have been annotated morphologically (2
mil. words) and the lowest number of
words tectogramatically (0.8 mil. words).
In other words, 0.8 million words have
been annotated on all three levels, 1.5 mil.
words on both morphological and syntac-
tical level and 2 mil. words on the lowest
morphological level.
Besides the veri cation of ’pre-PDT’ the-
ories and formulation of new ones, PDT
serves as training data for machine learn-
ing methods. Here, we present a system
Styx that is designed to be an exercise
book of Czech morphology and syntax
with exercises directly selected from PDT.
The schoolchildren can use a computer to
write, to draw, to play games, to page en-
cyclopedia, to compose music - why they
could not use it to parse a sentence, to de-
termine gender, number, case, . . . ? While
the Styx development, two main phases
have been passed:
1. transformation of an academic ver-
sion of PDT into a school one. 20
thousand sentences were automati-
cally selected out of 80 thousand
sentences morphologically and syn-
tactically annotated. The complex-
ity of selected sentences exactly cor-
responds to the complexity of sen-
tences exercised in the current text-
books of Czech. A syntactically an-
notated sentence in PDT is repre-
sented as a tree with the same num-
ber of nodes as is the number of the
words in the given sentence. It dif-
fers from the schemes used at schools
(Grepl and Karl· k, 1998). On the
other side, the linear structure of PDT
morphological annotations was taken
as it is  only morphological cate-
gories relevant to school syllabuses
were preserved.
2. proposal and implementation of ex-
ercises. The general computer facil-
ities of basic and secondary schools
were taken into account while choos-
ing a potential programming lan-
guage to use. The Styx is imple-
mented in Java that meets our main
requirements  platform-independent
system and system stability.
At least to our knowledge, there is no
such system for any language corpus that
makes the schoolchildren familiar with an
academic product. At the same time, our
system represents a challenge and an op-
portunity for the academicians to popular-
ize a  eld devoted to the natural language
processing with promising future.
A number of electronic exercises of Czech
morphology and syntax were created.
However, they were built manually, i.e.
authors selected sentences either from
their minds or randomly from books,
newspapers. Then they analyzed them
manually. In a given manner, there is no
chance to build an exercise system that
re ects a real usage of language in such
amount the Styx system fully offers.
References
Jan Haji c, Eva Haji cov·a, Barbora Hladk·a, Petr Pajas,
Jarmila Panevov·a, and Petr Sgall. 2001. Prague
Dependency Treebank 1.0 (Final Production Label)
CD-ROM, CAT: LDC2001T10, ISBN 1-58563-212-0,
Linguistic Data Consortium.
Miroslav Grepl and Petr Karl· k 1998. Skladba ˇceˇsiny.
[Czech Langauge.] Votobia, Praha.
15
