Dependency Treebank for Russian: 
Concept, Tools, Types of Infornmtion 
Igor BOGUSLAVSKY, Svetlana GRIGORIEVA, 
Nikolai GRIGORIEV, I,conid KREIDLIN, Nadezhda FRID 
I.aboratory for Computational IJnguistics 
Institute for Iufornmtion rl'rausuaission Problems 
Russian Academy of Sciences 
Bolshoi Karetnyi per. 19, 101447 Moscow- RUSSIA 
{bogus, sveta, grig, lenya, nadya}Oiitp.ru 
Abstract 
'File paper describes a tagging scheme designed 
for the Russian Treebank, and presents tools used 
for corpus creation. 
1. lntrodudory Remarks 
The present paper describes a project aimed at 
developing the first annotated corpus of P, ussian 
texts. I.arge text coq~ora trove been used in the 
computational linguistics community long 
enough: at present, over 20 large corpora for the 
main European s arc available, the 
largest of them containing hundreds of millions of 
words (I.anguage Resources (19971); Marcus, 
Santorini, and Marcinkiewicz (1993); Kurohashi, 
Nagao (1998)). So far, however, no annotated 
corpora for Russian have been developed. To the 
best of our knowledge, the present project is the 
first attempt to fill the gap. 
l)ifferent tasks require different annotation levels 
that entail different amount of additional 
information about text structure. The corpus that 
is; being created in the fiamework of the pre.sent 
project consists of several subcorpora that differ 
by the level of annotation. The following three 
levels are envisaged: 
• lemmalized leA'Is, for every word, its normal 
form (lemma) and part of speech are 
indicated; 
• mowhologically tagged leXlS: for every word, 
a full set of nlorl)hological attributes it 
specified along with the lenmm and the part of 
speech; 
• symactically tagged ldxlx: apart from tile full 
morphological markup at the word level, 
every sentence has a syntax structure. 
We annotate Russian texts with depmlde,wy 
structttres - a formalism that is more suitable for 
Slavonic s with their relatively fiee word 
order. The structure not only contains inl'omlation 
on which words of the sentence are syntactically 
linked, but also relegates each link to one of the 
several dozen syntactic types (at present, we use 
78 syntactic relations). This formalism ensures a 
more complete and informative representation 
than ally other syntactically annotated corpus. 
This is a major innowttion, since the majority of 
syntactically annotated corpora, both those 
already awfilable and under construction, 
represent the syntactic structure by means of 
constituents. 
The closest analogue to our work is the Czech 
annotated corpus collected at Charles University 
in Prague - see I tajicova, Panevova, Sgall (19981). 
In this corpus, the syntactic data are also 
expressed in a dependency formalism, although 
the set of syntactic functional relations is much 
smaller as it only has 23 relations 
In what follows, we describe the types of texts 
used to create the coqms (Section 2), markup 
format (Section 3), annotation tools and 
procedures (Sectional), and types of linguistic 
data included in the markup (Section 5). 
2. Source text selection 
The well-known Uppsala University Corpus of 
contemporary Russian prose, totalling ca. 
1,000,000 words, has been chosen as the prilnary 
source for our work. The Uppsaht Corpus is well 
balanced between fiction and journalistic genre, 
with a smaller percentage of scientific and popular 
science texts. The Corpus includes samples of 
contemporary Russian prose, as well as excerpts 
flom newspapers and magazines of recent 
decades, and gives a representative coverage of 
987 
written Russian in modern use. Conversational 
examples are scarce and appear as dialogues 
inside fiction texts. 
3. Markup flDrnmt 
The design principles were fommlated as follows: 
® "layered" markup- several annotation levels 
coexist and can be extracted or processed 
independently; 
• incrementality - it should be easy to add 
higher annotation levels; 
• convenient parsing of the annotated text by 
means of standard software packages. 
The most natural solution to meet this criteria is 
an XML-based markup . We have tried 
to make our format compatible with TEI (Text 
Encoding for Interchange, see TEI Guidelines 
(1994)), inuoducing new elements or attributes 
only in situations where TEI markup does not 
provide adequate means to describe the text 
structure in the dependency grammar framework. 
Listed below are types of iuformation about text 
structure tlmt must be encoded in the markup, and 
relative tags/attributes used to bear them. 
a) Splitting of text into sentences. A special 
container element <S> (available in TEI) is used 
to delimit sentence boundaries. The element may 
have an (optional) ID attribute that supplies a 
unique identifier for the sentence within the text; 
this identifier may be used to store infommtion 
about extra-sentential relations in the text. It may 
also have a COMMENT attribute, used by linguists 
to store observations about particular syntactic 
phenomena encountered in the sentence; 
b) Splitting of sentences into lexical items 
~. The words are delimited by a container 
element <W>. Like sentences, words may have a 
unique "rD attribute that is used to reference the 
word within the sentence; 
c) Ascribing morphological features to words. 
Morphological information is ascribed to the word 
by means of two attributes born by the <W> tag: 
LlgNNg_- a normalized word form; 
FEAT - morphological features. 
d) Storing information about the syntax structure. 
To annotate the information about syntactic 
dependencies, we use two other attributes in the 
<W> element: 
DON- the ID of the master word; 
LINK - syntactic function label. 
There are also special provisions in the lbrmalism 
to store auxiliary information, e.g. multiple 
morphological analyses and syntax trees. They are 
expected to disappear from the final version of the 
corpus. 
4. Annotation tools and procedures 
The procedure of corpus data acquisition is senti- 
automatic. An initial version of markup is 
generated by a computer using a general l~urpose 
morphological analyzer and syntax parser engine; 
after that, the results of the automatic processing 
are submitted to human post-editing. The analysis 
engine (morphology and parsing) is based upon 
the ETAP-3 machine translation engine - see 
Apresjan et al. (1992, 1993). 
To support the creation of mmotated data, a set of 
tools was designed and implemented. All tools are 
Win32 applications written in C++. The tools 
available are: 
• a program for sentence boundaries markup, 
called Chopper; 
" a post-editor for building, editing and mana- 
ging syntactically annotated texts - Slruclure 
Edilor (or SirEd). 
The amount of manual work required to build 
annotations depends on the complexity of the 
input data. SirEd offers different options for 
building structures. Most sentences can be reliably 
processed without any human intervention; in this 
case, a linguist should look through the processing 
result and confirm it. If the structure contains 
errors, the linguist can edit it using a user-friendly 
graphical interface (see screenshots below). If the 
errors are too many or no structure could be 
produced, the linguist may use a special split-and- 
rtm mode. This mode includes manual pre- 
chunking of the input phrase into pieces with a 
more transparent structure and applying the 
analyzer/parser to every chunk. Then the linguist 
must manually link the subtrees produced for 
every chunk into a single structure. 
If the linguist has encountered a very peculiar 
syntactic construction so that he/she is uncertain 
988 
glbOtil the ton'cot strticture, he/she lllay mark its 
"doubtful" the whole sentence or sirlgh.', words 
whoso func:tions are not complelely clear. The 
hiforniation will be stored hi the niarkt/p, and 
Sirlgd will visualize the rOSl~eCtiVe SClltellce ;is 
one in need for further editing. 
\]qg. i presents the nlain dialog window for 
editinb, , soilteiico l)roportios. All operator can edit 
il:to iluirkup diicctly, or edit single properlics u!;ing 
a gral,hk:al interfac:e. The sotirt:o loxl illlder 
analysis is wi-illcn in all edit WilldOW ill lhc top: 
,Volj<~ pis'mo ne hylo podpisamJ, ja m,r:novenslo 
do<r;adal.sja, klo e,qo #mpisal \[A/lhou<~J~ /lle lelier 
was su.,l sighted, 1 i~,slanlly guessed who had 
written itl. 'l'ho information about sin~,le words is 
wriltcn inlo a li:~t: e.g. the first word xotja 
\]althottgh\] has an identifier :I;D:-:"~ "; llle 
Icmnlatized forni is XO'IJA; its feature list 
coi~sisls of a sinp~le roattlre -- ;t l)art-of-spoech 
characlor\]slk: (it iS a conjtillCtion); the word 
depends oil ;I word with IO="8" by till adverbial 
.vottt'ce .venle~Tce I / raw mr~#'/,tq# 
rc:tation (link type is "adverb"). By double- 
clicking all itoi\]i hi the word list or prossh\]g the 
button, a linguist can invoke dialog whidows f{}r 
editing 1}roportios {}f single words, ltowovor, the 
i\]lost coIlvenient way of editing the structure 
consists in invoking a Tree l~\]dilor whldow} 
shown in Fig. 2 with the Sall\]O soiltollco ~lS, hi the 
previous picture. 
The Tree Editor interface Js .shlipio alld nattlrai. 
Words of the SOUlCO SOlltCllCt: ;11%; written on the 
left, their lelllllias aic pill hlto glay roclallgles, alld 
their inorl)hological foattnes arc written on the 
right. The syntactic relations are shown as arrows 
directed from the master to the slave; Ihe link 
typc.s are indicated in rotmdod rcclanglos oll lhe 
arcs. All text l\]elds except for tile sotlrco SOIl\[ClICK 
are edilable in-place. Moreover, one can drag Ihe 
rOlllldod rectangles: dropping it on a word illeans 
that this word is; declared ;i new maStOl- It)l die 
word \['rOlil which the rectangle was dragged. A 
sh;glo rightd)ulton click on the lolllllla reel;ingle 
1 
S~r, terico l\['}~: I'1 ~6-5n~ statusI'tlll Strudure 
CVt~-D-/.-}tv\]--';~7'77tS,--'(:I-%;'C-(:\]I'4.\]"ilT~.';1 -' t.\[-t....~ll,}lTq-"--;--i<:J~r~q '- (INK-",,i-,~./,/ 
<'vV ( )Olvl~<"'l" If !~",.1-~":3 HM EJL CP\[_I?, I II \[O.r{" ID- "2" I t_: MMA#'J~)9 
<W DOM-"4" t-E.RI-="IV'd>,I TM I\[ ~-<"3"ll_El,..lt'..,l./',=="l If!" I_INK=%i b~/4/4!4 
<W DOM::"I" I--EA\] = "V I 1POIIJ Elh.~)1.'1~ I l..4:-',1~.~ 1t3 CPIz !-t HECC~/3" I 
<W DOM,~"4" ~ E/\I-"Vlll ~O\[\[I Eq ~'IPb'ILI KP c{:p\[~!q..CO\[7 CIffA~.' 
<W IX:¢.'iJ,<.I" I ~@q="S HM E!I MW)K O.fl" 11 )-%" I_EMMA#',~' L 
< W D O M =" a" \[" V.ACr =~"ADV" I D ="/' L E M Mi\:~" M I ~ IO L1 \[ HI) Of~ I_ll' 
<W DOM="root' Iz\[:I'<.F="V I IF~OI U t-r\[ j} HL11437~>"GI i'.,.f~l.>/., gOB" 
eFO ltOrlHerdf1. 
").k.~l.\] T 9, { ~\IV) 
I( ;bMO" LINK~"rmesxL4K" 
1'-I")1 IO (/vV) t " i0="'\]," LIZMi,.4A="F.:bn-17/ 
\~." I\[,),= ~J I.EMM/',=' rio,c4) 
LINK= \[ I j-3 e LI. H K II ) .q ( f'i/"/>/// .l\[,4K=" obc-l-I'>Hr-HoBe.t~,ljO<p, 
D="~' LEMMA= rA¢ /A\[{ 
Word ID . i Lemme . 
I Xo-ru \[1\] XO I;-1 
\[-1'~ rll4Cbl-4t. \[2\] I 1HC.'.L>I 4C 
t 4l Fie \[3\] IlL 
l,llq \[~t,lno \[4\] bbll I-> 
t41 N Ofl.F1.4C\[tH O \[ \[._\] \] I\]o£trlHCblDAFt-> 
it, I-4r ttt_iDOI411U \[7\] tvlrt I(J\[JE-_l tt4o 
dll.. j'lOl-i~).ftt~jl(-;\[} \[07 .\[\[O1-i\\[1 blLW'q/bCYl 
_,_1_ 
14 CblDA \[ 
blLi.,ATbOY- ~j 
cpNJ I/\[ul oSoT | 
7HML,:LCPr-!~nEO~t // \[4\] ~,r~,-,~ l /-V,.RT // \[41 eq~,,,.-, q 
lp..:/, i ipOLl.I t--~ rlHq 1,13~..~1\] riosvt-oo~q 
V'Hr>OLtl t~Ft rlpb.iq~/-.... \[,ll nao,>~<H~,,__J 
S 141vl ELI MV.)t,, 0~II \[{iJ npe~li.iK 
/"¢Jv i \[\[ii e~c-,- 
\i F1POlll EJ-i/ll,ll/I H31z,. rm, et v I 
t Setmp\[c,, iq.us.qi#q-i cer4unc;~ 
certain etbout it! F 
Cancel J 
Edit-lree.. _l 
Words: 
nsert \] 
I- I~ \] 
Comments... 
Fit;ure I. Sentence I roporiics dialog in Strli,,d. 
989 
dHao ..... ~"~ )~~\] v nPOLU Ell nlau 14sbnl~ cpE~ HECOB 
~o~n,ea,o. ....... "~"~¢( nacc-a.an "'~1 nOILIFII4CblBATb I V FIPOI£1 Eft FIPVlq KP CPEICt COB c-rPAfl 
s.M En Mw0n 
Mr.oBe.Ho .- ,,,(a'o6~cr";'=). I MFHOBEHHO I ADV 
.~ora~az~c...('-~.-~ I1OI-AD.IMBATbCFI I V F1POLU En fl kiLl Vi3bFIB MW>K COB 
............. 
nan.can, l(O~?~-> I rll4C~,Tbl I vnpou\] Ell rlviq 143bFIB MY>K COl? 
Figure 2. Tree Editor dialog in StrEd. 
brings out the word properties dialog° All colors, 
sizes and fonts are customizable. 
5. Types of linguistic information by level 
M o rpK0Jg_g y information 
The morphological analyzer ascribes features to 
every word. The feature set for Russian includes: 
part of speech, animateness, gender, number, 
case, degree of comparison, short form (of 
adjectives and participles), representation (of 
verbs), aspect, tense, person, voice. 
Syntax information 
As we have already mentioned, the result of the 
parsing is a tree composed of links. Links are 
binary and oriented; they link single words rather 
than syntactic groups. For every syntactic group, 
one word (head) is chosen to represent it as a 
slave in larger syntactic units; all other members 
of the group become slaves of the head. 
In a typical case, the number of nodes in the 
syntactic tree corresponds to the number of word 
tokens. However, several exceptional situations 
occur in which the number of nodes may be less 
or even greater than the number of word tokens. 
The latter case is especially interesting. We 
postulate such a description in the following 
cases: 
a) Copulative sentences in the present tense 
where the auxiliary verb can be omitted. This 
is treated as a special "zero-form" of the 
copula, e.g. On - uchitel' \[He is a teacher, lit. 
He - teacher\]° The copula should be 
introduced in the syntactic representation. 
b) Elliptical constructs (omitted members of 
contrasted coordinative expressions), like in 
Ja kupil rubashku, a on galstuk \[I bought a 
shirt, and he bought a necktie, lit. I bought a 
shirt, and he a necktie\]. 
The latter type of sentences should be discussed in 
more detail. Elliptical constructions are known to 
be one of the toughest problems in the 
formalization of natural  syntax. In our 
corpus, we decided to reconstruct the omitted 
elements in the syntactic trees, tamking them with 
a special '°phantom" feature. In the above 
example, a phantom node is inserted into the 
sentence between the words on 'he' and galstuk 
'necktie'. This new node will have a lemma 
POKUPAT" \[BUY\] and will beat" exactly the same 
morphological features as the wordform kupil 
\[bought\] physically present in the sentence, plus a 
special "phantom" marker. In certain cases, the 
feature set for the phantom may differ from that of 
the prototype, e.g. in a slightly modified phrase Ja 
kupil rubashku, aona galstuk \[I bought a shirt, 
and she (bought) a necktie\] the phantom node will 
have the feminine gender, as required by the 
agreement with the subject of the second clause. 
Most real-life elliptical constructs can be 
represented in this way. 
The inventory of syntactic relationship types 
generated by the ETAP--3 system is wLst enough: 
at present, we count 78 different syntactic 
function types. All relationships are divided into 6 
990 
major groups: aclant, altribulive, quantitative, 
adverbial, coordinative, auxiliary. 
For readers' COlwenience, we will give equivalent 
English examples: 
Aelant relalionships link the predicate word to 
its arguments. Some examples (\[IX\] - master, 
\[Y\] - slave): 
predicative - Pete \[Y\] reads \[X\]; 
completive (1,2, 3)- translate \[X\] 
the book \[Y, l-compl\] 
from \[Y1, 2-compl\] English 
into \[Y2, 3-compl\] Russian 
Ah-ibutive relationships often link a noun to a 
modifier expressed by an adjectve, another noun, 
a participle clause, etc: 
relative- The house \[X\] we live\[YI in. 
Quanlitalive relationships link a noun to a word 
with quantity semantics, or two such words one to 
another: 
quantitative -five \[Y\] pages \[IX\]; 
auxiliary-quantitative - gtirly \[Y\] five IX\]; 
Adverbial relationshil)s link the predicate word to 
various adverbial modifiers: 
adverbial- come \[Xl i, the evening \[Y\]; 
parenthetic - In my opinion IYI, lhal's \[IX\] righI. 
Coordinalive relationships serve for clauses 
coordinated by conjunctions: 
coordinative - buy apples \[XI and peaJwlYl ; 
coordinative-conj unctive - I)tty apples 
and \[X\] l)emw \[Y\]. 
Auxiliary relationships typically link two 
elements that form a single syntactic unit: 
analytical- will \[IX\] buy \[Y\]; 
The list of syntactic relations is not closed. Tile 
process of data acquisition brings up a variety of 
rare syntactic constructions, hardly covered by 
traditional grammars. In some cases, this has led 
to the introduction of new syntactic link types in 
order to reflect the semantic relation between 
single words and make tile syntactic structure 
unambiguous. 
Conclusion 
Corpus crcation is not yet complctcd: at prcscnt, 
the flfll syntactic markup has been generated for 
4,000 sentences (55,000 words), which constitutes 
30% of the total amount planned. Our approach 
permits to include all information expressed by 
morphological and syntactic means in 
contemporary Russian. We expect that the new 
corpus will stimulate a broad range of further 
investigations, both theoretical and applied. 
We plan to make the corpus awtilable via EI,RA 
fiamework after completion. Samples of tagged 
text, documentation and structure editing tools 
will be available for download from our site: 
Ifltp://prolin~.iitp.ru/Corpus/preview.zip. 
Acknowledgements 
This work is supported by Russian Foundation of 
Fundamental Research, grant No. 98-0790072. 

References 

Apresjan Ju.D., Boguslavskij I.M., Iomdin L.I~., 
Lazurskij A.V., Sannikov V.Z. and Tsinman L.I.. 
(1992). The linguistics oJ'a Machine 7)'anslation 
System. Meta, 37 (1), pp. 97-112. 

Aprcsian Ju.D., Boguslavskij I.M., Iomdin 1..I.., 
I.azurskij A.V., Sannikov V.Z. and Tsinlnan L.I.. 
(1993). @~stbme de tmduction atttomatique ETAP. 
In: £a 7)'aductique. P.Bouillon and A.Clas (eds). 
l.es Presses de I'Universitd de Montrdal, Monlrdal. 

lIaiicova E., Panevova J., Sgall P. (1998). Lal~,guage 
Resources Need Amlolations To Make Them 
Really Reusable: 7"he Ibz~gtte Del;enden~o; 
"l)'eebank. in: Proceedings of lhe First Interna- 
tional Conference on I:anguage Resources & 
Evahmtion, pp. 713-718. 

Kur<>hashi S., Nagao M. (1998). BuiMing a Japanese 
Parsed Corpus while lmprovbzg the Parsin,~ 
System. In: Proceedings of the First Inlernational 
Conference on Language Resources & Evaluation, 
pp. 719-724 

I.anguagc Resources (1997). hu Survey of the State of 
the Art in IIuman Language Technology. Eds. 
G. B. Varile, A. Zampolli, Linguistica Computa- 
zionale, w)l. XII-XIII, pp. 381-408. 

Marcus M. P., Santorini B., and Marcinkiewicz M.-A. 
(1993). Building a large Am~otated Corpus o/" 
English: The Penn 7)vebank. Computational 
lfinguistics, Vol. 19, No. 2. 

TEI Guidelines (1994). TEl Guidelbws for Electronic 
7k.xt Encoding and h~tetwhange (P3). URI.: 
hlq)://elext.lil).virginia.edu/TEI.html 
