R{j}ecnik.com: English—Serbo-Croatian Electronic Dictionary
Vlado KEˇSELJ
Faculty of Computer Science
Dalhousie University, Halifax
Canada, vlado@cs.dal.ca
Tanja KEˇSELJ
Korlex Software
Bedford NS, Canada
tanja@keselj.net
Larisa ZLATI´C
Larisa Zlatic Language Services
Austin, Texas, USA
larisaz@serbiantranslator.com
Abstract
The features of R{j}ecnik.com dictionary, as
one of the first on-line English-Serbo-Croatian
dictionaries are presented. The dictionary has
been on-line for the past five years and has been
frequently visited by the Internet users. We
evaluate and discuss the system based on the
analysis of the collected data about site vis-
its during this five-year period. The dictionary
structure is inspired by the WordNet basic de-
sign. The dictionary’s source knowledge base
and the software system provide interfaces to
producing an on-line dictionary, a printed-paper
dictionary, and several electronic resources use-
ful in Natural Language Processing.
1 Introduction
The dictionaries, monolingual, bilingual, or
multilingual, are the standard way of collecting
and presenting lexicographic knowledge about
one or more languages. The electronic dic-
tionaries (EDs) are not merely a straightfor-
ward extension of their printed counterparts,
but they entail additional purely computational
problems.
ED as marked-up text. An ED may be seen
simply as a long, marked-up text. The im-
portant computational issues arise around the
problem of efficient keyword search and appro-
priate presentation of the dictionary data. The
search is performed in the context of a markup
scheme, such as SGML or XML, and the query
model has to provide expressibility for search
queries within this scheme; e.g, searching for
a keyword within a certain text region. An
example of such research is the OED project
conducted from 1987 through 1994 (Tompa and
Gonnet, 1999; OED, 2004). One of the achieve-
ments of the OED project was that the search
software was able to retrieve all occurrences of
words and phrases within the dictionary corpus
of size 570 MB in less than a second (Tompa
and Gonnet, 1999).
Knowledge-base Structure of an ED.
The second aspect of EDs is the structure of
information represented in them. This struc-
ture is of interest to linguists, lexicographers,
and various dictionary users, but it is of chief
interest to computational linguists. A major
computational challenge is how to design the
dictionary structure in order to make its main-
tenance manageable and efficient. Various lex-
ical resources that were developed in the last
few decades have become invaluable in Natural
Language Processing (NLP), most notably the
WordNet. Another reason why efficiency in dic-
tionary maintenance is important is that natu-
ral languages change dynamically and good ED
should track these lexical innovations. Differ-
ent domains need to be covered, and the parts
of the dictionary that are becoming old and ar-
chaic need to be time-stamped and archived as
such.
In this paper, we present a bilingual bidirec-
tional on-line Serbo-Croatian (SC)-English dic-
tionary that has been available on the Internet
since 1999. This is the first published report de-
scribing this resource. The dictionary internal
structure is motivated by the WordNet struc-
ture, and it provides a way of producing mono-
lingual SC and bilingual SC-English wordnet.
2 Related Work
The OED project (Tompa and Gonnet, 1999;
OED, 2004) is a related project that was dis-
cussed in section 1. There are many on-line
dictionaries on the Internet: monolingual, bilin-
gual, and even multilingual. Probably the most
comprehensive list is given at the site YourDic-
tionary.com1, collected by Robert Beard from
the Bucknell University, which lists on-line dic-
tionaries for 294 languages, including two en-
tries for sign languages (ASL and Sign).
There are not that many on-line SC-English2
1http://www.YourDictionary.com
2Under language name “Serbo-Croatian” (SC) we
assume labels Serbo-Croatian, Serbian, Croatian, or
Bosnian.
dictionaries. YourDictionary.com lists about
five such dictionaries. Most of them are narrow-
domain dictionaries. The Google directory3
lists seven dictionaries. Rjecnik.com4 is the old-
est one in these language pairs and is still active
and expanding. Tkusmic.com5 was created in
2003 and has a very similar interface. One of the
most popular dictionaries is Krstarica.com6. A
long list of dictionaries is given at Danko ˇSipka’s
web site.7 Many of those are not active any
more, or they are textual dictionary files with a
limited domain.
The WordNet (Miller, 2004) project is rele-
vant to our work, since we propose a dictionary
structure based on the building blocks that fol-
low the WordNet structure. As a result, a di-
rect by-product of our ED is an SC WordNet.
The task of creating a Serbo-Croatian Word-
Net is already underway within the Balkanet
project (Christodoulakis, 2002).
3 Project Description
Project history. The on-line dictionary
R{j}ecnik.com has been active since 1999. One
of its most visible characteristics, also noted by
other users, is simplicity of the user interface.
There is one search textual field in which the
user enters the query and the dictionary reports
all dictionary entries matching the query on ei-
ther English or SC side. It provides an efficient
search mechanism, returning the results within
a second.
Lexical resources. As a lexicographic re-
source, this is a wide-coverage, up-to-date, bidi-
rectional, and bilingual dictionary covering not
only general, often used terms, but also over
8,000 computer and Internet terms,8 as well as
healthcare and medical vocabulary, including
useful abbreviations. The entries are grouped
by semantic meaning and part of speech, in
the WordNet fashion. The English lexemes are
associated with their phonetic representations,
and the entries are marked by domain of usage
(e.g., computers, business, finance, medicine).
Colloquial and informal expressions are marked
3http://directory.google.com
4http://rjecnik.com and http://recnik.com
5http://www.tkuzmic.com/dictionary/
6http://www.krstarica.com/recnik/
7http://www.public.asu.edu/˜dsipka/rjeynici.html
8A number of the terms was collected
through public discussion at the e-mail list Ser-
bian Terminology maintained by Danko Sipka
(http://main.amu.edu.pl/mailman/listinfo/st-l).
with special symbols so that they can be eas-
ily identified. In addition, the dictionary con-
tains plenty of illustrative examples showing the
language in use. A suitable text encoding for
SC is used so that the software generates both
Latin (Roman) and Cyrillic script versions. Di-
alectical and geographical differences are also
marked.
Software overview. The dictionary software
is developed in the Perl programming language.
From the source dictionary file, the searchable
on-line resource file is generated. It is in tex-
tual format and it is indexed through an in-
verted file index for searchable terms in English
and SC. The searchable terms are chosen selec-
tively. The tags and descriptions are not search-
able since this would produce spurious search
results.
Dictionary structure. Following the ideas
from OED (Tompa and Gonnet, 1999), we
adopted the philosophy of modern text markup
systems that “a computer-processable version
of text is well-represented by interleaving ‘tags’
with the text of the original document, still leav-
ing the original words in proper sequence.” Ad-
ditionally, we adopted the ideas from the Word-
Net project (Miller, 2004) in structuring our
knowledge base around the basic entry unit be-
ing a meaning; i.e., one meaning = one en-
try. One source dictionary entry (vs. a printed,
or on-line dictionary entry) corresponds to one
synset in WordNet. It is represented in one
physical line in a textual file, or it may be stored
in several lines which are continued by having a
backslash (\) character at the end of each line
but the last one. An entry starts with the En-
glish lexemes separated by commas followed by
an equal sign (=), and the corresponding SC lex-
emes, also separated by commas. Additional
pertinent information is encoded using tags.
This representation is conceptually simple and
efficient in terms of manual maintenance and
memory use. It is also flexible, since it allows
tags to define features that refer to the whole
entry or just individual lexemes. Such rep-
resentation deviates from the commonly used
XML notation because we find the XML nota-
tion to be more “machine-friendly” than user-
friendly, but it can be automatically converted
to XML. To illustrate the difference between
TEI (Sperberg-McQueen and Burnard, 2003),
the standard XML-based markup scheme, and
our markup scheme, we adopt an example from
(Erjavec, 1999), which is shown in in Fig. 1.
(A)
<entry key="bewilder">
<form> <orth type=’hw’>bewilder</orth>
<pron>bI"wIld@(r)</pron> </form>
<gramgrp><pos>vtr</pos></gramgrp>
<sense orig=’sem’>
<trans><tr>zbuniti</tr>, <tr>zaplesti</tr>,
<tr>zavesti</tr>, <tr>posramiti</tr>,
<tr>pobrkati</tr></trans>
<eg><quote>too much choice can bewilder a
small child</quote>
<trans><tr>prevelik izbor mo"ze zbuniti
malo d{ij}ete</tr></trans>
</eg>
</entry>
(B)
abash [\eb’ae"s], bewilder [biw’ild\er], \
confound [kanf’aund], confuse [k\enfj’u:z]\
= :v zbuniti, zaplesti, zavesti, posramiti,\
:coll pobrkati :/coll, :eg too much choice\
can bewilder a small child = prevelik izbor\
mo"ze zbuniti malo d{ij}ete
Figure 1: Comparative example with TEI
The entry (A) in Fig. 1 shows an entry with
TEI markup, in (B) we give our corresponding
entry. The tags are preceded with a colon (:).
English lexemes are associated with their pho-
netic representations within the square brack-
ets. The phonetic representation is encoded us-
ing the vfon encoding.9 All changes to the dic-
tionary can be easily tracked down using the key
:id tag and the standard CVS (Control Version
System) system. The encoding ipp is used to en-
code SC text fragments, since they include ad-
ditional letters beside the standard 7-bit ASCII
set. The on-line version of the dictionary is en-
coded using the dual1 encoding for simplicity
and efficiency reasons. The input query can be
entered using the ipp encoding, and is trans-
lated into the dual1 encoding before matching.
The krascii encoding10 is additionally accepted
in the input query as the most common tran-
scribing scheme, although it inherently leads to
some incorrect matches.
A very systematic variation in SC is
ekavian vs. ijekavian dialect; for example:
mleko/mlijeko (milk) and primeri/primjeri (ex-
amples), but also hteo/htio (wanted). The
text is converted via the following regular ex-
9The details about different encodings such as ipp,
vfon, and dual1 are provided in (Keˇselj and others, 2004).
10Krascii is a simple transcribing scheme that ignores
diacritics.
POS tags: noun (n), verb (v), adjective (a), ad-
verb (adv), article (art), preposition (prep), conjunction
(conj), interjection (interj), pronoun (pron), numeral
(num), noun phrase (np), verb phrase (vp), symbol or
special character (sym), and idiom (idiom).
Morpho-syntactic features: diminutive (dim), femi-
nine (fm), imperfective (ipf), intransitive (itv), mascu-
line (m), neuter (nt), past participle (pp), perfective (pf),
plural (pl), preterite or past tense (pret), singular (sl),
and transitive (tv).
Dialect tags: American (am), Bosnian (bos), British
(br), Croatian (hr), Serbian (sr), and Old Slavic (ssl).
Domain tags: agriculture (agr), archaeological (archl),
architecture (archt), biology (bio), botany (bot), com-
puter (c), diplomacy (dipl), electrical (elect), chemistry
(chem), culinary (cul), law (law), linguistic (ling), math-
ematics (mat), medicine (med), military (mil), mythol-
ogy (myt), music (mus), religion (rel), sports (sp), and
zoology (zoo).
Computer science subareas, cob tag (e.g., cob pl):
internet (int), programing languages (pl), computational
linguistics (cl), graph theory (gt), cryptography (crypt),
data structures (ds), formal languages (fl), computer net-
works (cn), information retrieval (ir), and object oriented
programming (oop).
Misc.: abbreviation (abb), abbreviation expansion
(abbE), colloquial (coll), description (desc), example
(eg), obsolete (obs), see (see), unique entry identifier
(id), and vulgar (vul).
Figure 2: List of tags
Avg.visits Avg.time Len. of the
Year per day b/w visits longest query
1999 106 13m 34s 953
2000 249 5m 47s 710
2001 402 3m 34s 1556
2002 662 2m 10s 2492
2003 1018 1m 25s 4958
2004 2158 40s 1249
Figure 3: Site visit statistics
pression substitutions for ekavian and ijeka-
vian: s/\{(([^\|\}]*)\|)?([^\}]*)\}/$2/g
and s/\{(([^\|\}]*)\|)?([^\}]*)\}/$3/g.
The list of tags used in the dictionary is given
in Fig. 2.
4 Dictionary and Usage Statistics
The dictionary has been on-line for five years
(since 22-Jul-99). As of 28-Apr-2004, it has
60,338lexemes, organizedin20,911entries. The
average system response time is 0.4 sec. Some
site statistics are given in Fig. 3. The inter-
face is supposed to be used only for short-word
queries, but long queries are also submitted in
hope that the system would do machine transla-
tion. As can be seen from the figure, the longest
submitted query had the length of 4958 bytes.
Still, the majority of the queries are below 100
bytes: in 1999 there were 0.03% queries sub-
1999 2000 2001 2002 2003 2004
95 love 522 love 854 love 1252 hello 1977 hello 607 hello
95 hello 499 hello 756 hello 1205 love 1776 love 590 love
70 you 346 you 521 you 892 you 1287 you 416 you
57 devojka 215 good 324 good 487 i 707 good 259 i
38 i 170 f...(en) 278 i 453 good 705 i 216 good
34 k...(sc) 158 i 264 devojka 341 f...(en) 578 thank you 204 da
34 djevojka 154 I 254 f...(en) 335 thank you 573 f...(en) 191 se
30 djak 148 devojka 252 thank you 333 happy 551 beautiful 191 thank you
30 f...(en) 144 are 243 happy 330 beautiful 499 are 189 beautiful
28 word 141 thank you 218 I 319 I 486 i love you 185 volim
Figure 5: The most commonly asked queries (f...(en) and k...(sc) denote obscene words
 0
 0.1
 0.2
 0  5  6  7  10  15  20  25  30
Distribution in a year
Query length
19992000
20012002
20032004
Figure 4: Distribution of query lengths
mited longer than 100 bytes, 0.05% in 2000 and
2001, 0.14% in 2002, 0.27% in 2003, and 0.12%
in 2004. The distribution of query lengths less
than 30 bytes is given in Fig. 4. The most com-
monly asked queries are given in Fig. 5.
5 Conclusions and Future Work
We have presented the features of an electronic
English-SC dictionary. The dictionary is de-
signed to be multi-functional, providing the in-
terfaces to produce a printed dictionary copy
and an on-line searchable lexicon. We propose
a dictionary structure inspired by the WordNet,
which is flexible and easy to maintain. We also
report the site statistics of the on-line dictionary
during the last five years.
Future work. The plan for future work in-
cludes incorporating a lemmatizer that would
translate inflected word forms into their canoni-
cal representations. This is relevant for English,
but it is a more important issue in SC, which is a
highly-inflectional language. We do not know of
any lemmatizer or stemmer currently available
for SC. The software interfaces for producing a
wordnet form, and a TEI-encoded form will be
developed. An issue of long queries needs to be
addressed. Currently, if a user submits a long
query, which is usually a sentence or paragraph,
the dictionary reports “zero entries found.” A
fall-back strategy should be provided, which will
consist of tokenizing the input and giving the
results on querying separate lexemes.
6 Acknowledgments
We thank Danko ˇSipka, Duˇsko Vitas, and
anonymous reviewers for helpful feedback. The
first author is supported by the NSERC.

References
D. Christodoulakis. 2002. Balkanet: Design
and development of a multilingual Balkan
WordNet. WWW.
T. Erjavec. 1999. Encoding and presenting an
English-Slovene dictionary and corpus. In 4th
TELRI Seminar, Bratislava.
V. Keˇselj et al. 2004. Report on R{j}ecnik.com:
An English — Serbo-Croatian electronic dic-
tionary. Technical Report CS-2004-XX, Dal-
housie University. Forthcoming.
G.A. Miller. 2004. WordNet home page.
http://www.cogsci.princeton .edu/˜wn/.
OED. 2004. Oxford English Dictionary.
WWW. http://www.oed.com/, Apr. 2004.
C.M. Sperberg-McQueen and L. Burnard. 2003.
Text encoding initiative. http://www.tei-
c.org/P4X/index.html, accessed in May 2004.
F. Tompa and G. Gonnet. 1999. UW cen-
tre for the new OED and text research.
http://db.uwaterloo.ca/OED/.
D. Vitas, C. Krstev, I. Obradovi´c, Lj. Popovi´c,
and G. Pavlovi´c-Laˇzeti´c. 2003. An overview
of resources and basic tools for the process-
ing of Serbian written texts. In D. Piperidis,
editor, First workshop on Balkan Languages
and Resources, pages 1–8.
D. ˇSipka. 1998. Osnovi Leksikologije i Srodnih
Disciplina. Matica srpska.
