A Czech Morphological Lexicon 
Hana Skoumalovfi 
Institute of Theoretical and Computational Linguistics 
Charles University 
Celetn£ 13, Praha 1 
Czech Republic 
hana.skoumalova @ff . cuni. cz 
Abstract 
In this paper, a treatment of Czech 
phonological rules in two-level mor- 
phology approach is described. First 
the possible phonological alternations 
in Czech are listed and then their 
treatment in a practical application of 
a Czech morphological lexicon. 
1 Motivation 
In this paper I want to describe the way in which 
I treated the phonological changes that occur in 
Czech conjugation, declension and derivation. 
My work concerned the written language, but 
as spelling of Czech is based on phonological 
principles, moSt statements will be true about 
phonology, too, 
My task was to encode an existing Czech mor- 
phological dictionary (Haji~, 1994) as a finite 
state transducer. The existing lexicon was orig- 
inally designed :for simple C programs that only 
attach "endings" to the "stems". The quota- 
tion marks in the previous sentence mean that 
the terms are not used in the linguistic mean- 
ing but rather, technically: Stem means any 
part of a word: that is not changed in declen- 
sion/conjugation. Ending means the real ending 
and possibly also another part of the word that 
is changed. Wh:en I started the work on convert- 
ing this lexicon to a two-level morphology sys- 
tem, the first idea was that it should be linguis- 
tically more elegant and accurate. This required 
me to redesign the set of patterns and their cor- 
responding endings. From the original number 
of 219 paradigms I got 159 that use 116 sets of 
endings. Under the term paradigm I mean the 
set of endings that belong to one lemma (e.g. 
noun endings for all seven cases in both num- 
bers) and possible derivations with their cor- 
responding endings (e.g. possessive adjectives 
derived from nouns in all possible forms). That 
is why the number of paradigms is higher then 
the number of endings. 
In this approach, it is necessary to deal with 
the phonological changes that occur at bound- 
aries between the stem and the suffix/ending or 
between the suffix and the ending. There are 
also changes inside the stem (e.g. p~'tel 'friend' 
x p~dteld 'friends', or hndt 'to chase' x 5enu 
'I chase'), but I will not deal with them, as 
they are rather rare and irregular. They are 
treated in the lexicon as exceptions. I also will 
not deal with all the changes that may occur in 
a verb stem--this would require reconstructing 
the forms of the verbs back in the 14th cen- 
tury, which is outside the scope:of my work. 
Instead, I work with several stems of these ir- 
regular verbs. For example the verb hndt ('to 
chase') has three different stems, hnd- for infini- 
tive, 5en- for the present tense, imperative and 
present participles, and hna- for the past par- 
ticiples. The verb vdst ('to lead') has two stems, 
vds- for the infinitive and ved- for all finite forms 
and participles. The verb tit ('to cut') has the 
stem tn- in the present tense, and the stem ra- 
in the past tense; the participles can be formed 
both from the present and the past stem. For 
practical reasons we work either with one verb 
stem (for regular verbs) or with six stems (for 
irregular verbs). These six stems are stems for 
4-1 
infinitive, present indicative, imperative, past 
participle, transgressive and passive participle. 
In fact, there is no verb in Czech with six differ- 
ent stems, but this division is made because of 
various combinations of endings with the stems. 
2 Types of phonological alternations 
in Czech 
We will deal with three types of phonological 
alternations: palatalization, assimilation and 
epenthesis. Palatalization occurs mainly in de- 
clension and partly also in conjugation. Assimi- 
lation occurs mainly in conjugation. Epenthesis 
occurs both in declension and in conjugation. 
2.1 Epenthesis 
An epenthetic e occurs in a group of consonants 
before a O-ending. The final group of conso- 
nants can consist of a suffix (e.g. -k or -b) and 
a part of the stem; in this case the epenthesis is 
obligatory (e.g. kousek x kousku 'piece', malba 
x maleb 'painting'). In cases when the group 
is morphologically unseparable, the application 
of epenthesis depends on whether the group of 
consonants is phonetically admissable at word 
end. In loan words, the epenthetic e may occur 
if the final group of consonants reminds a Czech 
suffix (e.g. korek x korku 'cork', but alba x alb 
'alb'). In declension, two situations can occur: 
• The base form contains an epenthetic e; the 
rule has to remove it, if the form has a 
non-O ending, e.g. chlapec 'boy', chlapci 
dative/locative sg or nominative pl. 
• The base form has a non-O ending; the rule 
has to insert an epenthetic e, if the ending 
is O, e.g. chodba 'corridor', chodeb genitive 
pl. 
In conjugation, an epenthetic e occurs in the 
past participle, masculine sg of the verb jit 'to 
go' (and its prefixed derivations): gel 'he-gone', 
gla 'she-gone', glo 'it-gone'. The rule has to in- 
sert an epenthetic e if the form has a O-ending. 
2.2 Palatalization and assimilation 
Palatalization or assimilation at the morpheme 
boundaries occurs when an ending/suffix starts 
with a soft vowel. The alternations are different 
for different types of consonants. The types of 
consonants and vowels are as follows: 
• hard consonants--d, (g,)h, ch, k, n, r, t 
• soft consonants--c, d, d, j, ~, ÷, g, t, 2 
• neutral consonants--b, l, m; p, s, v, z 
• hard vowels--a, d, e, d, o, 6, u, ~, y, ~\] and 
the diphthong ou 
• soft vowels--d, i, ( 
The vowel d cannot occur in the ending/suffix 
so it will not be interesting for us. I also will not 
discuss what happens with 'foreign' consonants 
/, q, w and x--they would be treated as v, k, 
v and s, respectively. The only borrowing from 
foreign languages that I included to the above 
lists is g: This sound existed in Old Slavonic but 
in Czech it changed into h. However, when later 
new words with g were adopted from other lan- 
guages, this sound behaved phonologically as h 
(e.g. hloh, hlozich--from Common Slavonic glog 
'hawthorn', and katalog, kataloz(ch 'catalog'). 
The phonological alternations are reflected in 
writing, with one exception--if the consonants 
d, n and t are followed by a soft vowel, they are 
palatalized, but the spelling is not changed: 
spelling: d~, di phonology: /de/,/di/ 
ne, ni I el, la l 
t~, ti / \[e/, / \[i/ 
In other cases the spelling reflects the phonol- 
ogy. In the further text I will use { } for the 
morpho-phonological level, / / for the phonolog- 
ical level and no brackets for the orthographical 
level. In the cases where the orthography and 
phonology are the same I will only use the or- 
thographical level. Let us look at the possible 
types of alternation of consonants: 
• Soft consonant and ~-- The soft consonant 
is not changed, the soft ~ is changed to e. 
{d(d@} ---+ d(de 'pussycat' dative sg 
• Soft or neutral consonant and i/(-- No al- 
ternations occur. 
{ d(di} ~ didi 'pussycat' genitive sg 
• Hard consonant and a soft vowel -- The 
alternations differ depending on when and 
how the soft vowel originated. 
Assimilation: 
- {k j} -~ e 
tlak 'pressure' ---+ tladen 'pressed' 
- {hj)~ 
mnoho 'much, many' ~ mno2eni'mul- 
t/plying' 
- {gj}.-~2 
It is !not easy to find an example of 
i this sprt of alternation, as g only oc- 
curs in loan words that do not use the 
old t~rpes of derivation. In colloquial 
speec h it would be perhaps possible to 
creat~ the following form: 
pedaglog 'teacher' ---+ pedago2en( 'work- 
ing as a teacher' 
- {dj}-~z 
sladit 'to sweeten' ~ slazen('sweeten- 
ing' 
This sort of alternation is not pro- 
ductive any more--in newer words 
r palatalization applies: 
sladit.'to tune up' --+ slad~n( 'tuning 
up' 
In some cases both variants are pos- 
sible, :or the different variants exist in 
different dialects--the east (Moray/an) 
dialects tend to keep this phonolog- 
ical alternation, while the west (Bo- 
hemiah) dialects often abandoned it. 
- {tie} ~ ~e 
platit !to pay' ~ placen( 'paying' 
This alternation is also not productive 
any more. The newest word that I 
found which shows this sort of phono- 
log/ca! alternation is the word fotit 
'to take a photo' ~ focen( 'taking a 
photo ~. 
Palatalization: 
During the historical development of the 
language several sorts of palatalization 
occured--the first and second Slavonic 
palatalization and further Czech palataliza- 
tions. 
- {k~/ki} --+ 5e/di (1st pMat.) 
matka 'mother' ---+ matSin possesive 
adjective 
- {k~/ki) --~ ce/ci (2nd palat.) 
matka ~ matce dative/locative sg 
- {hi/hi} ~ 2e/2i (1st palat.) 
B~h 'God' ~ Bo2e vocative sg 
- {hi/hi} ~ ze/zi (2nd palat.) 
Bgh ~ Bozi nominative/vocative pl 
- {g~/gi} ~ 2e/2i (1st palat.) 
Jaga a witch from Russian tales --~ 
Ja2in possesive adjective 
- {ge/gi} -+ ze/zi (2nd palat.) 
Jaga ~ Jaze dative/locative sg 
- { d~} ~ / de/--4 dg 
rada 'council' --~ radg dative/locative 
sg 
- {t4 --~ lie/--~ t~ 
teta 'aunt' --+ tet~ dative/locative sg 
Both palatalization and assimilation yields 
the same result: 
- {oh} ~ 
moucha 'fly' -+ mouse dative/locative 
sg, muM derived adjective 
- {n) ~/~/~ 
hon 'chase' ---+ honit 'to chase', hongn~\] 
'chased' 
- {r)-~ ~ 
vat 'boil' --~ va÷it 'to cook', va÷en( 
'cooking' 
• Neutral consonant and ~--:The alterna- 
tions differ depending on when and how 
originated. 
Assimilation: 
- { bje} ~ be 
zlobit 'to irritate' ---+ {zlobjem\] 
zloben( 'irritating' 
- {m j4 -~ .~e 
zlomit 'to break' ~ {zlornjen~\]} --+ 
zlornen~ 'broken' 
- {pie} ~ pe 
kropit 'to sprinkle' ----+ { kropjen,~ --+ 
kropeni 'sprinkling' 
- {vie} -+ ve 
lovit 'to hunt' ---+ {lovjen~\] -+ loven( 
'hunting' 
- {sje} ~ ge 
prosit 'to ask' --+ {prosjenz~ -+ proven( 
'asking' 
This type of assimilation is not pro- 
ductive any more. In newer deriva- 
tions {sje} --+ se (e.g. kosit 'to mow' 
kosen( 'mowing') . 
- {zje} ~ 2e 
kazit 'to spoil' ~ { kazjenz~ -+ ka2en( 
'spoiling' 
This type of assimilation is also not 
productive any more. In newer deriva- 
tions {zje} ~ ze (e.g. ~et&it 'to con- 
catenate' --+ ÷et&eni'concatenating'). 
Palatalization: 
With b, m, p and v no alternation occurs 
({vrb~} 'willow' dative/locative sg ---+ vrb~). 
- {s~) + se 
rosa 'wasp' ---+ {vos@} ~ rose da- 
tive/locative sg 
- {z~} --~ ze 
koza 'goat' --.+ {koz@} --+ koze da- 
tive/locative sg 
Both palatalization and assimilation yields 
the same result: 
- {lje} -+ le 
akolit 'to school' --+ {$koljem~ 
gkolen( 'schooling' 
- {le} ~ le 
~kola 'school' -+ { $kol~} ~ ~kole da- 
tive/locative sg 
• Group of hard consonants and a soft vowel. 
Here again either palatalization or assimi- 
lation occurs. 
Assimilation: 
- {stj} ~ Igtl 
distit 'to clean' --+ 5igt~n( 'cleaning' 
- {slj} -~ ~z 
myslit 'to think' --+ my~leni'thinking' 
Palatalization: 
- {.k} +/~i/ 
kamarddsk~\] 'friendly' ~ kamarddgt( 
masculine animate, nominative pl, ka- 
marddgt~jg( 'more friendly' 
- {ck} ~/d/ 
5ack~\] 'brave' ~ 5aSt( masculine ani- 
mate, nominative pl, 5a2t~jM 'braver' 
- {ek) +/d/ 
2lu\[oudkU 'yellowish' ~ 2lu\[oudt~jg( 
'more yellowish', but 21ufoudc( mascu- 
line animate, nominative pl 
The alternations affect also the vowel ~. 
When it causes palatalization or assimilation of 
the previous consonant, it looses its 'softness', 
i.e. ~ --~ e: 
{matk@} ~ matce 
{ sestr@} ~ sest÷e 
{ gkol@} --+ gkole 
3 Phenomena treated by two-level 
rules in the Czech lexicon 
As the Czech lexicon should serve practical ap- 
plications I did not try to solve all the prob- 
lems that occur in Czech phonology. I concen- 
trated on dealing with the alternations that oc- 
cur in declension and regular conjugation, and 
the most productive derivations. The rest of al- 
ternations occurring in conjugation are treated 
by inserting several verb stems in the lexicon. 
The list of alternations and other changes cov- 
ered by the rules: 
• epenthesis 
• palatalization in declension 
• palatalization in conjugation 
• palatalization in derivation 
nouns from masculines 
of feminine 
• palatalization in derivation of possessive 
adjectives 
• palatalization in derivation of adverbs 
• palatalization in derivation of comparatives 
of adjectives and adverbs 
• palatalization or assimilation in derivation 
of passive participles 
• shortening of the vowel in suffixes -ik (in 
derivation of feminine noun from mascu- 
line) and-~v (in declension of possesive ad- 
jectives) 
For the CZech lexicon I used the software r 
tools for two-level morphology developed at Xe- 
rox (Karttune.n and Beesley, 1992; Karttunen, 
1993). The le:kical forms are created by attach- 
ing the proper ending/suffix to the base form 
in a separate:program. To help the two-level 
rules to find where they should operate, I also 
marked morpheme boundaries by special mark- 
ers. These markers have two further functions: 
• They bear the information about the length 
of ending i(or suffix and ending) of the base 
form, i.e. how many characters should be 
removed before attaching the ending. 
• They bear the information about the kind 
of alternation. 
Beside the markers for morpheme boundaries 
I also use markers for an epenthetic e. As I said 
before, e is inserted before the last consonat of a 
final consonant group, if the last consonant is a 
suffix, or if the consonant group is not phoneti- 
cally admissable. However, as I do not generally 
deal with derivation nor with the phonetics, I 
am not able to recognize what is a suffix and 
what is phone~ically admissable. That is why I 
need these special markers. 
Another auxiliary marker is used for mark- 
ing the suffix -~7~, that needs a special treatment 
in derivation of feminine nouns and their poss- 
esive adjectives. The long vowel/must be short- 
ened in the derivation, and the final k must be 
palatalized even if the O-ending follows. I need 
a special marker, as -ik- allows two realizations 
for both the sohnds in same contexts: 
Two realizations of i 
d~edn~7~ 'clerk' ~ d~ednice 'she-clerk', but 
rybnzT~ 'pond' ~ rybnlce locative sg 
Two realizations of k 
d÷ednzT~ x d÷ednic (genitive pl of the derived 
feminine) 
i 
In the previous section, I described all pos- 
sible alternations concerning single consonants. 
When I work with the paradigms or with the 
derivations, it is necessary to specify the kind 
of the alternation for all consonants that can 
occur at the boundary. For this purpose I in- 
troduced four types of markers: 
"1P -- 1st palatalization for g, h and k, or 
the only possible (or no) palatalization for 
other consonants. I use this marker also for 
palatalization c --~ 5 in vocative sg of the 
paradigm chlapec. The final c is in fact a 
palatalized k, so there is even a linguistic 
motivation for this. 
A2P -- 2nd palatalization for g, h and k, or 
the only possible (or no) palatalization for 
other consonants. 
^A -- Assimilation (or nothing). 
AN --- NO alternation. 
These markers are followed by a number that 
denotes how many characters of the base form 
should be removed before attaching the end- 
ing/suffix. Thus there are markers ~ 1P0, ^2P0, 
^1P1, etc. The markers starting with ^N only 
denote the length of the ending of the base 
form--and instead of using ^N0 I attach the 
suffix/ending directly to the base form. For- 
tunately, nearly all paradigms and derivations 
cause at most one type of alternation, so it 
is possible to use one marker for the whole 
paradigm. 
The markers for an epenthetic e are ^El (for 
e that should be deleted) and ^E2 (for e that 
should be inserted). The marker for the suffix 
-zTc in derivations is ^ IK. 
Here are some examples of lexical items and 
the rules that transduce them to the surface 
form: 
(1) doktorka ^ 1Plin^2P0~ch 
The base form is doktorka 'she-doctor'. The 
marker ^IP1 denotes that the possible alter- 
nation at this morpheme boundary is (first) 
palatalization and that the length of the end- 
ing of the base form is 1 (it means that a must 
be removed from the word form and the possi- 
ble alternation concerns k). The marker ~2P0 
means that the derived possessive adjective has 
a O-ending and the possible alternation at this 
morpheme boundary is palatalization. If we 
rewrite this string to a sequence of morphemes 
we get the following string: doktork-in-~jch. The 
sound k in front of i is palatalized, so the cor- 
rect final form is doktordin~eh, which is genitive 
plural of the possessive adjective derived from 
the word doktorka. 
Let us look now at the two-level rules that 
transduce the lexical string to the surface string. 
We need four rules in this example: two for 
deleting the markers, one for deleting the end- 
ing -a, and one for palatalization. The rules for 
deleting auxiliary markers are very simple, as 
these markers should be deleted in any context. 
The rules can be included in the definition of 
the alphabet of symbols: 
Alphabet 
7j IP0 : 0 7j 1P1:0 
7.'2P0:0 7,'2PI:0 7j2P2:0 7,'2P3:0 
7jA2:0 
Z'NI:0 Z'N2:0 Z'N3:0 Z'N4:0 
Y,'EI:0 Y.'E2:0 Y.'IK:0 
This notation means that the auxiliary markers 
are always realized as zeros on the surface level. 
The rule for deleting the ending -c looks as 
follows: 
"Deletion of the ending -a-" 
a:O <=> _ \[ Y,'NI: I ~jiPI: I ~,'2Pl: \] ; 
_ t: \[ Z'N2: I Z'N4: \]; 
The first line of the rule describes the context 
of a one-letter nominal ending u, and the second 
line describes the context of an infinitive suffix 
with ending -at or -ovut. 
The rule for palatalization k -+ d looks as fol- 
lows: 
"First palatalization k -> ~" 
k:~ <=> _ (7,'IK:) \[ a: I ~: \] 7.'iPi: i ; 
NonCeS: (End) 7.'1PI: ~: ; 
The first line describes two possible cases: ei- 
ther the derivation of a possesive adjective from 
a feminine noun (doktorku--~ doktordin), or the 
derivation of a possesive adjective from a fem- 
inine noun derived from a masculine that ends 
with -~7~ ( ~ednzT~ ~ ( d÷ednice -+) d÷ednidin). 
The second context describes a comparative 
of an adjective, or a comparative of adverb de- 
rived from that adjective (ho÷k~\] ~ ho÷dejM, 
ho~deji). The set NonCCS contains all character 
except c, d and s and it is defined in a speciM 
section. This context condition is introduced, 
because the groups of consonants ck, dk and sk 
have different 1st palatalization. 
The label End denotes any character that can 
occur in an ending and that is removed from the 
base form. 
(2) korek'2P0^Elem 
The base form of this word form is korek 'cork'; 
the marker ^2P0 means their the possible alter- 
nation is (second) palatalization and that the 
length of ending of the base form is 0. The 
marker ^El means that the base form contains 
an epenthetic e, and em is the ending of in- 
strumental singular. The correct final form is 
korkem. The rule for deleting an (epenthetic) e 
follows: 
"Deletion of e" 
e:0 <=> Cons c: 7,'N2:; 
\[ YjIPI~" I 7j2P1: I Y,'NI: I 7jN2: \]; 
Dons Cons: (\[Z'IPO: IZ'2PO:\]) Z'Ei: Vowel: ; 
t:0-\[ Z*2P2: I Z'N2: \]; 
The first line describes the context for dele- 
tion of the suffix -ec in the derivation of the type 
v~dec 'scientist' --+ v~dkyn~ 'she-scientist'. 
The second context is the context of the end- 
ing -e or the suffix -ce. This suffix must be 
removed in the derivation of the type soudce 
'judge' ~ soudkyn~ 'she-judge'. : 
The third context is the context of an 
epenthetic e that is present in the base form 
and must be removed from a form with a non-O 
ending. The sets Cons and Vowel contain all 
consonants and all vowels, respectively. 
The fourth line describes the context for dele- 
tion of the infinitive ending -et. 
The whole program contains 35 rules. Some 
of the rules concern rather morphology than 
phonology; namely the rules that remove end- 
ings or suffixes. One rule is purely technical; 
it is one of the two rules for the alternation 
ch ~ ~, as c and h must be treated separately 
4.6 
(though ch is considered one letter in Czech 
alphabet). Six rules are forced by the Czech 
spelling rules (e.g. rules for treating /d/, /t/ 
and/~/in various contexts, or a rule for rewrit- 
ing y ~ i after soft consonants). 18 rules deal 
b with the actual phonological alternations and 
they cover the whole productive phonological 
system of Czech language. The lexicon using 
these rules was tested on a newspaper text con- 
taining 2,978,320 word forms, with the result of 
more than 96% analyzed forms. 
4 Acknowledgements 
My thanks to Ken Beesley, who taught me how 
to work with the Xerox tools, and to my fa- 
ther, Jan Skoumal, for fr~uitful discussions on 
the draft of tNis paper. 

References 
Jan Hajji. 1994. Unification Morphology Grammar, 
Ph.D. dissertation, Faculty of Mathematics and 
Physics, Charles University, Prague. 
Josef Holub, and Stanislav Lyer. 1978. StruSn~ 
etymologick~ :slovnzT~ jazyka 5eskdho (Concise et- 
ymological dictionary of Czech language), SPN, 
Prague. 
Lauri Karttunen, and Kenneth R. Beesley. 1992. 
Two-Level Role Compiler, Xerox Palo Alto Re- 
search Center', Palo Alto. 
Lauri Karttunen. 1993. Finite-State Lexicon Com- 
piler, Xerox Palo Alto Research Center, Palo Alto. 
Kimmo Koskenniemi. 1983. Two-level Morphology: 
E A General Computational Model for Word-Form 
Recognition ~ind Production, Publication No. 11, 
University of iHelsinki. 
Arno~t Lamprecht, Dugan Slosar, and Jaroslav 
Bauer. 1986.i Historickd mluvnice 5egtiny (His- 
torical Grammar of Czech), SPN, Prague. 
Jan Petr et al. 1!986. Mluvnice 5egtiny (Grammar of 
Czech), Academia, Prague. 
Jana Weisheiteiov£, Kv~ta Kr£1fkov£, and Petr 
Sgall. 1982. Morphemic Analysis of Czech. No. 
VII in Explizite Beschreibung der Sprache und au- 
tomatische Textbearbeitung, Faculty of Mathemat- 
ics and Physics, Charles University, Prague. 
