Formal ~orpholog¥ 
Jan HAJIC 
Re~earch Institute of Hathematical Machines 
LoretAnsk~ n~m. 3 
~18 55 Praha I, Czechoslovakia 
&b~raot 
A formalism for the description of a 
system of formal morphology for flexive and 
agglutinative languages (such as C3ech) is 
presented, borrowing some notions and the 
style from the theory of formal languages. 
Some examples (for Czech adjectives) are 
presented at the end of the paper. In these 
examples, the formalism's rules are used for 
the phonology-based changes as well, but 
nothing prevents the use of a separate 
phonology level (e.g. of the Koskenniemi's 
two-level model) as a front- (and back-) end 
for the analysis (and synthesis). 
1. Th~ Notiva¢ion 
Using a computer, the morphological 
level is a basis for building the syntaotlco- 
semantic part of any NL analysis. The CL 
world pays more attention to morphology only 
after the work /Koskenniemi 1983/ was 
published. However, as Kay remarked (e.g. 
in /Kay 1987/), phonology was actually what 
was done in /Koskenniemi 1983/. Moreover, 
the strategy used there is best suited for 
agglutinative languages with almost one-to- 
one mapping between morpheme and grammatical 
meaning, but slavonic languages are different 
in this respect. 
One of the praotigal reasons for 
formalizing morphology is that although there 
are some computer implementations using a 
Czech morphology subsystem (/Haji~,Oliva 
1986/, IKirschner 1983/, /Kirschner 1987/), 
based on the same sources (/EBSAT VI 198~/, 
/EBSAT VII 1982/), no unifying formalism for 
a complete description of formal morphology 
exists. 
2. The Po~malimm 
The terms alphabet, string, concatenat- 
ion, • ~., symbol N (positive integers), 
indexes and are used here in the same way 
as in the formal grammar theory; the symbol 
exp(A) denotes the set of all subsets of A, e 
denotes an empty string. Uppercase letters 
.are used mainly for denotin~ sets and newly 
defined structures, lowercase letters are 
used for mappings, for elements of an 
alphabet and for strings. 
I~finition i. A finite set K of symbols is 
called a set of grammatical meanings (or 
simply meanings for short); values from K 
represent values of morphological categories 
(e.g, sg may represent singular number, p3 
may represent dative ("3rd case") for nouns, 
etc.). 
Definition 2- A finite set D = ((w,i) E A* x 
(N , {0))\], where A is an alphabet, is called 
a dictionary. A pair (w,i) ~ D is called a 
dictionary entry, w is a lexical unit and i 
is called pattern number. In the linguistic 
interpretation, a lexical unit represents the 
notion "systemic word", but it need not be 
represented by a traditional dictionary form. 
Defini~i,n 3. Let A be a finite alphabet, K 
a finite set of meanings, V a finite alphabet 
of variables such that A a V = £). The 
quintuple (A,V,K,t,R) where t is a mapping 
t: V ~> exp(A*) assigni,~g types to 
variables, R is a finite 'set of rules 
(I,H,u,v,C), where I ~ N is is a finite set 
(of labels), C ~ (N u {0}7 is a finite set 
(of continuations), H n K is a set of 
meanings belongin~ to a particular rule from 
R, u,v E (A u V)-, is called a controlled 
rewriting system (ORS)| all variables from 
the left-hand side (u) must be present on the 
right-hand side (v) and vice versa (rule 
symmetry according to variables). 
Definition 4. Let T = (A,V,K,t,R) be a CRS. 
A (simple) substitution on T will be any 
mapping q: V -> A*| q(v) s t(v). 
I)efini~ion 5- Let T = (A,V,K,t0R) be a ORS 
and q a simple substitution on T. Happin~ d: 
CA , V) z -> A ~ such that d(e) = e| d(a) = a 
for a ~ A| d(v) = q{v) for v ~ V; d(bu) = 
d(b)d(u) for b E CA v V), u s CA , V) ~ will 
be called (generalized) substitution derived 
from q. 
Comment. The (generalized) substitution 
substitutes tin a given string) all 
variables by some string. The ~ame string is 
substituted for all oucu~ences of this 
variable (follows from the definition). 
Definition 6. Let T = (A,V,K,~,R) be a CRS 
and F ~ K. Let then G, G' ~ K, w,z ~ (A , 
V) ~, i E N, i' E (N u {0}). Me say that w 
~an be directly rewritten in the state (i0G) 
to z with a continuation (i',G') according to 
meanings F (written as w(i,G) =>\[T,F\] 
~(i',G')), if there exist such rule 
(l,H,u,v,C) E R and such simple substitution 
q on T, that i ~ I, i' s C, H n F, G = G' , 
H, d(u) = w and d(v) = z, where d is the 
substitution derived from q. 
Relation =>~\[T,F\] is defined as the reflexive 
and transitive closure of =>iT,F\]. 
Comment. The CRS is controlled through 
continuations and labels. After a dlreot 
rewriting operation, the only rules that 
could be applied next must have in their 
label at least one number from the rewritln K 
operation continuation. Please notice that: 
- this operation always rewrite~ whole words| 
- the restriction on the left-hand and right- 
hand side of a rule that it should be only 
string (of letters and/or variables) is not 
so strong as it may seem, because no 
restrictions are imposed on the substitution 
q. However, to be able to implement the rules 
in a particular implementation as finite 
state machines, we shall require q to be 
defined usin~ regular expressions onlyo 
~fi~i~ion 7. Let T = (A,V,K,~,R) be a CRS 
and let n be the maximal numbe~ from all 
222 
labels from all rules from R; n-tuple P = 
(pl, ..., pn) will be called a list of 
patterna; on T (the elements of P are called 
patterna;) if for every i a mapping pi: exp(K) 
x A* -> t)xp(A ~) is defined as z E pi(F,w) <=> 
wCi,F) =:>~\[T,F\] zOO,{)). 
Comment. The "strange" sets G and G' from 
the definition 6 acquire a real meaning only 
in connection with the definition of 
patterns; they have a controlling task during 
pi cons%)ruction, namely, they check whether 
all meanings from F are used during the 
derivation. "To use a meaning k" means here 
that th,:~re is some rule (l,H,u,v,C) applied 
in the ~ourse of derivation from w(i,F) to 
z(O,()) such that k E H. Such meaning can 
then be removed from G when constructing G' 
(see Def~ 7); meanings not from H cannot. 
Thus, to get the empty set in z(O,()) when 
startin~ from w(i,F), all meanings from F 
must be "used" in this sense. 
A patte>?n describes how to construct to a 
given wo>zd w all possible forms according to 
meaning~ F.. In this sense, the notion of 
pattern does not differ substantially from 
the traditional notion of pattern in formal 
morphology, although traditionally, not the 
constructive description, but just some 
represent;afire of such a description is 
called a pattern. 
Deflnlt|x;n 8. Let D be a dictionary over an 
alphabet A, T = (A,V,K,t,R) a CRS and P a 
list of patterns on T. A quadruple H = 
(A,D,K,P) is called a morphology description 
on T (H\['C\]-description). 
Def|ni~|.t)n 9. Let T = (A,V,K,t,R) be a CRS 
and H = (A,D,K,F) an H\[T\]-description. Set L 
= (z ~ A:'~; there ex- w E A~ i E N, H ~ K; z 
pi(H,w)} will be called a language 
generated by H\[T\]-description H. The 
element~ of L will be called word forms. 
Comment. The term morphology description 
introduced above is a counterpart to a 
description of a system of' formal morphology, 
as used in traditional literature on 
morpholo~y. 
Definition 9 is introduced here just for the 
purpose of formalization of the notion of 
word form, i.e. any form derived from any 
word from the dictionary using all possible 
meanings according to H\[T\]. 
Definiti~)n 10- Let T = (A,V,K,t,R) be a ORS 
and M == (A,D,K,P) be HET\]-description. The 
term syn.i;hesis on M is used for a mapping s: 
exp(K) x A ~ -> exp(A*); s(H,w) = (z; ex. i 
N, i <~= n; z ~ pi(H,w) & (w,i) E D}. The 
term ant~lysis is used then for a mapping a: 
A ~ -> exp(exp(K) x A~); a(z) = ((H,w); z 
s{H,w)). 
Comment. According to definition I0, 
synthesi~ means to use patterns for words 
from the dictionary only. The definition of 
analysis; is based on the syhthesis 
definition, so it clearly and surely follows 
the intuition what an analysis is. In this 
sense, these definitions don't differ 
substantially from the traditional view on 
formal morphology, as opposed to Koskenniemi; 
however, the so~called oomplex word forms 
("have been called") are not covered, and 
their an~Iysis is shifted to syntax. 
The definition of analysis is quite clear, 
but there is no procedure contained, capable 
of actually carrying out this process. 
However, thanks to rule symmetry it is 
possible to reverse the rewriting process: 
Definition tl. Let T = (A,V,K,t,R) be a ORS. 
Further, let G G = a K, i ~ N, i' ~ (N v 
(0)), z,w E A ~. He say that under ~he 
condition (i',G') it is possible to directly 
analyse a string z to w with a continuation 
(i,G) (we write z(i',G' ) =<\[T\] w(i,G)), if 
there exists a rule (I,H,u,v,C) E R and a 
simple substitution q on T such that i E I, 
i' E C, G = G' u H, d(u) = w a d(v) = z, 
where d is the generalized substitution 
derived from q. A relation "it is possible 
to analyze" (=<~\[T\]) is defined as a 
reflexive and transitive closure of =<\[T\]. 
Definition 12. Let T = (A V,K,t,R) be a ORS 
and z e A . Every strln~ w s A , i e N and F 
}< such that z(O,£}) =< "\[T\] w(i,F) is called 
a predecessor of z with a continuation (i,F). 
Lemma. Let T = (A,V,K,t,R) be a ORS and w E 
A* a predecessor of string z g A * with a 
continuation (i,P). Then z E pi(F,w), where 
pi is a pattern by T (see Def. 7). Proof 
(idea). The only "asymmetry" in the 
definition of => as opposed to =<, i.e. the 
condition H n F, can be solved putting (see 
Def. 11) P = (} v HI u H~ u • .. ~, Hn (for n 
analysis steps). Then surely Hi a F for 
every i. 
Theorem. Let T = (A,V,K,t,R) be a CRS, H = 
(A,D,K,P) an H\[T\]-desoription, a an analysis 
by H and w s A* a predecessor of z e A ~ with 
a continuation (i,F). Moreover, let (w,i) E 
D. Then (F,w) ~ a(z). 
Proof follows from the precedin~ lemma and 
from the definition of analysis. 
Comment. This theorem helps us to manage an 
analysis of a word form: we begin with the 
form being analysed (z) and a "continuation '' 
(0,(3), using then "reversed" rules for back 
rewriting. In any state w(i,F) during this 
process, a correct analysis is obtained 
whenever (w,i) is found in the dictionary. 
At the same time we have in F the appropriate 
meanings. Passin~ along all possible paths 
of back rewriting, we obtain the whole set 
a(z). 
3. An Example 
To illustrate the most important 
features of the fcrmalism described above, 
we have chosen a simplified example of Czech 
adjectives (regular declination acccrding to 
two traditional "patterns" - mlad~ (young) 
and jarn~ (spring), with negation, full 
comparative and superlative, sg and pl, but 
only masc. anim. nominative and genitive). 
The dictionary: 
D = {(nov,,l), new 
(podl~,2)} vile (it has no neg. forms) 
The CRS: 
CRS T = (A,V,K,t,R): 
A = {a,~,b,c,~,...,z,~,#} 
(# means word separator) 
K = {sg,pl,comp,sup,neg,masc,nom,acc} 
V = {-,LIM} 
t(-) = A~| t(L) = {1,z}; t(M) = {m,n,v} 
R = { (see fig. 1) } 
223 
({1},{ }, -, -,{2}), ({3},{masc,sg,nom}, -~, -~#,{0}), 
({l),{neg }, -, ne-,{2}), ((3},{mssc,sg~acc}, -~-~ho#,{O}), 
({2},{ }, -, -,{3}), ({3),{masc,pl,nom}, -~, -~#,{0}), 
({2},{comp},-L~, -Lej~,{3}), ({3},{masc,pl,acc}, -~, -~@,{0}), 
({2),{sup },-L~,nej-Lej~,{3)), ({3),{easc,sg,n~m}, -{, -~#~{O})t 
((2},{c~mp},-M~, -M~j~,{3}), ({3},{masc,sg,acc}, -~,-~h¢~.,{O}), 
({2},{sup),-M~,ne~-M~,{3}), ({3},{~asc,pl,nom}, -,, -*#,{0}), 
({3},{masc,pl,acc}, -~, -~#,{0}) 
Pi~. 1 
...................................................................... 
using p2: 
podl~(2,{sup,masc,pl,acc}) => two possib. 
ne3podle3~(3,{masc,pl,acc}) => 1st alt. 
ne~podle~W(O,{}) .......... 8" empty, O.K. 
podl~(3,{sup,masc,pl,acc}) => 2nd alt. 
podl~#(O,{sup}) .............. S" not empty, so 
this is not a solution 
Possibilities without removinK "used" meanings are not shown; 
all lead to non-empty G' in the resultin~ z(O,G'). 
Fig. 2 
................................... . .................................. 
• v . I nejnovej~,#(O,{}) =< ................. not in D (4 alter.) 
nejnov~j~(3,{masc,pl,acc}) =< .... not in D (3 alter.) 
nov#(2,{sup,masc,pl,acc}) =< .... not in D 
nov#(l,(sup,masc,pl,acc)) ........... E D; SOLUTION 
ne3nov~(2,{comp,masc,pl,acc}) =< not in D (2 altar.) 
jnov#(1,{neg,comp,masc,pl,acc}), not in D 
nejnov~(1,{comp,masc,pl,acc}), not in D 
v.v~ nejnovejs1(2,{masc,pl,acc}) =< ...not in D (2 alter.) 
• v.vp 3noveJsz(1,(negtmasc,pl,acc}). not in D 
nejnov~J~{(1,{masc,pllacc}) .... not in D 
nejnov~j~(3,{masc,pl,nom}) =< ..... not in 
nov~(2,{sup,masc,pl,nom}) =< ..... not in D 
nov~(1,{sup,masc,pl,nom}) ............ s D; SOLUTION 
... same as 1st alter., but nom instead of ace ... • v.v~ 
nejnoveJsz(3,{masc,sg,nom}) =< ..... not in D 
nov~(2,{sup,masc,sg~nom}) =< .... not in D 
nov~(1,{sup,masc,sg,nom}) ........... s D; SOLUTION 
... same as 1st altar., but sg,ncm instead of pl,acc 
nejnov~j~(3,{masc,pl,nom}) =< ..... not in D 
nejnov~j~(2,{masc,pl,nom)) =< ...not in D (2 alter.) 
nejnovSjg#(1,{masc,pl,nom}) .... not in D 
• v .vs jnove3sy(1,{neg,masc,pl,nom)), not in D 
Fig. 3 
....................................................................... 
An example of synthesis: we want to obtain 
s({sup,masc,pl,acc}~pod1~) -> (podia,2) ~ D; 
see fig. 2 
An example of analysis: we want to obtain 
a n • w•v. ( eJnovejsz#); see fig. 3 
Comment• Better written rules in CRS would 
not allow for the 4th alternative in the 
. v. vs. first step ("ne3nove3sy), because "~" could 
not be followed by "9" in any Czech word 
form; however, constructing the other 
unsuccessful alternatives could not be a 
priori cancelled only the dictionary can 
decide, whether e.~. "jnov~" is or is not a 
Czech adjective. 
Comment on comment. No o,~ange in the rules 
would be necessary if a separate phonology 
and/or orthography level is used; then, the 
"~" possibility, bein K orthographically im- 
possible, is excluded there, of course. 
4. Conclusion 
This formalism will be probably 
sufficient for Czech (no counter-example to 
this thesis has been discovered so far)• Per 
inflected words one or two "levels" (i.e., 
successive rule applications) will suffice, 
224 
agglutinative elements (e.~., adjective 
comparison) will probably need three to five 
rules. 

References 

EBSAT VII (1982): Pk~rpheiic ~nalysis of Czech 
Prague 1982 

EBSAT VI (19811 = Lexical Input Data for 
EKperim4wnts Nith Czech~ Prahs 1981 

Koskennlemi, K. (1983), T~o-level morphology, 
Univ. of Helsinki, Dept. of Sen. Lingu- 
istics, Publications No. 11 

Haji~, J., Olive, K. (1986)= Projekt ~esko- 
ruske~ho strojovLiho pr~ekladu, (A Project 
of Czech to Russian MT System), in= 
Proceedings of SOFSEM'86, Liptovsk~ JAn 

Kirschner, Z. (1983)= IIGSRII= (A Nethod of 
Automatic Extraction of Significant 
Terms from Texts), EIM~T X 

Kirschner, Z. (1987)= Kirschnert Z.= APd%C3-2: 
An English,to-Czech Machine Translation 
System, EBSAT X I I X 

Kay, M. (1987) = Non-Cones,erie, ire Finite ~. 
State Morphology, In= Proceedings of the 
3rd European ACL meeting, ~.openhagen, 
Denmark, April 1987 

EBSAT = Explizite Beschreibung der Sprache 
und autolkmtische Textbearbeitung, LK Praha 
