ASSOCIATIVE MODEL OF MORPHOLOGICAL ANALYSIS: 
As EMPIRICAL INQUIRY 1 
Harri Jiippinen 2 and Matti Ylilammi 2 
Helsinki University of Technology 
Helsinki, Finland 
This paper presents a computational model for the analysis of word forms of a highly inflectional, 
agglutinative language. We call the model "associative" as it directly links phonemic stimulus with its 
morphemic interpretation(s) under the guidance of a coherence constraint. The model has been fully 
implemented for Finnish. We discuss separately the abstract model and various algorithms to implement 
the model. We also demonstrate the implementation. The best features of the method are its efficiency 
and its capability of supporting open lexicons. 
1 INTRODUCTION 
In the search for computational models of language, 
syntactic analysis of sentences has so far received much 
greater attention than morphological analysis of word 
forms. For example, Winograd (1983) in his thorough 
book on syntax and parsing methods has allocated six or 
so pages to the morphological analysis. That makes about 
one percent of the whole text. This does not, of course, 
mean that the author rates morphology unimportant, but 
it does, we believe, reflect the general interest of the 
research community. 
Such heavy emphasis on syntax on the one hand and 
almost total neglect of morphology on the other follows 
from the idiosyncracies of English. And due to the domi- 
nant role English has in the computational linguistic 
community, this somewhat unbalanced view permeates 
the computational linguistic literature. 
The neglect of English morphology has obvious 
reasons. The basic rules of English word inflection are 
quite simple. There are some fusion phenomena that 
produce portmanteau morphs resistant to inflectional 
analysis, but their number is small. For syntactic analysis 
of sentences, one needs a lexicon anyway. Why not then 
take the easy way out and let the lexicon bear the burden 
of morphology and carry at least the hard word forms if 
not all word forms as separate lexemes? 
For agglutinative inflectional languages, however, the 
economy of computation may shake hands with theore- 
tical ambitions. In Finnisl~, to take an example close to 
our heart, a nominal may appear in a running text in 
thousands of different forms, and verbs have an even 
wider spectrum of forms (Karlsson 1983). The probabili- 
ty distribution of the word forms of a given lexeme is 
uneven, to be Sure, but even the most conservative esti- 
mates render the brute-force method inappropriate in 
other than naive attempts for extremely limited purposes. 
Fortunately Finnish is an agglutinative language; port- 
manteau morphs are almost nonexistent. In order to 
analyze Finnish sentences computationally one must 
analyze word forms as well. 
This paper describes a morphological model for Finn- 
ish. The model has been fully implemented, and our tests 
rate it efficient. A synopsis of the model appears in 
J~ippinen et al. (1983a). This paper describes the model 
in more detail. 
The functional requirements set for the model were: 
1. clean separation between linguistic knowledge and 
algorithms (we wanted to increment and modify the 
model without structural changes in the algorithm), 
2. general and efficient analysis of inflectional, posses- 
sive, and cliticizod morphs (all inflectional word 
forms should be analyzed on the basis of general 
linguistic knowledge; synthesis of word forms and 
the analysis of derivational forms were left out), and 
3. support of an open lexicon (the model should recog- 
nize the occurrence of a new lexeme and support 
lexical update). 
The model we came up with is associative. It is not 
generative in the sense that it does not utilize, backwards 
or forwards, formal rules designed to generate valid and 
only valid Finnish word forms. The associative rules 
connect strings of phonemes directly with morphemes 
without passing through intermediate syntagmatic cate- 
Copyright1986 by the Association for Computational Linguistics. Permission to copy without fee all or part of this material is granted provided that 
the copies are not made for direct commercial advantage and the CL reference and this copyright notice are included on the first page. To copy 
otherwise, or to republish, requires a fee and/or specific permission. 
0362-613X/86/040257-272503.00 
Computational Linguistics, Volume 12, Number 4, October-December 1986 257 
Harri J~ippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry 
gories. The associative model is composed of two inter- 
related parts: 
1. a collection of independent stimulus-result rules, 
each associating a (partial) phonemic stimulus with 
one or more morphemes, and 
2. a coherence constraint that determines which of the 
fragmentary results proposed by the rules make up a 
coherent whole. 
This model is the result of an empirical inquiry guided 
by pragmatic constraints. Therefore, the discussion plac- 
es emphasis on computational issues rather than on 
theoretical linguistic argumentation. We first outline the 
problem, and then describe separately a model for its 
solution and various algorithms to realize the model. We 
also discuss and demonstrate an implementation, and 
contrast it with other known models. 
2 THE PROBLEM DECOMPOSED 
Henceforth we call lexical entry or lexeme the basic word 
form that serves as a carder of morpho-syntactic infor- 
mation about a word in a lexicon. Word form or simply 
form denotes an inflected lexeme. The grammatical 
representation of a word form is called grammatical word 
or word. The grammatical word of a lexeme in our model 
is singular nominative case for nominals and first infini- 
tive for verbs. This is the practice of printed Finnish 
mono- and bilingual dictionaries and lexicons. 
There is a one-to-many (one-to-thousands) mapping 
between lexemes and their valid forms. An effective 
inverse mapping is the analysis problem. Two distinct 
lexemes may, and frequently do, produce identical forms, 
homographs. Consequently, the analysis of a form is 
occasionally ambiguous, and it is paramount for the anal- 
ysis to find all lexemes a given form represents. 
Any native Finn is able to produce without effort the 
proper phonemic (or graphemic) word form of a given 
lexeme to fit a given context. When various dialects are 
reduced into standard written orthographic Finnish, the 
lexeme takki ('coat'), to take a random example, may 
appear in text in any of the following forms: 
(1) 
{takki,takit,takin,takkien,takkiesi,takkina,takkeihin, 
takeissasi,takeittako,takeissannehan,takkeinensakohan,...} 
Similarly, we can take any verb lexeme, say jakaa 
('deliver'), and list all its distinct forms: 
(2) 
{ jakaa,jaan,jaat,jakavat,jaoin,jakanevat,jakakoot, 
jakaisimme,jaetaankohan,jaettaisiinkohan,... } 
A native speaker, when browsing lists such as (1) or (2), 
would accept the elements as valid forms and, further- 
more, easily assign each its basic form, its lexeme. For 
the ninth element in (1), for instance, an average adult 
Finn would spontaneously recognize it as a properly 
inflected form of takki. But if asked to interpret the 
form, he after some hesitation would probably not be 
able to extract and identify affixes attached. 
The basic meaning-bearing unit in a word is a 
morpheme. The linear arrangements of morpheme classes 
in Finnish word forms are (Ikola 1977): 
(3) Nominals: 
STEM < \[COMPAR\] < NUMBER < CASE 
< \[POSSESSIVE\] < \[CLITICS1..2\] 
Verbs: 
STEM < VOICE < \[TENSE/MOOD\] < \[PERSON\] 
< \[CLITICS1..2\] 
Verbal Nominals: 
STEM < \[VOICE\] < \[VERBNOM\] < \[COMPAR\] < 
NUMBER < CASE < \[POSSESSIVE\] < \[CLITICS1..2\] 
Brackets are used to distinguish optional morpheme 
classes. VERBNOM (verb nominal) comprises two partici- 
ple and four infinitive morphemes. Some of the classes 
are cross-categorial. COMPAR (comparison) may appear 
only with adjectival stems or with participial forms. Past 
tense can be joined only with the indicative mood; tense 
and mood are therefore expressed in a common class. 
Two clitics may be attached in a row, on rare occasions 
even three. In speech a < b means b follows a in time; in 
writing b is to the right of a. 
Once the morpheme classes and their ordering are 
made explicit, it is then easy for any given word form to 
isolate the representatives of the morpheme classes. A 
noun, say takkeinensakohan from (1), when matched 
against (3) is analyzed as takke+i+ne+nsa+ko+han 
corresponding to a stem, plural, comitative case, third 
person possessive, and two clitic morphemes. Similarly, a 
randomly picked complex verb form such as jaettaisiinko- 
han from (2) reads as jae+tta+isi+:n+ko+han, repres- 
enting a stem, passive, conditional (and hence present 
tense) passive suffix (serving as a kind of "fourth" 
person), and two clitic morphemes. 
Each class has a small number of elements, and each 
morpheme has at most a few allomorphs, phonemic real- 
izations. From the strategic point of view the morpheme 
classes fall into two categories. Other morphemes than 
the stems have a closed set of allomorphs. They can be 
recognized with a finite set of rules. Stems constitute a 
large and unbounded set of morphemes as new lexemes 
may develop and enter in the vocabulary. Hence the allo- 
morphic variants of stems cannot be directly used in a 
closed set of rules. To bound a set Of rules, the rules for 
stems must recognize invariant parts of phoneme strings, 
not entire stem alternants. 
The original problem was thus decomposed into two 
parts: The Morphotactic Problem segments word forms 
and solves the allomorph relation for morphemes other 
than stems; The Stem Alternation Problem solves the 
residual allomorph relation for stems. 
3 GENERAL VIEW OF THE ASSOCIATIVE MODEL 
An associative model for the analysis of word forms of 
Finnish consists of a triplet <{MRi}, <*, {SRi}>, where 
{MR i} is a set of associative morphotactic rules, {SRi} is a 
258 Computational Linguistics, Volume 12, Number 4, October-December 1986 
Harri J~ippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry 
set of associative stem rules, and <* is a precedence 
relation in the set of the morphotactic rules. The rules 
associate phonemic stimulus with morphemic data. They 
obey the general form: 
(4) \[ml context\] <pl context> key <pr context> 
\[mr context\]--> \[m result\] 
A rule is applicable and fires whenever a phonemic string 
identical to key appears somewhere in an input form and 
the contextual conditions of the rule are satisfied, pl and 
pr stand for phonemic contexts and inl and mr designate 
morphemic contexts. A contextual phonemic string calls 
for substring identity, morphemic contexts require set 
inclusion. Infixes I and r denote left and right sensitivity, 
respectively. When a rule fires, it contributes one or more 
morphemes (m result) as a possible partial interpretation 
of the word form. There is not necessarily a one-to-one 
correspondence between keys and morphs. Keys may 
represent entire morphs or morphs that have been trun- 
cated by fusion processes. 
The model is implemented in a sequential machine. 
Therefore the rule (4) has two slightly differing applica- 
tions: 
(5a) Morphotactic rules: 
<pl context>allomorph\[mr context\] 
-- \[morphemes\] 
(5b) Stem rules: 
<pl context>a ending\[mr context\] 
-~ \[conc(ROOT,b ending)\] 
Morphotactic rules recognize and interpret all other 
morphs except stem alternants. As our algorithm is tuned 
to right-to-left sequential processing, these rules are 
invoked first. Only left phonemic and right morphemic 
contexts make sense in these rules. Stem rules recognize 
the allomorph relation of a potentially unlimited number 
of stems. They use alternant stem endings (a ending) as 
keys. A stem rule produces a hypothetical basic stem 
(lexeme) in which the recognized alternant ending is 
replaced (concatenated with the root) by the basic 
ending (b ending) shown. Only left phonemic and right 
morphemic contexts are meaningful in stem rules in the 
chosen strategy. We discuss the morphotactic part and 
the stem alternation part of the model in separate 
sections below. 
4 MORPHOTACTIC MODEL 
This section lists the morphemes of Finnish and presents 
a few outstanding problems in their allomorph relation. 
The discussion uses similar semiformal generative rules as 
Matthews (1972). Then an associative solution to the 
Morphotactic Problem is displayed. 
4.1 THE MORPHEMES 
Expression (3) arranges the morpheme classes of Finnish 
in three precedence orders, one for the nominal forms, 
one for the verbal forms, and one for the verbal nominal 
forms. These morpheme classes have grammatical func- 
tions shown in (6). (Clitics have such complex functions 
in Finnish that we do not attempt to mark them by 
morphemes but use instead their phonetic realizations in 
the discussion.) 
(6) COMPAR ---- {eom(parative),sup(erlative)} 
NUMBER = s(in)g(ular),pl(ural) \] 
CASE = {nom(inative),gen(itive), 
part(itive),ess(ive), 
in(essive),ela(tive), 
ill(ative),ad(essive), 
abl(ative),all(ative), 
ins(tructive),com(itative), 
ab(essive),tran(slative) \] 
POSSESSIVE ---- {lp(erson)s(ingular),2ps,3ps, 
I pp(lural),2pp,3pp} 
CLITICS = {...} 
VERBNOM = {Ipart (iciple),IIpart,Iinf (initive), 
.... IVinf} 
VOICE = {pass(ive),aet(ive) } 
MOOD = {ind(icative),imp(erative), 
cond(itional),pot(ential)\] 
TENSE = {pres(ent),past\] 
PERSON = {lps,2ps,3ps,lpp,2pp,3pp,p(assive) 
surf(ix) \] 
4.2 SOME PROBLEMS IN THE ALLOMORPH RELATIONS 
Morphemes of Finnish are complex. Some morphemes 
have no correlates on the phonemic level, some 
morphemes have more than one allomorphs, and some 
phoneme strings have the simplest explanation, it seems, 
if empty morphs are postulated. Further complexities are 
caused by certain fusion processes. We outline these 
phenomena below for non-Finnish readers. 
The pl morpheme has the suppletive allomorphs 'i', 'j', 
or 't', and in some rare cases no phonemic string marks 
plural. The sg morpheme has no allomorphs. These are 
not the only morphemes without phonemic correlates. In 
order to faithfully pair off morphs and morphemes in a 
one-to-one correspondence, it is convenient to postulate 
a zero alternant in place of a missing morph. For the 
lexeme kala ('fish'), for instance, we then get among 
other derivations between the morphemic (ML) and 
phonemic levels (PL) the few possibilities shown in 
Figure 1. (The root of a stem is henceforth written in 
capital letters.) 
The standard treatment assigns four allomorphs for 
the partitive case: 'a', '~', 'ta', and 't~i'. This variety 
decreases if one posits two allomorphs, 'a' and '~t', for 
part and allows the existence of an empty morph 't' to be 
conditioned by stress. The partitive forms of kala ('fish') 
and pasuuna ('trombone') in Figure 2 illustrate the inter- 
play between pl and part morphemes in this stipulation. 
Computational Linguistics, Volume 12, Number 4, October-December 1986 259 
Harri Jiippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry 
ML: KAra + sg + nan ML: KALa + pl + part 
PL: kala + e + e PL: ka'lo + j + a 
ML: KALa + pl + nan ML: KALa + pl + ess 
PL: kala + t + e PL: kalo + i + na 
Figure 1. Examples of suppletive allomorphs and zero morphs. 
ML: KALa + pl + part 
PL: kalo + j + a 
ML: PASUU2qa + pl + part 
PL: pasuuno + i + t + a 
Figure 2. An example of empty morphs 
Some alternants, when joined with neighboring 
affixes, exhibit regularities in behavior which can be 
captured conveniently by archiphonemes on the mediat- 
ing morphophonemic level (MPL). The allomorphs of 
comparison are examples of such alteruants, and so are 
some clitic segments. The use of archiphonemes captures 
nicely consonant gradation in the former and vowel 
harmony in the latter. The two part allomorphs discussed 
above can also be generated via a single archipboneme 
'A' on the morphophonemic level. It is realized as an 'a' 
or an 'a' on the phonemic level as vowel harmony 
demands. In Figure 3 lexemes suuri ('big') and jda ('ice') 
exemplify how the use of archimorphemes reduces a set 
of generative rules. There are fusion processes that 
delete information. These phenomena are easily formu- 
lated in generative terms but are problematic for analysis. 
The leftmost consonant in the possessive morphs 
(lps:'ni'; 2ps:'si'; 3ps:'nsa'; lpp:'mme'; 2pp:'nne'; 
3pp:'nsa'), be it a nasal or a fricative, overlaps and domi- 
nates the preceding consonant. For the lexeme kala 
('fish'), for instance, we get the derivations in Figure 4 in 
the singular and plural nominative and genitive cases 
when a possessive segment is present or absent, respec- 
tively. 
Notice how the four forms are distinct when a posses- 
sive is absent (kala, kalat, kalan, kalojen) and become 
threefold ambiguous when the possessive segment is 
attached (kalamme, kalamme, kalamme, kalojemme). 
This is a general phenomenon. A nominal in Finnish 
always becomes grammatically ambiguous when a 
possessive suffix is attached to a singular nominative or 
genitive, or to a plural nominative form. 
4.3 MORPHOTACT1C MODEL 
An associative Morphotactic Model (MTModel) is a pair 
<{MRi},<*>, where {MRi} is a set of morphotactic rules 
(5a) and <* is a precedence relation in the set. <* is an 
irreflexive, antisymmetric, and nontransitive relation 
which imposes a coherence constraint on the rules. Each 
morphotactic rule associates a morphemic interpretation 
with a phonemic substring. The relation <* orders the 
rules in such a way that partial interpretations, when a 
word form is processed from right to left, contribute to 
valid total interpretations. 
260 Computational Linguistics, Volume 12, Number 4, October-December 1986 
Harri Jiippinen and Matti ylilamml Associative Model of Morphological Analysis: An Empirical Inquiry 
ML: 
MPL: 
SUURi + ccmp + ~ + n~ SUURi + crmp +~ + n~ 
I LII ILl! i i 
PL: suur + empi + e + e suur + mtttd + t + e 
ML: 
MPL: 
PL: 
SUURi+sg 
1 
suur 
+ part + 'kO' 
A kO 
+e+t+a + 1~ 
J~ + sg + part + kO 
k kO 
j~ + e + t + ~ + k5 
Figure 3. Examples of archiphonemes. 
ML: KALa + sg + non 
PL: kala + ~ + e 
KALa + sg + non + ipp 
kala + ~ + e + mine 
ML: KALa + pl + nun 
PL: kala + t + 
KALa + pl + nan + ipp 
I t e mme 
• . 
kala + me 
ML: KALa + sg + gen 
PL: kala + e + n 
KALa+ sg +gen + ipp 
I I 
kala + e + mine 
ML: KALa + pl + gen 
PL: ~lo+j+e+n 
KALa + pl + gen + ipp 
n mine 
kalo + j + e + mine 
Figure 4. Examples of fusion processes. 
Computational Linguistics, Volume 12, Number 4, October-December 1986 261 
Harri Jiippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry 
MR i <* MRj iff MR i can "immediately follow" MRj. A 
rule can immediately follow another if the key of the 
former can be juxtaposed to the left of the latter on the 
phonemic level. The keys may not overlap, or be discon- 
tinuous, and their morphemic interpretations must obey 
the ordering (3). 
For coherence, the model also needs boundary rules. 
Let e denote a zero key for zero morphs, and a and /3 
mark the zero keys for two special empty sets of 
morphemes. The "rightmost" morphotactic rule MR a = e 
L -~'\]\[o~ and the "leftmost" morphotactic rule MR B = 
the coherence constraint are defined below. -,,,, 
two boundary rules have obvious interpretations: MR a 
signals the right end of a word form and MR/3 indicates a 
stem boundary. 
(7) ForAll (MRi)\[NOT(MRa<*MRi)\] 
ForAll (MRi) \[NOT(MRi<*MR/3) \] 
Brodda and Karlsson (1980) tried to find the most likely 
morphotactic segmentation for a given Finnish word 
form drawn from a running text. The algorithm does not 
use a lexicon, neither does it associate phonemic 
segments with their morphemic interpretations. 
From that work we were able to extract and enumer- 
ate the valid phonemic keys for the morphotactic rules. 
The keys were then associated with their morphemic 
correlates and the rules were organized under the preced- 
ence relation <*. The set in (8) lists a small subset of the 
rule set and a fragment of the coherence constraint. For 
the sake of brevity, only the key is shown in the left hand 
side of a rule. To compress the rules, archiphonemes, 
typed in upper case letters, are used in keys whenever 
possible. Figure 5 illustrates this part of the coherence 
constraint in graphic form. 
% 
~.6 3 ,i~'~6 2, MR61, \[~i i, MRI0 
_,'VIRI3 
MR71 /~~54 ~MR4 
' I~0MR~ NR77 , 
MR g MR81 
% % 
MR12 'MR11 'MRI 0 MR12 'MR11 'MRI 0 X 
NR20kNR3 
~B 
~%NMR MR 12' N~ 10 
,MR 
Figure 5. Partial coherence constraint of the Morphotactic Model 
262 Computational Linguistics, Volume 12, Number 4, October-December 1986 
Harri Jiippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry 
(8) MRot = a ~ Ii 
MR/3 = 13 -*- II 
MR3 ---- ~ -" !1 
MR 4 = e -- II 
MRI0 = e ~ \[sg, nom\] 
MR! 1 = e -*- \[act,ind,pres,3ps\] 
MRI2 = t ~ \[act, imp,pres,2ps\] 
MR20 = kO ~ \['kO'\] 
MR21 = kin -- \['kin'! 
MR32 = isi --~ \[act,cond,pres,3ps\] 
MR41 = :n -~ \[pres\] 
MR42 = 1A -~ \[pass,ind\] 
MR54 = mme -~ \[lpp\] 
MR61 = e -~ \[pl,nom\] 
MR62 = e -~ \[sg,gen\] 
MR63 = e --~ \[act,ind,pres\] 
MR71 = mA -~ \[act,IIIinf\] 
MR74 = ssA ~ \[in\] 
MR75 = see -- \[sg, ill\] 
MR77 = i -~ \[pl\] 
MRso = nee -- \[act,IIpartl 
MRs1 = ne -- \[act,IIpart\] 
R<* = { <MR10,MRc~> , <MR11,MRc~> , 
<MR12,MRc~>, <MR20,MRa>, 
<MR3,MRc~>, <MR/3,MR10>, 
<MR/3,MR'I 1 >, <MR~,MR12>, 
<MR32,MR20>, <MR/3,MR32 >, 
< MR4,MR3 >, <MR21 ,MR3 >, 
<MR4,MR20>, <MR21,MR20>, 
<MR41,MR4>, <MR41,MR21 >, 
<MR42,MR41 >, <MR~,MR42 > .... } 
The rule set and the coherence constraint represent 
the morphotactic part for morphological analysis. A 
phoneme string is a morphotactically valid form if there is 
a "path" between the "rightmost" rule, MR a, and the 
"leftmost" rule, MR/3, in the coherence constraint. The 
interpretation of the form is the union of the morphemes 
associated with the rules along the path. For an ambig- 
uous word form more than one path exists between the 
MR a and the MRfl. 
The fragmentary rule set and the constraint in (8) 
give, for instance, the following morphotactic interpreta- 
tions for the ambiguous form kalamme shown in Figure 
4: 
(9) 
ka la + \[sg, nom, 1 pp\] 
(MRfl <*MRI0 <* MR54 <*MR 4 <*MR 3 <*MRc~ ) 
ka la + \[sg,gen, I pp\] 
(MR/3 <*MR62 <*MR54 <*MR 4 <*MR 3 <*MRa) 
kala+ \[pl,nom, lpp\] 
(MR/3 <*MR61 <* MR 4 <* MR 3 <* MRc~ ) 
kala + \[act,ind,pr, 1 pp\] 
(MR/3 <*MR63 <*MR54 <*MR4 <*MR3 <*MRc~ ) 
The first three are valid interpretations. The verbal inter- 
pretation, although morphotactically valid, does not 
result in an existing verb stem. That interpretation will be 
rejected by the Stem Alternation Model discussed below. 
That the verbal interpretation is indeed morphotactically 
plausible can be seen, for instance, with the form 
palamme, analyzed as pala+\[act,ind, pr, lpp\], which is a 
valid interpretation for the verb lexeme palaa ('burn'). 
MTModel for Finnish consists of 178 rules. It is not 
yet an algorithm. It does not state how analysis is being 
done, that is, how control is to proceed in an analysis. 
These are matters of an algorithm discussed in a later 
section. The previous discussion has committed the 
model from right to left processing, but reverse process- 
ing or some more advanced control schemes might be 
used as well. 
5 STEM ALTERNATION MODEL 
For any given word form, MTModel resolves sets of 
morphemes that make up coherent wholes. MTModel also 
indicates stem alternant boundaries (MR/3) but leaves the 
alternants intact. The Stem Alternation Model 
(SAModel) discussed in this section finds for each postu- 
lated stem alternant its basic form(s), or rejects it. We 
first discuss the stem alternants in Finnish as they are 
customarily described in the Word and Paradigm Model. 
We then describe associative rules for the analysis of 
stem alternants. 
5.1 THE STEM ALTERNATION PROBLEM 
The Standard Dictionary of Modern Finnish (Nykysuo- 
men sanakirja, 1966) describes the behavior of Finnish 
word forms in terms of the Word and Paradigm Model. It 
classifies nominals into 82 and verbs into 45 equivalence 
classes - paradigms - based on variations in their stem 
alternants. For each paradigm the classification gives a 
theme word, to represent the class, and its stem alter- 
nants. Thus, for instance, the nominal paradigms 10 and 
41, and the verb paradigm 25 are listed as in Figure 6. 
The theme words are KALa ('fish'), TOSi ("true"), and 
TULla ("come"), respectively. (We have slightly edited 
the entries for our purpose.) Upper case letters in Figure 
6 indicate the roots and the stem-forming affixes; lower 
case letters are reserved strictly for the alternant stem 
endings. 
The information conveyed by the paradigm tables can 
be compressed into two matrices below which show just 
the distributions of the stem endings. The rows of the 
matrices represent the paradigms and the columns 
morphemic contexts (not given here). Whenever allo- 
morphs generate different stem endings, the endings are 
enclosed in parentheses. The vertical bars separate singu- 
lar nominal stems from plural stems and active verbal 
stems from passive stems. The first column in both matri- 
ces represents the ending of the basic form, the lexeme. 
e v denotes a null ending in a vowel stem, e a null ending 
in general. Upper case letters mark here archiphonemes. 
Computational Linguistics, Volume 12, Number 4, October-December 1986 263 
Harri Jiippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry 
Nominals: 
nGm gen 
~al0 -aN 
~si41 --den 
eee 
p~t 
-aA 
-tTA 
ess 
-aNA 
-teNA 
ill 
-aAN 
-teEN 
gen 
-oJEN 
-aIN 
-sI~q 
pl 
part 
-oJA 
-sIA 
ill 
-oIHIN 
-sIIN 
Verbs: 
linf 
~a 25 
eee 
ind 
pres past 
ips 3ps 
act 
pot cond 
pres pres 
3ps 3ps 2psP~eS3ps 
-e \[ -KOON 
IIpart 
-LUT 
pres 
)ass 
past llpart 
-Tff 
Figure 6. Examples of the Finnish word paradigms. 
(10) Nominal stem endings: 
01: e v,e v,e v,e v, e v \] e v,e v, ev 
02: e v,e v,e v,e v, e v \] e v,e v, e v 
03: e v,e v,e v,e v, e v I ev?ev, ev 
04: i, i, i, i, i I i, e, e 
05: i, i, i, i, i I (i,e), e, e 
10: A,A,A,A, A I (O,A),O, O 
41: si,de, t, te, te I s, s, s 
Verbal stem endings: 
01-" ~V' ~V' EV' eV' EV' EV' EV' EV" \[ EV' ~V' EV 
02: A,A, e,A,A,A,A, A I e, e, e, 
03: tA,dA, e v, tA, tA,dA, tA, tA \[ de, de, de 
25: la, e, e, e, e, e, e, e I e, e, e 
Each interpretation postulated by MTModel unambig- 
uously chooses a column. The problem of stem alter- 
nation follows from the fact that the row of the stem is 
not known. Should SAModel know, say, that the poStu- 
lated singular genitive stem ki~de in the form kaden 
represents paradigm 41, simple substring replacement 
operation would produce the correct lexeme KJfsi righta- 
way (the singular genitive case occupies the second 
column in the nominal matrix above). 
5.2 STEM ALTERNATION MODEL 
Our associative SAModel consists of a set of stem rules 
{SRi\], each of the form (5b) and retyped below: 
(11) <pl context>a ending\[mr context\] 
-~ \[cone(ROOT,b ending)\] 
'a ending' is an alternant stem ending. When a rule fires, 
its alternant ending is replaced with the basic stem ending 
('b ending'). The operator 'cone' concatenates the new 
ending with the root, producing a hypothetical basic 
word form. The consonant gradation process in roots is 
not analyzed in SAModel. Weak and strong stems are 
dealt with as separate lexemes. 
The paradigm tables (10) yield data for morphemic 
contexts ('mr context') and alternant and basic endings. 
Alternant endings are necessary but not sufficient 
phonemic data for rules. Stem rules without phonemic 
contexts are too productive. 
Luckily, due to phonotactic reasons the orthographic 
distribution of roots (unvarying parts of stems) is uneven 
in various paradigms. A manageable number of short 
phoneme strings suffice to represent all roots of whole 
paradigms. The Reverse Dictionary of Finnish (Tuomi 
1980) lists practically speaking all Finnish basic word 
forms (in reverse order), including some archaic ones and 
some of foreign origin. Each lexeme is tagged with its 
paradigm number and syntactic category. That dictionary 
264 Computational Linguistics, Volume 12, Number 4, October-December 1986 
HaiTi 3iippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry 
was a valuable source for the contextual phoneme strings 
for the stem rules. 
For stem rules a well-formed phonemic context 
(WFPC) and its truth value is defined recursively as 
follows. Any lower case letter in the Finnish alphabet is a 
WFPC and the context is true if the last letter of a root is 
identical to that letter. If &l, &2 ..... &n are WFPCs, then 
the following constructions are also WFPCs: 
(12) (i) &n...8~Z&l 
(ii) <&l,&z ..... &n> 
(i) is true if &l and &2 and ... and &n are true, in that 
order. Testing continues from the point in a stem where 
the previous test left off. (ii) is true if &l or &2 or ... or 
&n is true. The testing of 8q's halts if a recognition 
occurs. Each 8q starts its test afresh. 
To enhance compact notation we stipulate that a 
single capital letter may represent a WFPC. Archipho- 
nemes are conveniently expressed A for <a,~i>, O for 
<o,6>, and U for <u,y>; the set of consonants and 
vowels appear compactly as K for <d,f,g,h,j,k,l,m,n, 
p,r,s,t,v> and V for <a,e,i,o,u,y,~i,6>. But a WFPC of 
any complexity can be denoted by a single upper case 
letter. 
The phonemic contexts vary in complexity in the rules 
in SAModel. Most of them have a fairly simple structure. 
Two paradigms are, however, without any phonemic 
contextual regularity. One is the nominal paradigm 08. 
The stem of the theme word LOVi ('notch') ends with an 
i in the basic form and with an e in singular genitive case 
love+n ('of a notch'). This paradigm represents an old 
form and the set of its lexemes is closed. All new nomi- 
nals that end with an i in the basic form retain the i in the 
genitive case and in other singular cases. For example, 
the theme word for the paradigm 04 is RISTi ('cross') 
and its genitive form is risti+n ('of a cross'). The criteri- 
on for choosing between paradigms 04 and 08 is not 
phonotactic; it is diachronic. Therefore, no phonemic 
context short of a minilexicon would help us to resolve, 
say, that suurin (a valid superlative form for suuri ('big')) 
is not SUURi+lsg, genl, as muurin (for muuri ('wall')) is 
MUURi+lsg, gen\]. We solved the problem by using two 
kinds of i's as the last letter of a lexeme. 
SAModel consists of 280 rules. Added context sensi- 
tivity increased greatly the quality of stem rules. A stem 
alternant produces only a fraction over one basic forms 
on average. The stem rules augment the coherence 
constraint of MTModel with an obvious component: a 
morphotactically coherent word form passes the coher- 
ence test of SAModel only if at least one of the basic 
forms generated by the stem rules is a valid lexeme. 
To illustrate the interplay of MTModel and SAModel, 
ki~sissi~mmekO will be analyzed KAsi+\[pl,in, lpp,'kO'\] in 
the way shown in Figure 7. The figure exhibits schemat- 
ically only the stem rule responsible for the correct 
lexeme. The morphotactic rules in Figure 7 are from (8). 
The form gets other morphotactically coherent segmenta- 
tions as well, but they are rejected by the stem rules and 
the lexicon. 
6 ALGORITHM 
One can think of various alternative algorithms to realize 
the model. A multiprocessor environment might make the 
blackboard strategy used in HearsayII (Erman et al. 
1980) an attractive alternative. Our choice was a mono- 
processor environment and right-to-left strategy: first all 
morphotactically coherent stem alternants are postulated, 
then stem rules and dictionary check are invoked in that 
order for each alternant. The algorithmic issues are brief- 
ly talked about in this section. 
6.1 MORPHOTACTICS 
First we decided to implement MTModel as a structured 
collection of interconnected "islands". Each island 
comprises the possible and mutually exclusive morpho- 
tactic rules at any given point of processing; the rules 
represent valid paths through the island. The coherence 
constraint provides "bridges" between the islands, a 
bridge indicating a valid continuation after a walk 
through an island. Computationally the islands were 
finite state transition automata. 
There were 32 distinct automata: 3 for clitics, 1 for 
person, 5 for tense, 3 for case, 2 for number, 3 for 
passive, 5 for participle, 5 for comparation, and 5 for 
infinitive rules. To assist automatic compilation from the 
rules to the automata, the morphotactic rules were slight- 
ly modified to read as: 
SRI24 = <A>s\[ ...,in(ssA) ,pl(i) \] 7-> \[conc(ROOT, si) \] 
SRI24 
~s 
KKsi 
<* MR8 <* MR77 <* MR74 <* MR54 <* MR 4 <* MR20 <* M~ 
+ 8 + i + SS~ + mine + e + k8 + 
+ \[ pl, in, ipp, 'kO' \] 
Figure 7. The analysis of kiisissiimmek6. 
Computational Linguistics, Volume 12, Number 4, October-December 1986 265 
Harri Jiippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry 
(13) automaton: (p12) (pl 1) allomorph 
-~ (\[morphemes\],\[next automata\]) 
e.g., 
Rpp: (ALL-\[i:,O:\])(V+X~)n 
-~ (\[verb,act,ind, 1 ps\],\[Tense 1\]) 
Left contexts of phonemes were confined into 
expressions of two optional sets: for phonemes next to 
the left and second to the left. The term automaton names 
the island the rule belongs to; next automata identifies 
valid continuations after this path. mr context in (5a) is 
represented implicitly in (13) as the path leading to this 
rule. 
The example rule belongs to the person automaton. 
The rule recognizes the lps suffix n for an active indic- 
ative verb if an n is found such that it has an ordinary or 
stressed vowel first to the left and any phoneme except a 
long n,o, or 6 second to the left. Control proceeds to the 
automaton Tense1 to identify modal and temporal 
morphemes. In general more than one continuation auto- 
maton is possible. 
The island approach worked quite well. However, it 
was redundant because identical transition paths existed 
for different automata. To save memory, we imple- 
mented another version of MTModel, this one as an 
orthographic tree of the keys (and rules). The islands 
were layered, so to speak, on top of each other. A pass 
through the constraint in (8), or a walk through an 
island, corresponds now to a traversal through the tree. 
Coherence is satisfied if, for each transition along a path 
from the MR a to a MRfl, a successful walk through the 
orthographic tree can take place. Automatic compilation 
again transforms the rules of the form (13) into the 
orthographic tree. 
The orthographic tree occupied only about one-tenth 
of the memory needed for the island approach. Using a 
novel key-and-lock construct we were able also to speed 
up the analysis. With each node in the tree a "lock" was 
associated as a union of the automata names ('next 
automata' in (13)) in its subtree. Each traversal through 
the tree provides a "key" as a set of possible continua- 
tions. During the next traversal the key is checked in the 
lock of each node along the path and only a match (non- 
empty intersection) permits continuation. This method 
aborts fruitless attempts through the tree early on. 
Morphotactic analysis in the orthographic tree with this 
lock-and-key approach takes about 40% of the time the 
original island approach took. 
6.2 STEMS AND LEXICON 
The control of the stem rules was first realized as an 
orthographic tree of "prolonged stem endings". A 
prolonged ending concatenates an alternant ending with 
its contextual strings. The 280 stem rules yield 420 
distinct extended stem endings. Exit points were marked 
in the tree and morpheme contexts were attached to 
these nodes as exit conditions. Basic stem endings were 
also associated with the exit points. A stem alternant 
traversed the tree and produced basic forms along the 
path whenever the exit condition was satisfied in exit 
nodes. 
The stem alternant tree wasted, however, memory to 
an extent that we implemented also a hash-coded version 
of the extended endings. This version saves memory 
considerably without a noticeable increase in the analysis 
time. 
A word form is valid only if it has at least one coher- 
ent morphotactic interpretation and if at least one of the 
lexemes produced by the stem rules appears in the lexi- 
con. Dictionary organization and its search procedure 
constitute therefore an integral part of the algorithm. In 
our implementation the dictionary is composed of three 
distinct parts. The main dictionary is preceded by a hash- 
coded lexicon that contains the function words. 
The main dictionary consists of an open set of adjec- 
tives, nouns, and verbs (and also numerals). It is imple- 
mented as a backward-sorted orthographic tree. The 
unconventional ordering allows for iterative analysis of 
compound word forms. Lexemes whose roots participate 
in consonant-gradation process have two separate lexical 
entries: weak and strong. 
7 IMPLEMENTATION AND DEMONSTRATION 
The ultimate test of the model and the algorithm lies in 
its performance. We felt that the primary justification of 
our model is its capability of meeting certain functional 
requirements: 
• clean separation between linguistic knowledge and 
algorithms, 
• general analysis method, 
• efficient analysis, and 
• support of an open lexicon. 
The model separates linguistic data from algorithms, as 
the discussion above has indicated. Due to the rule struc- 
ture the model has proved to be easy to augment, and 
now it covers the entire Finnish inflectional morphology. 
This satisfies the second requirement. In this section we 
discuss efficiency, open lexicon, and other issues of 
implementation and demonstrate the implementation. 
For the reasons of efficiency and portability we imple- 
mented the algorithm in PASCAL. Separate compiler 
procedures transform the associative rules into their 
internal representations, as discussed in the previous 
section. The orthographic morphotactic tree takes about 
4kW and the hash coded extended stem endings 5kW of 
DEC2060 memory. The procedures that utilize these data 
structures take about 20kW. The two hash-coded front 
lexicons reside also in the main memory. They cover 
already the majority of function words in Finnish. Their 
data structures and code together occupy 21kW of 
DEC20 memory. There is also a version on VAX11 and 
one on IBM PC/XT. In the latter, MORFO, as we call the 
system, takes up 305kB of memory. That figure includes 
MS-DOS. 
266 Computational Linguistics, Volume 12, Number 4, October-December 1986 
Harri J~ppinen and Matti Ylilammi Associative Model of Morpholo~cal Analysis: An Empirical Inquiry 
The main lexicon resides on disc. As of this writing the 
main lexicon contains over 30,000 of the most frequently 
used Finnish verbal and nominal lexemes taken from 
Saukkonen et al. (1979) and from running ordinary texts. 
Figure 8 shows a few sample analyses with the trace 
mode of the system switched on. Alusta is a highly 
ambiguous word form in Finnish. MTModel (JAOTIN) 
finds six coherent morphotactic interpretations for it. 
SAModel (MUOKKAIN) extracts two different basic word 
forms for the first interpretation, one for the second and 
the third, three for the fourth, and none for the fifth and 
the sixth. ('VA', 'HA', and 'NE' stand for strong (or 
neutral), weak (or neutral) and neutral grade, respective- 
ly. The numbers within angle brackets are identifiers of 
the stem rules.) The weak stem alu is accompanied by its 
strong partner alku. Of the seven postulated lexemes, 
five actually occur, found in the main lexicon (SANAKIR- 
JAT). The presence of the affix n (gen, or lpp) greatly 
reduces ambiguity as Figure 8 further shows. The morph 
n is either gen or lpp person. This information disquali- 
fies the cases el and part. 
We have tested the system rather extensively. In addi- 
tion to randomly picked word forms we typed in, a typist 
entered news reports and columns picked from various 
Finnish newspapers. The test texts also included, of 
course, function words and compound word forms. Over 
300,000 forms have been thus introduced. The analysis 
of a word form takes about 20ms of DEC2060, 35ms of 
VAXll/780, and 50ms of VAXll/750 CPU-time on the 
average. Throughput on an IBM PC/XT is about 95 
words forms per minute. These figures satisfy our func- 
tional requirement for an efficient analysis method. 
As an example trace of the system at work, the first 
word forms of Genesis in the Finnish Bible are analyzed 
by MORFO in the way shown in Figure 9. (Our lexicons 
carry English equivalents for each lexeme.) 
Sane: ALUSTA 
aAOTIN: 11.1 ms. 
1: ALUSTA= 
2: ALUSTA= 
3: ALL= STA= 
4: ALUS= TA= 
5: ALUS= TA= 
6: ALUS= TA= 
MUOKKAIN: lb.2 ms. 
1: ALUST A (mort.patio --03A) 
2: ALUST A 
3: ALL;ST AA 
4: ALU 
ALKU 
5: ALU NEN 
~: ALUS I (yks.gen. -EN) 
7: ALUS 
SANAKIRJAT: 75.4 ms. 
1: ALUSTA 
2: ALUSTAA 
3: ALKU 
4: ALUNEN 
5: ALUS 
> ALUSTAN 
Noun 88 Nora 
Verb ACt lmper Pr 8 2P 
Noun B8 E1 
Noun $8 Pert 
Verb Pass Ind Pr Nl~ 
Verb ACt Iinf SG Nora 
VR 11 <1585> 
VA 1, < 1580> 
HA 2, < 710> 
HA 3, < 1810> 
FIE 4p <2860> 
VA 4, <2480> 
HA 4, <1750> 
BASE Noun $8 Nom 
INITIALIZE Verb ACt Imper Pr S; 2P 
BESINNING Noun SE El 
BEDDING Noun S8. Part 
SHIP Noun SG Part 
Sane: ALUSTAN 
ALUSTAA 
ALUSTA 
> ? 
INITIALIZE Verb ACt lnd Pr B 1P 
BASE t~k:xm S8 8an 
Sane: ALUSTAN 
JAOTIN: b.8 ms. 
1 = ALUSTAN= 
21 ALIJSTA" N" 
3: ALUSTA= N= 
MUOKKAIN: 13.5 ms. 
1 t ALUST AA 
2: ALUST A (~.part. -03A) 
3: ALUST 
SANAKIRJAT: 4&.8 ms. 
1 = ALU~TAA 
2: ALUSTA 
Noun 88 Noa 
Verb Act Ind Pr B 1P 
Noun SG Ben 
HA 2, < 710> 
HA 3, <185D> 
HA 3, < 1B~> 
INITIALIZE Verb ACt lnd Pr 8 1P 
BASE Noun S8 Gen 
Figure 8.. Analysis of alusta and alustan. 
Computational Linguistics, Volume 12, Number 4, October-December 1986 267 
Harri Jiippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry 
> Alussa loi Jumala taivaan ja maan. 
SANE: Alussa 
ALKU BEGINNING Noun SG In 
SANE: loi 
LUODA CREATE Verb Act Ind Imp S 3P 
SANE: Jumala 
JUMALA GOD Noun SG NoN 
SANE: taivaan 
TAIVAS HEAVEN Noun SO Oen 
SANE: 3a 
JA AND Particle ConJ 
SANE: Raan 
EARTH/COUNTRY Noun SG Oen 
> Ja maa oil autio Ja tyhji Ja pimeys oil syvyyden p~ll~. 
SANE: Ja 
JA AND Particle ConJ 
SANE: maa 
MAA EARTH/COUNTRY Noun SG Nom 
SANE: eli 
0LLA BE Verb Act Ind Imp S 3P 
SANE: autio 
AUTIO DESERT Adjective SG Nom 
SANE: Ja 
JA AND Particle ConJ 
SANE: tyhJ~ 
TYHD~ EMPTY Adjective SO Nom 
SANE: 3a 
JA AND Particle ConJ 
SANE: pimey$ 
PIMEYS DARKNESS Noun SG Nom 
SANE: eli 
0LLA BE Verb Act Ind Imp S 3P 
SANE: syvyyden 
SYVYYS DEPTH Noun SG Gen 
SANE: p~ll~ 
PAALLA UPON Particle Adverb 
P~ALL~ ON Particle Prep 
P~ HEAD Noun SG Ad 
268 
Figure 9. Analysis of the first words in the Finnish Bible. 
Computational Linguistics, Volume 12, Number 4, October-December 1986 
Harri Jiippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry 
A randomly picked verbal lexeme, say katua 
('repent'), to continue in the Biblical domain, has some 
of its various forms analyzed in Figure 10. Notice, by the 
way, how the verbal forms katua and kadun are homo- 
nymic with partitive and genitive forms of katu ('street). 
(In Figure 10 'imp' stands for past, 'imper' for imper- 
ative; 's' for singular in verbs, 'sg' singular in nominals; 
'p' for plural in verbs, 'pl' plural in nominals.) 
The analysis of compound word forms is automatically 
invoked, if none of the basic forms postulated by the 
stem rules is found in the main lexicon. If this analysis 
also fails, control proceeds to the lexical acquisition 
mode. Good Friday is the compound pitkiiperjantai in 
Finnish. (Its literal translation in English is Long Friday.) 
That compound belongs to a subclass of complex lexical 
items whose modifying part gets inflected in various 
cases in agreement with the head. Incidentally, this 
phenomenon holds also for numerals. Figure 11 shows 
example analyses of some forms of pitkdperjantai. 
katua kadun katuvlmmillaan katukaam katumlseaaansa 
Sane: KATUA 
KATIJ STREET Noun SS Pert 
KATUA RE\]MENT Virb Act llnf SB Noe 
Sane: KADUN 
KATUA REPENT Verb Act Ind Pr B IF 
KATU STREET Noun $8 Gen 
Sane: KATIJVI MM ILLA~N 
KATUA REPENT Verb Act Ipartie 8up PL Ad 3P 
Sane: KATUKAAMME 
KATUA REPENT Verb Act loper Pr P 1P 
Sane: KATUM I SESSAN~ 
KATUA REPENT Verb Act IVlnf S8 In 3P 
Figure 10. Sample analyses of forms of katua. 
> pitk;iperjantat pttk&n&perjantaina pitklketperJantalke| 
Sane: PITKX.. (pttk&perjantai) 
PITK~I LONG Adjective $8 Nom 
Sane: ..PER3ANTAI (pitkiperJantai) 
PERJANTAI FRIDAY Noun SG Nam 
Sane= PITKXN~.. (pitk&niperjantaina) 
PITK~ LONG ~djKtive S8 Ees 
Sane: ..PERJANTAINA (pltkanliperjantalna) 
PER3ANTAI FRIDAY Noun $8 Eaa 
Sane: PITK&KSI.. (pitkikaiperJantaiksi) 
PITK& LI3NG Adjective $8 Tranel 
Sane: ..PER3ANTAIKSI (pttk&ketperJantatket) 
PER3rU~TAI FRIDAY Noun $8 Tranal 
Figure 11. Sample analyses of compound nouns. 
Computational Linguistics, Volume 12, Number 4, October-December 1986 269 
Harri Jiippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry 
> paholaieen 
Sane: PAHOLAISEN 
1: PAHOLAISEN 
2: PAHOLAIS TA 
3: PAHOLAI NEN 
HA Noun $8 Nom 
VA Verb Act Ind Pr S 1P 
NE Noun SG 8en 
Lis&t~in no. (0 = poistu~<esc> m peru, y m ysanawm m menu) : 3 
Subst/Adj/Numer-/Yoea x • 
Semantlikka : devil 
Sane: paholaleen 
PAHOLAINEN DEVIL Noun St; Een 
LisJt&in no. (0 " poistu,<eec> = peru,y = ysana,m " menu) : 
> paholaistako 
Sane: PAHOLA ISTAKO 
PAHOLA I MEN DEV I L Noun SG Part ko 
Figure 12. An example of lexical acquisition. 
We may now state in more precise terms in what way 
our model is capable of supporting open lexicons. Maybe 
inadvertently, we had not inserted paholainen ('devil') in 
the lexicon. If we input one of its forms, say paholaisen 
('devil's'), the failure of the analysis prompts the user to 
choose one of the postulated basic forms as shown in 
Figure 12. When the user has chosen the only valid 
option (3), has supplied syntactic category ('S' for 
substantive), and provided its English equivalent, the 
lexeme enters the lexicon, as the subsequent test proves 
in Figure 12. In this convenient manner, we have built up 
our lexicon to hold about 30,000 entries. We continuous- 
ly augment the lexicon from running texts. 
8 DISCUSSION 
This model for the analysis of word forms originated 
from certain functional requirements. The two most 
important ones were efficiency and an open lexicon. 
These two pragmatic considerations imposed quite 
naturally two specific strategic constraints which may be 
of theoretical interest. 
Efficiency resulted in a computational strategy to use 
fully realized morphs as primitives in analysis rather than 
their abstract, morphophonemic representations. Compu- 
tationally speaking, analysis amounts then to the ordered 
recognition of phoneme substrings (morphs) within a 
phoneme string (input word form). The result of an anal- 
ysis is the union of the morphemes associated with the 
morphs. The model resulted in an efficient running 
system, as we have described. 
The requirement of an open lexicon resulted, under 
the constraint of a sequential machine, in-a right-to-left 
processing strategy. Only then are stem alternants of 
those lexemes not yet listed in the lexicon unambiguously 
recognized and analyzed. Compact expressions of 
phonemes suffice to represent the phonetic make-up of 
whole paradigms - unbounded sets of phonotactically 
possible stems. Right-to-left processing enables the stem 
rules using these expressions to handle all stem alter- 
nants, regardless of their occurrence in the lexicon. 
Native speakers have also the ability to analyze forms of 
non-existing lexemes. 
Brodda and Karlsson (1980) reports on a program 
that aims at the most probable segmentation of Finnish 
word forms without using a lexicon. The program neither 
interprets segmentations nor finds all possibilities. 
Kallgren (1983) describes a prototype system for the 
analysis and synthesis of Finnish nominals. Sagvall-Hein 
(1980) has studied the applicability of the chart parsing 
method for the morphological analysis of Finnish word 
forms. As this experiment did not result in a full-scale 
model, we do not discuss it further. Two other papers 
report more or less complete solutions for the analysis of 
inflected Finnish word forms. 
Karttunen et al. (1981) reports on an implementation 
that "can recognize, in a fraction of a second, any 
inflected form of a word it has stored in its lexicon .... 
The present lexicon consists of about 100 roots .... It can 
analyze a short unambiguous word in less than 20 milli- 
seconds \[DEC-2060/Interlisp\]. A long word or a 
compound that requires a lot of disambiguation can take 
ten times longer" (emphasis and comment added). The 
model utilizes phonetically realized morphs in analysis, as 
we do, but stores them in separate suffix lexicons. No 
explicit precedence relation links the morphs and, there- 
fore, each suffix entry must carry a description of its 
environment. Processing is from left to right. A root lexi- 
con first contributes a set of roots matching the input 
form. Each root entry lists constraints its valid forms 
must obey. As the residual input form is processed 
270 Computational Linguistics, Volume 12, Number 4, October-December 1986 
Harri J~ippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry 
phoneme by phoneme rightward, the interplay between 
the root constraints and retrieved suffix entries filters out 
the roots that match totally with the input form. A lot of 
run-time bookkeeping is involved. In this model as well 
as other models that use a left-to-right strategy, process- 
ing time is highly sensitive to the size of the root lexicon: 
the larger the lexicon the more there are matching roots. 
Their root lexicon carries entries of such complexity 
that only a linguist can add them. Below appear lexical 
entry an ordinary bilingual dictionary entry and our lexi- 
cal entry for "mato" ('worm'). (Our model uses two 
distinct entries for the gradated stem, but the system 
automatically generates the pair.) 
(14) ma SUBST IntAlt t-d FinAlt o (Karttunen et al.) 
mato SUBST (Ordinary dictionary) 
mato SUBST VA (Our) 
mado SUBST HA mato (Our) 
Koskenniemi (1983) presents a "two-level model" and 
reports: "With a large lexicon it takes about 0.1 CPU 
seconds Burroughs B7800/PASCAL to analyze reason- 
ably complicated word forms." This model also runs from 
left to right and first collects from: the root lexicon the 
roots that match. Like Karttunen et al., he also stores 
morphs in separate suffix lexicons, and prunes from the 
initial root set those whose combinations with suffixes 
match with the input word form. The main difference is 
that Koskenniemi uses abstracted, morphophonemic 
representations of morphs, and the matching of a suffix is 
not performed by expert routines in the lexicons but by 
external rules, implemented as finite state automata. The 
morphophonemic representations of morphs and the 
processing rules capture linguistically appealing gener- 
alizations. The realization of rules at run time at least 
partly explains the slow speed reported. Lexical entries 
are abstracted roots and hence unnatural to ordinary 
users, although less so than in Karttunen et al. Below is 
his entry for hakata ('hack') contrasted to our entry. 
(15) hakKa VERB (Koskenniemi) 
hakata VERB (Ordinary dictionary) 
haka ta VERB HA (Our) 
hakka ta VERB VA hakata (Our) 
Neither Karttunen et al. nor Koskenniemi seem to 
consider it important, as we do, that the form of lexical 
entries is "natural" to casual users. This issue certainly 
has at least practical import. Both of the discussed 
models process from left to right, unlike our model. It 
follows that they cannot analyze forms of lexemes not 
yet in the lexicon. For example, vimpuloissa is phonotac- 
tically well-formed form, and all native speakers would 
agree that it is plural inessive case for the meaningless 
word vimpula. As these models begin an analysis by find- 
ing first the matching roots, they are in deep trouble with 
a form whose root is missing. As our model proceeds 
from right to left and has all morphological knowledge 
embedded in rules, vimpuloissa will be analyzed into 
vimpul a + \[pl, in\]. This interpretation is rejected only 
because the lexeme does not exist. 
One might argue that Koskenniemi could also augment 
his model to handle new forms by supplementing the root 
lexicon with a dummy auxiliary lexicon whose entries are 
skeletons that represent well-formedness constraints on 
Finnish stems. But lacking knowledge of the stem bound- 
ary for an unknown lexeme the system should invoke 
skeletons haphazardly and processing time would 
degrade. 
9 SUMMARY 
We have described an associative model for the morpho- 
logical analysis of word forms of Finnish, which is an 
agglutinative language. Such a model consists of sets of 
associative morphotactic and stem rules that directly link 
phonemic segments with their morphemic interpretations, 
and of a holistic coherence constraint which filters out 
the associations that make up coherent wholes. We have 
argued that such an associative model results in an effi- 
cient analysis and that it supports open lexicons. We then 
described a fragment of our fully defined associative 
model for Finnish word forms. We also discussed an 
algorithm and its various implementations. 
The algorithm has been fully implemented in PASCAL. 
Our tests demonstrate that the model satisfies quite well 
the functional requirements we set for it. A clear sepa- 
ration exists between linguistic knowledge (associative 
rules, the coherence constraint, and the lexicons) on one 
hand and the algorithm on the other. The model provides 
a general analysis method of Finnish word forms, includ- 
ing compound word forms. (The model can easily be 
extended to analyze derivational word forms as well, and 
in fact it currently analyzes a few of the most commonly 
used derivational forms.) Analysis is efficient as it takes 
about 30 ms of VAX11/780 CPU-time on average to 
analyze word forms in a running text. The figure includes 
the analysis of compound word forms ordinary newspa- 
per texts contain. Throughput in an IBM PC/XT is about 
95 forms per minute. The model supports open lexicons, 
which is proved by the fact that the over 30,000 lexical 
entries we currently have have been added from ordinary 
word forms inputted to the system. 
ACKNOWLEDGMENTS 
This research has been supported by SITRA Foundation. 
We greatly appreciate the help of Professor Tuomo 
Tuomi, who gave us a computer print-out of the Reverse 
Dictionary of Finnish. Aarno Lehtola and Esa Nelimark- 
ka have contributed to the formation of the model. Juha 
Niemist6 has been heavily involved in the implementa- 
tion of various algorithms. In addition, Asko Hentunen, 
Esko Nuutila, Pentti Soini, Panu Viljamaa, and Vesa 
Yl~ij~i~iski, all students at the Helsinki University of Tech- 
nology, implemented various versions of the algorithm. 
We greatly appreciate their help. 
Computational Linguistics, Volume 12, Number 4, October-December 1986 271 
Harri Jiippinen and Matti Ylilammi Associative Model of Morphological Analysis: An Empirical Inquiry 

REFERENCES 
Brodda, B. and Karlsson, F. 1980 An Experiment with Automatic 
Morphological Analysis of Finnish. Papers from the Institute of 
Linguistics, University of Stockholm, Stockholm. 
Cercone, N.; Boates, J.; and Krause, M. 1983 A Semi-interactive 
System for Finding Perfect Hash Functions. Technical Report in 
Computing Science, Simon Fraser University, Burnaby, British 
Columbia, Canada. 
Cercone, N. and Mercer, R. 1980 Design of Lexicons in some Natural 
Language Systems. ALCC Journal 1 (2): 37-59. 
Cichelli, R. 1980 Minimal Perfect Hash Functions Made Simple. 
Comm ACM 23(1): 17-19. 
Erman, L., Hayes-Roth, F., Lesser, V., and Reddy, D. 1980 The Hear- 
say-II Speech-Understanding System: Integrating Knowledge to 
Resolve Uncertainty. Computing Surveys 12(2): 213-253. 
Ikola, O. 1977 Nykysuomen kasikirja. Weiling & G66s, Espoo, Finland. 
J~ippinen, H.;Lehtola, A.; Nelimarkka, E.; and Ylilammi, M. 1983a 
Knowledge Engineering Approach to Morphological Analysis. First 
Conference of the European Chapter of ACL, Pisa, Italy; 49-51. 
J~ippinen, H.; Lehtola, A.; Nelimarkka, E.; and Ylilammi, M. 1983b 
Morphological Analysis of Finnish: A Heuristic Approach. Report 
B26, Helsinki University of Technology, Digital Systems Laborato- 
ry, Helsinki, Finland. 
Karlsson, F. 1981 Finsk Grammatik. Suomalaisen Kirjallisuuden Seura, 
Helsinki, Finland. 
Karlsson, F. 1983 Suomen kielen ti anneja muotorakenne. WSOY, 
Porvoo, Finland. 
Karttunen, L.; Root, R.; and Uszkoreit, H. 1981 Morphological Anal- 
ysis of Finnish by Computer. Proceedings of the 71st Ann. Meeting 
of the SASS. Albuquerque, New Mexico, USA. 
Koskenniemi, K. 1983 Two-level Model for Morphological Analysis. 
IJCAI-83, Karlsruhe, West Germany; 683-685. 
Kallgren, G. 1983 Computerized analysis and synthesis of Finnish 
nominals. Papers from the Seventh Scandinavian Conference of 
Linguistics II, Helsinki, Finland; 433-444. 
Matthews, P.H. 1972 Inflectional Morphology. Cambridge University 
Press. 
Penttil~i, A. 1957 Suomen Kielioppi. WSOY, Porvoo, Finland. 
Sadeniemi, M., Ed. 1966 Nykysuomen sanakirja. WSOY, Porvoo, 
Finland. 
Saukkonen, P.; Haipus, M.; Niemikorpi, A.; and Sulkala, H. 1979 
Suomen Kielen Taajuussanasto. WSOY, Porvoo, Finland. 
S/tgvall-Hein, A. 1980 An Outline of a Computer Model of Finnish 
Word Recognition. Report 3, Fenno-Ugrica Suecana, Uppsala 
University, Center for Computational Linguistics, Uppsala, Sweden. 
Tuomi, T. 1980 The Reverse Dictionary of Finnish. Suomalaisen Kirjalli- 
suuden Seura, Hameenlinna, Finland. 
Wiik, K. 1967 Suomen Kielen Morfofonemiikkaa. Report 3, Publica- 
tions of the Phonetics Department, University of Turku, Turku, 
Finland. 
Winograd, T. 1983 Language as a Cognitive Process. Volume I: Syntax. 
Addison-Wesley. 
