Induction of a Simple Morphology for Highly-Inflecting Languages
Mathias Creutz and Krista Lagus
Neural Networks Research Centre
Helsinki University of Technology
P.O.Box 5400, FIN-02015 HUT, Finland
{Mathias.Creutz, Krista.Lagus}@hut.fi
Abstract
This paper presents an algorithm for the unsuper-
vised learning of a simple morphology of a nat-
ural language from raw text. A generative prob-
abilistic model is applied to segment word forms
into morphs. The morphs are assumed to be gener-
ated by one of three categories, namely prefix, suf-
fix, or stem, and we make use of some observed
asymmetries between these categories. The model
learns a word structure, where words are allowed
to consist of lengthy sequences of alternating stems
and affixes, which makes the model suitable for
highly-inflecting languages. The ability of the al-
gorithm to find real morpheme boundaries is eval-
uated against a gold standard for both Finnish and
English. In comparison with a state-of-the-art al-
gorithm the new algorithm performs best on the
Finnish data, and on roughly equal level on the En-
glish data.
1 Introduction
We are intrigued by the endeavor of devising artifi-
cial systems that are capable of learning natural lan-
guage in an unsupervised manner. As untagged text
data is available in large quantities for a large num-
ber of languages, unsupervised methods may be ap-
plied much more widely, or with much lower cost,
than supervised ones.
Some languages, such as Finnish, Turkish, and
Swahili, are highly-inflecting. We wish to use this
term in a wide sense including many kinds of pro-
cesses for word forming, e.g., compounding and
derivation. Their essential challenge for natural lan-
guage applications arises from the very large num-
ber of possible word forms, which causes problems
of data sparsity. For instance, creating extensive
word lists is not a feasible strategy for obtaining
good coverage on the vocabulary necessary for a
general dictation task in automatic speech recogni-
tion. Instead, a model of the language should in-
corporate regularities of how words are formed; cf.
e.g., (Siivola et al., 2003; Hacioglu et al., 2003).
We will now focus on methods that try to induce
the morphology of a natural language from raw text,
that is, on algorithms that learn in an unsupervised
manner how words are formed. If a human were to
learn a language in an analogous way, this would
correspond to being exposed to a stream of large
amounts of language without observing or interact-
ing with the world where this language is produced.
This is clearly not a realistic assumption about lan-
guage learning in humans. However, Saffran et
al. (1996) show that adults are capable of discov-
ering word units rapidly in a stream of a nonsense
language, where there is no connection to a meaning
of the discovered word-like units. This suggests that
humans do use distributional cues, such as transition
probabilities between sounds, in language learning.
And these kinds of statistical patterns in language
data can be successfully exploited by appropriately
designed algorithms.
Existing morphology learning algorithms are
commonly based on the Item and Arrangement
model, i.e., words are formed by a concatenation
of morphemes, which are the smallest meaning-
bearing units in language. The methods segment
words, and the resulting segments are supposed to
be close to linguistic morphemes. In addition to
producing a segmentation of words the aim is of-
ten to discover structure, such as knowledge of
which word forms belong to the same inflectional
paradigm.
Typically, generative models are used, either for-
mulated in a Bayesian framework, e.g., (Brent,
1999; Creutz, 2003); or applying the Minimum De-
scription Length (MDL) principle, e.g., (de Mar-
cken, 1996; Goldsmith, 2001; Creutz and Lagus,
2002). There is another approach, inspired by the
works of Zellig Harris, where a morpheme bound-
ary is suggested at locations where the predictability
of the next letter in a letter sequence is low, cf. e.g.,
(D´ejean, 1998).
As it is necessary to learn both which segments
are plausible morphemes and what sequences of
morphemes are possible, the learning task is al-
                                                                  Barcelona, July 2004
                                              Association for Computations Linguistics
                       ACL Special Interest Group on Computational Phonology (SIGPHON)
                                                    Proceedings of the Workshop of the
huumori n taju ttom uute nne
humor of sense -less -ness your
Figure 1: Morpheme segmentation of the Finnish
word ‘huumorintajuttomuutenne’ (“your lack of
sense of humor”).
leviated by making simplifying assumptions about
word structure. Often words are assumed to con-
sist of one stem followed by one, possibly empty,
suffix as in, e.g., (D´ejean, 1998; Snover and Brent,
2001). In (Goldsmith, 2001) a recursive struc-
ture is proposed, such that stems can consist of a
sub-stem and a suffix. Also prefixes are possible.
Other algorithms (Creutz and Lagus, 2002; Creutz,
2003) have been developed for highly-inflecting
languages, such as Finnish, where words can con-
sist of lengthy sequences of alternating stems and
affixes (see Fig. 1 for an example). These resem-
ble algorithms that segment text without blanks (or
transcribed speech) into words, e.g., (de Marcken,
1996; Brent, 1999), in that they do not distinguish
between stems and affixes, but split words into so
called morphs, which carry no explicit category in-
formation.
Some algorithms do not rely on the Item and Ar-
rangement (IA) model, but learn relationships be-
tween words by comparing the orthographic sim-
ilarity of pairs of words. In (Neuvel and Fulop,
2002), a morphological learner based on the theory
of Whole Word Morphology is outlined. Full words
are related to other full words, and complex word
forms are analyzed into a variable and non-variable
component. Conceivably, in this framework non-
concatenative morphological processes, such as um-
laut in German, should not be as problematic as in
the IA model.
Other algorithms combine information of both
orthographic and semantic similarity of words
(Schone and Jurafsky, 2000; Baroni et al., 2002).
Semantic similarity is measured in terms of sim-
ilar word contexts. If two orthographically simi-
lar words occur in the context of roughly the same
set of other words they probably share the same
base form, e.g. German ‘Vertrag’ vs. ‘Vertr¨agen’
(treaty).
Further cues for morphological learning are
presented in (Schone and Jurafsky, 2001) and
(Yarowsky and Wicentowsky, 2000). The latter uti-
lizes frequency distributions over different inflec-
tional forms (e.g., how often an English verb oc-
curs in its past tense form in comparison to its base
form). The algorithm is not entirely unsupervised.
However, none of these non-IA models suits
highly-inflecting languages as they assume only two
or three constituents per word, analogous to stem
and suffix. In order to cope with a broader range
of languages we would need the following: On the
one hand, words should be allowed to consist of any
number of alternating stems and affixes, making the
model more flexible than, e.g., the model in (Gold-
smith, 2001). On the other hand, in contrast with
(Creutz and Lagus, 2002; Creutz, 2003), sequential
dependencies between morphs, i.e., morphotactics,
should be taken into account in order to reduce the
error rate.
We present a model that incorporates both of
these aspects. Experiments show that the new al-
gorithm is able to obtain considerable improve-
ments over the segmentation produced by the al-
gorithm described in (Creutz, 2003). Moreover, it
performs better than a state-of-the-art morphology-
learning algorithm, Linguistica (Goldsmith, 2001),
when evaluated on Finnish data. In the evaluation,
the ability of the algorithms to detect morpheme
boundaries are measured against a gold standard for
both Finnish and English, languages with rather dif-
ferent types of word structure.
2 A probabilistic category-learning
algorithm
2.1 Linguistic assumptions
We use a Hidden Markov Model (HMM) to model
morph sequences. Previously, HMM’s have been
used extensively for, e.g., segmentation or tagging
purposes. The challenge in this task lies in knowing
neither the segments (morphs), nor their tags (cat-
egories) in advance. To make the task easier, we
utilize the following linguistic assumptions, formu-
lated probabilistically:
(a) Categorial assumption. We assume that with
respect to sequential behavior, morphs fall into two
main categories, stems and affixes. Affixes are fur-
ther divided into prefixes and suffixes.
(b) Impossible category sequences. We want to
be able to cope with languages with extensive com-
pounding and many consecutive affixes, but not
with just any sequence. In particular, we do not wish
to allow that a suffix may start a word, or that a pre-
fix end it. Moreover, a prefix should not be followed
directly by a suffix. These restrictions are captured
by the following regular expression:
word = ( prefix* stem suffix* )+
(1)
Note that no assumptions are made regarding
whether the language is more likely to employ pre-
fixation or suffixation.
(c) Likely properties of morphs in each category.
Grammatical affixes mainly carry syntactic infor-
mation. We therefore assume that affixes are likely
to be common “general-purpose” morphs that can
be used in connection with a large number of other
morphs. By contrast, the set of stems is much larger
and there are a considerable number of rare stems
that mainly carry semantic information. In order for
all stems to be distinguishable from each other they
are not likely to be very short morphs.
2.2 Probabilistic generative model for word
formation
We use an HMM to assign probabilities to each pos-
sible segmentation of a word form. The word is
segmented into morphs, each of which belongs to
one category: prefix, suffix, or stem. We assume a
first-order Markov chain, i.e., a bigram model, for
the morph categories. For each category, there is a
separate probability distribution over the set of pos-
sible morphs. The probability of a particular seg-
mentation of the word w into the morph sequence
µ1µ2 ...µk is thus:
p(µ1µ2 ...µk | w) = (2)
bracketleftBig kproductdisplay
i=1
p(Ci | Ci−1) · p(µi | Ci)
bracketrightBig
· p(Ck+1 | Ck).
The bigram probability of a transition from
one morph category to another is expressed by
p(Ci | Ci−1). For instance, the probability of
observing a stem after a prefix would be written
as p(STM | PRE). The probability of observing
the morph µi when the category Ci is given is ex-
pressed by p(µi | Ci). The categories C0 and
Ck+1 represent word boundaries. That is, we take
into account the transition from a word boundary to
the first morph in the word, as well as the transition
from the last morph to a word boundary.
Note also that a morph can be generated from sev-
eral categories, e.g., a particular morph can function
as a stem or a suffix depending on the context.
2.3 The algorithm step by step
The algorithm involves the following steps: (i) pro-
duction of a baseline segmentation, (ii) initialization
of p(µi | Ci) and p(Ci | Ci−1), (iii) removal of re-
dundant morphs, and (iv) removal of noise morphs.
All steps involve a modification of the morph seg-
mentations, except step (ii), where the probability
distributions are initialized.
Steps (ii)–(iv) are all concluded with a re-
estimation of the probabilities by means of
Expectation-Maximization (EM): The words seg-
mented into morphs are re-tagged using the Viterbi
algorithm by maximizing Equation 2. The proba-
bilities p(µi | Ci) and p(Ci | Ci−1) are then
re-estimated from the tagged data. This is repeated
until convergence of the probabilities.
Step (iv) is further followed by a final pass of
the Viterbi algorithm, which re-splits and tags the
words. Viterbi re-splitting improves the segmen-
tation somewhat, but it is much slower than mere
tagging. Therefore Viterbi re-splitting is only per-
formed at the final stage.
2.3.1 Baseline segmentation
A good initial morph segmentation is obtained by
the baseline segmentation method (Creutz and La-
gus, 2002). The choice of baseline segmentation
was motivated by the fact that we wanted the best
possible segmentation that suits highly-inflecting
languages. The baseline algorithm is based on a
probabilistic model that learns a set of morphs, or a
morph lexicon, that contains the most likely “build-
ing blocks” of the word forms observed in the cor-
pus used as data. The learning process is guided
by two prior probability distributions, a prior dis-
tribution on morph lengths and a prior distribution
on morph frequencies, i.e., the balance between fre-
quent and rare morphs.
2.3.2 Initialization of the probability
distributions
Given an initial baseline morph segmentation
and initial category membership probabilities
p(Ci | µk) for each segment (morph), random
sampling of this distribution can be utilized to ob-
tain specific tags for the morphs. From the tagged
segmentation we can estimate the desired values
p(Cj | Ci) and p(µk | Ci).
Below we describe how the initial category mem-
bership probabilities p(Ci | µk) emerge. These are
probabilities that a particular morph is a prefix, suf-
fix, or stem. In addition, during the process a tem-
porary noise category is utilized, to hold segments
that cannot be considered as prefix, suffix, or stem.
Identifying plausible affixes and stems. To iden-
tify plausible affixes in our corpus we take the base-
line splitting and collect information on the contexts
that every discovered morph occurs in. More specif-
ically, we assume that a morph is likely to be a prefix
if it is difficult to predict what the following morph
is going to be. That is, there are many possible right
contexts of the morph. Correspondingly, a morph is
likely to be a suffix if it is difficult to predict what
the preceding morph can be.
We use perplexity to measure the predictability
of the preceding or following morph in relation to
a specific target morph. The following formula can
be used for calculating the left perplexity of a target
morph µ:
Left-ppl(µ) =
bracketleftBig productdisplay
νi ∈ Left-of(µ)
p(νi | µ)
bracketrightBig− 1
Nµ . (3)
There are Nµ occurrences of the target morph µ in
the corpus. The morph tokens νi occur to the left of,
immediately preceding, the occurrences of µ. The
probability distribution p(νi | µ) is calculated over
all such νi. Right perplexity can be computed anal-
ogously.
Next, we implement a graded threshold of suffix-
likeness by applying a sigmoid function to the left
perplexity of the morphs.
Suffix-like(µ) = [1 + e−a·(Left-ppl(µ)−b)]−1. (4)
The parameter b is the perplexity threshold, which
indicates the point where a morph µ is as likely to be
a suffix as a non-suffix. The parameter a governs the
steepness of the sigmoid. The equations for prefix
are identical except that right perplexity is applied
instead of left perplexity.
As for stems, we assume that the stem-likeness
of a morph correlates positively with the length in
letters of the morph. We employ a sigmoid function
as above, which yields:
Stem-like(µ) = [1 + e−c·(Length(µ)−d)]−1, (5)
where d is the length threshold and c governs the
steepness of the curve.
Initial probability of a morph belonging to a cat-
egory. Prefix-, suffix- and stem-likeness assume
values between zero and one, but they are no prob-
abilities, since they usually do not sum up to one.
In order to create a probability distribution, we
first introduce a fourth category besides prefixes,
suffixes and stems. This category is the noise cat-
egory and corresponds to cases where none of the
proper morph classes is likely. Typically noise
morphs arise as a consequence of over-segmentation
of rare word forms in the baseline word splitting.
We set the probability of a morph being noise
(NOI) to:
p(NOI | µ) = [1 − Prefix-like(µ)]
·[1 − Suffix-like(µ)] · [1 − Stem-like(µ)]. (6)
We then distribute the remaining probability mass
proportionally between prefix (PRE), suffix (SUF),
and stem (STM), e.g.:
p(SUF | µ) =
Suffix-like(µ) · [1 − p(NOI | µ)]
Prefix-like(µ) + Suffix-like(µ) + Stem-like(µ).(7)
2.3.3 Removal of redundant morphs
As a result of applying the baseline segmentation al-
gorithm, there are possibly many redundant morphs,
that is, morphs that consist of other discovered
morphs. Each morph is studied. If it is possible
to split it into two other known morphs, the most
probable split is selected and the redundant morph
is removed. The most probable split is determined
as:
arg max
µ1,C1,µ2,C2
p(µ1 | C1) · p(C2 | C1) · p(µ2 | C2),
(8)
where C1 and C2 are morph categories, and µ1 and
µ2 are substrings of the redundant morph µ, such
that the concatenation of µ1 and µ2 yields µ.
However, some restrictions apply: Splitting into
“noise morphs” is not allowed. Furthermore, forbid-
den category transitions are not allowed to emerge,
such as a direct transition from a prefix to a suffix
without going through a stem. Nor is splitting into
sub-morphs with very low probability accepted.
2.3.4 Removal of noise morphs
As noise morphs are mainly very short morphs and
a product of over-segmentation in the baseline split-
ting algorithm, they are removed by joining with ei-
ther of the adjacent morphs. The new morph is then
labeled as noise. This is repeated until the resulting
morph can qualify as a stem, which is determined
by the Equation 5. The following heuristics are ap-
plied: Joining with shorter morphs is preferred, and
joining noise with noise or a stem is always pre-
ferred to joining with a prefix or a suffix. These pri-
orities are motivated by the observation that noise
morphs tend to be fragments of what should be a
stem.
3 Evaluation
We have produced gold standard segmentations
with marked morpheme boundaries for 1.4 mil-
lion Finnish and 36 000 English word forms. We
evaluate the segmentations produced by our split-
ting algorithm against the gold standard, and com-
pute precision and recall on discovered morpheme
boundaries. Precision is the proportion of correct
boundaries among all morph boundaries suggested
by the algorithm. Recall is the proportion of correct
boundaries discovered by the algorithm in relation
to all morpheme boundaries in the gold standard.
The gold standard was created semi-
automatically, by first running all words through
a morphological analyzer based on the two-level
morphology of Koskenniemi (1983).1 For each
1The software was licensed from Lingsoft, Inc. <http:
word form, the analyzer outputs the base form of
the word together with grammatical tags indicating,
e.g., the part-of-speech, case, or derivational type
of the word form. In addition, the boundaries
between the constituents of compound words are
often marked. We thoroughly investigated the
correspondence between the grammatical tags
and the corresponding morphemes and created a
rule-set for segmenting the original word forms
with the help of the output of the analyzer.
As there can sometimes be many plausibly cor-
rect segmentation of a word we supplied several
alternatives when needed, e.g., English ‘evening’
(time of day) vs. ‘even+ing’ (verb). We also intro-
duced so called “fuzzy” boundaries between stems
and endings, allowing some letter to belong to ei-
ther the stem or ending, when both alternatives are
reasonable, e.g., English ‘invite+s’ vs. ‘invit+es’
(cf. ‘invit+ing’), or Finnish ‘t¨ahde+n’ vs. ‘t¨ahd+en’
(“of the star”; the base form is ‘t¨ahti’).2
4 Experiments
We report experiments on Finnish and English cor-
pora. The new category-learning algorithm is com-
pared to two other algorithms, namely the baseline
segmentation algorithm presented in (Creutz, 2003),
which was also utilized for initializing the segmen-
tation in the category-learning algorithm, and the
Linguistica algorithm (Goldsmith, 2001).3
4.1 Data sets
The Finnish corpus consists of mainly news texts
from the CSC (The Finnish IT Center for Science)4
and the Finnish News Agency. The corpus consists
of 32 million words and it was divided into a devel-
opment set and a test set, each containing 16 million
words. For experiments on English we have used
the Brown corpus5. It contains one million words,
divided into a development set of 250 000 words and
a test set of 750 000 words.
The development sets were utilized for optimiz-
ing the algorithms and for selecting parameter val-
ues, whereas the test sets were used solely in the
final evaluation.
//www.lingsoft.fi>.
2Our gold standard segmentations for Finnish and English
words are not public, but we are currently investigating the pos-
sibility of making them public.
3We used the December 2003 version of the soft-
ware, available at <http://humanities.uchicago.
edu/faculty/goldsmith/Linguistica2000/>.
4<http://www.csc.fi/kielipankki/>
5Available at the Linguistic Data Consortium: <http://
www.ldc.upenn.edu>
Finnish English
Word tokens Word types Word types
10 000 5 500 2 400
50 000 20 000 7 400
250 000 65 000 20 000
16 000 000 1 100 000 –
Table 1: Sizes of the Finnish and English test sets.
The algorithms were evaluated on different sub-
sets of the test set to produce the precision-recall
curves in Figure 2. The sizes of the subsets are
shown in Table 1. As can be seen, the Finnish and
English data sets contain the same number of word
tokens (words of running text), but the number of
word types (distinct word forms) is higher in the
Finnish data. The word type figures are important,
since what was referred to as a ‘corpus’ in the pre-
vious sections is actually a word list. That is, one
occurrence of each distinct word form in the data is
picked for the morphology learning task.
The word forms in the test sets for which there
are no gold standard segmentations are simply left
out of the evaluation. The proportions of such word
forms are 5%, 6%, 8%, and 15% in the Finnish
sets of size 10 000, 50 000, 250 000 and 16 million
words, respectively. For English the proportions are
5%, 9%, and 14% for the data sets (in growing or-
der).
4.2 Parameters
The development sets were used for setting the val-
ues of the parameters of the algorithms. As a cri-
terion for selecting the optimal values, we used
the (equally weighted) F-measure, which is the
harmonic mean of the precision and recall of de-
tected morpheme boundaries. For each data size
and language separately, we selected the configura-
tion yielding the best F-measure on the development
set. These values were then fixed and utilized when
evaluating the performance of the algorithms on the
test set of corresponding size.
In the Baseline algorithm, we optimized the prior
morph length distribution. The prior morph fre-
quency distribution was left at its default value.
The Category algorithm has four parameters: a,
b, c, and d; cf. Equations 4, and 5. The constant val-
ues c = 2, d = 3.5 work well for every data set size
and language, as does the relation a = 10/b. The
perplexity threshold, b, assumes values between 5
and 100 depending on the data set. Conveniently,
the algorithm is robust with respect to the value of b
and the result is always better than that of the Base-
line algorithm, except for values of b that are orders
30 40 50 60 7040
50
60
70
80
90
Recall [%]
Precision [%]
Finnish
10k
50k
250k
10k
50k
250k 16M
10k
50k
250k
16M
Baseline
Categories
Linguistica
(a)
60 70 8040
50
60
70
80
90
Recall [%]
Precision [%]
English
10k
50k
250k
10k
50k 250k
10k
50k 250k
(b)
Figure 2: Precision and recall of the three algorithms on test sets of increasing sizes on both Finnish (a) and
English (b) data. Each data point is an average of 4 runs on separate test sets, with the exception of the 16M
(16 million) words for Finnish (with 1 test set), and the 250k (250 000) words for English (3 test sets). In
these cases the lack of test data constrained the number of runs. The standard deviations of the averages are
shown as intervals around the data points. There is no 16M data point for Linguistica on Finnish, because
the algorithm is very memory-consuming and we could not run it on larger data sizes than 250 000 words
on our PC. In most curves, when the data size is increased, recall also rises. An exception is the Baseline
curve for Finnish, where precision rises, while recall drops.
of magnitude too large.
In the Linguistica algorithm, we used the com-
mands ‘Find suffix system’ and ‘Find prefixes of
suffixal stems’.
4.3 Results
Figure 2 depicts the precision and recall of the algo-
rithms on test sets of different sizes.
When studying the curves for Finnish (Fig. 2a),
we observe that the Baseline and Category algo-
rithms perform on a similar level on the smallest
data set (10k). However, from there the perfor-
mances diverge: the Category algorithm improves
on both precision and recall, whereas the Baseline
algorithm displays a strong increase in precision
while recall actually decreases. This means that
words are split less often but the proposed splits are
more often correct. This is due to measuring the cost
of both the lexicon and the data in the optimization
function: with a much larger corpus (more data) the
optimal solution contains a much larger morph lex-
icon. Hence, less splitting ensues. The effect is not
seen on the English data (Fig. 2b), but this might be
due to the smaller corpus sizes.
For Linguistica, an increase in the amount of data
is reflected in higher recall, but lower precision.
Linguistica only suggests a morpheme boundary be-
tween a stem and an affix, if the same stem has been
observed in combination with at least one other af-
fix. This leads to a “conservative word-splitting be-
havior”, with a rather low recall for small data sets,
but with high precision. As the amount of data in-
creases, the sparsity of the data decreases, and more
morpheme boundaries are suggested. This results
in higher recall, but unfortunately lower precision.
As Linguistica was not designed for discovering
the boundaries within compound words, it misses
a large number of them.
For Finnish, the Category algorithm is better than
the other two algorithms when compared on data
sets of the same size. We interpret a result to be
better, even though precision might be somewhat
lower, if recall is significantly higher (or vice versa).
As an example, for the 16 million word set, the cate-
gory algorithm achieves 79.0% precision and 71.0%
recall. The Baseline achieves 88.5% precision but
only 45.9% recall. T-tests show significant differ-
ences at the level of 0.01 between all algorithms
on Finnish, except for Categories vs. Baseline at
10 000 words.
For English, the Baseline algorithm generally
performs worst, but it is difficult to say which of
the other two algorithm performs best. According
to T-tests there are no significant differences at the
level of 0.05 between the following: Categories vs.
Linguistica (50k & 250k), and Categories vs. Base-
line (10k). However, if one were to extrapolate
from the current trends to a larger data set, it would
seem likely that the Category algorithm would out-
perform Linguistica.
4.4 Computational requirements
The Baseline and Category algorithms are imple-
mented as Perl scripts. On the Finnish 250 000 word
set, the Baseline algorithm runs in 45 minutes, and
the Category algorithm additionally takes 20 min-
utes on a 900 MHz AMD Duron processor with
a maximum memory usage of 20 MB. The Lin-
guistica algorithm is a compiled Windows program,
which uses 500 MB of memory and runs in 90 min-
utes, of which 80 minutes(!) are taken up by the
saving of the results.
5 Discussion
It is worth remembering that the gold standard split-
ting used in these evaluations is based on a tradi-
tional morphology. If the segmentations were eval-
uated using a real-world application, perhaps some-
what different segmentations would be most useful.
For example, the tendency to keep common words
together, seen in the Baseline model and generally
in Bayesian or MDL-based models, might not be at
all troublesome, e.g., in speech recognition or ma-
chine translation applications. In contrast, excessive
splitting might be a problem in both applications.
When compared to the gold standard segmen-
tation used here, the Baseline algorithm produces
three types of errors that are prominent: (i) exces-
sive segmentation especially when trained on small
amounts of data, (ii) too little segmentation espe-
cially with large amounts of data, and (iii) erroneous
segments suggested in the beginning of words due
to the fact that the same segments frequently occur
at the end of words (e.g. ‘s+wing’). The Category
algorithm is able to clearly reduce these types of er-
rors due to its following properties: (i) the joining
of noise morphs with adjacent morphs, (ii) the re-
moval of redundant morphs by splitting them into
sub-morphs, and (iii) the simple morphotactics in-
volving three categories (stem, prefix, and suffix)
implemented as an HMM. Furthermore, (iii) is nec-
essary for being able to carry out (i) and (ii).
The Category algorithm does a good job in find-
ing morpheme boundaries and assigning categories
to the morphs, as can be seen in the examples in Fig-
ure 3, e.g., ‘photograph+er+s’, ‘un+expect+ed+ly’,
‘aarre+kammio+i+ssa’ (“in treasure chambers”; ‘i’
is a plural marker and ‘ssa’ marks the inessive case),
‘bahama+saar+et’ (“[the] Bahama islands”; ‘saari’
means “island” and ‘saaret’ is the plural form). The
reader interested in the analyses of other words can
try our on-line demo athttp://www.cis.hut.
fi/projects/morpho/.
It is nice to see that the same morph can be
tagged differently in different contexts, e.g. ‘p¨a¨a’
is a prefix in ‘p¨a¨a+aihe+e+sta’ (“about [the] main
topic”), whereas ‘p¨a¨a’ is a stem in ‘p¨a¨a+h¨an’ (“in
[the] head”). In this case the morph categories
also resolve the semantic ambiguity of the morph
‘p¨a¨a’. Occasionally, the segmentation is correct, but
the category tagging differs from the linguistic con-
vention, e.g., ‘taka+penkki+l¨ais+et’ (“[the] ones in
[the] back seat”), where ‘l¨ais’ is tagged as a stem
instead of a suffix.
The segmentation of ‘p¨a¨aaiheesta’ is not entirely
correct: ‘p¨a¨a+aihe+e+sta’ contains an superfluous
morph (‘e’), which should be part of the stem, i.e.,
‘p¨a¨a+aihee+sta’. This mistake is explained by a
comparison with the plural form ‘p¨a¨a+aihe+i+sta’,
which is correct. As the singular and plural only dif-
fer in one letter, ‘e’ vs. ‘i’, the algorithm has found
a solution, where the alternating letter is treated as
an independent “number marker”: ‘e’ for singular,
‘i’ for plural.
In the Linguistica algorithm, stems and suffixes
are grouped into so called signatures, which can be
thought of as inflectional paradigms: a certain set
of stems goes together with a certain set of suffixes.
Words will be left unsplit unless the potential stem
and suffix fit into a signature. As a consequence, if
there is only the plural of some particular English
noun in the data, but not the singular, Linguistica
will not split the noun into a stem and the plural
‘s’, since this does not fit into any signature. In
this respect, our category-based algorithm is better
at coping with data sparsity. For highly-inflecting
languages, such as Finnish, this is especially impor-
tant.
In contrast with Linguistica, our algorithms can
incorrectly “overgeneralize” and suggest a suffix,
aarre + kammio + i + ssa j¨a¨ady + tt¨a + ¨a abandon long + est
aarre + kammio + i + sta j¨a¨ady + tt¨a + ¨a + kseen abandon + ed long + fellow + ’s
aarre + kammio + ita j¨a¨ady + tt¨a + isi abandon + ing longish
aarre + kammio + nsa maclare + n abandon + ment long + itude
aarre + kammio + on nais + auto + ili + ja beauti + ful master + piece + s
aarre + kammio + t nais + auto + ili + ja + a beauti + fully micro + organ + ism + s
aarre + kammio + ta nais + auto + ili + joista beauty + ’s near + ly
bahama + saar + et prot + e + iin + eja calculat + ed necess + ary
bahama + saari + en prot + e + iin + i calculat + ion + s necess + ities
bahama + saari + lla prot + e + iin + ia con + figur + ation necess + ity
bahama + saari + lle p¨a¨a + aihe + e + sta con + firm + ed photograph
bahama + saar + ten p¨a¨a + aihe + i + sta express + ion + ist photograph + er + s
edes + autta + isi + vat p¨a¨a + h¨an express + ive + ness photograph + y
edes + autta + ko + on p¨a¨a + kin fanatic + ism phrase + d
edes + autta + maan p¨a¨a + ksi invit + ation + s phrase + ology
edes + autta + ma + ssa taka + penkki + l¨a + in+ en invit + e phrase + s
haap + a + koske + a taka + penkki + l¨ais + et invit + ed sun + rise
haap + a + koske + en voida + kaan invit + e + es thanks + giving
haap + a + koske + lla voi + mme + ko invit + es un + avail + able
haap + a + koski voisi + mme invit + ing un + expect + ed + ly
Figure 3: Examples of segmentations learned from the large Finnish data set and a small English data set.
Discovered stems are underlined, suffixes are slanted, and prefixes are rendered in the standard font.
where there is none, e.g., ‘maclare+n’ (“Mac-
Laren”). Furthermore, nonsensical sequences of
suffixes (which in other contexts are true suffixes)
can be suggested, e.g., ‘prot+e+iin+i’, which should
be ‘proteiini’ (“protein”). A model with more fine-
grained categories might reduce such shortcomings
in that it could model morphotactics more accu-
rately.
Another aspect requiring attention in the future
is allomorphy. Currently each discovered segment
(morph) is assigned a role (prefix, stem, or suf-
fix), but no further “meaning” or relation to other
morphs. In Figure 3 there are some examples
of allomorphs, morphs representing the same mor-
pheme, i.e., morphs having the same meaning but
used in complementary distributions. The current
algorithm has no means for discovering that ‘on’
and ‘en’ mark the same case, namely illative, in
‘aarre+kammio+on’ (“into [the] treasure chamber”)
and ‘haap+a+koske+en’ (“to Haapakoski”). 6 To
enable such discovery in principle, one would prob-
ably need to look at contexts of nearby words,
not just the word-internal context. Additionally,
one should allow the learning of a model with
richer category structure. Moreover, ‘on’ and ‘en’
do not always mark the illative case. In ‘ba-
6Furthermore the algorithm cannot deduce that the illative
is actually realized as a vowel lengthening + ‘n’: ‘kammioon’
vs. ‘koskeen’.
hama+saari+en’ the genitive is marked as ‘en’, and
in ‘edes+autta+ko+on’ (“may he/she help”) ‘on’
marks the third person singular. Similar examples
can be found for English, e.g., ‘ed’ and ‘d’ are al-
lomorphs in ‘invit+ed’ vs. ‘phrase+d’, and so are
‘es’ and ‘s’ in ‘invit+es’ vs. ‘phrase+s’. However,
the meaning of ‘s’ is often ambiguous. It can mark
either the plural of a noun or the third person sin-
gular of a verb in the present tense. But this kind
of ambiguity is in principle solvable in the current
model; the Category algorithm resolves similar, also
semantic, ambiguities occurring between the three
current categories: prefix, stem, and suffix.
6 Conclusions
We described an algorithm that differs from earlier
morpheme segmentation algorithms in that it mod-
els dependencies between morph categories in se-
quences of arbitrary length. Even a simple model
with few categories, namely prefix, suffix, and stem
is able to capture relevant dependencies that consid-
erably improve the obtained segmentation on both
Finnish and English, languages with rather different
types of word structure. An interesting future di-
rection is whether the application of more complex
model structures may lead to improvements in the
morphology induction task.
Acknowledgments
We are grateful to Krister Lind´en, as well as the
anonymous reviewers for their valuable and thor-
ough comments on the manuscript.

References
M. Baroni, J. Matiasek, and H. Trost. 2002. Un-
supervised learning of morphologically related
words based on orthographic and semantic sim-
ilarity. In Proc. Workshop on Morphological &
Phonological Learning of ACL’02, pages 48–57.
M. R. Brent. 1999. An efficient, probabilistically
sound algorithm for segmentation and word dis-
covery. Machine Learning, 34:71–105.
M. Creutz and K. Lagus. 2002. Unsupervised dis-
covery of morphemes. In Proc. Workshop on
Morphological and Phonological Learning of
ACL’02, pages 21–30, Philadelphia, Pennsylva-
nia, USA.
M. Creutz. 2003. Unsupervised segmentation of
words using prior distributions of morph length
and frequency. In Proc. ACL’03, pages 280–287,
Sapporo, Japan.
C. G. de Marcken. 1996. Unsupervised Language
Acquisition. Ph.D. thesis, MIT.
H. D´ejean. 1998. Morphemes as necessary con-
cept for structures discovery from untagged cor-
pora. In Workshop on Paradigms and Grounding
in Natural Language Learning, pages 295–299,
Adelaide.
J. Goldsmith. 2001. Unsupervised learning of the
morphology of a natural language. Computa-
tional Linguistics, 27(2):153–198.
K. Hacioglu, B. Pellom, T. Ciloglu, O. Ozturk,
M. Kurimo, and M. Creutz. 2003. On lexi-
con creation for Turkish LVCSR. In Proc. Eu-
rospeech’03, pages 1165–1168, Geneva, Switzer-
land.
K. Koskenniemi. 1983. Two-level morphology:
A general computational model for word-form
recognition and production. Ph.D. thesis, Uni-
versity of Helsinki.
S. Neuvel and S. A. Fulop. 2002. Unsupervised
learning of morphology without morphemes. In
Proc. Workshop on Morphological & Phonologi-
cal Learning of ACL’02, pages 31–40.
J. R. Saffran, E. L. Newport, and R. N. Aslin. 1996.
Word segmentation: The role of distributional
cues. Journal of Memory and Language, 35:606–
621.
P. Schone and D. Jurafsky. 2000. Knowledge-free
induction of morphology using Latent Semantic
Analysis. In Proc. CoNLL-2000 & LLL-2000,
pages 67–72.
P. Schone and D. Jurafsky. 2001. Knowledge-free
induction of inflectional morphologies. In Proc.
NAACL-2001.
V. Siivola, T. Hirsim¨aki, M. Creutz, and M. Kurimo.
2003. Unlimited vocabulary speech recognition
based on morphs discovered in an unsupervised
manner. In Proc. Eurospeech’03, pages 2293–
2296, Geneva, Switzerland.
M. G. Snover and M. R. Brent. 2001. A Bayesian
model for morpheme and paradigm identifica-
tion. In Proc. 39th Annual Meeting of the ACL,
pages 482–490.
D. Yarowsky and R. Wicentowsky. 2000. Mini-
mally supervised morphological analysis by mul-
timodal alignment. In Proc. ACL-2000, pages
207–216.
