Unsupervised Discovery of Morphemes
Mathias Creutz and Krista Lagus
Neural Networks Research Centre
Helsinki University of Technology
P.O.Box 9800, FIN-02015 HUT, Finland
{Mathias.Creutz, Krista.Lagus}@hut.fi
Abstract
We present two methods for unsupervised
segmentation of words into morpheme-
like units. The model utilized is espe-
cially suited for languages with a rich
morphology, such as Finnish. The first
method is based on the Minimum Descrip-
tion Length (MDL) principle and works
online. In the second method, Max-
imum Likelihood (ML) optimization is
used. The quality of the segmentations is
measured using an evaluation method that
compares the segmentations produced to
an existing morphological analysis. Ex-
periments on both Finnish and English
corpora show that the presented methods
perform well compared to a current state-
of-the-art system.
1 Introduction
According to linguistic theory, morphemes are con-
sidered to be the smallest meaning-bearing ele-
ments of language, and they can be defined in a
language-independent manner. However, no ade-
quate language-independent definition of the word
as a unit has been agreed upon (Karlsson, 1998,
p. 83). If effective methods can be devised for the
unsupervised discovery of morphemes, they could
aid the formulation of a linguistic theory of mor-
phology for a new language.
It seems that even approximative automated mor-
phological analysis would be beneficial for many
natural language applications dealing with large vo-
cabularies. For example, in text retrieval it is cus-
tomary to preprocess texts by returning words to
their base forms, especially for morphologically rich
languages.
Moreover, in large vocabulary speech recognition,
predictive models of language are typically used for
selecting the most plausible words suggested by an
acoustic speech recognizer (see, e.g., Bellegarda,
2000). Consider, for example the estimation of the
standard n-gram model, which entails the estima-
tion of the probabilities of all sequences of n words.
When the vocabulary is very large, say 100 000
words, the basic problems in the estimation of the
language model are: (1) If words are used as ba-
sic representational units in the language model, the
number of basic units is very high and the estimated
word n-grams are poor due to sparse data. (2) Due
to the high number of possible word forms, many
perfectly valid word forms will not be observed at
all in the training data, even in large amounts of text.
These problems are particularly severe for languages
with rich morphology, such as Finnish and Turkish.
For example, in Finnish, a single verb may appear in
thousands of different forms (Karlsson, 1987).
The utilization of morphemes as basic representa-
tional units in a statistical language model instead of
words seems a promising course. Even a rough mor-
phological segmentation could then be sufficient.
On the other hand, the construction of a comprehen-
sive morphological analyzer for a language based
on linguistic theory requires a considerable amount
of work by experts. This is both slow and expen-
sive and therefore not applicable to all languages.
                     July 2002, pp. 21-30.  Association for Computational Linguistics.
        ACL Special Interest Group in Computational Phonology (SIGPHON), Philadelphia,
       Morphological and Phonological Learning: Proceedings of the 6th Workshop of the
Table 1: The morphological structure of the Finnish
word for ‘also for [the] coffee drinker’.
Word kahvinjuojallekin
Morphs kahvi n juo ja lle kin
Transl. coffee of drink -er for also
The problem is further compounded as languages
evolve, new words appear and grammatical changes
take place. Consequently, it is important to develop
methods that are able to discover a morphology for
a language based on unsupervised analysis of large
amounts of data.
As the morphology discovery from untagged cor-
pora is a computationally hard problem, in practice
one must make some assumptions about the struc-
ture of words. The appropriate specific assumptions
are somewhat language-dedependent. For example,
for English it may be useful to assume that words
consist of a stem, often followed by a suffix and pos-
sibly preceded by a prefix. By contrast, a Finnish
word typically consists of a stem followed by multi-
ple suffixes. In addition, compound words are com-
mon, containing an alternation of stems and suf-
fixes, e.g., the wordkahvinjuojallekin(Engl.
’also for [the] coffee drinker’; cf. Table 1)1. More-
over, one may ask, whether a morphologically com-
plex word exhibits some hierarchical structure, or
whether it is merely a flat concatenation of stems
and suffices.
1.1 Previous Work on Unsupervised
Segmentation
Many existing morphology discovery algorithms
concentrate on identifying prefixes, suffixes and
stems, i.e., assume a rather simple inflectional mor-
phology.
D´ejean (1998) concentrates on the problem of
finding the list of frequent affixes for a language
rather than attempting to produce a morphological
analysis of each word. Following the work of Zellig
Harris he identifies possible morpheme boundaries
by looking at the number of possible letters follow-
ing a given sequence of letters, and then utilizes fre-
quency limits for accepting morphemes.
1For a comprehensive view of Finnish morphology, see
(Karlsson, 1987).
Goldsmith (2000) concentrates on stem+suffix-
languages, in particular Indo-European languages,
and tries to produce output that would match as
closely as possible with the analysis given by a hu-
man morphologist. He further assumes that stems
form groups that he calls signatures, and each sig-
nature shares a set of possible affixes. He applies an
MDL criterion for model optimization.
The previously discussed approaches consider
only individual words without regard to their con-
texts, or to their semantic content. In a different ap-
proach, Schone and Jurafsky (2000) utilize the con-
text of each term to obtain a semantic representa-
tion for it using LSA. The division to morphemes
is then accepted only when the stem and stem+affix
are sufficiently similar semantically. Their method
is shown to improve on the performance of Gold-
smith’s Linguistica on CELEX, a morphologically
analyzed English corpus.
In the related field of text segmentation, one can
sometimes obtain morphemes. Some of the ap-
proaches remove spaces from text and try to identify
word boundaries utilizing e.g. entropy-based mea-
sures, as in (Redlich, 1993).
Word induction from natural language text with-
out word boundaries is also studied in (Deligne
and Bimbot, 1997; Hua, 2000), where MDL-based
model optimization measures are used. Viterbi or
the forward-backward algorithm (an EM algorithm)
is used for improving the segmentation of the cor-
pus2.
Also de Marcken (1995; 1996) studies the prob-
lem of learning a lexicon, but instead of optimiz-
ing the cost of the whole corpus, as in (Redlich,
1993; Hua, 2000), de Marcken starts with sentences.
Spaces are included as any other characters.
Utterances are also analyzed in (Kit and Wilks,
1999) where optimal segmentation for an utterance
is sought so that the compression effect over the seg-
ments is maximal. The compression effect is mea-
sured in what the authors call Description Length
Gain, defined as the relative reduction in entropy.
The Viterbi algorithm is used for searching for the
optimal segmentation given a model. The input ut-
2The regular EM procedure only maximizes the likelihood
of the data. To follow the MDL approach where model cost is
also optimized, Hua includes the model cost as a penalty term
on pure ML probabilities.
terances include spaces and punctuation as ordinary
characters. The method is evaluated in terms of pre-
cision and recall on word boundary prediction.
Brent presents a general, modular probabilistic
model structure for word discovery (Brent, 1999).
He uses a minimum representation length criterion
for model optimization and applies an incremental,
greedy search algorithm which is suitable for on-line
learning such that children might employ.
1.2 Our Approach
In this work, we use a model where words may con-
sist of lengthy sequences of segments. This model
is especially suitable for languages with agglutina-
tive morphological structure. We call the segments
morphs and at this point no distinction is made be-
tween stems and affixes.
The practical purpose of the segmentation is
to provide a vocabulary of language units that is
smaller and generalizes better than a vocabulary
consisting of words as they appear in text. Such a
vocabulary could be utilized in statistical language
modeling, e.g., for speech recognition. Moreover,
one could assume that such a discovered morph vo-
cabulary would correspond rather closely to linguis-
tic morphemes of the language.
We examine two methods for unsupervised learn-
ing of the model, presented in Sections 2 and 3. The
cost function for the first method is derived from the
Minimum Description Length principle from classic
information theory (Rissanen, 1989), which simul-
taneously measures the goodness of the representa-
tion and the model complexity. Including a model
complexity term generally improves generalization
by inhibiting overlearning, a problem especially se-
vere for sparse data. An incremental (online) search
algorithm is utilized that applies a hierarchical split-
ting strategy for words. In the second method the
cost function is defined as the maximum likelihood
of the data given the model. Sequential splitting is
applied and a batch learning algorithm is utilized.
In Section 4, we develop a method for evaluat-
ing the quality of the morph segmentations produced
by the unsupervised segmentation methods. Even
though the morph segmentations obtained are not in-
tended to correspond exactly to the morphemes of
linguistic theory, a basis for comparison is provided
by existing, linguistically motivated morphological
analyses of the words.
Both segmentation methods are applied to the
segmentation of both Finnish and English words.
In Section 5, we compare the results obtained from
our methods to results produced by Goldsmith’s Lin-
guistica on the same data.
2 Method 1: Recursive Segmentation and
MDL Cost
The task is to find the optimal segmentation of the
source text into morphs. One can think of this as
constructing a model of the data in which the model
consists of a vocabulary of morphs, i.e. the code-
book and the data is the sequence of text. We try to
find a set of morphs that is concise, and moreover
gives a concise representation for the data. This is
achieved by utilizing an MDL cost function.
2.1 Model Cost Using MDL
The total cost consists of two parts: the cost of the
source text in this model and the cost of the code-
book. Let M be the morph codebook (the vocab-
ulary of morph types) and D = m1m2 ...mn the
sequence of morph tokens that makes up the string
of words. We then define the total cost C as
C = Cost(Source text) + Cost(Codebook)
=
summationdisplay
tokens
−logp(mi) +
summationdisplay
types
k ∗ l(mj)
The cost of the source text is thus the negative log-
likelihood of the morph, summed over all the morph
tokens that comprise the source text. The cost of the
codebook is simply the length in bits needed to rep-
resent each morph separately as a string of charac-
ters, summed over the morphs in the codebook. The
length in characters of the morph mj is denoted by
l(mj) and k is the number of bits needed to code a
character (we have used a value of 5 since that is suf-
ficient for coding 32 lower-case letters). For p(mi)
we use the ML estimate, i.e., the token count of mi
divided by the total count of morph tokens.
2.2 Search Algorithm
The online search algorithm works by incremen-
tally suggesting changes that could improve the cost
function. Each time a new word token is read
from the input, different ways of segmenting it into
morphs are evaluated, and the one with minimum
cost is selected.
Recursive segmentation. The search for the opti-
mal morph segmentation proceeds recursively. First,
the word as a whole is considered to be a morph and
added to the codebook. Next, every possible split of
the word into two parts is evaluated.
The algorithm selects the split (or no split) that
yields the minimum total cost. In case of no split,
the processing of the word is finished and the next
word is read from input. Otherwise, the search for a
split is performed recursively on the two segments.
The order of splits can be represented as a binary tree
for each word, where the leafs represent the morphs
making up the word, and the tree structure describes
the ordering of the splits.
During model search, an overall hierarchical data
structure is used for keeping track of the current
segmentation of every word type encountered so
far. Let us assume that we have seen seven in-
stances of linja-auton (Engl. ’of [the] bus’)
and two instances of autonkuljettajalla-
kaan (Engl. ’not even by/at/with [the] car driver’).
Figure 1 then shows a possible structure used for
representing the segmentations of the data. Each
chunk is provided with an occurrence count of the
chunk in the data set and the split location in this
chunk. A zero split location denotes a leaf node, i.e.,
a morph. The occurrence counts flow down through
the hierachical structure, so that the count of a child
always equals the sum of the counts of its parents.
The occurrence counts of the leaf nodes are used for
computing the relative frequencies of the morphs.
To find out the morph sequence that a word consists
of, we look up the chunk that is identical to the word,
and trace the split indices recursively until we reach
the leafs, which are the morphs.
Note that the hierarchical structure is used only
during model search: It is not part of the final model,
and accordingly no cost is associated with any other
nodes than the leaf nodes.
Adding and removing morphs. Adding new
morphs to the codebook increases the codebook
cost. Consequently, a new word token will tend to
be split into morphs already listed in the codebook,
which may lead to local optima. To better escape lo-
cal optima, each time a new word token is encoun-
13:2
0:210:2
0:2
5:2
kuljettajallakaan
kuljettajalla
kuljettaja lla
kaan
0:2
6:7
0:7
0:9
linja−
n
4:9
0:9
linja−auton
auton
auto
autonkuljettajallakaan
Figure 1: Hierarchical structure of the segmenta-
tion of the words linja-auton and autonkul-
jettajallakaan. The boxes represent chunks.
Boxes with bold text are morphs, and are part of the
codebook. The numbers above each box are the split
location (to the left of the colon sign) and the occur-
rence count of the chunk (to the right of the colon
sign).
tered, it is resegmented, whether or not this word has
been observed before. If the word has been observed
(i.e. the corresponding chunk is found in the hierar-
chical structure), we first remove the chunk and de-
crease the counts of all its children. Chunks with
zero count are removed (remember that removal of
leaf nodes corresponds to removal of morphs from
the codebook). Next, we increase the count of the
observed word chunk by one and re-insert it as an
unsplit chunk. Finally, we apply the recursive split-
ting to the chunk, which may lead to a new, different
segmentation of the word.
“Dreaming”. Due to the online learning, as the
number of processed words increases, the quality
of the set of morphs in the codebook gradually im-
proves. Consequently, words encountered in the be-
ginning of the input data, and not observed since,
may have a sub-optimal segmentation in the new
model, since at some point more suitable morphs
have emerged in the codebook. We have therefore
introduced a ’dreaming’ stage: At regular intervals
the system stops reading words from the input, and
instead iterates over the words already encountered
in random order. These words are resegmented and
thus compressed further, if possible. Dreaming con-
tinues for a limited time or until no considerable de-
crease in the total cost can be observed. Figure 2
shows the development of the average cost per word
as a function of the increasing amount of source text.
0 20 40 60 80 10020
25
30
35
40
45
50Corpus: Finnish newspaper text 100 000 words
Number of words read [x 1000 words]
Average cost per word [bits]
Figure 2: Development of the average word cost
when processing newspaper text. Dreaming, i.e., the
re-processing of the words encountered so far, takes
place five times, which can be seen as sudden drops
on the curve.
3 Method 2: Sequential Segmentation and
ML Cost
3.1 Model Cost Using ML
In this case, we use as cost function the likelihood
of the data, i.e., P(data|model). Thus, the model
cost is not included. This corresponds to Maximum-
Likelihood (ML) learning. The cost is then
Cost(Source text) =
summationdisplay
morph tokens
−logp(mi), (1)
where the summation is over all morph tokens in the
source data. As before, for p(mi) we use the ML
estimate, i.e., the token count of mi divided by the
total count of morph tokens.
3.2 Search Algorithm
In this case, we utilize batch learning where an EM-
like (Expectation-Maximization) algorithm is used
for optimizing the model. Moreover, splitting is not
recursive but proceeds linearly.
1. Initialize segmentation by splitting words into
morphs at random intervals, starting from the
beginning of the word. The lengths of intervals
are sampled from the Poisson distribution with
λ = 5.5. If the interval is larger than the num-
ber of letters in the remaining word segment,
the splitting ends.
2. Repeat for a number of iterations:
(a) Estimate morph probabilities for the given
splitting.
(b) Given the current set of morphs and their
probabilities, re-segment the text using the
Viterbi algorithm for finding the segmen-
tation with lowest cost for each word.
(c) If not the last iteration: Evaluate the seg-
mentation of a word against rejection cri-
teria. If the proposed segmentation is not
accepted, segment this word randomly (as
in the Initialization step).
Note that the possibility of introducing a random
segmentation at step (c) is the only thing that allows
for the addition of new morphs. (In the cost function
their cost would be infinite, due to ML probability
estimates). In fact, without this step the algorithm
seems to get seriously stuck in suboptimal solutions.
Rejection criteria. (1) Rare morphs. Reject the
segmentation of a word if the segmentation contains
a morph that was used in only one word type in the
previous iteration. This is motivated by the fact that
extremely rare morphs are often incorrect. (2) Se-
quences of one-letter morphs. Reject the segmenta-
tion if it contains two or more one-letter morphs in
a sequence. For instance, accept the segmentation
halua + n (Engl. ’I want’, i.e. present stem of
the verb ’to want’ followed by the ending for the first
person singular), but reject the segmentation halu
+ a + n (stem of the noun ’desire’ followed by
a strange sequence of endings). Long sequences of
one-letter morphs are usually a sign of a very bad
local optimum that may even get worse in future it-
erations, in case too much probability mass is trans-
ferred onto these short morphs3.
3Nevertheless, for Finnish there do exist some one-letter
morphemes that can occur in a sequence. However, these mor-
phemes can be thought of as a group that belongs together: e.g.,
4 Evaluation Measures
We wish to evaluate the method quantitatively from
the following perspectives: (1) correspondence with
linguistic morphemes, (2) efficiency of compression
of the data, and (3) computational efficiency. The ef-
ficiency of compression can be evaluated as the total
description length of the corpus and the codebook
(the MDL cost function). The computational effi-
ciency of the algorithm can be estimated from the
running time and memory consumption of the pro-
gram. However, the linguistic evaluation is in gen-
eral not so straightforward.
4.1 Linguistic Evaluation Procedure
If a corpus with marked morpheme boundaries is
available, the linguistic evaluation can be computed
as the precision and recall of the segmentation. Un-
fortunately, we did not have such data sets at our dis-
posal, and for Finnish such do not even exist. In ad-
dition, it is not always clear exactly where the mor-
pheme boundary should be placed. Several alterna-
tives may be possible, cf. Engl. hope + dvs. hop
+ ed, (past tense of to hope).
Instead, we utilized an existing tool for providing
a morphological analysis, although not a segmenta-
tion, of words, based on the two-level morphology
of Koskenniemi (1983). The analyzer is a finite-state
transducer that reads a word form as input and out-
puts the base form of the word together with gram-
matical tags. Sample analyses are shown in Figure 3.
The tag set consists of tags corresponding to
morphological affixes and other tags, for example,
part-of-speech tags. We preprocessed the analyses
by removing other tags than those corresponding
to affixes, and further split compound base forms
(marked using the # character by the analyzer) into
their constituents. As a result, we obtained for each
word a sequence of labels that corresponds well to
a linguistic morphemic analysis of the word. A la-
bel can often be considered to correspond to a single
word segment, and the labels appear in the order of
the segments.
The following step consists in retrieving the seg-
mentation produced by one of the unsupervised seg-
mentation algorithms, and trying to align this seg-
the Finnish talo + j + a (plural partitive of ’house’); can
also be thought of as talo + ja.
Input Output
Word Base form Tags
easily EASY <DER:ly> ADV
bigger BIG A CMP
hours’ HOUR N PL GEN
auton AUTO N SG GEN
puutaloja PUU#TALO N PL PTV
tehnyt TEHD ¨A V ACT PCP2 SG
Figure 3: Morphological analyses for some English
and Finnish word forms. The Finnish words areau-
ton (car’s), puutaloja ([some] wooden houses)
and tehnyt ([has] done). The tags are A (adjec-
tive), ACT (active voice), ADV (adverb), CMP (com-
parative), GEN (genitive), N (noun), PCP2 (2nd par-
ticiple), PL (plural), PTV (partitive), SG (singular),
V (verb), and <DER:ly> (-ly derivative).
mentation with the desired morphemic label se-
quence (cf. Figure 4).
A good segmentation algorithm will produce
morphs that align gracefully with the correct mor-
phemic labels, preferably producing a one-to-one
mapping. A one-to-many mapping from morphs
to labels is also acceptable, when a morph forms a
common entity, such as the suffix -ja in puutaloja,
which contains both the plural and partitive element.
By contrast, a many-to-one mapping from morphs
to a label is a sign of excessive splitting, e.g., t +
alo for talo (cf. English h + ouse for house).
Correct labels BIG CMP
Morph sequence bigg er
Correct labels HOUR PL GEN
Morph sequence hour s ’
Correct labels PUU TALO PL PTV
Morph sequence puu t alo ja
Figure 4: Alignment of obtained morph sequences
with their respective correct morphemic analyses.
We assume that the segmentation algorithm has
split the word bigger into the morphs bigg + er,
hours’ into hour + s + ’ and puutaloja into
puu + t + alo + ja.
Alignment procedure. We align the morph se-
quence with the morphemic label sequence using
dynamic programming, namely Viterbi alignment,
to find the best sequence of mappings between
morphs and morphemic labels. Each possible pair
of morph/morphemic label has a distance associated
with it. For each segmented word, the algorithm
searches for the alignment that minimizes the to-
tal alignment distance for the word. The distance
d(M,L) for a pair of morph M and label L is given
by:
d(M,L) = −log cM,Lc
M
, (2)
where cM,L is the number of word tokens in which
the morph M has been aligned with the label L; and
cM is the number of word tokens that contain the
morph M in their segmentation. The distance mea-
sure can be thought of as the negative logarithm of a
conditional probability P(L|M). This indicates the
probability that a morph M is a realisation of a mor-
pheme represented by the label L. Put another way,
if the unsupervised segmentation algorithm discov-
ers morphs that are allomorphs of real morphemes, a
particular allomorph will ideally always be aligned
with the same (correct) morphemic label, which
leads to a high probability P(L|M), and a short dis-
tance d(M,L)4. In contrast, if the segmentation al-
gorithm does not discover meaningful morphs, each
of the segments will be aligned with a number of dif-
ferent morphemic labels throughout the corpus, and
as a consequence, the probabilities will be low and
the distances high.
We then utilize the EM algorithm for iteratively
improving the alignment. The initial alignment that
is used for computing initial distance values is ob-
tained through a string matching procedure: String
matching is efficient for aligning the stem of the
word with the base form (e.g., the morph puu with
the label PUU, and the morphs t + alo with the
label TALO). The suffix morphs that do not match
well with the base form labels will end up aligned
somehow with the morphological tags (e.g., the
morph ja with the labels PL + PTV).
4This holds especially for allomorphs of ’stem morphemes’,
e.g., it is possible to identify the English morpheme easy with
a probability of one from both its allomorphs: easy and easi.
However, suffixes, in particular, can have several meanings,
e.g., the English suffix s can mean either the plural of nouns
or the third person singular of the present tense of verbs.
Comparison of methods. In order to compare two
segmentation algorithms, the segmentation of each
is aligned with the linguistic morpheme labels, and
the total distance of the alignment is computed.
Shorter total distance indicates better segmentation.
However, one should note that the distance mea-
sure used favors long morphs. If a particular “seg-
mentation” algorithm does not split one single word
of the corpus, the total distance can be zero. In such
a situation, the single morph that a word is com-
posed of is aligned with all morphemic labels of the
word. The morph M, i.e., the word, is unique, which
means that all probabilities P(L|M) are equal to
one: e.g., the morph puutaloja is always aligned
with the labels PUU + TALO + PL + PTV and no
other labels, which yields the probabilities P(PUU |
puutaloja) = P(TALO | puutaloja) = P(PL |
puutaloja) = P(PTV | puutaloja) = 1.
Therefore, part of the corpus should be used as
training data, and the rest as test data. Both data sets
are segmented using the unsupervised segmentation
algorithms. The training set is then used for estimat-
ing the distance values d(M,L). These values are
used when the test set is aligned. The better seg-
mentation algorithm is the one that yields a better
alignment distance for the test set.
For morph/label pairs that were never observed in
the training set, a maximum distance value is as-
signed. A good segmentation algorithm will find
segments that are good building blocks of entirely
new word forms, and thus the maximum distance
values will occur only rarely.
5 Experiments and Results
We compared the two proposed methods as well as
Goldsmith’s program Linguistica5 on both Finnish
and English corpora. The Finnish corpus consisted
of newspaper text from CSC6. A morphosyntac-
tic analysis of the text was performed using the
Conexor FDG parser7. All characters were con-
verted to lower case, and words containing other
characters than a through z and the Scandinavian
letters ˚a, ¨a and ¨o were removed. Other than mor-
phemic tags were removed from the morphological
5http://humanities.uchicago.edu/faculty/goldsmith/Linguist-
ica2000/
6http://www.csc.fi/kielipankki/
7http://www.conexor.fi/
analyses of the words. The remaining tags corre-
spond to inflectional affixes (i.e. endings and mark-
ers) and clitics. Unfortunately the parser does not
distinguish derivational affixes. The first 100 000
word tokens were used as training data, and the fol-
lowing 100 000 word tokens were used as test data.
The test data contained 34 821 word types.
The English corpus consisted of mainly newspa-
per text from the Brown corpus8. A morphologi-
cal analysis of the words was performed using the
Lingsoft ENGTWOL analyzer9. In case of multi-
ple alternative morphological analyses, the shortest
analysis was selected. All characters were converted
to lower case, and words containing other characters
than a through z, an apostrophe or a hyphen were
removed. Other than morphemic tags were removed
from the morphological analyses of the words. The
remaining tags correspond to inflectional or deriva-
tional affixes. A set of 100 000 word tokens from the
corpus sections Press Reportage and Press Editorial
were used as training data. A separate set of 100 000
word tokens from the sections Press Editorial, Press
Reviews, Religion, and Skills Hobbies were used as
test data. The test data contained 12 053 word types.
Test results for the three methods and the two lan-
guages are shown in Table 2. We observe different
tendencies for Finnish and English. For Finnish,
there is a correlation between the compression of
the corpus and the linguistic generalization capac-
ity to new word forms. The Recursive splitting with
the MDL cost function is clearly superior to the Se-
quential splitting with ML cost, which in turn is su-
perior to Linguistica. The Recursive MDL method
is best in terms of data compression: it produces the
smallest morph lexicon (codebook), and the code-
book naturally occupies a small part of the total cost.
It is best also in terms of the linguistic measure, the
total alignment distance on test data. Linguistica, on
the other hand, employs a more restricted segmenta-
tion, which leads to a larger codebook and to the fact
that the codebook occupies a large part of the total
MDL cost. This also appears to lead to a poor gen-
eralization ability to new word forms. The linguis-
tic alignment distance is the highest, and so is the
percentage of aligned morph/morphemic label pairs
8The Brown corpus is available at the Linguistic Data Con-
sortium at http://www.ldc.upenn.edu/
9http://www.lingsoft.fi/
that were never observed in the training set. On the
other hand, Linguistica is the fastest program10.
Also for English, the Recursive MDL method
achieves the best alignment, but here Linguistica
achieves nearly the same result. The rate of com-
pression follows the same pattern as for Finnish,
in that Linguistica produces a much larger morph
lexicon than the methods presented in this pa-
per. In spite of this fact, the percentage of unseen
morph/morphemic label pairs is about the same for
all three methods. This suggests that in a morpho-
logically poor language such as English a restrictive
segmentation method, such as Linguistica, can com-
pensate for new word forms – that it does not rec-
ognize at all – with old, familiar words, that it “gets
just right”. In contrast, the methods presented in this
paper produce a morph lexicon that is smaller and
able to generalize better to new word forms but has
somewhat lower accuracy for already observed word
forms.
Visual inspection of a sample of words. In an
attempt to analyze the segmentations more thor-
oughly, we randomly picked 1000 different words
from the Finnish test set. The total number of occur-
rences of these words constitute about 2.5% of the
whole set. We inspected the segmentation of each
word visually and classified it into one of three cat-
egories: (1) correct and complete segmentation (i.e.,
all relevant morpheme boundaries were identified),
(2) correct but incomplete segmentation (i.e., not all
relevant morpheme boundaries were identified, but
no proposed boundary was incorrect), (3) incorrect
segmentation (i.e., some proposed boundary did not
correspond to an actual morpheme boundary).
The results of the inspection for each of the three
segmentation methods are shown in Table 3. The
Recursive MDL method performs best and segments
about half of the words correctly. The Sequential
ML method comes second and Linguistica third with
a share of 43% correctly segmented words. When
considering the incomplete and incorrect segmenta-
tions the methods behave differently. The Recursive
MDL method leaves very common word forms un-
split, and often produces excessive splitting for rare
10Note, however, that the computing time comparison with
Linguistica is only approximate since it was a compiled pro-
gram run on Windows whereas the two other methods were im-
plemented as Perl scripts run on Linux.
Table 2: Test results for the Finnish and English corpus. Method names are abbreviated: Recursive seg-
mentation and MDL cost (Rec. MDL), Sequential segmentation and ML cost (Seq. ML), and Linguistica
(Ling.). The total MDL cost measures the compression of the corpus. However, the cost is computed accord-
ing to Equation (1), which favors the Recursive MDL method. The final number of morphs in the codebook
(#morphs in codebook) is a measure of the size of the morph “vocabulary”. The relative codebook cost
gives the share of the total MDL cost that goes into coding the codebook. The alignment distance is the total
distance computed over the sequence of morph/morphemic label pairs in the test data. The unseen aligned
pairs is the percentage of all aligned morph/label pairs in the test set that were never observed in the training
set. This gives an indication of the generalization capacity of the method to new word forms.
Language Finnish English
Method Rec. MDL Seq. ML Ling. Rec. MDL Seq. ML Ling.
Total MDL cost [bits] 2.09M 2.27M 2.88M 1.26M 1.34M 1.44M
#morphs in codebook 6302 10 977 22 075 3836 4888 8153
Relative codebook cost 10.16% 15.27% 36.81% 9.42% 10.90% 19.14%
Alignment distance 768k 817k 1111k 313k 444k 332k
Unseen aligned pairs 23.64% 20.20% 37.22% 18.75% 19.67% 20.94%
Time [sec] 620 390 180 130 80 30
Table 3: Estimate of accuracy of morpheme bound-
ary detection based on visual inspection of a sample
of 2500 Finnish word tokens.
Method Correct Incomplete Incorrect
Rec. MDL 49.6% 29.7% 20.6%
Seq. ML 47.3% 15.3% 37.4%
Linguistica 43.1% 24.1% 32.8%
words. The Sequential ML method is more prone to
excessive splitting, even for words that are not rare.
Linguistica, on the other hand, employs a more con-
servative splitting strategy, but makes incorrect seg-
mentations for many common word forms.
The behaviour of the methods is illustrated by ex-
ample segmentations in Table 4. Often the Recur-
sive MDL method produces complete and correct
segmentations. However, both it and the Sequential
ML method can produce excessive splitting, as is
shown for the latter, e.g. affecti + on + at
+ e. In contrast, Linguistica refrains from splitting
words when they should be split, e.g., the Finnish
compound words in the table.
6 Discussion of the Model
Regarding the model, there is always room for im-
provement. In particular, the current model does
not allow representation of contextual dependencies,
i.e., that some morphs appear only in particular con-
texts (allomorphy). Moreover, languages have rules
regarding the ordering of stems and affixes (morpho-
tax). However, the current model has no way of rep-
resenting such contextual dependencies.
7 Conclusions
In the experiments the online method with the MDL
cost function and recursive splitting appeared most
successful especially for Finnish, whereas for En-
glish the compared methods were rather equal in
performance. This is likely to be partially due to
the model structure of the presented methods which
is especially suitable for languages such as Finnish.
However, there is still room for considerable im-
provement in the model structure, especially regard-
ing the representation of contextual dependencies.
Considering the two examined model optimiza-
tion methods, the Recursive MDL method per-
formed consistently somewhat better. Whether this
is due to the cost function or the splitting strategy
cannot be deduced based on these experiments. In
the future, we intend to extend the latter method to
utilize an MDL-like cost function.
Table 4: Some English and Finnish word segmentations produced by the three methods. The Finnish words
are el¨ainl¨a¨ak¨ari (veterinarian, lit. animal doctor), el¨ainmuseo (zoological museum, lit. animal
museum), el¨ainpuisto (zoological park, lit. animal park), and el¨aintarha (zoo, lit. animal garden).
The suffixes -lle, -n, -on, and -sta are linguistically correct. (Note that in the Sequential ML method
the rejection criteria mentioned are not applied on the last round of Viterbi segmentation. This is why two
one letter morphs appear in a sequence in the segmentation el¨ain + tarh + a + n.)
Recursive MDL Sequential ML Linguistica
affect affect affect
affect + ing affect + ing affect + ing
affect + ing + ly affect + ing + ly affect + ing + ly
affect + ion affecti + on affect + ion
affect + ion + ate affecti + on + at + e affect + ion + ate
affect + ion + s affecti + on + s affect + ion + s
affect + s affect + s affect + s
el¨ain + l¨a¨ak¨ari el¨ain + l¨a¨ak¨ari el¨ainl¨a¨ak¨ari
el¨ain + l¨a¨ak¨ari + lle el¨ain + l¨a¨ak¨ari + lle el¨ainl¨a¨ak¨ari + lle
el¨ain + museo + n el¨ain + museo + n el¨ainmuseo + n
el¨ain + museo + on el¨ain + museo + on el¨ainmuseo + on
el¨ain + puisto + n el¨ain + puisto + n el¨ainpuisto + n
el¨ain + puisto + sta el¨ain + puisto + sta el¨ainpuisto + sta
el¨ain + tar + han el¨ain + tarh + a + n el¨aintarh + an

References
Jerome Bellegarda. 2000. Exploiting latent semantic in-
formation in statistical language modeling. Proceed-
ings of the IEEE, 88(8):1279–1296.
Michael R. Brent. 1999. An efficient, probabilistically
sound algorithm for segmentation and word discovery.
Machine Learning, 34:71–105.
Carl de Marcken. 1995. The unsupervised acquisition
of a lexicon from continuous speech. Technical Re-
port A.I. Memo 1558, MIT Artificial Intelligence Lab.,
Cambridge, Massachusetts.
Carl de Marcken. 1996. Linguistic structure as compo-
sition and perturbation. In Meeting of the Association
for Computational Linguistics.
Herv´e D´ejean. 1998. Morphemes as necessary con-
cept for structures discovery from untagged corpora.
In Workshop on Paradigms and Grounding in Natural
Language Learning, pages 295–299, Adelaide, Jan.
22.
Sabine Deligne and Fr´ed´eric Bimbot. 1997. Inference of
variable-length linguistic and acoustic units by multi-
grams. Speech Communication, 23:223–241.
John Goldsmith. 2001. Unsupervised learning of the
morphology of a natural language. Computational
Linguistics, 27(2):153–198.
Yu Hua. 2000. Unsupervised word induction using MDL
criterion. In Proceedings of ISCSL, Beijing.
Fred Karlsson. 1987. Finnish Grammar. WSOY, Juva,
second edition.
Fred Karlsson. 1998. Yleinen kielitiede.
Yliopistopaino/Helsinki University Press.
Chunyu Kit and Yorick Wilks. 1999. Unsupervised
learning of word boundary with description length
gain. In Proceedings of CoNLL99 ACL Workshop,
Bergen.
Kimmo Koskenniemi. 1983. Two-level morphology:
A general computational model for word-form recog-
nition and production. Ph.D. thesis, University of
Helsinki.
A. Norman Redlich. 1993. Redundancy reduction as a
strategy for unsupervised learning. Neural Computa-
tion, 5:289–304.
Jorma Rissanen. 1989. Stochastic complexity in statis-
tical inquiry. World Scientific Series in Computer Sci-
ence, 15:79–93.
Patrick Schone and Daniel Jurafsky. 2000. Knowledge-
free induction of morphology using latent semantic
analysis. In Proceedings of CoNLL-2000 and LLL-
2000, pages 67–72, Lisbon.
