Efficient Unsupervised Recursive Word Segmentation Using Minimum
Description Length
Shlomo ARGAMONa Navot AKIVAb Amihood AMIRb Oren KAPAHb
aIllinois Institute of Technology, Dept. of Computer Science, Chicago, IL 60616, USA
argamon@iit.edu
bBar-Ilan University, Dept. of Computer Science, Ramat Gan 52900, ISRAEL
{navot,amir,kapaho}@cs.biu.ac.il
Abstract
Automatic word segmentation is a basic re-
quirement for unsupervised learning in morpho-
logical analysis. In this paper, we formulate a
novel recursive method for minimum descrip-
tion length (MDL) word segmentation, whose
basic operation is resegmenting the corpus on
a prefix (equivalently, a suffix). We derive a
local expression for the change in description
length under resegmentation, i.e., one which de-
pends only on properties of the specific prefix
(not on the rest of the corpus). Such a formula-
tion permits use of a new and efficient algorithm
for greedy morphological segmentation of the
corpus in a recursive manner. In particular, our
method does not restrict words to be segmented
only once, into a stem+affix form, as do many
extant techniques. Early results for English and
Turkish corpora are promising.
1 Introduction
Although computational morphological analyzers
have existed for many years for a number of lan-
guages, there are still many languages for which no
such analyzer exists, but for which there is an abun-
dance of electronically-available text. Developing
a morphological analyzer for a new language by
hand can be costly and time-consuming, requiring
a great deal of effort by highly-specialized experts.
Supervised learning methods, on the other hand, re-
quire annotated data, which is often scarce or non-
existent, and is also costly to develop. For this
reason, there is increasing interest in unsupervised
learning of morphology, in which unannotated text
is analysed to find morphological structures. Even
approximate unsupervised morphological analysis
can be useful, as an aid to human annotators.
This paper addresses a key task for unsuper-
vised morphological analysis: word segmentation,
segmenting words into their most basic meaning-
ful constituents (substrings), called morphs (ortho-
graphic realizations of morphemes). We adopt the
minimum description length (MDL) approach to
word segmentation, which has been shown to be ef-
fective in recent work (notably (Goldsmith, 2001)
and (Brent et al., 1995)). The minimum descrip-
tion length principle (Barron et al., 1998) is an
information-theoretic criterion to prefer that model
for observed data which gives a minimal length cod-
ing of the observed data set (given the model) to-
gether with the model itself.
1.1 Our approach
Our approach in this paper is to better clarify the
use of MDL for morphological segmentation by en-
abling direct use of a variety of MDL coding criteria
in a general and efficient search algorithm. Issues of
computational efficiency have been a bottleneck in
work on unsupervised morphological analysis, lead-
ing to various approximations and heuristics being
used. Our key contribution is to show how a lo-
cal formulation of description length (DL) for word
segmentation enables an efficient algorithm (based
on pattern-matching methods) for greedy morpho-
logical segmentation of the corpus. We thus provide
a search framework method which avoids some re-
strictions needed in previous work for efficiency. In
particular, our method segments words in the cor-
pus recursively, enabling multiple morphs to be ex-
tracted from a single word, rather than just allow-
ing a single stem+affix pair for a given word, as in
many previous approaches. For example, we might
find the segmentation inter+nation+al+ist,
whereas a single-boundary method would segment
the word on just one of those boundaries.
This paper describes the first step in a larger re-
search program; it’s purpose is to show how to most
efficiently recursively segment the words in a cor-
pus based on an MDL criterion, rather than exhibit a
full morphological analysis system. The procedure
developed here is a component of a larger planned
system, which will use semantic and structural in-
formation to correct word segmentation errors and
will cluster morphological relations into productive
paradigms.
1.2 Related work
Several systems for unsupervised learning of mor-
phology have been developed over the last decade or
so. D´ejean (1998), extending ideas in Harris (1955),
describes a system for finding the most frequent af-
fixes in a language and identifying possible mor-
pheme boundaries by frequency bounds on the num-
ber of possible characters following a given char-
acter sequence. Brent et al. (1995) give an in-
formation theoretic method for discovering mean-
ingful affixes, which was later extended to enable
a novel search algorithm based on a probabilistic
word-generation model (Snover et al., 2002). Gold-
smith (2001) gives a comprehensive heuristic al-
gorithm for unsupervised morphological analysis,
which uses an MDL criterion to segment words
and find morphological paradigms (called signa-
tures). Similarly, Creutz and Lagus (2002) use an
MDL formulation for word segmentation. All of
these approaches assume a stem+affix morpholog-
ical paradigm.
Further, the above approaches only consider in-
formation in words’ character sequences for im-
prove morphological segmentation, and do not con-
sider syntactic or semantic context. Schone and Ju-
rafsky (2000) extend this by using latent semantic
analysis (Dumais et al., 1988) to require that a pro-
posed stem+affix split is sufficiently semantically
similar to the stem before the split is accepted. A
conceptually similar approach is taken by Baroni et
al. (2002) who combine use of edit distance to mea-
sure orthographic similarity and mutual information
to measure semantic similarity, to determine mor-
phologically related word pairs.
2 Overview of the Approach
In this section we provide an overview of our ap-
proach to greedy construction of a set of morphs
(a dictionary), using a minimal description length
(MDL) criterion (Barron et al., 1998) (we present
three alternative MDL-type criteria below, of vary-
ing levels of sophistication). The idea is to initialize
a dictionary of morphs to the set of all word types in
the corpus, and incrementally refine it by resegment-
ing affixes (either prefixes or suffixes) from the cor-
pus. Resegmenting on a prefix p (depicted in Fig-
ure 1) means adding the prefix as a new morph, and
removing it from all words where it occurs as a pre-
fix. Some of the morphs thus created may already
exist in the corpus (e.g., “cognition” in Fig. 1). We
denote the set of morphs starting with p as Vp, and
the set of continuations that follow p by Sp (i.e.,
Vp = pSp). The number of occurrences of a morph
m in the corpus (as currently segmented) is denoted
Dictionary before Dictionary after
relic re
retire lic
recognition tire
relive cognition
tire live
cognition farm
farm
Figure 1: Illustration of resegmenting on the prefixre-.
Note that Vre ={relic, retire, recognition, relive}, and
Sre ={lic, tire, cognition, live}.
by C(m), and the number of tokens in the corpus
with prefix p is denoted B(p) = summationtextvk∈Vp C(vk).
The algorithm examines all prefixes of current
morphs in the dictionary as resegmentation candi-
dates. The candidate p∗ that would give the greatest
decrease in description length upon resegmentation
is chosen, and the corpus is then resegmented on p∗.
This is repeated until no candidate can decrease de-
scription length.
Key to this process is efficient resegmentation of
the corpus, which entails incremental update of the
description length change that each prefix p will
give upon resegmentation, denoted ∆CODEp (the
change in the coding cost CODE(M,Data) for the
corpus plus the model M). This is achieved in two
ways. First, we develop (Sec. 3) expressions for
∆CODEp which depend only on simple properties
of p, Vp, and Sp, and their occurrences in the corpus.
This locality property obviates the need to exam-
ine most of the corpus to determine ∆CODEp. Sec-
ond, we use a novel word/suffix indexing data struc-
ture which permits efficient resegmentation and up-
date of the statistics on which ∆CODEp depends
(Sec. 4). Initial experimental results for the different
models using our algorithm are given in Section 5.
3 Local Description Length Models
As we show below, the key to efficiency is deriving
local expressions for the change in coding length
that will be caused by resegmentation on a partic-
ular prefix p. That is, this coding length change,
∆CODEp, should depend only on direct properties
of p, those morphs Vp = {vk = psk} for which it is
a prefix, and those strings Sp = {sk|psk ∈ Vp} (p’s
continuations). This enables us to efficiently main-
tain the necessary data about the corpus and to up-
date it on resegmentation, avoiding costly scanning
of the entire corpus on each iteration.
We now describe three description length mod-
els for word segmentation. First, we introduce local
description length via two simple models, and then
give a derivation of a local expression for descrip-
tion length change for a more realistic description
length measure.
3.1 Model 1: Dictionary count
Perhaps the simplest possible model is to find a seg-
mentation which minimizes the number of morphs
in the dictionary CODE1(M,Data) = |M|. Al-
though the global minimum will almost always be
the trivial solution where each morph is an individ-
ual letter, this trivial solution may be avoided by
enforcing a minimal morph length (of 2, say). Fur-
thermore, when implemented via a greedy prefix (or
suffix) resegmenting algorithm, this measure gives
surprisingly good results, as we show below.
Locality in this model is easily shown, as
∆CODE1p(M) = 1 + |Sp −M| − |Vp|
= 1 − |Sp ∩M|
since p is added to M as are all its continuations
not currently in M, while each morph vk ∈ Vp is re-
moved (being resegmented as the 2-morph sequence
psk).
3.2 Model 1a: Adjusted count
We also found a heuristic modification of Model
1 to work well, based on the intuition that an af-
fix with more continuations that are current morphs
will be better, while to a lesser extent more contin-
uations that are not current morphs indicates lower
quality. This gives the local heuristic formula:
∆CODE1ap (M) = 1 + |Sp −M| −α|Sp ∩M|
where α is a tunable parameter determining the rel-
ative weights of the two factors.
3.3 Model 2: MDL
A more theoretically motivated model seeks to min-
imize the combined coding cost of the corpus and
the dictionary (Barron et al., 1998):
CODE2(Data|M) + CODE2(M)
where we assume a minimal length code for the cor-
pus based on the morphs in the dictionary1.
The coding cost of the dictionary M is:
CODE2(M) = CODE2(M)
= bsummationtextm∈M len(m)
1As is well known, MDL model estimation is equivalent to
MAP estimation for appropriately chosen prior and conditional
data distribution (Barron et al., 1998).
where b is the number of bits needed to represent a
character and len(m) is the length of m in charac-
ters.
The coding cost CODE(Data|M) of the corpus
given the dictionary is simply the total number of
bits to encode the data using M’s code:
CODE2(Data|M)
= CODE(M(Data) = M1...N)
= −summationtextNi=1 logP(mi)
= −summationtext|M|j=1C(mj)logP(mj)
= −summationtext|M|j=1C(mj)(logC(mj) − logN)
where M(Data) is the corpus segmented according
to M, N is the number of morph tokens in the seg-
mented corpus, mi is the ith morph token in that
segmentation, P(m) is the probability of morph
m in the corpus estimated as P(m) = C(m)/N,
C(m) is the number of times morph m appears in
the corpus, |M| is the total number of morph types
in M, and mj is the jth morph type in the M.
Now suppose we wish to add a new morph to M
by resegmenting on a prefix p from all morphs shar-
ing that prefix, as above. First, consider the total
change in cost for the dictionary. Note that the ad-
dition of the new morph p will cause an increase of
blen(p) bits to the total dictionary size. At the same
time, each new morph s ∈ Sp − M will add its
coding cost blen(s), while each preexisting morph
s′ ∈ Sp∩M will not change the dictionary length at
all. Finally, each vk is removed from the dictionary,
giving a change of −blen(vk). The total change in
coding cost for the dictionary by resegmenting on p
is thus:
∆CODE2p(M) = b (len(p)
+summationtextsk∈(Sp−M) len(sk)
−summationtextk len(vk))
Now consider the change in coding cost for the
corpus after resegmentation. First, consider each
preexisting morph type m negationslash∈ Vp, with the same
count after resegmentation (since it does not con-
tain p). The coding cost of each occurrence of m,
however, will change, since the total number of to-
kens in the corpus will change. Thus the total cost
change for such an m is:
∆CODE2p(Data|m negationslash∈ Vp)
= C(m)(logP(m) − log ˆP(m))
= C(m)(logC(m) − logN − logC(m) + log ˆN)
= C(m)(log ˆN − logN)
= C(m)(log(N + B(p)) − logN)
The total corpus cost change for unchanged morphs
depends only on N and B(p):
∆CODE2p(Data|M −Vp)
=summationtextm∈M−Vp C(m)(log(N + B(p)) − logN)
= (summationtextm∈M−Vp C(m))(log(N + B(p)) − logN)
= (N −summationtextvk C(vk))(log(N + B(p)) − logN)
= (N −B(p))(log(N + B(p)) − logN)
Now, consider explicitly each morph vk ∈ Vp
which will be split after resegmentation. First,
remove the code for each occurrence of vk from
the corpus coding: C(vk)logP(vk). Next, add a
code for each occurrence of the new morph cre-
ated by the prefix: −C(vk)log ˆP(p), where ˆP(p) =
B(p)/(N + B(p)) is the probability of morph p
in the resegmented corpus. Finally, code the con-
tinuations sk: −C(vk)log ˆP(sk) (where ˆP(sk) =
ˆC(sk)
ˆN =
C(vk)+C(sk)
ˆN is the probability of the ‘new’
morph sk). Putting this together, we have the cor-
pus coding cost change for Vp (noting that B(p) =summationtext
vk C(vk)):
∆CODE2p(Data|Vp)
=summationtextvk C(vk)[ logP(vk) − log ˆP(p) − log ˆP(sk)]
=summationtextvk C(vk) (logC(vk) − logN
+log ˆN − logB(p)
+log ˆN − log ˆC(sk))
= summationtextvk C(vk)(logC(vk) − log ˆC(sk))
+B(p)(2log ˆN − logN)
−B(p)logB(p)
Thus the cost change for resegmenting on p is:
∆CODE2p(M,Data)
= ∆CODE2p(M) + ∆CODE2p(Data|M)
= ∆CODE2p(M) + ∆CODE2p(Data|M −Vp)
+∆CODE2p(Data|Vp)
= b
bracketleftBig
len(p) +summationtextsk∈(Sp−M) len(sk) −summationtextvk len(vk)
bracketrightBig
+ (N −B(p)) (log(N + B(p)) − logN)
+ summationtextvk C(vk)(logC(vk) − log ˆC(sk))
+B(p)(2log ˆN − logN)
−B(p)logB(p)
Note that all terms are local to the prefix p, its in-
cluding morphs Vp and its continuations Sp. This
will enable an efficient incremental algorithm for
greedy segmentation of all words in the corpus, as
described in the next section.
4 Efficient Greedy Prefix Search
The straightforward greedy algorithm schema for
finding an approximately minimal cost dictio-
nary is to repeatedly find the best prefix p∗ =
argminp ∆CODEp(M,Data) and resegment the
corpus on p∗, until no p∗ exists with negative
∆CODE. However, the expense of passing over the
entire corpus repeatedly would be prohibitive. Due
to lack of space, we sketch here our method for
caching corpus statistics in a pair of tries, in such
a way that ∆CODEp can be easily computed for any
prefix p, and such that the data structures can be ef-
ficiently updated when resegmenting on a prefix p.
(A heap is also used for efficiently finding the best
prefix.)
The main data structures consist of two tries. The
first, which we term the main suffix trie (MST), is a
suffix trie (Gusfield, 1997) for all the words in the
corpus. Each node in the MST represents either the
prefix of a current morph (initially, a word in the
corpus), or the prefix of a potential morph (in case
its preceding prefix gets segmented). Each such
node is labeled with various statistics of its prefix p
(denoted by the path to it from the root) and its suf-
fixes, such as its prefix length len(p), its count B(p),
the number of its continuations |Sp|, and the col-
lective length of its continuations summationtextsk∈Sp len(sk),
as well as the current value of ∆CODEp(M,Data)
(computed from these statistics). Also, each node
representing the end of an actual word in the corpus
is marked as such.
The second trie, the reversed prefix trie (RPT),
contains all the words in the corpus in reverse.
Hence each node in the RPT corresponds to the suf-
fix of a word in the corpus. We maintain a list of
pointers at each node in the RPT to each node in the
MST which has an identical suffix. This allows ef-
ficient access to all prefixes of a given string. Also,
those nodes corresponding to a complete word in
the corpus are marked.
Initial construction of the data structures can be
done in time linear in the size of the corpus, us-
ing straightforward extensions of known suffix trie
construction techniques (Gusfield, 1997). Finding
the best prefix p∗ can be done efficiently by stor-
ing pointers to all the prefixes in a heap, keyed by
∆CODEp. To then remove all words prefixed by p∗
and add all its continuations as new morphs (as well
as p∗ itself), proceed as follows, for each continua-
tion sk:
1. If sk is marked in RPT, then it is a complete
word, and only its count needs to be updated.
2. Otherwise
(a) Mark sk’s node in MST as a complete
word, and update its statistics
(b) Add sRk to RPT and mark the correspond-
ing nodes in MST as accepting stems.
3. Update the heap for the changed prefixes.
Prefixes
re- *ter-
un- im-
in- com-
de- trans-
con- sub-
dis- *se-
pre- en-
ex- *pa-
pro- *pe-
over- *mi-
Suffixes
-’s *-at
-ing -ate
-ed -ive
-es -able
-ly -ment
-er -or
?-ers -en
-ion ?-ors
?-ions ?-ings
-al *-is
Figure 2: The first 20 English prefix and suffix morphs
extracted from Reuters-21578 corpus using Model 1.
Meaningless morphs are marked by ’*’; nonminimal
meaningful morphs by ’?’.
Prefixes Suffixes
α = 1 α = 2 α = 1 α = 2
over-
non-
under-
mis-
food-
stock-
feed-
view-
work-
export-
book-
warn-
borrow-
depres-
market-
high-
narrow-
turn-
trail-
steel-
un-
over-
non-
*der-
dis-
mis-
out-
inter-
trans-
re-
super-
fore-
up-
down-
tele-
stock-
im-
air-
euro-
mid-
-’s
-ly
-ness
-ship
?-ships
?-ization
-ize
?-ized
?-isation
?-izing
?-izes
?-holders
?-izations
?-isations
-water
?-ised
-ise
?-ising
?-ises
?-iser
-’s
-ly
-ness
-ment
?-ments
?-ized
-ize
?-ization
?-izing
?-isation
?-ised
-ise
?-ising
?-ises
-ship
-men
?-ened
?-ening
?-izes
*-mental
Figure 3: The first 20 English prefix and suffix morphs
extracted using Model 1a, as above.
The complexity for resegmenting on p is
O(len(p) + summationdisplay
sk∈Sp
len(sk) + NSUF(Sp)log(|M|))
where NSUF(Sp) is the number of different morphs
in the previous dictionary that have a suffix in Sp
(which need to be updated in the heap).
5 Experimental Results
In this section we give initial results for the above
algorithm in English and Turkish, showing how
meaningful morphs are extracted using different
greedy MDL criteria. Recall that the models and
algorithm described in this paper are intended as
parts of a more comprehensive morphological anal-
ysis system, as we describe below in future work.
5.1 English
For evaluation in English, we used the standard
Reuters-21578 corpus of news articles (comprising
1.7M word tokens and 32,811 unique words). For
each of the 3 models described above, we extracted
morphs either by resegmenting on prefixes or on
suffixes (looking at the words reversed). When seg-
menting according to Models 1 and 2, a minimum
prefix length of 2 was enforced, to improve morph
quality (though not for suffixes, since in English
there are some one-letter suffixes such as -s).
First, consider morphs found by Model 1 (Fig. 2).
The prefix morphs found are surprisingly good for
this simple model, with only one wrong in the first
15 extracted. That erroneous morph is ter-, which
is part of inter-, however in- was extracted
first; this kind of error could be ameliorated by
a merging postprocessing step. The suffixes are
similarly good, although oddly the system did not
find -s, which caused it to find several compos-
ite morphs, such as -ers and -ions, which can
get resegmented into their components (-er+s and
-ion+s) later.
Model 1a also performs extremely well, for dif-
ferent values of α (we show just α = 1 and α = 2
in Fig. 3, for lack of space). Note that the morphs
found by this model differ qualitatively from those
found by Model 1, in that we get longer morphs
more related to agglutination than to regular inflec-
tion patterns. This suggests that multiple statistical
models should be used together to extract different
facets of a language’s morphological composition.
Finally, morphs from the more complex Model
2 are given in Fig. 4. As in Model 1a, Model
2 gives more agglutinative morphs than inflective
morphs, and has a greater tendency to segment com-
plex morphs (such as -ification-), which pre-
sumably will later be resegmented into their com-
ponent parts (e.g., -if+ic+at+ion). This may
enable construction of hierarchical models of mor-
phological composition in the future.
5.2 Turkish
In addition to English, we tested the method’s abil-
ity to extract meaningful morphs on a small corpus
of Turkish texts from the Turkish Natural Language
Processing Initiative (Oflazer, 2001), which consists
of one foreign ministry press release, texts of two
treaties, and three journal articles on translation.
The corpus comprises 20,284 individual words, of
which 5961 are unique. Turkish is a highly aggluti-
native language, hence a prime candidate for recur-
sive morphological segmentation. Results for Mod-
els 1 and 2 are shown in Tables 5–8. Meaningful
Prefixes
non- rein-
bio- over-
?disi- *ine-
diss- ?interc-
video- fluor-
financier- wood-
quadr- key-
*kl- *kar-
weather- vin-
*jas- ?kings-
Suffixes
-’s -ville
-town -field
?-ification ?-ians
?-alize ?-alising
?-ically ?-ological
-tech -wood
?-ioning ?-etic
?-sively -point
?-nating -tally
?-tational *-uting
Figure 4: The first 20 English prefix and suffix morphs
extracted using Model 2, as above, with b = 8.
p Meaning
bahs- talk (about)
terk- leaving
redd- refuse, rejected
zikr- mention (someone)
bey- Mr., sir
akt- agree
haps- (im)prison
birbirlerin- one to another
s¸efin- your chief
tedbirler- precautions
birin- somebody
h¨uk¨umlerin- your opinions
¨ulkesin- his country
elimiz- our hand
d¨uzenlemelerin- your arrangements
yerin- your place
kendin- yourself
devletler- governments
bic¸imin- your style
istedi˘gim- (thing) that I want
Figure 5: Turkish morphs segmented as prefixes using
Model 1.
morphs were found using all models, with Model 2
finding longer morphs, as in English. We do note
some issues with boundary letters for Model 2 pre-
fixes, however.
6 Conclusions
We have given a firmer foundation for the use of
minimal description length (MDL) criteria for mor-
phological analysis by giving a novel local formula-
tion of the change in description length (DL) upon
resegmentation of the corpus on a prefix (or suffix),
p Meaning
-nin of
-nın of
-nı your
-na to your
-ler plural form
-leri plural form
-nda at your
-ni your
-lerin your (things)
-ki that
p Meaning
-si of
-ndan from
-ları plural form
-lar plural form
-sı of
-ların your (pl.) (things)
-lerine to your (pl.) (things)
-ya to
-lara to (pl.)
-dir is
Figure 6: Turkish morphs segmented as suffixes using
Model 1.
p Meaning
hizmet(l)- service
neden(l)- reason
madd- material
birbir- one another
belg- document
izlenim- observation
nitelik- specification
en- width
dil- language
bilg(i)- knowledge
p Meaning
bahs- mention
zih(in)- memory
verg(i)- tax
person(el)- employee
biri- one of
verme- giving
vere(n)- giver
belirsi(z)- unknown
bildirim- announcement
zikr- mention
Figure 7: Turkish morphs segmented as prefixes using
Model 2. Turkish letters in parentheses are not in the
segmented morphs, though a better segmentation would
have included them.
which enables an efficient algorithm for greedy con-
struction of a morph dictionary using an MDL cri-
terion. The algorithm we have devised is generic, in
that it may easily be applied to any local description
length model. Early results of our method, as evalu-
ated by examination of the morphs it extracts, show
high accuracy in finding meaningful morphs based
solely on orthographic considerations; in fact, we
find that Model 1, which depends only on the num-
ber of morphs in the dictionary (and not on frequen-
cies in the corpus at all) gives surprisingly good re-
sults, though Model 2 may generally be preferable
(more experiments on varied and larger corpora still
remain to be run).
We see two immediate directions for future work.
The first comprises direct improvements to the tech-
niques presented here. Rather than segmenting pre-
fixes and suffixes separately, the data structures and
algorithms should be extended to segment both pre-
fixes and suffixes in the current morph list, depend-
ing on which gives the best overall DL improve-
ment. Related is the need to enable approximate
matching of ‘boundary’ characters due to ortho-
graphic shifts such as-yto-i-, as well as incorpo-
rating other orthographic filters on possible morphs
(such as requiring prefixes to contain a vowel). An-
other algorithmic extension will be to develop an
p Meaning
-isine toward (someone)
-nlerinin of their (things)
-taki which at
-isini from, towards
-yeti to
-iyorsa if (pres.)
-ili with
-likte at (the place of)
-’in of
-imizden from our
p Meaning
-ilerine to (pl.)
-lemektedir it does
*-tik
-ilemez cannot do
-lerimizi our things
-mun from my
-mlar (plural)
-tmak to
-unca while
-lu with
Figure 8: Turkish morphs segmented as suffixes using
Model 2; tables as in Figure 5.
efficient beam-search algorithm (avoiding copying
the entire data structure), which may improve accu-
racy over the current greedy search method. In ad-
dition, we will investigate the use of more sophisti-
cated DL models, including, for example, semantic
similarity between candidate affixes and stems, us-
ing the probability of occurrence of individual char-
acters for coding, or using n-gram probabilities for
coding the corpus as a sequence of morphs (instead
of the unigram coding model used here and previ-
ously).
The second direction involves integrating the cur-
rent algorithm into a larger system for more compre-
hensive morphological analysis. As noted above,
due to the greedy nature of the search, a recombi-
nation step may be needed to ’glue’ morphs that
got incorrectly separated (such asun-and-der-).
More fundamentally, we intend to use the algorithm
presented here (with the above extensions) as a sub-
routine in a paradigm construction system along the
lines of Goldsmith (2001). It seems likely that effi-
cient and accurate MDL segmentation as we present
here will enable more effective search through the
space of possible morphological signatures.
Acknowledgements
Thanks to Moshe Fresko and Kagan Agun for help
with the Turkish translations, as well as the anony-
mous reviewers for their comments.

References

M. Baroni, J. Matiasek, and H. Trost. 2002. Unsu-
pervised discovery of morphologically related words
based on orthographic and semantic similarity. In
Proceedings of the Workshop on Morphological
and Phonological Learning of ACL/SIGPHON-2002,
pages 48–57.

Andrew Barron, Jorma Rissanen, and Bin Yu. 1998. The
minimum description length principle in coding and
modeling. IEEE Transactions on Information Theory,
44(6):2743–2760, October.

Michael R. Brent, Sreerama K. Murthy, and Andrew
Lundberg. 1995. Discovering morphemic suffixes: A
case study in minimum description length induction.
In Proceedings of the Fifth International Workshop on
Artificial Intelligence and Statistics, Ft. Lauderdale,FL.

Mathias Creutz and Krista Lagus. 2002. Unsupervised
discovery of morphemes. In Proceedings of the Work-
shop on Morphological and Phonological Learning of
ACL-02, pages 21–30, Philadelphia.

Herv´e D´ejean. 1998. Morphemes as necessary concept
for structures discovery from untagged corpora. In
Workshop on Paradigms and Grounding in Natural
Language Learning, pages 295–299, Adelaide.

S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deer-
wester, and R. Harshman. 1988. Using latent seman-
tic analysis to improve access to textual information.
In Proceedings of the SIGCHI conference on Human
factors in computing systems, pages 281–285. ACM
Press.

John Goldsmith. 2001. Unsupervised learning of the
morphology of a natural language. Computational
Linguistics, 27:153–198.

Dan Gusfield. 1997. Algorithms on Strings, Trees, and
Sequences - Computer Science and Computational Bi-
ology. Cambridge University Press.

Zellig Harris. 1955. From phoneme to morpheme. Lan-
guage, 31:190–222.

Kemal Oflazer. 2001. English Turkish aligned
parallel corpora. Turkish Natural Language
Processing Initiative, Bilkent University.
http://www.nlp.cs.bilkent.edu.tr/Turklang/
corpus/par-corpus/.

Patrick Schone and Daniel Jurafsky. 2000. Knowledge
free induction of morphology using latent semantic
analysis. In Proceedings of CoNLL-2000 and LLL-
2000, pages 67–72, Lisbon.

Matthew Snover, Gaja Jarosz, and Michael Brent. 2002.
Unsupervised learning of morphology using a novel
directed search algorithm: Taking the first step. In
ACL-2002 Workshop on Morphological and Phono-
logical Learning.
