Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology at HLT-NAACL 2006, pages 79–88,
New York City, USA, June 2006. c©2006 Association for Computational Linguistics
A Naive Theory of Affixation and an Algorithm for Extraction
Harald Hammarstr¨om
Dept. of Computing Science
Chalmers University of Technology
412 96, Gothenburg Sweden
harald2@cs.chalmers.se
Abstract
We present a novel approach to the unsu-
pervised detection of affixes, that is, to ex-
tract a set of salient prefixes and suffixes
from an unlabeled corpus of a language.
The underlying theory makes no assump-
tions on whether the language uses a lot
of morphology or not, whether it is pre-
fixing or suffixing, or whether affixes are
long or short. It does however make the
assumption that 1. salient affixes have to
be frequent, i.e occur much more often
that random segments of the same length,
and that 2. words essentially are vari-
able length sequences of random charac-
ters, e.g a character should not occur in
far too many words than random without
a reason, such as being part of a very fre-
quent affix. The affix extraction algorithm
uses only information from fluctation of
frequencies, runs in linear time, and is free
from thresholds and untransparent itera-
tions. We demonstrate the usefulness of
the approach with example case studies on
typologically distant languages.
1 Introduction
The problem at hand can be described as follows:
Input : An unlabeled corpus of an arbitrary natural
language
Output : A (possibly ranked) set of prefixes and
suffixes corresponding to true prefixes and suf-
fixes in the linguistic sense, i.e well-segmented
and with grammatical meaning, for the lan-
guage in question.
Restrictions : Weconsideronlyconcatenativemor-
phology and assume that the corpus comes al-
ready segmented on the word level.
The theory and practice of the problem is relevant
or even essential in fields such as child language ac-
quisition, information retrieval and, of course, the
fuller scope of computational morphology and its
further layers of application (e.g Machine Transla-
tion).
The reasons for attacking this problem in an un-
supervised manner include advantages in elegance,
economyoftimeandmoney(noannotatedresources
required), and the fact that the same technology may
be used on new languages.
An outline of the paper is as follows: we start
withsomenotationandbasicdefinitions, withwhich
we describe the theory that is intended to model
the essential behaviour of affixation in natural lan-
guages. Then we describe in detail and with ex-
amples the thinking behind the affix extraction al-
gorithm, which actually requires only a few lines to
define mathematically. Next, we present and discuss
some experimental results on typologically different
languages. The paper then finishes with a brief but
comprehensive characterization of related work and
its differences to our work. At the very end we state
the most important conclusions and ideas on future
components of unsupervised morphological analy-
sis.
79
2 A Naive Theory of Affixation
Notation and definitions:
• w,s,b,x,y,... ∈ Σ∗: lowercase-letter vari-
ablesrangeoverstringsofsomealphabetΣand
are variously called words, segments, strings,
etc.
• s triangleleft w: s is a terminal segment of the word w
i.e there exists a (possibly empty) string x such
that w = xs
• W,S,... ⊆ Σ∗: capital-letter variables range
over sets of words/strings/segments
• fW(s) = |{w ∈ W|s triangleleft w}|: the number of
words in W with terminal segment s
• SW = {s|s triangleleft w ∈ W}: all terminal segments
of the words in W
• |·|: is overloaded to denote both the length of
a string and the cardinality of a set
Assume we have two sets of random strings over
some alphabet Σ:
• Bases B = {b1,b2,...,bm}
• Suffixes S = {s1,s2,...,sn}
Such that:
Arbitrary Character Assumption (ACA) Each
character c ∈ Σ should be equally likely in any
word-position for any member of B or S.
Note that B and S need not be of the same car-
dinality and that any string, including the empty
string, could end up belonging to both B and S.
They need neither to be sampled from the same
distribution; pace the requirement, the distributions
from which B and S are drawn may differ in how
much probability mass is given to strings of different
lengths. For instance, it would not be violation if B
were drawn from a a distribution favouring strings
of length, say, 42 and S from a distribution with a
strong bias for short strings.
Next, build a set of affixed words W ⊆ {bs|b ∈
B,s ∈ S}, that is, a large set whose members are
concatenations of the form bs for b ∈ B,s ∈ S,
such that:
Frequent Flyer Assumption (FFA) : The mem-
bers of S are frequent. Formally: Given any
s ∈ S: fW(s) >> fW(x) for all x such that 1.
|x| = |s|; and 2. not x triangleleft sprime for all sprime ∈ S).
In other words, if we call s ∈ S a true suffix and we
call x an arbitrary segment if it neither a true suffix
nor the terminal segment of a true suffix, then any
true suffix should have much higher frequency than
an arbitrary segment of the same length.
One may legimately ask to what extent words of
real natural languages fit the construction model of
W, with the strong ACA and FFA assumptions, out-
lined above. For instance, even though natural lan-
guages often aren’t written phonemically, it is not
hard to come up with languages that have phonotac-
tic constraints on what may appear at the beginning
or end of a word, e.g, Spanish *st- may not begin
a word and yields est- instead. Another violation
of ACA is that (presumably all (Ladefoged, 2005))
languages disallow or disprefer a consonant vs. a
vowel conditioned by the vowel/consonant status of
its predecessor. However, if a certain element occurs
with less frequency than random (the best example
wouldbeclickconsonantswhich, insomelanguages
e.g Eastern !X˜oo (Traill, 1994), occur only initially),
this is not a problem to the theory.
As for FFA, we may have breaches such as Bibli-
cal Aramaic (Rosenthal, 1995) where an old -¯a el-
ement appears on virtually everywhere on nouns,
making it very frequent, but no longer has any syn-
chronic meaning. Also, one can doubt the require-
ment that an affix should need to be frequent; for
instance, the Classical Greek inflectional (lacking
synchronicinternalsegmentation)alternativemedial
3p. pl. aorist imperative ending -σθων (Blomqvist
and Jastrup, 1998), is not common at all.
Just how realistic the assumptions are is an empir-
ical question, whose answer must be judged by ex-
periments on the relevant languages. In the absense
of fully annotated annotated test sets for diverse lan-
guages, and since the author does not have access to
the Hutmegs/CELEX gold standard sets for Finnish
and English (Creutz and Lind´en, 2004), we can only
give some guidelining experimental data.
ACA On a New Testament corpus of Basque
(Leizarraga, 1571) we computed the probabil-
ity of a character appearing in the initial, sec-
80
Positions Distance
||p1 −p2|| 0.47
||p1 −p3|| 0.36
||p1 −p4|| 0.37
||p2 −p3|| 0.34
||p2 −p4|| 0.23
||p3 −p4|| 0.18
Table 1: Difference between character distributions
according to word position.
ond, third or fourth position of the word. Since
Basque is entirely suffixing, if it complied to
ACA, we’d expect the distributions to be simi-
lar. However, if we look at the difference of the
distributions in terms of variation distance be-
tween two probability distributions (||p−q|| =
1
2
summationtext
x|p(x) − q(x)|), it shows that they dif-
fer considerably – especially the initial position
proves more special (see table 1).
FFA As for the FFA, we checked a corpus of bible
portions of Warlpiri (Yal, 1968 2001). This was
chosen because it is one of the few languages
known to the author where data was available
and which has a decent amount of frequent suf-
fixes which are also long, e.g case affixes are
typically bisyllabic phonologically and five-ish
characters long orthographically. Since the or-
thography used marked segmentation, it was
easy to compute FFA statistics on the words
as removed from segmentation marking. Com-
paring with the lists in (Nash, 1980, Ch. 2) it
turns out that FFA is remarkably stable for all
grammatical suffixes occuring in the outermost
layer. There are however the expected kind
of breaches; e.g a tense suffix -ku combined
with a last vowel -u which is frequent in some
frequent preceding affixes making the terminal
segment -uku more frequent than some genuine
three-letter suffixes.
The language known to the author which has
shown the most systematic disconcord with the
FFA is Haitian Creole (also in bible corpus
experiments (Hai, 2003 1999)). Haitian cre-
ole has very little morphology of its own but
owes the lion’s share of it’s words from French.
French derivational morphemes abound in
these words, e.g -syon, which have been care-
fully shown by (Lefebvre, 2004) not to be pro-
ductive in Haitian Creole. Thus, the little mor-
phology there is in Haitian creole is very dif-
ficult to get at without also getting the French
relics.
3 An Algorithm for Affix Extraction
The key question is, if words in natural languages
are constructed as W explained above, can we re-
cover the segmentation? That is, can we find B and
S, given only W? The answer is yes, we can par-
tially decide this. To be more specific, we can com-
pute a score Z such that Z(x) > Z(y) if x ∈ SW
and y /∈ SW. In general, the converse need not hold,
i.e if both x,y ∈ SW, or both x,y /∈ SW, then
it may still be that Z(x) > Z(y). This is equiva-
lent to constructing a ranked list of all possible seg-
ments, where the true members of SW appear at the
top, and somewhere down the list the junk, i.e non-
members of SW, start appearing and fill up the rest
of the list. Thus, it is not said where on the list the
true-affixes/junk border begins, just that there is a
consistent such border.
Now, howshouldthislistbecomputed? Giventhe
FFA, it’s tempting to look at frequencies alone, i.e
just go through all words and make a list of all seg-
ments, ranking them by frequency? This won’t do it
because 1. it doesn’t compensate between segments
of different length; naturally, short segments will be
more frequent than long ones, solely by virtue of
their shortness 2. it overcounts ill-segmented true
affixes, e.g -ng will invariably get a higher (or equal)
count than -ing. What we will do is a modification
of this strategy, because 1. can easily be amended
by subtracting estimated prior frequencies (under
ACA) and there is a clever way of tackling 2. Note
that, to amend 2., when going through w and each
striangleleftw, it would be nice if we could count s only when
it is well-segmented in w. We are given only W so
this information is not available to us, but, the FFA
assumption let’s us make a local guess of it.
We shall illustrate the idea with an example of an
evolving frequency curve of a word “playing” and
its segmentations “playing”, “aying”, “ying”, “ing”,
“ng”, “g” (W being the set of words from an Eng-
lish bible corpus (Eng, 1977)). Figure 1 shows a
81
 0
 100
 200
 300
 400
 500
 600
 700
 800
 laying  aying  ying  ing  ng  g
fe
 playing
Figure 1: The observed fW(s) and expected eW(s)
frequency for s triangleleft w = playing.
frequency curve fW(s) and its expected frequency
curve eW(s). The expected frequency of a suffix s
doesn’t depend on the actual characters of s and is
defined as:
eW(s) = |W|· 1r|s|
Wherer isthesizeofthealphabetundertheassump-
tion that its characters are uniformly distributed. We
don’t simply use 26 in the case of lowercase English
since not all characters are equally frequent. Instead
we estimate the size of a would-be uniform distribu-
tion from the entropy of the distribution of the char-
acters in W. This gives r ≈ 18.98 for English and
other languages with a similar writing practice.
Next, define the adjusted frequency as the differ-
ence between the observed frequency and the ex-
pected frequency:
fprimeW(s) = fW(s)−eW(s)
It is the slope of this curve that predicts the presence
of a good split. Figure 2 shows the appearance of
this curve again exemplified by “playing”.
After these examples, we are ready to define the
segmentation score of a suffix relative to a word Z :
SW ×W →Q:
ZW(s,w) =
braceleftBigg 0 if not s triangleleft w
fprime(si)−fprime(si−1)
|fprime(si−1)| if s = si(w) for some i
Table 2 shows the evolution of exact values from
the running example.
To move from a Z-score for a segment that is rel-
ative to a word we simply sum over all words to get
−2
 0
 2
 4
 6
 8
 10
 12
 14
 16
 18
 ng  g
Z
playing  laying  aying  ying  ing
Figure 2: The slope of the fprimeW(s) curve for s triangleleft w =
playing.
Input: A text corpus C
Step 1. Extract the set of words W from C (thus all
contextual and word-frequency information is
discarded)
Step 2. Calculate ZW(s,w) for each w ∈ W and
s triangleleft w
Step 3. Accumulate ZW(s) =summationtextw∈W Z(s,w)
Table 3: Summary of affix-extraction algorithm.
the final score Z : SW →Q:
ZW(s) = summationdisplay
w∈W
Z(s,w) (1)
To be extra clear, the FFA assumption is “ex-
ploited” in two ways. On the one hand, frequent
affixes get many oppurtunities to get a score (which
could, however, be negative) in the final sum over
w ∈ W. On the other hand, the frequency is what
make up the appearance of the slope that predicts the
segmentation point.
The final Z-score in equation 1 is the one that
purports to have the property that Z(x) > Z(y) if
x ∈ SW and y /∈ SW – at least if purged (see be-
low). A summary of the algorithm described in this
section is displayed in table 3.
The time-complexity bounding factor is the num-
ber of suffixes, i.e the cardinality of SW, which is
linear (in the size of the input) if words are bounded
in length by a constant and quadratic in the (really)
worst case if not.
82
s playing laying aying ying ing ng g
f(s) 1 4 12 40 706 729 756
eW(s) 0.00 0.00 0.00 0.10 1.90 36.0 684
f(s)−eW(s) 0.99 3.99 11.9 39.8 704 692 71.0
Z(s,”playing”) 0.00 2.99 1.99 2.32 16.6 -0.0 -0.8
Table 2: Exact values of frequency curves and scores from the running “playing” example.
1028682.0 ing 111264.0 ling
594208.0 ed 111132.0 ent
371145.0 s 109725.0 ating
337464.0 ’s 109125.0 ate
326250.0 ation 108228.0 an
289536.0 es 97020.0 ies
238853.5 e 94560.0 ts
222256.0 er 81648.0 ically
191889.0 ers 81504.0 ment
172800.0 ting 78669.0 led
168288.0 ly 77900.0 ering
159408.0 ations 74976.0 er’s
143775.0 ted 73988.0 y
130960.0 able ... ...
116352.0 ated -26137.9 l
113364.0 al -38620.6 m
113280.0 ness -78757.3 a
Table 4: Top 30 and bottom 3 extracted suffixes
for English. 47178 unique words yielded a total of
154407 ranked suffixes.
4 Experimental Results
For a regular English 1 million token newspaper
corpus we get the top 30 plus bottom 3 suffixes as
shown in table 4.
English has little affixation compared to e.g Turk-
ish which is at the opposite end of the typological
scale (Dryer, 2005). The corresponding results for
Turkish on a bible corpus (Tur, 1988) is shown in
table 5.
The results largely speak for themselves but some
comments are in order. As is easily seen from the
lists, some suffixes are suffixes of each other so one
could purge the list in some way to get only the
most “competitive” suffixes. One purging strategy
would be to remove x from the list if there is a z
1288402.4 i 33756.55 ler
151056.9 er 29816.53 da
142552.6 in 29404.49 di
141603.3 im 28337.89 le
134403.2 en 26580.41 dan
130794.5 e 26373.54 r
127352.0 an 24183.99 ti
113482.6 a 22527.26 un
82581.95 ya 21388.71 iniz
78447.74 ar 20993.87 sin
76353.77 ak 20117.60 ik
68730.00 n 18612.14 li
64761.37 ir 18316.45 ek
53021.67 la ... ...
47218.78 ini -38091.8 t
44858.18 lar -240917.5 l
37229.14 iz -284460.1 s
Table 5: Top 30 and bottom 3 extracted suffixes
for Turkish. 56881 unique words yielded a total of
175937 ranked suffixes.
such that x = yx and Z(z) > Z(x) (this would
remove e.g -ting if -ing is above it on the list). A
more sophisticated purging method is the following,
which does slightly more. First, for a word w ∈ W
define its best segmentation as: Segment(w) =
argmaxstriangleleftwZ(s). Thenpurge bykeeping onlythose
suffixes which are the best parse for at least one
word: SprimeW = {s ∈ SW|∃wSegment(w) = s}.
Such purging kicks out the bulk of “junk” suf-
fixes. Table 4 shows the numbers for English, Turk-
ish and the virtually affixless Maori (Bauer et al.,
1993). It should noted that “junk” suffixes still re-
main after purging – typically common stem-final
characters – and that there is no simple relation
between the number of suffixes left after purging
and the amount of morphology of the language in
question. Otherwise we would have expected the
morphology-less Maori to be left with no, or 28-ish,
83
Language Corpus Tokens |W| |SW| |SprimeW|
Maori (Mao, 1996) 1101665 8354 23007 78
English (Eng, 1977) 917634 12999 39845 63
Turkish (Tur, 1988) 574592 56881 175937 122
Table 6: Figures for different languages on the ef-
fects on the size of the suffix list after purging.
suffixes or at least less than English.
A good sign is that the purged list and its order
seems to be largely independent of corpus size (as
long as the corpus is not very small) but we do get
some significant differences between bible English
and newspaper English.
We have chosen to illustrate using affixes but the
method readily generalizes to prefixes as well and
even prefixes and suffixes at the same time. As
an example of this, we show top-10 purged prefix-
suffix scores in the same table also for some typo-
logically differing languages in table 7. Again, we
use bible corpora for cross-language comparability
(Swedish (Swe, 1917) and Swahili (Swa, 1953)).
The scores have been normalized in each language
to allow cross-language comparison – which, judg-
ing from the table, seems meaningful. Swahili is an
exclusively prefixing language but verbs tend to end
in -a (whose status as a morpheme is the linguistic
sense can be doubted), whereas Swedish is suffix-
ing, although some prefixes are or were productive
in word-formation.
A full discussion of further aspects such as a more
informed segmentation of words, peeling of multi-
ple suffix layers and purging of unwanted affixes re-
quires, is beyond the scope of this paper.
5 Related Work
For reasons of space we cannot cite and comment
every relevant paper even in the narrow view of
highly unsupervised extraction of affixes from raw
corpus data, but we will cite enough to cover each
line of research. The vast fields of word segmenta-
tion for speech recognition or for languages which
do not mark word boundaries will not be covered.
In our view, segmentation into lexical units is a dif-
ferent problem than that of affix extraction since the
frequencies of lexical items are different, i.e occur
Swedish English Swahili
f¨or- 0.097 -eth 0.086 -a 0.100
-en 0.086 -ing 0.080 wa- 0.095
-na 0.036 -ed 0.063 ali- 0.065
-ade 0.035 -est 0.036 nita- 0.059
-a 0.034 -th 0.035 aka- 0.049
-ar 0.033 -es 0.034 ni- 0.046
-er 0.033 -s 0.033 ku- 0.044
-as 0.032 -ah 0.026 ata- 0.042
-s 0.031 -er 0.026 ha- 0.032
-de 0.031 -ation 0.019 a- 0.031
... ... ... ... ... ...
Table 7: Comparative figures for prefix vs. suffix
detection. The high placement of English -eth and
-ah are due to the fact that the bible version used has
drinketh, sitteth etc and a lot of personal names in
-ah.
much more sparsely. Results from this area which
have been carried over or overlap with affic detec-
tion will however be taken into account. A lot of
the papers cited have a wider scope and are still use-
ful even though they are critisized here for having a
non-optimal affix detection component.
Many authors trace their approches back to two
early papers by Zellig Harris (Harris, 1955; Har-
ris, 1970) which count letter successor varieties.
The basic procedure is to ask how many different
phonemes occur (in various utterances e.g a corpus)
after the first n phonemes of some test utterance and
predictthatsegmentation(s)occurwherethenumber
ofsuccesorsreachesapeak. Forexample, ifwehave
play, played, playing, player, players, playground
and we wish to test where to segment plays, the suc-
cesor count for the prefix pla would be 1 because
only y occurs after whereas the number of succes-
sors of play peak at three (i.e {e,i,g}). Although
the heuristic has had some success it was shown (in
various interpretations) as early as (Hafer and Weiss,
1974) that it is not really sound – even for English.
A slightly better method is to compile a set of words
into a trie and predict boundaries at nodes with high
actitivity (e.g (Johnson and Martin, 2003; Schone
and Jurafsky, 2001; Kazakov and Manandhar, 2001)
and earlier papers by the same authors), but this not
sound either as non-morphemic short common char-
acter sequences also show significant branching.
84
The algorithm in this paper is differs significantly
from the Harris-inspired varieties. First, we do
not record the number of phonemes/character of a
given prefix/suffix but the total number of contin-
uations. In the example above, that would be the
set {ed,ing,er,ers,ground} rather than the three-
member set of continuing phonemes/characters.
Secondly, segmentation of a given word is not the
immediate objective and what amounts to identifi-
cation of the end of a lexical (thus generally low-
frequency) item is not within the direct reach of the
model. Thirdly, and most importantly, the algorithm
in this paper looks at the slope of the frequency
curve not at peaks in absolute frequency.
A different approach, sometimes used in com-
plement of other sources of information, is to se-
lect aligned pairs (or sets) of strings that share a
long character sequence (work includes (Jacquemin,
1997; Yarowsky and Wicentowski, 2000; Baroni et
al., 2002; Clark, 2001)). A notable advantage is that
one is not restricted to concatenative morphology.
Many publications (´Cavar et al., 2004; Brent et
al., 1995; Goldsmith et al., 2001; D´ejean, 1998;
Snover et al., 2002; Argamon et al., 2004; Gold-
smith, 2001; Creutz and Lagus, 2005; Neuvel and
Fulop, 2002; Baroni, 2003; Gaussier, 1999; Sharma
et al., 2002; Wicentowski, 2002; Oliver, 2004),
and various other works by the same authors, de-
scribe strategies that use frequencies, probabilities,
and optimization criteria, often Minimum Descrip-
tion Length (MDL), in various combinations. So far,
all these are unsatisfactory on two main accounts; on
the theretical side, they still owe an explanation of
why compression or MDL should give birth to seg-
mentations coinciding with morphemes as linguisti-
cally defined. On the experimental side, thresholds,
supervised/developed parametres and selective input
still cloud the success of reported results, which, in
any case, aren’t wide enough to sustain some too
rash language independence claims.
To be more specific, some MDL approaches aim
to minimize the description of the set of words in
the input corpus, some to describe all tokens in
the corpus, but, none aims to minimize, what one
would otherwise expect, the set of possible words
in the language. More importantly, none of the re-
viewed works allow any variation in the descrip-
tion language (“model”) during the minimization
search. Therefore they should be more properly la-
beled“weightingschemes”andit’sanopenquestion
whether their yields correspond to linguistic analy-
sis. Given an input corpus and a traditional linguis-
tic analysis, it is trivial to show that it is possible to
decrease description length (according to the given
schemes) by stepping away from linguistic analysis.
Moreover, various forms of codebook compression,
such as Lempel-Ziv compression, yield shorter de-
scription but without any known linguistic relevance
at all. What is clear, however, apart from whether it
is theoretically motivated, is that MDL approaches
are useful.
A systematic test of segmentation algorithms over
many different types of languages has yet to be pub-
lished. For three reasons, it will not be undertaken
here either. First, as e.g already Manning (1998)
notes for sandhi phenomena, it is far from clear
what the gold standard should be (even though we
may agree or agree to disagree on some familiar
European languages). Secondly, segmentation al-
gorithms may have different purposes and it might
not make good sense to study segmentation in isola-
tion from induction of paradigms. Lastly, and most
importantly, all of the reviewed techniques (Wicen-
towski, 2004; Wicentowski, 2002; Snover et al.,
2002; Baroni et al., 2002; Andreev, 1965; ´Cavar
et al., 2004; Snover and Brent, 2003; Snover and
Brent, 2001; Snover, 2002; Schone and Jurafsky,
2001; Jacquemin, 1997; Goldsmith and Hu, 2004;
Sharmaetal., 2002; Clark, 2001; KazakovandMan-
andhar, 1998; D´ejean, 1998; Oliver, 2004; Creutz
and Lagus, 2002; Creutz and Lagus, 2003; Creutz
and Lagus, 2004; Hirsim¨aki et al., 2003; Creutz
and Lagus, 2005; Argamon et al., 2004; Gaussier,
1999; Lehmann, 1973; Langer, 1991; Flenner, 1995;
Klenk and Langer, 1989; Goldsmith, 2001; Gold-
smith, 2000; Hu et al., 2005b; Hu et al., 2005a;
Brent et al., 1995), as they are described, have
threshold-parameters of some sort, explicitly claim
not to work well for an open set of languages, or
require noise-free all-form input (Albright, 2002;
Manning, 1998; Borin, 1991). Therefore it is not
possible to even design a fair test.
In any event, we wish to appeal to the merits of
developing a theory in parallel with experimentation
– as opposed to only ad hoc result chasing. If we
have a theory and we don’t get the results we want,
85
wemayscrutinizetheassumptionsbehindthetheory
in order to modify or reject it (understanding why
we did so). Without a theory there’s no telling what
to do or how to interpret intermediate numbers in a
long series of calculations.
6 Conclusion
We have presented a new theory of affixation and a
parameter-less efficient algorithm for collecting af-
fixes from raw corpus data of an arbitrary language.
Depending on one’s purposes with it, a cut-off point
for the collected list is still missing, or at least, we
do not consider that matter here. The results are very
promising and competitive but at present we lack
formal evaluation in this respect. Future directions
also include a more specialized look into the relation
between affix-segmentation and paradigmatic varia-
tion and further exploits into layered morphology.
7 Acknowledgements
The author has benefited much from discussions
with Bengt Nordstr¨om.

References
Adam C. Albright. 2002. The Identification of Bases in
Morphological Paradigms. Ph.D. thesis, University of
California at Los Angeles.
Nikolai Dmitrievich Andreev, editor. 1965. Statistiko-
kombinatornoe modelirovanie iazykov. Akademia
Nauk SSSR, Moskva.
Shlomo Argamon, Navot Akiva, Amihood Amit, and
Oren Kapah. 2004. Efficient unsupervised recursive
word segmentation using minimum description length.
In COLING-04, 22-29 August 2004, Geneva, Switzer-
land.
Marco Baroni, Johannes Matiasek, and Harald Trost.
2002. Unsupervised discovery of morphologically re-
lated words based on orthographic and semantic simi-
larity. In Proceedings of the Workshop on Morpholog-
ical and Phonological Learning of ACL/SIGPHON-
2002, pages 48–57.
Marco Baroni. 2003. Distribution-driven morpheme dis-
covery: A computational/experimental study. Year-
book of Morphology, pages 213–248.
Winifred Bauer, William Parker, and Te Kareongawai
Evans. 1993. Maori. Descriptive Grammars. Rout-
ledge, London & New York.
Jerker Blomqvist and Poul Ole Jastrup. 1998. Grekisk
Grammatik: Graesk grammatik. Akademisk Forlag,
København, 2 edition.
Lars Borin. 1991. The Automatic Induction of Morpho-
logical Regularities. Ph.D. thesis, University of Upp-
sala.
Michael R. Brent, S. Murthy, and A. Lundberg. 1995.
Discovering morphemic suffixes: A case study in min-
imum description length induction. In Fifth Interna-
tional Workshop on Artificial Intelligence and Statis-
tics, Ft. Lauderdale, Florida.
Damir ´Cavar, Joshua Herring, Toshikazu Ikuta, Paul Ro-
drigues, and Giancarlo Schrementi. 2004. On in-
duction of morphology grammars and its role in boot-
strapping. In Gerhard J¨ager, Paola Monachesi, Gerald
Penn, and Shuly Wintner, editors, Proceedings of For-
mal Grammar 2004, pages 47–62.
Alexander Clark. 2001. Learning morphology with pair
hidden markov models. In ACL (Companion Volume),
pages 55–60.
Mathias Creutz and Krista Lagus. 2002. Unsupervised
discovery of morphemes. In Proceedings of the 6th
Workshop of the ACL Special Interest Group in Com-
putational Phonology (SIGPHON), Philadelphia, July
2002, pages 21–30. Association for Computational
Linguistics.
Mathias Creutz and Krista Lagus. 2003. Unsupervised
discovery of morphemes. In Proceedings of the 6th
Workshop of the ACL Special Interest Group in Com-
putational Phonology (SIGPHON), Philadelphia, July
2002, pages 21–30. Association for Computational
Linguistics.
Mathias Creutz and Krista Lagus. 2004. Induction of
a simple morphology for highly-inflecting languages.
In Proceedings of the 7th Meeting of the ACL Spe-
cial Interest Group in Computational Phonology (SIG-
PHON), pages 43–51. Barcelona.
Mathias Creutz and Krista Lagus. 2005. Unsupervised
morpheme segmentation and morphology induction
from text corpora using morfessor 1.0. Technical re-
port, Publications in Computer and Information Sci-
ence, Report A81, Helsinki University of Technology,
March.
Mathias Creutz and Krister Lind´en. 2004. Morpheme
segmentation gold standards for finnish and english.
publications in computer and information science, re-
port a77, helsinki university of technology. Technical
report, Publications in Computer and Information Sci-
ence, Report A77, Helsinki University of Technology,
October.
Herv´e D´ejean. 1998. Concepts et algorithmes pour
la d´ecouverte des structures formelles des langues.
Ph.D. thesis, Universit´e de Caen Basse Normandie.
Matthew S. Dryer. 2005. Prefixing versus suffix-
ing in inflectional morphology. In Bernard Comrie,
Matthew S. Dryer, David Gil, and Martin Haspelmath,
editors, World Atlas of Language Structures, pages
110–113. Oxford University Press.
1977. The holy bible, containing the old and new testa-
ments and the apocrypha in the authorized king james
version. Thomas Nelson, Nashville, New York.
Gudrun Flenner. 1995. Quantitative morphseg-
mentierung im spanischen auf phonologischer basis.
Sprache und Datenverarbeitung, 19(2):63–78. Also
cited as: Computatio Linguae II, 1994, pp. 1994 as
well as Sprache und Datenverarbeitung 19(2):31-62,
1994.
´Eric Gaussier. 1999. Unsupervised learning of deriva-
tional morphology from inflectional lexicons. In Pro-
ceedings of the 37th Annual Meeting of the Associa-
tion for Computational Linguistics (ACL-1999). Asso-
ciation for Computational Linguistics, Philadephia.
John Goldsmith and Yu Hu. 2004. From signatures to fi-
nite state automata. Technical report TR-2005-05, De-
partment of Computer Science, University of Chicago.
John Goldsmith, Derrick Higgins, and Svetlana Soglas-
nova. 2001. Automatic language-specific stem-
ming in information retrieval. In Carol Peters, edi-
tor, Cross-Language Information Retrieval and Eval-
uation: Proceedings of the CLEF 2000 Workshop,
Lecture Notes in Computer Science, pages 273–283.
Springer-Verlag, Berlin.
John Goldsmith. 2000. Linguistica: An automatic
morphological analyzer. In A. Okrent and J. Boyle,
editors, Proceedings from the Main Session of the
Chicago Linguistic Society’s thirty-sith Meeting.
John Goldsmith. 2001. Unsupervised learning of the
morphology of natural language. Computational Lin-
guistics, 27(2):153–198.
Margaret A. Hafer and Stephen F. Weiss. 1974. Word
segmentation by letter successor varieties. Informa-
tion and Storge Retrieval, 10:371–385.
2003 [1999]. Bib la. American Bible Society.
Zellig S. Harris. 1955. From phoneme to morpheme.
Language, 31(2):190–222.
Zellig S. Harris. 1970. Morpheme boundaries within
words: Report on a computer test. In Zellig S. Harris,
editor, Papers in Structural and Transformational Lin-
guistics, volume 1 of Formal Linguistics Series, pages
68–77. D. Reidel, Dordrecht.
Teemu Hirsim¨aki, Mathias Creutz, Vesa Siivola, and
Mikko Kurimo. 2003. Unlimited vocabulary speech
recognition based on morphs discovered in an unsu-
pervised manner. In Proceedings of Eurospeech 2003,
Geneva, pages 2293–2996. Geneva, Switzerland.
Yu Hu, Irina Matveeva, John Goldsmith, and Colin
Sprague. 2005a. Refining the SED heuristic for
morpheme discovery: Another look at Swahili. In
Proceedings of the Workshop on Psychocomputational
ModelsofHumanLanguageAcquisition, pages28–35,
Ann Arbor, Michigan, June. Association for Computa-
tional Linguistics.
Yu Hu, Irina Matveeva, John Goldsmith, and Colin
Sprague. 2005b. Using morphology and syntax to-
gether in unsupervised learning. In Proceedings of
the Workshop on Psychocomputational Models of Hu-
man Language Acquisition, pages 20–27, Ann Arbor,
Michigan, June. Association for Computational Lin-
guistics.
Christian Jacquemin. 1997. Guessing morphology from
terms and corpora. In Proceedings, 20th Annual In-
ternational ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR ’97),
Philadelphia, PA.
Howard Johnson and Joel Martin. 2003. Unsuper-
vised learning of morphology for english and inukti-
tut. In HLT-NAACL 2003, Human Language Technol-
ogy Conference of the North American Chapter of the
Association for Computational Linguistics, May 27 -
June 1, Edmonton, Canada, volume Companion Vol-
ume - Short papers.
Dimitar Kazakov and Suresh Manandhar. 1998. A hy-
brid approach to word segmentation. In C. D. Page,
editor, Proceedings of the 8th International Workshop
on Inductive Logic Programming (ILP-98) in Madi-
son, Wisconsin, USA, volume 1446 of Lecture Notes
in Artificial Intelligence. Springer-Verlag, Berlin.
Dimitar Kazakov and Suresh Manandhar. 2001. Un-
supervised learning of word segmentation rules with
genetic algorithms and inductive logic programming.
Machine Learning, 43:121–162.
Ursula Klenk and Hagen Langer. 1989. Morphological
segmentation without a lexicon. Literary and Linguis-
tic Computing, 4(4):247–253.
Peter Ladefoged. 2005. Vowels and Consonants. Black-
well, Oxford, 2 edition.
Hagen Langer. 1991. Ein automatisches Morphseg-
mentierungsverfahrenf¨urdeutscheWortformen. Ph.D.
thesis, Georg-August-Universit¨at zu G¨ottingen.
Claire Lefebvre. 2004. Issues in the study of Pidgin and
Creole languages, volume 70 of Studies in Language
Companion Series. John Benjamins, Amsterdam.
Hubert Lehmann. 1973. Linguistische Modellbildung
und Methodologie. Max Niemeyer Verlag, T¨ubingen.
Pp. 71-76 and 88-93.
Joanes Leizarraga. 1571. Iesus krist gure iaunaren tes-
tamentu berria. Pierre Hautin, Inprimizale, Roxellan.
[NT only].
Christopher D. Manning. 1998. The segmentation prob-
lem in morphology learning. In Jill Burstein and Clau-
dia Leacock, editors, Proceedings of the Joint Confer-
ence on New Methods in Language Processing and
Computational Language Learning, pages 299–305.
Association for Computational Linguistics, Somerset,
New Jersey.
1996. Maori bible. The British & Foreign Bible Society,
London, England.
David G. Nash. 1980. Topics in Warlpiri Grammar.
Ph.D. thesis, Massachusetts Institute of Technology.
Sylvain Neuvel and Sean A. Fulop. 2002. Unsuper-
vised learning of morphology without morphemes. In
Workshop on Morphological and Phonological Learn-
ing at Association for Computational Linguistics 40th
Anniversary Meeting (ACL-02), July 6-12, pages9–15.
ACL Publications.
A. Oliver. 2004. Adquisici´o d’informaci´o l`exica i mor-
fosint`actica a partir de corpus sense anotar: apli-
caci´o al rus i al croat. Ph.D. thesis, Universitat de
Barcelona.
Franz Rosenthal. 1995. A grammar of biblical Aramaic,
volume 5 of Porta linguarum Orientalium. Harras-
sowitz, Wiesbaden, 6 edition.
Patrick Schone and Daniel Jurafsky. 2001. Knowledge-
free induction of inflectional morphologies. In Pro-
ceedings of the North American Chapter of the Asso-
ciation for Computational Linguistics, Pittsburgh, PA,
2001.
Utpal Sharma, Jugal Kalita, and Rajib Das. 2002. Unsu-
pervised learning of morphology for building lexicon
for a highly inflectional language. In Proceedings of
the 6th Workshop of the ACL Special Interest Group in
Computational Phonology (SIGPHON), Philadelphia,
July 2002, pages 1–10. Association for Computational
Linguistics.
Matthew G. Snover and Michael R. Brent. 2001. A
bayesian model for morpheme and paradigm identifi-
cation. In Proceedings of the 39th Annual Meeting of
the Association for Computational Linguistics (ACL-
2001), pages 482–490. Morgan Kaufmann Publishers.
Matthew G. Snover and Michael R. Brent. 2003. A prob-
abilistic model for learning concatenative morphology.
In S. Becker, S. Thrun, and K. Obermayer, editors, Ad-
vances in Neural Information Processing Systems 15,
pages 1513–1520. MIT Press, Cambridge, MA.
Matthew G. Snover, Gaja E. Jarosz, and Michael R.
Brent. 2002. Unsupervised learning of morphol-
ogy using a novel directed search algorithm: Taking
the first step. In Workshop on Morphological and
Phonological Learning at Association for Computa-
tionalLinguistics40thAnniversaryMeeting(ACL-02),
July 6-12. ACL Publications.
Matthew G. Snover. 2002. An unsupervised knowledge
free algorithm for the learning of morphology in nat-
ural languages. Master’s thesis, Department of Com-
puter Science, Washington University.
1953. Maandiko matakatifu ya mungu yaitwaya biblia,
yaani agano la kale na agano jipya, katika lugha ya
kiswahili. British and Foreign Bible Society, London,
England.
1917. Gamla och nya testamentet: de kanoniska
b¨ockerna. Norstedt, Stockgholm.
Anthony Traill. 1994. A !X´o˜o Dictionary, volume 9 of
Quellen zur Khoisan-Forschung/Research in Khoisan
Studies. R¨udiger K¨oppe Verlag, K¨oln.
1988. Turkish bible. American Bible Society, Tulsa, Ok-
lahoma.
Richard Wicentowski. 2002. Modeling and Learning
Multilingual Inflectional Morphology in a Minimally
Supervised Framework. Ph.D. thesis, Johns Hopkins
University, Baltimore, MD.
Richard Wicentowski. 2004. Multilingual noise-robust
supervised morphological analysis using the word-
frame model. In Proceedings of the ACL Special Inter-
est Group on Computational Phonology (SIGPHON),
pages 70–77.
1968–2001. Bible: selections in warlpiri. Summer Insti-
tute of Linguistics. Document 0650 of the Aboriginal
Studies Electronic Data Archive (ASEDA), AIATSIS
(Australian Institute of Aboriginal and Torres Strait Is-
lander Studies), Canberra.
David Yarowsky and Richard Wicentowski. 2000. Min-
imally supervised morphological analysis by multi-
modal alignment. In Proceedings of the 38th Annual
Meeting of the Association for Computational Linguis-
tics (ACL-2000), pages 207–216.
