m 
m 
m 
m 
/ 
| 
m 
m 
m 
/ 
/ 
| 
l 
m 
m 
m 
Morphemes as Necessary Concept 
for Structures Discovery 
from Untagged Corpora 
Hervd D~jean 
GREYC - CNRS - UPRESA 6072 
Universit~ de Caen - Basse Normandie 
Herve. Dej ean@inf o. unicaen, fr 
Abstract 
This paper describes an overview of a method 
which allows discovery of syntactic structures 
from untagged corpora. It is composed of three 
main steps: the discovery of the grammatical 
morphemes of the language. Then the con- 
struction of the chunks which axe a multilin- 
gual conceptual level allowing the bypass of 
the limping notion of words. And Finally the 
discovery of the relations between chunks. We 
give an overview of the ditferent procedures re- 
alized and we especially describe the discov- 
ery of morphemes. This operation is divided 
into three steps: the discovery of the most fre- 
quent morphemes of the language. Then the 
discovery of the other morphemes, and finally 
the segmentation of the words of the corpus. 
We concluded with the procedure of correction 
which required the chunk level. The concepts 
and algorithms were tested on a twenty nat- 
ural languages like English, German, Turkish, 
Vietnamese, Swahili, Finnish, Latin, Indone- 
sian. 
1 Introduction 
The method presented in this paper is inspired by 
the distributional approach developed by American 
structuraIists between 1940 and 1950 (Harris, 1951). 
This approach is characterized by two facts: (a) the 
use of corpora and (b) the use of the notion of distri- 
bution instead of the sense of elements. The distri- 
bution of an element is the set of the environments 
in which the element occurs. Other works describe 
systems that induce structures from corpora, but 
they use tagged corpora (Brill, 1993), or grammat- 
ical informations (Brent, 1993), or work with artifi- 
cial samples (Elman, 1990). Our originality lies in 
the fazt that we only use untagged and non artificial 
corpora without specific knowledge about the studied 
language. We try to discover the structures of a nat- 
ural language from raw texts of this language (on 
100,000 words). We show that this kind of discov- 
ery is possible if we have some expectations of the 
structure of Natural Languages and if we use some 
formal properties. 
The method relies on structural linguistic con- 
cepts: the morpheme, the chunk and the linearity of 
the language, i.e. the corpus is composed of a unidi- 
mensional sequence of elements. We first give here 
an overview of the concepts and general principles, 
from morphemes to syntactic structures discovery. 
Then we explain in detail how the segmentation is 
carried out. 
2 The General Structure of 
Sentences 
Natural Languages are a linear object. It means 
that sentences are sequences of sounds. In the case 
of written sentences, we consider them as sequences 
of letters (or characters). We also consider that lan- 
guages are not only sequences of sound but are struc- 
tured in several structural levels. We claim that 
these different levels are formally indicated in the 
sentences. How? Since sentences are unidimensional 
object, a simple way is the use of boundaries in- 
dicators between the elements which composed the 
sentences. Applying this principle on several Lan- 
guages, we find out three multilingual and hierar- 
chical levels: the morpheme level, the chunk level 
and the clause level. One usehll formal criterium in 
the discovery of these structures is the position of 
words and morphemes relatively to beginnings and 
ends of sentences. 
The morpheme level is already well known in Lin- 
guistics. The morphemes are the basic elements of 
the structure. In this paper, we call morphemes the 
atTlxes of the language. These elements are discov- 
ered during the operation of words segmentation. 
The morphemes contain as much structural infor- 
mations as grammatical words and are essential to 
the discovery of the syntactic structures. Section 3 
explains how we list them. 
The higher level is the chunk one. We note that 
D~jean 295 Morphemes for Structures Discovery 
Herr6 D6jean (1998) Morphemes as Necessary Concept for Structures Discovery from Untagged Corpora. In D.M.W. 
Powers (ed.) NeMLaP3/CoNLL98 Workshop on Paradigms and Grounding in Language Learning, ACL, pp 295-298. 
some elements have a specific behaviour: they never 
occur at the beginning or at the end of the sentences. 
For example, the English word the never ends the 
sentences. There exists in all the studied languages 
similar elements (words or morphemes) that we can 
consider as indicating the beginning or the end of 
structures, the grammatical words as well as the 
morphemes are consider as boundaries indicators. 
We systematically consider grammatical words ei- 
ther as beginning indicator or as ending indicator. In 
practice, we tie them to their nearest lexical element 
(the following lexical for beginnings and preceding 
lexical for endings) In the same way, prefixes are 
considered as beginning and suffixes as ending. For 
example, both postpositions and inflexional suffixes 
are consider as ending indicators. The structures 
generated by these elements correspond to a lexical 
element (the nucleus of the chunk) surrounded by 
grammatical elements (words or morphemes, gener- 
ally a combination of both). 
The chunks may be viewed as non recursive 
phrases. Though each chunk of the corpus has not 
systematically boundaries indicators, there generally 
exists enough chunks which are delimited in order to 
allow the discovery of these boundaries. The discov- 
ery of such indicators is automatically realized for a 
large part. 
The last level is the clause level. By working 
on boundaries indicators, we have noted that some 
indicators have a more specific behaviour. They 
mainly occur at the beginning or at the end of 
sentences. Furthermore, since some chunks have 
the same behaviour, clause boundaries are indi- 
cated either by morpheme, sole grammatical words 
or chunks. These elements always characterize el- 
ements of clauses: conjunctions or verbal phrases. 
For instance, English conjunction but begins sen- 
tences 672 times out of 760 occurrences. This be- 
haviour is specific to clause boundaries indicators. 
German clause is, most of the times, dosed either 
by grammatical words such as her, zur~ck, verbal 
particles, or by verbal phrases. In Turkish, the con- 
junction area (but) occurs 763 times and begins 743 
times. The Turkish clause is closed by verbal chunks, 
which implies that all the verbal morphemes (-tit, 
yor) axe well marked as absolute endings. All the 
languages which have a SOV or OSV structure offer 
obvious end boundaries for clauses, and languages 
which have VSO or VOS structure offer beginnings 
boundaries for clauses. These formal informations 
do not cover all the formal characteristics of the lan- 
guages, but they offer enough informations in order 
to discover the different syntactic relations between 
chunks, and offer a good starting point in order to 
find specific structures of a language, the position of 
the finite verb in German for instance. 
In practice, we note that some languages priv- 
ilege beginning indicators (prepositional languages 
as many European ones), others privilege ending 
indicators (postpositional languages as Turkish or 
Japanese) either at chunk level as at clause level, 
but they generally use the two methods. Some lan- 
guages (Asian tonal languages) have a low number 
of boundaries indicators that complicates the chunks 
and clauses discovery. For the moment, we have 
stopped this study at clause level (or sequences of 
clause), but there perhaps exists higher levels. 
3 The Morphemes Discovery 
We now explain in details how the morphemes of a 
particular language are found. We refer the read- 
ers to other works dealing with this problem (Brent, 
Murthy, and Lunsberg, 1995), (de Marcken, 1995). 
Our aim is not the realization of a morphological 
analysis of each word of the corpus, but the pro- 
duction of the list of the morphemes for a given 
language. We do not try to discover all the mor- 
phemes contained in the corpus, since only the hun- 
dred most frequent ones are necessary in order to 
climb to chunks level. The method is inspired by 
the works of Zellig Harris. His algorithm is based on 
the number of different letters which follow a given 
sequence of letters. The increase of this number indi- 
cates a morpheme boundary. For instance, after the 
English sequence direc, we only find, in our corpus, 
one letter t. After direct, we find four letters: /, l, o, 
and e (directly, director, directed, direction). This in- 
crease indicates a boundary between the root (direct 
and the SUffLxes (-ion, -ly, -or and -ed). The algo- 
rithm works well when the corpus contains enough 
occurrences of a stem family. But, it may gener- 
ate wrong segmentations. For example from the 
list started, startled, startling, the algorithm outputs 
this segmentation: start-ed, start-Ied~ start-ling. The 
errors occur when two kinds of stem families are used 
for the segmentation. (Harris, 1955) exposes several 
variations more or less complex. Their implementa- 
tion does not furnish great improvements. 
Our idea for improving the segmentation is to di- 
vide into three steps this operation. The first step 
computes the list of the most frequent morphemes. 
The second steps extends the list by segmenting 
words with the help of the morphemes already gener- 
ated. The third step consists in the segmentation of 
all the words with the morphemes obtained at the 
second step. The algorithm is illustrated with the 
suffixes segmentation, but the discovery of prefixes 
is totally symmetric: we just reverse the letters of 
D~jean 296 Morphemes for Structures Discovery 
I 
1 
1 
1 
II 
1 
1 
1 
1 
1 
1 
1 
II 
I 
I 
m 
B 
m 
II 
m 
m 
m 
m 
B 
II 
m 
I 
m 
m 
m 
I 
m 
.-: • 
the words. 
3.1 The discovery of the most frequent 
morphemes 
The discovery of the most frequent morphemes is 
based on Harris algorithm. We try to find begin- 
nings or endings of words which have the following 
property: after a given sequence of letters, we count 
the number of different letters. If this number is 
higher than a threshold (half of the letters of the al- 
phabet), we arrive at a morpheme boundary, except 
in the case we are in the sequence which corresponds 
to a longer morpheme, a case we can detect. For ex- 
ample, before the sequence on, we found 18 different 
letters, thus on may be a morpheme. But 292 of 
these words in the corpus end with ion out of 367 
which end with on. Since the longest sequence ion 
represents more than 50% of the word ended by on, 
we consider that on is a part of the morpheme - 
ion I. We only keep on the sequences which have a 
frequency higher than 100. 
Table 1: The most frequent morphemes. 
English 
French 
German 
Turkish 
Swahili 
Swahili 
Vietnamese 
-e -s -ed -ing -al -ation -ly -ic -ent 
-s -e -es -ent -er -ds -re -ation -ique 
-en -e -te -ten -er -es -lich -el 
-m -in -lar -ler -dan -den -inl -ml 
-wa -ia -u -eni -o -isha -ana -we 
wa- m- ku- ali- ni- aka- ki- vi- 
NONE 
3.2 The discovery of other morphemes 
Once these morphemes are found, we use them in 
order to segment words and to find out other mor- 
phemes thanks to the following rule: For a given 
sequence of letters (light in Table 3.2), we check on 
if the next sequences of letters correspond to mor- 
phemes already found. If half of them belongs to the 
morphemes found (like -s -ed -ing -ly -er, then the 
others (-hess -en -est) are also considered as mor- 
phemes. 
This algorithm also generates wrong morphemes, 
but the frequency of them is very low (1 or 2). Thus, 
we only keep on new morphemes which have a fre- 
quency higher than a given threshold (5 in practice). 
The morphemes with a frequency lower than this 
threshold are not found. The morphemes list may 
greatly depend on the type of corpus used. The num- 
ber of morphemes depends on the morphology of the 
language. In Vietnamese, no morpheme is found. 
1Form the sequence on, we generate the morpheme 
-ation. 
Table 2: Second ste' 
Morphemes found 
-S 
-ed 
-ing 
-ly 
-er 
) of the morphemes discovery. 
words New Morphemes 
light 
lights 
lighted 
lighting 
lightly 
lighter 
lightness -hess 
lightest -est 
lighten -en 
In English, a list of fifty morphemes is generated 
(Table 3). The Turkish list contains more than 500 
morphemes. We note that morphemes have a sim- 
ilar behaviour as words: a small number of them 
possesses a high frequency and corresponds to the 
major occurrences of the corpus. We do not try 
to generate all the morphemes of the corpus, since 
the hundred most frequent morphemes are sufficient 
for the construction of the higher level (the chunk 
level). Some morphemes of the list given in Table 3 
are composed of a sequence of morphemes (ful-ly, 
ence-s). In highly morphological languages, most of 
the morphemes correspond to sequence of elemen- 
tary morphemes. We do not try to resegment these 
elements now. Because of the presence of one letter 
morphemes, the resegmentation inevitably lead to 
the segmentation of the morphemes in letters. We 
wait the chunk level in order to refine these mor- 
phemes (Section 4). 
Table 3: Final English Morphemes 
suflLxes:-y -ward -ure -s -ry -ously -ous -ors -or -ness 
-ments -ment -ly -less -ively -ive -ity -ious -ions -ion 
-ings -ingly -ing -in -ily -ies -ic -ible -fully -ful -est -es 
-ers -er -ence -ences -en -ement -ements -ely -ed -e 
-ations -ation -ance -ances -ally -al -age -ably -able 
-'S 
prefixes: dis- in- pro- re- un- 
3.3 The segmentation of the words 
Once the list of the morphemes is found, we use 
it for segmenting all the words of the corpus. We 
segment the words by applying the longest match 
algorithm: we segment each word with the longest 
morpheme which matches beginning or ending of the 
word. In order to allow the chunks discovery, there 
are some words which are not segmented: the most 
frequent ones (5% of the words). They generally cot- 
D~\]ean 297 Morphemes for Structures Discovery 
respond to grammatical words, and we do not seg- 
ment them in order to make easier the chunks discov- 
ery. The following section explains how the lexical 
words which appear in this list are segmented. We 
check on the segmentation of 500 words randomly 
selected and we obtain 8 segmentations we consider 
as wrong (as compla-in, forse-en or in German word 
antwortest 2 segmented in antwor-test with the mor- 
pheme -test (correct in lern-test 3 , preterit 2 pers.). 
Harris' algorithm realizes the segmentation of 
words during the discovery of morphemes. The dis- 
sociation of the two phase allows a more correct 
segmentation. With Harris algorithm, the words 
startling, startled and started generate the follow- 
ing segmentation: start-ed, start-led, start-ling (Sec- 
tion 3). With our method, the segmentation is 
startl-ing, startl-ed and start-ed since -ling and -led 
are not morphemes. It may be generated some errors 
(as antwortest) but only for few words. 
4 The correction of words 
segmentation 
We now explain how the frequent lexical words and 
the morphemes composed of a sequence of other 
morphemes are segmented. The method use the con- 
textual informations discovered in the chunk level. 
During the construction of chunks, we generate bi- 
grams of morphemes (Table 4). We use these bi- 
grams in order to refine the segmentation. Each 
word or morpheme occurring in a context corre- 
sponding to chunk structure will be segmented. For 
example, the German word Hauses (house) occur- 
ring in des Hauses is segmented in des Haus.es 
thanks to the context des S-es 4. The algorithm is 
the same for sequences of morphemes. The French 
sequence antes is segmented is ante-s thanks to the 
contexts les S-s. 
Table 4: Segmentation correction. 
bigrams correct segmentation 
German 
des S-es des Hauses des Haus-es 
ich S-te ida machte ich mach-te 
French 
les S-s les S-es les S-e-s 
les S-s les S-antes les S-ante-s 
~(you) answer. Antwort-en: to answer 
Z(you) learned. Lern-en: to learn. 
aS for Stem 
5 The necessity of morphemes in a 
procedure of discovery 
The morphemes level allows the emergence of struc- 
tures which hardly appear at word level: structures 
which are marked by morphemes like the concor- 
dance structures. For example, the French structure 
(les-S-s S-s) or German one ( des-S-en S-es) are eas- 
ily found thanks to their frequencies. Other struc- 
tures are also easily found like adverb-verb structure 
in English, characterized by the high frequency of 
the bigrams (S-ly S-ed). Another useful morphemes 
are inflectional ones which mark relations between 
chunks at clause level. The relations between chunks 
are discover since bigrams composed of grammati- 
cal words and morphemes belonging to contiguous 
chunks. Frequent bigrams generally correspond to 
relations between two chunks (like S-ed S-ly). A po- 
sitional criterium allows the elimination of bad fre- 
quent bigrams like (o\]-S S-ed) (Noun Complement 
- Verb sequence): since this bigram ne/rer begins a 
sentence, we consider that the structure is not com- 
plete and requires another chunk in order to com- 
plete the relational structure (the-S of-S S-ed). 
We conclude by claiming that morphemic level is 
essential and unavoidable in a procedure of syntactic 
structures discovery. 

References 
Brent, Mickael. 1993. From grammar to lexicon: 
Unsupervised learning of lexical syntax. Compu- 
tational Linguistics, 19:243-262. 
Brent, Mick~l, Sreerama K. Murthy, and Andrew 
Lunsberg. 1995. Discoveringmorphemic suifi.xes : 
A case study in mdl induction. In Fifth Inter- 
national Workshop on AI and Statistics,Ft. Laud- 
erdale, florida. 
Brill, Eric. 1993. Automatic grammar induction 
and parsing free text : a transformation-based ap- 
proach. In ACL93. 
de Marcken; Carl. 1995. The unsupervised acquisi- 
tion of a lexicon from continous spreech. Techni- 
cal report, MIT Artificial Intelligence Lab. Memo 
1558. 
Elman, J.L. 1990. Finding struture in time. Cogni- 
tive Science, 14:179--211. 
Harris, Zellig. 1951. Structural Linguistics. The 
University of Chicago Press. 
Harris, Zellig. 1955. From phonemes to morphemes. 
Language, 31(2):190-222. 
