BROAD COVERAGE AUTOMATIC MORPHOLOGICAL 
SEGMENTATION OF GERMAN WORDS 
T. PACHUNKE, O. MERTINEIT, K. WOTttKE, R. SCIIMII)T 
IBM Germany 
Heidelberg Scientific Center 
Tiergartenstr. 15 
I)-W-6900 tteidelberg 
ABSTRACT 
A system for the automatic segmentation of 
German words into morphs was developed. 
The main linguistic knowledge sources used 
by the system are a word syntax and a morph 
dictionary. The syntax is written in the 
formalism of right linear regular grammars 
and comprises approximately 1,400 rules de- 
scribing the set of those sequences of morph 
classes which underlie syntactically well 
formed words. The morph dictionary contains 
almost 11,000 morphs. Each morph is as- 
signed to up to 6 morph classes. - Statistical 
evaluations with 6000 test words showed that 
more than 99% of the segmented words got a 
correct segmentation. 
1 INTRODUCTION 
IBM Scientific Center Heidelberg is develop- 
ing a large vocabulary speech recognition sys- 
tem for German (Wothke et al. 1989). The 
system needs for each word of its reference 
vocabulary two types of reference patterns: 
• prototypal acoustic reference patterns. 
• phonetic transcriptions of the main pro- 
nunciation variants of the word. 
Up to now the transcriptions were generated 
for each orthographic word of the reference 
vocabulary by an automatic procedure having 
two drawbacks which caused a high amount 
of manual revision for the generated tran- 
scriptions: 
• For each" word only one transcription was 
generated. Our speech recognition system, 
however, needs at least the most signif- 
icant pronunciation variants of each 
word. 
• The automatic procedure took into ac- 
count only the letter context of each letter 
to determine its transcription, in German, 
however, the transcription of a letter is 
very often also dependent on its 
morphological context.- Most of the 
transcription errors of the former system 
were a consequence of the fact that the 
system did not have any intbrmation 
about tile morph structure of the words. 
To reduce the manual work necessary to revise 
the transcriptions we currently develop a sys- 
tem with the following new features: 
1. An orthographic word is first segmented 
into its morphs. 
2. In a second step one or more phonetic 
transcriptions are produced for each seg- 
mentation of the word using letter-to- 
phone rules which can refer to the morph 
structure detected in the first step. 
The following paragraphs will deal with the 
first step. We will mainly restrict ourselves to 
the linguistic knowledge incorporated in our 
current morph segmentation system. The 
overall architecture of the segmentation system 
and details of the segmentation algorithm are 
described in Wothke/Schmidt (1991). 
A morphological segmentation procedure for 
German has to deal with the following basic 
features of German morphology: 
• Composition. 
• Derivation. 
• lnflexion. 
• Ambiguous morph structure: Some words 
can be segmented in several ways. 
• Reduction of consonant triples: If two lexi- 
cal morphs are concatenated, where the 
first morph ends in a vocalic letter and 
two identical consonantal letters and the 
second morph starts with the same con- 
sonantal letter and a vocalic letter, then 
the result of the concatenation does not 
contain the consonantal letter three times 
but only twice. - The inverse process, i.e. 
trebling of double consonants, has to be 
carried out, when segmenting such words. 
Figure 1 shows the architecture of the 
morphological segmentation system. The inter- 
preter for the segmentation has 5 main input 
files: 
Ac'I~ DE COLING-92, NANTES, 23-28 AOt'/T 1992 1 2 1 8 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 
r'epresen- 
i c Words 
rations per Word 
Figure 1. Architecture ot' the Morphological S, egmentatiou System 
,, A morph dictionary containing inibrma- 
tion about the morph class/es each morph 
belongs to. 
* A word syntax represented in tile 
formalism of right linear regular gram- 
mars. It has to describe the set of those 
sequences of morph chtsses which underlie 
words. 
o A morph boundary table, where the user 
can specify the symbols used by the inter- 
preter to mark the diflbrent kinds of 
morph boundaries. We specified that 
+ is inserted before a prefix, 
= is inserted before a lexical morph, 
% is inserted before an infix, a deriva- 
tional, or an inflexionat suffix, 
is inserted belore a Latin or Greek 
derivational suffix, 
~ is inserted before a French or \['~ng- 
lish derivational suffix. 
. A table of (brbidden classes, where the 
user can enter the names of those morph 
classes which may not attract either of the 
three identical consonantal letters arising 
from consonant trebling (i.e. infix classes, 
suffix classes, and prefix classes). 
• A file containing the orthographic words 
to be segmented into morphs. 
The linguistic knowledge in the first el files ex- 
ists in 2 representations: 
* An external representation which is cre- 
ated by the user of the system and which 
is human readable. 
• An internal representation which is auto- 
matically generated by a preprocessor 
from the external representation and 
which is more suitable for tile processing 
hy the interpreter. 
Thc intcrprcter loads tile internal represent- 
ations of the 4 files and scgmcnts orthographic 
words according to tile knowledge in tile files. 
If a word is morphologically ambigm)us, se- 
veral segmentations arc generated. 
2 THE LINGUISTIC 
KNOWLEDGE 
The main linguistic knowledge sourccs of thc 
system are the morph dictionary, which con- 
tains information about tile morph class/es 
each morph belongs to, and tile word syntax. 
We developed a classification scheme lbr 
German morphs and a suitable word syntax. 
The signiticant step to our current version was 
the classification of an extensive German 
morph list based on about 9,(100 nlorphs 
compiled by the Institut fi~r deutsche Sprachc 
m Mannheim (Germany). We merged thesc 
morphs with an expcrimcntal list of about 
2,200 morphs which we used in tile former 
versions of our system. Additionally, we in- 
creased the resulting list up to ahnost I1,000 
entries by many loreign morphs. 
It turned out that for the manual development 
of the syntax the tormalism o(" finite state 
networks is easier to handle than a right linear 
regular grammar. So we lirst represented the 
syntax with a finite statc network, which 
finally was translated into a tuoctionally 
equivalent right linear grammar. 
Ac'~s DE COLING-92, NAr~qTzs, 23-28 Aot,q" 1992 l 2 1 9 PROC. OF COLING-92, NAt~'IES, AUG. 23-28, 1992 
Syntax and Classification Scheme used Third Fourth 
Syntax Number of states 129 289 
Number of arcs 1,050 1,368 
Classification Number of morphs 2,200 10,784 
Scheme Number of morph-classes 183 198 
\]'able I. 
So far, we have developed and tested succes- 
sively four classification schemes, each with a 
new, better syntax. We describe the third and 
fourth scheme, which are of actual interest (cf. 
table 1). 
The substructures of the entire transition net 
dealing with the word classes verb, adjective, 
noun etc. will be called verb net, adjective net, 
noun net etc. We should stress that these sub- 
structures are not independent automata with 
any separate input. Nevertheless we call them 
nets; parts of these nets will be called 
subnets.- Although the word formation of 
the different word classes is not fully distinct 
and does share some substructures, it was not 
possible to design the entire net in such a way 
that the nets for these word classes physically 
share some subnets. Instead physical copies 
of common subnets had to be created for each 
occurrence of such a subnet in each of the 
nets. This is since we used a finite state net- 
work for the representation of the word syn- 
tax. This formalism does not allow to activate 
from different points one common subnet and 
afterwards to return to the appropriate acti- 
vation point. 
We will limit the following description to the 
nets for those word classes with productivity 
in word formation. 
Verbs 
Our Verb Net is responsible for the segmen- 
tation of finite verbs. Those of its subnets 
containing stem-labelled arcs refer to different 
combinations of mood, tense, and weakness 
vs strongness of the verb stem. Each of these 
combinations demands specific inflexional 
endings. - Weak stems are tense-invariant, 
strong stems can vary - according to tense - 
by vowel gradation (Umlaut or Ablaut). As a 
consequence, the classification of strong verb 
stems is oriented towards their suitability for 
certain tenses. For example, < = ging> is an 
imperfect tense form of < = geh%en> (engl. 
to go). In our morph dictionary the two 
morphs <geh> and <ging> are two inde- 
pendent entries, each with its own tense- 
oriented classification. Weak stems are 
classified according to prefixation and deriva- 
tion needs. We took into account three groups 
Overview of the Extent of the Syntaxes and Classification Systems Developed 
ofderivational suffixes: 1) <-el>, <-er>, 2) 
< -ig>, < -lich >, and 3) < ~ier>. In the area 
of verb prefixes, one problem solved in our 
current version was to avoid the splitting of 
particular prefixes, e.g. * < + her+ unter- 
=geh%en> apart of the correct segmenta- 
tion which is < + herunter = geh%en > (engl.: 
to go down). 
In German, each infinitive can take the role 
of a noun, and each participle can do the same 
after being inflected. As a consequence, the 
part of our transition net related to infinite 
verbs is integrated into the Noun Net. 
The set of verb stem classes had to be ex- 
panded for our current version to implement 
composition restrictions concerning verb 
stems as parts of nouns. For example, we had 
to cope with missegmentations such as 
• < + Er= find + er = sehon%ung> apart of 
the correct segmentation < + Er= find%er- 
=schon%ung> (engl. careful treatment of 
inventors). At least two restrictions exist: 
Firstly, the verb stem < find> is not allowed 
before a noun (which the word 
• < Erschonung> would be, if existing) but, 
e.g., the originally identically classified verb 
stem <bind> is, as in < Bindladen> (engl. 
string). Secondly, the morph <er> is no 
suitable prefix for the verb stem <schon> 
but, e.g., for the originally identically classified 
verb stem < schein >, leading to 
<erscheinen> (engl. to appear). Verb-stem- 
related restrictions like these, which we imple- 
mented in our system by adding morph 
classifications to the existing ones, are only 
relevant for nouns, in the first case mentioned 
above, this is obvious. The second restriction 
does not concern finite verbs, because misseg- 
mentations only occur when the morph < er> 
is positioned between two stems. At the be- 
ginning of a word, the morph <er> can be 
seen as a prefix without any restrictions. 
Adl'ectives 
The adjective net consists of three subnets, 
each representing a possible way of adjectival 
derivation in German. 
I. Simple adjectives like < schnell> (engl.: 
last), <schOn> (engl.: beautiful) etc. 
AcrEs DE COLING-92. NAN'rF_S, 23-28 AOl'rr 1992 l 2 2 0 PROC. OF COLING-92. NANTES, AUG. 23-28, 1992 
These stems can be compared and 
inflected. Some stems occur only in a 
certain degree o\[ comparison like 
<bess> (stem of engl.: better), < bcs> 
(stem ofengl.: best). They have obligatory 
comparative or superlative sufiixes while 
the corresponding stems of the positive 
degree must not be followed by tllese suf- 
fixes, like e.g. <gut> (engl.: good). 
2. Adjectives derived from verbs or verbal 
stems like < +be=gch%bar> (engl.: 
passable) 
3. Adjectives derived from nonns. Example: 
< = held%en%haft> (engl.: heroic). 
As a peculiarity o1" German word formation, 
a past participle may be compared and 
intlected like an adjective stem. l!xampte: The 
past participle of <gclingcn> (engl.: to suc- 
ceed) has the comparative forms < +ge- 
= lung%en >, < + ge = lung%en%er >, 
< (am) + ge = lung%en%st%en >. These 
may be translated as "successful, more suc- 
cessful, most successful'. 
Roughly speaking, the concept of the adjective 
net is to allow an adjective stem to be substi- 
tuted by more complex constructions, like the 
ones described above. Special subnets are ex- 
isting for adverbs and tot adjectives with non- 
German stems. The latter is needed for 
marking foreign suffixes like in 
< =parall_el> because these suffixes attract 
the word accent, which in (ierman causes a 
vowel to be pronounced long. 
Nouns 
A very productive feature of German word 
formation in thc arca of nouns is composition: 
New nouns may be fbrmed by concatenation 
of lexical morphs, optionally interspersed with 
prefixes, suffixes, and infixes. In our noun net, 
this feature is modelled by loops over lexical 
morphs which can be left by inflexion modules 
to reach a final state and which cross infix 
modules (including zero-infix), prefix modules, 
and (derivational) suffix modules. 
lnflexional suffixes occurring in compound 
nouns between lexical morphs are treated by 
us the same way as infixes. 
Noun stems are classilied according to the 
features umlaut, etymology (German vs not 
German), obligatory affix, inIlexional suflix, 
and composition needs. 
Foreign Words 
Each of the described nets contains subnets 
dealing with the l~brmation of foreign words 
which are involved in German word forma- 
tion, e.g. < = mum.if iz%ier%en> (engl. to 
mumify), < = Bas is > (engl. basis), 
< = l'ort~ier (Frenctii engl. porter). 
Foreign words without connection to German 
word tbrmation and names are not intended 
to be segmented by our system. So an unseg- 
mented word is not necessarily a system fail- 
ure but can be a required rejection (el. 
Table 3). 
3 EVALUATION 
The experimental morph list used during the 
development of versions 1, 2, and 3 of" our 
syntax and classilication scheme consisted of 
2,200 morphs mainly selected from Ortmann 
(1985). In the fourth system the morph list 
was extended to ahnost 11,00(1 morphs (cf 
Table 1). To evaluate the different versions 
(each consisting of syntax and classification 
scheme), two word sets were used each con- 
raining 3,0(10 words. The first set consisted of 
rank 1 - 1,000, 300,(X) 1 - 301,000, and 
600,001-601,000 of a ti'equency list sorted in 
descending order which was created fi~om a 
corpus containing articles of a German busi- 
ness newspaper with about 31,000,(X)0 running 
words. This set was used for the iterative im- 
provement of" our system. 1t is called test set. 
The second set, serving as a control set, con~ 
rained rank 1,001-2,000, 200,001-201,000, and 
400,001-401,000 of a corpus obtained from a 
common newspaper with about 13,200,000 
running words. The control set was necessary 
because of the risk that the later versions were 
designed in such a way as to cope only with 
those errors which arose when applying the 
earlier versions to tire test set. 
Table 2 shows the improvement in coverage 
inainly achieved by extending the morph list: 
While - refcrriug to the control set only - the 
third system segmented only !/125 of the input 
words (=47.5%), the lourth system seg- 
mented 2,492 words (= 83%). The quality of 
the segmentations of the tburth system is a 
little worse. The reason for this effect is the 
larger morph list allowing more nonsense 
concatenations of morphs. Although we made 
grammar and classification system more re- 
strictive, it would have been too costly to 
strive for equal or better segmentation quality. 
Table 3 gives an overview of the words which 
were not segmented by the fourth system. 
Many of them are proper names. 
Acll~.s DE COL1NG-92, NArCrEs, 23-28 AO~r 1992 l 2 2 1 PROC. OV COLING-92, NANTES, AIJO. 23-28, 1992 
Syntax and Classification 
Scheme used 
Corpus used 
Number Of Words segmented and 
percentage related to total num- 
ber of 3,000 
Number of segmentations and 
ratio of segmentations obtained 
per segmented word on average 
Number of correct segmenta- 
tions and percentage related to 
total number of segmentations 
Number of wrong segmentations 
and percentage related to total 
number of segmentations 
Number of words with at least 
one correct segmentation and 
percentage related to number of 
segmented words 
Third Fourth 
"business 'common "business 'common 
newspaper' 
(test set) 
1,955- 
2,026 
1,991 
35 
newspaper' 
(control set) 
1,425 
65.2% 47.5% 
1,533 
1.04 t.08 
1,489 
98.3 % 97.1% 
44 
1.7% 2.9% 
not evaluated 
newspaper" 
(test set) 
2,715 
90.5% 
2,960 
1.09 
2,854 
96A% 
106 
3.6% 
2,710 
99.8% 
not evaluated 
newspaper' 
(control set) 
2,492 
83.1% 
2,815 
1.13 
2,656 
94.4% 
159 
5.6% 
2,482 
99.6% 
Table 2. Results for Words Segmenfed by the Third and Fourth Version of the Segmentation Procedure 
"business "conmaon both 
Corpus used newspaper' newspaper" (test + con- 
(test set) (control set) trol set) 
Number of words rejected " 508"(of 3,000) 
- words which should be segmented 
- words which cannot be segmented 
- words with sl~elling errors 
- foreign Words'which are not used in German 
- names 
285 (of 3,000) 
33 11.6°/) 215 42.3% 
252 88.4% 293 57.7% i 
19 6.7% 8 1.6% 
55 19.3% 17 3.3% 
178 62.5% 268 52.8"/0 
793 (of 6,000) 
248 31.3% 
545 68.7% 
27 3.4% 
72 9.1% 
446 56.2% 
Table 3. 
4 CONCLUSION 
A morph classification scheme and a syntax 
for the segmentation of German words were 
developed, with which more than 99% of the 
segmented words were correctly segmented, 
i.e. at least one resulting segmentation was 
correct. 
With the extended morph list used about 
13.2% ofthe words (= 793 out of 6,000) were 
not segmented. Out of these unsegmented 
words, 68.7% a priori could not be segmented, 
i.e. no conceivable segmentation procedure 
can segment these words. 
Results for Words Rejected by Fourth Version of the Segmentation Procedure 
3. We would also like to express our gratitude 
for the encouragement received from our 
manager Eric Keppel. 
ACKNOWLEDGEMENTS 
We thank Georg Walch, who put at our dis- 
posal the frequency lists mentioned in chapter 
REFERENCES 
Ortmann, W.1), (1985): Woflbildung und 
Morphemstruktur hoehfrequenter deutscher 
Wortfonnen. Tell I. Mfinchen. 
Wothke, K. / Bandara, U. / Kempf, J. / Keppel, E. 
/ Mohr, K. / Walch, G. (1989): The SPRING 
Speech Recognition System for German. In: 
Eurospeech 89, Europe,'m Conference on 
Speech Communication and Technology. 
Paris - September 1989. Edinburgh. Vol. !I. 
pp. 9-12. 
Wothke, K. / Schmidt, R. 0991): A Morphological 
Segmentation Procedure for German. In: Pro- 
ceedings of the International Conference on 
Current Issues in Computational l,inguistics. 
Penang (Malaysia). pp. 137-147. 
AcrEs OF. COLING-92, NANTF~S, 23-28 AOt~' 1992 1 2 2 2 PROC. OV COLING-92, NANTES, AUG. 23-28, 1992 
