A Finite State Approach to German Verb Morphology 
Giinther GORZ 
IBM -- WT LILOG 
Schlot3str. 70, D-7000 Stuttgart 1, W. Germany 
(on leave from Univ. of Erlangen-Niirnberg) 
Goerz@SUMEX-AIM.STANFORD.EDU, GOERZ@DSOLILOG.BITNET 
Dietrich PAULUS 
Univ. of Erlangen-Nilrnberg, IMMD V 
Martensstr. 3, D-8520 Erlangen, W. Germany 
Paulus@fani53.uucp 
Abstract 
This paper presents a new, language independent model for 
analysis and generation of word forms based on Finite State 
Transducers (FSTs). It has been completely implemented on 
a PC and successfully tested with lexicons and rules cover- 
ing all of German verb morphology and the most interesting 
subsets of French and Spanish verbs as well. The linguistic 
databases consist of a'letter-tree structured lexicon with annc~ 
tated feature lists and a FST which is constructed from a set 
of morphophonological rules. These rewriting rules operate on 
complete words unlike other FST-based systems. 
1 Introduction 
Until the beginning of this decade, morphological parsers usually were 
restricted to on e particular language; in fact we do not know of any 
one which was language independent or even applicable to a wide 
class of non-trivial inflected languages. In the meantime, the situa- 
tion has changed a lot through the usage of Finite State Transducers 
(FSTs). Although the formalism of generative phonology seems to 
be powerful enough to cover almost any language, it is very diffi- 
cult to implement it computationally. Recent approaches to compile 
the rules of generative phonology into finite automata offer solutions 
to both problems. In the following, we report on a successful and 
complete application of this technique to the morphology of German 
verbs. To demonstrate its generality, it has also been applied to a 
large subset of French -- in fact the most interesting cases -- and 
some Spanish verbs. 
2 The Finite State Approach 
As Gazdar/1985/ \[1\] observed, only very few papers on the math- 
ematical foundations of modern phonology exist. He quotes from 
Johnson's/1970/PhD thesis \[2\], the earliest study of this kind, who 
states that "any theory which allows phonological rules to simulate 
arbitrary rewriting systems is seriously defective, for it asserts next to 
nothing about the sorts of mappings the rules can perform" (/John- 
son 1970\] \[2\], p. 42). According to Gazdar /1985/ (\[1\], p. 2) Johnson 
"proves that a phonology that permits only simultaneous rule appli- 
cation, as opposed to iterative derivational application, is equivalent 
to an FST. And he then argues that most of the phonology current 
around 1970 could eitlmr be formalized or reanalyzed in terms of 
simultaneous rule application, and could thus be reduced to FSTs." 
At the Winter LSA meeting at New York in December 1981, R. 
Kaplan and M. Kay gave a talk -- a written account does nat exist -- 
in which they showed "how the iteratively applied rules of standard 
generative phonology could, individually, be algorithmically compiled 
into FSTs" (/Gazdar 1985/\[1\], p. 2) under the constraint that rules 
may not be reapplied to their own outputs. Such a finite ordered 
cascade of FSTs can be collapsed into a single FST whose behavior 
is equivalent to that of the original generative rules (cf./Kay 1983/ 
\[4\], p. 100-104). A FST is a special kind of finite automaton which 
operates simultaneously on an input and an output tape such that it 
inspects two symbols at a time. In Kay's approach, the FSTs carry 
two labels, each label referring to one of the two tapes. In general, a 
FST is said to accept a pair of tapes if the symbols on them match a 
sequence of transitions starting in an initial state and ending in one 
of the designated final states. If no such sequence can be found, the 
tapes are rejected. To allow tapes of different length to be accepted, 
a symbol to be matched against one or other of the tapes to do a 
transition may be empty, in which case the corresponding tape is 
ignored. 
There are two advantages with this approach: The first is, that 
such a combined FST can be implemented easily as a simple and 
very efficient program. Second, unlike ordered sets of rewriting rules, 
there is no directionality in principle, so that the same machine can 
be used for analysis and generation as well. 
Kimmo Koskenniemi /1983a, 1983b, 1984, 1985/\] (\[5\], \[6\], \[7\], 
\[8\]) took up this approach and applied a variation of it to some heav- 
ily inflected languages, first of all to Finnish. His "two-level" model 
proposes parallel rules instead of successive ones like those of gener- 
ative phonology. The term "two-level" is supposed to express that 
there are only two levels, the lexicai and the surface level, and that 
there are no intermediate ones, even logically. Besides its simplicity 
-- in particular with respect to implementation -- the problematic 
ordering of rules is avoided. 
3 A Parallel ttewriting Variant of FSTs with 
Feature Unification 
Although Koskenniemi's machinery works in parallel with respect to 
the rules, rewriting is still performed in a sequential manner: each 
word form is processed letter by letter (or morpheme by morpheme) 
such that all replacements are done one at a time. Certainly this 
model does not depend on the processing direction from left to right, 
but at any time during processing it focusses on only one symbol on 
the input tape. 
It is precisely this feature, where our approach, based on a sug- 
gestion by Kay, differs from Koskenniemi's. Our work grew out of 
discussions with M. Kay, to which the first author had the oppor- 
tunity during a research stay at CSLI, Stanford, in summer 1985. 
Without his help oar investigations would not have been possible. In 
oul system, rewriting is performed over complete surface words, not 
letters or morphemes. There is no translation from lexical to surface 
strings, because there is only one level, the level of surface strings. 
Rewriting is defined by rules satisfying the scheme 
Pattern -+ Replacement 
where both, Pattern and Replacement, are strings that are allowed to 
contain the wild card "?" character which matches exactly one (and 
the same) letter. Let a,b, wl,w2 E ~* where \]E is an alphabet. For all 
wl, w2 the rule a --~ b, with Pattern= a and Replacement= b, rewrites 
wlaw2 to wlbw2. It should be noted that only one occurrence of the 
Pattern is rewritten. Furthermore, it can be specified whether the 
search is to be conducted from left to right or vice versa. Hence, it is 
possible to perform rewriting in parallel in contrast to Koskenniemi's 
sequential mode. 
The rules are attached to the edges of a FST; hence the applica- 
tion order of the rules is determined by the sequence of admissible 
transitions. Conflicts arising from the fact that at a given state the 
patterns of several rules match are resolved by the strategy described 
in sec. 5. 
212 
Matchil~g of the left hand side of a rule is only one condition to do 
a transition successfully. The second condition is that the list of mor- 
phosyntactic features of the actual item can be successfully unified 
with the fe~ture llst attadmd to the resp. edge of the automaton. 
The required unification procedure realizes a slightly extended 
version of the well known term unification algorithm. The objects to 
be unified are not lists of functors and arguments in a fixed order with 
fixed lengths, but sets of attributes (named arguments) of arbitrary 
length. The argument values, however, are restricted to atomic ob- 
jects, and therefore not allowed to be attribute lists themselves (as it 
is the case with the recnrsively defined functional structure datatype 
in unification grammars). 
Example I: 
Note that words are delimited by angle brackets such that affixes 
can be substituted for the empty string at the beginning or end of a 
word. 
Some rewriting rules 
~ ~ O 
Ilt --+ mm 
• Corresponding automaton fragment 
((s~art all 
("a .... o" stl ((tompus imperf) (group 1))) 
...) 
(st1 nil 
("m" "~" st2 ((group 1)))) 
(st2 nil 
(">" "en>" end ((pers 1) (num sing) 
(mode indic))) 
...) 
(end t)) 
This automaton fragment generates "<kam>" with the feature 
list ((tempus import) (group 1) (hUm sing) (mode indic) 
(pets 1)) t rein the infinitive form "<kommen>'. 
Currently, there is no cmnpiler wldch generates an automaton 
from a given set of rules like the one by Karttunen et al. /1987/ \[3\], 
i.e. the automaton has to be coded manually. 
4 The \[,exicon 
In order to achieve fast access and to avoid redundancy wherever 
possible, the lexicon is realized as a letter tree with annotated feature 
lists for terminal nodes. 
Example 2: A section of the letter4ree lexicon containing "wa- 
gen', "wiege£', and "w~igen'. 
4" (\~ (\a (\g (\e (\n (\÷ ((group 2))))))) 
(\i (\e (\g (\e (\n (\÷ ((group 4)))))))) 
(\a (\g (\e (\n 4\÷ ((group 3))))))))) 
5 The Control Strategy 
Our implementation follows Kay's suggestion, in that processing -- 
analysis am\[ generation as well -- is done in two essential steps: 
First, along v. path beginning at a start state, for all applicable rules 
the attached feature unifications are performed until a final state is 
reached. The search strategy is depth-first, i.e. at each state the 
fir~t applicable rewriting rule in the list of transitions is selected. 
In a second phase, such a successful path is traced back to its ori- 
gin with simultaneous execution of the corresponding rewriting rules. 
For rewriting, a device called exclusion list is employed, which al- 
lows to coml)ine several distinct rules into one unit (which has been 
omitted ia example 1 for tim sake of simplicity). This adds a further 
restriction to \[ransitions: A transition is blocked if the pattern of the 
corresponding rule matches, but is contained in the exclusion list. 
6 German Verb Morphology 
In German, inflected verb forms exist for the four tense/mode com- 
binations: present tense/indicative, present tense/conjunctive, past 
tense/ indicative and past tense/ conjunctive. Furthermore there 
are two participles (present and perfect) and two imperative forms 
derived from the infinitive verb stem. This adds up to 29 possibly 
different forms per verb. 
With respect to inflection, German verbs can be divided into three 
classes: regular "weak" verbs ("schwache Verben"), "strong" verbs 
("starke Verben") and irregular verbs. 
Inflection of weak verbs is done by simply adding a suffix to the 
stem. In the special case of the past participle the prefix "ge-" is 
added too. This class can easily be handled by existing algorithms, 
like the one of Kay described above. 
Inflection of strong verbs is also done by adding to the stem suf- 
fixes, which slightly differ from the ones used with weak verbs. In 
addition to the change of the ending, the stein itself may vary, too. 
In most cases it is not the whole stem that changes, but only one 
special vowel in the stem, tbe stem vowel. 
This change introduces the problems that make an extension of 
the existing algorithm necessary. 
In most cases irregular verbs can be treated like regular strong 
verbs with the exception of some special forms. 
Example 3: "sein" (engl.: to be) 
To conjugate the verb in past tense ("ich war", "du wurst", 
...), conjugate "war-" as a regular strong verb. 
The following fourteen graphemes can be stem vowels: "a", "e", 
"i', "o', "u', "~\[", "5", "ii", "el", "ai", "au", "~u", "eu" and "ie". 
When conjugating a verb, the stem vowel may change up to six 
times, as the following example demonstrates: 
Form Intl. Type Grammatical Description 
ich helle (1) present ist person 
du hilfst (2) present 2nd person 
er half (3) past tense 
er hffife (rarely used, (4) past tense conj. 
but correct) 
er h£1fe (4) past tense conj. 
er hat geholfen (5) past participle 
This gives rise to tile combinatorial explosion of 14 6 possible se- 
ries of stem vowels for each verb conjugation ("paradigm"). Only a 
small number of those are actually used in the language, but even 
this number is too big to be handled easily by one of the described 
algorithms. 
7 Hard Problems in German Verb Inflection 
The following problems are hard to be solved by any one of the ex- 
isting algorithms: 
• How can the stem vowel be located? This may be difficult, 
especially when compound verbs are to be analyzed, like "be- 
herzigen". 
• Given an inflected verb form, how can we find the infinitive stein 
from which this form is derived? Example: "wSge": "wages'? 
or "wiegen'? or "w£gen"? 
• tIow can the lexicon be kept small; i.e. can we get around adding 
all the possible changes of the stem to the lexicon? 
The general idea behind our solution is to build a "shell" around 
Kay's generic two-state-morphology scheme which takes care of the 
special stem vowel problems in German verbs. The core of this 
scheme, which is the rewriting-rule algorithm, remains unchanged 
and adds all appropriate affixes to the stem. This leads to an al- 
gorithm that can generate all forms of any German verb, even of a 
213 
prefixed verb, and analyze these forms as well. One important part of 
the extended algorithm is a matrix called the stem-vowel table which 
contains all the information about the vowel series occurring in the 
conjugations of one verb. After some compression and combination 
of related series the size of the table is 40*5 lists of characters. This 
matrix is organized in tile following manner: 
There are five columns corresponding to the five cases of stem 
vowel change ill example 3. Each entry in a column is a list of char 
utters; mostly this list has length one. (The fourth element of the 
list corresponding to the verb "helfen" would have the two elements 
"ii" and "~"). 
The rows list all the possible combinations of vowel change that 
occur in the present use of the language. 
The ,;hell consists of five basic parts (placed in order of tile way 
they are called when the algorithm 9ene~*ttes forms): 
1. A routine for locating the stem vowel and replacing it by a 
generic symbol; it is realized by a simple function. 
2. An algorithm that separates prefixes from the stem when a 
compound verb is to be analyzed. It also strips off the infinitive 
ending. This is done by a simple lookup in the prefix table. 
3. A lexicon module which also adds some default intormation to 
the grammatical information obtained from the lexicon entry. 
Irregular and strong vm'bs get a group number added to the 
feature list. The prefix, if one is found~ is compared with the 
list of permissible prefixes in ttle lexicon. 
4. The core of the algorithm uses an automaton and rewriting rules 
to modify the affixes of the verb. In the course of unification 
new attributes are added to the feature list. In particular, if the 
verb is strong or irregular, information about the stein vowel is 
added to the list. The new information contains an offset into 
the stem vowel table. 
5. The generic symbol is replaced by the stem vowel indicated by 
the feature list using a single rewriting rule. The new vowel 
is looked up in the table which is indexed by two values in 
the feature list, namely the group number of the verb (whirh 
is either defaulted or part of the lexical information), amt a 
column number, which is added by the automaton. 
8 Further Enhancements to Keep the Anal- 
ysis of Verbs Fast 
The main problem with the analysis of German verb forms is to find 
the infinitive stem belonging to the stein. As soon as this stem is 
found, the search tree can be pruned considerably. This is because 
the lexicon information of the infinitive form may restrict the pos- 
sible unifications when stepping from one state of the automaton to 
another one. 
This problem h~ been solved in tile following way. Given an 
inflected form with a possible changed stem vowel, we can at least find 
the position of the actual stem vowel. We can also strip off the ending 
and the prefix, if one exists (e.g. "erwSge" \[infinitive :'erw£gen"\] --~ 
"wXg-"). This leads to a rather peculiar structure for the lexicon. 
Tile lexicon mainly contains verb infinitives in an encoded form. The 
stem vowel of the infinitive is replaced by a place holder, the stem 
vowel is added to the end of the form, separated frmn the stem by 
a hyphen: Stein vowels consisting of more than one character are 
encoded as a single symbol. 
Example 4: 
wiegen -+ wXg-I 
w£gen -} wXg-~ 
wagen --~ wXg-a 
Putting these forms into a lexicon tree we find that the three verbs 
differ only in the last position. 
(* (\u (\X (\g (\- (\I (\+ ((group 2)))) ;"i~" 'too is 
;onco4~d as "i" 
(\a (\+ ((group S)))) 
(\a (\+ ((group 4))))))))) 214 
The analysis is simplified. Immedia.l, ely after preprocessing the 
the form we can reduce the possible candidates ibr tile related inliui-- 
tire to the subtree below the hyphen. This special encoding has the 
side effect that the nmnber of nodes of the lexicon tree is reduced 
when many similar forms are added to the lexicon. 
9 Constraints On ~.\['he :~3e:dco~ 
Three other classes of verbs have to be considered, if we want to find 
the stem of any German verb easily: 
l. Verbs which change the stein at places other than the stein 
vowel. 
2. Verbs with an infinitive ending on "-era" or '¢-eln'~ These verbs 
omit in some cases the %" which belongs to the stem (!). 
3. Verbs with the ending "-ssen" or "-lieu'. For these verbs i.he 
"ss" and "fl" have to be exchanged in some forms. 
~br (1) all the changed stems are added to the lexicon together 
with the grammatical information, that restricts their use to the per- 
missible forms, whicll results in about 75 new entries for the lexicon 
The verbs in (2) and (3) are em:oded in a special way. The en- 
coding has no side effects on the rest of the Ngorithm. it only add8 
some transitions to the automaton (el./Paulus 1986/\[9\]). 
10 Furthew N:~<tc:*;mions 
Tile algorithm as implemented can handle all rases of prefixed verbs, 
even the cases where the prefix is separated from the verb for some 
forms (e.g. %r kani an"). 
The prefixes are added to the lexical information of the infinitive 
torm. Thus an extra prefix requires only little extra, ,storage lbr the 
lexicon. The analysis-mode checks whether the prefix is allowable or 
not. 
Finally the algorithm also takes care of tile tra.usitive and intran- 
sitive use of a verb, if this alfects ~he way the verb is inflected (e.g. 
"er schrak', "er erschreckte reich"). 
11 Practical Experience 
The complete system for analysis and generation including all of the 
mentioned extensions has been implmnented in TLC-LISP on a PC. 
The lexicon contains all irregular and strong verbs with their prefixes, 
and many other verbs, without running into memory limitations. 
In a first try the German lexicon was built in a straightforward 
way (as shown in example 2) and all the inflection was done us- 
ing, rewriting-rules only. Comparison with the extended algorithm 
strawed a runtime improvement of more than 75 percent. In absolute 
figures the performance of analysis is less than 1 second per verb tbrm; 
the present version of the program consists of non-optimized compiled 
LISP code. French and Spanish verbs can he haudled directly by the 
kernel algorithm without the described extensions. 
12 Conclusion 
A general, language independent FST-based model for morphological 
analysis and generation has been implemented and applied to the fill 
range of German verb morphology. In the course of our investigatiom% 
we found that the treatment of particular language dependent inflec~ 
tional phenomena which cannot be handled by the general model~ 
can be easily embedded in a way which does not require to modify 
the basic model: but can instead be wrapped around it. Iience~ prob-- 
lems which might come up in localizing the stem vowel by means of 
rewriting rules alone do not occur, From a general point of vi,~w, 
the main innovations in our sy~tem are a new mettmd for wo~-d stem 
recognition and a gener~lized fi:;~mework for lexicM represeutx, tion. 

References

\[1\] Gazdar~ G.: l;~nite State Morphology. A Review of Koskenniemi 
(1983). Center for the Study of Language and Information, Stan- 
ford University, Report No. CSI,L85-32, Stanford, Cal., 1985. 

\[2\] Johnson, C.D.: On the Formal Properties of Phonological 
Rules° PhD Dissertation, University of California, Santa Bar- 
bara. POLA Report 11, University of California, Berkeley, 1970. 
(Published as Formal Aspects: A Phonological Description. The 
\]tague: Mouton~ 19'/2) 

\[3\] Karttunen, L., Koskenniemi, K., Kaplan, R.M.: A Compiler 
for Two-level Phonological Rules. Technical Report, Xerox Palo 
Alto Researdl Center and Center for the Study of Language and 
Information, Stanford University, Stanford, Cal., 1987 

\[4\] Kay, M.: When Meta-Rules are not Meta-Rules. In: Sparck 
Jones, K., Wilks, Y.: Automatic Natural Language Parsing. 
Chichester: Ellis tlorwood, 1983, 94-116 

\[5\] Koskenniemi, K.: Two-level Morphology: A General Computa- 
tional Model lbr Word-form Recognition and Production. Uni- 
versity of tlelsinki, Department of General Linguistics, Publica- 
tion 11, 1983 

\[6\] Koskenniemi, K.: Two-level Model for Morphological Analysis. 
In: Proc. IJCAI-83, 1983, 683-685 

\[7\] KoskemLiemi, K.: A General Computational Model for Word: 
form R~cognition and Production. In: Proc. COLING-84, 1984, 
178--181 

\[8\] Koskenniemi K.: Compilation of Automata Erom Two-level 
Rules. Paper presented to the CSLI Workshop on Finite State 
MorphoLogy, Stanford, July 29-30, 1985 

\[9\] Paulus, 1).: FAn Programmpaket zur Morphologischen Analyse. 
Univers~t/~t lCrlangen-Niirnberg, RRZE, Diplomarbeit, RRZE- 
IAB-259, 1986. 

\[10\] Panlus, D.: Endliche Automaten zur Verbttexion und ein 
speziellvs deutsches Verblexikon. In: Morik, K. (Ed.): GWAI- 
87 -- German Workshop on Artificial Intelligence. Proceedings 
Berlin: Springer (IFB 152), 1987, 340-344
