Rapid Development of Morphological Descriptions for 
Full Language Processing Systems 
David Carter 
SRI International Cambridge Computer Science Research Centre 
23 Millers Yard, Mill Lane 
Cambridge CB2 1RQ, U:K. 
dmc~cam, sri. com 
Abstract 
I describe a compiler and development 
environment for feature-augmented two- 
level morphology rules integrated into 
a full NLP system. The compiler is 
optimized for a class of languages in- 
cluding many or most European ones, 
and for rapid development and debug- 
ging of descriptions of new languages. 
The key design decision is to compose 
morphophonological and morphosyntac- 
tic information, but not the lexicon, 
when compiling the description. This 
results in typical compilation times of 
about a minute, and has allowed a rea- 
sonably full, feature-based description of 
French inflectional morphology to be de- 
veloped in about a month by a linguist 
new to the system. 
1 Introduction 
The paradigm of two-level morphology (Kosken- 
niemi, 1983) has become popular for handling 
word formation phenomena in a variety of lan- 
guages. The original formulation has been ex- 
tended to allow morphotactic constraints to be ex- 
pressed by feature specification (Trost, 1990; A1- 
shawi et al, 1991) rather than Koskenniemi's less 
perspicuous device of continuation classes. Meth- 
ods for the automatic compilation of rules from a 
notation convenient for the rule-writer into finite- 
state automata have also been developed, allowing 
the efficient analysis and synthesis of word forms. 
The automata may be derived from the rules alone 
(Trost, 1990), or involve composition with the lex- 
icon (Karttunen, Kaplan and Zaenen, 1992). 
However, there is often a trade-off between run- 
time efficiency and factors important for rapid and 
accurate system development, such as perspicuity 
of notation, ease of debugging, speed of compi- 
lation and the size of its output, and the inde- 
pendence of the morphological and lexical compo- 
nents. In compilation, one may compose any or 
all of 
(a) the two-level rule set, 
(b) the set of affixes and their allowed combina- 
tions, and 
(c) the lexicon; 
see Kaplan and Kay (1994 / for an exposition of 
the mathematical basis. The type of compilation 
appropriate for rapid development and acceptable 
run-time performance depends on, at least, the 
nature of the language being described and the 
number of base forms in the lexicon; that is, on the 
position in the three-dimensional space defined by 
(a), (b) and (c). 
For example, English inflectional morphology is 
relatively simple; dimensions (a) and (b) are fairly 
small, so if (c), the lexicon, is known in advance 
and is of manageable size, then the entire task of 
morphological anMysis can be carried out at com- 
pile time, producing a list of analysed word forms 
which need only be looked up at run time, or a 
network which can be traversed very simply. Al- 
ternatively, there may be no need to provide as 
powerful a mechanism as two-level morphology at 
all; a simpler device such as affix stripping (A1- 
shawi, 1992, pll9ff) or merely listing all inflected 
forms explicitly may be preferable. 
For agglutinative languages such as Korean, 
Finnish and Turkish (Kwon and Karttunen, 1994; 
Koskenniemi, 1983; Oflazer, 1993), dimension (b) 
is very large, so creating an exhaustive word list 
is out of the question unless the lexicon is trivial. 
Compilation to a network may still make sense, 
however, and because these languages tend to ex- 
hibit few non-eoncatenative morphophonological 
phenomena other than vowel harmony, the con- 
tinuation class mechanism may suffice to describe 
the allowed affix sequences at the surface level. 
Many European languages are of the inflect- 
ing type, and occupy still another region of the 
space of difficulty. They are too complex mor- 
phologically to yield easily to the simpler tech- 
niques that can work for English. The phonologi- 
cal or orthographic changes involved in affixation 
may be quite complex, so dimension (a) can be 
laige, and a feature mechanism may be needed to 
handle such varied but interrelated morphosyn- 
202 
tactic phenomena such as umlaut (Trost, 1991), 
case, number, gender, and different morphologi- 
cal paradigms. On the other hand, while there 
may be many different affixes, their possibilities 
for combination within a word are fairly limited, 
so dimension (b) is quite manageable. 
This paper describes a representation and as- 
sociated compiler intended for two-level morpho- 
logical descriptions of the written forms of inflect- 
ing languages. The system described is a com- 
ponent of the Core Language Engine (CLE; AI- 
shawi, 1992), a general-purpose language analyser 
and generator implemented in Prolog which sup- 
ports both a built-in lexicon and access to large 
external lexical databases. In this context, highly 
efficient word analysis and generation at run-time 
are less important than ensuring that the mor- 
phology mechanism is expressive, is easy to debug, 
and allows relatively quick compilation. Morphol- 
ogy also needs to be well integrated with other 
processing levels. In particular, it should be pos- 
sible to specify relations among morphosyntactic 
and morphophonological rules and lexical entries; 
for the convenience of developers, this is done by 
means of feature equations. Further, it cannot be 
assumed that the lexicon has been fully specified 
when the morphology rules are compiled. Devel- 
opers may wish to add and test further lexical 
entries without frequently recompiling the rules, 
and it may also be necessary to deal with un- 
known words at run time, for example by query- 
ing a large external lexical database or attempt- 
ing spelling correction (Alshawi, 1992, pp124-7). 
Also, both analysis and generation of word forms 
are required. Run-time speed need only be enough 
to make the time spent on morphology small com- 
pared to sententia\] and contextual processing. 
These parameters - languages with a complex 
morphology/syntax interface but a limited num- 
ber of affix combinations, tasks where the lexicon 
is not necessarily known at compile time, bidirec- 
tional processing, and the need to ease develop- 
ment rather than optimize run-time efficiency - 
dictate the design of the morphology compiler de- 
scribed in this paper, in which spelling rules and 
possible affix combinations (items (a) and (b)), 
but not the lexicon (item (c)), are composed in 
the compilation phase. Descriptions of French, 
Polish and English inflectional morphology have 
been developed for it, and I show how various as- 
peers of the mechanism allow phenomena in these 
languages to be handled. 
2 The Description Language 
2.1 Morphophonology 
The formalism for spelling rules (dimension (a)) is 
a syntactic variant of that of Ruessink (1989) and 
Pulman (1991). A rule is of the form 
spell(Name, Surface Op Lexical, 
Classes, Features). 
Rules may be optional (Op is "~") or obliga- 
tory (Op is "¢~"). Surface and Lexical are both 
strings of the form 
" LContext I Target I RContext" 
meaning that the surface and lexical targets may 
correspond if the left and right contexts and the 
Features specification are satisfied. The vertical 
bars simply separate the parts of the string and 
do not themselves match letters. The correspon- 
dence between surface and lexical strings for an 
entire word is licensed if there is a partitioning of 
both so that each partition (pair of corresponding 
surface and lexica\] targets) is licensed by a rule, 
and no partition breaks an obligatory rule. A par- 
tition breaks an obligatory rule if the surface tar- 
get does not match but everything else, including 
the feature specification, does. 
The Features in a rule is a list of Feature = 
Value equations. The allowed (finite) set of values 
of each feature must be prespecified. Value may 
be atomic or it may he a boolean expression. 
Members of the surface and lexieal strings may 
be characters or classes of single characters. The 
latter are represented by a single digit N in the 
string and an item N/ClassName in the Classes 
list; multiple occurrences of the same N in a single 
rule must all match the same character in a given 
application. 
Figure I shows three of the French spelling rules 
developed for this system. The change_e_~l rule 
(simplified slightly here) makes it obligatory for a 
lexical e to be realised as a surface ~ when followed 
by t, r, or l, then a morpheme boundary, then 
e, as long as the feature cdouble has an appro- 
priate value. The default rule that copies char- 
acters between surface and lexical levels and the 
boundary rule that deletes boundary markers are 
both optional. Together these rules permit the fol- 
lowing realization of cher ("expensive") followed 
by e (feminine gender suffix) as chore, as shown 
in Figure 2. Because of the obligatory nature of 
change_e_~l, and the fact that the orthographic 
feature restriction on the root cher, \[cdouble=n\], 
is consistent with the one on that rule, an alter- 
native realisation chere, involving the use of the 
default rule in third position, is ruled out. 1 
Unlike many other flavours of two-level mor- 
phology, the Target parts of a rule need not con- 
sist of a single character (or class occurrence); 
they can contain more than one, and the surface 
target may be empty. This obviates the need 
for "null" characters at the surface. However, 
although surface targets of any length can use- 
fully be specified, it is in practicea good strategy 
1The cdouble feature is in fact used to specify the 
spelling changes when e is added to various stems: 
cher+e=chdre, achet+e=ach~te, but jet+e=jette. 
203 
spell(change_e_~l, " I ~1" ~:~ " I e I l+e", \[l/trl\], \[,cdouble=n\]). 
spell(default, "Ill" =~ "Ill", \[,1/letter\], \['3). 
spell(boundary, "\[ \[" ~ "Ill", \[,I/bmarker\] , \['1). 
Figure 1: Three spelling rules 
Surface: c h ~ r e 
Lexical: c h e r + e + 
Rule: def. def. c.e_~l def. bdy. def. bdy. 
Figure 2: Partitioning of ehtre as chef+e+ 
always to make lexical targets exactly one char- 
acter long, because, by definition, an obligatory 
rule cannot block the application of another rule 
if their lexicM targets axe of different lengths. The 
example in Section 4.1 below clarifies this point. 
2.2 Word Formation and Interfacing to 
Syntax 
The allowed sequences of morphemes, and the 
syntactic and semantic properties of morphemes 
and of the words derived by combining them, are 
specified by morphosyntactic production rules (di- 
mension (b)) and lexical entries both for affixes 
(dimension (b)) and for roots (dimension (c)), es- 
sentially as described by Alshawi (1992) (where 
the production rules are referred to as "morphol- 
ogy rules"). Affixes may appear explicitly in pro- 
duction rules or, like roots, they may be assigned 
complex feature-valued categories. Information, 
including the creation of logical forms, is passed 
between constituents in a rule by the sharing of 
variables. These feature-augmented production 
rules are just the same device as those used in the 
CLE's syntactico-semantic descriptions, and are a 
much more natural way to express morphotactic 
information than finite-state devices such as con- 
tinuation classes (see Trost and Matiasek, 1994, 
for a related approach). 
The syntactic and semantic production rules for 
deriving the feminine singular of a French adjec- 
tive by suffixation with "e" are given, with some 
details omitted, in Figure 3. In this case, nearly 
MI features are shared between the inflected word 
and the root, as is the logical form for the word 
(shown as Adj in the doriv rule). The only differ- 
ing feature is that for gender, shown as the third 
argument of the ©agr macro, which itself expands 
to a category. 
Irregular forms, either complete words or affix- 
able stems, are specified by listing the morpho- 
logical rules and terminal morphemes from which 
the appropriate analyses may be constructed, for 
example: 
irreg(dit, \[-dire, ' PRESENT_3s ' \], \[v_v_affix-only\] 
). 
Here, PRESENT_3s is a pseudo-affix which has the 
same syntactic and semantic information attached 
to it as (one sense of) the affix "t", which is 
used to form some regular third person singulars. 
However, the spelling rules make no reference to 
PRESENT_3s; it is simply a device allowing cate- 
gories and logical forms for irregulax words to be 
built up using the same production rules as for 
regular words. 
3 Compilation 
All rules and lexieal entries in the CLE are com- 
piled to a form that allows normal Prolog unifi- 
cation to be used for category matching at run 
time. The same compiled forms are used for anal- 
ysis and generation, but are indexed differently. 
Each feature for a major category is assigned a 
unique position in the compiled Prolog term, and 
features for which finite value sets have been spec- 
ified are compiled into vectors in a form that al- 
lows boolean expressions, involving negation as 
well as conjunction and disjunction, to be con- 
joined by unification (see Mellish, 1988; Alshawi, 
1992, pp46-48). 
The compilation of morphological information 
is motivated by the nature of the task and of the 
languages to be handled. As discussed in Sec- 
tion 1, we expect the number of affix combina- 
tions to be limited, but the lexicon is not neces- 
sarily known in advance. Morphophonological in- 
teractions may be quite complex, and the purpose 
of morphological processing is to derive syntactic 
and semantic analyses from words and vice versa 
for the purpose of full NLP. Reasonably quick 
compilation is required, and run-time speed need 
only be moderate. 
3.1 Compiling Spelling Patterns 
Compilation of individual spell rules is straight- 
forward; feature specifications are compiled to 
positional/boolean format, characters and occur- 
rences of character classes are also converted to 
boolean vectors, and left contexts are reversed (cf 
Abrahamson, 1992) for efficiency. However, al- 
though it would be possible to analyse words di- 
rectly with individually compiled rules (see Sec- 
tion 5 below), it can take an unacceptably long 
time to do so, largely because of the wide range of 
204 
morph(adjp_adjp_fem, 
\[adjp:\[agr= @agr(3,sing,f) \] Shared\], 
adjp:\[agr= ~agr(3,sing,m) I Shared\], 
el) 
:- Shared=\[aform=Aform, ..., wh=n\]. 
Z rule (syntax) 
Z mother category 
Z first daughter (category) 
Z second daughter (literal) 
shared syntactic features 
deriv(adjp_adjp_fem, only 
\[(Adj,adjp:Shared), 
(Adj,adjp:Shared), Z 
(_,e)\]) 
• - Shared=\[anaIn=Ai, ..., subjval=Subj\] 
rule (semantics) 
mother logical form and cat. 
first daughter 
second daughter 
• ~ shared semantic features 
Figure 3: Syntactic and semantic morphological production rules 
choices of rule available at each point and the need 
to check at each stage that obligatory rules have 
not been broken. We therefore take the following 
approach. 
First, all legal sequences of morphemes are pro- 
duced by top-down nondeterministic application 
of the production rules (Section 2.2), selecting af- 
fixes but keeping the root morpheme unspecified 
because, as explained above, the lexicon is unde- 
termined at this stage. For example, for English, 
the sequences *+ed+ly and un+*+ing are among 
those produced, the asterisk representing the un- 
specified root. 
Then, each sequence, together with any associ- 
ated restrictions on orthographic features, under- 
goes analysis by the compiled spelling rules (Sec- 
tion 2.1), with the surface sequence and the root 
part of the lexical sequence initially uninstanti- 
ated. Rules are applied recursively and nondeter- 
ministically, somewhat in the style of Abramson 
(1992), taking advantage of Prolog's unification 
mechanism to instantiate the part of the surface 
string corresponding to affixes and to place some 
spelling constraints on the start and/or end of the 
surface and/or lexical forms of the root. 
This process results in a set of spelling palterns, 
one for each distinct application of the spelling 
rules to each affix sequence suggested by the pro- 
duction rules. A spelling pattern consists of par- 
tially specified surface and lexical root character 
sequences~ fully specified surface and lexical affix 
sequences, orthographic feature constraints asso- 
ciated with the spelling rules and affixes used, and 
a pair of syntactic category specifications derived 
from the production rules used. One category is 
for the root form, and one for the inflected form. 
Spelling patterns are indexed according to the 
surface (for analysis) and lexical (for generation) 
affix characters they involve. At run time, an in- 
flected word is analysed nondeterministically in 
several stages, each of which may succeed any 
number of times including zero. 
• stripping off possible (surface) affix charac- 
ters in the word and locating a spelling pat- 
tern that they index; 
• matching the remaining characters in the 
word against the surface part of the spelling 
pattern, thereby, through shared variables, 
instantiating the characters for the lexical 
part to provide a possible root spelling; 
• checking any orthographic feature constraints 
on that root; 
• finding a lexical entry for the root, by any of a 
range of mechanisms including lookup in the 
system's own lexicon, querying an external 
lexical database, or attempting to guess an 
entry for an undefined word; and 
• unifying the root lexical entry with the root 
category in the spelling pattern, thereby, 
through variable sharing with the other cate- 
gory in the pattern, creating a fully specified 
category for the inflected form that can be 
used in parsing. 
In generation, the process works in reverse, start- 
ing from indexes on the lexical affix characters. 
3.2 Representing Lexical Roots 
Complications arise in spelling rule application 
from the fact that, at compile time, neither the 
lexical nor the surface form of the root, nor even 
its length, is known. It would be possible to hy- 
pothesize all sensible lengths and compile separate 
spelling patterns for each. However, this would 
lead to many times more patterns being produced 
than are really necessary. 
Lexical (and, after instantiation, surface) 
strings for the unspecified roots are therefore rep- 
resented in a more complex but less redundant 
way: as a structure 
L1 ... Lm v(L, R) R1 ... R,. 
Here the Li's are variables later instantiated to 
single characters at the beginning of the root, and 
L is a variable, which is later instantiated to a 
list of characters, for its continuation. Similarly, 
the /~'s represent the end of the root, and R 
is the continuation (this time reversed) leftwards 
into the root from the R1. The v(L, R) structure 
is always matched specially with a Kleene-star of 
205 
the default spelling rule. For full generality and 
minimal redundancy, Lm and R1 are constrained 
not to match the default rule, but the other Li's 
and Ri's may. The values of n required are those 
for which, for some spelling rule, there are k char- 
acters in the target lexical string and n - k from 
the beginning of the right context up to (but not 
including) a boundary symbol. The lexical string 
of that rule may then match R1,...,Rk, and its 
right context match Rk+l,..., Rn,+,.... The re- 
quired values of m may be calculated similarly 
with reference to the left contexts of rules. 2 
During rule compilation, the spelling pattern 
that leads to the run-time analysis of chore given 
above is derived from m = 0 and n = 2 and the 
specified rule sequence, with the variables R1 R2 
matching as in Figure 4. 
3.3 Applying Obligatory Rules 
In the absence of a lexical string for the root, the 
correct treatment of obligatory rules is another 
problem for compilation. If an obligatory rule 
specifies that lexical X must be realised as surface 
Y when certain contextual and feature conditions 
hold, then a partitioning where X is realised as 
something other than Y is only" allowed if one or 
more of those conditions is unsatisfied. Because of 
the use of boolean vectors for both features and 
characters, it is quite possible to constrain each 
partitioning by unifying it with the complement 
of one of the conditions of each applicable obliga- 
tory rule, thereby preventing that rule from apply- 
ing. For English, with its relatively simple inflec- 
tional spelling changes, this works well. However, 
for other languages, including French, it leads to 
excessive numbers of spelling patterns, because 
there are many obligatory rules with non-trivial 
contexts and feature specifications. 
For this reason, complement unification is not 
actually carried out at compile time. Instead, the 
spelling patterns are augmented with the fact that 
certain conditions on certain obligatory rules need 
to be checked on certain parts of the partitioning 
when it is fully instantiated. This slows down run- 
time performance a little but, as we will see below, 
the speed is still quite acceptable. 
3.4 Timings 
The compilation process for the entire rule set 
takes just over a minute for a fairly thorough de- 
2Alternations in the middle of a root, such as um- 
laut, can be handled straightforwardly by altering the 
root/affix pattern from L1... Lm v(L,R) R1...R, to 
L1...Lm v(L,R) M v(L',R') R1...Rn, with M for- 
bidden to be the default rule. This has not been 
necessary for the descriptions developed so far, but its 
implementation is not expected to lead to any great 
decrease in run-time performance, because the non- 
determinism it induces in the lookup process is no 
different in kind from that arising from alternations 
at root-affix boundaries. 
scription of French inflectional morphology, run- 
ning on a Sparcstation 10/41 (SPECint92=52.6). 
Run-time speeds are quite adequate for full NLP, 
and reflect the fact that the system is imple- 
mented in Prolog rather than (say) C and that full 
syntactico-semantic analyses of sentences, rather 
than just morpheme sequences or acceptability 
judgments, are produced. 
Analysis of French words using this rule set and 
only an in-core lexicon averages around 50 words 
per second, with a mean of 11 spelling analyses 
per word leading to a mean of 1.6 morphological 
analyses (the reduction being because many of the 
roots suggested by spelling analysis do not exist 
or cannot combine with the affixes produced). If 
results are cached, subsequent attempts to anal- 
yse the same word are around 40 times faster still. 
Generation is also quite acceptably fast, running 
at around 100 Words per second; it is slightly faster 
than analysis because only one spelling, rather 
than all possible analyses, is sought from each 
call. Because of the separation between lexical 
and morphological representations, these timings 
are essentially unaffected by in-core lexicon size, 
as full advantage is taken of Prolog's built-in in- 
dexing. 
Development times are at least as important 
as computation times. A rule set embodying a 
quite comprehensive treatment of French inflec- 
tional morphology was developed in about one 
person month. The English spelling rule set was 
adapted from Ritchie e~ al (1992) in only a day or 
two. A Polish rule set is also under development, 
and Swedish is planned for the near future. 
4 Some Examples 
To clarify further the use of the formalism and 
the operation of the mechanisms, we now examine 
several further examples. 
4.1 Multiple-letter spelling changes 
Some obligatory spelling changes in French involve 
more than one letter. For example, masculine ad- 
jectives and nouns ending in eau have feminine 
counterparts ending in elle: beau ("nice") becomes 
belle, chameau ("camel") becomes chamelle. The 
final e is a feminizing affix and can be seen as 
inducing the obligatory spelling change au ~ II. 
However, although the obvious spelling rule, 
spell(change_au_ll, "Ill\[" +-+ "laui+e"), 
allows this change, it does not rule out the incor- 
rect realization of beau+e as e'beaue, shown in Fig- 
ure 5, because it only affects partitionings where 
the au at the lexical level forms a single partition, 
rather than one for a and one for u. Instead, the 
following pair of rules, in which the lexical targets 
have only one character each, achieve the desired 
effect: 
206 
Compile 
time: 
Run 
time: 
Variable: v( L, t=0 R1 R2 ... 
Surface: c h ~ r e 
Figure 4: Spelling pattern application to the analysis of ch@re 
Surface: b e a u e 
Lexical: b e a u + e + 
Rule: def. def. def. def. bdy. def. bdy. 
Figure 5: Incorrect partitioning for beau+e+ 
spell(change_au_lll, " Ill" ~ "lalu+e") 
spell(change_au_ll2, "Ill" ~-+ "alul+e") 
Here, change_au_lll rules out a:a partition in 
Figure 5, and change_au_ll2 rules out the u:u 
one. 
It is not necessary for the surface target to con- 
tain exactly one character for the blocking effect 
to apply, because the semantics of obligatoriness 
is that the lezicaltarget and all contexts, taken to- 
gether, make the specified surface target (of what- 
ever length) obligatory for that partition. The re- 
verse constraint, on the lexical target, does not 
apply. 
4.2 Using features to control rule 
application 
Features can be used to control the application of 
rules to particular lexical items where the appli- 
cability cannot be deduced from spellings alone. 
For example, Polish nouns with stems whose fi- 
nal syllable has vowel 6 normally have inflected 
forms in which the accent is dropped. Thus in the 
nominative plural, kr6j ("style") becomes kroje, 
b6r ("forest") becomes bory, b6j ("combat") be- 
comes boje. However, there are exceptions, such as 
zb6j ("bandit") becoming zbgje. Similarly, some 
French verbs whose infinitives end in -eler take 
a grave accent on the first e in the third per- 
son singular future (modeler, "model", becomes 
mod~lera), while others double the I instead (e.g. 
appeler, "call", becomes appellera). 
These phenomena can be handled by providing 
an obligatory rule for the case whether the letter 
changes, but constraining the applicability of the 
rule with a feature and making the feature clash 
with that for roots where the change does not oc- 
cur. In the Polish case: 
spell(change_6_o, "\[o\[" +-+ "\[611+2", 
\[i/c, 21v\], \[clmgo:y\]). 
orth(zb6j, \[chngo=n\] ). 
Then the partitionings given in Figure 6 will be 
the only possible ones. For b6j, the change_6_o 
rule must apply, because the chngo feature for b6j 
is unspecified and therefore can take any value; for 
zb@ however, the rule is prevented from applying 
by the feature clash, and so the default rule is the 
only one that can apply. 
5 Debugging the Rules 
The debugging tools help in checking the opera- 
tion of the spelling rules, either (1) in conjunction 
with other constraints or (2) on their own. 
For case (1), the user may ask to see all inflec- 
tions of a root licensed by the spelling rules, pro- 
duction rules, and lexicon; for chef, the output 
is 
\[cher,e\] : adjp -> chore 
\[cher,e,s\]: adjp -> chores 
\[cher,s\] : adjp -> chers 
meaning that when cher is an adjp (adjective) it 
may combine with the suffixes listed to produce 
the inflected forms shown. This is useful in check- 
ing over- and undergeneration. It is also possible 
to view the spelling patterns and production rule 
tree used to produce a form; for chore, the trace 
(slightly simplified here) is as in figure 7. The 
spelling pattern 194 referred to here is the one 
depicted in a different form in Figure 4. The no- 
tation {clmnprstv=A} denotes a set of possible 
consonants represented by the variable A, which 
also occurs on the right hand side of the rule, in- 
dicating that the same selection must be made for 
both occurrences. Production rule tree 17 is that 
for a single application of the rule adjp_adjp_fem, 
which describes the feminine form of the an ad- 
jective, where the root is taken to be the mas- 
culine form. The Root and Infl lines show the 
features that differ between the root and inflected 
forms, while the Both line shows those that they 
share. Tree 18, which is also pointed to by the 
spelling pattern, describes the feminine forms of 
nouns analogously. 
For case (2), the spelling rules may be applied 
directly, just as in rule compilation, to a speci- 
fied surface or lexical character sequence, as if no 
207 
Surface: b o j e 
Lexical: b 6 j + e + 
Rule: def. c_6_o, def. bdy. def. bdy. 
Surface: z b 6 j e 
Lexicah z b 6 j + e + 
Rule: def. def. def. def. bdy. def. bdy. 
Figure 6: Feature-dependent dropping of accent 
"chbre" has root "chef" with pattern 194 and tree 17. 
Pattern 194: 
"___~{clmnprstv=A}e" <-> "___e{clmnprstv=A}+e+" 
=> tree 17 and 18 if \[doublec=n\] 
Uses: default* change_e_~l default boundary default boundary 
Tree 17: 
Both = adjp:\[dmodified=n,headfinal=y,mhdfl=y,synmorpha=l,wh=n\] 
Root = adjp:\[agr=agr:\[gender=m\]\] 
Infl = adjp:\[agr=agr:\[gender=f\]\] 
Tree = adjp_adjp_fem=>\[*,e\] 
Figure 7: Debugger trace of derivation of chore 
lexical or morphotactic constraints existed. Fea- 
ture constraints, and cases where the rules will not 
apply if those constraints are broken, are shown. 
For the lexical sequence cher+e+, for example, the 
output is as follows. 
Surface: "chbre" <-> 
Lexical: "chef". Suffix: "e" 
c :: c <- default 
h :: h <- default 
b :: e <- change_e_bl 
r :: r <- default 
:: + <- boundary 
Category: orth: \[cdouble=n\] 
e :: e <- default 
:: + <~ boundary 
Surface : 
Lexical : 
c :: c <- default 
h :: h <- default 
e :: e <- default 
" change_e_ b l !' ) 
r :: r <- default 
:: + <- boundary 
e :: e <- default 
:: + <- boundary 
"chere" <-> 
"cher". Suffix: "e" 
(breaks 
This indicates to the user that if chef is given 
a lexical entry consistent with the constraint 
cdouble=n, then only the first analysis will be 
valid; otherwise, only the second will be. 
6 Conclusions and Further Work 
The rule formalism and compiler described here 
work well for European languages with reasonably 
complex orthographic changes but a limited range 
of possible affix combinations. Development, com- 
pilation and run-time efficiency are quite accept- 
able, and the use of rules containing complex 
feature-augmented categories allows morphotactic 
behaviours and non-segmentM spelling constraints 
to be specified in a way that is perspicuous to lin- 
guists, leading to rapid development of descrip- 
tions adequate for full NLP. 
The kinds of non-linear effects common in 
Semitic languages, where vowel and consonant 
patterns are interpolated in words (Kay, 1987; 
Kiraz, 1994) could be treated efficiently by the 
mechanisms described here if it proved possible to 
define a representation that allowed the parts of 
an inflected word corresponding to the root to be 
separated fairly cleanly from the parts expressing 
the inflection. The latter could then be used by a 
modified version of the current system as the basis 
for efficient lookup of spelling patterns which, as 
in the current system, would allow possible lexical 
roots to be calculated. 
Agglutinative languages could be handled ef- 
208 
flciently by the current mechanism if specifica- 
tions were provided for the affix combinations that 
were likely to occur at all often in real texts. A 
backup mechanism could then be provided which 
attempted a slower, but more complete, direct ap- 
plication of the rules for the rarer cases. 
The interaction of morphological analysis with 
spelling correction (Carter, 1992; Oflazer, 1994; 
Bowden, 1995) is another possibly fruitful area of 
work. Once the root spelling patterns and the affix 
combinations pointing to them have been created, 
analysis essentially reduces to an instance of affix- 
stripping, which would be amenable to exactly the 
technique outlined by Carter (1992). As in that 
work, a discrimination net of root forms would be 
required; however, this could be augmented inde- 
pendently of spelling pattern creation, so that the 
flexibility resulting from not composing the lexi- 
con with the spelling rules would not be lost. 
Acknowledgments 
I am grateful to Manny Rayner and anonymous 
European ACL referees for commenting on earlier 
versions of this paper, and to Pierrette Bouillion 
and Malgorzata Styg for comments and also fo~ 
providing me with their analyses of the French 
and Polish examples respectively. 
This research was partly funded by the Defence 
Research Agency, Malvern, UK, under Strategic 
Research Project M2YBT44X. 
References 
Abramson, H., (1992). "A Logic Programming 
View of Relational Morphology". Proceedings 
of COLING-92, 850-854. 
Alshawi, H. (1992). The Core Language Engine 
(ed). MIT Press. 
Alshawi, H., D.J. Arnold, R. Backofen, D.M. 
Carter, J. Lindop, K. Netter, S.G. Pulman, 
J. Tsujii, and H. Uszkoreit (1991). Euro- 
ira ET6/I: Rule Formalism and Virtual Ma- 
chine Design Study. Commission of the Eu- 
ropean Communities, Luxembourg. 
Bowden, T. (1995) "Cooperative Error Handling 
and Shallow Processing", these proceedings. 
Carter, D.M. (1992). "Lattice-based Word Identi- 
fication in CLARE". Proceedings of A CL-92. 
Kaplan, R., and M. Kay (1994). "Regular Mod- 
els of Phonological Rule Systems", Computa- 
tional Linguistics, 20:3, 331-378. 
Kay, M. (1987). "Non-concatenative Finite-State 
Morphology". Proceedings of EA CL-87. 
Karttunen, L., R.M. Kaplan, and A. Zaenen 
(1992). "Two-level Morphology with Com- 
position". Proceedings of COLING-92, 141: 
148. 
Kiraz, G. (1994). "Multi-tape Two-level Morphol- 
ogy". Proceedings of COLING-94, 180-186. 
Koskenniemi, K. (1983). Two-level morphology: 
a general computational model for word.form 
recognition and production. University of 
Helsinki, Department of General Linguistics, 
Publications, No. 11. 
Kwon, H-C., and L. Karttunen (1994). "Incre- 
mental Construction of a Lexical Transducer 
for Korean". Proceedings of COLING-9~, 
1262-1266. 
Mellish, C. S. (1988). "Implementing Systemic 
Classification by Unification". Computa- 
tional Linguistics 14:40-51. 
Oflazer, K. (1993). "Two-level Description of 
Turkish Morphology". Proceedings of Euro- 
pean A CL- 93. 
Oflazer, K. (1994). Spelling Correction in Agglu- 
tinative Languages. Article 9410004 in 
cmp-lg©xxx, lanl. gov archive. 
Ritchie, G., G.J. Russell, A.W. Black and S.G. 
Pulman (1992). Computational Morphology. 
MIT Press. 
Ruessink, H. (1989). Two Level Formalisms. 
Utrecht Working Papers in NLP, no. 5. 
Trost, H. (1990). "The Application of Two-level 
Morphology to Non-Concatenative German 
Morphology". Proceedings of COLING.90, 
371-376. 
Trost, H. (1991). "X2MORF: A Morphologi- 
cal Component Based on Augmented Two- 
level Morphology". Proceedings of IJCAI-91, 
1024-1030. 
Tr0st, H., and J. Matiasek (1994). "Morphol- 
ogy with a Null-Interface", Proceedings of 
COLING-94. 
209 
