A i INII L"S .\[A,I E MORPHOLOGICAL PROCESSOR FOI  
SPANISH 
Evelyne Tzoukermann and Mark Y. Liberman 
AT&T Bell Laboratories 
600 Mountain Avenue 
Murray tlill, NJ 07974 
Abstract 
A finite transducer that processes Spanish inflec- 
tional amt derivational morphology is presented. 
The system handles both generation astd analysis 
of tens of millions inflected ibrms. Lexical and 
surface (orthographic) representations of the words 
are linked by a program that interprets a finite di- 
rected graph whose arcs are labelled by n-tuples of 
strings. Each of about 55,000 base forms requires at 
le~t one arc in the graph. Representing the inflec- 
tional and derivational possibilities for these forms 
imposed an overhead of only about 3000 additional 
arcs, of which about 2500 represent (phonologically- 
predictable) stem allomorphy, so that we pay a stor- 
age price of about 5% for compiling these form~ off- 
line. A simple interpreter for the resulting automa- 
ton processes several hundred words per second on a 
Sun4. 
it Introduction 
One useful way to look at computational morphology 
and phonology is in terms of transductlons, that is, 
n-sty word relations definable by the element-wise 
concatenation of n-tuple labels along paths in a fi- 
idte directed labeled graph. For instance, we can 
take one member of such a relation to be the spelling 
of an inflected form, another member to be the corre- 
t~ponding lemma, another to be a string representing 
its morphosyntactic features, another to represent its 
pronunciation, and so forth. 
Inspired by the (unpublished) work of Kaplan and 
Kay (ongoing since the late 1970's), and that of 
Koskenniemi in \[12\], many researchers have used bi- 
nary word relations to represent "underlying" and 
"surface" forms in the morphophonology of words. 
Much of the interest of this work has been focused 
on methods to combine multiple two-tape automata, 
which may be composed or run in parallel in order 
to compute the desired binary relation. 
In this paper, we take a somewhat different ap- 
proach to defining and computing word relations, 
and discuss its application in a morphological proces- 
sot for Spanish orthographic words that covers more 
than forty millions forms generable from the approxi- 
mately 55,000 basic words in the Collins Spanish Dic- 
tionary (\[3\]) 1. The main advantage of this approach 
is the extreme simplicity both of its data structures 
and of their interpretation. As a result, an inter- 
preter is easy to implement; time and/or space op-= 
timization issues in the implementation are straight.. 
forward to define; at the same time, it is extremely 
easy to compile traditional morphological informa- 
tion into the required form, at least for languages like 
Spanish that can be fairly well modeled in terms of 
the concatenation of steins and affixes. As is usl~.lly 
the case in automata-based approaches, the system 
treats analysis and generation symmetrically, and 
tile same description can be run with equal facility 
in either direction. 
Define 2 an n-ary nondeterminisiic finite automa- 
ton as a 5-tuple 
A = (Q,qt,F,E,H) 
where Q is a finite non-empty set of states, ql is a 
designated start state, F is a set of designated final 
states, E is a finite non-empty alphabet, and H is 
a finite subset of {Q × (E*)" × Q}, where (:E*) n is 
the set of n-tuples of (possibly empty) words over 
E..,4 can be thought of as a labeled directed graph, 
whose nodes are elements of Q, and whose edges are 
elements of H, each such edge being labeled with the 
appropriate n-tuple of words. The component-wise 
concatenation of labels along every path that begins 
in ql and ends in an element of F defines a set of 
n-tuples, R C (E*) n, which is the relation accepted 
by A. 
As a practical matter, we generally want to run a 
program that (explicitly or implicitly) searches this 
graph in order to find all the n-tuples in R with 
some interesting property, say those corresponding 
1 The present set can be increased almost exponentially by adding new derivatlonal affixes. 
~The name and the basic idea of these automata come from 
\[5\]. For simplicity of exposition we gloss over various authors' 
attempts to distinguish variously among machines, automata 
and transducer#, as well as the profusion of precursors mad de- 
scendants in (\[15\], \[16\], \[2\], \[7\], etc.). Our notation is edeetlc. 
1 277 
to forms whose surface spelling is the string w, or 
those corresponding to the first person plural imper- 
fect subjunctive of such-and-such a verb. Depending 
on the structure of H and the property selected, the 
search will be harder or easier. For the stem-and- 
affix kind of morphology exemplified by Spanish, the 
natural structure for H is quite easy to search. We 
do not have space to discuss search methods here, 
but will simply observe that a non-optimal method 
devised for convenience in another experiment (\[6\]) 
processes several hundred Spanish words per second 
on a Sun4. 
For the application discussed in this paper, we 
want to relate inflected forms, lemmas, and mor- 
phosyntactic features, so that the elements of R 
should be 3-tuples like: 
( eambiaran, eambiar, 3rd plural per fect subjunctive). 
Since most Spanish words consist of a stem, which 
mainly specifies the lemma, and a set of affixes that 
mainly specify the morphosyntactic features, it is ap- 
propriate to use 2-tuples made by concatenating the 
second and third elements. 
The basis of our run-time system is the arc list H. 
For a large lexicon, it is inconvenient to write this list 
by hand, and so we compile it from a lexical table 
that reflects more directly the way that morpholog- 
ical information is represented in a standard dictio- 
nary, such as the Collins dictionary we began with. 
The program interprets recursively all the possible 
arcs of the lists. Therefore, more than one analyzed 
or generated form is given. For instance, the analysis 
for the input word "retirada" is of the form: 
retiral" pas~ participle feminine singular 
retirado adjective feminine singular 
retirada noun feminine singular 
2 The arc-list compiler 
The arc-list compiler starts with a list of lexical 
items with their morphological classes, applying mor- 
phophonological transformations to generate the arc 
list. For instance, each verb headword in the Collins 
dictionary is given an index that specifies one of 
62 conjugation classes. Based on this information, 
the arc-list compiler calculates the set of stem al- 
lomorphs necessary for that verb's inflection, along 
with the set of endings that each stem allomorph 
selects. Spanish verbs have from one to five ortho- 
graphic stem allomorphs. When the verb is regu- 
lar there is only one stem, like "cambi-" in "cam- 
biar" (to change). An irregular verb may have up 
to five stems, like "ten-", "teng-", "tien-", "tend- 
", "tnv-" for the verb "tener" (to have). This is 
common in Romance languages (see Tzoukermann 
1986 for French). These different stems are the re- 
sult of morphophonological changes occuring during 
the verbal flexion, usually related to the stress im- 
plications of the verbal ending or to the features of 
its initial vowel. 
Depending on the conjugation class, the character 
string corresponding to the verb lemma is subjected 
to one or more rewriting rules. These rewriting rules 
are of different types: 
* they can be the consequence of a stress change 
during the verbal flexion: 
(a) e -- ie when the last syllable is not stressed 
like in quoter / qulero. 
* they can be a morphographic change that is gen- 
eral to Spanish orthography: 
(b) c - qu before "e" and "i" like in sacar / 
saque. 
or the reverse rule 
(c) qu - c before "a", "o", "u" like in delinquir 
/ delinco. 
Some verbs are subject to one type of rewriting 
rule such as in (a) - (c) above, and consequently 
produce one additional stein allomorph. The verb 
"sacar" (to take / pull out) will generate "sac-" 
and "saqu-", as well as "delinqnir" (to offend) with 
"delinqu-" and "delinc-". 
Some other verbs, less frequent in number' but 
more frequent in actual use, are subject to two 
rewriting rules and need a more complex treatment. 
In "forzar" (to force), tile morphophonological rule 
combines with the othographic one and produces a 
distribution of four steins, such as "forz-", "fore- 
", "fuerc-', "fuerz-". The same phenomenon oc- 
curs for "rogar" (to beg) with the stems "rog-", 
"rogu-", "rues-", "ruegu-". For some verbs of the 
second group in "-er", the stem production is less 
predictable; for instance "tenet" presents five sterns 
"ten-", "teng-", "tien-", "tend-", "tuv-'. Notice that 
some of them such as "tens-" do not follow the type 
of morphophonological rules mentioned above. 
Because of Spanish orthographic conventions con- 
nected with the notation of stress, some nouns and 
adjectives also acquire more than one stein allomorph 
in a rule-governed way. In addition, of course, there 
must be a list of cases where the allomorphy is simply 
unique to the word in question. 
3 The arc list 
Using a state labeled 1 by convention as the start 
state, and a state labeled 0 by convention as the 
(unique) final state, we express all of the informa- 
tion needed to define our automaton .4 by enumer- 
ating the arcs in H, which now can be represented 
as lists of 4-tuples (qi,qj,u,v), where qi and qj are 
arbitrary identifiers for states, u is a substring of an 
inflected form, and v is a substring of the correspond- 
ing lemma + morphosyntactic category. 
278 2 
{Jsed either tbr analysis or for generation, our pro-. 
gram interprets this same arc list. The arc list can 
be conceptually divided in two parts: one contains 
the stems of the verbs, nouns and adjectives; the 
other contains a number of sub-lexicons that provide 
the endings for these lexical categories as well as the 
clitics° 
Our Spanish system is defined by a set of about 
58,000 such 4-tuples, (most of which are) gener~ 
ated by rule from head words and category informa- 
tion extracted from the typographer's tape for the 
Collin:~ Spanish Dictionary. Affixes, assorted null- 
string transitions and tittles account for about 1000 
elements of this set; the remainder are stems or stem 
allomorphs. Since we have about 55,000 laminas, the 
overhead for compiling out predictable aspects of al- 
lomorphy is at worst the approximately 2,500 stem 
allolnorphs and affix arcs, i.e. less than 5%. There 
are about 225 states in total. 
3.1. :Verbal stems 
The verbal stem lexicon was obtained by extracting 
tile verb headwords (about 6,800 Spanish verbs) from 
the Collins dictionary. 
Once the grammar provides the stems, a state pair 
is associated to them. The first state is always the 
initial state "1", the second depemts on the type of 
stern and its ending throughout the conjugation (dig- 
its or character strings can be used indifferently for 
labelling the states), l~br example, for the first verb 
conjugation, whose infinitives end in "-at," the sec- 
ond states are spread out among 10 different states. 
I 2 cambi carabiar 
i 6 cruc cruzar 
I 3 env{ onviar 
i 4 envi enviar 
I 3 sit~ situar 
I 4 situ sit~ar 
1 5 cruz cruzar 
:I 6 cruc cruzar 
:t 7 jug jugax 
:t 8 jueg jugav 
:t 9 juegu jugar 
:t 10 jugu jugar 
Two verb stems x and y will share the same second 
state number if and only if: 
, x has the same number of sterns as y, 
® x has the same ending distribution as y. 
This permits a compression of the database since 
the set of sterns are gathered under a common sec- 
ond state number. Other arguments in favor of this 
choice of representation are given in section 4.1. 
For the 62 conjugation classes, grouped in three 
verb conjugations, the number of stems combined 
with the various ending distributions creates a num- 
ber of verb-stem-final states close to 150. 
Defective verbs, due to their idiosyncrasies, are 
listed separately° 
3.2 The adjective stems 
The adjective base tbrms (about 10,500) were de- 
rived fl'om the masculine singular Ibrms listed in the 
dictionary. The lexical representation of a regular 
adjective has an entry in the lexicon as follows: 
i 300 buon bueno 
where "buen-" is the stem and "bueno" (good) the 
dictionary base form. Special attention needed to be 
paid to stressed adjectives like "musulmSn" (Mus- 
lim) or "mand&l" (bossy) where the inflected form 
does not keep the accent. Therefore, both forms 
(stressed and unstressed) needed to be stored. 
3.3 The noun stems 
About 30,700 nouns were extracted from the dictio- 
nary. These nouns are not inflected for gender, but 
are simply listed as masculine or feminine. Thus the 
arc label for a noun contains the complete form of 
the singular. Some examples of arcs for nouns are: 
(a) 1 
(b) i 
499 aexodromo 
aorodromo noun masculine 
500 mariscos 
ma~iscos ~toun masculine plnral 
In the above examples, (a) can either generate a sin- 
gular lbrm or it can acquire the plural tbrm in a 
fimher step, whereas (b), which occurs only in the 
plural, can have no Nrther inflection added. 
4 The affixes 
Besides the stems, various sublexicons containing 
"intermediary states" and affixes of different types 
constitute the other part of the Spanish arc list. 
4.1 Intermediary nodes or continua- 
tion classes 
The regrouping of the verbal arc list by stem and 
person allows reduction of the number of states and 
therefore, of arcs. For instance, an intermediary 
state was added for the tenses only. The arc marked 
"#" shows a transition on an empty string. 
2 150 # # 
This arc takes any verb stem of which tile final state 
is 2 and links it to tile indicative present node - la- 
beled here 150- of the "-at" verbs. Consequently, 
there are as many nodes of that kind as tenses for 
each group and verb category. 
3 279 
4.2 Endings 
A series of sublexicons lists the inflections for the 
verbs, nouns and adjectives. Verbal inflections are of 
the form: 
150 500 o Ist singular present indicative 
150 500 as 2nd singular present indicative 
In the same way, the regular endings for the adjec- 
tives are of the form: 
300 497 o adjective 
497 498 # masculine 
497 500 # singular 
498 500 s plural 
Each transition corresponds to the gender or number 
feature of the adjective. 
4.3 Clitics 
The eleven Spanish clities can occur either alone or 
in combination (\[1\]). Over sixty-five combinations 
can be formed such as "seles", "noslas", etc. The in- 
finitive, gerund and imperative are the only forms in 
which they can occur, for instance, "hacerlo" (to do 
it) or "dici6ndooslo" (saying it to you). Nevertheless, 
they are sometimes subject to orthographic rules of 
the type: deletion of "s" for first person plural im- 
perative verbs in front of the enclitic "nos", such as 
in "anlanlonos '~ . 
Consequently, about 300 arcs were listed to handle 
the general cases as well as the idiosyncrasies. 
4.4 Reflexive verbs 
In the case of reflexive verbs such as "aflliarse" (to 
afiliate, to join) or "abstenerse" (to abstain, to re- 
frain), a special treatment is motivated. Such verbs 
have a paradigm like: 
(a) me afilio, (I afiliate) 
te afilias, (you afiliate) 
me afiliaba, (I .as afiliating) 
te a~liabas, (you were afiliating) 
(b) afiliandome (afiliating myself) 
afiliatet (afilla~e!) 
The reflexive pronouns generally precede the verb 
form, separated from it by white space ms shown in 
(a), except for the infinitive, imperative and present 
participle (example (b) above) a. For the preced- 
ing reflexive pronouns, there is a dependency be- 
tween the person-and-number of the pronoun and 
the person-and-number of the verbal ending, span- 
ning the intervening verb stem. To capture such de- 
pendencies in a single automaton of the kind that 
3Note that some verbs (e.g. "afillaxse") occur only reflex- 
ively, while other (e.g. "lavar" (to wash, to clean)) may be 
used reflexively or non reflexively. Note also that object pro- 
nouns in general are cliticized, note only the reflexive ones. 
we are using, we would have to use a separate path 
for each person-number combination, duplicating the 
verb stem (and its allomorphs, if any) six times. This 
seems like a bad idea. A better alternative, in such 
cases, is to set up the automaton to permit all re- 
flexive pronouns to co-occur with all endings, and to 
filter the resulting set of tuples to remove the ones 
that do not match. This can be done, for example, by 
passing the output through a second automaton that 
does nothing but check person and number agree- 
ment in reflexive verbs. 
We find it interesting that precisely those aspects 
of Spanish morphology that require such a treatment 
are those whose formatives are written as separate 
words. 
4.5 Prefixes and suffixes 
About 60 suffixes and 90 prefixes were added to the 
arc list for handling derivational morphology. Only 
tile very productive ones were selected. The prefixes 
are of the form "nero-", "ante-", "auto-", "bio-" oc- 
curring with or without the dash; the suffixes are of 
the form "-ejo", "-eta", "-zuela", "-uelo", etc. 
The resulting arc list, in addition to supporting 
an efficient computation of relations between surface 
and lexical forms, provides a good overview of the 
morphological structure of the Spanish verbal, sys- 
tem, permitting easy access to the sets of verbs that 
behave in a similar way. 
5 Conclusion 
We have implemented a complete morphological pro- 
cessor for Spanish, one which generates and recog- 
nizes all (and only) well-formed inflected and de- 
rived forms. It covers about 95 % of Spanish text 
extracted from the EFE newswire text coming from 
Madrid. It has been linked to a browser for the Span- 
ish newswire and to the Collins bilingual dictionary 
(see Appendix), is also being utilized in the construc- 
tion of a Spanish parser (Donald ttindleat Bell Lab- 
oratories) and for further research in Spanish text 
analysis. We have found this model to be both simple 
and powerful. We plan to implement other Romance 
languages, and to experiment with German, where 
the treatment of compounds presents some special 
interest. 
5 281 
A 
Ii 
i.i ~J 
llo .il - 
I.I 
IU 
P 
II 
IJ 
i!_ 
i,i 
.o - iI .- ~ I!, 
__ i;; i,,-fl i o m "ilc ~ "~li--~ 
ill Cl:l~ - llq II ID I~ la IO I~i ~ ~ II ~ e~ ll,,,,,,, ii ,i.i :l i.,- m =,=~. ~ -~ 
"=~''''0" "''" " " "" i "til 
~i~.____ ill illl i,,, i,~l ill li., Ill illl ill ~ ~ t l ~U II~ ~ll llll ~li l~ I~ li~i~ li I~ li i.i 
I , "Z 
I = 
0 
I % I 
! .=- 
.-~-.~ ~ 
O~ °s 'q" I (U 
~ .-~ }° 
° 
o ~ 
:'-,3 ~, =', % <Z - 
--iJ ltl Ol 
Ii&ii 
, ,.:...,~N... . ..... ,., ,. ..... 
~j ,. ".:.:.:.;,.:.., ,NNri~l~i~l~u--.i;i~l,- .... ~lN,l~.Jl',--.": .~lP-.Itl II oo t~ 
• -~ .~ 
ii~-,,~-.~-,,,~ ~~..I la.i~.,i~ ~lll .~,~.,~,r.,ii~/,.,~.. ~ ,t,i,~.-~ ,,,.~ ~"~ ..,~, ~C~ 0~1 ..,.~ ..-.. ~ 
ui~' ~ Ill i~i i,~ C ~ ~ il i~," ii71 "il~i Cll II ii~.<i.i i,,.i ,i~ ill ilU Ill i~ 
,,:'.: :,, - -,,. 
tt ..~ i,~mm .. la iI ..l= ltl.~,-,-l~l ti.~t,~ T J o ~ tG ~. I~ 
~,.,,.-,,,--,~-,"-~~,~-~,i%~~-'l:i~tt~.~ "- ~i!t~ ~,~.I~-~ ~=~ '~ =~ ~ 
iti Ill i~ lfa llu II "~l a.,,i ~l I~l Ill Ilii 
,....I 
282 6 

References 

Casajuana R. and C. Rodr{guez 1985. Clasi- 
ficacidn de los verbos castellanos pars un dic. 
cionario en ordenador. I congreso de lengua- 
jes naturales y lenguajes formales. Uni- 
versidad de Barcelona. Facultad de Filolo- 
gin. Departamento de Lingii{stica General. 
Barcelona. 

Chomsky, N. 1962. Couiext-free Grammars 
and Pushdown Storage, M.I.T. Research Lab- 
oratory of E\]ectronics Quarterly Progress Re- 
port #65, pp. 187q93. 

Collins Spanish Dictionary: Spanish-English. 
Collins Pnbl!ishers, Glasgow, 1989. 

Corbin D. 11)87. Morphologie d&ivationnelle 
et structural:ion du lexique. Niemeyer Verlag: 
Tubingen. 

Elgot, C.C. and J.E. Mezei 1965. On Relations 
Defined by Generalized Finite Automata, IBM 
Journal Res. 9, pp. 47-68. 

Feigenbaum, J., M.Y. Liberman, R.N. Wright 
(forthcoming). Cryptographic Protection of 
Databases and Software. In Proceedings of the 
DIMACS Workshop on Distributed Comput~ 
ing and Cryptography, Feigenbaum and Mer- 
ritt, Eds. AMS and ACM. 

Ginsburg, S. 1966. 7~e Malhema!ical Theory 
of Context-Free Languages, McGraw Hill. 

Kay, M. 1982. When Meta-rules are not Mela- 
rules. In Spark-Jones & Wiiks (eds.) Auto- 
matic Natural Lang~tage Processing. Univer- 
sity of Essex, Cognitive Studies Center (CSM- 
\]0). 

Kartunnen, L. 1983. KIMMO: A general mor- 
phological processor. Texas Linguistic Forum, 
No. 22 pp 165-185. 

Kartunnen, L., K. Koskenniemi, R. Kaplan 
1987. A Compiler for Two-level Phonological 
Rules. Ms. Xerox Palo Alto Research Center. 

Khan R. 1983. A two-level morphological anal- 
ysis of t?.oumanian. Texas Linguistic Forum, 
No. 22 pp 153-170. 

Koskenniemi, K. 1983. Two-level morphology: 
A General Computational Model for Word- 
Form Recognition and Production. University 
of Iielsinki, Dept. of General Linguistics, Pub- 
lications, No. 11. 

Koskenniemi, K., K. W. Church 1988. Com- 
plexity, Two-level morphology and Finnish. 
Proceedings of the 12th International Confer- 
ence on Computational Linguistics. Budapest, 
Hungary. 

Lun S. 1983. A two-level morphological analy- 
sis of French. Texas Linguistic Forum, No. 22 
pp 271-277. 

Rabin, M.O. and D. Scott, 1959. Finite Au- 
tomata and their Decision Problems, IBM J. 
Res. 3, pp. 114-125. 

Schiitzenberger, M.P. 1961. A Remark on Fi- 
nite Transducers, Information and Control 4, 
pp. 185-196. 

Tzoukermann E., R. Byrd 1988. The Applica~ 
lion of a Morphological Analyzer to o~.line 
French Dictionaries. Proceedings of the h> 
ternational Conference on Lexicography, Eu- 
ralex. Budapest, Hungary. 

Tzoukermann E. 1986. Morpholoo 
gie et ggndration des verbes fran~ais. Unpub- 
lished PhD dissertation. Institut National des 
Langues Orientales, Sorbonne Nonvelle, Paris 
lII, France. 
