Arabic Finite-State Morphological Analysis and Generation 
Kenneth R. Beesley 
Rank Xerox Research Centre 
Grenoble Laboratory 
Le Quartz 
6, chemin de Maupertuis 
3824O MEYLAN 
France 
ken. beesley@xerox, fr 
Abstract 
This paper describes a large-scale sys- 
tem that performs morphological anal- 
ysis and generation of on-line Arabic 
words represented in the standard or- 
thography, whether fully voweled, par- 
tially voweled or unw)weled. Analyses 
display the root, pattern and all other 
affixes together with feature tags in- 
dicating part of speech, person, num- 
ber, mood, voice, aspect, etc. The 
system is based on lexicons and rules 
from an earlier KIMMO-style two-level 
morphological system, reworked exten- 
sively using Xerox Finite-State Morphol- 
ogy tools. The result is an Arabic Finite- 
State Lexical Transducer that is applied 
with the same runtime code used for 
English, French, German, Spanish, Por- 
tuguese, Dutch and Italian lexical tran~ 
ducers. 
1 Introduction 
1.1 Challenges of Arabic Morphology 
Semitic languages like Arabic present unusual 
challenges to automatic morphological analysis 
and generation. The first challenge is morpho- 
tactic: whereas most languages construct words 
out of morphemes which are just concatenated 
one after another, as in un-t-fail+ing-t-ly, an Arabic 
stem like daras (&,3.~) 1 is traditionally analyzed 
as consisting of a three-consonant root, transliter- 
ated as drs (0~ .~ ~), which is interdigitated with 
a pattern CaCaC, where C represents a slot for a 
root consonant, sometimes termed a radical; var- 
ious prefixes and suffixes can then concatenate to 
the stem in the familiar way. See Figure 1. 
Similarly, the root klb (~,. c.~ "-J) interdigitates 
with the same pattern to form katab (.~; and 
1The Arabic examples in this paper were produced 
using the ArabTeX package for TEX and ~:I'EX by 
:Professor Klaus Lagally. 
Abstract lexical level: 
CaCaC 
wa+ +at 
d r s 
Abstract intersected level: 
wa+daras+at 
Figure 1: Abstract wa-l-daras-l-at ("and she 
learned/studied") 
the root brj (.~ j ~.) iuterdigitates with the pat- 
tern taCaC-aC to form the stein tabar-aj (~.~.:). 
There are perhaps 5000 Arabic roots in common 
usage, and about 400 phonologically distinct pat- 
terns, most of which are ambiguous. Each root 
can legally combine with only a small subset of 
the phonologically distinct patterns, an average 
of about seventeen or eighteen, and this decid- 
edly derivational process must be controlled by 
old-fashioned lexicography. 
The second challenge is that standard Arabic 
surface orthography seldom represents short vow- 
els, distinctive consonant length, and other po- 
tentially helpful details. The wa+daras+at exam- 
pie could conceivably be written fully roweled as 
wadarasal (~.aSJ~) , but it is much more likely to 
appear ms tile unvoweled wdrst (,~.., p_~). The re- 
suiting incompleteness of the surface orthography 
makes written text unusually ambiguous, with an 
average of ahnost five valid morphological anal- 
yses per word. Finally, Arabic orthography dis- 
plays an idiosyncratic mix of deep morphophono- 
logical elements carried to the surface, resulting 
in silent letters, and more surfacy representations 
of epenthesis, deletion and assimilation. 
1.2 ChMlenges of Arabic Lexical Lookup 
Standard Arabic dictionaries like the Wehr-Cohen 
are organized by root headwords like drs (&, j ~) 
and ktb (~. ~ a). In fact the roots by themselves 
89 
are not valid words, nor are they even pronounce- 
able until they are combined with a pattern. Be- 
cause in orthographical words these root conso- 
nants or radicals are usually surrounded, and even 
split up, by other consonant letters, and because 
the radicals themselves may be modified by assim- 
ilation or even deleted entirely in a written word, 
root identification and dictionary lookup are sig- 
nificant challenges for learners and native speakers 
alike. 
2 Goals 
To be interesting in our applications, the Ara- 
bic morphology system had to have the following 
qualities: 
1. It had to deal with real Arabic surface orthog- 
raphy, as represented on-line in standards 
such as ASMO 449 or the Macintosh Arabic 
code page (ISO8859-6). While it is possible to 
devise strict roman transliterations of Arabic 
orthography that are unambiguously convert- 
ible back and forth into real Arabic orthogra- 
phy, most existing romanizations are in fact 
transcriptions that contain more or less in- 
formation than the original and so represent 
different orthographical systems. 
2. It had to be able to analyze Arabic words 
as they appear in real texts. This means 
timt input words may be fully voweled or 
diacriticized (i.e. supplied with full diacrit- 
ical markings, a style of writing found only 
in religious texts, poetry, and writings in- 
tended for children and other learners), par- 
tially diacriticized or undiacriticized, which is 
the normal case. A single system had to han- 
dle undiacriticized words and yet be able to 
take advantage of any diacritics that might 
be present. 
3. To facilitate lookup of words in printed and 
on-line dictionaries, and for pedagogical pur- 
poses, the system had to return the root as an 
easily distinguished part of the analysis. An 
easier to build, but less useful, system would 
simply deal with complete stems rather than 
roots and patterns. 
4. The system had to be large and open-ended, 
with each root coded to restrict the patterns 
with which it can in fact co-occur. 
5. It had to be efficient and accurate, suc- 
cessfully analyzing hundreds or thousands 
of words per second on commonly available 
workstations and higher-end PCs. 
6. It had to perform efficient and accurate gen- 
eration of valid surface forms when supplied 
with the component root and relevant fea- 
ture tags. Analysis and generation had to 
be straightforward inverse operations. 
Forest of Lexicon "Letter Trees" 
Trees are connected by "continuation classesY 
A letter path through the trees is an abstract word. 
Rules hand-compiled into FSTs 
The intersection of the rules is simulated in code. 
Rules allow and control the discrepancies between the 
abstract words in the lexicon and the surface words being 
analyzed. 
Figure 2: Traditional Kimmo-Style System Archi- 
tecture 
3 History 
In 1989 and 1990, with colleagues at ALPNET 
(Beesley, 1989; Beesley, Buckwalter and New- 
toil, 1989; Buckwalter, 1990; Beesley, 1990), I 
built a large two-level morphological analyzer for 
Arabic using a slightly enhanced implementation 
of KIMMO-style two-level morphology (Kosken- 
niemi, 1983, 1984; Gajek, 1983; Karttunen, 1983). 
Traditional two-level morphology (see Figure 2), 
as in the publicly available PC-KIMMO imple- 
mentation (Antworth, 1990), allows only concate- 
nation of morphemes in the morphotacties. Lex- 
icons are stored and manipulated at runtime as 
a forest of letter trees, with each trec typically 
containing a single class of morphemes, with the 
leaves connected to subsequent morpheme trees 
via a system of "continuation classes". A letter 
path through the lexieal trees from a legal start- 
ing state to a final leaf defines an abstract or "lexi- 
cal" string. The various two-level rules, which had 
to be hand-compiled into finite-state transducers, 
were run in parallel by code that simulated their 
intersection. The rules allowed and controlled the 
variations between the lexical strings and the sur- 
face strings being analyzed: thus the Arabic sur- 
face word wdrsl (~5,~ja ~) could be matched with 
the lexical string wa+daras+al, among others, via 
appropriate rules. 
In the ALPNET Arabic system, roots and pat= 
terns were stored in separate trees in the lexical 
forest, and an algorithm, called Detouring, per- 
formed the interdigitation of semitic roots and 
patterns into stems at runtime. The other chal- 
90 
lenges of Arabic morphological w~riation and or- 
thography, including varying amounts of diacriti- 
cal marking, all succmnbed to rather complex but 
conq)letely traditional two-level rules. Whih" the 
resulting system was successfidly sold and is also 
currently being used as the morphological engine 
of an Arabic project at the University of Mary- 
land, it suffers from many well-known limitations 
of traditional two-level morphology. 
1. As there was no automatic rulc compiler 
available to us, the rules had to bc compiled 
into tinite-state transducers t)y hand, a te- 
dious task that often influences the linguist 
to simplify the rules by postulating a rather 
surfacy lexical level. Hand-compilation of a 
complex rule, which can easily take hours, is 
a real disincentive to change and experimen- 
tation. 
2. Because there was no algorithm to intersect 
the rule transduccrs, over 100 of them in 
the ALPNET system, thcy are stored sepa- 
rately and must each be consulted separately 
at each step of the analysis. As the time nec- 
essary to move a rule transduccr to a new 
state is usually independent of its size, mov- 
ing 100 transducers at runtimc cat, be 100 
times slower than moving a single intersected 
transducer. 
3. Because the lexical letter trccs in a tra- 
ditional Kimmo-style system are dccoratcd 
with glosses, features and other miscellaneous 
information on the leaves, they are not purc 
finite-state machines, cannot bc combined 
into a single fsm, cannot be composed with 
the rules, and have to be storcd and run as 
separate data structures. 
4. Various diacritical fcatures inscrted into the 
lexical strings to insurc proper analyses made 
this and other KIMMO-stylc systems awk- 
ward or in,practical for generation. 
5. Finally, in the enhanced ALPNI,;T implemen- 
tation, the storage of almost 5000 roots and 
hundreds of patterns it, separate sul)lcxicons 
saved memory space, but the l)etouring op- 
eration that interdigitatcd them in rcaltime 
was inherently inelficient, building and then 
throwing away many superficially plausible 
sterns that were not sahctioned by the lexi- 
con codings. (Any Arabic root (:at, combine 
legally with only a small subset of the possi- 
ble patterns.) With building phantom stems 
and the unavoidable backtracking caused by 
the overall deficiency and ambiguity of writ- 
ten Arabic words, the resulting system was 
rather slow, analyzing about 2 words per sec- 
ond on a small IBM mainframe. 
Abstract lcxical level: 
\[ drs & CaCaC \] 
Abstract intersected Icvcl: 
daras 
Figure 3: Intersection of Lexically Consecutive 
Root and Pattcrn 
Abstract Icxical lcvcl: 
\[ drs & CVCVC & aa \] 
Abstract intcrscctcd level: 
daras 
Figure 4: Intersection of Lexically Consecutive 
Root, CV-Template, and Voweling 
4 Reimplementation 
Work began in 1995 to convert the analysis to thc 
Xerox fst format. The ALPNET lexicons were 
first converted into the format of lexc, the lexi- 
con c()mpiler (Karttnnen and Beesley, 1992). Al- 
thongil lexc by itself is largely limited to concatc- 
native morphotactics, just like traditional two- 
level morphology, it was noted that the interdig- 
itation of semitic roots and patterns is nothing 
more or less than their intersection, an operation 
supported in the Xerox finite-state calculus. Thus 
if ? represents any letter, and C represents any 
radical (consonant), the root drs (tY' -) ~) can be 
interpreted as ?*d?*r?*s?*. 
The intersection of this root with the pattern 
CaCaC yields the stem daras (ty,55). See Figure 
3. 
In s()mc analyses (e.g. McCarthy, 1981), the 
voweling of the pattern is also abstracted out, 
leaving pattern templatcs like CVCVC and a vo- 
calic element that cat, bc formalized as ?*a?*a?*. 
If V represents a vowel, then the intersection of 
the root, ten,plate and vocalic elements yields the 
same result. See Figure 4. 
Using standard Ol)crations availablc through 
the lexc compiler and other finite-state tools, the 
analysis can be constructed according to the taste 
and necds of the linguists. 
Because the upper-side string is returned as the 
result of an anMysis, it is often more helpful to 
define the upper-side string as a baseform (here 
a root) folh,wed by a set of symbol tags designed 
to represent relevant morphosyntactic features of 
the attalysis. For examph', daras (O,)~) happens 
to be the Form 1 perfect active stem based on the 
root drs (tY) a, with CVCVC being the Form 
91 
Abstract lexical level: 
drs+FormI+Perfect+Active 
Abstract intermediate level: 
drs+CVCVC+aa 
Abstract intersected level: 
daras 
Figure 5: Root drs with CVCVC Template and 
Active Voweling 
Abstract lexical level: 
drs+Forml+Perfect+Passive 
Abstract intermediate level: 
drs+CVCVC+ui 
Abstract intersected level: 
duris 
Figure 6: Root drs with CVCVC Template and 
Passive Voweling 
I pattern and the vocal element aa representing 
active voice. The stem duris (~.r,9.~), using the 
passive voweling ui is the parallel passive example. 
If +FormI, +Perfect, +Active and +Passive are 
defined as single symbols, and if +FormI+Perfect 
maps to CVCVC, and if +Active maps to aa and 
+Passive to ui, the analyses can be constructed as 
in Figures 5 and 6. 
After composition of the relevant transducers, 
the intermediate levels disappear, resulting in a 
direct mapping between the upper and lower levels 
shown. The resulting single transducer is called 
the lexicon transducer. 
All valid stems, currently about 85,000 of them, 
are automatically intersected, at compile time, at 
one level of the analysis. Suitable prefixes and 
suffixes are also present in the lexicon transducer, 
added in the normal concatenative ways. 
Stems like davas (t.r,33) and duris (tg4~), 
and especially those like banay (~.') based on 
"weak" roots, are still quite abstract and idealized 
compared to their ultimate surface realizations. 
Finite-state rules rules map the idealized strings 
into surface strings, handling all kinds of epenthe- 
ses, deletions and assimilations. The twolc rule 
compiler (Karttunen and Beesley, 1992) is able 
not only to recompile the rules automatically but 
to intersect them into a single rule fst. This rule 
fst is then composed on the bottom of the lexi- 
Lexicon FST 
*O. 
Rule FST 
Lexical Transducer 
Figure 7: Composition of Lexicon and Rule FSTs 
into a Single Lexical Transducer 
Lexical level: 
drs+FormI+Perfect+Active+3P+Fem+Sg 
Surface level: 
drst 
Figure 8: Typical Transduction from Lexical 
String to Unvoweled Surface String c.,~_)~ 
con fst, yielding a single Lexical Transducer. The 
symbol .o. in Figure 7 indicates composition. 
Another transducer is also composed on top of 
the lexicon fst to map various rule-triggering fea- 
tures, no longer needed, into epsilon and to enforce 
various long-distance morphotactic restrictions. 
All intermediate levels disappear in the compo- 
sitions, and one is left with a single two-level lexi- 
cal transducer that contains surface strings on tim 
bottom and lexical strings, including roots and 
tags, on the top. A typical transduction is shown 
in Figure 8, where the final t (~) is the surface 
realization of the third-person feminine singular 
suffix -at. Fully voweled, the surface string for 
this reading would be darasat ( -,~a33 ). Because 
short vowels are seldom written in surface words, 
dvst is also analyzed as the Form I perfect pas- 
sive third-person singular, which would be fully 
roweled as dnrisat ( ",~ ~.~), and as several other 
forms. 
At runtime, strings being analyzed are simply 
matched along paths on the bottom side of the lex- 
ical transducer, and the solution strings are read 
off of the matching top side. Like all finite-state 
transducers, it also generates as easily as it ana- 
lyzes, literally by running the transducer "back- 
92 
Lexical Cleanup 
Transducer 
oO* 
Lexicon Transducer 
.o. 
Rules that Generate 
Fuily-Voweled Forms 
oO. 
Rules Generating from 
Fully Voweled Forms to 
All Surface Variations 
Figure 9: Full System with Two Levels of R,ules 
wards ~ . 
The Arabic system runs in exactly the same 
way, using the same runtime code, a~ the lcxi- 
cal transducers for other languages like English, 
French and Spanish. The Arabic system is, how- 
ever, substantially slower than the. other lan- 
guages, t)ecause the ambiguity of the surface 
words forces many dead-end analysis paths to be 
explored and because more valid solutions have 
to be found and returned. The mismatch between 
the concatenated root and pattern on the lexical 
side and the intersected stem on the lower side 
also creates an Arabic system that is substantially 
larger than the other languages. 
5 Generation 
A single underlying Arabic word may be spelled 
many ways on the surface, depending on how coin- 
plctely the writer specilies the diacritics. Because 
the system described above recognizes all possible 
written forms of a word, with varying degrees of 
diacritical marking, it also generates all the possi- 
ble surface forms of a word, which may be less 
than useful in many applications, q'yi)ically, a 
user wants to see only the fidly vowcled form dur- 
ing generation. 
The Arabic rules have now been modilied to 
work in two steps, lirst to generate the fully vow- 
eled form, and then to generate the various par- 
tially roweled forms and the unvoweled form. 
Where desired, the lexicon fst can be composed 
with only the upper set of rules to make a lexical 
transducer that gencratcs (and recognizes) only 
fully-roweled surface forms, l,'or general recogni- 
tion, both sets of rules, a.s shown in Figure 9, are 
composed. The result is equivalent to the original 
lexical transducer described in Figure 7. 
6 Conclusion 
Arabic morphology, though considerably more dif- 
ficult than the morphology found in the commonly 
studied European languages, is fully susceptible 
to finitc-state analysis techniques, either in an en- 
hanced two-level morphology or in the mathemat- 
ically equiwdent but much more cornputationally 
efficient Xerox finite-state format. We hope to ex- 
tend our tinite-state techniques to cover Ilebrew 
and <)ther languages with exotic morphology. 
References 
Antworth, Evan L. 1990. PC-KIMMO: A Two- 
level Processor for Morphological Analysis. Occa- 
sional Publications in Academic Computing No. 
16. Dallas: Summer Institute of Linguistics. 
Beesley, Kenneth R. 1989. Computer Analy- 
sis of Arabic Morphology: A Two-Level Approach 
with Detours. Read at the Third Annual Sympo- 
sium on Arabic I,inguistics, University of Utah, 
Salt Lake City, Utah, 3-4 March 1989. Published 
in Bernard Comrie and Mushira Eid (eds.), Per- 
spectives on Arabic Linguistics 111: l'apers from 
the Third Anuual Symposium on Arabic Linguis- 
tics, Amsterdam: John Benjamins, pp. 155-172. 
Beesley, Kcnneth R.; Buckwalter, Tim; and 
Newton, Stuart N. 1989. Two-Level Finite-State 
Analysis of Arabic Morphology. In Proceedings 
of the Seminar on Bilingual Computing in Arabic 
and English, Cambridge, England, 6-7 Sept 1989. 
No pagination. 
Beesley, Kenneth R.. 1990. Finite-State De- 
scription of Arabic Morphology. In Proceedings 
of the Second Cambridge Conference on Bilingual 
Computing in Arabic and English, 5-7 Sept 1990. 
No pagination. 
Beeston, A.F.L. 1968. Written Arabic: an ap- 
proach to the basic structures. Cambridge: Cam- 
bridge University Press. 
Buckwalter, Timothy A. 1990. Lexicographic 
Notation of Arabic Noun Pattern Morphemes and 
Their Inflectional Features. In Proceedings of the 
Second Cambridge Conference on Bilingual Com- 
puting in Arabic and English, 5-7 Sept 1990. No 
pagination. 
Gajek, Oliver el al. 1983. LISP lmplcmenta- 
tion. Texas Linguistic Forum 22 ed. by 1)alrym- 
pie et al. Austin: Linguistics Department. Uni- 
versity of Texas, pp. 187-202 
Kaplan, Ronald M. and Kay, Martin. 1981. 
l)honological rules and linite-state transducers 
\[Abstract\]. Linguistic Society of America Meeting 
93 
Handbook. Fifty-Sixth Annual Meeting, Decem- 
ber 27-30, 1981. New York. 
Kaplan, Ronald M. and Kay, Martin. 1994. 
Regular Models of Phonological Rule Systems. 
Computational Linguistics. 20:3, pp. 331-378. 
Karttunen, Lauri. 1983. A General Morpholog- 
ical Processor. Texas Linguistic Forum 22 ed. by 
Dalrumple et al. Austin: Linguistics Department, 
University of Texas, pp. 165-186. 
Karttunen, Lauri. 1991. Finite-State Con- 
straints. In the Proceedings of the International 
Conference on Current Issues in Computational 
Linguistics. June 10-14, 1991. Penang:Universiti 
Sains Malaysia. 
Karttunen, Lauri; Kaplan, Ronald M.; and Za- 
enen, Annie. 1992. Two-Level Morphology with 
Composition. COLING 92, pp. 141-148. 
Karttunen, Lauri. 1993. Finite-State Lexicon 
Compiler. Technical Report. ISTL-NLTT-1993- 
04-02. Xerox Palo Alto Research Center. Palo 
Alto, California. 
Koskenniemi, Kimmo. 1983. Two-Level Mor- 
phology: A General Computational Model for 
Word-Form Recognition and Production. Publi- 
cation No. 11. Helsinki: Department of General 
Linguistics, University of Helsinki. 
Koskenniemi, Kimmo. 1984. A General Com- 
putational Model for Word-Form Recognition and 
Production. COLING 84, pp. 178-181. 
Karttunen, Lauri and Beesley, Kenneth R. 
1992. Two-Level Rule Compiler. Technical Re- 
port. ISTL-1992-2. Xerox Palo Alto Research 
Center. Palo Alto, California. 
McCarthy, J. 1981. A Prosodic Theory of Non- 
concatenative Morphology. Linguistic Inquiry, 
12(3), pp. 373-418. 
Wehr, Hans. 1976. A Dictionary of Modern 
Written Arabic. Third edition, ed. by J. Milton 
Cowan. Ithaca, N.Y.:Spoken Language Services, 
Inc. 
94 
