A COMPRESSION TECHNIQI~ FOR ARABIC DICTIONARIES : 
THE AFFIX ANALYSIS. 
Abdelmajid BEN HAMADOU 
D~partement of computer science -FSEG Faculty 
B.P 69 - Route de l'a~roport - 
SFAX - TUNISIA 
ABSTRACT 
In every application that concerns the automatic 
processing of natural language, the problem of the 
dictionary size is posed. In this paper , we propose a 
compression dictionary al~orithm based on an affix ana- 
lysis of the non diacritical Arabic. 
It consists in decomposing a word into its first 
elements taking into account the different linguistic 
transformations that can affect the morphological 
structures. 
This work has been achieved as part of a study of 
the automatic detection and correction of spelling errors 
in the non diacritical Arabic texts. 
I- INTRODUCTION 
In every application that concerns the automatic 
processing of natural language, the problem of the 
dictionary size is posed. We can approach this impor- 
tant question in several ways and particularly : 
- By grouping together the common prefixes of the 
different language words. In the PIAF system,(interac- 
tive program for French Analysis) for instance, words 
are represented in chained lists following an alpha- 
betical order \[COUR 77\] 
EX : PARTIEL ~ PARTIES_____--~_PARTOUT ... 
PARTIE-~ J 
PART -,~-- 
- By creating multiple dictionaries: or:efor each 
major topic area. This approach requires, in addition, 
a common base dictionary. When a particular area is 
concerned, a temporary master dictionary is created 
by increasinF the base dictionary with selected local 
ones. 
-- By usin~ the Affix analysis which consists in 
performing a morphological analysis in order to iden- 
tify, in a given word, the redundant elements 
(Affixes). The dictionary will be limited to the non- 
redundant elements (roots). This technique is used 
specially in the DECIO - SPELL system for detecting 
and correcting spelling errors. 
In the present paper, we will develop this last 
approach for the non_diacritical Arabic. 
The particularities of the algorithms that we 
propose, stem~ in great part, form the specificities 
of the language used : 
- Words are written in consonantic form 
- Words can contain infixes 
- Morphological structures can be altered by 
linguistic transformations. 
This work has been developed within a national 
research project for the study of the automatic detec- 
tion and correction of spelling errors in Arabic texts 
BEN 86 7 
II - THEORETICAL ASPECTS 
Let V be a finite Set 
and V ~, the set of words built on V including 
nul s$ing noted 
W ~-V-- W = W I W 2 ...... W . W i ~ V n 
i 6 fl ,mr 
let v+ = v'- 1~} 
i°/ Prefix ( W ) 
let W = W 1 W 2 ....... W n. W 6 V + 
We call order i prefix the quantity Pi = W\] W2.-W i 
(l g i <n-l) 
the order o prefix is 
2°/ Suffix ( W ) + 
Let W = W 1 W 2 ..... W W ~ V n 
We call order j suffix the quantity S. = W. 
Wj+I ........... Wn 3 J (1 ~j ~<n) 
the order n + l suffix is 
3°/ Infix ( W ) q- 
W 6V Let W = W\] W 2 ....... W n 
We call order l infix the quantity I = W k 
(i Zk ~ n) 
~le ca1\] order 2 infix the quantity 
I = W k , W I 
(l ~k <I ~ n) 
the order zero infix is 
4°/ Root ( W ) 
W ~ V + Let W = W l W 2 ...... W n 
We call Root the quantity : R = W ......... W P q 
(I ~ p < q ~< n), (card(R) .< q-p+l) 
5°/ Card ( ~Ji ) 
Let W =W I .. W W 6 V + n 
Let ~ = (~ , PI, P2, P3 .... P. }I 
Card ( ~))i ) = i + \] if i ~ I 
Card ( ~i ) 1 if ~i =I~} 
6°/ Card (~.) 
let W = W~ ...... Wn. W ~ V + 
let~j = {~ , Sj , Sj+ I ..... Snj 
- Card (~j) ~< n-j+2 if (i ~ j < n) 
-Card ($j) = I i f~j = {~I 
III- AFFIX ANALYSIS 
\]. Morphological decomposition 
The Affix analysis consists in decomposing a 
given word into its first elements among which we can 
distinguish the affixes (prefix, infix and suffix) 
which are the redundant elements of the language and 
the root which is its non redundant one . 
286 
This decomposition is based on the derivational 
structure of the \]anguage : nearly all the words are 
obtained by adding an affix combination to a given 
root. 
suffix Infix Root Prefixes 
- Root = ~ ka~aba 
- Prefix = ~ 
- Infix = ~ tO 
- Suffix = "&-- £ 
Among the possible affix comb{nations, we dis- 
tinguish those that are valid and those that are not. 
Valid combinations constitute what is called Morpho- 
\]ogical Pattern (M P) 
For a given word, the number of possible morpho- 
logical decnmpositim~s depends on the root, according 
to whether or not it contains characters which can be 
assimilated to different affixes. 
This nui~)er is calculated using the following 
formula : 
Nd = Card ( b~i ~ . (lard (c~)j) 
2. Study of the morphological transformations 
The morphological derivation for a root can be 
accompanied with transformations caused by linguistic 
ohenomena such as asshnilation, contraction, 
metathisJs. 
These transformations can affect the Root as 
well as the affixes (M P). The Roots affected are 
mainly those which contain the characters 
yaa: q , Waw : ~ and hamza : 
EX 1 : Root affected. 
Consider the root : .uy--~ 
Derivation 
"v----~ ~" .. 
EX 2 : Affix Affected. 
and the MP =( ~ , / ,¢) 
Transformation 
Consider the root : e-~and the MP =( ~ , o , ¢) 
Derivation Transformation daja ~ a/----'-.~ idtaja ~a ~ficfaja ~a 
~Z-~. 4~ '~' ~l ~ "a-->~."~ " ~ ~" .t-~ 
The morphological transformations can be classified 
into two categories : 
- The morpho-phonological transformation are 
those that substitute a character for another one 
without changing the length of the word (isometrica\] 
transformations)-(see EX1 and EX2). 
- The purely phonological transformations are 
those that suppress one or more characters, therefore 
they modify the length of the word. 
EX 3 : consider the root ~-_ii~ 
Derivation transformation 
waqafa yawqi~u yaqifu 
EX 4 : considc_r the root ~ 
Der ~ vat ion tran~ format ion 
d__C~(,¢,@) -- ~Tl )removal of 
"If' % ~ hu~Ju hurl 
ahadaThose t:ransfor~ations are a sourEe-of ambiguity 
for the morphological decomposition. To remove these 
ambiguities, we use heuristics among which we can 
mention for instance : 
Let D be the morphological derivatio~ operator 
such as : D ( R , P , I , S ) = W W ~ V 
and T the operator composed of a derivation followed 
by a transformation. And D the morE~olOglcal dec6mpo- 
sit,on operator (inverse of D) and T the morphologi- 
cal decomposition operator taking into account the 
transformational rules (inverse of T). 
Consider W the word to be ana\]ysed. 
If_ D (W) = (R l ' Pl ifl' SI)'RI G V+and PI,II,SICV~ 
anji (W) = (R2,P2,I2,S2) , R2£V + and P2,12,$2 e V x 
So R 1 is the selected root (R 2 is rejected) 
F,X : 
~a£a2ta .\[, £ 
The root retained is : J----~'~ da.£aAa 
This heuristic means that the transformations can not 
be done at the expense of semantics. 
\] V - TMPLVMI<WT'ATI'NN : 
The affix analysis is composed of two modu\].es 
(See Fig. I) : 
- morphological decomposition module 
- validation modu\]e 
1. The morphological decomposition module 
permits to--~.de'{~Ty the different ~-ombinations. 
It is executed in two steps : 
Step one : IdentiJieation of prefixes and suf- 
fixes by us~--a table o~ prefixes and a table of 
suffixes. 
Step two : identification of the infix by 
anaysing-t~Te remaining chain after eliminating P and 
S. 
The analyser has s single initia\] state and 
as many ways as there are infix possibilities. 
The interest of realising this decomposition 
into two steps lies in the use of a single analyser 
in order to rec.'ognise all the morphological forms. 
we distin~_uish differeut morphological Patterns . 
2. Validation module 
The two precedin~ steps lead to a list of 
candidate decompositions.\]it is necessary to apply to 
this list an adequate validation mechanism to sort 
out the valid decomposition 
This fi\]terin~ can provide multiple solutions. 
In these conditions, we talk about morphological ambi- 
guity that can not be removed without considering the 
context: of the word in the sentence. 
However, the affix analysis used for the pur- 
pose of verifying whether or not a word belongs to 
the language can be content with the first valid 
decomposition. 
287 
= W I W2'' W n 
Root ; R 
Affixes (P,l,~) 
ANAL¥SER 
J 
-- Prefixes CONGRUITY 
Suffixes MATRIX 
- Lis~ o¢ 
~rllho loKical 
Codes 
....................... 
Fi~ \[ : Functional Affix Analysis diagram 
The validation is based on the principle of affi~ 
congruity and on the result of the root dictionary 
checking. 
The affix congruity arises at three different 
levels : - Compatibility between the prefix (P) and 
the suffix (S) 
-Compatibility between the couple (P,q) :~n~ ~ 
the infix (I). 
- Compatibility between the Morphological 
Pattern (P,I,S) and the Root (R) . 
The compatibility between P andS is obtained from 
the affix congruity matrix C (P. ,S.) composed of 609 
e~ements (2| prefixes and 29 su~fi~es). The values 
attributed to a couple (P.,S.) are : 
C(P., S.) = O if PI and ~. ~re incompatible 
S\]) \] J N k C \[ I, 226\] 
C(P$, = N k if ~i and ~. are compatible, 
The compatibility between (P., S ) and the infi~ 
I ~s obtained by perform~n~ the ~nt~rseet~on of the k 
Morphological Code (MC) generated by the analyser 
with the set of Morphological Codes associated with 
the couple (P.,S.) . This set or list is referred to 
by N k. Let ~ lbe this list L = {MC\],MC 2 ..... MCI\] 
If MC AMC i = O S__o_o (Pi,Sj)and I k are incompatible 
Jeff,l\] 
If MC ^MC i = MC So (Pi' Sj) and Ikare compatible 
iefl , 17 
The Compatibility of the Morph61ogical ~attern 
with the root does not have a morphological origin 
but it is essentially_ of a Semantic one. 
EX : The Word G----~i.~l ~oe~ not exist because the ~sta ~k~a 
root " ~.(-----~ "akaf~- 
and MP = ( ~---I,# , ~ ) are incompatible. 
The detection of this incompatibility requires 
flagging the dictionary for eadh root with its legal 
non-systematic morphological patterns (ex : derived 
288 
verb forms ,'masdar' , same nouns). 
The diction~r~T look-up permits to verify whether 
the word analysedbelon~s to the linguistic corpus or 
not. It plays a decisive Dart in identifying the valid 
root if the analysis, for one morphological pattern, 
~enerates several candidate roots. (nondeterministic 
analysis) . 
EX : Consider the root and MP =( I ~ ~. ~) 
Derivation Trans formation 
~ada ~ i ~tahada ~ i tt~ada 
The decomposition of the tar~,et word A-~1 
accordin~ to the transformation rules gives the three 
~lausible roots : 
~---~T ~ ~ada A__~ w~ada ~ tc~ada 
These transformation rt, les are the follcwing ones: 
x 2 :~iwta ~! ----~- >~£a ----~ 
r 3 :~i~£ a ~I~ ~ ~- ~itta %1 
The dictionary look-up enables to suppress the 
candidates :" A---~' £a~6~da and • A_._~' a~hada 
Our root dictionary being used has been built'by 
takin~ census of the roots related to the linguistic 
corpus of the Maghreb Countries. This corpus has been 
done by the Permanent Commission of Functional 
Arabic ~ P C F A 76 \] 
The sine of the obtained dictionary is about 1,500 
three-character roots and IOO four - character roots 
Its increase can easily be done thanks to its evolu- 
tionary structure. 
~ces to th~s dictionary is ~rect. The access 
argument is calculated from the first three characters 
of the root and its leneth L. 
V- CONCLUSION 
The affix analysis permits to replace an important 
dictionary containing roots only. This technique has 
proved efficient for Arabic because of its derivational 
structure. We have tested this technique on a corpus 
made up of \]O0,000 words or so using the dictionary 
of the 1,600 roots. 
The programs are written in FORTRAN for reasons 
of portability, easy calculation of the Dictionary 
access argument and index manipulation. 
Used in the context of the detection and correc- 
tion of spelling errors, the affix analysis is interes- 
ting in "that : 
in memoryiI$ makes easier the use of the dictionary lomded 
- performs a natural cutting of the words, 
which facilitates the algorithms of automatic correc- 
ting based on inferential mechanisms and heuristics 
These features ~ive the suggested algorithms some 
originality and a contribution tb the work Jn the 
field of Arabic morphological analysis. 
BIBLIOGRAPHY 

(BEN ~6) - A. BEN HAMADOU : Automatic detection and 
correction of shelling errors in Arabic texts. 
2nd International BaF.hdad conference 24~26 March 86. 

(COUR77) - J. COURTIN : Algorithmes pour \]e traitement 
interactif des lanEues naturelles.-Th.Et~t GRENOBLE 77 

(WOOD70) - W.A. WOODS : Transition Network grarmnar for 
natural language analysis C.A.C.M VoL |3 N ° 10 oct 70. 

(PC FA 76) - Permanent Commission of Functional Arabic 
L'arabe Fonctionne\]. 2nd Edition - Tunis 1976. 
