LOGIC COMPRESSION OF DICTIONARIES FOR 
MULTILINGUAL SPELLING CHECKERS 
Boubaker MEDI)EII I IAMROUNI 
GETA, IMAG-campus (UJF & CNRS) 
BP 53, I;-38041 Grenoble Cedcx 09, FRANCE 
Boubaker. Meddeb-namrouni@ i mag. t7 r 
& 
WinSoft SA. 
34, Bd. de l'Esplanade 
1:-38000 Grenoble, FRANCF 
ABSTRACT 
To provide practical spelling checkers on micro-com- 
puters, good compression algorithms ,'~'c essenlial. CutTeut 
techniques used to compress lexicons for indo-Fmropean 
languages provide efficient spelling checker. Applying the 
.~une methods to languages which have a different morpho- 
logical system (Arabic, Turkish,...) gives insufficient re- 
suits. To get better results, we apply other "logical" com- 
pression mechanisms based on tile structure of the lan- 
guage itself. Experiments with muir)lingual dictionaries 
show a significant reduction rate attributable to our logic 
compression alone and even better resnlls when using our 
method in conjunction with existing methods. 
KEY WORDS: Spelling checkers, Multilinguism, 
Compression, Dictionary, Finite-state machines. 
INTRODUCTION 
Since the first work in 1957 by Glantz \[611, a great 
deal of timer)zing and reseltrch has taken place on the sub- 
ject of spelling verificatiou and correction. Many commer- 
cial products (word processors, desktop presentation,...) in- 
elude efficient spelling checkers on mic,'o-computers. The 
classical methods, used arc generally based on a morpho- 
logical analyzer. This is sufficient to provide a robust 
monolingual spelling checker, but using morphological 
amdyzers can become unrealistic when wc want to develop 
an univers~d solution. In fact, tile analyzers built for each 
language use various linguistic models and engines, and it 
is impossible to convert a morphoh)gical analyzer from 
one formalism to another. Furthermore, using flmse classi- 
cal mcthods would lead to combining into the host appli- 
cation as many of grammars and parsers as languages, 
which would increase the code size and Ihe mainten:mcc 
problem of rules and data. The method presented in this 
paper is based on building a dictionary of all surface forms 
for each language, which is sufficient for spelling checkers 
applications. "llle dictionary built with the existing genera- 
)ors can bc e~ily updated manually bt,t may l)e huge, es- 
pecially for some agglutinative language (Arabic, 
Turkish,...). A compression process on the muir)lingual 
dictionaries is neeess,'u'y to obtain a reduced size. The exist- 
ins compression methods generally used are physical and 
provide good results for indo-European languages. 
Applying the sane techniques to other languages (Arabic, 
Tnrkish,...) shows their limits. For this reason we intro- 
duce a new kind of compression techniques that we called 
"logic compression". This new technique requires a p,'imi- 
tire morphological knowledge during tile compression 
process and requires less storage space than prevkms meth- 
ods. It ,also has the advantage of being an universal lnelhod 
applicable to all languages, 
Seclion 1 contains an overview of existing methods 
for building spell checkers and the limits of such system 
whcll we take into account new constraints such as lnnlti- 
lingual)sin. Section 2 outlines tile first two steps of our 
work: we adapt an existing method to Arabic, then make a 
first extension hy introducing a new kind of compression 
called "logic compression". Section 3 introduces ill detail 
the logic compression with its application to other lan- 
gtmges, ll,ld shows the improvcinents obtained when using 
logic compression ill conjunction with existing methods. 
Section 4 outlines the architecture of our lnullilhlgual 
spelling checker system and some future projects. 
1. OVERVIEW OF EXISTING 
METllOI)S 
1.1. (~rammar-hased approach 
These methods were used in the beginning on early 
computers when storage space was expensive. It consists 
in building a small lexicon contaiuing roots and affixes, a 
grammar of rules that express tile morphographemic alter- 
nations, and all engiue that uses tile grammar and the lexi.. 
con to see if an input word belongs to the lauguage or not. 
If the process of recognition fails, some operations 
(substitution, insertion,...) are performed on the misspelled 
word to provide a list of candidale words that helps the user 
Io select tile correct form. 
Even though, it is a great accomplishment to design a 
powcrful cnginc \[3\] \[8\] and to cxprcss rules in a pseudo 
natural way \[9\] even for different languages \[1\] \[2\] \[11\], 
these systems present some limits: 
- Multilinguism: This methods does not support all lan- 
guages. To offer a rot, It)lingual solution for n languages 
you have to store n grammars and n lexicons, and gener- 
ally n different engines inlo tile host application. 
- Cosl of retrievak For some languages, the retriewd of 
words may be long. For instance, a vocalized Arabic spell 
checker nlust accept non-vocalized or partially vocalized 
words which require more lime to be accepted than fully 
vocalized words. 
- Cost of guessing alternalives for a misspelled word: To 
guess a correct word when a misspelled word is found, we 
have to modify the misspelled word by all possible ope,'a- 
tions (substitution, insertion, suppression,...) for 1 or 2 
characters and then try to check them. This matter can take 
a lot of time before displaying the correct forms lot end- 
users. 
- Maintaining file grammars and data: The grammars and 
lexicon require conti,nlous updating. You need to fiud a 
muir)lingual computational linguist who knows the lin- 
guistic theory and tile ft)rmalism to easily update data and 
rules \[811. 
- Ergonomic fcalures: In some languages, end users want 
to have some options that let Ihem choose how tile spell 
checker will accept words. In Arabic, for example, different 
regions have slightly different orthographical conventions. 
292 
1.2. Lexicai-fiased appro.'tch: 
Lexical-based approach appear after the first methods 
described above, when storage space become less 
expensive. The first step is to build complete list of 
surface forms belonging to the language using 
morphological generators, SI,LP (Specialized l+anguagcs 
for Linguistic Progr,'uns), etc. and then compresses the 
large word-dictionary. They are generally used for office 
apl)lications such as word processors, desktop presentation, 
etc. Their main advantage is that they cover a complete 
language since all the forms can be fouud in the initial 
lisl. Also, they allow efficient rctricval and guessing of 
misspelled words \[4\]. Ilowever, some limits exist in such 
systems: 
- Multilinguism: The compression process give a good 
ratio for languages with awcak inflexion factor 
(English,...) where the compression nteehanism give up to 
150 KB of storage fi'om around 3 MB of a fifll list \[4\]. The 
compression technologies arc still powcrfifl for languages 
with a medium iuflexion factor (Russian,...). For example, 
a list of all surface Russian words of between 10 and 15 
MB of size can be reduced to 700 KB \[41. For hmguagcs 
with a high inflexion factor (Arabic, I'innish, 
llungarian,..+), it won't be easy to find compression tech- 
nologies that give practical results \[4\]. For instance, a full 
list of completed vocalized woMs in Arabic h:m 300 MB in 
size anti the current compression mefll(xls are itnpraclical. 
- No morphological knowledge : These methods arc neu- 
tral with respect to the text language, the efficiency of 
compression techniques +nay be improved by using spe- 
cific properties of the language \[41. 
I1. A FIRST APPROACII: ADAPTING 
AN EXISTING METItOD FOR ARABIC 
ILl. Using an existing method 
As a first step, we take an eflicieut method used to 
compress dictionaries for F+uropean (l:nglish, l:rench,...) 
spelling checkers 11411 and try tit apply it to Arabic. The 
first step of our work cousists in building a full list of sur- 
face fin+ms usiug a morphological generator 151 anti com- 
pleted by all irregular fonns and existing corpus. The final 
large word dictionary which covers uou-vocalizcd Arabic 
has a size of 75 MB. The comprcssiou process yickls 18 
MB iu a or)repressed fi)rmat. I:or .'m idea of the compres- 
sion process readers can refer to \[10\]. Table 1 gives some 
results of the compression process for a few Europeau lan- 
guagcs to see the efficiency of thc method aud its itm/le- 
quacy for the Arabic language. 
word 
forms 
l)anish 448.000 
German 403.000 
Arabic 7 milllous, 
"E~'tglish 88.000 
size sizc 
uncomprcssed compressed 
5689 KB 725 KB 
5297 KB 866 KB 
75 i'~lB '18 MB 
84l KB" 224 KB 
Table 1 
The result fl)r Arabic is impractical for small computers. 
We must titan find other techniques that produce a smaller 
dictionary or extend this method; to get an exploit'dale so- 
lution. 
11.2. Extension of the method: 
The initial idea is applied to the morphological sys- 
tem of Aral)ic. While most of the fully inflectc/l forms 
words in Arabic mc built by adding to a stem prefixes and 
suffixes wc l)roposc replacing some words with only one 
form beginning by a special code that represents it family 
of prefixes and finishing by another special code which 
represents a family of suffixes. For this purpesc, wc wrote 
a program in MPW-C that processes a full list of inflected 
/brms and (+sing an existing decomposition of affixes into 
sub-sets already established, give the reduced lcxicou where 
many for,ns are replaced by only otto representation 
(PSi stem SSj) where PSi (with rcspect to SSj) is the set i 
(with respect to j) of prefixes (with respect to suffixes). 
Note that the reduced lexicon reprcscnts faithfully the iui- 
tial list without any silence (missing words) or noise 
(incorrect words). Only compressed words are replaced, and 
the rest remain in the reduced list. The figure 1 gives an 
example of words, an example it1" a decompositions luld the 
obtained result. 
Decmn +osilion 
Full sttrfa¢c forlm ~ 7- Reduced list 
J weAXl~l~ \]S~II!! s 
\[ qeAX!)dw I w~Xl~,us 
\] xyAXl)~us 
Fig. I: Ikaml@, of the compression process 
The next crucial problem to resolve is lit find the best dc- 
composilion that provide the best retluced lcxicon. The 
me+hill must t~ automatic, It must process the large word- 
dictionary, and rcgar(ling an initial list of prcl\]xes and sill L 
fixes, must give as oulput the best dccompositiou and the 
optimal reduced dictionary. But, hclk)rc studying the im- 
plementation of such an algorithm, we began, tit see how 
much space we coukl gain by this teChlfique starting from a 
lt)anual decomposition. 
~ ~tldRclh~d;_ Starting front a different fifll lists for 
each category of words (transitive verbs, nouns,...), we 
choose different decompositions and processed the full list 
with the coml)rcssion tool. The best decomposition kept 
\[or each category was lhc decomposition which eliminated 
the maxiluum forms. This method gave mauy candidate 
decompositious depending ou Ihc grammalio'tl calcgory of 
ihc word. To choose Ihe best global one we took into ac- 
count the fi'equency of dictionary etlIries. This method was 
tested tit+ differeut Arabic word lists and some results :Ire 
described here, Re:tders cat+ refer to 1101 or fill for luore 
itfformation. To see some dccolnpositiou, consider the fol- 
lowing sets: 
lil:\[wa, fa\], I~l l;+2={la, sa }, /,3~/ 
l;, 3 = {ha, at}, 1,3i/ ....... 
F 1 ~ {tom, ttnuna, ta, Uma}, / ~3~.../ 
F 2 : {ya, ;din, yimt, +ulna}, /,31,31 ,3~. / ...... 
I; 6 = {ha, haft, ya, ka, kern, kt)uma, kent, l).om, houma, bona, 
haft}, F 7 -~ F6\ {ya, i+laS} 4. {hi}, 
1:9 ~=(wa} ...... 
l:. i (with respect to Fj) is a set of prefixes (with respect to 
suffixes). We uotc the quantity I(i.E j (wilh respect to FI.Fi) 
all strings built by a collcalcnation of each clcmcut of l~+i 
(with rcsl~ecl to Fit with each clement of l(j (with rcspect 
Io l:j). 
l'~xaml)le of 3 class (from 6) of Ihe prefix class: 
29+3 
ell vii/'-" '-' "-' - "%lk Sl CI '" 
o- -  ,v-v ' 'o-o- - -(} 
Fig, 2: Initial automaton 
Pl = El- P2 = E4. 
P3 = E3 + ~ -E3 + El.E2 .E3 
Ex,'unple of 4 class (from 13) of Ihe suffix class: 
S t =F 1. S2 =F 2. $7 =177 . S 8 =F9.F 7. 
• pirs~ r¢,~,!1~8: case of Arabic: With all the classes al- 
ready found for Arabic (6 classes of prefixes, 13 cl,'tsscs of 
suffixes; each class containing an average of 8 affixes), we 
processed a collection of non-vocalized Arabic dictionaries 
(17 MB), the restllt gave a reduction lexicon of 254 KB. 
Used this in combination with the compression process 
described in § 1.2, tile final result is 121 KB. Note also 
that part of this work was implemented in a commercial 
multilingual word processor (WinText ©) to offer Arabic 
spell checking. 
II1 LOGIC COMPRESSION: 
III.1. Theoretical aspects: 
Let V be a finite set and V* the set of words built on 
V including null strings noted ~. 
W E V*. W = WiW>..Wn. W i e V. 
i c \[1..n\]. Let V + = V* - {~l}. 
Let Y be a sub-set of V that contain vowels. 
1. Prefix(W). V W c V +. 
We call order i prefix the quantity: 
Pi = WlW2...Wi. (1 _< i _< n-I). 
2. Suffix(W). V W e V +. 
We c,'fll order j suffix Ibe quantity: 
Sj = WjWj+t...W.. (1 _< j _< n). 
3. VoePat(W) gWe V +. 
We call vocalic pattern of W the set: 
Vy = {Wi ,Wj,...Wk}, W i < Y. 
card(Vy) __. leugfll(W) 
4. Root(W). V W e V +. 
We call root the quantity: 
R = Wp...Wq. (1 _< p < q _< n), 
card(R) _< q-p+l. 
5. Pi: Prefixes class. Pi = {~, F'il,Pi>...l:Ji',:} • 
Pij is a prefix. 1 _< j _< k 
Card(Pi) =k+ 1. if k>__ 1. 
= 1. ifl'i = {0}, 
6. Sj: Suffixes class. Sj = {~, Sjl, Sj2,...Si~:}. 
Sji is a suffix. 1 _< i _< k 
Card(Sj) =k+l. if k_> 1. 
= 1. ifSj = {tZi}. 
7. Vl: Vowel class. 
Vk = {¢J, VYkI,VYk2,...VYtk} 
Vyii is a vocalic pattern. 1 <_ i <_ k 
Card(Vv.) = k + 1. if k _>. 1. 
= 1. ifVk= {~}. 
Ill.2. Imgic Conlpression: Wllat is it ? 
Let's take the following automala that represent some 
surface w)calized words (fig 2) 
Pij is a prefix. 1 _<_ j <_ n. 
Sji is a suffix. 1 _< i _< n. 
C i are tile consonalltS of the vocabtilary. 
1 _<i_<k. 
Vij iS the vowel attached to the consonaut Cj. 
l ~<i_<qand l_<j_<_k. 
?J is the null string. 
This automata recognizes all words beginning from an ini- 
tial state (marked by *) and finishing in a final state 
(marked by a double circle) 
The utunher of arcs of such an aulofuala is: 
11 II 
~_.~ length (l'ik) + + Z Iength(Sjk) 2q(k-1) 
k=l k=l 
If we consider, for example, that affixes have a single chm'- 
acter, the nmnber of a,cs is equal to 2(n+1) + 2q(k-1). 
The logic compression consist in supplying the class of 
prefixes, suffixes and vowels and replaces each set by only 
one arc that represent a family of prefixes, suffixes or 
vowels. 
Starting from the following sets already eslablished: 
Pi = {~, Pil,Pi2,-..l~i,~\] a class of prefixes slored as x. 
Sj = {~'J, Sjl , Sj2,...Sjn } a chiss of suffixes stored ;is y. 
Vk = {{Vll,...Vlk},{V21,...V2~: } .... {Vql,...Vql¢} ) a class Of 
v(K'alic pallern slorcd as z. 
The logic compression reduces the initial automalOU to 
this new one: 
Fig,. 3: P.cduced automata 
The number of arcs kept in the automata is equal to 3 + k. 
The SOl Vt: contains a sub-scl of k vowels which must be 
applied to the last k characlers. 
Ill.3. Experiments: 
The logic compression with only an affix decomposi- 
tion, built by the manual meflmd cxplaiued above, has 
been tested on various list of words that represent collec- 
lions of multilingual diclionaries (a list of inflected 
forms). Three languages are tesmd: non-vocalized Arabic 
which has a great inllexion lactor, French which has a 
2.94 
Arabic French Russian 
Size of uncompressed list (MB) 
Ratio from it complete dictionary 
Number of inflected forms 
Class decomlx3sition (Ih'efixes) 
• . (suffixes) 
17 
33 
1.980.280 
6 
13 
..~1- l'hysical compression --, 5 660 
2 - Morphg-physica.1 comp. += ,l 22 l 
3 - FSM compression 88 
4 + l~8ic compression 253.686 
4 + 1 145.086 
2.636 1 
80 16 
247.406 
0 
84 
892.646 
311.593 
201.216 
480.770 
207.376 
4+2 121.500 1114.665 
~,, 
44-3 57.214 150.321 
75.234 
3 
23 
348.636 
109,418 
48.78 
163.202 
56.784 
37.74 
36.71 
7}lble 2 
weak inflexion factor, Russian which has a medium inflex- 
toll factor, l;.xtmrimenls arc dolie in two ways. First by us- 
ing our logic compression alone anti, thel|, in conji||tction 
with other methods by supplying the reduced lexicon (lisl 
of compressed words in text format) obtained with our 
method as input to existing methods. The three other 
methods tested a,e Ihe following: 
o Physical compression: Using a commercial physical 
process (Stuffit). 
- Morpho-physical coinprcssion: This method was used 
to compress dictionaries used to buiM a spell checker 1411. 
It combines morphological proprieties by taking inlo ac- 
count the suffixes of the language, but wilhout any link 
between Ihem. It also contains sonie physical features 171. 
• FSM (Finite-State Machine) Compression: Using file 
Lexc (Finite State Lexicon Compiler) which allows the 
conversion of a list of surface forms inlo a transducer 
which is then minimized \[81. 
Resttlls are described in table 2. 
111.4. Interpretations: 
The nlost interesting thing observed on this table is 
the improvement obtained when we combine our method 
with a previotls one. These resulls show that the existing 
methods are not optimal and can be improved by our logi- 
cal compression in its first step. These important results 
in storage space shouhl not hide others aspects of Slmll 
checker systems (retrieval and guessing). It would be inter- 
esting if the results given in the table were followed by 
oilier results showing impmvenmnts in the |etrieval and 
guessing of words. 
IV. A PROPOSEI) ARCIIITECTURI,; OF 
A UNIVERSAL SI'ELLING CHECKI,'R: 
Figure 3 shows the architecture of our proposed uni- 
versal spelling checker. Our method is inspired from pre- 
vious methods (§ 1.2), but presents some new original as- 
pects that allow it to be considered a truly multilingual so- 
lution. In summary, our system has the following l'ea+ 
ttlles: 
• Multilinguism: lhis mclhod will insure the multi- 
lingual constraint By using different tools, specific to 
each langt|age, to create a list of all surface lk),'ms. 
• Storage space: by introducing the logic compression 
into the compression process, we will be able to get a re- 
duced lexicon for whalever langu'lgc we have to use. One 
task that still remains is to improve the logic comp,'ession 
by making the lask of finding the best decomposilion 
more automatic. This problem is coii|bi|latorial; we lllllSl 
discover how to apply the optimization algorithms 
(genetic algorithll|, stochastic algorithm,...) in each case Io 
find an optimal reduced lexicon starling from Ihe large 
word-dictiolmry and primilive morphological km)wledge 
(list of affixes and w}wets). 
• Retrieval/guessing: even lllollgh we havell'l any 
conc|'ele ,-esults now, the firsl experinlenls show Ihat the 
process of checking words in an I;SM formalisln is faster 
\[halt other exisling methods, l'urlhermore, we are explor- 
ing paths Io introduce functions (similarily key,...) into 
the final obtained lexicon to make a rapkl guessing of re- 
placements for misslxflled words. 
CONCI,USION 
()ill" approach 1o spell checking differs from previous 
inethods by faking into llccolm\[ a liew para|neler which is 
• in -- # ..<, ........... ........ --. ........ 
{\]xe~:~ ~st, re~e \[_~ Machine (Psm) \[-~('-l!}.y~!ca! 1 ComiSressit; n'',~._4~ Reduced lexicon 
. "l  or,n,l,S,. " J " 
Fig. 3: Universal spelling checker 
29.5 
file multilinguism. The system proposed tries to give so- 
lutions for the three main problems: Multilinguism, de- 
teclion/guessiug and storage size. 
The first results, although using a manual method to 
find the decomposition in this first step, show that the 
previous methods to store dictionaries ,are not opthnal and 
can be improved by exploring other techniques from the 
language itself. Another interesting experiment is to find 
m~ original opfimiz~ation algorithm to find the optimal re- 
duced lexicon that represents faithfully the initi'd list 
without any silence (missing words) or noise (incorrect 
words). Yet another project is to build a more robust 
method for the two other problems (detection and guess- 
iug) from the reduced lexicon. 
ACKNOWLEDGMENTS 
qlie author would like to thank Prof. Christian BOI'I'I-~'I" 
for his constant support ,'uld encouragement. I am also very 
grateful to Mr. Kenneth BEESLEY (Rank Xerox, 
Grenoble) for his fruitful discussions and Mr. Lauri 
KARTTUNEN (Rank Xerox, Grenoble) for his help to 
realize some experiments. 
REFERENCES 
\[1\] Beesley K. R., Bukwalter T., (:1989) 
Two-level, Finite-State Analysis of Arabic Morphology. 
Proceedings of the Seminar on Bilingual Computing in 
Arabic and English, 6-7 Sept. 1989. Camhridge, England: 
The Literary and Linguistic Computing Center & The 
Center for Middle Eastern Studies. 
\[2\] Beesley K. R., (1990)Finite-state descrip- 
tion of Arabic Morphology, iu the Pr(vceediug of the 
Second Cambridge Conference on Bilingual Computing in 
Arabic and English, Cambridge, England, 6-7 September 
1989. No pagination. 
\[3\] Ben llamadou A., (1986)A Compression 
technique for Arabic Dictionaries: The affix Analysis, in 
the Proceeding of COLING-86, Boml 1986, pp. 286-289. 
\[41 Circle Noetic Services (1989) Passwd, 
Reference Manual, MIT Branch Office, Boston, pp. 1-6. 
\[5\] Circle Noetic Services (1989)Conjugate 
tool, Reference Manual, MIT Branch Office, Boston, pp. 
1-5. 
\[6\] Glantz 11., (1957)On the recognition of in- 
fornultion with a digital computer, J. ACM, Vol. 4, No. 
2, 178-188. 
\[7\] lluffman D. A., (1951)A method for the 
construction of minimum redundancy codes, Proc. IRE 40 
(1951), 1098-1101. 
\[8\] Karttunen L. (1993), Finite-State Lexicon 
Compiler, Xerox P,'do Alto Research Center, April 1993, 
1-35. 
\[9\] Koskeniemmi K., (1983) Two level 
Morphology, Publication no. 11, Department of Geucral 
Linguistics, University of llelsinki, pp. 18. 
\[101 Meddeb ll.B., (1993)lntdgration d'une com- 
posante morphologique pour la compression d'un diction- 
naire arabe, in Proc. Langue Arabe c.t Technologies 
Infonnatiques Avancfes, C,-t~ablanca, pp. 14. 
\[11\] Meddeh II.R., (1994)Logic Compression of 
Multilingual dictionaries, in Proe. of ICEMCO-94, 
International Confcreuce and Fxhibition on Multi-lingual 
Computing, University of C~unbridge, Center of Middle 
Eastern Studies, London, April-1994, pp. 14. 
\[12\] Oflazer K, Solak A, (1992)Parsing agglu- 
tinative word structures and its application to spelling 
checking for Turkish, Proc. of COI~ING-92, Nantes, Aug. 
23-28, Vol. 1, pp. 39-45. 
296 
