File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/86/c86-1068_abstr.xml
Size: 6,632 bytes
Last Modified: 2025-10-06 13:46:19
<?xml version="1.0" standalone="yes"?> <Paper uid="C86-1068"> <Title>A COMPRESSION TECHNIQI~ FOR ARABIC DICTIONARIES : THE AFFIX ANALYSIS.</Title> <Section position="1" start_page="0" end_page="286" type="abstr"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> In every application that concerns the automatic processing of natural language, the problem of the dictionary size is posed. In this paper , we propose a compression dictionary al~orithm based on an affix analysis of the non diacritical Arabic.</Paragraph> <Paragraph position="1"> It consists in decomposing a word into its first elements taking into account the different linguistic transformations that can affect the morphological structures.</Paragraph> <Paragraph position="2"> This work has been achieved as part of a study of the automatic detection and correction of spelling errors in the non diacritical Arabic texts.</Paragraph> <Paragraph position="3"> I- INTRODUCTION In every application that concerns the automatic processing of natural language, the problem of the dictionary size is posed. We can approach this important question in several ways and particularly : - By grouping together the common prefixes of the different language words. In the PIAF system,(interactive program for French Analysis) for instance, words are represented in chained lists following an alpha-</Paragraph> <Paragraph position="5"> - By creating multiple dictionaries: or:efor each major topic area. This approach requires, in addition, a common base dictionary. When a particular area is concerned, a temporary master dictionary is created by increasinF the base dictionary with selected local ones.</Paragraph> <Paragraph position="6"> -- By usin~ the Affix analysis which consists in performing a morphological analysis in order to identify, in a given word, the redundant elements (Affixes). The dictionary will be limited to the non-redundant elements (roots). This technique is used specially in the DECIO - SPELL system for detecting and correcting spelling errors.</Paragraph> <Paragraph position="7"> In the present paper, we will develop this last approach for the non_diacritical Arabic.</Paragraph> <Paragraph position="8"> The particularities of the algorithms that we propose, stem~ in great part, form the specificities of the language used : - Words are written in consonantic form - Words can contain infixes - Morphological structures can be altered by linguistic transformations.</Paragraph> <Paragraph position="9"> This work has been developed within a national research project for the study of the automatic detection and correction of spelling errors in Arabic texts</Paragraph> <Paragraph position="11"> Let V be a finite Set and V ~, the set of words built on V including nul s$ing noted</Paragraph> <Paragraph position="13"> let W = W 1 W 2 ....... W n. W 6 V + We call order i prefix the quantity Pi = W\] W2.-W i (l g i <n-l) the order o prefix is The Affix analysis consists in decomposing a given word into its first elements among which we can distinguish the affixes (prefix, infix and suffix) which are the redundant elements of the language and the root which is its non redundant one . This decomposition is based on the derivational structure of the \]anguage : nearly all the words are obtained by adding an affix combination to a given root.</Paragraph> <Paragraph position="14"> suffix Infix Root Prefixes - Root = ~ ka~aba - Prefix = ~ - Infix = ~ tO - Suffix = &quot;&-- PS Among the possible affix comb{nations, we distinguish those that are valid and those that are not. Valid combinations constitute what is called Morpho\]ogical Pattern (M P) For a given word, the number of possible morphological decnmpositim~s depends on the root, according to whether or not it contains characters which can be assimilated to different affixes.</Paragraph> <Paragraph position="15"> This nui~)er is calculated using the following</Paragraph> <Paragraph position="17"> 2. Study of the morphological transformations The morphological derivation for a root can be accompanied with transformations caused by linguistic ohenomena such as asshnilation, contraction, metathisJs.</Paragraph> <Paragraph position="18"> These transformations can affect the Root as well as the affixes (M P). The Roots affected are mainly those which contain the characters yaa: q , Waw : ~ and hamza : ~Z-~. 4~ '~' ~l ~ &quot;a-->~.&quot;~ &quot; ~ ~&quot; .t-~ The morphological transformations can be classified into two categories : - The morpho-phonological transformation are those that substitute a character for another one without changing the length of the word (isometrica\] transformations)-(see EX1 and EX2).</Paragraph> <Paragraph position="19"> - The purely phonological transformations are those that suppress one or more characters, therefore they modify the length of the word.</Paragraph> <Paragraph position="20"> ahadaThose t:ransfor~ations are a sourEe-of ambiguity for the morphological decomposition. To remove these ambiguities, we use heuristics among which we can mention for instance : Let D be the morphological derivatio~ operator such as : D ( R , P , I , S ) = W W ~ V and T the operator composed of a derivation followed by a transformation. And D the morE~olOglcal dec6mposit,on operator (inverse of D) and T the morphological decomposition operator taking into account the transformational rules (inverse of T).</Paragraph> <Paragraph position="21"> Consider W the word to be ana\]ysed.</Paragraph> <Paragraph position="23"> The root retained is : J----~'~ da.PSaAa This heuristic means that the transformations can not be done at the expense of semantics.</Paragraph> <Paragraph position="24"> \] V - TMPLVMI<WT'ATI'NN : The affix analysis is composed of two modu\].es (See Fig. I) : - morphological decomposition module - validation modu\]e 1. The morphological decomposition module permits to--~.de'{~Ty the different ~-ombinations. It is executed in two steps : Step one : IdentiJieation of prefixes and suffixes by us~--a table o~ prefixes and a table of suffixes.</Paragraph> <Paragraph position="25"> Step two : identification of the infix by anaysing-t~Te remaining chain after eliminating P and S.</Paragraph> <Paragraph position="26"> The analyser has s single initia\] state and as many ways as there are infix possibilities. The interest of realising this decomposition into two steps lies in the use of a single analyser in order to rec.'ognise all the morphological forms. we distin~_uish differeut morphological Patterns .</Paragraph> </Section> class="xml-element"></Paper>