XML Viewer - w04-1604

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1604_metho.xml
Size: 20,377 bytes
Last Modified: 2025-10-06 14:09:17
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1604">
  <Title>The Architecture of a Standard Arabic lexical database: some figures, ratios and categories from the DIINAR.1 source program Ramzi ABBES</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The research and development work referred to in
</SectionTitle>
    <Paragraph position="0"> the SILAT research group goes back to the 1980ies and has been going on since (Descles et alii 1983, Dichy &amp; Hassoun, eds. 1989, Dichy 1984/89, 1987, 1993, 1997, 2000, Lelubre 1993, Braham 1998, Braham &amp; Ghazali 1998). It includes a number of doctoral dissertations (Hassoun 1987, Abu Al-Chay 1988, Dichy, 1990, Gader 1992, Ghenima 1998). For further developments, see:  is a comprehensive Arabic Language dB operating at word-form level (morphological analysis or generation). It has been completed in close cooperation, in Tunis by IRSIT (now SOTETEL-IT - A. Braham and S. Ghazali), and in France by ENSSIB (M. Hassoun) and the Lumiere-Lyon 2 University (J. Dichy). See Dichy, Braham, Ghazali &amp; Hassoun, 2002.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Non diacriticized writing
</SectionTitle>
      <Paragraph position="0"> It is well-known that Arabic script belongs to a group of Semitic writings originating from ancient Phoenician alphabets, such as Hebrew, Aramaic or Syriac. Phonographic translation is basically restricted to the notation of consonants and &amp;quot;long vowels&amp;quot;. In the course of time, these writing systems have developed additional diacritic symbols, mainly for the needs of the oral reading of sacred texts (Bible, New Testament, Koran). Arabic writing has thus been provided with a sophisticated system of diacritical marks (comparable to the Massora diacritics which were later devised for the Hebrew Bible). Standard writing nevertheless disregards these symbols. This results in a high degree of homography, accounting for the multiple analyses encountered in a majority of single words by morphological analysers (which are, needless to recall, bound to consider every word off-context).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 &amp;quot;Nucleus&amp;quot; and &amp;quot;extensions&amp;quot;: a quick re-
</SectionTitle>
      <Paragraph position="0"> call of the structure of word-forms in written Arabic Unlike automatic recognition software, human readers are, of course, able to combine semantic, syntactic and morphological analyses. They are helped in their reading of Arabic written utterances by another major feature of the writing system: the very regular structure of the word-form. This structure has been introduced and extensively described previously (Descles ed., 1983; Dichy 1984, 1990; Hassoun, 1987 - after the pioneering work of Cohen, 1961/70), and is only recalled here for the sake of clarity.</Paragraph>
      <Paragraph position="1"> Word-forms in Arabic can be described on the whole as consisting of a nucleus formative (henceforth NF) to which extension formatives (henceforth EF) are added, either to the left or to the right (Dichy, 1997). Ante-positioned EF-s are abbreviated as aEF, and post-positioned ones as pEF. The nucleus formative, usually called stem, can be represented in terms of prosodic or non-concatenative morphology (after J. McCarthy's original and much discussed insights, 1981). In Semitic morphology, the stem is considered, according to a somewhat recent, but very widely followed tradition, as a compound of root and pattern. One must keep in mind, though, that many nouns cannot be analysed in such a way: they are referred to as quasi-stems (Dichy &amp; Hassoun, eds., 1989).</Paragraph>
      <Paragraph position="2"> Arabic word-forms consist of: - proclitics (PCL), which include mono-consonantal conjunctions, e.g. wa-, 'and' , li-, 'in order to', or prepositions, i.e. bi-, 'in, at' or 'by', etc.; - a prefix (PRF). The category, after D. Cohen's representation of the word-form, only includes the prefixes of the imperfective, e.g., ya-, prefixed morpheme of the 3rd person; - a stem, which can be represented in terms of a ROOT (an ordered triple of consonants, or, by extension of the system, a quadruple) and a PATTERN (roughly: a template of syllables, the consonants of which are the triple of the ROOT to which monoconsonantal affixes are added). The stem takabbar, 'to be haughty', thus consists of the 3-consonant ROOT /k-b-r/ and of the PATTERN /taR1aR2R2aR3/, where R1, R2 and R3 stand for 'radical consonant 1, 2, 3', and are instantiated by the triple of the ROOT (R1=k, R2=b, R3=r). Nouns that cannot be analysed in ROOT and PATTERN are conventionally referred to as quasi-stems, e.g.: 'isma'il, 'Ishmael', yunisku, 'UNESCO', kahraman, 'amber'; - suffixes (SUF), such as verb endings, nominal cases, the nominal feminine ending -at, etc.; - enclitics (ECL). In Arabic, enclitics are complement pronouns.</Paragraph>
      <Paragraph position="3"> In the table below two apparently equivalent representations of the structure of the Arabic word-form are given. The main difference between them lies in the fact that (2) aims at highlighting the relations between nucleus and extension formatives (NF and EF-s), featuring a triangle (in bold-face below). The rules governing the relations between morphemes embedded in the word-form are included in a word-formatives grammar (henceforth WFG - Dichy, 1987, 1997). These rules, and the features they involve, are distributed along these three relations, a great number of which are related to the lexical nucleus, and have to rely upon the finite set of grammar-lexis relations operating at word-level (formalised in Dichy,</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Word-formatives grammar (WFG) and
</SectionTitle>
      <Paragraph position="0"> word-level grammar-lexis relations Complex as it may appear, the above structure is regular, and remains, up to a certain point, recognisable from a psychological stand. It is, subsequently, very restrictive: Arabic word-forms include one lexical stem and one only6. In fact, the word-formatives grammar (WFG) accounts for the regular structure of the word-form.</Paragraph>
      <Paragraph position="1"> Rules involving word-formatives (the above nucleus and extension formatives, NF and EF) are based on three fundamental types of relations (Dichy, 1987): a0 'entails', [?]&gt; 'excludes', ** 'is compatible with' (or 'admits'), the third of which is attached to the opposed pair of the first two as an 'elsewhere' relation of a special kind, directly connected to ambiguity in language analysis processes (Dichy, 2000). In generation, all 'compatibility' (or 'admit') relations can in fact be rewritten in terms of 'entail' or 'exclude' rules associated with specific sets of word-formatives.</Paragraph>
      <Paragraph position="2"> 'Compatibility' relations are mostly useful in the formalisation of recognition rules, when ambiguity is at stake7. The formal structure of the WFG thus includes relations of the three types above, which are, in turn, involved in either one of the two following combination schemes:</Paragraph>
      <Paragraph position="4"> which can be phrased as: 'the proclitic preposition bi# entails one of the indirect (or genitive) case suffixes'. Other rules will point to a given case suffix in a given utterance.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
a1 NF
</SectionTitle>
    <Paragraph position="0"> a3 EF combinations, such as STEM - SUF rules, e.g.: STEM = 'diptote' a0 SUF = {u, a, i} which can be phrased as: 'a stem whose declension is diptote entails case-endings belonging to the listed set'. (Diptote stems may also be compatible with dual or plural suffixes, which is taken into account in another rule.) Another type of relation to be encoded in a lexical database is:  6. A few exceptive compound items exist, but they are kept marginal by the structure of the language, for the obvious reasons hinted at here, unlike what has happened in Modern Hebrew, as opposed to the Biblical and Medieval state of the language (Kirtchuk, 1997). 7. Automatic recognition and generation are not to be considered as reverse processes. Evidence from Arabic is given in Descles, ed., 1983; Dichy, 1984, 1997, 2000. a1 NF a3 NF linking combinations, which have to  be encoded whenever the morphological variation is not rule-predictable (cf. Mela0uk's concept of syntactic, 1982). This is the case in a majority of singular a1 'broken plural' links in nouns or adjectives, as well as in 'perfective' a1 'imperfective' (madi a1 mudari') links, in verbs belonging to 'simple' PATTERNS (al-fi'l al-mujarrad).</Paragraph>
    <Paragraph position="1"> In an Arabic lexical dB, lexical entries (NF-s or STEMS in the above representation) need to be associated with morphosyntactic specifiers ensuring their insertion in word-forms, and their morphological variation (conjugation or declension). Morphosyntactic specifiers, in other words, account for: - grammar-lexis relations, i.e. NF a2 EF combinations; null - morpho-lexical variation, i.e. NF a2 NF linking combinations.</Paragraph>
    <Paragraph position="2"> Lexical entries thus 'entail', 'exclude' or 'admit' a number of grammatical morphemes listed in the various fields of the word-form as word-formatives, either on a non regular basis, or on the basis of rules founded on semantic features that cannot be deduced from the formal structure of the morpheme. As shown in Dichy (1997), morphosyntactic specifiers make up formally, in a lexical database, for information associated in the speaker's memory to various levels of linguistic analysis (morpho-phonological, syntactic or semantic features).</Paragraph>
    <Paragraph position="3"> This structure has often been disregarded in the elaboration of Arabic lexical databases on the assumption that the representation of lexical entries as a mere combination of PATTERN and ROOT (plus a number of suffixes) is sufficient. This is definitely not the case: evidence recalled in this paragraph (also in Hassoun &amp; Dichy, eds. 1989, Dichy, 1997, Dichy &amp; Fergaly, 2003) show that grammar-lexis relations operating at word-form level can only be taken into account if information is associated to whole stems (or nuclei), or to stem+suffix 'frozen' compounds. These relations cannot be predicted on the sole basis of patterns.</Paragraph>
    <Paragraph position="4"> The description of the WFG outlined in this paragraph has led to the elaboration of exhaustive and finite sets of morphosyntactic specifiers liable to be associated to the non finite lexical entries of an Arabic database (Dichy, 1997). These sets have been associated with the entries of the DIINAR.1 Arabic Language database. The WFG has been on the other hand implemented in the related generation and analysis source programs.</Paragraph>
    <Paragraph position="5"> Another lexical LR including morphosyntactic information at word level is the lexicon elaborated and completed by Timothy Buckwalter, which has been used in the finite-state morphological analyser elaborated at the European Xerox Research Centre (Meylan, France)8.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 A few figures and ratios from DIINAR.1:
</SectionTitle>
    <Paragraph position="0"> generated lexica vs. source lexicon In the previous section, we outlined the structure of the WFG and the information associated with lexical entries in the source program of the DIINAR.1 database.</Paragraph>
    <Paragraph position="1"> It is essential to note that the expression lexical database is ambiguous, i.e. that it is liable to refer, either: - to a source program drawing on lists of basic lexical or grammatical items (related to a grammar of the kind outlined in the previous section), - or a set of generated lexica, the items of which can be either basic (as in the source program) or combined, i.e. resulting from the combination of basic items, according to the rules of the word-formatives grammar.</Paragraph>
    <Paragraph position="2"> Software relying partly or entirely on morphological analysis may, or may not, need all the information outlined in section 2. They draw on lexica generated by the source program associated with the dB (Hassoun, 1987). Generated lexica can be restricted to a subset of information, as in a spelling checker (Gader, 1992), or extended to all available information, as in a parser (Ouersighni, 2002) or in an interactive language teaching software (Zaafrani, 2002). In the current section, we will examine the architecture of the DIINAR.1 database, from the standpoint of the relation between the figures of the basic entries included (SS 3.1 and 3.2), and that of the inflected word-forms (SS 3.3).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 The basic figures of the DIINAR.1 source
</SectionTitle>
      <Paragraph position="0"> program The total number of lemma-entries in the DIINAR.1 database is : 121,522. This includes 445 tool-words belonging to various grammatical categories (e.g.: prepositions, conjunctions, etc.) and the prototype of a proper names database of 1,384 entries. Both types of entries are associated with a particular word-formatives grammar, and with their own subsets of morpho-syntactic specifiers.</Paragraph>
      <Paragraph position="1"> The main parts of the database include: 8. Beesley, 1998, 2001, Beesley and Karttunen, 2003. Also: Buckwalter, 2002.</Paragraph>
      <Paragraph position="2"> Nouns, including adjectives 29,534</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Comments and critical analysis
</SectionTitle>
      <Paragraph position="0"/>
      <Paragraph position="2"> Table 2 features two ratios of general interest for the structure of the Arabic general Lexicon: - The ratio between broken plural nominal forms (which are not counted as lemmas9) and nouns and adjectives is roughly of one to four.</Paragraph>
      <Paragraph position="3"> - Deverbals appear to be 3.6 more numerous than verbs.</Paragraph>
      <Paragraph position="4"> (2) The above categorisation follows that of traditional Arabic grammar. Two sub-categorisations should, nevertheless be revisited for linguistic consistency reasons: - Adjectives (although they can appear as nouns in many syntactic structures) should be isolated. This will be needed, of course, in parsing - even in 'shallow parsing'. Adjectives in Arabic can be identified through syntactic tests.</Paragraph>
      <Paragraph position="5"> - 'Nouns of time and place' ('asma'u l-makan waz-zaman) should not, in future versions of DIINAR, remain in the 'deverbal' category.</Paragraph>
      <Paragraph position="6"> They are in fact (except for the earliest stages in the development of the language) inserted in syntactic structures as full nouns.</Paragraph>
      <Paragraph position="8"> It is to be noted, on the other hand, that (except for 'nouns of time and place') DIINAR.1 is very consistent in distinguishing between nouns and deverbals: deverbals re-used as nouns, and showing full nominal features appear, in the dB, 9. 'Broken plural' forms are related to a singular noun-form lemma. Links between singular and plural forms, in the dB, are described as NF a0 NF linking combinations (see SS 2.3).</Paragraph>
      <Paragraph position="9"> twice (as 'deverbals' and as 'nouns', with their related morphosyntactic specifiers), e.g.: * sakin, plur. sakinun, sakinat, 'dwelling', 'inhabiting', is a deverbal, e.g.: Nahnu sakinuna madinata al-'iskandariyya = 'We live in Alexandria'.</Paragraph>
      <Paragraph position="10"> * sakin, plur. sukka1n (broken plural form), 'inhabitant', is a full noun (appearing in the first line of Table 2), e.g.: Nahnu sukkanu madinati l-'iskandariyya = 'We are the inhabitants of Alexandria'.</Paragraph>
      <Paragraph position="11"> (4) The number of roots in DIINAR.1 is 6,546, it being understood that a great many nouns cannot be analysed in ROOT and PATTERN. (On the other hand, all the verbs and deverbals of the language can - Dichy, 1984/89.)</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 The DIINAR.1 lexica of inflected word-
</SectionTitle>
      <Paragraph position="0"> forms The number of combined proclitics (which are effectively in use in Modern Standard Arabic), suffixes, prefixes and enclitics is shown in the ta- null verbal stems It is easy to imagine, on the basis of the above table, that one could generate huge figures through multiplying the number of extension formatives among themselves, then multiplying the result by the number of nouns and/or verbs. In order to avoid 'over-powerful' inflation of data, a consistent database needs to be filtered through (a) a word-formatives grammar and (b) morphosyntactic specifiers associated to stems.</Paragraph>
      <Paragraph position="1"> The overall figures for inflected forms lexica generated by the DIINAR.1 can be broken down as shown in Table 5:</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 The fundamental ratio between lemma-
</SectionTitle>
      <Paragraph position="0"> entries and inflected word-forms High as they may seem, the above figures are not over-powerful, and result from stem-by-stem filtering of information through morphosyntactic specifiers and the associated word-formatives grammar.</Paragraph>
      <Paragraph position="1"> One can also compare the ratio between the total number of stems and that of inflected forms to what can be found in another language, which is equally known to be a highly inflected one. The Xerox Spanish Lexical Transducer contained, in 1996 over 46,000 base-forms, and generated over 3,400,000 inflected word-forms (Beesley &amp; Karttunen, 2003, p. xvii). The ratio between inflected forms and base-forms in the Xerox Spanish database was then of around 74 to one. In the DIINAR.1 dB, the same ratio is of just under 60 to one, which can be considered as reasonable. The question of how many 'maximal word' forms can be correctly generated remains to be introduced and discussed in a further paper.</Paragraph>
      <Paragraph position="2"> 4 The rationale beyond ratios: towards a first set of validation criteria for Arabic lexica The ratios considered in the present paper are divided in two general categories: * The category encountered in SS 3.2 involves NF a1 NF linking combinations (SS 2.3):  (a) The ratio between the number of noun lemmas (in general vocabulary) and that of 'broken plurals' is of 1 'broken plural' for every 4 nouns.</Paragraph>
      <Paragraph position="3"> (b) The overall ratio between verbs and  deverbals gives an average of 3.6 deverbals for one verb.</Paragraph>
      <Paragraph position="4"> * The ratios given in SS 3.3 and 3.4 consider the number of basic entries, such as nouns, verbs, deverbals, etc., and the inflected forms generated through the rules of the WFG and the grammar-lexis relations specifiers included in the dB. In nouns, the relatively high ratio of 45.55 is due to the combination of case-endings with other suffixes. In proper names, case-endings are limited, because they do not vary according to definiteness or indefiniteness, and also because some categories of proper names are in addition not liable to be followed by the relative suffix -iyy). In this contribution, the numbers of lemma-entries reflect the state of the DIINAR.1 database, which is likely to be modified, in the course of time, through eliminating lemmas corresponding to words that have fallen out of use or through adding new entries. Ratios, on the other hand, reflect the word-formatives grammar as well as the overall structure of the sets of morpho-syntactic specifiers associated to lexical entries. They are, on he whole, to remain stable. It is therefore reasonable to consider that they should be added to the language-specific parts of a check-list devised for the evaluation and validation of Arabic lexical resources, or of multilingual lexica including Arabic.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML