File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1086_metho.xml

Size: 21,386 bytes

Last Modified: 2025-10-06 14:10:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1086">
  <Title>MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects</Title>
  <Section position="5" start_page="0" end_page="681" type="metho">
    <SectionTitle>
2 Arabic Morphology
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="681" type="sub_section">
      <SectionTitle>
2.1 Variants of Arabic
</SectionTitle>
      <Paragraph position="0"> The Arabic-speaking world is characterized by diglossia (Ferguson, 1959). Modern Standard Arabic (MSA) is the shared written language from Morocco to the Gulf, but it is not a native language of anyone. It is spoken only in formal, scripted contexts (news, speeches). In addition, there is a continuum of spoken dialects (varying geographically, but also by social class, gender, etc.) which are native languages, but rarely written (except in very informal contexts: collections of folk tales, newsgroups, email, etc). We will refer to MSA and the dialects as variants of Arabic. Variants differ phonologically, lexically, morphologically, and syntactically from one another; many pairs of variants are mutually unintelligible.</Paragraph>
      <Paragraph position="1"> In unscripted situations where spoken MSA would normally be required (such as talk shows on TV), speakers usually resort to repeated code-switching between their dialect and MSA, as nearly all native speakers of Arabic are unable to produce sustained spontaneous discourse in MSA.</Paragraph>
      <Paragraph position="2">  In this paper, we discuss MSA and Levantine, the dialect spoken (roughly) in Syria, Lebanon, Jordan, Palestine, and Israel. Our Levantine data comes from Jordan. The discussion in this section uses only examples from MSA, but all variants show a combination of root-and-pattern and affixational morphology and similar examples could be found for Levantine.</Paragraph>
    </Section>
    <Section position="2" start_page="681" end_page="681" type="sub_section">
      <SectionTitle>
2.2 Roots, Patterns and Vocalism
</SectionTitle>
      <Paragraph position="0"> Arabic morphemes fall into three categories: templatic morphemes, affixational morphemes, and non-templatic word stems (NTWSs). NTWSs are word stems that are not constructed from a root/pattern/vocalism combination. Verbs are never NTWSs.</Paragraph>
      <Paragraph position="1"> Templatic morphemes come in three types that are equally needed to create a word stem: roots, patterns and vocalisms. The root morpheme is a sequence of three, four, or five consonants (termed radicals) that signifies some abstract meaning shared by all its derivations. For example, the words a0a2a1 a3 katab 'to write', a0a2a4a6a5a3 kaAtib 'writer', and a7a9a8a10a1a12a11a14a13 maktuwb 'written' all share the root morpheme ktb (a7a16a15a2a17 ) 'writing-related'. The pattern morpheme is an abstract template in which roots and vocalisms are inserted. The vocalism morpheme specifies which short vowels to use with a pattern. We will represent the pattern as a string made up of numbers to indicate radical position, of the symbol V to indicate the position of the vocalism, and of pattern consonants (if needed).</Paragraph>
      <Paragraph position="2"> A word stem is constructed by interleaving the three types of templatic morphemes. For example, the word stem a0a18a1 a3 katab 'to write' is constructed from the root ktb (a7a16a15a19a17 ), the pattern 1V2V3 and the vocalism aa.</Paragraph>
    </Section>
    <Section position="3" start_page="681" end_page="681" type="sub_section">
      <SectionTitle>
2.3 Affixational Morphemes
</SectionTitle>
      <Paragraph position="0"> Arabic affixes can be prefixes such as sa+</Paragraph>
      <Paragraph position="2"> fem. plural]'. Multiple affixes can appear in a word. For example, the word a5a24a26a25a27a8a10a28a29a1a12a11a14a30a29a20a31a22 wasayaktubuwnahA 'and they will write it' has two prefixes, one circumfix and one suffix:2 2We analyze the imperfective word stem as including an initial short vowel, and leave a discussion of this analysis to future publications.</Paragraph>
      <Paragraph position="4"/>
    </Section>
    <Section position="4" start_page="681" end_page="681" type="sub_section">
      <SectionTitle>
2.4 Morphological Rewrite Rules
</SectionTitle>
      <Paragraph position="0"> An Arabic word is constructed by first creating a word stem from templatic morphemes or by using a NTWS. Affixational morphemes are then added to this stem. The process of combining morphemes involves a number of phonological, morphemic and orthographic rules that modify the form of the created word so it is not a simple interleaving or concatenation of its morphemic components. null An example of a phonological rewrite rule is the voicing of the /t/ of the verbal pattern V1tV2V3 (Form VIII) when the first root radical is /z/, /d/, or /*/ (a32 , a33 , or a34 ): the verbal stem zhr+V1tV2V3+iaa is realized phonologically as /izdahar/ (orthographically: a35a37a36a38a33a39a32a38a40 ) 'flourish' not /iztahar/ (orthographically: a35a37a24a41a4a6a32a42a40 ). An example of an orthographic rewrite rule is the deletion of the Alif (a40 ) of the definite article morpheme Al+ (+a43a23a40 ) in nouns when preceded by the preposition l+ (+a44 ).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="681" end_page="682" type="metho">
    <SectionTitle>
3 Previous Work
</SectionTitle>
    <Paragraph position="0"> There has been a considerable amount of work on Arabic morphological analysis; for an overview, see (Al-Sughaiyer and Al-Kharashi, 2004). We summarize some of the most relevant work here.</Paragraph>
    <Paragraph position="1"> Kataja and Koskenniemi (1988) present a system for handling Akkadian root-and-pattern morphology by adding an additional lexicon component to Koskenniemi's two-level morphology (1983). The first large scale implementation of Arabic morphology within the constraints of finite-state methods is that of Beesley et al. (1989) with a 'detouring' mechanism for access to multiple lexica, which gives rise to other works by Beesley (Beesley, 1998) and, independently, by Buckwalter (2004).</Paragraph>
    <Paragraph position="2"> The approach of McCarthy (1981) to describing root-and-pattern morphology in the framework of autosegmental phonology has given rise to a number of computational proposals. Kay (1987) proposes a framework with which each of the autosegmental tiers is assigned a tape in a multi-tape finite state machine, with an additional tape for the surface form. Kiraz (2000,2001) extends Kay's  approach and implements a small working multi-tape system for MSA and Syriac. Other autosegmental approaches (described in more details in Kiraz 2001 (Chapter 4)) include those of Kornai (1995), Bird and Ellison (1994), Pulman and Hepple (1993), whose formalism Kiraz adopts, and others.</Paragraph>
  </Section>
  <Section position="7" start_page="682" end_page="682" type="metho">
    <SectionTitle>
4 Design Goals for MAGEAD
</SectionTitle>
    <Paragraph position="0"> This work is aimed at a unified processing architecture for the morphology of all variants of Arabic, including the dialects. Three design goals follow from this overall goal: a0 First, we want to be able to use the analyzer when we do not have a lexicon, or only a partial lexicon. This is because, despite the similarities between dialects at the morphological and lexical levels, we do cannot assume we have a complete lexicon for every dialect we wish to morphologically analyze. As a result, we want an on-line analyzer which performs full morphological analysis at run time.</Paragraph>
    <Paragraph position="1"> a0 Second, we want to be able to exploit the existing regularities among the variants, in particular systematic sound changes which operate at the level of the radicals, and pattern changes. This requires an explicit analysis into root and pattern.</Paragraph>
    <Paragraph position="2"> a0 Third, the dialects are mainly used in spoken communication and in the rare cases when they are written they do not have standard orthographies, and different (inconsistent) orthographies may be used even within a single written text. We thus need a representation of morphology that incorporates models of both phonology and orthography. null In addition, we add two general requirements for morphological analyzers. First, we want both a morphological analyzer and a morphological generator. Second, we want to use a representation that is defined in terms of a lexeme and attribute-value pairs for morphological features such as aspect or person. This is because we want our component to be usable in natural language processing (NLP) applications such as natural language generation and machine translation, and the lexeme provides a usable lexicographic abstraction. Note that the second general requirement (an analysis to a lexemic representation) appears to clash with the first design desideratum (we may not have a lexicon).</Paragraph>
    <Paragraph position="3"> We tackle these requirements by doing a full analysis of templatic morphology, rather than &amp;quot;precompiling&amp;quot; the templatic morphology into stems and only analyzing affixational morphology on-line (as is done in (Buckwalter, 2004)). Our implementation uses the multitape approach of Kiraz (2000). This is the first large-scale implementation of that approach. We extend it by adding an additional tape for independently modeling phonology and orthography. The use of finite state technology makes MAGEAD usable as a generator as well as an analyzer, unlike some morphological analyzers which cannot be converted to generators in a straightforward manner (Buckwalter, 2004; Habash, 2004).</Paragraph>
  </Section>
  <Section position="8" start_page="682" end_page="683" type="metho">
    <SectionTitle>
5 The MAGEAD System: Representation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="682" end_page="682" type="sub_section">
      <SectionTitle>
of Linguistic Knowledge
</SectionTitle>
      <Paragraph position="0"> MAGEAD relates (bidirectionally) a lexeme and a set of linguistic features to a surface word form through a sequence of transformations. In a generation perspective, the features are translated to abstract morphemes which are then ordered, and expressed as concrete morphemes. The concrete templatic morphemes are interdigitated and affixes added, and finally morphological and phonological rewrite rules are applied. In this section, we discuss our organization of linguistic knowledge, and give some examples; a more complete discussion of the organization of linguistic knowledge in MAGEAD can be found in (Habash et al., 2006).</Paragraph>
    </Section>
    <Section position="2" start_page="682" end_page="683" type="sub_section">
      <SectionTitle>
5.1 Morphological Behavior Classes
</SectionTitle>
      <Paragraph position="0"> Morphological analyses are represented in terms of a lexeme and features. We define the lexeme to be a triple consisting of a root (or an NTWS), a meaning index, and a morphological behavior class (MBC). We do not deal with issues relating to word sense here and therefore do not further discuss the meaning index. It is through this view of the lexeme (which incorporates productive derivational morphology without making claims about semantic predictability) that we can both have a lexeme-based representation, and operate without a lexicon. In fact, because lexemes have internal structure, we can hypothesize lexemes on the fly without having to make wild guesses (we know the pattern, it is only the root that we are guessing). We will see in Section 8 that this approach does not wildly overgenerate.</Paragraph>
      <Paragraph position="1"> We use as our example the surface form  'she/it flourished'. The lexeme-and-features representation of this word form is as follows: (2) Root:zhr MBC:verb-VIII POS:V PER:3</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="683" end_page="683" type="metho">
    <SectionTitle>
GEN:F NUM:SG ASPECT:PERF
</SectionTitle>
    <Paragraph position="0"> An MBC maps sets of linguistic feature-value pairs to sets of abstract morphemes. For example, MBC verb-VIII maps the feature-value pair ASPECT:PERF to the abstract root morpheme [PAT PV:VIII], which in MSA corresponds to the concrete root morpheme AV1tV2V3, while the MBC verb-I maps ASPECT:PERF to the abstract root morpheme [PAT PV:I], which in MSA corresponds to the concrete root morpheme 1V2V3. We define MBCs using a hierarchical representation with non-monotonic inheritance. The hierarchy allows us to specify only once those feature-to-morpheme mappings for all MBCs which share them. For example, the root node of our MBC hierarchy is a word, and all Arabic words share certain mappings, such as that from the linguistic feature conj:w to the clitic w+. This means that all Arabic words can take a cliticized conjunction. Similarly, the object pronominal clitics are the same for all transitive verbs, no matter what their templatic pattern is. We have developed a specification language for expressing MBC hierarchies in a concise manner. Our hypothesis is that the MBC hierarchy is variantindependent, though as more variants are added, some modifications may be needed. Our current MBC hierarchy specification for both MSA and Levantine, which covers only the verbs, comprises 66 classes, of which 25 are abstract, i.e., only used for organizing the inheritance hierarchy and never instantiated in a lexeme.</Paragraph>
    <Section position="1" start_page="683" end_page="683" type="sub_section">
      <SectionTitle>
5.2 Ordering and Mapping Abstract and
Concrete Morphemes
</SectionTitle>
      <Paragraph position="0"> To keep the MBC hierarchy variant-independent, we have also chosen a variant-independent representation of the morphemes that the MBC hierarchy maps to. We refer to these morphemes as abstract morphemes (AMs). The AMs are then ordered into the surface order of the corresponding concrete morphemes. The ordering of AMs is specified in a variant-independent context-free grammar. At this point, our example (2) looks like this:</Paragraph>
      <Paragraph position="2"> Note that as the root, pattern, and vocalism are not ordered with respect to each other, they are simply juxtaposed. The '+' sign indicates the ordering of affixational morphemes. Only now are the AMs translated to concrete morphemes (CMs), which are concatenated in the specified order. Our example becomes: (4) a0 zhr,AV1tV2V3,iaa a1 +at The interdigitation of root, pattern and vocalism then yields the form Aiztahar+at.</Paragraph>
    </Section>
    <Section position="2" start_page="683" end_page="683" type="sub_section">
      <SectionTitle>
5.3 Morphological, Phonological, and
Orthographic Rules
</SectionTitle>
      <Paragraph position="0"> We have two types of rules. Morphophonemic/phonological rules map from the morphemic representation to the phonological and orthographic representations. This includes default rules which copy roots and vocalisms to the phonological and orthographic tiers, and specialized rules to handle hollow verbs (verbs with a glide as their middle radical), or more specialized rules for cases such as the pattern consonant change in Form VIII (the /t/ of the pattern changes to a /d/ if the first radical is /z/, /d/, or /*/; this rule operates in our example). For MSA, we have 69 rules of this type.</Paragraph>
      <Paragraph position="1"> Orthographic rules rewrite only the orthographic representation. These include, for examples, rules for using the shadda (consonant doubling diacritic). For MSA, we have 53 such rules.</Paragraph>
      <Paragraph position="2"> For our example, we get /izdaharat/ at the phonological level. Using standard MSA diacritized orthography, our example becomes Aizdaharat (in transliteration). Removing the diacritics turns this into the more familiar a15a2a35a37a36a38a33a39a32a38a40 Azdhrt. Note that in analysis mode, we hypothesize all possible diacritics (a finite number, even in combination) and perform the analysis on the resulting multi-path automaton.</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="683" end_page="684" type="metho">
    <SectionTitle>
6 The MAGEAD System: Implementation
</SectionTitle>
    <Paragraph position="0"> We follow (Kiraz, 2000) in using a multitape representation. We extend the analysis of Kiraz by introducing a fifth tier. The five tiers are used as follows: Tier 1: pattern and affixational morphemes; Tier 2: root; Tier 3: vocalism; Tier 4: phonological representation; Tier 5: orthographic representation. In the generation direction, tiers 1 through 3 are always input tiers. Tier 4 is first an output tier, and subsequently an input tier. Tier 5 is always an output tier.</Paragraph>
    <Paragraph position="1">  We have implemented multi-tape finite state automata as a layer on top of the AT&amp;T two-tape finite state transducers (Mohri et al., 1998). We have defined a specification language for the higher multitape level, the new Morphtools format. Specification in the Morphtools format of different types of information such as rules or context-free grammars for morpheme ordering are compiled to the appropriate Lextools format (an NLP-oriented extension of the AT&amp;T toolkit for finite-state machines, (Sproat, 1995)). For reasons of space, we omit a further discussion of Morphtools. For details, see (Habash et al., 2005).</Paragraph>
  </Section>
  <Section position="11" start_page="684" end_page="684" type="metho">
    <SectionTitle>
7 From MSA to Levantine
</SectionTitle>
    <Paragraph position="0"> We modified MAGEAD so that it accepts Levantine rather than MSA verbs. Our effort concentrated on the orthographic representation; to simplify our task, we used a diacritic-free orthography for Levantine developed at the Linguistic Data Consortium (Maamouri et al., 2006). Changes were done only to the representations of linguistic knowledge at the four levels discussed in Section 5, not to the processing engine.</Paragraph>
    <Paragraph position="1"> Morphological Behavior Classes: The MBCs are variant-independent, so in theory no changes needed to be implemented. However, as Levantine is our first dialect, we expand the MBCs to include two AMs not found in MSA: the aspectual particle and the postfix negation marker.</Paragraph>
    <Paragraph position="2"> Abstract Morpheme Ordering: The context-free grammar representing the ordering of AMs needed to be extended to order the two new AMs, which was straightforward.</Paragraph>
    <Paragraph position="3"> Mapping Abstract to Concrete Morphemes: This step requires four types of changes to a table representing this mapping. In the first category, the new AMs require mapping to CMs. Second, those AMs which do not exist in Levantine need to be mapped to zero (or to an error value). These are dual number, and subjunctive and jussive moods.</Paragraph>
    <Paragraph position="4"> Third, in Levantine some AMs allow additional CMs in allomorphic variation with the same CMs as seen in MSA. This affects three object clitics; for example, the second person masculine plural, in addition to a0 a3 +kum (also found in MSA), also can be a40a8 a3 +kuwA. Fourth, in five cases, the subject suffix in the imperfective is simply different for Levantine. For example, the second per-son feminine singular indicative imperfective suffix changes from a1a3a2 + +iyna in MSA to a4 + +iy in Levantine. Note that more changes in CMs would be required were we completely modeling Levantine phonology (i.e., including the short vowels). Morphological, Phonological, and Orthographic Rules. We needed to change one rule, and add one. In MSA, the vowel between the second and third radical is deleted when they are identical (&amp;quot;gemination&amp;quot;) only if the third radical is followed by a suffix starting with a vowel. In Levantine, in contrast, gemination always happens, independently of the suffix. If the suffix starts with a consonant, a long /e/ is inserted after the third radical. The new rule deletes the first person singular sub-ject prefix for the imperfective, +a40 A+, when it is preceded by the aspectual marker +a5 b+.</Paragraph>
    <Paragraph position="5"> We summarize now the expertise required to convert MSA resources to Levantine, and we comment on the amount of work needed for adding a further dialect. We modified the MBC hierarchy, but only minor changes were needed. We expect only one major further change to the MBCs, namely the addition of an indirect object clitic (since the indirect object in some dialects is sometimes represented as an orthographic clitic). The AM ordering can be read off from examples in a fairly straightforward manner; the introduction of an indirect object AM would, for example, require an extension of the ordering specification.</Paragraph>
    <Paragraph position="6"> The mapping from AMs to CMs, which is variantspecific, can be obtained easily from a linguistically trained (near-)native speaker or from a grammar handbook, and with a little more effort from an informant. Finally, the rules, which again can be variant-specific, require either a good morpho-phonological treatise for the dialect, a linguistically trained (near-)native speaker, or extensive access to an informant. In our case, the entire conversion from MSA to Levantine was performed by a native speaker linguist in about six hours.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML