XML Viewer - w02-0509

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0509_intro.xml
Size: 18,928 bytes
Last Modified: 2025-10-06 14:01:27
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0509">
  <Title>A Comprehensive NLP System for Modern Standard Arabic and Modern Hebrew Morphological analysis, lemmatization, vocalization, disambiguation and text-to-speech</Title>
  <Section position="3" start_page="0" end_page="4" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
1.1 The common Semitic basis from an NLP
</SectionTitle>
      <Paragraph position="0"> standpoint Modern Standard Arabic (MSA) and Modern Hebrew (MH) share the basic Semitic traits: rich morphology, based on consonantal roots (Jir / ore)  , which depends on vowel changes and in some cases consonantal insertions and deletions to create inflections and derivations.  For example, in MSA: the consonantal root /ktb/ combined with the vocalic pattern CaCaCa derives the verb kataba to write. This derivation is further inflected into forms that indicate semantic features, such as number, gender, tense etc.: katab-tu I wrote, katab-ta you (sing. masc.) wrote, katab-ti you (sing. fem.) wrote, ?a-ktubu I write/will write, etc.</Paragraph>
      <Paragraph position="1"> Similarly in MH: the consonantal root /ktv/ combined with the vocalic pattern CaCaC derives the verb katav to write, and its inflections are: katav-ti I wrote, katav-ta you (sing. masc.)  A remark about the notation: Phonetic transcriptions always appear in Italics, and follow the IPA convention, except the following: ? glottal stop, voiced pharyngeal fricative (Ayn), d velarized d, s velarized s. Orthographic transliterations appear in curly brackets. Bound morphemes (affixes, clitics, consonantal roots) are written between two slashes. Arabic and Hebrew linguistic terms are written in phonetic spelling beginning with a capital letter. The Arabic term comes first.</Paragraph>
      <Paragraph position="2">  For a review on the different approaches to Semitic inflections see Beesley (2001), p. 2.</Paragraph>
      <Paragraph position="3"> wrote, katav-t you (sing. fem.) wrote, e-xtov I will write etc.</Paragraph>
      <Paragraph position="4"> In fact, morphological similarity extends much further than this general observation, and includes very specific similarities in terms of the NLP systems, such as usage of nominal forms to mark tenses and moods of verbs; usage of pronominal enclitics to convey direct objects, and usage of proclitics to convey some prepositions. Moreover, the inflectional patterns and clitics are quite similar in form in most cases. Both languages exhibit construct formation (Ida:fa / Smixut), which is similar in its structure and in its role. The suffix marking feminine gender is also similar, and similarity goes as far as peculiarities in the numbering system, where the female gender suffix marks the masculine. Some of these phenomena will be demonstrated below.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
1.2 Lemmatization of Semitic Languages
</SectionTitle>
      <Paragraph position="0"> A consistent definition of lemma is crucial for a data retrieval system. A lemma can be said to be the equivalent to a lexical entry: the basic grammatical unit of natural language that is semantically closed. In applications such as search engines, usually it is the lemma that is sought, while additional information including tense, number, and person are dispensable.</Paragraph>
      <Paragraph position="1"> In MSA and MH a lemma is actually the common denominator of a set of forms (hundreds or thousands of forms in each set) that share the same meaning and some morphological and syntactic features. Thus, in MSA, the forms: ?awla:d, walada:ni, despite their remarkable difference in appearance, share the same lemma WALAD a boy.</Paragraph>
      <Paragraph position="2"> This is even more noticeable in verbs, where forms like kataba, yaktubu, kutiba, yuktabu, kita:ba and many more are all part of the same lemma: KATABA to write.</Paragraph>
      <Paragraph position="3"> The rather large number of inflections and complex forms (forms that include clitics, see below 1.5) possible for each lemma results in a high total number of forms, which, in fact, is estimated to be the same for both languages: around 70 million null  . The mapping of these forms into lemmas is inconclusive (See Dichy (2001), p. 24). Hence the question rises: what should be defined as lemma in MSA and MH.</Paragraph>
      <Paragraph position="4">  For Arabic - see Beesley (2001), p. 7 For Hebrew - our own sources.</Paragraph>
      <Paragraph position="5"> The fact that MSA and MH morphology is root-based might promote the notion of identifying the lemma with the root. But this solution is not satisfactory: in most cases there is indeed a diachronic relation in meaning among words and forms of the same consonantal root. However, semantic shifts which occur over the years rule out this method in synchronic analysis. Moreover, some diachronic processes result in totally coincidental sharing of a root by two or more completely different semantic domains. For example, in MSA, the words fajr dawn and infija:r explosion share the same root /fjr/ (the latter might have originally been a metaphor). Similarly, in MH the verbs pasal to ban, disqualify and pisel to sculpture share the same root /psl/ (the former is an old loan from Aramaic).</Paragraph>
      <Paragraph position="6"> In Morfix, as described below (2.1), a lemma is defined not as the root, but as the manifestation of this root, most commonly as the lesser marked form of a noun, adjective or verb. There is no escape from some arbitrariness in the implementation of this definition, due to the fine line between inflectional morphology and derivational morphology. However, Morfix generally follows the tradition set by dictionaries, especially bilingual dictionaries. Thus, for example, difference in part of speech entails different lemmas, even if the morphological process is partially predictable.</Paragraph>
      <Paragraph position="7"> Similarly each verb pattern (Wazn / Binyan) is treated as a different lemma.</Paragraph>
      <Paragraph position="8"> Even so, the roots should not be overlooked, as they are a good basis for forming groups of lemmas; in other words, the root can often serve as a super-lemma, joining together several lemmas, provided they all share a semantic field.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
1.3 The Issue of Nominal Inflections of Verbs
</SectionTitle>
      <Paragraph position="0"> The inconclusive selection of lemmas in MSA and MH can be demonstrated by looking into an interesting phenomenon: the nominal inflections of verbs (roughly parallel to the Latin participle, see below). Since this issue is a good example both for a characteristic of Semitic NLP and for the similarities between MSA and MH, it is worthwhile to further elaborate on it.</Paragraph>
      <Paragraph position="1"> Both MSA and MH use the nominal inflections of verbs to convey tenses, moods and aspects.</Paragraph>
      <Paragraph position="2"> These inflections are derived directly from the verb according to strict rules, and their forms are predictable in most cases. Nonetheless, grammatically, these forms behave as nouns or adjectives. This means that they bear case marking in MSA, nominal marking for number and gender (in both languages) and they can be definite or indefinite (in both languages). Moreover, these inflections often serve as nouns or adjectives in their own right. This, in fact, causes the crucial problem for data retrieval, since the system has to determine whether the user refers to the noun/adjective or rather to the verb for which it serves as inflection. Nominal inflections of verbs exist in non-Semitic languages as well; in most European languages participles and infinitives have nominal features. However, two Semitic traits make this phenomenon more challenging in our case the rich morphology which creates a large set of inflections for each base form (i.e. the verb is inflected to create nominal forms and then each form is inflected again for case, gender and number).</Paragraph>
      <Paragraph position="3"> Furthermore, Semitic languages allow nominal clauses, namely verbless sentences, which increase ambiguity. For example, in English it is easy to recognize the form drunk in he has drunk as related to the lemma DRINK (V) (and not as an adjective). This is done by spotting the auxiliary has which precedes this form. However in MH, the clause axi omer could mean my brother is a guard or my brother guards/is guarding. The syntactical cues for the final decision are subtle and elusive. Similarly in MSA: axi ka:tibun could mean my brother is writing or my brother is a writer.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
1.4 Orthography
</SectionTitle>
      <Paragraph position="0"> From the viewpoint of NLP, especially commercially applicable NLP, it is important to note that the writing systems of both MSA and MH follow the same conventions, in which most vowels are not marked. Therefore, in MSA the form yaktubu he writes/will write is written {yktb}. Similarly in MH, the form yilmad he will learn is written {ylmd}. Both languages have a supplementary marking system for vocalization (written above, under and beside the text), but it is not used in the overwhelming majority of texts. In both languages, when vowels do appear as letters, letters of consonantal origin are used, consequently turning these letters ambiguous (between their consonantal and vocalic readings).</Paragraph>
      <Paragraph position="1"> It is easy to see the additional difficulty that this writing convention presents for NLP. The string {yktb} in MSA can be interpreted as yaktubu (future tense), yaktuba (subjunctive), yaktub (jussive), yuktabu (future tense passive) and even yuktibu he dictates/will dictate a form that is considered by Morfix to be a different lemma altogether (see above 1.2). Furthermore, ambiguity can occur between totally unrelated words, as will be shown in section 1.7. A trained MSA reader can distinguish between these forms by using contextual cues (both syntactic and semantic). A similar contextual sensitivity must be programmed into the NLP system in order to meet this challenge.</Paragraph>
      <Paragraph position="2"> Each language also has some orthographic peculiarities of its own. The most striking in MH is the multiple spelling conventions that are used simultaneously. The classical convention has been replaced in most texts with some kind of spelling system that partially indicates vowels, and thus reduces ambiguities. An NLP system has to take into account the various spelling systems and the fact that the classic convention is still occasionally used. Thus, each word often has more than one spelling. For example: the word shi?ur a lesson can be written {wr} or {ywr}. The word kiven to direct can be written {kwn} or {kywwn}, the former is the classical spelling (Ktiv Xaser) while the later is the standard semi-vocalized system (Ktiv Male), but a some non-standard spellings can also appear: {kywn}, {kwwn}.</Paragraph>
      <Paragraph position="3"> MSA spelling is much more standardized and follows classic conventions. Nonetheless, some of these conventions may seem confusing at first sight. The Hamza sign, which represents the glottal stop phoneme, can be written in 5 different ways, depending on its phonological environment. Therefore, any change in vowels (very regular a phenomenon in MSA inflectional paradigms) results in a different shape of Hamza. This occurs even when the vowels themselves are not marked. Moreover there is often more than one shape possible per form, without any mandatory convention. One could argue that all Hamza shapes should be encoded as one for our purposes. This may solve some problems, but then again it would deny us of crucial information about the vowels in the word.</Paragraph>
      <Paragraph position="4"> Since the Hamza changes according to vowels around it, it is a good cue for retrieving the vocalization of the word, and to reduce ambiguity.</Paragraph>
    </Section>
    <Section position="5" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
1.5 Clitics and Complex Forms
</SectionTitle>
      <Paragraph position="0"> The phenomenon which will be described in this section is related both to the morphological structure of MSA and MH, and to the orthographical conventions shared by these languages. Both languages use a diverse system of clitics  that are appended to the inflectional forms, creating complex forms and further complications in proper lemmatization and data retrieval.</Paragraph>
      <Paragraph position="1"> For example, in MSA, the form: ?awla:dun boys (nom.), a part of the lemma WALAD boy, can take the genitive pronominal enclitic /-ha/ her and create the complex form: ?awla:d-u-ha boysnom.-her (=her boys). This complex form is orthographically represented as follows: {?wladha}. Similarly in Hebrew, the form yeladim children (of the lemma YELED child), combined with the genitive pronominal enclitic /-ha/ her, yields the complex form yelade-ha children-her (=her children). The orthographical representation is: {yldyh}.</Paragraph>
      <Paragraph position="2"> Enclitics usually denote genitive pronouns for nouns (as demonstrated above) and accusative pronouns for verbs. For example, in MSA, ?akaltu-hu I ate it {?klth}, or in MH axalti-v I ate it {?kltyw}. It is easy to see how this phenomenon, especially the orthographic convention which conjoins these enclitics to the basic form, may create confusion in lemmatizing and data retrieval. However, the nature of clitics which limits their position and possible combinations helps to locate them and trace the basic form from which the complex one was created.</Paragraph>
      <Paragraph position="3"> There are also several proclitics denoting prepositions and other particles, attached to the preceding form by orthographic convention. The most common are the conjunctions /w, f/, the prepositions /b, l, k/ and the definite article /al/ in MSA, and the conjunction /w/, the prepositions /b, k, l, m/ (often referred to as Otiyot Baxlam), the relative pronoun // and the definite article /h/ in MH. Therefore, in MSA, the phrase: wa-li-l?wla:di and to the boys will have the following orthographical representation: {wll?wlad}. In MH the phrase ve-la-yeladim and to the children will be represented orthographically as: {wlyldym}.</Paragraph>
      <Paragraph position="4"> Once again, when scanning a written text, these  The term clitics is employed here as the closest term which can describe this phenomenon without committing to any linguistic theory.</Paragraph>
      <Paragraph position="5"> proclitics must be taken into account in the lemmatization process.</Paragraph>
    </Section>
    <Section position="6" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
1.6 Syntax
</SectionTitle>
      <Paragraph position="0"> The syntactic structure of MSA and MH is very similar. In fact, the list of major syntactic rules is almost identical, though the actual application of these rules may differ between the languages. null A good demonstration of that is the agreement rule. Both languages demand a strict nounadjective-verb agreement. The agreement includes features such as number, gender, definiteness and in MSA also case marking (in noun-adjective agreement). The MH agreement rule is more straightforward than the MSA one. For example: ha-yeladim ha-gdolim halxu the-child-pl. the-bigpl. go-past-pl. (=The big children went). Note that all elements in the sentence are marked as plural, and the noun and the adjective also agree in definiteness. null The case of MSA is slightly different. MSA has incomplete agreement in verb-subject sentences, which are the vast majority. In this case the agreement of the verb will only be in gender but not in number, e.g. ahaba l-?awla:du go-pastmasc.-sing. boy-pl. (=The boys went). MSA also distinguishes between human plural forms and non-human plural forms, i.e. if the plural form does not have a human referent, the verb or the adjective will be marked as feminine rather than plural, e.g. ahabat el-kila:bu l-kabi:ratu go-pastfem.-sing. the-dog-masc.-pl. the-big-fem.-sing. (=The big dogs went).</Paragraph>
      <Paragraph position="1"> The example of the agreement rule demonstrates both the similarities and the differences between MSA and MH. Furthermore, it demonstrates how minor are the differences as far as our purposes go. As long as the agreement rule is taken into account, its actual implementation has hardly any consequences in the level of the system. This example also demonstrates a very useful cue to reduce ambiguity among forms. This cue is probablyused intuitively by trained readers of MSA and MH, and encoding it into the Morfix NLP system turns out quite useful.</Paragraph>
    </Section>
    <Section position="7" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
1.7 Ambiguity
</SectionTitle>
      <Paragraph position="0"> Perhaps the major challenge for NLP analysis in MSA and MH is overcoming the ambiguity of forms. In this respect, Morfix has to imitate the rather sophisticated reading of a trained MSA or MH speaker, who continuously disambiguates word tokens while reading.</Paragraph>
      <Paragraph position="1"> The reason for ambiguity can be depicted in three main factors: i. The large amount of morphological forms, which are sometimes homographic.</Paragraph>
      <Paragraph position="2"> For example, both in MSA and MH the verbial inflection of the imperfect for the singular is the same for 2 nd person masculine and 3 rd  per-son feminine: MSA taktubu, MH tixtov. ii. The possibility of creating complex forms by conjoining clitics, which raises the possibility of coincidental identity.</Paragraph>
      <Paragraph position="3"> For example, in MSA: ka-ma:l as money, kama:l perfection, Kamal (proper name) ! {kmal}. Similarly in MH: ha-naxa theresting-fem., hanaxa an assumption, a discount ! {hnhh}.</Paragraph>
      <Paragraph position="4"> iii. The orthographical conventions, such as the lack of vowel marking and various spelling alternatives. null For example, in MSA: muda:fi defender, mada:fi cannons ! {mdaf}, and in MH baneha her sons bniya building ! {bnyh}. In many cases ambiguity is the result of the combination of two factors or even all three. This makes ambiguity rate rather high, and its resolution such a major component of NLP mechanism. Disambiguation is based on syntactical structures and semantic cues that can be retrieved from the text, which might resemble the way a human reader copes with these problems. It is the objective of NLP systems dealing with MSA an MH to formalize these cues.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML