File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1073_intro.xml
Size: 4,263 bytes
Last Modified: 2025-10-06 14:03:36
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1073"> <Title>Maximum Entropy Based Restoration of Arabic Diacritics</Title> <Section position="4" start_page="0" end_page="577" type="intro"> <SectionTitle> 2 Arabic Diacritics </SectionTitle> <Paragraph position="0"> The Arabic alphabet consists of 28 letters that can be extended to a set of 90 by additional shapes, marks, and vowels (Tayli and Al-Salamah, 1990).</Paragraph> <Paragraph position="1"> The 28 letters represent the consonants and long vowels such as A , o (both pronounced as /a:/), o (pronounced as /i:/), and ae (pronounced as /u:/). Long vowels are constructed by combining A , o , o , and ae with the short vowels. The short vowels and certain other phonetic information such as consonant doubling (shadda) are not represented by letters, but by diacritics. A diacritic is a short stroke placed above or below the consonant. Table 1 shows the complete set of Ara- null - L (pronounced as /t/).</Paragraph> <Paragraph position="2"> bic diacritics. We split the Arabic diacritics into three sets: short vowels, doubled case endings, and syllabification marks. Short vowels are written as symbols either above or below the letter in text with diacritics, and dropped all together in text without diacritics. We find three short vowels: * fatha: it represents the /a/ sound and is an oblique dash over a consonant as in L (c.f.</Paragraph> <Paragraph position="3"> fourth row of Table 1).</Paragraph> <Paragraph position="4"> * damma: it represents the /u/ sound and is a loop over a consonant that resembles the shape of a comma (c.f. fifth row of Table 1).</Paragraph> <Paragraph position="5"> * kasra: it represents the /i/ sound and is an oblique dash under a consonant (c.f. sixth row of Table 1).</Paragraph> <Paragraph position="6"> The doubled case ending diacritics are vowels used at the end of the words to mark case distinction, which can be considered as a double short vowels; the term &quot;tanween&quot; is used to express this phenomenon. Similar to short vowels, there are three different diacritics for tanween: tanween al-fatha, tanween al-damma, and tanween al-kasra. They are placed on the last letter of the word and have the phonetic effect of placing an &quot;N&quot; at the end of the word. Text with diacritics contains also two syllabification marks: * shadda: it is a gemination mark placed above the Arabic letters as in L. It denotes the doubling of the consonant. The shadda is usually combined with a short vowel such as in</Paragraph> <Paragraph position="8"> * sukuun: written as a small circle as in L. It is used to indicate that the letter doesn't contain vowels.</Paragraph> <Paragraph position="9"> Figure 1 shows an Arabic sentence transcribed with and without diacritics. In modern Arabic, writing scripts without diacritics is the most natural way. Because many words with different vowel patterns may appear identical in a diacritic-less setting, considerable ambiguity exists at the word level.</Paragraph> <Paragraph position="10"> The word I. J>>, for example, has 21 possible forms that have valid interpretations when adding diacritics (Kirchhoff and Vergyri, 2005). It may have the interpretation of the verb &quot;to write&quot; in I. J >> (pronounced /kataba/). Also, it can be interpreted as &quot;books&quot; in the noun form I. J >>(pronounced /kutubun/). A study made by (Debili et al., 2002) shows that there is an average of 11.6 possible diacritizations for every non-diacritized word when analyzing a text of 23,000 script forms.</Paragraph> <Paragraph position="12"> per row) and with (lower row) diacritics. The English translation is &quot;the president wrote the document.&quot; null Arabic diacritic restoration is a non-trivial task as expressed in (El-Imam, 2003). Native speakers of Arabic are able, in most cases, to accurately vocalize words in text based on their context, the speaker's knowledge of the grammar, and the lexicon of Arabic. Our goal is to convert knowledge used by native speakers into features and incorporate them into a maximum entropy model. We assume that the input text does not contain any diacritics.</Paragraph> </Section> class="xml-element"></Paper>