File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1613_metho.xml
Size: 13,061 bytes
Last Modified: 2025-10-06 14:09:18
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1613"> <Title>Letter-to-Sound Conversion for Urdu Text-to-Speech System</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Urdu Writing System and Phonemic </SectionTitle> <Paragraph position="0"> Inventory Urdu is written in Arabic script in Nastaleeq style using an extended Arabic character set. Nastaleeq is a cursive, context-sensitive and highly complex writing system (Hussain 2003). The character set includes basic and secondary letters, aerab (or diacritical marks), punctuation marks and special symbols (Hussain and Afzal 2001, Afzal and Hussain 2001). Urdu is normally written with only the letters. However, the letters represent just the consonantal content of the string and in some cases (under-specified) vocalic content. The vocalic content can be (optionally) completely specified by using the aerab with the letters. Aerab are normally not written and are assumed to be known by the native speaker, thus making it very hard for a foreigner to read. Certain aerab are also used to specify additional consonants. Urdu letters and aerab are given in Table 1 below.</Paragraph> <Paragraph position="1"> ch j thtt t pb</Paragraph> <Paragraph position="3"> (middle) letters and aerab (bottom) Combination of these characters realizes a rich inventory of 44 consonants, 8 long oral vowels, 7 long nasal vowels, 3 short vowels and numerous diphthongs (e.g. Saleem et al. 2002, Hussain 1997; set of Urdu diphthongs is still under analysis).</Paragraph> <Paragraph position="4"> This phonemic inventory is given in Table 2.</Paragraph> <Paragraph position="5"> The italicized phonemes, whose existence is still not determined, are not considered any further (see Saleem et al. 2002 for further discussion).</Paragraph> <Paragraph position="6"> Mapping of this phonetic inventory to the characters given in Table 1 is discussed later.</Paragraph> <Paragraph position="8"/> </Section> <Section position="4" start_page="0" end_page="3" type="metho"> <SectionTitle> 3 NLP for Urdu TTS </SectionTitle> <Paragraph position="0"> As discussed earlier, to enable text-to-speech system for any language, a Natural Language Processing component is required. The NLP system may have differing requirement for different languages. However, it always takes raw text input and always outputs precise phonetic transcription for a language. The system can be divided into two parts, Text-Normalization</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Component and Phonological Processing </SectionTitle> <Paragraph position="0"> Component. These components may be further divided. A simplified schematic is shown in</Paragraph> <Paragraph position="2"> The Text Normalization component takes a character string as input and converts it into a string of letters. Within it, the Tokenizer uses the punctuation marks and space between words to mark token boundaries which are then stamped as words, punctuation, date, time and other relevant categories by the Semantic Tagger. The String Generator takes any non-letter based input (e.g. a number or a date containing digits) and converts it into a letter string.</Paragraph> <Paragraph position="3"> After the input is converted into a string comprising only of letters, the Phonological corresponding phonetic transcription. This is done through a series of processes. The first process is to use Letter-to-Sound Converter (detailed below) to convert the normalized text input to a phonemic string. This process may also be referred to as grapheme-to-phoneme conversion. This is followed by Syllabifier, which marks syllable boundaries. The intermediate output is then forwarded to a module which applies Urdu sound change rules to generate the corresponding phonetic string. Following these modules, Stress Marker and Intonation Marker modules add stress and intonation to the string being processed. Resyllabification is also performed after sound change rules are applied, in case phones are epenthesized or deleted and syllable boundaries require re-adjustment. Urdu shows a reasonably regular behavior and most of these tasks can be achieved through rule-based systems (e.g. see Hussain 1997 for stress assignment algorithm).</Paragraph> <Paragraph position="4"> This paper focuses on Letter-to-Sound rules for Urdu, the first in the series of modules in</Paragraph> </Section> <Section position="2" start_page="0" end_page="3" type="sub_section"> <SectionTitle> Phonological Processing Component. </SectionTitle> <Paragraph position="0"> 4 Urdu Letter to Sound Rules Urdu shows a very regular mapping from graphemes to phonemes. However, to explain the behavior, the letters need to be further classified into the following categories: a. Consonantal characters b. Dual (consonantal and vocalic) behavior characters c. Vowel modifier character d. Consonant modifier character e. Composite (consonantal and vocalic) character Similarly, the aerab set can also be divided into the following categories: f. Basic vowel specifier g. Extended vowel specifier h. Consonantal gemination specifier i. Dual (vocalic and consonantal) insertor Finally, there is a third category which may take shape of an letter and aerab: j. Vowel-aerab placeholder The Consonantal characters in (a) above always represent a consonant of Urdu. In Urdu, there is always a single consonant corresponding to a single character of this category, unlike some other languages e.g. English maps &quot;ph&quot; string to phoneme /f/. Most of the Urdu consonantal characters fall into this category. These characters and corresponding consonantal phonemes are given in Table 3 below. A simple mapping rule would generate the phoneme corresponding to these characters.</Paragraph> <Paragraph position="1"> ch j th tt t p b</Paragraph> <Paragraph position="3"> Three characters of Urdu show dual behavior, i.e. in certain contexts they transform into consonants, but in certain other contexts, they transform into vowels. These characters are Alef (), vao (w), and Yay (~ or y). Alef acts exceptionally in this category and therefore it is discussed separately in (j) below. Vao changes to /v/ and Yay changes to the approximant /j/ when they occur in consonantal positions (in onset or coda of a syllable). However, when they occur as nucleus of a syllable, they form long vowels. As an example, Yay occurs as a consonant when it occurs in the onset of single syllable word rnullnull null (/jar/, &quot;friend&quot;) but is a vowel when it occurs word medially in nullnull null a null null (/bael/, &quot;ox&quot;). These characters represent category (b) listed above.</Paragraph> <Paragraph position="4"> There is only one character in category (c), the letter Noon Ghunna (N), which does not add any additional sound to the string but only nasalizes the preceding vowel. This letter follows and combines with the category (b) characters (when occurring as vowels) to form the nasal long vowels, e.g. nulld null (/d/, &quot;go&quot;) vs. Nnulld null ( /d/, &quot;life&quot;). Catergory (d) is the letter Do-Chashmey Hay (h), which combines with all the stops and affricates to form aspirated (breathy or voiceless) consonants but does not add an additional phoneme. It may also combine with nasal stops and approximants to form their aspirated versions, though these sounds are not clearly established phonetically. As an example, adding this character adds aspiration to the phoneme /p/: nullnull</Paragraph> <Paragraph position="6"> (/pl/, &quot;fruit&quot;). Finally, there is also a single character in category (e), the Alef Madda (a). This character is a stylistic way of writing two Alefs and thus represents an Alef in consonantal position (see (j) below) and an Alef in vocalic position, forming /a/ vowel, e.g. b a (/b/, &quot;now&quot;) vs. b (/b/, &quot;water&quot;).</Paragraph> <Paragraph position="7"> There are three Basic vowel aerab used in Urdu called Zabar (Arabic Fatha), Zer (Arabic Kasra) and Pesh (Arabic Damma). In addition, absence of these aerab also define certain vowels and thus this absence is referred to as Null aerab. They combine with characters to form vowels according to the following principles: (i) Short vowels, when they occur with category (a) and (b) consonants not followed by category (b) letters.</Paragraph> <Paragraph position="8"> (ii) Long vowels, when they occur with category (a) and (b) consonants followed and combined by category (b) characters.</Paragraph> <Paragraph position="9"> (iii) Long nasal vowels, when they combine with category (a) and (b) consonants followed by category (b) characters followed by category (c) Noon Ghunna.</Paragraph> <Paragraph position="10"> Different combination of these aerab with category (b) characters generate the various vowels, as indicated in Table 4 (all vowels shown in combination with b (phoneme /b/) as a consonant character is required as a placeholder for the aerab).</Paragraph> <Paragraph position="11"> NULL or Zer. It is controversial whether Zer is present for the representation of vowel /i/. One solution is to process both cases till the diction controversy is solved.</Paragraph> <Paragraph position="12"> Existence of the remaining vocalic phoneme // is controversial in Urdu as there is no way of expressing it using the Urdu writing system and because it is schwa conditioned by the following /h/ phoneme and only occurs in this context. However, it may exist phonetically e.g. in the word nullnullnull (/hr/, &quot;city&quot;) (see discussion in Qureshi, 1992; also see some supporting acoustic evidence in Fatima et. al, 2003, e.g. duration of // is 136 ms compared with 235 ms for /ae/).</Paragraph> <Paragraph position="13"> The next category (g) consists of Khari Zabar. This represents the vowel Alef and, whenever occurs on top of a Vao or Yay, replaces these sounds with the Alef vowel sound /a/ as in words @nullnullz (/zkt/,&quot;zakat&quot;) and nullnullnull (/l/, special&quot;). Sporadically Khari Zer and Ulta Pesh are referred to in Urdu as well but they generally do not occur on Urdu words. These are not considered here. The gemination mark of category (h) is called Shad in Urdu and occurs on consonantal characters (of categories (a, b) except Alef). Shad geminates the consonant on which it occurs, which is normally word medially and inter-vocalically. As a result of gemination, the duplicate consonant acts as coda of previous syllable and onset of following syllable. For example, nullnull ( /.d/, &quot;a poor person&quot;) vs. nullnull W ( /d.d/, &quot;mattress&quot;).</Paragraph> <Paragraph position="14"> The category (i) aerab, called Do-Zabar only occurs on Alef (in vocalic position) and converts the long vowel /a/ to short schwa followed by consonant /n/, e.g. in word rnullnull an (/frn/, &quot;immediately&quot;). Do-Zer and Do-Pesh are similarly referred to in Urdu but are not generatively used and are mostly in foreign words especially of Arabic and are not considered further here. If considered, they would present a similar analysis. Finally, (j) is a very interesting category as it represents allo-graphs Alef and Hamza (former a character and latter (arguably) an aerab and character ). Both of them are default markers and occur in complimentary distribution, Alef always word initially and Hamza always otherwise. As discussed earlier, aerab in Urdu always need a Kursi (&quot;seat&quot;). If a short vowel occurs word initially without a consonant (i.e. in a syllable which has no onset), there is no placeholder for aerab. A default place holder is necessary and Alef is used. Word medially, if there is an onset-less syllable, Urdu faces the same problem. In these cases, Hamza (instead of Alef) is used as a placeholder for aerab. There are two further possible sub-cases. In one, the preceding syllable is open and ends with a vowel. This case is very frequent and Hamza is introduced inter.-vocalically (e.g. nullnullo i hnull /fa.dh/, &quot;advantage&quot;). In the second less productive sub-case, the preceding syllable is closed by a coda consonant. In this case, Hamza is t /dr.t/, &quot;courage&quot;).</Paragraph> <Paragraph position="15"> Hindi which employs a different mechanism by defining different shapes for vowels word-initially and word-medially (Matras). The Matras are anchored onto the consonants, e.g. in Aanad vaalaa , &quot;about to come&quot; vowel /a/ is written as Aa word initially, but is written as a word medially).</Paragraph> <Paragraph position="16"> These rules have been implemented in an on-going project (see Footnote 1 above) and are successfully generating the desired phonemic output. This phonemic output is passed through sound change rule module to generate the desired phonetic form.</Paragraph> </Section> </Section> class="xml-element"></Paper>