XML Viewer - p06-2114

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2114_metho.xml
Size: 17,555 bytes
Last Modified: 2025-10-06 14:10:29
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2114">
  <Title>Sinhala Grapheme-to-Phoneme Conversion and Rules for Schwa Epenthesis</Title>
  <Section position="4" start_page="890" end_page="891" type="metho">
    <SectionTitle>
2 Sinhala Phonemic Inventory and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="890" end_page="890" type="sub_section">
      <SectionTitle>
Writing System
2.1 The Sinhala Phonemic Inventory
</SectionTitle>
      <Paragraph position="0"> Sinhala is the official language of Sri Lanka and the mother tongue of the majority - 74% of its population. Spoken Sinhala contains 40 segmental phonemes; 14 vowels and 26 consonants as classified below in Table 1 and Table 2 (Karunatillake, 2004).</Paragraph>
      <Paragraph position="1"> There are two nasalized vowels occurring in two or three words in Sinhala. They are /a~/, /a~:/, /ae~/ and /ae~~:/ (Karunatillake, 2004). Spoken Sinhala also has following Diphthongs; /iu/, /eu/, /aeu/, /ou/, /au/, /ui/, /ei/, /aei/, /oi/ and /ai/ (Disanayaka, 1991).</Paragraph>
    </Section>
    <Section position="2" start_page="890" end_page="890" type="sub_section">
      <SectionTitle>
Front Central Back
Short Long Short Long Short Long
</SectionTitle>
      <Paragraph position="0"/>
      <Paragraph position="2"> A separate sign for vowel /\/ is not provided by the Sinhala writing system. In terms of distribution, the vowel /\/ does not occur at the beginning of a syllable except in the conjugational variants of verbs formed from the verbal stem /k\r\/ (to do). In contrast to this, though the letter * Lab. - Labial, Den. - Dental, Alv. - Alveolar, Ret. -Retroflex, Pal. - Palatal, Vel. - Velar and Glo. - Glottal. &amp;quot;nyj &amp;quot;, which symbolizes the consonant sound /O~/ exists, it is not considered a phoneme in Sinhala.</Paragraph>
    </Section>
    <Section position="3" start_page="890" end_page="891" type="sub_section">
      <SectionTitle>
2.2 The Sinhala Writing System
</SectionTitle>
      <Paragraph position="0"> The Sinhala character set has 18 vowels, and 42 consonants as shown in Table 3.</Paragraph>
      <Paragraph position="1"> Vowels and corresponding vowel modifiers (within brackets): a aa (*aa ) ae (*ae ) aae (*aae ) i (*</Paragraph>
      <Paragraph position="3"> Consonants: k kh g gh ng nng c ch j jh ny nyj tt tth dd ddh nn nndd t th d dh n nd p ph b bh m mb y r l v sh ss s h ll f *N *H Special symbols: * null *null null jny Inherent vowel remover (Hal marker): * Table 3. Sinhala Character Set.</Paragraph>
      <Paragraph position="4"> Sinhala characters are written left to right in horizontal lines. Words are delimited by a space in general. Vowels have corresponding fullcharacter forms when they appear in an absolute initial position of a word. In other positions, they appear as 'strokes' and, are used with consonants to denote vowel modifiers. All vowels except &amp;quot;RR &amp;quot; /iru:/, are able to occur in word initial positions (Disanayaka, 1995). The vowel /@ / and /@ :/ occurs only in loan words of English origin.</Paragraph>
      <Paragraph position="5"> Since there are no special symbols to represent them, frequently the &amp;quot;a &amp;quot; vowel is used to symbolize them (Karunatillake, 2004).</Paragraph>
      <Paragraph position="6"> All consonants occur in word initial position except /ng / and nasals (Disanayaka, 1995). The symbols &amp;quot;nn &amp;quot;, and &amp;quot;ll &amp;quot; represent the retroflex nasal /-/ and the retroflex lateral /AE/ respectively. But they are pronounced as their respective alveolar counterparts &amp;quot;n &amp;quot;-/n/ and &amp;quot;l &amp;quot;-/l/. Similarly, the symbol &amp;quot;ss &amp;quot; representing the retroflex sibilant /I/, is pronounced as the palatal sibilant &amp;quot;sh &amp;quot;-/ss/. The corresponding aspirated symbols of letters k , g , c , j , tt , dd , t , d , p , b namely kh , gh , ch , jh , ddh , th , dh , ph , bh respectively are pronounced like the corresponding unaspirates (Karunatillake, 2004). When consonants are combined with /r/ or /j/, special conjunct symbols are used. &amp;quot;r &amp;quot;-/r/ immediately following a consonant can be marked by the symbol &amp;quot;* null&amp;quot; added to the bottom of the consonant preceding it. Similarly, &amp;quot;y &amp;quot;-/j/, immediately following consonant can be marked by the symbol &amp;quot;*null &amp;quot;  added to the right-hand side of the consonant preceding it (Karunatillake, 2004). &amp;quot;L &amp;quot; /ilu/ and &amp;quot;LL &amp;quot; /ilu:/ do not occur in contemporary Sinhala (Disanayaka, 1995). Though there are 60 symbols in Sinhala (Disanayaka, 1995), only 42 symbols are necessary to represent Spoken Sinhala (Karunatillake, 2004).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="891" end_page="891" type="metho">
    <SectionTitle>
3 G2P Conversion Approaches
</SectionTitle>
    <Paragraph position="0"> The issue of mapping textual content into phonemic content is highly language dependent.</Paragraph>
    <Paragraph position="1"> Three main approaches of G2P conversion are; use of a pronunciation dictionary, use of well defined language-dependent rules and data-driven methods (El-Imam and Don, 2005).</Paragraph>
    <Paragraph position="2"> One of the easiest ways of G2P conversion is the use of a lexicon or pronunciation dictionary.</Paragraph>
    <Paragraph position="3"> A lexicon consists of a large list of words together with their pronunciation. There are several limitations to the use of lexicons. It is practically impossible to construct such to cover the whole vocabulary of a language owing to Zipfian phenomena. Though a large lexicon is constructed, one would face other limitations such as efficient access, memory storage etc. Most lexicons often do not include many proper names, and only very few provide pronunciations for abbreviations and acronyms. Only a few lexicons provide distinct entries for morphological productions of words. In addition, pronunciations of some words differ based on the context and their partsof-speech. Further, an enormous effort has to be made to develop a comprehensive lexicon. In practical scenarios, speech synthesizers as well as speech recognizers need to be able to produce the pronunciation of words that are not in the lexicon. Names, morphological productivity and numbers are the three most important cases that cause the use of lexica to be impractical (Jurafsky and Martin, 2000).</Paragraph>
    <Paragraph position="4"> To overcome these difficulties, rules can be specified on how letters can be mapped to phonemes. In this way, the size of the lexicon can be reduced as only to contain exceptions to the rules. In contrast to the above fact, some systems rely on using very large lexicons, together with a set of letter-to-sound conversion rules to deal with words which are not found in the lexicon (Black and Lenzo, 2003).</Paragraph>
    <Paragraph position="5"> These language and context dependent rules are formulated using phonetic and linguistic knowledge of a particular language. The complexity of devising a set of rules for a particular language is dependent on the degree of correspondence between graphemes and phonemes.</Paragraph>
    <Paragraph position="6"> For some languages such as English and French, the relationship is complex and require large numbers of rules (El-Imam and Don, 2005; Damper et al., 1998), while some languages such as Urdu (Hussain, 2004), and Hindi (Ramakishnan et al., 2004; Choudhury, 2003) show regular behavior and thus pronunciation can be modeled by defining fairly regular simple rules.</Paragraph>
    <Paragraph position="7"> Data-driven methods are widely used to avoid tedious manual work involving the above approaches. In these methods, G2P rules are captured by means of various machine learning techniques based on a large amount of training data. Most previous data-driven approaches have been used for English. Widely used data-driven approaches include, Pronunciation by Analogy (PbA), Neural Networks (Damper et al., 1998), and Finite-State-Machines (Jurafsky and Martin, 2000). Black et al. (1998) discussed a method for building general letter-to-sound rules suitable for any language, based on training a CART - decision tree.</Paragraph>
  </Section>
  <Section position="6" start_page="891" end_page="892" type="metho">
    <SectionTitle>
4 Schwa Epenthesis in Sinhala
</SectionTitle>
    <Paragraph position="0"> G2P conversion problems encountered in Sinhala are similar to those encountered in the Hindi language (Ramakishnan et al., 2004). All consonant graphemes in Sinhala are associated with an inherent vowel schwa-/@ / or /a/ which is not represented in orthography. Vowels other than /@ / and /a/ are represented in orthographic text by placing specific vowel modifier diacritics around the consonant grapheme. In the absence of any vowel modifier for a particular consonant grapheme, there is an ambiguity of associating /@ / or /a/ as the vowel modifier. The inherent vowel association in Sinhala can be distinguished from Hindi. In Hindi the only possible association is schwa vowel where as in Sinhala either of vowel-/a/ or schwa-/@ / can be associated with a consonant. Native Sinhala speakers are naturally capable of choosing the association of the appropriate vowel (/@ / or /a/) in context. Moreover, linguistic rules describing the transformation of G2P, is rarely found in literature, with available literature not providing any precise procedure suitable for G2P conversion of contemporary Sinhala. Automating the G2P conversion process is a difficult task due to the ambiguity of choosing between /@ / and /a/.</Paragraph>
    <Paragraph position="1"> A similar phenomenon is observed in Hindi and Malay as well. In Hindi, the &amp;quot;deletion of the schwa vowel (in some cases)&amp;quot; is successfully  solved by using rule based algorithms (Choudhury 2003; Ramakishnan et al., 2004). In Malay, the character 'e' can be pronounced as either vowel /e/ or /@ /, and rule based algorithms are used to address this ambiguity (El-Imam and Don, 2005).</Paragraph>
    <Paragraph position="2"> In our research, a set of rules is proposed to disambiguate epenthesis of /a/ and /@ /, when associating with consonants. Unlike in Hindi, in Sinhala, the schwa is not deleted, instead always inserted. Hence, this process is named &amp;quot;Schwa Epenthesis&amp;quot; in this paper.</Paragraph>
  </Section>
  <Section position="7" start_page="892" end_page="894" type="metho">
    <SectionTitle>
5 Sinhala G2P Conversion Architecture
</SectionTitle>
    <Paragraph position="0"> An architecture is proposed to convert Sinhala Unicode text into phonemes encompassing a set of rules to handle schwa epenthesis. The G2P architecture developed for Sinhala is identical to the Hindi G2P architecture (Ramakishnan et al., 2004). The input to the system is normalized Sinhala Unicode text. The G2P engine first maps all characters in the input word into corresponding phonemes by using the letter-to-phoneme mapping table below (Table 4).</Paragraph>
    <Paragraph position="1">  The mapping procedure is given in section 5.1. Then, a set of rules are applied to this phonemic string in a specific order to obtain a more accurate version. This phonemic string is then compared with the entries in the exception lexicon. If a matching entry is found, the correct pronunciation form of the text is obtained from the lexicon, otherwise the resultant phonemic string is returned. Hence, the final output of G2P model is the phonemic transcription of the input text.</Paragraph>
    <Section position="1" start_page="892" end_page="892" type="sub_section">
      <SectionTitle>
5.1 G2P Mapping Procedure
</SectionTitle>
      <Paragraph position="0"> Each tokenized word represented by Unicode normalization form is analyzed by individual graphemes from left to right. By using the G2P mapping table (Table 4), corresponding phonemes are obtained. As in the given example Figure 1, no mappings are required for the Zero-Width-Joiner and diacritic Hal marker &amp;quot;* &amp;quot; (Halant) which is used to remove the inherent vowel in a consonant.</Paragraph>
      <Paragraph position="1">  The next step is epenthesis of schwa-/@ / for consonants. In Sinhala, the tendency of associating a /@ / with consonant is very much higher than associating vowel /a/. Therefore, initially, all plausible consonants are associated with /@ /. To obtain the accurate pronunciation, the assigned /@ / is altered to /a/ or vice versa by applying the set of rules given in next section. However, when associating /@ / with consonants, /@ / should associate only with consonant graphemes excluding the graphemes &amp;quot;*N &amp;quot;, &amp;quot;ng &amp;quot; and &amp;quot;*H &amp;quot;, which do not contain any vowel modifier or diacritic Hal marker. In the above example, only /n/ and first /j/ are associated with schwa, because other consonants violate the above principle. When schwa is associated with appropriate consonants, the resultant phonemic string for the given example (section 5.1) is; /n@ mj@ ji/.</Paragraph>
    </Section>
    <Section position="2" start_page="892" end_page="894" type="sub_section">
      <SectionTitle>
5.2 G2P Conversion Rules
</SectionTitle>
      <Paragraph position="0"> It is observed that resultant phoneme strings from the above procedure should undergo several modifications in terms of schwa assignments into vowel /a/ or vice versa, in order to obtain the accurate pronunciation of a particular word.</Paragraph>
      <Paragraph position="1"> Guided by the literature (Karunatillake, 2004), it was noticed that these modifications can be carried out by formulating a set of rules.</Paragraph>
      <Paragraph position="2"> The G2P rules were formulated with the aid of phonological rules described in the linguistic literature (Karunatillake, 2004) and by a comprehensive word search analysis using the UCSC  Sinhala corpus BETA (2005). Some of these existing phonological rules were altered in order to reflect the observations made in the corpus word analysis and to achieve more accurate results.</Paragraph>
      <Paragraph position="3"> The proposed new set of rules is empirically shown to be effective and can be conveniently implemented using regular expressions.</Paragraph>
      <Paragraph position="4"> Each rule given below is applied from left to right, and the presented order of the rules is to be preserved. Except for rule #1, rule #5, rule #6 and rule #8, all other rules are applied repeatedly many times to a single word until the conditions presented in the rules are satisfied.</Paragraph>
      <Paragraph position="5"> Rule #1: If the nucleus of the first syllable is a schwa, the schwa should be replaced by vowel /a/ (Karunatillake, 2004), except in the following situations;  (a) The syllable starts with /s/ followed by /v/. (ie. /sv/) (b) The first syllable starts with /k/ where as, /k/ is followed by /@ / and subsequently /@ / is preceded by /r/. (ie. /k@ r/) (c) The word consists of a single syllable having CV structure (eg. /d@ /) Rule #2: (a) If /r/ is preceded by any consonant, followed by /@ / and subsequently followed by /h/, then /@ / should be replaced by /a/.</Paragraph>
      <Paragraph position="6"> (/[consonant]r@ h/-&gt;/[consonant]rah/ ) (b) If /r/ is preceded by any consonant, followed by /@ / and subsequently followed by any consonant other than /h/, then /@ / should be replaced by /a/.</Paragraph>
      <Paragraph position="7"> (/[consonant]r@ [!h]/-&gt;/[consonant]ra[!h]/ ) (c) If /r/ is preceded by any consonant, followed by /a/ and subsequently followed by any consonant other than /h/, then /a/ should be replaced by /@ /.</Paragraph>
      <Paragraph position="8"> (/[consonant]ra[!h]/-&gt;/[consonant]r@ !h]/) (d) If /r/ is preceded by any consonant, followed by /a/ and subsequently followed by /h/, then /a/ is retained.</Paragraph>
      <Paragraph position="9"> (/[consonant]ra[h]/-&gt;/[consonant]ra[h]/) Rule #3: If any vowel in the set {/a/, /e/, /ae/, /o/, /\/} is followed by /h/ and subsequently /h/ is preceded by schwa, then schwa should replaced by vowel /a/.</Paragraph>
      <Paragraph position="10"> Rule #4: If schwa is followed by a consonant  cluster, the schwa should be replaced by /a/ (Karunatillake, 2004).</Paragraph>
      <Paragraph position="11"> Rule #5: If /@ / is followed by the word final consonant, it should be replaced by /a/, except in the situations where the word final consonant is /r/, /b/, /I/ or /V/.</Paragraph>
      <Paragraph position="12"> Rule #6: At the end of a word, if schwa precedes the phoneme sequence /ji/, the schwa should be replaced by /a/ (Karunatillake, 2004).</Paragraph>
      <Paragraph position="13"> Rule #7: If the /k/ is followed by schwa, and subsequent phonemes are /r/ or /l/ followed by /u/, then schwa should be replaced by phoneme /a/. (ie. /k@ (r|l)u/-&gt;/ka(r|l)u/) Rule #8: Within the given context of following words, /a/ found in phoneme sequence /kal/, (the left hand side of the arrow) should be changed to /@ / as shown in the right hand side.</Paragraph>
      <Paragraph position="14">  The above rules handle the schwa epenthesis problem. The corresponding diphthongs (refer section 2) are then obtained by processing the resultant phonetized string. This string is again analyzed from left to right, and the phoneme sequences given in the first column of Table 5 are replaced by the diphthong, represented in the second column.</Paragraph>
      <Paragraph position="15">  The application of the above rules for the given example (section 5.1) is illustrated in Fig-</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML