File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2114_evalu.xml

Size: 6,963 bytes

Last Modified: 2025-10-06 13:59:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2114">
  <Title>Sinhala Grapheme-to-Phoneme Conversion and Rules for Schwa Epenthesis</Title>
  <Section position="8" start_page="894" end_page="895" type="evalu">
    <SectionTitle>
6 Results and Discussion
</SectionTitle>
    <Paragraph position="0"> Text obtained from the category &amp;quot;News Paper&gt; Feature Articles &gt; Other&amp;quot; of the UCSC Sinhala corpus was chosen for testing due to the heterogeneous nature of these texts and hence perceived better representation of the language in this part of the corpus * . A list of distinct words was first extracted, and the 30,000 most frequently occurring words chosen for testing. The overall accuracy of our G2P module was calculated at 98%, in comparison with the same words correctly transcribed by an expert.</Paragraph>
    <Paragraph position="1"> Since this is the first known documented work on implementing a G2P scheme for Sinhala, its contribution to the existing body of knowledge is difficult to evaluate. However, an experiment was conducted in order to arrive at an approximation of the scale of this contribution. It was first necessary, to define a baseline against which this work could be measured.</Paragraph>
    <Paragraph position="2"> While this could be done by giving a single default letter-to-sound mapping for any Sinhala letter, owing to the near universal application of rule #1 in Sinhala words (22766 of the 30000 words used in testing), the baseline was defined by the application of this rule in addition to the 'default mapping'. This baseline gives us an error of approximately 24%. Since the proposed solution reduces this error to 2%, this work can claim to have improved performance by 22%.</Paragraph>
    <Paragraph position="3"> An error analysis revealed the following types of errors (Table 6): Error description # of words Compound words- (ie. Single words formed by combining 2 or more distinct words; such as in the case of the English word &amp;quot;thereafter&amp;quot;).</Paragraph>
    <Paragraph position="4">  Foreign (mainly English) words directly encoded in Sinhala. eg. faessn - fashion, kaemps - campus.</Paragraph>
    <Paragraph position="5">  The errors categorized as &amp;quot;Other&amp;quot; are given below with clarifications: * The modifier used to denote long vowel &amp;quot;aa &amp;quot; /a:/ is &amp;quot;*aa &amp;quot; which is known as &amp;quot;Aelapilla&amp;quot;. eg. consonant &amp;quot;k&amp;quot; /k/ associates with &amp;quot;*aa &amp;quot; /a:/ to produce grapheme &amp;quot;kaa &amp;quot; is pronounced as /ka:/. The above exercise * This accounts for almost two-thirds of the size of this version of the corpus.</Paragraph>
    <Paragraph position="6"> revealed some 37 words end without vowel modifier &amp;quot;*aa &amp;quot;, but are usually pronounced with the associated long vowel /a:/. In the following examples, each input word is listed first, followed by the erroneous output of G2P conversion, and correct transcription.</Paragraph>
    <Paragraph position="7"> &amp;quot;amm &amp;quot;(mother) -&gt; /amm@ / -&gt; /amma:/ &amp;quot;akk &amp;quot;(sister) -&gt; /akk@ / -&gt; /akka:/ &amp;quot;gtt &amp;quot;(taken)-&gt; /gatt@ / -&gt; /gatta:/ * There were 27 words associated with erroneous conversion of words having the letter &amp;quot;h &amp;quot;, which corresponds to phoneme /h/. The study revealed this letter shows an unusual behavior in G2P conversion.</Paragraph>
    <Paragraph position="8"> * The modifier used to denote vowel &amp;quot;R &amp;quot; - &amp;quot;*R &amp;quot; is known as &amp;quot;Geta-pilla&amp;quot;. When this vowel appears as the initial letter of a word, it is pronounced as /ri/ as in &amp;quot;Rnn &amp;quot; /rin@ / (minus). When the corresponding vowel modifier appears in a middle of a word most of the time it is pronounced as /ru/ (Disanayaka, 2000). eg. &amp;quot;kRtiy &amp;quot; (book)is pronounced as /krutij@ /, &amp;quot;pRss tthy &amp;quot; (surface) - /prussV\j\/, &amp;quot;utkRss tt &amp;quot; (excellent)-/utkrussV\/. But 13 words were found as exceptions of this general rule. In those words, the &amp;quot;*R &amp;quot; is pronounced as /ur/ rather than /ru/. eg. &amp;quot;pvRtti &amp;quot; (news)/pr@ wurti/,&amp;quot;smRddhi &amp;quot;(prosperity)-/samurdi/, &amp;quot;vivRt &amp;quot; (opened) - /wiwurt@ /.</Paragraph>
    <Paragraph position="9"> * In general, vowel modifiers &amp;quot;*ae &amp;quot; (Adhapilla), &amp;quot;*aae &amp;quot; (Diga Adha-pilla) symbolizes the vowel &amp;quot;ae &amp;quot; /ae/ and &amp;quot;aae &amp;quot; /ae:/ respectively. eg. consonant &amp;quot;k&amp;quot; /k/ combines with vowel modifier &amp;quot;*ae &amp;quot; to create &amp;quot;kae &amp;quot; which is pronounced as /kae/. Few words were found where this rule is violated. In such words, the vowel modifiers &amp;quot;*ae &amp;quot; and &amp;quot;*aae &amp;quot; represent vowels &amp;quot;u &amp;quot;- /u/, and &amp;quot;uu &amp;quot;/u:/ respectively. eg. &amp;quot;jnshaeti &amp;quot; (legend) /Oan@ ssruti/, &amp;quot;kaaer &amp;quot; (cruel) - /kru:r\/. * The verbal stem &amp;quot;kr &amp;quot; (to do) is pronounced as /k@ r@ /. Though there are many words starting with the same verbal stem, there are a few other words differently pronounced as /kar@ / or /kara/. eg.</Paragraph>
    <Paragraph position="10"> &amp;quot;krtty &amp;quot; (cart) /karatt@ y@ /, &amp;quot;krvl &amp;quot; (dried fish) /kar@ v@ l@ /.</Paragraph>
    <Paragraph position="11">  * A few of the remaining errors are due to homographs; &amp;quot;vn &amp;quot; - /van@ /, /v@ n@ /; &amp;quot;kl &amp;quot; -/kal@ /, /k@ l@ /; &amp;quot;kr &amp;quot; - /kar@ /, /k@ r@ /. The above error analysis itself shows that the model can be extended. Failures in the current model are mostly due to compound words and foreign words directly encoded in Sinhala (1.66%). The accuracy of the G2P model can be increased significantly by incorporating a method to identify compound words and transcribe them accurately. If the constituent words of a compound word can be identified and separated, the same set of rules can be applied for each constituent word, and the resultant phonetized strings combined to obtain the correct pronunciation. The same problem is observed in the Hindi language too. Ramakishnan et al.</Paragraph>
    <Paragraph position="12"> (2004) proposed a procedure for extracting compound words from a Hindi corpus. The utilization of compound word lexicon in their rule-based G2P conversion module improved the accuracy of G2P conversion by 1.6% (Ramakishnan et al., 2004). In our architecture, the most frequently occurring compound words and foreign words are dealt with the aid of an exceptions lexicon. Homographs are also disambiguated using the most frequently occurring words in Sinhala. Future improvements of the architecture will include incorporation of a compound word identification and phonetization module.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML