File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2005_metho.xml

Size: 11,249 bytes

Last Modified: 2025-10-06 14:10:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2005">
  <Title>Thai Grapheme-Based Speech Recognition</Title>
  <Section position="3" start_page="17" end_page="17" type="metho">
    <SectionTitle>
2 Grapheme-to-Phoneme Relation in Thai
</SectionTitle>
    <Paragraph position="0"> In the grapheme-based approach, the pronunciation dictionary is constructed by splitting a word into its constituent letters. Previous experiments have shown that the quality of the grapheme-based recognizer is highly dependent on the nature of the grapheme-to-phoneme relation of a specific language (Killer, 2003). In this section we have a closer look at the grapheme-to-phoneme relation in Thai.</Paragraph>
    <Paragraph position="1"> Thai, an alphabetical language, has 44 letters for 21 consonant sounds, 19 letters for 24 vowel sounds (9 short vowels, 9 long vowels and 6 diphthongs), 4 letters for tone markers (5 tones), few special letters, and numerals. There are some characteristics of Thai writing that can cause problems for GBSR: Some vowel letters can appear before, after, above or below a consonant letter. e.g. In the word &amp;quot;a0a2a1a4a3 &amp;quot; (/mae:w/), the vowel &amp;quot;a0 &amp;quot; (/ae:/) appears before the consonant &amp;quot;a1 &amp;quot; (/m/).</Paragraph>
    <Paragraph position="2"> Some vowel and consonant letters can be combined together to make a new vowel. e.g. In the word &amp;quot;a1a6a5a7a3 &amp;quot; /mua/, the vowel &amp;quot;ua&amp;quot; is composed of a vowel letter &amp;quot; a5 &amp;quot; and a consonant letter &amp;quot;a3 &amp;quot;.</Paragraph>
    <Paragraph position="3"> Some vowels are represented by more than one vowel letter For example, the vowel /ae/ requires two vowel letters: &amp;quot;a0 &amp;quot; and &amp;quot;a8 &amp;quot;. To make a syllable, a consonant is inserted in between the two vowel letters. e.g. &amp;quot;a0a10a9 a8 &amp;quot; (/lae/). The consonant &amp;quot;a9 &amp;quot; (/l/) is in the middle.</Paragraph>
    <Paragraph position="4"> In some syllables, vowels letters are not explicitly written. e.g. The word &amp;quot;a11a13a12 &amp;quot; (/yok/) consists of two consonant letter, &amp;quot;a11 &amp;quot; (/y/) and &amp;quot;a12 &amp;quot; (/k/). There is no letter to represent the vowel /o/.</Paragraph>
    <Paragraph position="5"> The special letter &amp;quot; a14 &amp;quot;, called Karan, is a deletion marker. If it appears above a consonant, that consonant will be ignored. Sometimes, it can also delete the immediately preceding consonant or the whole syllable.</Paragraph>
    <Paragraph position="6"> To make the relationship between graphemes and phonemes in Thai as close as possible we apply two preprocess steps: Reordering of graphemes when a vowel comes before a consonant.</Paragraph>
    <Paragraph position="7"> Merging multiple letters representing a single phoneme into one symbol.</Paragraph>
    <Paragraph position="8"> We use simple heuristic rules for this purpose; 10 rules for reordering and 15 for merging. In our initial experiments, reordering alone gave better results than reordering plus merging. Hence, we only used reordering rules for the rest of the experiments. null</Paragraph>
  </Section>
  <Section position="4" start_page="17" end_page="19" type="metho">
    <SectionTitle>
3 Thai Grapheme-Based Speech Recognition
</SectionTitle>
    <Paragraph position="0"> In this section, we explain the details of our Thai GBSR system. We used the Thai GlobalPhone corpus (Suebvisai et.al., 2005) as our data set, which consists of read-speech in the news domain.</Paragraph>
    <Paragraph position="1"> The corpus contains 20 hours of recorded speech from 90 native Thai speakers consisting of 14k utterances. There are approximately 260k words covering a vocabulary of about 7,400 words. For testing we used 1,181 utterances from 8 different speakers. The rest was used for training. The language model was built on news articles and gave a trigram perplexity of 140 and an OOV-rate of 1.4% on the test set.</Paragraph>
    <Paragraph position="2"> To start building the acoustic models for Thai, we first used a distribution that equally divided the number of frames among the graphemes. This was then trained for six iterations followed by writing the new labels. We repeated these steps six times.</Paragraph>
    <Paragraph position="3"> As can be seen in Table 1, the resulting system (Flat-Start) had poor performance. Hence we decided to bootstrap from a context independent acoustic model of an existing phoneme-based speech recognition (PBSR) systems.</Paragraph>
    <Section position="1" start_page="17" end_page="18" type="sub_section">
      <SectionTitle>
3.1 Bootstrapping
</SectionTitle>
      <Paragraph position="0"> We trained two grapheme-based systems by bootstrapping from the acoustic models of two different PBSR systems. The first system (Thai) was bootstrapped from a Thai PBSR system (Suebvisai et al., 2005) trained on the same corpus. The second system (Multilingual) was bootstrapped from the acoustic models trained on the multilingual GlobalPhone corpus (Schultz and Waibel, 1998) which shares acoustic models of similar sounds across multiple languages. In mapping phones to graphemes, when a grapheme can be mapped to  several different phones we selected the one which occurs more frequently.</Paragraph>
      <Paragraph position="1"> Both systems were based on trigraphemes (+/1) with 500 acoustic models. Training was identical to the Flat-Start system. Table 1 compares the word error rates (WER) of the three systems on the test set.</Paragraph>
      <Paragraph position="2">  with different bootstrapping techniques Results show that the two bootstrapped systems have comparable results, while Thai system gives the lowest WER. For the rest of the experiments we used the system bootstrapped from the multi-lingual acoustic models.</Paragraph>
    </Section>
    <Section position="2" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
3.2 Building Context Dependent Systems
</SectionTitle>
      <Paragraph position="0"> For the context dependent systems, we trained two systems each with different polygrapheme units; one with trigrapheme (+/- 1), and another with quintgrapheme (+/-2).</Paragraph>
      <Paragraph position="1"> The question set used in building the context dependent system was manually constructed by using the question set from the Thai PBSR system. Then we replaced every phoneme in the question set by the appropriate grapheme(s). In addition, we compared two different acoustic model sizes; 500 and 2000 acoustic models.</Paragraph>
      <Paragraph position="2"> Table 2 shows the recognition results for the resulting GBSR systems.</Paragraph>
      <Paragraph position="3">  different speech units and the # of models.</Paragraph>
      <Paragraph position="4"> The system with 500 acoustic models based on trigraphemes produced the best results. The higher WER for the quintgrapheme system might be due to the data sparseness.</Paragraph>
    </Section>
    <Section position="3" start_page="18" end_page="19" type="sub_section">
      <SectionTitle>
3.3 Enhanced Tree Clustering (ETC)
</SectionTitle>
      <Paragraph position="0"> Yu and Schultz (2003) introduced a tree clustering approach that allows the sharing of parameters across phonemes. In this enhanced tree clustering, a single decision tree is constructed for all sub-states of all phonemes. The clustering procedure starts with all the polyphones at the root of the tree.</Paragraph>
      <Paragraph position="1"> The decision tree can ask questions regarding the identity of the center phoneme and its neighboring phonemes, plus the sub-state identity (begin/middle/end). At each node, the question that yields the highest information gain is chosen and the tree is split. This process is repeated until the tree reaches a certain size. Enhanced tree clustering is well suited to implicitly capture the pronunciation variations in speech by allowing certain polyphones that are pronounced similarly to share the same set of parameters. Mimer et al. (2004) shows that this approach can successfully be applied to grapheme based speech recognition by building separate trees for each sub-state for consonants and vowels.</Paragraph>
      <Paragraph position="2"> For the experiments on enhanced tree clustering, we used the same setting as the grapheme-based system. Instead of growing a single tree, we built six separate trees - one each for begin, middle and end sub-states of vowels and consonants. Apart from the question set used in the grapheme-based system, we added singleton questions, which ask about the identity of different graphemes in a certain context. To apply the decision tree algorithm, a semi-continuous recognition system was trained.</Paragraph>
      <Paragraph position="3"> Since the number of models that share the same codebook drastically increases, we increased the number of Gaussians per codebook. Two different values were tested; 500 (ETC-500) and 1500 (ETC-1500) Gaussians. Table 4 shows the recognition results on the test set, after applying enhanced tree clustering to the system based on trigraphemes (MUL-TRI).</Paragraph>
      <Paragraph position="4">  As can be seen from Table 3, the enhanced tree clustering has significant improvement over the best grapheme-based system. ETC-500 with relatively lesser number of parameters has outperformed ETC-1500 system. Performance decreases when we increase the number of leaf nodes in the tree, from 500 to 2000. A closer look at the cluster trees that used the enhanced clustering reveals that  50~100 models share parameters across different center graphemes.</Paragraph>
      <Paragraph position="5"> 4 Grapheme vs. Phoneme based SR To evaluate our grapheme-based approach with the traditional phoneme-based approach, we compared the best GBSR system with two phoneme-based systems.</Paragraph>
      <Paragraph position="6"> The first system (PB-Man) uses a manually created dictionary and is identical to (Suebvisai et al., 2005) except that we used triphones as the speech unit. The second system (PB-LTS) uses an automatically generated dictionary using letter-to-sound rules. To generate the dictionary in PB-LTS, we used the letter-to-sound rules in Festival (Black 1998) speech synthesis system trained with 20k words. We also applied the same reordering rules used in the GBSR system as described in section 2. Both the systems have 500 acoustic models based on triphones.</Paragraph>
      <Paragraph position="7"> Table 4 gives the WER for the two systems, on the test set. Best results from GBSR systems are also reproduced here for the comparison.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="19" end_page="19" type="metho">
    <SectionTitle>
PBSR systems
</SectionTitle>
    <Paragraph position="0"> As expected, the manually generated dictionary gives the best performance. The performance between PB-LTS and grapheme based system are comparable. ETC-500 system has a significantly better performance than the automatically generated dictionary, and almost the same results as the phoneme-based baseline. This shows that grapheme-based speech recognition coupled with the enhanced tree clustering can be successfully applied to Thai speech recognition without the need for a manually generated dictionary.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML