XML Viewer - h92-1034

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1034_metho.xml
Size: 17,210 bytes
Last Modified: 2025-10-06 14:13:06
<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1034">
  <Title>Subphonetic Modeling for Speech Recognition</Title>
  <Section position="3" start_page="174" end_page="175" type="metho">
    <SectionTitle>
2 SHARED DISTRIBUTION MODELS
</SectionTitle>
    <Paragraph position="0"> In phone-based HMM systems, each phonetic model is formed by a sequence of states. Phonetic models are shared across different word models. In fact, the state can also be shared across different phonetic models. This section will describe the usage of senones for parameter sharing.</Paragraph>
    <Section position="1" start_page="174" end_page="174" type="sub_section">
      <SectionTitle>
2.1 Senone Construction by State Clustering
</SectionTitle>
      <Paragraph position="0"> The number of triphones in a large vocabulary system is generally very large. With limited training data, there is no hope to obtain well-trained models. Therefore, different technologies have been studied to reduce the number of parameters \[1, 9, 2, 10, 11\]. In generalized triphones, every state of a triphone is merged with the corresponding state of another triphone in the same cluster. It may be true that some states are merged not because they are similar, but because the other states of the involved models resemble each other. To fulfill more accurate modeling, states with differently-shaped output distributions should be kept apart, even though the other states of the models are tied. Therefore, clustering should be carried out at the output-distribution level rather than the model level. The distribution clustering thus creates a senone codebook as Figure 2 shows \[12\]. The clustered distributions or senones are fed back to instantiatephonetic models. Thus, states of different phonetic models may share the same senone.</Paragraph>
      <Paragraph position="1"> This is the same as theshared-distributionmodel (SDM) \[13\].</Paragraph>
      <Paragraph position="2"> Moreover, different states within the same model may also be tied together if too many states are used to model this phone's acoustic variations or ifa certain acoustic event appears repetitively within the phone.</Paragraph>
      <Paragraph position="4"> All HMMs are first estimated.</Paragraph>
      <Paragraph position="5"> Initially, every output distribution of all HMMs is created as a cluster.</Paragraph>
      <Paragraph position="6"> Find the most similar pair of clusters and merge them together.</Paragraph>
      <Paragraph position="7"> For each element in each cluster of the current configuration, move it to another cluster if that results in improvement. Repeat this shifting until no improvement can be made.</Paragraph>
      <Paragraph position="8"> Go to step 3 unless some convergence criterion is  of states for each phonetic model. Although an increase in the number of states will increase the total number of free parameters, yet by clustering similar states we can essentially eliminate those redundant states and have the luxury to maintain the necessary ones \[13\].</Paragraph>
    </Section>
    <Section position="2" start_page="174" end_page="175" type="sub_section">
      <SectionTitle>
2.2 Performance Evaluation
</SectionTitle>
      <Paragraph position="0"> We incorporated the above distribution clustering technique in the SPHINX-II system \[14\] and experimented on the speaker-independent DARPA Resource Management (RM) task with a word-pair grammar of perplexity 60. The test set consisted of the February 89 and October 89 test sets, totaling 600 sentences. Table 1 shows the word error rates of several systems.</Paragraph>
      <Paragraph position="1">  In the SPHINX system, there were 1100 generalized triphones, each with 3 distinct output distributions. In the  SPHINX-II system, we used 5-state Bakis triphone models and clustered all the output distributions in the 7500 or so triphones down to 3500-5500 senones. The system with 4500 senones had the best performance with the given 3990 training sentences. The similarity between two distributions was measured by their entropies. After two distributions are merged, the entropy-increase, weighted by counts, is computed: (ca + cb)no+b -- Coaa - CbHb where Ca is the summation of the entries of distribution a in terms of counts, and Ha is the entropy. The less the entropy-increase is, the closer the two distributions are. Weighting entropies by counts enables those distributions with less occurring frequency be merged before frequent ones. This makes each senone (shared distribution) more trainable.</Paragraph>
    </Section>
    <Section position="3" start_page="175" end_page="175" type="sub_section">
      <SectionTitle>
2.3 Behavior of State Clustering
</SectionTitle>
      <Paragraph position="0"> To understand the quality of the senone codebook, we examined several examples in comparison with 1100 generalized triphone models. As shown in Figure 3, the two/ey/triphones in -PLACE and --LaTION were mapped to the same generalized triphone. Similarly, phone/d/in START and ASTORIA were mapped to another generalized triphone. Both has the same left context, but different right contexts. States with the same color were tied to the same senone in the 4500-SDM system, x, y, z, and w represent different sentries. Figure (a) demonstrates that distribution clustering can keep dissimilar states apart while merging similar states of two models. Figure (b) shows that redundant states inside a model can be squeezed. It also reveals that distribution clustering is able to learn the same effect of similar contexts (/aa/ and /at/) on the current phone (/dO.</Paragraph>
      <Paragraph position="1"> It is also interesting to note that when 3 states, 5 states, and 7 states per triphone model are used with a senone codebook size of 4500, the average number of distinct senones a triphone used is 2.929, 4.655, and 5.574 respectively. This might imply that 5 states per phonetic model are optimal to model the acoustic variations within a triphone unit for the given DARPA RM training database. In fact, 5-state models indeed gave us the best performance.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="175" end_page="176" type="metho">
    <SectionTitle>
3 PRONUNCIATION OPTIMIZATION
</SectionTitle>
    <Paragraph position="0"> As shown in Figure 1, senones can be shared not only by different phonetic models, but also by different word models.</Paragraph>
    <Paragraph position="1"> This section will describe one of the most important applications of senones: word pronunciation optimization.</Paragraph>
    <Section position="1" start_page="175" end_page="176" type="sub_section">
      <SectionTitle>
3.1 Senonic Baseform by State Quantization
</SectionTitle>
      <Paragraph position="0"> Phonetic pronunciation optimization has been considered by \[15, 8\]. Subphonetic modeling also has a potential application to pronunciation learning. Most speech recognition systems use a fixed phonetic transcription for each word in the vocabulary. If a word is transcribed improperly, it will be difficult for the system to recognize it. There may be quite a few improper transcriptions in a large vocabulary system for the given task.</Paragraph>
      <Paragraph position="1"> Most importantly, some words may be pronounced in several different ways such as THE (/dh ax/or/dh ih/), TOMATO (/tax m ey dx owl or/t ax m aa dx ow/), and so  clustering. Figure (a) shows two/ey/triphones which were in the same generalized triphone cluster; (b) shows two/d/ triphones in another cluster. In each sub-figure, states with the same color were tied to the same senone, z, y, z, and w represent different senones.</Paragraph>
      <Paragraph position="2"> on. We can use multiple phonetic transcriptions for every word, or to learn the pronunciation automatically from the data.</Paragraph>
      <Paragraph position="3"> Figure 4 shows the algorithm which looks for the most appropriate senonic baseform for a given word when training examples are available.</Paragraph>
      <Paragraph position="4">  1. Compute the average duration (number of timeframes), given multiple tokens of the word.</Paragraph>
      <Paragraph position="5"> 2. Build a Bakis word HMM with the number of states equal to a portion of the average duration (usually 0.8).</Paragraph>
      <Paragraph position="6"> 3. Run several iterations (usually 2 - 3) of the forward-backward algorithm on the word model starting from uniform output distributions, using the given utterance tokens.</Paragraph>
      <Paragraph position="7"> 4. Quantize each state of the estimated word model  Here arbitrary sequences of senones are allowed to provide the freedom for the automatically learned pronunciation. This senonic baseform tightly combines the model and acoustic data. After the senonic baseform of every word is determined,  the senonic word models may be trained, resulting in a new set of senones.</Paragraph>
      <Paragraph position="8"> Similar to fenones, sentries take full advantage of the multiple utterances in baseform construction. In addition, both phonetic baseform and senonic baseform can be used together, without doubling the number of parameters in contrast to fenones. So we can keep using phonetic baseforrn when training examples are unavailable. The senone codebook also has a better acoustic resolution in comparison with the 200 VQ-dependent fenones. Although each senonic word model generally has more states than the traditional phoneme-concatenated word model, the number of parameters are not increased since the size of the senone codebook is fixed.</Paragraph>
    </Section>
    <Section position="2" start_page="176" end_page="176" type="sub_section">
      <SectionTitle>
3.2 Performance Evaluation
</SectionTitle>
      <Paragraph position="0"> As a pivotal experiment for pronunciation learning, we used the speaker-independent continuous spelling task (26 English alphabet). No grammar is used. There are 1132 training sentences from 100 speakers and 162 testing sentences from 12 new speakers. The training data were segmented into words by a set of existing HMMs and the Viterbi alignment \[16, 1\].</Paragraph>
      <Paragraph position="1"> For each word, we split its training data into several groups by a DTW clustering procedure according to their acoustic resemblance. Different groups represent different acoustic realizations of the same word. For each word group, we estimated the word model and computed a senonic baseform as Figure 4 describes. The number of states of a word model was equal to 75% of the average duration. The Euclidean distance was used as the distortion measure during state quantization.</Paragraph>
      <Paragraph position="2"> We calculated the predicting ability of the senonic word model M,o,a obtained from the g-th group of word w as:</Paragraph>
      <Paragraph position="4"> X,. egroup g X,. ~group a where X~o is an utterance of word w. For each word, we picked two models that had the best predicting abilities. The pronunciation of each word utterance in the training set was labeled by:</Paragraph>
      <Paragraph position="6"> After the training data were labeled in this way, we re-trained the system parameters by using the senonic baseform.</Paragraph>
      <Paragraph position="7"> Table 2 shows the word error rate. Both systems used the sex-dependent semi-continuous HMMs. The baseline used word-dependent phonetic models. Therefore, it was essentially a word-based system. Fifty-six word-dependent phonetic models were used. Note both systems used exactly the same number of parameters.</Paragraph>
      <Paragraph position="8"> This preliminary results indicated that the senonic baseform can capture detailed pronunciation variations for speaker-independent speech recognition.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="176" end_page="177" type="metho">
    <SectionTitle>
4 NEW WORD LEARNING
</SectionTitle>
    <Paragraph position="0"> In dictation applications, we can start from speaker-independent system. However, new words will often appear when users are dictating. In real applications, these new  baseform on the spelling task.</Paragraph>
    <Paragraph position="1"> word samples are often speaker-dependent albeit speaker-independent systems may be used initially. A natural extension for pronunciation optimization is to generate speaker-dependent senonic baseforms for these new words. In this study, we assume possible new words are already detected, and we want to derive the senonic baseforms of new words automatically. We are interested in using acoustic data only. This is useful for acronym words like IEEE (pronounced as ltriple-E), CAT-2 (pronounced as cat-two) and foreign person names, where spelling-to-sound rules are hard to generalize. The senonic baseform can also capture pronunciation characteristics of each individual speaker that cannot be represented in the phonetic baseform.</Paragraph>
    <Section position="1" start_page="176" end_page="176" type="sub_section">
      <SectionTitle>
4.1 Experimental Database and System Configuration
</SectionTitle>
      <Paragraph position="0"> With word-based senonic models, it is hard to incorporate between-word co-articulation modeling. Therefore, our base-line system used within-word triphone models only. Again we chose RM as the experimental task. Speaker-independent (SI) sex-dependent SDMs were used as our baseline system for this study. New word training and testing data are speaker-dependent (SD). We used the four speakers (2 females, 2 males) from the June-1990 test set; each supplied 2520 SD sentences. The SD sentences were segmented into words using the Viterbi alignment.</Paragraph>
      <Paragraph position="1"> Then we chose randomly 42 words that occurred frequently in the SD database (so that we have enough testing data) as shown in Table 3, where their frequencies in the speaker-independent training database are also included. For each speaker and each of these words, 4 utterances were used as samples to learn the senonic baseform, and at most 10 other utterances as testing. Therefore, the senonic baseform of a word is speaker-dependent. There were together 1460 testing word utterances for the four speakers. During recognition, the segmented data were tested in an isolated-speech mode without any grammar.</Paragraph>
    </Section>
    <Section position="2" start_page="176" end_page="177" type="sub_section">
      <SectionTitle>
4.2 State Quantization of the Senonic Baseform
</SectionTitle>
      <Paragraph position="0"> For each of the 42 words, we used 4 utterances to construct the senonic baseform. The number of states was set to be 0.8 of the average duration. To quantize states at step 4 of Figure 4, we aligned the sample utterances against the estimated word model by the Viterbi algorithm. Thus, each state had 5 to 7 frames on average. Each state of the word model is quantized to the senone that has the maximum probability of generating all the aligned frames. Given a certain senone, senone, the probability of generating the aligned frames of state s is computed in the same manner as the semi-continuous output probability:</Paragraph>
      <Paragraph position="2"> VXi aligned to s k=l where b(. I senone) denote the discrete output distribution that represents senone, L denotes the size of the front-end VQ codebook, and fk (.) denote the probability density function of codeword k.</Paragraph>
    </Section>
    <Section position="3" start_page="177" end_page="177" type="sub_section">
      <SectionTitle>
4.3 Experimental Performance
</SectionTitle>
      <Paragraph position="0"> For the hand-written phonetic baseform, the word error rate was 2.67% for the 1460 word utterances. As a pilot study, a separate senonic baseform was constructed for CASREP and its derivatives (CASREPED, and CASREPS). Similarly, for the singular and plural forms of the selected nouns. The selected 42 words were modeled by automatically constructed senonic baseforrns. They are used together with the rest 955 words (phonetic baseforms) in the RM task. The word error rate was 6.23%. Most of the errors came from the derivative confusion.</Paragraph>
      <Paragraph position="1"> To reduce the derivative confusion, we concatenated the original senonic baseform with the possible suffix phonemes as the baseform for the derived words. For example, the baseform of FLEETS became/fleet &lt;ts s ix-z&gt;~, where the context-independent phone model/ts/,/s/, and the concatenated/ix z/were appended parallelly after the senonic base-form of FLEE T. In this way, no training data were used to learn the pronunciations of the derivatives. This suffix senonic approach significantly reduced the word error to 3.63%. Still there were a lot of misrecognitions of CASREPED tO be CASREP and MAX tO be NEXT. These were due to the high confusion between/td/and/pd/,/m/and/n/. The above results are summarized in Table 4.</Paragraph>
      <Paragraph position="2"> system error rate hand-written phonetic baseform 2.67 % pilot senonic baseform 6.23 % suffix senonic baseform 3.63%  utterances for the selected 42 words.</Paragraph>
      <Paragraph position="3"> The study reported here is preliminary. Refinement on the algorithm of senonic-baseform construction (especially incorporation of the spelling information) is still under investigation. Our goal is to approach the phonetic system.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML