File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/h92-1034_intro.xml

Size: 5,730 bytes

Last Modified: 2025-10-06 14:05:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1034">
  <Title>Subphonetic Modeling for Speech Recognition</Title>
  <Section position="2" start_page="0" end_page="174" type="intro">
    <SectionTitle>
1 INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> For large-vocabulary speech recognition, we will never have sufficient training data to model all the various acoustic-phonetic phenomena. How to capture important acoustic clues and estimate essential parameters reliably is one of the central issues in speech recognition. To share parameters among different word modes, context-dependent subword models have been used successfully in many state-of-the-art speech recognition systems \[1, 2, 3, 4\]. The principle of parameter sharing can also be extended to subphonetic models.</Paragraph>
    <Paragraph position="1"> For subphonetic modeling, fenones \[5, 6\] have been used as the front end output of the IBM acoustic processor. To generate a fenonic pronunciation, multiple examples of each word are obtained. The fenonic baseform is built by searching for a sequence of fenones which has the maximum probability of generating all the given multiple utterances. The codeword-dependent fenonic models are then trained just like phonetic models. We believe that the 200 codeword-dependent fenones may be insufficient for large-vocabulary continuous speech recognition.</Paragraph>
    <Paragraph position="2"> In this paper, we propose to model subphonetic events with Markov states. We will treat the state in hidden Markov models (HMMs) as a basic subphonetic unit -- senone. The total number of HMM states in a system is often too large to be well trained. To reduce the number of free parameters, we can cluster the state-dependent output distributions. Each clustered output distribution is denoted as a senone. In this way, senones can be shared across different models as illustrated in Figure 1.</Paragraph>
    <Paragraph position="3">  The advantages of senones include better parameter sharing and improved pronunciation optimization. After clustering, different states in different models may share the same senone if they exhibit acoustic similarity. Clustefingat the granularity of the state rather than the entire model (like generalized tripbones) can keep the dissimilar states of two similar models apart while the other corresponding states are merged, and thus lead to better parameter sharing. For instance, the first, or the second states of the/ey/phones in PLACE and RELATION may be tied together. However, to magnify the acoustic effects of the fight contexts, their last states may be kept separately. In addition to finer parameter sharing, senones also give us the freedom to use a larger number of states for each phonetic model. Although an increase in the number of states will increase the total number of free parameters, with senone sharing we can essentially eliminate those redundant states and have the luxury of maintaining the necessary ones.</Paragraph>
    <Paragraph position="4"> Since senones depend on Markov states, the senonic base-form of a word can be constructed naturally with the forward-backward algorithm \[7\]. Regarding pronunciation optimization as well as new word learning, we can use the forward-backward algorithm to iteratively optimize a senone sequence appropriate for modeling multiple utterances of a word. That is, given the multiple examples, we can train a word HMM with the forward-backward algorithm. When the reestimation  reaches its optimality, the estimated states can be quantized with the codebook of senones. The closest one can be used to label the state of the word HMM. This sequence of senones becomes the senonic baseform of the word. Here arbitrary sequences of senones are allowed to provide the freedom for the automatically learned pronunciation. After the senonic base-form of every word is determined, the senonic word models may be trained, resulting in a new set of senones. Although each senonic word model generally has more states than the traditional phoneme-concatenated word model, the number of parameters remains the same since the size of the senone codebook is intact.</Paragraph>
    <Paragraph position="5"> In dictation applications, new words will often appear during user's usage. A natural extension for pronunciation optimization is to generate senonic baseforms for new words. Automatic determination of phonetic baseforms has been considered by \[8\], where four utterartces and spelling-to-soundrules are need. For the senonic baseform, we can derive senonic baseform only using acoustic data without any spelling information. This is useful for acronym words like IEEE (pronounced as 1-triple-E), CAT-2 (pronounced as cat-two) and foreign person names, where speUing-to-soundrules are hard to generalize. The acoustic-driven senonic baseform can also capture pronunciation of each individual speaker, since in dictation applications, multiple new-word samples are often from the same speaker.</Paragraph>
    <Paragraph position="6"> By constructing senone codebook and using senones in the triphone system, we were able to reduce the word error rate of the speaker-independent Resource Management task by 20% in comparison with the generalized triphone \[2\]. When senones were used for pronunciation optimization, our preliminary results gave us another 15% error reduction in a speaker-independent continuous spelling task. The word error rate was reduced from 11.3% to 9.6%. For new word learning, we used 4 utterances for each new word. Our preliminary results indicate that the error rate of automatically generated senonic baseform is comparable to that of hand-written phonetic baseform.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML