File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/89/h89-2047_intro.xml
Size: 5,488 bytes
Last Modified: 2025-10-06 14:04:49
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-2047"> <Title>Improvements in the Stochastic Segment Model for Phoneme Recognition</Title> <Section position="3" start_page="0" end_page="332" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> Although hidden Markov models (HMMs) are currently one of the most successful approaches to acoustic modelling for continuous speech recognition, their performance is limited in part became of the assumption that observation features at different times are conditionally independent given the underlying state sequence and because the Markov assumption on the state sequence may not adequately model time structure. An alternative model, the stochastic segment model (SSlVl), was proposed to overcome some of these deficiencies \[Roucos and Dunham 1987, Ostendorf and Roucos 1989, Roucos et al 1988\].</Paragraph> <Paragraph position="1"> An observed segment of speech (e.g., a phomeme) is represented by a sequence of q-dimensional feature vectors Y = \[~ ~ ... yt\] T, where the length k is variable and T denotes block transposition. The stochastic segment model for Y has two components \[Roncos et al 1988\]: 1) a time transformation Tk to model the variable-length observed segment, Y, in terms of a fixed-length unobserved sequence, X ffi \[zl z2 ... z,,~\] a', as Y = TkX, and 2) a probabilistic representation of the unobserved feature sequence X. The conditional density of the observed segment Y given phoneme a and observed length k is:</Paragraph> <Paragraph position="3"> Assuming the observed length is less than or equal to the length of X, k _< m, T~ is a time-warping transformation which obtains Y by selecting a subset of elements of X and the density p(Yla, k) is a qk-dimensional marginal distribution of p(Xlc0. In practice, we can accomodate observations of length k > m by either tying distributions of X (so rn is effectively larger) or discarding some of the observations in Y. In this work, as in previous work, the time transformation, T~, is chosen to map each observed frame Vi to the nearest model sample zj according to a linear time-warping criterion. The distribution p(Xla) for the segment X given the phoneme a is then modelled using an rnq-dirnensional multi-variate Gaussian distribution.</Paragraph> <Paragraph position="4"> Algorithms for automatic recognition and training of the stochastic segment model are similar to those for hidden Markov modelling. The maximum a posteriori probability rule is used for classification of segments when the phoneme segmentation is known: max p(Y lct, k )p( k la)p(~), Q where p(kla) is the probability that phoneme a has length k. A Viterbi search over all possible segmentations is used for recognition with unknown segmentations. The models are trained from known segmentations using maximum likelihood parameter estimation. When segmentations are unknown, an iterative algorithm based on automatic segmentation and maximum likelihood parameter estimation exists for which increasing the probability of the observations with each iteration is guaranteed.</Paragraph> <Paragraph position="5"> Initial results with segment-based models have been encouraging. A stochastic segment model has previously been used for speaker-dependent phoneme and word recognition, demonstrating that a segment model outperformed a discrete hidden Markov model when both models were context-independent \[Ostendorf and Roucos 1989\]. Other segment-based models have also showed encouraging results in speaker-independent applications \[Bush and Kopec 1987, Bocchieri and Doddington 1986, Makino and Kido 1986, Zue et al 1989\].</Paragraph> <Paragraph position="6"> Although the previous results using the stochastic segment model were encouraging, there were several limitations. First, the comparison to HMMs did not clearly show the advantages of the segment model since the SSM and the HMM used disparate feature distributions: the segment model used ten continuous distributions and the HMM used three discrete distributions for each phoneme. Second, the flexibility of the segment model was not fully exploited because time sampies within a segment were assumed independent due to training data limitations in these speaker-dependent applications. Finally, results showed that the context-dependent HMM triphone models \[Schwartz et al 1985\] outperformed the context-independent segment models.</Paragraph> <Paragraph position="7"> Again due to training data limitations, context-dependent segment models were not effective.</Paragraph> <Paragraph position="8"> In this work we address the issues of 1) time correlation modelling and 2) meaningful comparisons of the SSM with HMMs in a speaker-independent phoneme classification task. In the next section, we describe refinements to the SSM which improve the time correlation modelling capability, including time-dependent parameter reduction and assumption of a Markov time correlation structure. Then experimental results using the TIMIT database are described. These results include comparisons of HMM and SSM, as well as the effects of modelling time correlation. Although the HMM performauce is similar to segment model performance when time-sample independence is assumed for the segment model, we demonstrate that the refinements improve performance of the stochastic segment model so that it out-performs the HMM for phoneme classification.</Paragraph> </Section> class="xml-element"></Paper>