XML Viewer - h89-2047

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/89/h89-2047_evalu.xml
Size: 10,171 bytes
Last Modified: 2025-10-06 14:00:00
<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2047">
  <Title>Improvements in the Stochastic Segment Model for Phoneme Recognition</Title>
  <Section position="7" start_page="334" end_page="336" type="evalu">
    <SectionTitle>
EXPERIMENTAL RESULTS
</SectionTitle>
    <Paragraph position="0"> In this section we present experimental results for speaker-independent phoneme recognition. We performed experiments on the TIM1T database \[Lamel et al 1986\] for segment models and hidden Markov models using known phonetic segmentations. Mel-warped cepstra and their derivatives, together with the derivative of log power, are used for recognition. We used 61 phonetic models. However, in counting errors, different phones representing the same English phoneme can be substituted for each other without causing an error. The set of 39 English phonemes that we used is the same as the ones that the CMU SPHINX and MIT SUMMIT systems reported phonetic recognition results on \[Lee and Hon 1988, Zue et al 1989\].</Paragraph>
    <Paragraph position="1"> The portion of the database that we have available consists of 420 speakers and 10 sentences per speaker, two of these, the &amp;quot;sa&amp;quot; sentences, are the same across all speakers and were not used in either recognition or training because they would lead to optimistic results. We designated 219 male and 98 female speakers for training (a total of 2536 training sentences) and a second set of 71 male and 32 female different speakers for testing (a total of 824 test sentences with 31,990 phonemes). We deliberately selected a large number of testing speakers to increase the confidence of performance estimates. The best-case results, reported at the end of this section, were obtained using all of the available training sentences (from both male and female speakers) and testing over the entire test set. Most of the other results for algorithm development and comparisons were obtained by training over the male speakers only and testing on the Western and N.Y. dialect male speakers (a total of 219 training and 17 test speakers), a subset that gives us good estimates of the overall performance as we can see from the global results.</Paragraph>
    <Paragraph position="2">  cepstra and deriv., no time correlation).</Paragraph>
    <Paragraph position="3"> The TIMIT database has also been used by other researchers for the evaluation of the phonetic accuracy of their speech recognizers. Lee and Hon, 1988, reported a phonetic accuracy for the SPHINX system of from 58.8% with 12% insertions when context-independent (61) phone models were used. Zue et al, 1989 obtained a 70% classification performance on the same database for unknown speakers.</Paragraph>
    <Paragraph position="4"> HMM/segment comparison. We first evaluated the relative performance of the SSM to a continuous Gaussian distribution HMM \[Babl et al 1981\]. In this experiment, the features were ten reel-warped cepstra and their derivatives. Both the SSM and the HMM had the same number of distributions (SSM of length rn = 5 and no time correlation versus a 7-state, 5 distributions HMM), and the recognition algorithm for the HMM was a full search along the observed length. For the SSM, we obtained results for two cases: with and without using the duration information p(kl~ ). Note that, for a fair comparison, the segment model should include duration information, which was not incorporated in eartier versions of the segment model. With the duration information, both the SSM and the HMM gave similar performance (68.6% and 68.7%, see Table 2). The superiority of the SSM becomes clearer after the time correlation and the parameter reduction methods are incorporated, even though the SSM with lime correlation suffered from limited training.</Paragraph>
    <Paragraph position="5"> Parameter Reduction. We compared the single and multiple transformation reduction methods on a single-speaker task, using 14 mel-warped cepstra (but not their derivatives) as original features. We evaluated the recognition performance of the SSM for 1) different numbers of the original cepstral coefficients (from 4 up to 14), 2) different number of linear discriminants obtained using a single transformation and 3) different numbers of linear  for a segment model length rn = 5 using linear upsampiing parameter estimation.</Paragraph>
    <Paragraph position="6"> discriminants obtained from frame dependent transformations (see Figure 1). The frame dependent features gave the best performance of 78.2% when the features were reduced to 9 due to training problems, whereas the single transformation features gave actually lower performance than the original features. This can be explained by the fact that there was a small betweenclass scatter for the single Wansformation, relative to the sample-dependent transformations. The eigenvalue spread for the single transformation was only 6.2 (ratio of largest to smallest eigenvalue) whereas in the case of multiple transformations this ratio ranged from 178.7 to 318.3. A larger ratio occurs at the middle frames since the effect of adjacent phonemes is smaller at the middle of a certain phoneme and is easier to discriminate. The recognition performance for the single speaker reported here was measured on a set of 61 phonemes counting misrecognized allophones as errors.</Paragraph>
    <Paragraph position="7"> Time Correlation. We performed a series of experiments on three different types of covatiance matrices for the SSM. The length of the SSM in this case was rn = 5.</Paragraph>
    <Paragraph position="8"> In Figure 2, we have plotted the phonetic accuracy versus the number of features q for 1) a full covariance, 2) a Markov structure and 3) for a block diagonal covariance (independent frames). When the number of features is small and there is enough data to estimate the parameters, the full covariance model outperforms the other two. It should be noted though that the performance of the Markovian model is close to that of the full covariance even for a small number of features. Hence, the Markov hypothesis represents well the structure of the covariance matrix, and as the number of features increases the Markov model outperforms the full covariance model since it is more easily trainable. In addition, we expect that the curves for those two models would be further separated for a bigger segment length m, since the number of parameters for the full model is quadratic in ra, whereas that of the Markov model is linear. (For m = 5 the full model has almost twice as many parameters as the Markov model).</Paragraph>
    <Paragraph position="9"> As the number of parameters increases, the independence assumption gives the best recognition performance. However, with more training data the models that use time correlation will outperform the model which does not. Furthermore, we were able to duplicate the best case results using a Markov model for the first features and a second independent distribution for the &amp;quot;less significant&amp;quot; features. (In this case, the correlation  between the first and the last features is lost, but the time correlation between successive frames compensates for this).</Paragraph>
    <Paragraph position="10"> Parameter Estimation Algorithms. We evaluated the different methods of parameter estimation that we presented in Section 2. In this set of experiments, we only used 10 features (either 5 reel-warped cepstra and their derivatives, or 10 linear discriminants) due to limitations in the available computer time. A segment of length m = 8 rather than the usual rn = 5 was used, in order to obtain a better understanding of the interpolation potential of each algorithm. The comparative results are summarized in Table 3. When cepstra and their derivatives are used, the EM algorithm clearly gives better results than the linear time upsampling method. In addition, the &amp;quot;forward prediction&amp;quot; approximation gave us similar recognition performance to the one obtained when the full reestirnation formulas were used. However, the situation is inverted when the features are linear diseriminants, and we refer the reader to Section 2 for an explanation of those results.</Paragraph>
    <Paragraph position="11"> Global Results. The best case system - based on independent samples, m = 5, and 38 linear discriminants - was evaluated using the entire data set. The classifier was trained on the whole training set of all male and female speakers, and tested on 824 sentences from 103 speakers. As it can be seen in Table 4, where we present the results by region and sex, the phoneme classification rate does not have large variations among different regions, indicating the robustness of our classifier. The somewhat higher numbers on the male speakers can be attributed to the fact that approximately 70% of our training set consisted of sentences spoken by male speakers and the classifier was biased in this sense. The resuits were also consistent among different speakers. The  recognition rates for all speakers ranged from 59.9% to 80.7%, with the median speakers at 72.7% for the male test speakers, 71.9% for the female speakers and 72.3% for the whole test set. Approximately 80% of all the test speakers (82 out of 103) had a recognition performance over 69%, and only 8% of the speakers gave performance below 65%, including some &amp;quot;problematic&amp;quot; speakers.</Paragraph>
    <Paragraph position="12"> Our best cease result of 72% correct classification can be compared to the SUMMIT 70% classification performance on the TIMIT data for unknown speakers \[Zue et al 1989\]. Although these results are based on known segmentations, past work in segment modelling for speaker-dependent phoneme recognition showed that recognition with unknown segmentations yields a small loss in recognition performance with a cost of 10% phoneme insertion \[Ostendorf and Roucos 1989\].</Paragraph>
    <Paragraph position="13"> With this small loss in performance, the segment models can still be expected to outperform HMM phoneme recognition performance of 59% on this task \[Lee and Hon 1988\].</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML