XML Viewer - h91-1054

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1054_metho.xml
Size: 16,115 bytes
Last Modified: 2025-10-06 14:12:42
<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1054">
  <Title>A Study on Speaker-Adaptive Speech Recognition</Title>
  <Section position="3" start_page="278" end_page="279" type="metho">
    <SectionTitle>
2 BASELINE SYSTEM
</SectionTitle>
    <Paragraph position="0"> Large-vocabulary speaker-independent continuous speech recognition has made significant progress during the past years \[1, 2, 3, 4\]. Sphinx, a state-of-the-art speaker-independent speech recognition system developed at CMU \[1\], has achieved high word recognition accuracy with the introduction and usage of the following techniques: (1) multiple VQ codebooks. In order to incorporate the multiple knowledge sources and minimize VQ errors, multiple vector quantized codebooks incorporating LPC cepstrum, differential cepstrum, second order differential cepstrum, and logpower parameters were used \[13\]; (2) generalized triphone models. Triphones have been successfully used by \[16, 17\].</Paragraph>
    <Paragraph position="1"> However, many contexts are quite similar, and can be combined. Clustering contexts leads to fewer, and thus more trainable, models \[18\]; (3) function-word-dependent phone models. These models were used to model phones in function words, which are typically short, poorly-articulated words such as the, a, in, and; (4) between-word coarticulation modeling. The concept of triphone modeling was extended to the word boundary, which leads to between-word triphone models \[19\]; (5) semi-continuous models. SCHMMs mutually optimize the VQ codebook and HMM parameters under a unified probabilistic framework \[20\], which greatly enhances the robustness in comparison with the discrete HMM \[12\]; (6) speaker-clustered models. Another advantage to use the SCHMM is that it requires less training data in comparison with the discrete HMM. Therefore, speaker-clustered models (male/female in this study) were employed to improve the recognition accuracy \[ 12\].</Paragraph>
    <Paragraph position="2"> The above system was evaluated on the June 90 (RM2) test set, which consists of 480 sentences spoken by four speakers. The evaluation results are shown in Table 1. This will be referred as the baseline system in comparison with both speaker-dependent and speaker-adaptive systems. Recent resuits using the shared distribution modeling have not yet included, which led to additional 15% error reduction \[12\].</Paragraph>
    <Paragraph position="3">  The same technology was extended for speaker-dependent speech recognition with 600/2400 training sentences for each speaker \[21\]. The SCHMM parameters and VQ codebook were estimated jointly starting with speaker-independent models. Results are listed in Table 2. The error rate of the speaker-dependent system can be reduced by three times in comparison with the speaker-independent system, albeit this comparison is not fair since the speaker-independent system is trained with 3990 sentences from about 100 speakers. However, these results clearly indicate the importance of speaker-dependent training data, and effects of speaker variability in the speaker-independent system. If speaker-dependent data or speaker-normalization techniques are available, the error rate may be significantly reduced.</Paragraph>
  </Section>
  <Section position="4" start_page="279" end_page="281" type="metho">
    <SectionTitle>
3 SPEAKER-ADAPTIVE SYSTEM
</SectionTitle>
    <Paragraph position="0"> Last section clearly demonstrated the importance of speaker-dependent data, and requirements of speaker normalization mechanism for speaker-independent system design. This section will describe several techniques to adapt the speaker-independent system so that an initially speaker-independent system can be rapidly improved as a speaker uses the system.</Paragraph>
    <Paragraph position="1"> Speaker normalization techniques that may have a significant impact on both speaker-adaptive and speaker-independent speech recognition are also examined.</Paragraph>
    <Section position="1" start_page="279" end_page="279" type="sub_section">
      <SectionTitle>
3.1 Codebook adaptation
</SectionTitle>
      <Paragraph position="0"> The SCHMM has been proposed to extend the discrete HMM by replacing discrete output probability distributions with a combination of the original discrete output probability distributions and continuous pdf of a codebook \[8, 20\], In comparison with the conventional codebook adaptation techniques \[5,6, 7\], the SCHMM can jointly reestimate both the codebook and HMM parameters in order to achieve an optimal codebook/model combination according to the maximum likelihood criterion. The SCHMM can thus be readily applied to speaker-adaptive speech recognition by reestimating the codebook.</Paragraph>
      <Paragraph position="1"> With robust speaker-independent models, the codebook is modified according to the SCHMM structure such that the SCHMM likelihood can be maximized for a given speaker.</Paragraph>
      <Paragraph position="2"> Here, both phonetic and acoustic information are considered in the codebook mapping procedure since Pr(XI.A4), the probability of acoustic observations ?d given the model .A/l, is directly maximized. To elaborate, the posterior probability Ai (t) is first computed based on the speaker-independent model \[20\]. Ai (t) measures the similarity that acoustic vector at time t will be quantized with codeword i. The ith mean vector #i of the codebook can then be computed with In this study, the SCHMM is used to reestimate the mean vector only. Three iterations are carried out for each speaker. The error rates with 5 to 40 adaptive sentences from each speaker are 3.8% and 3.6%, respectively. In comparison with the speaker-independent model, the error rate of adaptive systems is reduced by about 15% with only 40 sentences from each speaker. Further increase in the number of adaptive sentences did not lead to any significant improvement. Speaker-adaptive recognition results with 5 to 150 adaptive sentences  In fact, both the mean and variance vector can be adapted iteratively. However, the variances cannot be reliably estimated with limited adaptive data. Because of this, estimates are interpolated with speaker-independent estimates analogous to Bayesian adaptation \[9, 22\]. However, in comparison with iterative SCHMM codebook reestimation, there is no significant error reduction by combining interpolation into the codebook mapping procedure. It is sufficient by just using very few samples to reestimate the mean vector.</Paragraph>
    </Section>
    <Section position="2" start_page="279" end_page="280" type="sub_section">
      <SectionTitle>
3.2 Output distribution adaptation
</SectionTitle>
      <Paragraph position="0"> Several output-distribution adaptation techniques, including cooccurence mapping \[23, 24\], deleted interpolation \[25, 20\], and state-level-distribution clustering, are examined. All these studies are based on SCHMM-adapted codebook as discussed above.</Paragraph>
      <Paragraph position="1"> In cooccurence mapping, the cooccurence matrix, the probability of codewords of the target speaker given the codeword of speaker-independent models, is first computed \[24\]. The output distribution of the speaker-independent models is then projected according to the cooccurence matrix, there is no improvement with cooccurence mapping. This is probably because that cooccurence smoothing only plays the role of smoothing, which is not directly relatect to maximum likelihood estimation.</Paragraph>
      <Paragraph position="2"> A better adaptation technique should be consistent with the criterion used in the speech recognition system. As the total number of distribution parameters is much larger than the codebook parameters, direct reestimation based on the SCHMM will not lead to any improvement. To alleviate the parameter problem, the similarity between output distributions of different phonetic models is measured. If two distributions are similar, they are grouped into the same cluster in a similar manner as the generalized triphone \[23\]. Since clustering is carried out at the state-level, it is more flexible  and more reliable in comparison with model-level clustering. Given two distributions, bi(Oh) and bj (Oh), the similarity between hi(Ok) and bj (Ok) is measured by d(bi, bj) = (\[Ik bi(Ok)C'(Ok))(H~ b.i(Ok) cAdeg&amp;quot;)) (2) (lq~ b~+j ( O~ )C,+~( o~) ) where Ci(Ok) is the count of codeword k in distribution i, bi+j (Ok) is the merged distribution by adding bi(Ok) and bj (O k ). Equation 2 measures the ratio between the probability that the individual distributions generated the training data and the probability that the merged distribution generated the training data in the similar manner as the generalized triphone.  Based on the similarity measure given in Equation 2, the Baum-Welch reestimation can be directly used to estimate the clustered distribution, which is consistent with the criterion used in our speaker-independent system. With speaker-dependent clustered distributions, the original speaker-independent models are interpolated. The interpolation weights can be either estimated using deleted interpolation or by mixing speaker-independent and speaker-dependent counts according to a pre-determined ratio that depends on the number of speaker-dependent data. Due to limited amount of adaptive data, the latter approach is more suitable to the former. It is also found that this procedure is more effective when the interpolation is performed directly on the raw data (counts), rather than on estimates of probability distributions derived from the counts. Let Cg -dep and C~ -indep represent speaker-dependent and speaker-independent counts for distribution i, Afi denote the number of speaker-dependent data for distribution i. Final interpolated counts are computed with</Paragraph>
      <Paragraph position="4"> from which interpolated counts are interpolated with context-independent models and uniform distributions with deleted interpolation. Varying the number of clustered distributions from 300 to 2100, speaker-adaptive recognition resuits are shown in Table 5. Just as in generalized triphone \[23\], the number of clustered distributions depends on the available adaptive data. From Table 5, it can be seen that when 40 sentences are used, the optimal number of clustered distributions is 500. The error rate is reduced from 3.6% (without distribution adaptation) to 3.1%. Detailed results for each speaker is shown in Table 6. In comparison with the speaker-independent system, the error reduction is more than 25%.</Paragraph>
      <Paragraph position="5"> The proposed algorithm can also be employed to incrementally adapt the voice of each speaker. Results are shown in Table 7. When 300 to 600 adaptive sentences are used, the error rate becomes lower than that of the best speaker-dependent systems. Here, clustered distributions are not used because of available adaptation data. With 300-600 adaptive sentences, the error rate is reduced to 2.5-2.4%, which is better than the best speaker-dependent system trained with 600 sentences. This indicates speaker-adaptive speech recognition is quite robust since information provided by speaker-independent models is available.</Paragraph>
    </Section>
    <Section position="3" start_page="280" end_page="281" type="sub_section">
      <SectionTitle>
3.3 Speaker normalization
</SectionTitle>
      <Paragraph position="0"> Speaker normalization may have a significant impact on both speaker-adaptive and speaker-independent speech recognition. Normalization techniques proposed here involve eepstrum transformation of a target speaker to the reference speaker. For each cepstrum vector ,Y=, the transformation function F(?() is defined such that the SCHMM probability Pr(T(?()\]M) can be maximized, where .h4 can be either speaker-independent or speaker-dependent models; and f(Af) can be either a simple function as A~Y + B or any complicated nonlinear function. Thus, a speaker-dependent function ~(,Y=) can be used to normalize the voice of any target speaker to a chosen reference speaker for speaker-adaptive speech recognition. Furthermore, a speaker-independent function .T0-V ) can also be built to reduce the difference of speakers before speaker-independent HMM training is applied such that the resulting speaker-independent models have sharp distributions.</Paragraph>
      <Paragraph position="1"> In the first experiment, two transformation matrix A and/3 are defined such that the speaker-independent SCHMM probability Pr(.AX + Bl.?vl ) is maximized. The mapping structure used here can be regarded as a one-layer perceptron, where the SCHMM probability is used as the objective function. Based on the speaker-independent model, the error rate for the same test set is reduced from 4.3% to 3.9%. This indicates that the  linear transformation used here may be insufficient to bridge the difference between speakers.</Paragraph>
      <Paragraph position="2"> As multi-layer perceptrons (MLP) can be used to approximate any nonlinear function, the fully-connected MLP as shown in Figure 1 is employed for speaker normalization.</Paragraph>
      <Paragraph position="3"> Such a network can be well trained with the back-propagation algorithm. The input of the nonlinear mapping network consists of three frames (3x13) from the target speaker. The output of the network is a normalized cepstrum frame, which is made to approximate the frame of the desired reference speaker. The objective function for network learning is to minimize the distortion (mean squared error) between the network output and the desired reference speaker frame. The network has two hidden layers, each of which has 20 hidden units. Each hidden unit is associated with a sigmoid function.</Paragraph>
      <Paragraph position="4"> For simplicity, the objective function used here has not been unified with the SCHMM. However, the extension should be straightforward.</Paragraph>
      <Paragraph position="5"> To provide learning examples for the network, a DTW algorithm \[26\] is used to warp the target data to the reference data. Optimal alignment pairs are used to supervise network learning. For the given input frames, the desired output frame for network learning is the one paired by the middle input frame in DTW alignment. Since the goal here is to transform the target speaker to the reference speaker, the sigmoid function is not used for the output layer. Multiple input frames feeded to the network not only alleviate possible inaccuracy of DTW alignment but also incorporate dynamic information in the learning procedure. As nonlinear network may be less well trained, full connections between input units and output units are added. This has an effect of interpolation between the nonlinear network output and the original speech frames. This interpolation helps generalization capability of the nonlinear network significantly. To minimize the objective function, both nonlinear connection weights and direct linear connection weights are simultaneously adjusted with the back-propagation algorithm. Experimental experience indicates that 200 to 250 epochs are required to achieve acceptable distortion.</Paragraph>
      <Paragraph position="6"> speaker normalization. Speaker-dependent models (2400 training sentences) are used instead of speaker-independent models. When the reference speaker is randomly selected as LPN, the average recognition error rate for the other three speakers is 41.9% as shown in Table 8. When 40 text- null dependent training sentences are used to build the speaker normalization network, the average error rate is reduced to 6.8%. Note that neither codebook nor output distribution has been adapted yet in this experiment. The error rate has already been reduced by 80%. It is also interesting to note that for female speakers QRM and BJW), speaker normalization dramatically reduces the error rate. Although the error rate of 6.8% is worse than that of the speaker-independent system (4.5%) for the same test set, this nevertheless demonstrated the ability of MLP-based speaker normalization.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML