File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1037_metho.xml

Size: 12,349 bytes

Last Modified: 2025-10-06 14:13:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1037">
  <Title>Minimizing Speaker Variation Effects for Speaker-Independent Speech Recognition</Title>
  <Section position="3" start_page="0" end_page="193" type="metho">
    <SectionTitle>
2. REVIEW OF THE SPHINX-H SYSTEM
</SectionTitle>
    <Paragraph position="0"> In comparison with the SPHINX system \[20\], the SPHINX-II system \[6\] reduced the word error rate by more than 50% through incorporating between-word coarticulation modeling \[13\], high-order dynamics \[9\], sex-dependent shared-distribution semi-continuous hidden Markov models \[9, 15\]. This section will review SPHINX-II, which will be used as our baseline system for this study \[6\].</Paragraph>
    <Section position="1" start_page="191" end_page="191" type="sub_section">
      <SectionTitle>
2.1. Signal Processing
</SectionTitle>
      <Paragraph position="0"> The input speech signal is sampled at 16 kHz with a preemphasized filter, 1 - 0.9Z -1. A Hamming window with a width of 20 msec is applied to speech signal every 10 msec. The 32-order LPC analysis is followed to compute the 12-order cepstral coefficients. Bilinear transformation of cepstral coefficients is employed to approximate reel-scale representation. In addition, relative power is also computed together with eepstral coefficients. Speech features used in SPHINX-II include (t is in units of 10 msec) LPC cepstral coefficients; 40-msec and 80-msec differenced LPC cepstral coefficients; second-order differenced cepstral coefficients; and power, 40-msec differenced power, second-order differenced power. These features are vector quantized into four independent codebooks by the Linde-Buzo-Gray algorithm \[21\], each of which has 256 entries.</Paragraph>
    </Section>
    <Section position="2" start_page="191" end_page="193" type="sub_section">
      <SectionTitle>
2.2. Training
</SectionTitle>
      <Paragraph position="0"> Training procedures are based on the forward-backward algorithm. Word models are formed by concatenating phonetic models; sentence models by concatenating word models.</Paragraph>
      <Paragraph position="1"> There are two stages at training. The first stage is to generate the shared output distribution mapping table. Forty-eight context-independent discrete phonetic models are initially estimated from the uniform distribution. Deleted interpolation \[17\] is used to smooth the estimated parameters with the uniform distribution. Then context-dependent models have to be estimated based on context-independent ones. There are 7549 triphone models in the DARPA RM task when both within-word and between-word triphones are considered. To facilitate training, one codebook discrete models were used, where acoustic feature consists of the cepstrai coefficients, 40-msec differenced cepstrum, power and 40-msec differenced power. After the 7549 discrete models are obtained, the distribution clustering procedure \[14\] is then applied to create 4500 distributions (senones). The second stage is to train 4codebook models. We first estimate 48 context independent, four-codebook discrete models with the uniform distribution.</Paragraph>
      <Paragraph position="2"> With these context independent models and the senone table, we then estimate the shared-distribution SCHMMs \[9\].</Paragraph>
      <Paragraph position="3"> Because of substantial difference between male and female speakers, two sets of sex-dependent SCHMMs are are separately trained to enhance the performance.</Paragraph>
      <Paragraph position="4"> To summarize, the configuration of the SPHINX-II system has:  * four codebooks of acoustic features, * shared-distribution between-word and within-word triphone models, * sex-dependent SCHMMs.</Paragraph>
      <Paragraph position="5"> 2.3. Recognition  In recognition, a language network is pre-compiled to represent the search space. For each input utterance, the (artificial) sex is first determined automatically as follows \[8, 31\]. Assume each codeword occurs equally and assume codeword i is represented by a Gaussian density function N(x, Pi, ~i). Then given a segment of speech x~, Prsex, the probability that x~&amp;quot; is generated from codebook-sex is approximated by:</Paragraph>
      <Paragraph position="7"> where r/t is a set that contains the top N codeword indices during quantization for cepstrum data xt at time t. If Prrnale  &gt; Pry~mat~, then x~ belongs to male speakers. Otherwise, x~ is female speech. After the sex is determined, only the models of the determined sex are activated during recognition. This saves both CPU time and memory requirement. For each input utterance, the Viterbi beam search algorithm is used to find out the optimal state sequence in the language network. 3. NEURAL NETWORK ARCHITECTURE 3.1. Codeword-Dependent Neural Networks (CDNN)  When presented with a large amount of training data, a single network is often unable to produce satisfactory results during training as each network is only suitable to a relatively small task. To improve the mapping performance, breaking up a large task and modular construction are usually required \[5, 7\]. This is because the nonlinear relationship between two speakers is very complicated, a simple network may not be powerful enough. One solution is to partition the mapping spaces into smaller regions, and to construct a neural network for each region as shown in Figure 1. As each neural network is trained on a separate region in the acoustic space, the complexity of the mapping required of each network is thus reduced. In Figure 1, the switch can be used to select the most likely network or top N networks based on some probability measures of acoustic similarity \[101. Functionally, the assembly of networks is similar to a huge neural network. However, each network in the assembly is learned independently with training data for the corresponding regions. This reduces the complexity of finding a good solution in a huge space of possible network configurations since strong constraints are introduced in performing complex constraint satisfaction in a massively interconnected network.</Paragraph>
      <Paragraph position="8"> Vector quantization (VQ) has been widely used for data compression in speech and image processing. Here, it can be used to to partition original acoustic space into different prototypes (codewords). This partition can be regarded as a procedure to perform broad-acoustic pattern classification.  The broad-acoustic patterns are automatically generated via a self-organization procedure based on the LBG algorithm \[21\]. When the codeword-dependent neural network (CDNN) was constructed from the data in the corresponding cell, it was found that learning for the CDNN converges very quickly in comparison with a huge neural network. The larger the codebook, the quicker it converges. However, the size of codebook relies on the number of available training data since codeword-dependent structure fragments training data. The size of codebook should be determined experimentally.</Paragraph>
      <Paragraph position="9"> Speaker normalization involves acoustic data transformation from one speaker cluster to another. In general, let X a = xl,xz,a a ...x\[ be a sequence of observations (frames) at time 1, 2, .. t of speaker a. Here, each observation at time k, x\[, is a multidimensional vector, which usually characterizes some short-time spectral features. For the sequence of speech observations X a produced by speaker-cluster a, our goal is to find a mapping function .Tt'(X a ) such that ~(X a ) resembles the corresponding sequence of observations produced by speakers in the golden speaker cluster. Speaker variations include many factors such as vocal tract, pitch, speaking speed, intensity, and cultural differences. Unfortunately, given two different speakers, there is no simple mapping function that can account for all these variations. Consequently, we are mainly concerned with spectral normalization. For each frame x a, we want to find out a mapping function to transform it to x b, the corresponding phonetic realization produced by speaker b. We believe that x\[ can represent most important features produced by the speaker. Thus, our objective functions is to minimize:</Paragraph>
      <Paragraph position="11"> corresponding pairs where ~D(x,y) denotes a predefined distortion measure between frame x and y, and corresponding pairs are constructed to approximate acoustic realizations of different speakers. Even if we are only interested in spectral normalization, there is no analytic mapping solution. Instead, stochastic approach has to be used to study the nonlinear relationship between the two observed spaces. We need to have a set of supervision data (corresponding pairs in Equation 1) to extract the nonlinear relationship.</Paragraph>
      <Paragraph position="12"> It has been found that dynamic information plays an important role in speech recognition \[4, 20, 12\]. As frame to frame normalization lacks use of dynamic information, the architecture of normalization network is thus chosen to incorporate multiple neighboring frames. One of such architectures is shown in Figure 2. Here, the current frame and its left and right neighboring frames are fed to the multi-layer neural network as inputs. The network output is a normalized frame corresponding to the current input frame. By using multiple input frames for the network, the important dynamic information can be effectively used in estimating network parameters and in normalization. In Figure 2, there are input layer, hidden layer, and output layer. Each arc k is associated with normalized frame previous frame current frame next frame  a weight wk. In the hidden and output layer, each node is characterized by an internal offset 0. The hidden node is also characterized by a nonlinear sigmoid function. The input to each hidden node and output node is a weighted sum of corresponding inputs with the offset 0. Both the internal offset and arc weights are learned by the backpropagation algorithm \[30\], which uses a gradient search to minimize the objective function. If the dimension of observation space is d and the number of input frames is m, we will have dxm input units in the normalization network. If we want to incorporate more neighboring frames, this will definitely increase the number of free parameters in the network. Although the increase in the number of free parameters lead to quick convergence during training, this nevertheless may not lead to improved general- null ization capability. Since the network is designed to normalize new data from a given speaker to the reference speaker, good generalitzation capability will be the most important concern. Therefore, a compromise has to be made between generalization capability and the number of free parameters.</Paragraph>
    </Section>
    <Section position="3" start_page="193" end_page="193" type="sub_section">
      <SectionTitle>
3.2. Golden Speaker-Cluster Selection
</SectionTitle>
      <Paragraph position="0"> Speaker-dependent CDNNs have been used successfully for speaker-adaptive speech recognition \[7\] (speaker-dependent mapping). If we need to map multiple speakers to one golden speaker and simply construct a speaker-independent CDNN, it is unlikely that a single network will do the job. With the same rational as CDNN for speaker-adaptive speech recognition, we can partition multiple speakers into speaker-clusters and construct cluster-dependent CDNN.</Paragraph>
      <Paragraph position="1"> For speaker clustering, we first generated 48 phonetic HMM for each speaker in the speaker-independent training database.</Paragraph>
      <Paragraph position="2"> Thus, for each speaker, we have a set of output distributions.</Paragraph>
      <Paragraph position="3"> We then merge the two speaker-clusters iteratively that resulted in the least loss of information, and then move elements from cluster to cluster to improve the overall quality.</Paragraph>
      <Paragraph position="4"> The clustering procedure used here is similar to the one used for generalized triphone clustering \[19\]. We can continue the clustering process until the specified speaker-clusters are obtained. The golden speaker-cluster is the one that contains the largest number of speakers. We generated two golden clusters for male and female respectively.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML