File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1014_metho.xml

Size: 15,566 bytes

Last Modified: 2025-10-06 14:13:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1014">
  <Title>EFFICIENT CEPSTRAL NORMALIZATION FOR ROBUST SPEECH RECOGNITION</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> The need for speech recognition systems and spoken language systems to be robust with respect to their acoustical environment has become more widely appreciated in recent years (e.g. \[1\]). Results of many studies have demonstrated that even automatic speech recognition systems that are designed to be speaker independent can perform very poorly when they are tested using a different type of microphone or acoustical environment from the one with which they were trained (e.g. \[2,3\]), even in a relatively quiet office environment. Applications such as speech recognition over telephones, in automobiles, on a factory floor, or outdoors demand an even greater degree of environmental robusmess.</Paragraph>
    <Paragraph position="1"> Many approaches have been considered in the development of robust speech recognition systems including techniques based on autoregressive analysis, the use of special distortion measures, the use of auditory models, and the use of microphone arrays, among many other approaches (as reviewed in \[1,4\]).</Paragraph>
    <Paragraph position="2"> In this paper we describe and compare the performance of a series of cepstrum-based procedures that enable the CMU SPHINX-II speech recognition system to maintain a high level of recognition accuracy over a wide variety of acoustical environments. The most recently-developed algorithm is multiple fixed codeword-dependent cepstral normalization (MFCDCN). MFCDCN is an extension of a similar algorithm, FCDCN, which provides an additive environmental compensation to cepstral vectors, but in an environment-specific fashion \[5\]. MFCDCN is less computationally complex than the earlier CDCN algorithm, and more accurate than the related SDCN and BSDCN algorithms \[6\], and it does not require domain-specific paining to new acoustical environments. In this paper we describe the performance of MFCDCN and related algorithms, and we compare it to the popular RASTA approach to robustness.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="70" type="metho">
    <SectionTitle>
2. EFFICIENT CEPSTRUM-BASED
COMPENSATION TECHNIQUES
</SectionTitle>
    <Paragraph position="0"> In this section we describe several of the cepstral normalization techniques we have developed to compensate simultaneously for additive noise and linear filtering. Most of these algorithms are completely data-driven, as the compensation parameters are determined by comparisons between the testing environment and simultaneously-recorded speech samples using the DARPA standard closetalking Sennheiser HMD-414 microphone (referred to as the CLSTLK microphone in this paper). The remaining algorithm, codeword-dependent cepstral normalization (CDCN), is model-based, as the speech that is input to the recognition system is characterized as speech from the CLSTLK microphone that undergoes unknown linear filtering and corruption by unknown additive noise.</Paragraph>
    <Paragraph position="1"> In addition, we discuss two other procedures, the RASTA method, and cepstral mean normalization, that may be referred to as cepstral-filtedng techniques. These procedures do not provide as much improvement as CDCN, MFCDCN and related algorithms, but they can be implemented with virtually no computational cost.</Paragraph>
    <Section position="1" start_page="0" end_page="70" type="sub_section">
      <SectionTitle>
2.1. Cepstral Normalization Techniques
</SectionTitle>
      <Paragraph position="0"> SDCN. The simplest compensation algorithm, SNR-Dependent Cepstral Normalization (SDCN) \[2,4\], applies an additive corr~tion in the cepstral domain that depends exclusively on the instantaneous SNR of the signal. This correction vector equals the average difference in cepstra  between simultaneous &amp;quot;stereo&amp;quot; recordings of speech sampies from both the training and testing environments at each SNR of speech in the testing environment. At high SNRs, this conection vector primarily compensates for differences in speclzal tilt between the training and testing environments (in a manner similar to the blind deconvolufion procedure tirst proposed by Stockham et al. \[7\]), while at low SNRs the vector provides a form of noise subtraction (in a manner similar to the spectral subtraction algorithm first proposed by Boll \[8\]). The SDCN algorithm is simple and effective, but it requires environment-specific training.</Paragraph>
      <Paragraph position="1"> FCDCN. Fixed codeword-dependent cepstral normalization (FCDCN) \[4,6\] was developed to provide a form of compensation that provides greater recognition accuracy than SDCN but in a more computationally-efticient fashion than the CDCN algorithm which is summarized below.</Paragraph>
      <Paragraph position="2"> The FCDCN algorithm applies an additive correction that depends on the instantaneous SNR of the input (like SDCN), but that can also vary from codeword to codeword (like CDCN) = z+r\[k,l\] where for each frame $ represents the estimated cepstral vector of the compensated speech, z is the cepstral vector of the incoming speech in the target environment, k is an index identifying the VQ codeword, I is an index identifying the SNR, and r \[k, !\] is the correction vector.</Paragraph>
      <Paragraph position="3"> The selection of the appropriate codeword is done at the VQ stage, so that the label k is chosen to minimize Ilz+ r\[k, l\] - c \[k\] II 2 where the c \[k\] are the VQ codewords of the speech in the training database. The new correction vectors are estimated with an EM algorithm that maximizes the likelihood of the data.</Paragraph>
      <Paragraph position="4"> The probability density function of x is assumed to be a mixture of Gaussian densities as in \[2,4\].</Paragraph>
      <Paragraph position="6"> The cepstra of the corrupted speech are modeled as Gaussian random vectors, whose variance depends also on the instantaneous SNR, l, of the input.</Paragraph>
      <Paragraph position="7"> p (zl ~ r, 1 ff~\[/\] exp /\] - c \[k\] II 2 k 20&amp;quot; In \[4\] it is shown that the solution to the EM algorithm is the following iterative algorithm. In practice, convergence is reached after 2 or 3 iterations if we choose the initial values of the correction vectors to be the ones specified by lhe  SDCN algorithm.</Paragraph>
      <Paragraph position="8"> 1. Assume initial values for r' \[k, l\] and 02 \[l\] . 2. Estimate fi \[k\], the aposteriori probabilities of the mix null ture components given the correction vectors r' \[k, li\], variances 02 \[li\], and codebook vectors c \[k\] exp I&amp;quot; 1 202 \[ l i\]</Paragraph>
      <Paragraph position="10"> where I i is the instantaneous SNR of the i th frame.</Paragraph>
      <Paragraph position="11">  3. Maximize the likelihood of the complete data by obtaining new estimates for the correction vectors r' \[k, l\] and cor-</Paragraph>
      <Paragraph position="13"> In the current version of FCDCN the SNR is varied over a range of 30 dB in 1-dB steps, with the lowest SNR set equal to the estimated noise level. At each SNR compensation vectors are computed for each of 8 separate VQ clusters.</Paragraph>
      <Paragraph position="14"> Figure 1 illustrates some typical compensation vectors obtained with the FCDCN algorithm, computed using the standard closetalking Sennheiser HMD-414 microphone and the unidirectional desktop PCC-160 microphone used as the target environment. The vectors are computed at the extreme SNRs of 0 and 29 dB, as well as at 5 dB. These curves are obtained by calculating the cosine transform of the cepstral compensation vectors, so they provide an estimate of the effective spectral profile of the compensation vectors. The horizontal axis represents frequency, warped nonlinearly according to the mel scale \[9\]. The maximum frequency corresponds to the Nyquist frequency, 8000 Hz. We note that the spectral profile of the compensation vector varies with SNR, and that especially for the intermediate SNRs the various VQ clusters require compensation vectors of different spectral shapes. The compensation curves for 0-dB SNR average to zero dB at low frequencies by design.</Paragraph>
      <Paragraph position="16"/>
    </Section>
  </Section>
  <Section position="5" start_page="70" end_page="71" type="metho">
    <SectionTitle>
-5 Figure
</SectionTitle>
    <Paragraph position="0"> 1: Comparison of compensation vectors using the FCDCN method with the PCC-160 unidirectional desktop microphone, at three different signal-to-noise ratios. The maximum SNR used by the FCDCN algorithm is 29 dB.</Paragraph>
    <Paragraph position="1"> The computational complexity of the FCDCN algorithm is very low because the correction vectors are precomputed.</Paragraph>
    <Paragraph position="2"> However, FCDCN does require simultaneously-recorded data from the training and testing environments. In previous studies \[6\] we found that the FCDCN algorithm provided a level of recognition accuracy that exceeded what was obtained with all other algorithms, including CDCN.</Paragraph>
    <Paragraph position="3"> MFCDCN. Multiple fixed codeword-dependent cepstral normalization (MFCDCN) is a simple extension to the FCDCN algorithm, with the goal of exploiting the simplicity and effectiveness of FCDCN but without the need for environment-specific training.</Paragraph>
    <Paragraph position="4"> In MFCDCN, compensation vectors are precomputed in parallel for a set of target envkonments, using the FCDCN procedure as described above. When an utterance from an unknown environment is input to the recognition system, compensation vectors computed using each of the possible target environments are applied successively, and the environment is chosen that minimizes the average residual VQ distortion over the entire utterance, llz + r\[k, l,m\] -c \[k\] II z where k refers to the VQ codeword, I to the SNR, and m to the target environment used to train the ensemble of compensation vectors. This general approach is similar in spirit to that used by the BBN speech system \[13\], which performs a classification among six groups of secondary microphones and the CLSTLK microphone to determine which of seven sets of phonetic models should be used to process speech from unknown environments.</Paragraph>
    <Paragraph position="5"> The success of MFCDCN depends on the availability of training data with stereo pairs of speech recorded from the training environment and from a variety of possible target environments, and on the extent to which the environments in the training data are representative of what is actually encountered in testing.</Paragraph>
    <Paragraph position="6"> IMFCDCN. While environment selection for the compensation vectors of MFCDCN is generally performed on an utterance-by-utterance basis, the probability of a correct selection can be improved by allowing the classification process to make use of cepstral vectors from previous utterances in a given session as well. We refer to this type of unsupervised incremental adaptation as Incremental Multiple Fixed Codeword-Dependent Cepstral Normalization OMFCDCN). CDCN. One of the best known compensation algorithms developed at CMU is Codeword-Dependent Cepstral Normalization (CDCN) \[2,4\]. CDCN uses EM techniques to compute ML estimates of the parameters characterizing the contributions of additive noise and linear filtering that when applied in inverse fashion to the cepstra of an incoming utterance produce an ensemble of cepstral coefficients that best match (in the ML sense) the cepstral coefficients of the incoming speech in the testing environment to the locations of VQ codewords in the training environment.</Paragraph>
    <Paragraph position="7"> The CDCN algorithm has the advantage that it does not require a priori knowledge of the testing environment (in the form of any sort of simultaneously-recorded &amp;quot;stereo&amp;quot; training data in the training and testing environments).</Paragraph>
    <Paragraph position="8"> However, it has the disadvantage of a somewhat more computationally demanding compensation process than MFCDCN and the other algorithms described above. Compared to MFCDCN and similar algorithms, CDCN uses a greater amount of structural knowledge about the nat~e of the degradations to the speech signal in order to improve recognition accuracy. Liu et al. \[5\] have shown that the structural knowledge embodied in the CDCN algorithm enables it to adapt to new envkonments much more rapidly  than an algorithm closely related to SDCN, but this experiment has not yet been repeated for FCDCN.</Paragraph>
    <Section position="1" start_page="71" end_page="71" type="sub_section">
      <SectionTitle>
2.2. Cepstral Filtering Techniques
</SectionTitle>
      <Paragraph position="0"> In this :section we describe two extremely simple techniques, RASTA and cepstral mean normalization, which can achieve a considerable amount of environmental robusmess at almost negligible cost.</Paragraph>
      <Paragraph position="1"> RASTA. In RASTA filtering \[10\], a high-pass filter is applied to a log-spectral representation of speech such as the cepstral coefficients. The SRI DECIPHER TM system, for example, uses the highpass filter described by the difference equation y\[n\] = x\[n\] -x\[n- 1\] +0.97y\[n- 1\] where x \[n\] and y \[n\] are the time-varying cepstral vectors of the utterance before and after RASTA filtering, and the index n refers to the analysis frames \[11\].</Paragraph>
      <Paragraph position="2"> Cepstral mean normalization. Cepstral mean normalization (CMN) is an alternate way to high-pass filter cepstral coefficients. In cepstral mean normalization the mean of the cepstral vectors is subtracted from the cepstral coefficients of that utterance on a sentence-by-sentence basis:</Paragraph>
      <Paragraph position="4"> where N is the total number frames in an utterance.</Paragraph>
      <Paragraph position="5"> Figure 2 shows the low-frequency portions of the transfer functions of the RASTA and CMN filters. Both curves exhibit a deep notch at zero frequency. The shape of the CMN curve depends on the duration of the utterance, and is plotted in Figure 2 for the average duration in the DARPA Wall Street Journal task, 7 seconds. The Nyquist frequency for the time-varying cepstral vectors is 50 frames per second. null Algorithms like RASTA and CMN compensate for the effects of unknown linear filtering because linear filters produce a static compensation vector in the cepstral domain that is the average difference between the cepstra of speech in the training and testing environments. Because the RASTA and CMN filters are highpass, they force the average values of cepstral coefficients to be zero in both the training and testing domains. Nevertheless, neither CMN nor RASTA can compensate directly for the combined effects of additive noise and linear filtering. It is seen in Figure 1 that the compensation vectors that maximize the likelihood of the data vary as a function of the SNR of individual frames of the utterance. Hence we expect compensation algorithms like MFCDCN (which incorporate this knowledge) to be more effective than RASTA or CMN  highpass cepstral filters implemented by the RASTA algorithm as used by SRI (dotted curve), and as implied by CMN (solid curve). The CMN curve assumes an utterance duration of 7 seconds.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML