File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2045_metho.xml

Size: 14,844 bytes

Last Modified: 2025-10-06 14:12:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2045">
  <Title>SPECTRAL ESTIMATION FOR NOISE ROBUST SPEECH RECOGNITION</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SPECTRAL ESTIMATION FOR NOISE ROBUST SPEECH RECOGNITION
Adoram Erell and Mitch Weintraub
SRI International
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> We present results on the recognition accuracy of a continuous speech, speaker independent HMM recognition system that incorporates a novel noise reduction algorithm. The algorithm is a minimum mean square error estimation tailored for a filter-bank front-end. It introduces a significant improvement over similar published algorithms by incorporating a better statistical model for the filter-bank log-energies, and by attempting to jointly estimate the log-energies vector rather than individual components. The algonthm was tested with SRrs recognizer trained on the official speaker-independent &amp;quot;Resource management task&amp;quot; clean speech database. When tested with additive white gaussian noise, the noise reduction achieved by the algorithm is equivalent to a 13 dB SNR improvement. When tested with desktop microphone recordings, the error rate at 13 dB SNR is only 40% higher than that with close-talking microphone at 31 dB SNR.</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="320" type="metho">
    <SectionTitle>
I. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Speech recognition systems are very sensitive to differences between the testing and training conditions. In particular, systems that are trained on high quality speech degrade drastically in noisy environments. One of several commonly used methods for handling this problem is to supplement the acoustic front-end of the recognizer with a statistical estimator. This paper introduces a novel estimation algorithm for a filter-bank based front-end, and describes recognition experiments with noisy speech.</Paragraph>
    <Paragraph position="1"> Estimation algorithms that have been used in filter-bank based systems can be classified by their estimation method -- minimum mean square error (MMSE) versus subtraction -- and by the features they estimate -- DFT coefficients versus the filter-bank output energies. Boll \[1\] and Macaulay and Malpass \[2\] used spectral subtraction; Ephraim and Malah \[3\],\[4\] and Porter and Boll \[5\] used MMSE estimation of various functionals of the DFT coefficients; Van Compemolle used filter energy subtraction \[6\] and MMSE estimation of filter-bank log energies \[7\]. The latter MMSE algorithm lacks, however, the degree of precision that has been incorporated by Porter and Boll in their statistical modeling and was not compared to their algorithm.</Paragraph>
    <Paragraph position="2"> A common deficiency of all of the above algorithms is that they estimate different frequency channels (whether DFT coefficients or filter output energies) independently. However, for a speech recognition system which is based on a distance metric, whether for template matching or vector quantization, the estimation should strive to minimize the average distortion as measured by the distance metric. For euclidean distance this criterion yields the following estimator</Paragraph>
    <Paragraph position="4"> where S is the clean speech feature vector and P (S I S t) is the aposteriori probability of the clean speech given the noisy observation. This estimator is considerably more complex than the independent MMSE of individual components (denoted by Sk),</Paragraph>
    <Paragraph position="6"> since Eq.(1) involves the estimation of multidimensional probability distributions and multidimensional integrations, whereas Eq.(2) is a relatively simple one dimensional integral. Both Eqs.(1) and (2) can in principle be computed using Bayes's rule, which than requires the conditioned probability P(~'I~) of the noisy observation S ~ given that the clean speech was S, and the clean speech probability distribution P(S).</Paragraph>
    <Paragraph position="7"> Most HMM systems use probability densities or distance metrics in a transformed domain. For example, our recognizer uses a weighted euclidean distance on the cepstral Iransform of the filter-bank energies. Since the optimal estimation criterion cannot be easily satisfied for such a metric, the practical question is which features and what computationally feasible estimation scheme will best approximate the optimal estimator. We argue that the filter-bank log energies are more attractive to estimate relative to either the DFT or the cepstral coefficients.</Paragraph>
    <Paragraph position="8"> They are more attractive than DFT coefficients since (a) the euclidean distance between filter-bank energies vectors is a better approximation to the cepstral distance than a euclidean distance between any functional of the DFT coefficients, and (tO, the estimation of typically 20-25 filter-bank energies is computationally easier than the estimation of typically 200 DFT coefficients. They are more attractive than the cepstral coefficients since the conditioned probability P(S'I~) can be modeled accurately for gaussian type noise in the frequency domain but not in the cepstral domain.</Paragraph>
    <Paragraph position="9"> In the present work we achieved three objectives. First, we derived an MMSE estimator for filter-bank log-energies based on a more accurate statistical model than the one derived by Van Compernolle \[7\], and compared its performance to that achieved with the DFT estimator of Porter and Boll \[5\]. Second, we improved over the independent MMSE estimation of individual filter energies by computing an approximation to the minimumdistortion estimator, Eq.(1). Third, the estimation algorithm was evaluated with SRI's DECIPHER continuous speech recognition system \[8\] on the NBS &amp;quot;Resource management task&amp;quot; speech database \[9\] with both additive white gaussian noise, and with desktop microphone recordings.</Paragraph>
    <Paragraph position="10"> II. ESTIMATION OF FILTER LOG-ENERGIES A. MMSE OF INDIVIDUAL FILTER LOG-ENERGIES The MMSE estimator given by Eq.(2) can be computed using Bayes's rule as follows:</Paragraph>
    <Paragraph position="12"> where Sk is the clean speech filter log-energy and S'k is the observed noisy value. The clean speech probability distribution P(Sk) was estimated in our implementation from speech data. The conditioned probability P(S'klSk) was modeled as follows.</Paragraph>
    <Paragraph position="13"> The filter output energy (Ek) is computed by a weighted sum of squared DFT coefficients. For additive, gaussian noise, the DFT coefficients of the noise are approximately independent, gaussian random variables. Approximating the weighted sum by a non weighted sum of M coefficients, and assuming that the noise spectral power is uniform over the frequency range spanned by any single filter, the noisy filter energy E'k is given by</Paragraph>
    <Paragraph position="15"> where the subscripts s and n correspond to speech and noise. Since the noise spectral power is assumed to be uniform within the range of summation, both Re{DFTn} and Im {DFTn} are gaussian random variables with sigma given by</Paragraph>
    <Paragraph position="17"> where Nk is the expected value of the noise filter energy. Under these conditions the random variable E'ld62 will obey a probability distribution of non central chi square with 2M degrees of freedom and non central parameter</Paragraph>
    <Paragraph position="19"> and the probability of the log-energy can then be easily computed.</Paragraph>
    <Paragraph position="20"> To account for correlations between DFT coefficients (introduced for noise by the hamming window), we relaxed the parameter M to fit the above model to simulated distributions with white gaussian noise. The modeled conditioned probability P(S'klSI0 and the estimated clean speech probability distribution P(Sk) were then used to compute the MMSE estimator of individual filter log energies.</Paragraph>
    <Paragraph position="21"> B. APPROXIMATE MINIMUM-DISTORTION JOINT ESTIMATION OF</Paragraph>
  </Section>
  <Section position="3" start_page="320" end_page="323" type="metho">
    <SectionTitle>
FILTER LOG ENERGIES
</SectionTitle>
    <Paragraph position="0"> To improve over the individual components MMSE estimator we approximated the joint estimator, Eq.(1), by the following method: Eq.(1) can be computed with Bayes' rule, similarly to Eq.(3), with the vector S replacing the components Sk. The conditioned probability P(S'IS) can then be modeled simply as the product of the marginal probabilities: P (S'I S) =H P (S'kl Sk) k (8) This approximation is fairly good for nonoverlapping filters, since gaussian noise is uncorrelated in the frequency domain and the value of a given noisy filter energy S'k is indeed dependent only on the clean energy Sk and on the noise level in that frequency band. For overlapping filters, such as in our system, the approximation in Eq.(8) is not as good as for nonoverlapping ones, but is still quite reasonable. The clean speech probability distribution P(S), on the other hand, cannot be represented in the frequency domain as a product of the marginal probabilities. In fact, if it could have been represented as such, the joint estimate would have been reduced to MMSE of individual componentsl However, one can improve over the single product model of P(S-~ by a sum of such products: &amp;quot;.&amp;quot;4&amp;quot; P (s) = E Cn n Pn (Sk) n k (9) The acoustic space can be partitioned either in the filter energies coordinate space, or in any other reduced representation. The estimator can be then approximated by</Paragraph>
    <Paragraph position="2"> where Sk is the MMSE estimator obtained for the n-th distribution component.</Paragraph>
    <Paragraph position="3"> III. RECOGNITION EXPERIMENTS The above algonthms were evaluated with SRI's DECIPHER continuous-speech, speaker-independent, HMM recognition system \[8\]. The system's acoustic front-end consists of performing 512-point FFT on 25.6 msec long speech frames, every 10 msec. The spectral power is summed in frequency bands corresponding to 25 overlapping Mel-scale filters, spanning a frequency range of 100-6400 Hz. A discrete cepstral transform is performed on the filter energies. The HMM is trained with discrete densities of four features: the truncated cepstral vector C1-C12, the DC component CO, and their corresponding time denvatives (Delta). The vector quantization of the cepstral and delta-cepstral vectors use a variance-weighted euclidean distance metric. The recognition task was the 1000 word vocabulary of the DARPA-NIST &amp;quot;Resource management task&amp;quot; using a word-pair grammar (perplexity 60) \[11\].</Paragraph>
    <Paragraph position="4"> The training was based on 3990 sentences of high quality speech, recorded at TI in a sound attenuated room with a close-talking microphone (designated by NIST as the Feb. '89 large training set). The testing material consisted of two types of noisy speech. The first was the DARPA-NIST &amp;quot;Resource management task&amp;quot; February 1989 test set (30 sentence from each of 10 talkers not in the training set), with additive white gaussian noise. The second consisted of recordings made at SRI, with both close talking and desktop microphones, in a noisy environment. Nine speakers participated in the SRI recordings, each speaking 30 sentences from the &amp;quot;Resource management task&amp;quot;. The noise, predominantly generated by air conditioning, was estimated from a three seconds sample, recorded at the beginning of each speaker session. For these recordings the estimation algonthm was supplemented with equalization, to compensate for global differences between the SRI microphones and the one used for the training database. The equalization was particularly necessary for the desktop microphone, whose frequency response was very much dependent on the location of the speaker relative to the microphone. The equalization was performed by aligning each speaker's long term average spectrum with that obtained by averaging the spectrum over the whole training database.</Paragraph>
    <Paragraph position="5"> Fig. 1 shows the recognition error rate obtained with no processing and with our best estimation algonthm (Eq. (10)), when trained on clean speech and tested with additive white noise. The SNR is defined here as the ratio between signal and noise average power, computed directly on the waveform. The performance without processing at 23 dB SNR is almost equal to that achieved with preprocessing at 10 dB SNR, suggesting that the estimation improves the effective SNR by 13 dB.</Paragraph>
    <Paragraph position="6"> Fig. 2 compares the error rate achieved with several estimation algonthms, all tested on the TI recorded speech with additive white noise at 10 dB SNR. With the exception of the two fightmost charts, the training was performed on clean speech. The estimation algorithms are, from left to right: (1) no processing; (2) filter energy subtraction following Van Compernolle's method \[6\] where, whenever the noisy filter-energy is below the noise level, it is fixed to 50 dB below the highest observed energies for that filter, (3) MMSE of the logarithm of DFT magnitude, following the method of Porter and Boll \[5\]; (4) our MMSE estimation of filter log-energies (Eq. (3)); (5) our improved algorithm (Eq. (10)); (6) train on the TI recorded database with additive white gaussian noise at 10 dB SNR, without any processing; (7) train in noise at 10 dB SNR with our improved estimation algonthm.</Paragraph>
    <Paragraph position="7"> Summarizing the results in fig. 2, the performance with the filter-bank MMSE is equivalent to that with IoglDFTI MMSE, both achieving error rate which is approximately twice that obtained when the training is performed under exactly the same noise conditions as the testing. The improved filter-bank estimator reduces the error rate to only 50% above the training in noise. Finally, the estimation improves the performance even when the training and testing are done under exactly the same conditions.</Paragraph>
    <Paragraph position="8"> Fig. 3 shows the error rate for the SRI recordings with the close-taRing and desktop microphones. The average SNR, computed on the waveforms, was 31 dB for the close-talking and 13 dB for the desktop. The  speaker-average SNRs in individual filters, averaged over all filters, were 32 and 23 dB, respectively. Error rate is given with no processing, with our best estimation algonthm, and with both estimation and equalization. With both estimation and equalization, the error rate with the desktop microphone is only 40% higher than that with the close-talking microphone.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML