File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/h90-1034_metho.xml
Size: 23,690 bytes
Last Modified: 2025-10-06 14:12:29
<?xml version="1.0" standalone="yes"?> <Paper uid="H90-1034"> <Title>Towards Environment-Independent Spoken Language Systems</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> Towards Environment-Independent </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="157" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> In this paper we discuss recent results from our efforts to make SPHINX, the CMU continuous-speech speaker-independent recognition system, robust to changes in the environment. To deal with differences in noise level and spectral tilt between close-talking and desk-top microphones, we describe two novel methods based on additive corrections in the cepstral domain. In the first algorithm, an additive correction is imposed that depends on the instantaneous SNR of the signal. In the second technique, EM techniques are used to best match the cepstral vectors of the input utterances to the ensemble of codebook entries representing a standard acoustical ambience. Use of these algorithms dramatically improves recognition accuracy when the system is tested on a microphone other than the one on which it was trained.</Paragraph> <Paragraph position="1"> Introduction There are many sources of acoustical distortion that can degrade the accuracy of speech-recognition systems. For example, obstacles to robustness include additive noise from machinery, competing talkers, etc., reverberation from surface reflections in a room, and spectral shaping by microphones and the vocal tracts of individual speakers.</Paragraph> <Paragraph position="2"> These sources of distortion cluster into two complementary classes: additive noise (as in the first two examples) and distortions resulting from the convolution of the speech signal with an unknown linear system (as in the remaining three).</Paragraph> <Paragraph position="3"> A number of algorithms for speech enhancement have been proposed in the literature. For example, Boll \[3\] and Beroufi et al. \[2\] introduced the spectral subtraction of DFT coefficients, and Porter and Boll \[11\] used MMSE techniques to estimate the DFT coefficients of corrupted speech. Spectral equalization to compensate for convolved distortions was introduced by Stockham et al. \[13\]. Recent applications of spectral subtraction and spectral equalization for speech recognition systems include the work of Van Compemolle \[5\] and Stem and Acero \[12\]. Although relatively successful, the above methtxls all depend on the assumption of independence of the spectral estimates across fr~uencies. Erell and Weimranb \[6\] demonstrated improved performance with an MMSE estimator in which correlation among frequencies is modeled explicitly.</Paragraph> <Paragraph position="4"> Acero and Stem \[1\] proposed an approach to environment normalization in the cepstral domain, going beyond the noise stripping problem.</Paragraph> <Paragraph position="5"> In this paper we present two algorithms for speech normalization based on additive corrections in the cepstral domain and compare them to techniques that operate in the frequency domain. We have chosen the cepstral domain rather than the frequency domain so that can we work directly with the parameters that SPHINX uses, and because speech can be characterized with a smaller number of parameters in the cepstral domain than in the frequency domain. The first algorithm, SNR-dependent cepstral normalization (SDCN) is simple and effective, but it cannot be applied to new microphones without microphonespecific training. The second algorithm, codeword-dependent cepstral normalization (CDCN) uses the speech knowledge represented in a codebook to estimate the noise and spectral equalization necessary for the environmental normalization. We also describe an interpolated SDCN algorithm (iSDCN) which combines the simplicity of SDCN and the normalization capabilities of CDCN. These algorithms are evaluated with a number of microphones using an alphanumeric database in which utterances were recorded simultaneously with two different microphones.</Paragraph> <Section position="1" start_page="157" end_page="157" type="sub_section"> <SectionTitle> Experimental Procedures </SectionTitle> <Paragraph position="0"> The alphanumeric database and system used for these experiments has been described previously \[12\] \[l\].</Paragraph> <Paragraph position="1"> Briefly, the database contain.q utterances that were recorded simultaneously in stereo using both the close-talking Sennheiser HMD224 microphone (CLSTK), a standard in previous DARPA evaluations, and a desk-top Crown PZM6fs microphone (CRPZM). The recordings with the CRPZM exhibit not only background noise but also key clicks from workstations, interference from other talkers, and reverberation. The task has a vocabulary of 104 words that are highly confusable, A simplified version of SPHINX with no grammar was used.</Paragraph> <Paragraph position="2"> Baseline recognition results obtained by gaining and testing SPHIlqX using this database are shown in the first two columns of Table 1. With no processing, training and testing using the CRPZM degrades recognition accuracy by about 60 percent relative to that obtained by training and testing on the CLSTK. Although most of the &quot;new&quot; errors introduced by the CRPZM were confusions of silence or noise segments with weak phonetic events, a significant percentage was also due to crosstalk \[12\]. It can also be seen that the &quot;cross conditions&quot; (training on one microphone and testing using the other) produce a very large degradation in recognition accuracy.</Paragraph> </Section> <Section position="2" start_page="157" end_page="157" type="sub_section"> <SectionTitle> Independent Compensation for Noise and Filtering </SectionTitle> <Paragraph position="0"> In this section we examine the performance of SPHINX under some of the techniques that have been used in the literature to combat noise and spectral ilk: multi-style training, short-time liftering, spectral subtraction, and spectral equalization.</Paragraph> </Section> <Section position="3" start_page="157" end_page="157" type="sub_section"> <SectionTitle> Multi-Style Training </SectionTitle> <Paragraph position="0"> Multi-style training is a technique in which the training set includes data representing different conditions so that the resulting HMM models are more robust to this variability. This simple approach has been used successfully in the field of speech styles \[10\] and speaker independence \[9\]. The price one must pay for the robustness is a degradation in performance for cases in which the training and testing are done with the same condition.</Paragraph> <Paragraph position="1"> with s =1.0 which Jtmqua \[8\] found to be optimum, and the ban@ass liftering method defined by Juang \[7\].</Paragraph> <Paragraph position="2"> Unfortunately, we found that application of these techniques produced essentially no improvement for clean speech and only a very small improvement for corrupted speech. Since the frequency-warping transformation in SPHINX alters the variances of the coefficients, some other set of weights may prove more effective.</Paragraph> </Section> <Section position="4" start_page="157" end_page="157" type="sub_section"> <SectionTitle> Spectral Subtraction and Equalization </SectionTitle> <Paragraph position="0"> In spectral subtraction and equalization it is assumed that the speech signal x(t) is degraded by linear filtering and/or uncorrelated additive noise, as depicted in Fig. 1.</Paragraph> <Paragraph position="1"> The goal of the compensation is to reverse the effects of these degradations.</Paragraph> <Paragraph position="2"> An experiment was carried out in which all the speech recorded from the CLSTK and the CRPZM microphones were used in training (Table 1). As expected, robustness is gained by using multi-style training, but at the expense of sacrificing performance with respect to the case of train and test on the same conditions.</Paragraph> </Section> </Section> <Section position="3" start_page="157" end_page="158" type="metho"> <SectionTitle> TRAIN CLSTK CRPZM MULTI </SectionTitle> <Paragraph position="0"> under different training and testing conditions. CLSTK is the Sennheiser HMD224, CRPZM is the Crown PZM6sf and MULTI means that the data from both microphones were used in training</Paragraph> <Section position="1" start_page="157" end_page="158" type="sub_section"> <SectionTitle> Liftering </SectionTitle> <Paragraph position="0"> Many studies have examined several potential distortion measures for speech recognition in noise. Most of these measures involve unequal weighthags of the mean-square distance between cepstral coefficients of the reference and test utterances. The motivation for weighting distances between cepstral vectors is twofold: it provides some variance normalization for the coefficients and it makes the system more robust to noise and spectral tilt by giving less weight to the low-order cepstral coefficients. We tried in our system several weighting measures that have been proposed in the literature including the inverse of the intra-eluster variance as defined by Tokhura \[14\], the exponential lifter Using the notation of Fig. 1, we can characterize the power spectral density (PSD) of the processes involved as</Paragraph> <Paragraph position="2"> Spectral equalization techniques attempt to compensate for the filter h(t), while spectral subtraction techniques attempt to remove the effects of the noise from the signal.</Paragraph> <Paragraph position="3"> We compare the performance of the following different implementations of spectral subtraction and equalization techniques in Table 2.</Paragraph> <Paragraph position="4"> * A spectral equalization algorithm (EQUAL) that is similar to the approach of \[13\]. It compensates for the effects of the linear fiJtering, but not the additive noise, as described in \[12\].</Paragraph> <Paragraph position="5"> * A direct implementation of the original power spectral subtraction rule (PSUB) on 32 frequency bands obtained via a real DFr of the cepstmm vector. The restored cepstrurn is obtained with an inverse DFT.</Paragraph> <Paragraph position="6"> * An implementation of BoU's algorithm (MMSE1) \[4\], in which a transformation is applied to all the frequency bands of the CRPZM speech that minimizes the mean squared error relative to the CLSTK speech.</Paragraph> <Paragraph position="7"> The log-power correction in each frequency band depended only on the instantaneous SNR in that band.</Paragraph> <Paragraph position="8"> * An implementation of magnitude spectral subtraction (MSUB) described in \[12\] that incorporates over- and under-subtraction depending on the SNR as suggested by \[2\]. In \[12\] it was noted that a cascade of the EQUAL and MSUB algorithms did not yield any further improvement in recognition accuracy because they interact nonlinearly.</Paragraph> <Paragraph position="9"> The different criteria used in PSUB, MSUB, produce different curves that relate the effective SNR of the input and output. Some of these curves are shown in Figure 2.</Paragraph> <Paragraph position="10"> ~zo\[ *..-'~ _sl .Igh SN. MSUB .~ il~ 10\[&quot; --- MMSSE1 ...~'/ / &quot; I ..... PSUe..:~ t &quot; .**..*'.deg y l , ,..,....</Paragraph> <Paragraph position="11"> of the signal in a frequency band minus the log-power of the noise in that band. The transformation for MSUB is not a single curve but a family of curves that depend on the total SNR for a given frame.</Paragraph> <Paragraph position="12"> spectral subtraction algorithms. EQUAL and MMSE1 were applied only to the CRPZM speech while PSUB and MSUB were applied to both the CLSTK and the CPRPZM speech.</Paragraph> <Paragraph position="13"> For the most part these algorithms provide increasing degrees of compensation, but their recognition accuracy under the &quot;cross&quot; conditions is still much worse than that obtained even with the system is trained and tested on the CRPZM. We have found that the above techniques produce many output frames that do not constitute legitimate speech vectors, especially at low SNR, because they do not take into account correlations across frequency. That problem, along with the nonlinear interaction of the subtraction and normalization processes motivated us to consider new algorithms which jointly compensate for noise and filtering, and with some attention paid to the spectral profile of the compensated speech.</Paragraph> <Paragraph position="14"> In this section we discuss two algorithms that perform noise suppression and spectral-flit compensation jointly in the cepstmm by means of additive corrections.</Paragraph> <Paragraph position="15"> If we let the cepstral vectors x, n, y and q represent the Fourier series expansion of ln ex(f), ln Pn(f), ln Py(f) and in IH(f)12 respectively, (1) can be rewritten as</Paragraph> <Paragraph position="17"> where the correction vector r (x, n, q) is given by</Paragraph> <Paragraph position="19"> Let z be an estimate of y obtained through our spectral estimation algorithm. Our goal is to recover the uncormpted vectors X = Xo,...x N_ 1 of an utterance given the observations Z = Zo,...zN_ 1 and our knowledge of the environment n and q.</Paragraph> </Section> </Section> <Section position="4" start_page="158" end_page="159" type="metho"> <SectionTitle> SDCN Algorithm </SectionTitle> <Paragraph position="0"> SNR-Dependent Cepstral Normalization (SDCN) is a simple algorithm that applies a fixed additive correction vector w to the cepstrai coefficients that depends exclusively on the instantaneous SNR of the input frame.</Paragraph> <Paragraph position="1"> A x = z - w(SNR) (4) At high SNR, inspection of equations (1), (2) and (3) indicates that x (0) + q(0) >> n(0), r = 0, and y = x + q. On the other hand at low SNR, x(0)+ q(0)<< n(0) and y = n. Hence, the SDCN algorithm performs spectrai equalization at high SNR and noise suppression at low SNR.</Paragraph> <Paragraph position="2"> SNR is estimated in the SDCN algorithm as z (0) - n (0). This is not the true signal-to-noise ratio but it is related to it and easier to compute. The compensation vectors w(SNR) were estimated with an MMSE criterion by computing the average difference between cepstral vectors for the test condition versus a standard acoustical environment from simultaneous stereo recordings. We have observed that applying a correction to just c o and c 1 yields basically the same results than if all the cepstmm coefficients are normalized.</Paragraph> <Paragraph position="3"> For the sake of comparison between algorithms operating in the spectral domain and the cepstral domain, we developed an algorithm called MMSEN that accomplishes noise suppression and spectral equalization jointly using different transformations for every frequency band.</Paragraph> <Paragraph position="4"> MMSEN is similar in concept to SDCN except that it operates in the spectral (rather than cepstral) domain. As is seen in Table 4, SDCN performs slightly better than MMSEN, and it is more computationally efficient as well.</Paragraph> <Paragraph position="5"> gorithrns when compared with the baseline.</Paragraph> <Paragraph position="6"> Although liftering provided very little improvement for our baseline system, this technique is actually complementary to SDCN: liftering techniques can be viewed as a variance normalization while SDCN is a biascompensation algorithm. Using SDCN and the algorithm of Juang \[7\] with p = 12 and values of the parameter L ranging from 0 to 6, we observed a modest improvement over pure SDCN (from 67.2% to 72.3%) when training using the CLSTK and testing with the CPRPZM microphone.</Paragraph> </Section> <Section position="5" start_page="159" end_page="161" type="metho"> <SectionTitle> CDCN Algorithm </SectionTitle> <Paragraph position="0"> Although the SDCN technique performs acceptably, it has the disadvantage that new microphones must be &quot;calibrated&quot; by conecang long-term statistics from a new stereo database. Since only long-term averages axe used, SDCN is clearly not able to model a non-stationary environment. The second new algorithm, Codeword-Dependent Cepstral Normalization (CDCN), was proposed to circumvent these problems.</Paragraph> <Paragraph position="1"> The CDCN algorithm attempts to determine the fixed equalization and noise vectors q and n that provide an ensemble of compensated cepstral vectors ~ that are collectively closest to the set of locations of legitimate VQ codewords. The correction vector will be different for every codebook vector.</Paragraph> <Paragraph position="2"> The q and n are estimated using ML techniques via the EM algorithm since no close-form expression can be obtained. The compensated vectors ~ are estimated using M/VISE techniques. The reader is referred to \[1\] for the details of this algorithm.</Paragraph> <Section position="1" start_page="159" end_page="159" type="sub_section"> <SectionTitle> Results and Discussion </SectionTitle> <Paragraph position="0"> Table 4 describes the recognition accuracy of the original SPHINX system with no preprocessing, and with the SDCN and CDCN algorithms. Use of the CDCN algorithm brings the performance obtained when training on the CLSTK and testing on the CRPZM to the level observed when the system is trained and tested on the CRPZM.</Paragraph> <Paragraph position="1"> Moreover, use of CDCN improves performance obtained when training and testing on the CRPZM to a level greater than the baseline performance. The much simpler SDCN algorithm also provides considerable improvement in performance when the system is trained and tested on two different microphones.</Paragraph> <Paragraph position="2"> Unlike in previous studies where estimates of the power norroaliTation factor, spectral equalization function, and noise are obtained independently, these quantities are jointly estirnated in CDCN using a common maximum likelihood fiarnework that is based on a priori knowledge of the speech signal. Since CDCN only requires a single utterance in order to estimate noise and spectral tilt, it can better captme the non-stationatity of the environment.</Paragraph> <Paragraph position="3"> Moreover, in a real application, long-term averages may not be available for every speaker and new microphone.</Paragraph> <Paragraph position="4"> In Figures 3, 4, 5 and 6 we show 3-D representations of an utterance with the CLSTK and no processing, the CRPZM with no processing, SDCN, and CDCN respectively. While it can be seen that noise suppression is achieved with both SDCN and CDCN, the CDCN algorithm provides greater compensation for spectral tilt.</Paragraph> </Section> <Section position="2" start_page="159" end_page="161" type="sub_section"> <SectionTitle> Results with other microphones </SectionTitle> <Paragraph position="0"> To confirm the ability of the CDCN algorithm to adapt to new environmental conditions, a series of tests was performed with 5 new stereo speech databases. The test data were all collected after development of the CDCN algorithm was completed. In all cases the system was trained using the Sennl'~iser HMD224. The &quot;second&quot; microphones (with which the system was not trained) were: basefine and the CDCN algorithm. Two microphones were recorded in stereo in each case. The microphones compared are the Sennheiser HMD224, 518, MES0, the Crown PZM6FS and PCC160, and the HME microphone. Training was done with the Sennheiser HMD224 in all cases.</Paragraph> <Paragraph position="1"> the Sennheiser HMD224. We believe that one cause for this is that estimates of q and n are not very good for short utterances.</Paragraph> <Paragraph position="2"> Interpolated SDCN One of the deficiencies of the SDCN algorithm is the inahifity to adapt to new environments since the correction vectors are derived from a stereo database of our &quot;standard&quot; Sennheiser I-IMD224 and the new microphone. By using an MMSE criterion that included some a priori informarion about the distribution of speech (a codebook), the SDCN can estimate the parameters of the environment q and n just as CDCN does.</Paragraph> <Paragraph position="3"> As we have noted above, the correction vector in SDCN, w, has the asymptotic value of the noise vector n at low SNR and of the equalization vector q at high SNR. In interpolated SDCN (ISDCN) the dependence on SNR is modelled as follows:</Paragraph> <Paragraph position="5"> In this evaluation tz i and 15i were set empirically to 3.0 for i > 0 and 6.0 for i= 0. The vectors n and q were determined by an EM algorithm whose objective function is the minimization of the total VQ distortion.</Paragraph> <Paragraph position="6"> In evaluating the ISDCN algorith m we also varied the amount of speech used for estimation of q and n. Since these parameters are normally estimated over the course of only a single utterance, the estimates of q and n will exhibit a large variance for short utterances. We believe this is one of the causes for the slight degradation in performance in Table 5 observed when the system was trained and tested using the CLSTK microphone.</Paragraph> <Paragraph position="7"> We compared the recognition accuracy with the ISDCN algorithm using estimates of the model parameters obtained by considering only one utterance at a rime, and with estimates obtained using all 14 utterances spoken by a given speaker. Estimating the model parameters from all utterances for a speaker produced an accuracy of 85.9%, which is slightly higher than the baseline 85.3%. (The corresponding recognition accuracy working an utterance at a time was 84.8%.) These results lead us to believe that CDCN could also benefit from a longer estimation time, and will be analyzed in future work.</Paragraph> <Paragraph position="8"> Conclusions We described and evaluated two algorithms to make SPHINX more robust with respect to changes of microphone and acoustical environment. With the first algorithm, SNR-dependent cepstral normalization, a correcrion vector is added that depends exclusively on the instantaneous SNR of the input. While SDCN is very simple, it provides a cousiderable improvement in performance when the system is trained and tested on different microphones, while maintaining the same performance for the case of training and testing on the same microphone. Two drawbacks of the method are that the system must be retrained using a stereo database for each new microphone considered, and that the normalization is based on long-term statistical models.</Paragraph> <Paragraph position="9"> The second algorithm, codeword-dependent cepstral normalization, uses a maximum likelihood technique to estirnate noise and spectral tilt in the context of an iterafive algorithm similar to the EM algorithm. With CDCN, the system can adapt to new speakers, microphones, and environrnents without the need for collecting statistics about them a priori. By not relying on long-term a priori informarion, the CDCN algorithm can dynamically adapt to changes in the acoustical environment as well.</Paragraph> <Paragraph position="10"> Both algorithms provided dramatic improvement in perforrnance when SPHINX is tra.ined on one microphone and tested on another, without degrading recognition accuracy obtained when the same microhone was used for training and testing.</Paragraph> </Section> </Section> class="xml-element"></Paper>