File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2044_metho.xml
Size: 18,573 bytes
Last Modified: 2025-10-06 14:12:18
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-2044"> <Title>ACOUSTICAL PRE-PROCESSING FOR ROBUST SPEECH RECOGNITION</Title> <Section position="4" start_page="311" end_page="311" type="metho"> <SectionTitle> THE ALPHANUMERIC DATABASE </SectionTitle> <Paragraph position="0"> Although the bulk of research using the Sphinx system at Carnegie Mellon has made use of the well-known Resource Management database, we were forced to use a different database, the Alphanumeric database, for our evaluations of signal processing. The primary reason for this is that the Resource Management database with its large vocabulary size and many utterances required several weeks to train satisfactorily, which was excessively long since the entire system had to be retrained each time a new signal-processing algorithm was introduced. We also performed these evaluations using a more compact and easily-trained version of Sphinx with only about 650 phonetic models, omitting such features as function-word models, between-word triphone models, and corrective training. We were willing to tolerate the somewhat lower absolute recognition accuracy that this version of Sphinx provided because of the reduced time required by the training process. Using the Alphanumeric database, the more compact Sphinx system, and faster computers, we were able to reduce the training time to the point that an entire train-and-test cycle could be performed in about 9 hours.</Paragraph> <Paragraph position="1"> A second reason why we resorted to a new database is that we specifically wanted to compare simultaneous recordings from close-talking and desktop microphones in our evaluations. We believe that it is very important to evaluate speech-recognition systems in the context of natural acoustical environments with natural noise sources, rather than using speech that is recorded in a quiet environment into which additive noise and spectral tilt are artificially injected.</Paragraph> </Section> <Section position="5" start_page="311" end_page="312" type="metho"> <SectionTitle> CONTENTS OF THE DATABASE </SectionTitle> <Paragraph position="0"> The Alphanumeric database consists of 1000 training utterances and 140 different testing utterances, that were each recorded simultaneously in stereo using both the Sennheiser HMD224 close-talking microphone that has been a standard in previous DARPA evaluations, and a desk-top Crown PZM6fs microphone. The recordings were made in one of the CMU speech laboratories (the &quot;Agora&quot; lab), which has high ceilings, concrete-block walls, and a carpeted floor. Although the recordings were made behind an acoustic partition, no attempt was made to silence other users of the room during recording sessions, and there is consequently a significant amount of audible interference from other talkers, key clicks from other workstations, slamming doors, and other sources of interference, as well as the reverberation from the room itself. Since the database was limited in size, it was necessary to perform repeated evaluations on the same test utterances.</Paragraph> <Paragraph position="1"> The database consisted of strings of letters, numbers, and a few control words, that were naturally elicited in the context of a task in which speakers spelled their names, addresses, and other personal information, and entered some random letter and digit strings. Some sample utterances are N-S-V-H-6-T-49, ENTER-4-5-8-2-1 and P-I-T-T-S-B-U-R-G-H. A total of 106 vocabulary items appeared in the vocabulary, of which about 40 were rarely uttered. Although it contains fewer vocabulary items, the Alphanumeric database is more difficult than the Resource Management database with perplexity 60 both because of the greater number of words in the vocabulary and because of their greater intrinsic acoustic confusibifity.</Paragraph> </Section> <Section position="6" start_page="312" end_page="313" type="metho"> <SectionTitle> AVERAGE SPEECH AND NOISE SPECTRA </SectionTitle> <Paragraph position="0"> Figure 1 compares averaged spectra from the Alphanumeric database for frames believed to contain speech and background noise from each of the two microphones. By comparing these curves, it can be seen that the average signal-to-noise ratio (SNR) using the close-talking Sennheiser microphone is about 25 dB. The signals from the Crown PZM, on the other hand, exhibit an SNR of less than 10 dB for frequencies below 1500 Hz and about 15 dB for frequencies above 2000 Hz. Furthermore, the response of the Crown PZM exhibits a greater spectral tilt than that of the Sennheiser, perhaps because the noise-cancelling transducer on the Sennheiser also suppresses much of the low-frequency components of the speech signal.</Paragraph> <Paragraph position="1"> Sennheiser Microphone and the Crown PZM microphone. The separation of the two curves in each panel provides an indication of signal-to-noise ratio for each microphone. It can also be seen that the Crown PZM produces greater spectral flit.</Paragraph> </Section> <Section position="7" start_page="313" end_page="313" type="metho"> <SectionTitle> BASELINE RECOGNITION ACCURACY </SectionTitle> <Paragraph position="0"> We first consider the &quot;baseline&quot; recognition accuracy of the Sphinx system obtained using the two microphones with the standard signal processing routines. Table I summarizes the recognition accuracy obtained by training and testing using each of the two microphones. Recognition accuracy is reported using the standard DARPA scoring procedure (Pallett, 1989), with penalties for insertions and deletions as well as for substitutions. It can be seen that training and testing on the Crown PZM produces an error rate that is 60% worse than the error rate produced when the system is trained and tested on the Sennheiser microphone. When the system is trained using one microphone and tested using the other, however, the performance degrades to a very low level. Hence we can identify two goals of signal processing for greater robustness: we need to drastically improve the performance of the system for the &quot;cross conditions&quot;, and to elevate the absolute performance of the system when it is trained and tested using the each of the two microphones.</Paragraph> <Paragraph position="1"> In order to better understand why performance degraded when the microphone was changed from the Sennheiser to the Crown PZM, even when the PZM was used for training as well as testing, we studied the spectrograms and listened carefully to all utterances for which training and testing with the PZM produced errors that did not appear when the system was trained and tested on the close-talking Sennheiser microphone. The estimated causes of the &quot;new&quot; errors using the Crown PZM are summarized in Table II. Not too surprisingly, the major consequence of using the PZM was that the effective SNR was lowered. As a result, there were many confusions of silence or noise segments with weak phonetic events. These confusions accounted for some 58 percent of the additional errors, with crosstalk (either by competing speakers or key clicks from other workstations) identified as the most significant other cause of new errors.</Paragraph> <Paragraph position="2"> We now consider the extent to which the use of acoustical pre-processing can mitigate the effects of the Crown PZM and of the change in microphone.</Paragraph> </Section> <Section position="8" start_page="313" end_page="313" type="metho"> <SectionTitle> ACOUSTICAL PRE-PROCESSING FOR SPEECH RECOGNITION </SectionTitle> <Paragraph position="0"> In this section we briefly review the baseline signal procedures used in the Sphinx system, and we describe the spectral normalization and spectral subtraction operations in the cepstral domain.</Paragraph> </Section> <Section position="9" start_page="313" end_page="314" type="metho"> <SectionTitle> GENERAL SIGNAL PROCESSING </SectionTitle> <Paragraph position="0"> The first stages of signal processing in the evaluation system are virtually identical to those that have been reported for the Sphinx system previously. Briefly, speech is digitized with a sampling rate of 16 kHz and pre-emphasized, and a Hamming window is applied to produce analysis frames of 20-ms duration every 10 ms. 14 LPC coefficients are produced for each frame using the autocorrelation method, from which 32 cepstral coefficients are obtained using the standard recursion method. Finally, these cepstral coefficients are frequency warped to a pseudo-mel scale using the bilinear-transform method with 12 stages, producing a final 12 cepstral coefficients after the frequency warping. (We found that increasing the number of cepstral coefficients before the warping from 12 to 32 provided better frequency resolution after frequency warping, which led to a 5-percent relative improvement of the baseline Sphinx system on the Resource Management task.) In addition to the LPC cepstral coefficients, differenced LPC cepstral coefficients, power and differenced power are also computed for every frame. The cepstra, differenced cepstra, and combined power and differenced power parameters are vector quantized into three different codebooks.</Paragraph> </Section> <Section position="10" start_page="314" end_page="315" type="metho"> <SectionTitle> PROCESSING FOR ROBUSTNESS IN THE CEPSTRAL DOMAIN </SectionTitle> <Paragraph position="0"> We describe in this section the procedures we used to achieve spectral normalization and spectral subtraction in the cepstral domain. Because signal processing and feature extraction in the Sphinx system was already based on cepstral analysis, these procedures could be implemented with an almost negligible increase in computational load beyond that of the existing signal processing procedures.</Paragraph> <Section position="1" start_page="314" end_page="314" type="sub_section"> <SectionTitle> Spectral Normalization </SectionTitle> <Paragraph position="0"> The goal of spectral normalization is to compensate for distortions to the speech signal produced by linear convolution, which could be the result of filtering by the vocal tract, room acoustics, or the transfer function of a particular microphone. As noted above, compensation for linear convolution could be accomplished by multiplying the magnitude of the spectrum by a correction factor. Since the cepstrum is the log of the magnitude of the spectrum, this corresponds to a simple additive correction of the cepstrum vector. The major differences between various spectral normalization algorithms are primarily concerned with how the additive compensation vector is estimated.</Paragraph> <Paragraph position="1"> The most effective form of spectral normalization that we have considered so far is also the simplest. Specifically, a static reference vector is estimated by computing the inverse DlZT of the long-term average of the cepstral vector for the speech frames from the training databases. (Samples of these averages for the alphanumeric database are shown in Fig. 1.) The compensation vector is defined to be the difference between the two sets of averaged cepstral coefficients from the two types of microphones in the training database., Although the compensation vector is determined only from averages of spectra in the speech frames, it is applied to both the speech and nonspeech frames.</Paragraph> <Paragraph position="2"> We have also considered other types of spectral normalization in the cepstral domain, including one that determines the compensation vector that minimizes the average VQ distortion. While none of these methods work any better in isolation than the simple static spectral normalization described above, some of them have exhibited better performance than the static normalization when used in conjunction with spectral subtraction.</Paragraph> </Section> <Section position="2" start_page="314" end_page="315" type="sub_section"> <SectionTitle> Spectral Subtraction </SectionTitle> <Paragraph position="0"> Spectral Subtraction is more complex than spectral normalization, both because it cannot be applied to the cepstral coefficients directly, and because there are more free parameters and arbitrary decisions that must be resolved in determining the best procedure for a particular system.</Paragraph> <Paragraph position="1"> Spectral subtraction in the Sphinx system is accomplished by converting from the feature vectors from cepstral coefficients to log-magnitude coefficients using a 32-point inverse DFT (for the 16 real and even cepstral coefficients). These log-magnitude vectors are then exponentiated to produce direct spectral magnitudes, from which a reference vector is subtracted according to the general procedure described below. The log of the resulting difference spectrum is then converted once again to a cepstral vector using a 32-point forward DFF. Although both an inverse and forward DFF must be performed on the cepstral vectors in this algorithm, little time is consumed because only 16 real coefficients are involved in the DFT computations. In addition, a computationally efficient procedure similar to the one described by Von Compernolle (1987) can be applied to perform the exponentiation and logarithm operations using a single table lookup.</Paragraph> <Paragraph position="2"> The estimated noise spectrum is either over-subtracted or under-subtracted from the input spectrum, depending on the estimated instantaneous signal-to-noise ratio (of the current analysis frame). In our current implementation of spectral subtraction, the estimation of the noise vector and the determination of the amount of subtraction to be invoked are based on a comparison of the incoming signal energy to two thresholds, representing a putative maximum power level for noise frames (the &quot;noise threshold&quot;) and a putative minimum power level for speech frames (the &quot;speech&quot; threshold&quot;). While these thresholds are presently set empirically, they could easily be estimated from histograms of the average power for the signals in the analysis fxames. The estimated noise vector is obtained by averaging the cepstra of all frames with a power that falls below the noise threshold. Once the noise vector is estimated, a magnitude equal to that of the reference spectrum plus 5 dB is subtracted from the magnitude of the spectrum of the incoming signal, for all frames in which the power of the incoming signal falls below the noise threshold. If the power of the incoming signal is above the speech threshold, the magnitude of the reference spectrum minus 2.5 dB is subtracted from the magnitude of the spectrum of the incoming signal. The amount of over- or under-subtraction (in dB) is a linearly interpolated function of the instantaneous signal-to-noise ratio (in dB) for incoming signals whose power is between the two thresholds. We note that we subtract the magnitudes of spectra \[as did Berouti et al. (1979)\] rather than the more intuitively appealing spectral power because we found that magnitude subtraction provides greater recognition accuracy.</Paragraph> </Section> </Section> <Section position="13" start_page="315" end_page="315" type="metho"> <SectionTitle> INTEGRATION OF SPECTRAL SUBTRACTION AND NORMALIZATION </SectionTitle> <Paragraph position="0"> Since spectral subtraction and normalization each provide some improvement in recognition accuracy when applied individually, one would expect that further improvement should be obtained when they are used simultaneously.</Paragraph> <Paragraph position="1"> Indeed, in pilot experiments using the Resource Management database, training using the Sennheiser microphone and testing using the Crown PZM, we obtained a 15 percent reduction in relative error rate when spectral normalization was added to spectral subtraction (Morii, 1987). Nevertheless, we have found that the effects of the two enhancement procedures interact with each other, and simple cascades of the two implementations that work best in isolation do not produce great improvements in performance. We are confident that with better understanding of the nature of these interactions we can more fully exploit the complementary nature of the two types of processing.</Paragraph> </Section> <Section position="14" start_page="315" end_page="316" type="metho"> <SectionTitle> INTRODUCTION OF NON-PHONETIC MODELS </SectionTitle> <Paragraph position="0"> In these Proceedings, Ward (1989) describes a procedure by which the performance of the Sphinx system can be improved by explicitly developing phonetic models for such non-speech events such as filled pauses, breath noises, door slams, telephone rings, paper rustling, etc. Most of these phenomena are highly transitory in nature, and as such are not directly addressed by either spectral subtraction or normalization. While Ward was especially and spectral normalization, and each of the two microphones. The horizontal dotted lines indicate performance obtained in the baseline condition when the system is trained and tested using the same microphone.</Paragraph> <Paragraph position="1"> concerned with the non-phonetic events associated with spontaneous speech, there is no reason why these techniques cannot be applied to process speech recorded from desk-top microphones as well. Since it appears that about 20 percent of the &quot;new&quot; errors introduced when one replaces the Sennheiser microphone by the Crown PZM are the result of crosstalk, we are optimistic that implementation of Ward's non-phonetic models should provide further improvement in recognition accuracy.</Paragraph> </Section> <Section position="15" start_page="316" end_page="316" type="metho"> <SectionTitle> CONSIDERATION OF SPECTRAL CORRELATIONS ACROSS FREQUENCY </SectionTitle> <Paragraph position="0"> Traditional spectral subtraction techniques assume that all speech frames are statistically independent from each other, and that every frequency component within a frame is statistically independent from the other frequencies. As a result, it is quite possible that the result of a spectral subtraction operation may bear little resemblance to any legitimate speech spectrum, particularly at low SNRs. We are exploring several techniques to take advantage of information about correlations across frequency to ensure that the result of the spectral subtraction is likely to represent a legitimate speech spectrum.</Paragraph> </Section> class="xml-element"></Paper>