XML Viewer - h89-2046

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2046_metho.xml
Size: 14,252 bytes
Last Modified: 2025-10-06 14:12:19
<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2046">
  <Title>THE AUDITORY PROCESSING AND RECOGNITION OF SPEECH</Title>
  <Section position="4" start_page="0" end_page="325" type="metho">
    <SectionTitle>
NOISE ROBUSTNESS
</SectionTitle>
    <Paragraph position="0"> Cochlear models in various forms are now commonly used in speech recognition systems. In many cases, they are severely simplified to reduce computational complexity, preserving only salient features of the original models, e.g., the psuedo-logarithmic frequency axis, critical-band filters, and the fast and/or slow adaptation (as AGCs). These and other processing steps have been justified in many elaborate and detailed experiments. One of the most desired features of cochlear processing has been robustness to noise, specifically, their supposed ability to provide a stable representation of the speech signal over a wide range of signal-to-noise ratios. Results from a few studies have so far been equivocal for many reasons, primary among them is the complexity of the systems tested which precluded clear separation of the causes of improvements and degradations. We have compared the noise immunity of cochlear representations to that of linear predictor coefficients (LPC), LPC cepstral coefficients, and discrete time Fourier Transform (DFT) spectra.</Paragraph>
    <Paragraph position="1"> Specifically, three investigations are performed: first, the distortion of each representation due to additive white noise is measured; in the second experiment, the robustness is measured through the deterioration in the vector quantizer performance of each representation; and finally, in the third experiment we measure the ability of each representation to discriminate speech sounds in noise.</Paragraph>
    <Paragraph position="2"> Ninety sentences spoken by ten male speakers are taken from the phonetically labeled Icecream database and transformed into each of the four representations after upsampling from 16KHz to 20Kiiz (the cochlear model requires a 20KItz sampling rate). The cochlear model followed by two stages of lateral inhibition  \[1\] produces vectors of 128 tonotopically ordered elements from the 100ttz to 10KHz region of the basilar membrane. For all representations, a 20ms frame and 8ms step size are used, and, except for the cochlear model, a preemphasis of 1.0 and a Hamming window are applied.</Paragraph>
    <Paragraph position="3"> The Log Area Ratios are obtained from the LPC coefficients of an order 28 predictor found via the autocorrelation method. The Log Area Ratios are used because of their appropriateness for vector quantization and mean square distortion measurements. The LPC cepstral coefficients are also computed via autocorrelation and the quefrency ranged from 0.0625 ms to 3.0625 ms. The spectrum is computed by a 256 point FFT with zero padding.</Paragraph>
    <Paragraph position="4"> In the tests performed, noisy speech is obtained by adding white gaussian noise of the appropriate amplitude to the clean speech. The slight oversampling of the speech is taken into account in determining the noise amplitude. The signal to noise levels investigated are 24dB - 0dB in steps of 3dB.</Paragraph>
  </Section>
  <Section position="5" start_page="325" end_page="325" type="metho">
    <SectionTitle>
FEATURE DISTORTION
</SectionTitle>
    <Paragraph position="0"> The actual effect of additive noise on the various representations is measured first as</Paragraph>
    <Paragraph position="2"> where F(sj) is the representation of frame j of the clean speech, N is the number of frames, and sj + nj refers to a frame of speech with additive noise.</Paragraph>
    <Paragraph position="3"> The Karhunen-Loeve transform, K, is computed for each representation from the autocovariance of the clean speech features. It is chosen as a means of reducing the dimension of the cochlear model in an optimum fashion. Since it also can be used to restrict measured data to a known signal space, it is applied to all representations so as not to give the cochlear model an unfair advantage. The eigenvectors corresponding to the 48 largest eigenvalues of the autocovariance matrix are chosen to form the transform kernel for both the spectral and cochlear representations. For the LPC and LPC cepstrum, all eigenvectors are retained.</Paragraph>
    <Paragraph position="4"> Note that if all eigenvectors are retained I\[g(F(s)) - g(F(s + n))\[\[2 = HE(s) - F(s + n)\[\[2 so the transform does not affect the distortion computation for the LPC and LPC cepstrum, and in practice, the spectral distortion is not reduced by the change in dimension.</Paragraph>
    <Paragraph position="5"> The cochlear model suffers less distortion than the other representations at noise levels less than 9db, at which point it becomes parallel to the parametric models (Figure l(a)).</Paragraph>
  </Section>
  <Section position="6" start_page="325" end_page="326" type="metho">
    <SectionTitle>
VECTOR QUANTIZER DISTORTION
</SectionTitle>
    <Paragraph position="0"> Another comparison among the different representations is through the effects of noise on the performance of vector quantizers (VQs) trained with clean speech. The effect of noise on the both VQ class distributions for each phoneme and the increase in codebook distortion are used as the measuring criteria.</Paragraph>
    <Paragraph position="1"> Codebooks of 64 symbols are trained on clean speech and sample distributions of the VQ classes are formed for each phoneme at all noise levels. The similarity between the class distribution of the quantized clean speech, f,, and the distribution of the quantized noisy speech, fs+n, is measured by DDi,t~ib~tlo. Di.to~tio. = 1 -- Ei6_41 fs(i)&amp;quot; f~+n(i) 64 C/2 (iS ~...~64 For presentation and comparison, the measurement of each representation is normalized by its 0db value. Only the results for the most frequently occurring vowel,/ey/, are given (Figure l(b)), but the results for other phonemes, with the exception of stops, are essentially the same. In the case of the stops (and during silences), all representations seem to perform similarly.</Paragraph>
    <Paragraph position="2"> Since the distribution of the VQ classes for particular phonemes is important to many statistical methods of speech recognition, the superior performance of the cochlear representation is significant. The class  distributions show that the LPC and cepstrum, as noise increases, model all the speech sounds as noise, which the VQ labels as one of three or four classes. This happens also to the cochlear representation, but at higher noise levels.</Paragraph>
    <Paragraph position="3"> An alternative way of measuring VQ performance is through the codebook distortion, defined as</Paragraph>
    <Paragraph position="5"> This is also computed for each phoneme, but only the composite results are presented here, normalized by the 0db distortion (Figure l(c)).</Paragraph>
    <Paragraph position="6"> A similar measure based on 1/g~;=l IlYQ(F(sj))- YQ(F(sj + nj))ll2 is also computed. The results closely resemble those in Figure l(c), but include a common bias due to the codebook distortion. The measures DDistribution Distortion and DVQ Distortion show that the cochlear model performs well at noise levels below 9db.</Paragraph>
  </Section>
  <Section position="7" start_page="326" end_page="327" type="metho">
    <SectionTitle>
DISCRIMINATION ABILITY
</SectionTitle>
    <Paragraph position="0"> The ability of the LPC cepstrum and cochlear model to discriminate between different phonemes in the presence of additive noise is an important performance measure in speech recognition. The phonetic labels  in the database are used to compute a variant of the Fischer Discrimination to compare the intra-class scatter to the inter-class scatter at each noise level. This measure favors representations in which features assigned to any particular phoneme are tightly clustered and distant from features assigned to other phonemes. The evaluation is given by Dcon\]usion Score = 1/n log det Sw det SB where Swand SB are the intra-class and inter-class scatter matrices, respectively</Paragraph>
    <Paragraph position="2"> and c is the number of phonemes, Xi is the collection of all representations, x, labeled as the i ~h phoneme, ni is the cardinality of Xi, and m and mi are found by averaging all features and averaging all the features in Xi, respectively.</Paragraph>
    <Paragraph position="3"> Both the cepstrum and the cochlear model have similar discrimination performance at low noise levels (Figure l(d)), but the cochlear model retains its performance better as the additive noise level increases.</Paragraph>
  </Section>
  <Section position="8" start_page="327" end_page="327" type="metho">
    <SectionTitle>
DISCUSSION
</SectionTitle>
    <Paragraph position="0"> Why is the cochlear representation performance superior to other representations? There are probably two sources: the first is the compression by the hair cell models; the second is the spectral extraction strategy - the lateral inhibitory network (LIN) - applied to the cochlear model output. Compression produces a well know effect of enhancing a signal in a noisy background (see \[3\]). In the cochlear models it is possible to apply strong compression without loss of spectral detail because the spectral information is encoded in the phase locked responses. The LIN utilizes this phase locking to extract a robust spectral estimate that can tolerate extreme compression. Such compression is not feasible for spectrogram representations since it completely destroys the spectral peaks and valleys.</Paragraph>
  </Section>
  <Section position="9" start_page="327" end_page="327" type="metho">
    <SectionTitle>
AUDITORY NEUROPHYSIOLOGY
</SectionTitle>
    <Paragraph position="0"> In the central auditory system, we are investigating the nature of the representation of complex acoustic spectra in the auditory cortex \[4\]. Recordings of unit responses along the isofrequency contours of the ferret primary auditory cortex reveal systematic changes in the symmetry of their receptive fields. At the center, units with narrow and symmetric inhibitory sidebands predominate. These give way gradually to asymmetric inhibition, with high frequencies (relative to the best frequency of the units) becoming more effective caudally, and weaker rostrally. This organization gives rise to a new columner organization in the primary auditory cortex that seems to encode spectral slopes and the symmetry of spectral peaks, edges, and envelopes. These columns are analogous to the well known orientation columns of the visual system. The implication of these findings is that in the perception and recognition of complex sounds special attention must be given to the representation of spectral gradients. We have simulated the receptive fields obtained in neurophysiological experiments and are in the process of examining in detail the representation of natural and synthetic stationary speech tokens in the responses of the cortex (Figure 2).</Paragraph>
  </Section>
  <Section position="10" start_page="327" end_page="330" type="metho">
    <SectionTitle>
WORD RECOGNITION
</SectionTitle>
    <Paragraph position="0"> Finally, we have been developing models of networks that can be used for the recognition of temporallyordered sequences (e.g., phoneme sequences in a word) \[5\]. These networks are biologically plausible in that they do not require delay-lines to memorize the word prior to recognition. Instead, they function in a manner  Along the Iso-Frequency Planes. (left bottom) Response Patterns Elicited by Different Spectral Peaks. (right) Examples of the Distribution of Activity Produced by Speech Stimuli in a Model of AI with Spectral Orientation Columns. The Input Profiles are Shown to the Right of Each Figure.</Paragraph>
    <Paragraph position="1"> analogous to phase-locked loops, where the network locks onto an incoming sequence and predicts one state ahead. An error signal between the network state and the input is fed back to control the rate of progression in the network states (Figure 3).</Paragraph>
    <Paragraph position="2"> The system is based on a nonlinear recurrent lateral inhibitory network operating in a hysteresis mode which functions as a pattern generator. The network consists of a single layer of reciprocally and strongly inhibited neurons. The profile of connectivities is designed such that the patterns of the desired sequence are stable states of the network outputs. It can be shown that, when equally activated, the network settles in any one of its stable states depending on its initial conditions, i.e. displays a hysteresis behavior. A network generates a sequence when it cycles through its stable states. In order to control the order and rate of this process, integrating excitatory connections are formed that project from the elements of one pattern to the elements of the succeeding pattern. Only one time-constant of integration is used for all connections in the network. The varying durations of the sequence patterns are encoded not as different time constants but as different widths of the hysteresis loops between the different patterns, i.e. through the magnitudes of the inhibitory connectivities in the network.</Paragraph>
    <Paragraph position="3">  The proposed network can be readily used as a recognizer of sequences applied to its input. The key concept here is the degree of correspondence between the applied input and the internally predicted state of the network. This measure is used to modulate the mode of operation in the network between a free-cycling mode when the correspondence is high, and an input-dominated mode when it is low. The measure is a state-dependent function derived during training, similar to a likelihood function. Thus, this measure can also be used as an indicator of the match between the applied sequence and the sequence generated by the network.</Paragraph>
    <Paragraph position="4">  When the confidence is relatively high and the network is free-cycling, it automatically substitutes missing patterns and is rather insensitive to small irregularities of the input temporal durations. Therefore, in such a scheme, no time-warping is needed.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML