File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2031_metho.xml
Size: 4,891 bytes
Last Modified: 2025-10-06 14:08:16
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-2031"> <Title>Auditory-based Acoustic Distinctive Features and Spectral Cues for Robust Automatic Speech Recognition in Low-SNR Car Environments</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The Auditory-based Processing </SectionTitle> <Paragraph position="0"> It was shown through several studies that the use of human hearing properties provides insight into defining a potentially useful front-end speech representation (O'Shaughnessy, 2000). However, the performance of current ASR systems is far from the performance achieved by humans. In an attempt to improve the ASR performance in noisy environments, we evaluate in this work the use of the hearing/perception knowledge for ASR in noisy car environments. This is accomplished through the use of the auditory-based acoustic distinctive features and the formant frequencies for robust ASR.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 The Caelen's Auditory Model </SectionTitle> <Paragraph position="0"> Caelen's auditory model (Caelen, 1985) consists of three parts which simulate the behavior of the ear. The external and middle ear are modeled using a bandpass filter that can be adjusted to signal energy to take into account the various adaptive motions of ossicles. The next part of the model simulates the behavior of the basilar membrane (BM), the most important part of the inner ear, that acts substantially as a non-linear filter bank. Due to the variability of its stiffness, different places along the BM are sensitive to sounds with different spectral content. In particular, the BM is stiff and thin at the base, but less rigid and more sensitive to low frequency signals at the apex. Each location along the BM has a characteristic frequency, at which it vibrates maximally for a given input sound. This behavior is simulated in the model by a cascade filter bank. The bigger the number of these filters the more accurate is the model. In front of these stages there is another stage that simulates the effects of the outer and middle ear (pre-emphasis). In our experiments we have considered 24 filters. This number depends on the sampling rate of the signals (16 kHz) and on other parameters of the model such as the overlapping factor of the bands of the filters, or the quality factor of the resonant part of the filters. The final part of the model deals with the electro-mechanical transduction of hair-cells and afferent fibers and the encoding at the level of the synaptic endings. For more details see (Caelen, 1985).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Acoustic Distinctive Cues </SectionTitle> <Paragraph position="0"> The acoustic distinctive cues are calculated starting from the spectral data using linear combinations of the energies taken in various channels. It was shown in (Jakobson et al., 1951) that 12 acoustic cues are sufficient to characterize acoustically all languages. However, it is not necessary to use all of these cues to characterize a specific language. In our study, we choose 7 cues to be merged in a multi-stream feature vector in an attempt to improve the performance of ASR. These cues are based on the Caelen ear model described above, which does not correspond exactly to Jakobson's cues. Each cue is computed based on the output of the 24 channel filters of the above-mentioned ear model. These seven normalized acoustic cues are: acute/grave (AG), open/closed (OC), diffuse/compact (DC), sharp/flat (SF), mat/strident (MS), continuous/discontinuous (CD) and tense/lax (TL).</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Multi-stream Statistical Framework </SectionTitle> <Paragraph position="0"> Most recognizers use typically left-to-right HMMs, which consist of an arbitrary number of states N (O'Shaughnessy, 2000). The output distribution associated with each state is dependent on one or more statistically independent streams. Assuming an observation sequence O composed of S input streams Os possibly of different lengths, representing the utterance to be recognized, the probability of the composite input vector Ot at a time t in state j can be written as follows:</Paragraph> <Paragraph position="2"> where Ost is the input observation vector in stream s at time t and s is the stream weight. Each individual stream probability bjs(Ost) is represented by a multivariate mixture Gaussian. To investigate the multi-stream paradigm using the proposed features for ASR, we have performed a number of experiments in which we merged different sources of information about the speech signal that could be lost with the cepstral analysis.</Paragraph> </Section> class="xml-element"></Paper>