XML Viewer - h92-1092

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1092_metho.xml
Size: 8,607 bytes
Last Modified: 2025-10-06 14:13:08
<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1092">
  <Title>COMPARISON OF AUDITORY MODELS FOR ROBUST SPEECH RECOGNITION*</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
COMPARISON OF AUDITORY MODELS FOR
ROBUST SPEECH RECOGNITION*
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> Two auditory front ends which emulate some aspects of the human auditory system were compared using a high performance isolated word Hidden Markov Model (HMM) speech recognizer. In these initial studies, auditory models from Seneff \[2\] and Ghitza \[4\] were compared using both clean speech and speech corrupted by speech-like &amp;quot;babble&amp;quot; noise. Preliminary results indicate that the auditory models reduce the error rate slightly, especially at intermediate and high noise levels.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. MOTIVATION
</SectionTitle>
    <Paragraph position="0"> The performance of speech recognizers often degrades dramatically in noise, with different talking styles, when the microphone is changed, and as a talker moves relative to a microphone. New auditory front ends that mimic some aspects of human auditory-nerve and psychoacoustic behavior have been proposed to reduce these problems.</Paragraph>
    <Paragraph position="1"> Although past limited experiments suggest that these front ends improve robustness, no thorough comparisons have been performed using high-performance Hidden Markov Model (HMM) recognizers. In addition, few studies have evaluated the effect of speech babble noise and frequency response variability on performance or explored alternative approaches to feature reduction. This is an early progress report on research in this area. Further ongoing experiments are exploring additional front ends and alternative data reduction techniques.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="453" type="metho">
    <SectionTitle>
2. AUDITORY MODELS
</SectionTitle>
    <Paragraph position="0"> Two auditory front ends which produce features that correspond to phase or synchrony information in the speech signal were explored. These auditory front ends were compared to a more conventional reel-scale cepstral front end.</Paragraph>
    <Paragraph position="1"> *This work was sponsored by the Defense Advanced Research th-ojects Agency. The views expressed are those of the authors and do not reflect the official policy or position of the U. S. Government.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1. Mel-Scale Cepstra
</SectionTitle>
      <Paragraph position="0"> The reel-scale cepstral front end described by Davis and Mermelstein \[1\] was used as a reference. This is a common signal representation which is currently used in all speech recognition systems at Lincoln Laboratory. In this front end, a 20 ms Hamming window is applied to the speech signal every 10 ms. The power spectrum of the windowed waveform is weighted by a series of filters, linearly spaced from 0 to 1000 Hz, and logarithmically spaced above 1000 Hz. Each filter &amp;quot;width&amp;quot; is twice the spacing. An inverse cosine transform converts the logarithm of the resulting filter bank coefficients to the cepstral domain.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2. Seneff's Auditory Model
</SectionTitle>
      <Paragraph position="0"> The first auditory front end evaluated was motivated from physiological data and described by Seneff \[2,3\]. This front end incorporates a first stage of 40 linear filters, followed by a series of nonlinearities modelling the transformation from basilar membrane motion to auditory nerve stimulation. Such nonlinearities include soft half-wave rectification, a model for short-term adaptation, and a rapid AGC.</Paragraph>
      <Paragraph position="1"> Seneff's front end had two outputs. &amp;quot;Mean rate&amp;quot; outputs are generated by detecting the envelope of the nonlinear stage output. These outputs roughly corresponds to spectral magnitude information. &amp;quot;Synchrony&amp;quot; outputs detect the extent that the nonlinear stage output for a particular channel has energy at the center frequency for that channel. This emulates the extent to which the nerve firings from a particular location of the basilar membrane are synchronized to the &amp;quot;characteristic frequency&amp;quot; corresponding to that location.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="453" type="sub_section">
      <SectionTitle>
2.3. EIH
</SectionTitle>
      <Paragraph position="0"> The second auditory front end evaluated was the Ensemble Interval Histogram (EIH) model developed by Ghitza \[4\].</Paragraph>
      <Paragraph position="1"> The EIH model has a first stage of linear filters similar to Seneff's, but with a considerably higher number of filters (133 instead of 40). The second stage takes the output of each filter and computes the intervals between the positive crossings of the filtered waveform at various logarithmi- null cally spaced thresholds. A histogram of the frequencies corresponding to these intervals is then created. The final stage combines the histograms for each of the channels together into the; final output, the Ensemble Interval Histogram. In this respect, the EIH model performs functions similar to Seneff's &amp;quot;Synchrony&amp;quot; output; measuring the extent that the output of the linear filter is in synchrony with the center frequency of that filter. The EIH model has been shown useful in performing isolated-word recognition in high noise conditions. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="453" end_page="453" type="metho">
    <SectionTitle>
3. EVALUATION CONDITIONS
</SectionTitle>
    <Paragraph position="0"> Initial evaluation is being performed using the TI-105 word speech corpus. \[5\] This corpus includes speech spoken in various taking styles, includes 8 speakers, (5 male and 3 female) and provides 5 training tokens and 2 testing tokens per condition for each vocabulary item.</Paragraph>
    <Paragraph position="1"> Noise was added to the clean speech condition to evaluate performance under noisy channel conditions. Noise was speech babble recorded in a public meeting place with many background speakers. For this evaluation we used Ghitza's definition of signal to noise ratio \[4\] as the ratio of the energy per speech sample in the clean speech and the noise averaged over the entire duration of the utterance.</Paragraph>
    <Paragraph position="2"> The recognition system was a word-based HMM system with eight speech states per word model, continuous density observations and a single tied diagonal covariance matrix for every state. This robust recognizer provides low error rates on many isolated-word databases.</Paragraph>
    <Paragraph position="3"> Both auditory front ends produce high-dimensional feature vector outputs. For classification, the dimensionality was reduced using the same inverse cosine transform used for the reel-scale cepstral front end. For these purposes, all auditory model outputs were treated as representing spectral magnitude. This was done in lieu of a more advanced data reduction technique such as principal components analysis or linear discriminant analysis.</Paragraph>
  </Section>
  <Section position="6" start_page="453" end_page="453" type="metho">
    <SectionTitle>
4. RESULTS
</SectionTitle>
    <Paragraph position="0"> Table 1 shows the results of the preliminary recognition experiments. The signal to noise ratio is indicated in the first column, followed by the word accuracy results for the reel-frequency cepstra, (MFC) the &amp;quot;mean rate&amp;quot; response and &amp;quot;synchrony&amp;quot; output (SYN) from Seneff's auditory model, and the results using the EIH model. The binomial standard deviation of the word accuracy rate assuming the MFC performance level is indicated in the final column.</Paragraph>
    <Paragraph position="1"> These preliminary results are encouraging. There is no degradation in performance at low noise levels, and all front-ends provide reduced error rates at both intermediate and high noise levels.</Paragraph>
  </Section>
  <Section position="7" start_page="453" end_page="453" type="metho">
    <SectionTitle>
MEAN SNR MFC
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
class="xml-element"></Paper>
Download Original XML