XML Viewer - h93-1014

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/93/h93-1014_evalu.xml
Size: 7,601 bytes
Last Modified: 2025-10-06 14:00:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1014">
  <Title>EFFICIENT CEPSTRAL NORMALIZATION FOR ROBUST SPEECH RECOGNITION</Title>
  <Section position="6" start_page="71" end_page="73" type="evalu">
    <SectionTitle>
3. EXPERIMENTAL RESULTS
</SectionTitle>
    <Paragraph position="0"> In this section we describe the ability of the various environmental compensation algorithms to improve the recognition accuracy obtained with speech from unknown or degraded microphones.</Paragraph>
    <Paragraph position="1"> The environmental compensation algorithms were evaluated using the SPHINX-II recognition system \[12\] in the context of the November, 1992, evaluations of continuous speech recognition systems using a 5000-word closed-vocabulary task consisting of dictation of sentences from the Wall Street Journal. A component of that evaluation involved utterances from a set of unknown &amp;quot;secondary&amp;quot; microphones, including desktop microphones, telephone handsets and speakerphones, stand-mounted microphones, and lapel-mounted microphones.</Paragraph>
    <Paragraph position="2"> 3.1. Results from November CSR Evaluations We describe in this section results of evaluations of the MFCDCN and CDCN algorithms using speech from secondary microphones in the November, 1992, CSR evaluations. null Because of the desire to benchmark multiple algorithms under several conditions in this evaluation combined with limited resources and the severe time constraints imposed by the evaluation protocol, this evaluation was performed using a version of SPHINX-II that was slightly reduced in performance, but that could process the test data more rapidly than the system described in \[12\]. Specifically, the selection of phonetic models (across genders) was performed by minimizing mean VQ distortion of the cepstral vectors before recognition was attempted, rather than on the basis of a posteriori probability after classification. In addition, neither the unified stochastic engine (USE) described in \[12\] nor the cepstral mean normalization algorithms were applied. Finally, the CDCN evaluations were conducted without making use of the CART decision tree or alternate  pronunciations in the recognition dictionary. The effect of these computational shortcuts was to increase the baseline error rate for the 5000-word task from 6.9% as reported in \[12\] to 8.1% for the MFCDCN evaluation, and to 8.4% for  panel) and the CDCN algorithm (lower panel) on the official DARPA CSR evaluations of November, 1992 Figure 3 summarizes the results obtained in the official November, 1992, evaluations. For these experiments, the MFCDCN algorithm was trained using the 15 environments in the training set and developmental test set for this evaluation. It is seen that both the CDCN and MFCDCN algorithms significantly improve the recognition accuracy obtained with the secondary microphones, with little or no loss in performance when applied to speech from the closetalking Sennheiser HMD-414 (CLSTLK) microphone.</Paragraph>
    <Paragraph position="3"> The small degradation in recognition accuracy observed for speech from the CLSTLK microphone using the MFCDCN algorithm may be at least in part a consequence of errors in selecting the environment for the compensation vectors.</Paragraph>
    <Paragraph position="4"> Environment-classification errors occurred on 48.8% of the CLSTLK utterances and on 28.5% of the utterances from secondary microphone. In the case of the secondary microphones, however, recognition accuracy was no better using the FCDCN algorithm which presumes knowledge of the correct environment, so confusions appear to have taken place primarily between acoustically-similar environments.</Paragraph>
    <Paragraph position="5"> In a later study we repeated the evaluation using MFCDCN compensation vectors obtained using only the seven categories of microphones suggested by BBN rather than the original 15 environments. This simplification produced only a modest increase in error rate for speech from secondary microphones (from 17.7% to 18.9%) and actually improved the error rate for speech from the CLSTLK microphone (from 9.4% to 8.3%).</Paragraph>
    <Paragraph position="6"> Figure 4 summarizes the results of a series of (unofficial) experiments run on the same data that explore the interaction between MFCDCN and the various cepstral filtering techniques. The vertical dotted line identifies the system described in \[12\].</Paragraph>
    <Paragraph position="7">  RASTA algorithm on recognition accuracy of the Sennbeiser HMD-414 microphone (solid curve) and the second ary microphones (dashed curve), from the November 1992 DARPA CSR evaluation data.</Paragraph>
    <Paragraph position="8"> It can be seen in Figure 4 that RASTA filtering provides only a modest improvement in errors using secondary microphones, and degrades speech from the CLSTLK microphone. CMN, on the other hand, provides almost as much improvement in recognition accuracy as MFCDCN, without degrading speech from the CLSTLK microphone.</Paragraph>
    <Paragraph position="9"> We do not yet know why our results using CMN are so much better than the results obtained using RASTA. In contrast, Schwartz et al. obtained approximately comparable results using these two procedures \[13\].</Paragraph>
    <Paragraph position="10"> Finally, adding MFCDCN to CMN improves the error rate from 21.4% to 16.2%, and the use of IMFCDCN provides a further reduction in error rate to 16.0% for this task.</Paragraph>
    <Paragraph position="11"> 3.2. Results from the &amp;quot;Stress Test&amp;quot; Evaluation In addition to the evaluation described above, a second unofficial &amp;quot;stress-test&amp;quot; evaluation was conducted in December, 1992, which included spontaneous speech, utterances containing out-of-vocabulary words, and speech from unknown microphones and environments, all related to the Wall Street Journal domain.</Paragraph>
    <Paragraph position="12">  The version of SPHINX-II used for this evaluation was configured to maximize the robustness of the recognition process. It was trained on 13,000 speaker-independent utterances from the Wall Street Journal task and 14,000 utterances of spontaneous speech from the ATIS travel planning domain. The trigram grammar for the system was derived from 70.0 million words of text without verbalized punctuation and 11.6 million words with verbalized punctuation. Two parallel versions of the SPHINX-II system were run, with and without IMFCDCN. Results obtained are summarized in the Table I below.</Paragraph>
    <Paragraph position="13"> In Out of STRESS BASE  December, 1992, &amp;quot;Stress-Test&amp;quot; Evaluation. The baseline CSR results are provided fox comparison only, and were not obtained using a comparably-configured system.</Paragraph>
    <Paragraph position="14"> We also compared these results with the performance of the baseline SPHINX-H system on the same data. The baseline system achieved a word error rate of 22.9% using only the bigram language model. Adding IMFCDCN reduced the error rate only to 22.7%, compared to 20.8% for the stress-test system using IMFCDCN. We believe that the IMFCDCN algoxithm provided only a small benefit because only a small percentage of data in this test was from secondary microphones.</Paragraph>
    <Paragraph position="15"> In general, we are very encouraged by these results, which are as good or better than the best results obtained only one year ago under highly controlled conditions. We believe that the stress-test protocol is a good paradigm for future evaluations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML