File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/90/h90-1065_abstr.xml
Size: 9,979 bytes
Last Modified: 2025-10-06 13:46:59
<?xml version="1.0" standalone="yes"?> <Paper uid="H90-1065"> <Title>for DECIPHER (Uses February 1989 RM Test Set)</Title> <Section position="1" start_page="0" end_page="338" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> SRI has developed the DECIPHER system, a hidden Markov model (HMM) based continuous speech recognition system typically used in a speaker-independent manner. Initially we review the DECIPHER system, then we show that DECIPHER's speaker-independent performance improved by 20% when the standard 3990-sentence speaker-independent test set was augmented with training data from the 7200-sentence resource management speaker-dependent training sentences.</Paragraph> <Paragraph position="1"> We show a further improvement of over 20% when a version of corrective training was implemented. Finally we show improvement using parallel male- and femaletrained models in DECIPHER. The word-error rate when all three improvements were combined was 3.7% on DARPA's February 1989 speaker-independent test set using the standard perplexity 60 wordpair grammar.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> System Description Front End Analysis </SectionTitle> <Paragraph position="0"> Decipher uses a FFT-based Mel-cepstra front end. Twenty-five FFT-Mel filters spanning 100 to 6400 hz are used to derive 12 Mel-cepstra coefficients every 10-rns frame. Four features are derived every frame from this cepstra sequence. They are: We use 256-word speaker-independent codebooks to vector-quantize the Mel-cepstra and the Mel-cepstral differences. The resulting four-feature-perframe vector is used as input to the DECIPHER HMM-based speech recognition system.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Pronunciation Models </SectionTitle> <Paragraph position="0"> DECIPHER uses pronunciation models generated by applying a phonological rule set to word baseforms. The technique used to generate the rules are described in Murveit89 and Cohen90. These generate approximately 40 pronunciations per word as measured on the DARPA resource management vocabulary. Speaker-independent pronunciation probabilities are then estimated using these bushy word networks and the forward-backward algorithm in DECIPHER. The networks are then pruned so that only the likely pronunciations remain--typically about four pronunciations per word for the resource management task.</Paragraph> <Paragraph position="1"> This modeling of pronunciation is one of the ways that DECIPHER is distinguished from other HMM-based systems. We have shown in Cohen90 that this modeling improves system performance.</Paragraph> </Section> <Section position="3" start_page="0" end_page="337" type="sub_section"> <SectionTitle> Acoustic Modeling </SectionTitle> <Paragraph position="0"> DECIPHER builds and trains word models by using context-based phone models arranged according to the pronunciation networks for the word being modeled.</Paragraph> <Paragraph position="1"> Models used include unique-phone-in-word, phone-inword, triphone, biphone, and generalized-phone forms of biphones and triphones, as well as context-independent models. Similar contexts are automatically smoothed together, if they do not adequately model the training dam, according to a deleted-estimation interpolation algorithm developed at SRI (similar to Jelinek80). The acoustic models reflect both inter-word and across-word coarticulatory effects.</Paragraph> <Paragraph position="2"> Training proceeds as follows: * Initially, context-independent boot models are estimated from hand-labeled portions of the training part of the TIMIT database.</Paragraph> <Paragraph position="3"> * The boot models are used as input for a 2-iteration context-independent model training run, where context-independent models are refined and pronunciation probabilities are calculated using the large 40-pronunciation word networks. As stated above, these large networks are then pruned to about four pronunciations per word.</Paragraph> <Paragraph position="4"> * Context-dependent models are then estimated from a second 2-iteration forward-backward run, which uses the context-independent models and the pruned networks as input.</Paragraph> </Section> <Section position="4" start_page="337" end_page="337" type="sub_section"> <SectionTitle> System Evaluation </SectionTitle> <Paragraph position="0"> DECIPHER has been evaluated on the speaker-independent continuous-speech DARPA resource management test sets \[Price88\] \[Pallet89\]. DECIPHER was evaluated on the November 1989 test set (evaluated by SRI in March 1990) and had 6% word error on the perplexity 60 task. This performance was equal to the best previously reported error rate for that condition. We recendy evaluated on the June 1990 task, and achieved 6.5% word error for a system trained on 3990 sentences and 4.8% word error using 11,190 training sentences.</Paragraph> <Paragraph position="1"> Since the October 1989 evaluation, DECIPHER's performance has improved in three ways: * We noted when using that the standard 3990-sentence resource management training set, that many of DECIPHER's probability distributions were poorly estimated. Therefore., we evaluated DECIPHER with several different amounts of training data. The largest training set we used, an ll,190-sentence resource management training set, improved the word error rate by about 20%.</Paragraph> <Paragraph position="2"> * We implemented a modified version of IBM's corrective training algorithm, additionally improving the word error rate by about 20%.</Paragraph> <Paragraph position="3"> * We separated the male and female training data, estimated different H/vIM output distributions for each sex. This also improved word accuracy by 20%.</Paragraph> <Paragraph position="4"> These improvements are described in more detail below.</Paragraph> </Section> <Section position="5" start_page="337" end_page="338" type="sub_section"> <SectionTitle> Effects of Training Data </SectionTitle> <Paragraph position="0"> In a recent study, we discovered that DECIPHER's word error rate on its training set using the perplexity 60 grammar was very low (0.7% over the 3990 resource management sentences). Since the test-set error rate for that system was about 7%, we concluded that the system would profit from more training data. To test this, we evaluated the system with four databases easily available to us as is shown in Table 1. There SI refers to the 3990-sentence speaker-independent portion of the resource management (RM) database--109 speakers, 30 or 40 sentences each, SD refers to the speaker-dependent portion of that database--12 speakers, 600 sentences each, and TIMIT refers to the training portion of the TIMIT database--420 speakers, 8 sentences each. Note that all SI and SD sentences are related to the resource management task, while TIMIT's sentences are not related to that task. All systems were tested using a continuous-speech, speaker-independent condition with the perplexity 60 resource management grammar testing on Word Error as a Function of Training Set Table 1 shows that performance improved as dam increased, even when adding the out-of-task TIMIT dam. The only exception was that training with 3990 sentences from 100 talkers was slighdy better than 7200 sentences from 12 talkers. This is to be expected in a speaker-independent system. This last result is consistent with the findings in Kubala90 that showed that there was not a big performance drop when the number of speakers was drastically reduced (from 109 to 12) in speaker-independent systems. It is likely that more training data would continue to improve performance on this task; however, we believe that a more sensible study would be to focus on how large training sets could improve performance across tasks and vocabularies. (See, for instance, Hon90.) Separating Male and Female Models We experimented with maintaining sex consistency in DECIPHER's hypotheses by partitioning male and female training data and using parallel recognition systems as in Bush87. Two subrecognizers are run in parallel on unknown speech and the hypothesis from either recognizer with the highest probability is used. The disadvantage of this approach is that it makes inefficient use of training data. That is, in the best scenario the male models are trained from only half of the training data and the female models use only half. This is inefficient because even though there may be a fundamental difference between the two types of speech, they still have many things in common and could profit from the others' training data if used properly.</Paragraph> <Paragraph position="1"> It is no wonder, then, that this approach has been successful in digit recognition systems with an abundance of training data for each parameter to be estimated, but has not significantly improved performance in large-vocabulary systems with a relatively small amount of training data \[Paul89\]. To validate the idea of sex consistency, we trained male-only and female-only versions of the DECIPHER speech recognition system using the l ll90-sentence SI+SD training set to make sure the data partitions had enough data. We produced SI+SD subsets with 4160 female and 7030 male sentences. These systems were tested on the DARPA February 1989 speaker-independent test set using the DARPA word-pair grammar (perplexity 60) and are compared below to a similar recognition system trained on all 11190 sentences. null The results in Table 2 show a 19% reduction in the error rate when using sex-consistent recognition systerns. This is a significant error rate reduction. A closer look at the system's performance showed that it correctly assigned the talker's sex in each of the 300 test sentences. null Discriminative Techniques Currently in</Paragraph> </Section> </Section> class="xml-element"></Paper>