File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/89/h89-1024_intro.xml
Size: 5,917 bytes
Last Modified: 2025-10-06 14:04:48
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-1024"> <Title>THE LINCOLN CONTINUOUS SPEECH RECOGNITION SYSTEM: RECENT DEVELOPMENTS AND RESULTS 1</Title> <Section position="3" start_page="0" end_page="161" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> Our earlier development efforts \[2,3,4,5,6,7,8,9\] centered on improving the SD speaker-stress robustness for both IWR and CSR tasks. Since our IWR database included a normal speech test section, we were able to determine that our enhancements for robustness also improved performance for normally spoken speech (0 errors/1680 test tokens, 105 word vocabulary, multi-style training). An independent test on the TI-20 word database \[10\] confirmed this normal speech performance with 3 errors out of 5120 test tokens on our first run on this database and no errors after a small amount of development \[3\]. Our robust CSR database was not useful for determining the large vocabulary normal speech performance.</Paragraph> <Paragraph position="1"> In order to work on a large vocabulary normal speech CSR task, we switched to the DARPA Resource Management database \[1\]. The SD portion of this database has 12 speakers with 600 training sentences and 100 development test sentences per speaker. This provided a total of 1,200 test sentences containing 10,242 words. For SI work we used the same development test sentences, but trained on 2,880 sentences from 72 speakers from the SI training portion of the database. (There was an overlap of 8 speakers between the SI and SD training sets making the total of 80 speakers reported in \[1\].) When additional SI training data was needed, we added the designated &quot;SI development test&quot; data, again avoiding test speaker overlaps, to the designated SI training data. This provided a total 3,990 training sentences from 109 speakers.</Paragraph> <Paragraph position="2"> The vocabulary of the Resource Management database is 991 words. There is also an &quot;official&quot; word-pair recognition grammar \[11\]. This grammar is just a list of allowable word pairs without probabilities for the purpose of reducing the recognition perplexity to about 60. (Including the probabilities slightly more than halves the perplexity.) Working with a single development test set carries a risk of tuning one's system to idiosyncrasies in the test set due to the multiple tests and decisions performed during algorithm development. Methodologies which focus on correcting the individual test set errors are particularly subject to this problem and become, in effect, corrective training \[12\] on the test set. In contrast, since different test sets have different inherent difficulties, comparisons of two systems using different test sets have a significantly reduced resolution. Therefore, the DARPA program has had several &quot;official evaluation tests&quot;, the most recent of which was held in June 88. This test used 25 sentences from each of the 12 speakers (2,546 total words), all of which were new data that had not been used in the development process. These evaluation tests provide the best comparison between systems developed at different sites. The development test data, while less useful for comparing systems developed at different sites, is useful for judging progress over time at a single site, subject to the risk that a later system may enjoy an advantage over an earlier system due to the training to the test set. The results provided below will be identified according to which test set was used: June 88 or development test.</Paragraph> <Paragraph position="3"> Error rates for these systems will be quoted as &quot;% word error rate&quot; in the text. This number is: The &quot;June 88&quot; CSR system (which was used for the June 88 DARPA tests) uses a continuous observation HMM with triphone (left and right context-sensitive phone) models \[13\]. The observation probability density functions are diagonal covariance Gaussians with either a grand (shared) variance or a fixed perceptuallymotivated variance. (Both give similar performance on normal speech; however, the perceptually motivated variance appears to be more robust to stress. The grand variance is used on the systems reported here.) The observation vector is a centisecond mel-cepstrum augmented with temporal back differences. The phone models have three states with no state skip transitions. Only one Gaussian per state is used in SD mode.</Paragraph> <Paragraph position="4"> The SI system is identical except that fourth order Gaussian mixtures are used.</Paragraph> <Paragraph position="5"> The system is trained by an unsupervised bootstrapping procedure. The training data is not time-marked, only its orthographic transcription and a dictionary is required. The initial iterations of the Baum-Welch algorithm are performed using monophone (context-free phone) models from a uniform initial state. This, in effect, automatically marks the data. The monophone models are then used to provide initial values for the (single Gaussian) triphone models and a few more iterations are performed. If mixtures are to be used, minor random perturbations of the single Gaussian mean vectors are used to initialize the Gaussian mixtures and a few final iterations are performed.</Paragraph> <Paragraph position="6"> During recognition, the system extrapolates (guesses based upon a linear combination of the available triphones) the triphones which were not observed during training. The recognition environment is modeled by adaptive background states. In order to control the relative number of word insertions and deletions, the likelihood is multiplied by a penalty for each word. A Viterbi beam search using a finite state grammar with optional interword silences produces the recognized output.</Paragraph> </Section> class="xml-element"></Paper>