XML Viewer - h91-1015

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1015_metho.xml
Size: 22,591 bytes
Last Modified: 2025-10-06 14:12:41
<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1015">
  <Title>SPEECH RECOGNITION IN SRI'S RESOURCE MANAGEMENT AND ATIS SYSTEMS</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SPEECH RECOGNITION IN SRI'S RESOURCE MANAGEMENT
AND ATIS SYSTEMS
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SRI International, Menlo Park, CA 94025
OVERVIEW
</SectionTitle>
    <Paragraph position="0"> This paper describes improvements to DECIPHER, the speech recognition component in SKI's Air Travel Information Systems (ATIS) and Resource Management systems. DECIPHER is a speaker-independent continuous speech recognition system based on hidden Markov model (HMM) technology. We show significant performance improvements in DECIPHER due to (I) the addition of tied-mixture I-IMM modeling (2) rejection of out-of-vocabulary speech and background noise while continuing to recognize speech (3) adapting to the current speaker (4) the implementation of N-gram statistical grammars with DECIPHER. Finally we describe our performance in the February 1991 DARPA Resource Management evaluation (4.8 percent word error) and in the February 1991 DARPA-ATIS speech and SLS evaluations (95 sentences correct, 15 wrong of 140). We show that, for the ATIS evaluation, a well-conceived system integration can be relatively robust to speech recognition errors and to linguistic variability and errors.</Paragraph>
    <Paragraph position="1"> Introduction The DARPA ATIS Spoken Language System (SLS) task represents significant new challenges for speech and natural language technologies. For speech recognition, the SIS task is more difficult than our previous task, DARPA Resource Management, along several dimensions: it is recorded in a noisier environment, the vocabulary is not fixed, and, most important, it is spontaneous speech, which differs significantly from read speech.</Paragraph>
    <Paragraph position="2"> Spontaneous speech is a significant challenge to speech recognition, since it contains false starts, and non-words, and because it tends to be more casual than read speech. It is also a major challenge to natural language technologies because the structure of spontaneous language differs dramatically from the structure of written language, and almost all natural language research has been focused on written language.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SLS Architecture
</SectionTitle>
    <Paragraph position="0"> SRI has developed a spoken language system (SLS) for DARPA's ATIS benchmark task \[1\]. This system can be broken up into two distinct components, the speech recognition and natural language components. DECIPHER, the speech recognition component, accepts the speech waveform as input and produces a word list. The word list is processed by the natural language (NL) component, which generates a data base query (or no response).</Paragraph>
    <Paragraph position="1"> This simple serial integration of speech and natural language processing works well because the speech recognition system uses a statistical language model to improve recognition performance, and because the natural language processing uses a template matching approach that makes it somewhat insensitive to recognition errors. SRI's SLS achieves relatively high performance because the SLS-level system integration acknowledges the imperfect performance of the speech and natural language technologies. Our natural language component is described in another paper in this volume \[2\]. This paper focuses on the speech recognition system and the evaluation of the speech recognition and overall ATIS SLS systems.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Resource Management Architecture
</SectionTitle>
      <Paragraph position="0"> SRI has also evaluated DECIPHER using DARPA's Resource Management task \[3,4\]. The system architecture for this task is simply the speech recognition system with no NL postprocessing.</Paragraph>
      <Paragraph position="1"> There are two language models used in the evaluation: a perplexity 60 word-pair grammar, and a perplexity 1000 all-word grammar.</Paragraph>
      <Paragraph position="2"> The output is simply an attempted transcription of the input speech.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="94" type="metho">
    <SectionTitle>
DECIPHER
</SectionTitle>
    <Paragraph position="0"> This section reviews the structure of the DECIPHER system  We use 256-word speaker-independent codebooks to vectorquantize the Mel-cepstra and the Mel-cepstral differences. The resulting four-feature-per-frame vector is used as input to the DECIPHER HMM-based speech recognition system.</Paragraph>
    <Section position="1" start_page="0" end_page="94" type="sub_section">
      <SectionTitle>
Pronunciation Models
</SectionTitle>
      <Paragraph position="0"> DECIPHER uses pronunciation models generated by applying a phonological rule set to word baseforms. The techniques used to generate the rules are described in \[6\] and \[5\], These generate approximately 40 pronunciations per word as measured on the DARPA Resource Management vocabulary and 75 per word on the ATIS vocabulary. Speaker-independent pronunciation probabilities are then estimated using these bushy word networks and the forward-backward algorithm in DECIPHER. The networks are then pruned so that only the likely pronunciations remain--typically about 4 per word for the resource management task and 2.6 per word on the ATIS task. This modeling of pronunciation is one of the ways that DECIPHER is distinguished from other HMM-based systems. We have shown in \[6\] that this modeling reduces error rate.</Paragraph>
    </Section>
    <Section position="2" start_page="94" end_page="94" type="sub_section">
      <SectionTitle>
Acoustic Modeling
</SectionTitle>
      <Paragraph position="0"> DECIPHER builds and trains word models by using context-dependent phone models arranged according to the pronunciation networks for the word being modeled. Models used inelode uniquephone-in-word, phone-in-word, triphone, biphone, and generalized biphones and Wiphones, as well as context-independent models.</Paragraph>
      <Paragraph position="1"> Similar contexts are automatically smoothed together, if they do not adequately model the training data, according to a deletedestimation interpolation algorithm similar to \[7\]. The acoustic models reflect both inter-word and across-word eoarticulatory effects. Training proceeds as follows: * Initially, context-independent boot models are estimated from hand-labels in the TIMIT training database.</Paragraph>
      <Paragraph position="2"> * The boot models are used as input for a two-iteration context-independent model training run, where context-independent models are refined and pronunciation probabilities are calculated using the full word networks. These large networks are then pruned by eliminating low probability pronunciations.</Paragraph>
      <Paragraph position="3"> * Context-dependent models are then estimated from a seeond two-iteration forward-backward run, which uses the context-independent models and the pruned networks from the previous iterations as input.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="94" end_page="97" type="metho">
    <SectionTitle>
ACOUSTIC MODELING IMPROVEMENTS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="94" end_page="94" type="sub_section">
      <SectionTitle>
Tied Mixtures
</SectionTitle>
      <Paragraph position="0"> We have implemented tied-mixture HMMs (TM-HMMs) in the DECIPHER system. Tied mixtures were first described by Huang\[9\] and more recently in by Bellegarda and Nahamoo\[8\].</Paragraph>
      <Paragraph position="1"> TM-HMMs use Gaussian mixtures as HMM output probabilities.</Paragraph>
      <Paragraph position="2"> The mixture weights are unique to each phonetic model used, but the set of Gaussians is shared among the states. The tied Ganssians could be viewed as forming a Gaussian-based VQ codebook that is reestimated by the HMM forward -backward algorithm.</Paragraph>
      <Paragraph position="3"> Our implementation of TM-HMMs has the following characteristics: * We used 12-dimensional diagonal-eovariance Gaussians.</Paragraph>
      <Paragraph position="4"> The variances were estimated and then smoothed with grand variances.</Paragraph>
      <Paragraph position="5"> * Computation can be significantly reduced in TM-HMMs by pruning either the mixture weights or the Gaussians themselves. We found that shortfall threshold Gaussian pruning--discarding all Gaussians whose probability density of input at a frame is less than a constant times the best probability density for that flame--works as well for us as standard top-N pruning (keeping the N best Gaussians) and requires less computation.</Paragraph>
      <Paragraph position="6"> * We use two separate sets of Gaussian mixtures for our TM-HMMs; one for Mel cepstra and one for Mel-cepstral derivatives. We retained our discrete distribution models for our energy features.</Paragraph>
      <Paragraph position="7"> Corrective training \[5,10,11\] was used to update the mixture weights for the TM-HMMs. The algorithm is identical to that used for discrete HMMs. That is, the mixture weights are updated as ff they were discrete output probabilities. No mixture means or variances were corrected.</Paragraph>
      <Paragraph position="8"> We evaluated TM-HMMs on the RM task using the perplexity 60 word-pair grammar. Our training corpus was the standard 3990 sentence training set. We used the combined DARPA 1988, February 1989, and October 1989 test sets for our development set. This contains 900 sentences from 32 speakers. We achieved a 6.8 percent word error rate using our discrete HMM system on this test set. The TM-HMM approach achieved an error rate of 5.5 percent. Thus, the TM-HMMs improved word recognition error rate by 20 percent compared to discrete HMMs.</Paragraph>
    </Section>
    <Section position="2" start_page="94" end_page="95" type="sub_section">
      <SectionTitle>
Male-Female Separation
</SectionTitle>
      <Paragraph position="0"> In the June 1990 DARPA Speech and Natural Language meeting \[5\], we reported a 20 percent reduction in RM word-error rate by training separate male and female recognizers, decoding using recognizers from both sexes, and then choosing the sex according to the recognizer with the highest probability hypothesis. This improvement was achieved using a recognizer trained on 11,190 sentences. We did not achieve a significant improvement using male-female separation on the smaller 3990 sentence training set. We set out to see, as has been claimed in \[8\], whether TM-HMMs can take advantage of male-female separation with smaller (3990 sentence) training sets. Our results were mixed. Although performance did improve from 5.5 percent word error with combined models, to 4.9 percent word error with separate male-female models (a 10 percent improvement) we note that 2/3 of the overall improvement was due to the dramatic improvement for speaker HXS. Aside from this one speaker, the performance gain was not significant. Based on our last study, however, we are confident that male-female separation does improve performance with sufficient training data. The table below shows performance for tied-mixture HMMs using combined and sexseparated models.</Paragraph>
      <Paragraph position="1">  There was no significant additional gain from using corrective training in addition to male-female separation. Performance improved from 4.9 percent error (male-female only) or 4.7 percent error (corrective training only) to 4.5 percent error (both methods). This lack of further improvement is due to the reduction in training data.</Paragraph>
    </Section>
    <Section position="3" start_page="95" end_page="96" type="sub_section">
      <SectionTitle>
Speaker Adaptation
</SectionTitle>
      <Paragraph position="0"> We have begun experiments into speaker-adaptation, converting speaker-independent models into speaker-dependent ones. Our experiment involved using VQ codebook adaptation via tied-mixture HMMs as proposed by Rtischev \[13\]. That is, we adjusted VQ codeword locations based on forward-backward alignments of adaptation sentences. However, since we are using a tied-mixture recognition system, we adapted the Gaussian means instead of the codebook.</Paragraph>
      <Paragraph position="1"> We selected 21 of the speakers in our development test set for use in an adaptation experiment. We had either 25 or 30 Resource Management sentences recorded for each of these speakers. We chose to use their first 20 sentences for adaptation, and the other 5 or 10 sentences for adaptation testing.</Paragraph>
      <Paragraph position="2"> Using our original TM-HMM models, we achieved an error rate of 7.4 percent (114 errors in 1541 reference words) on this adaptation test set. After adjusting means for each speaker using the 20 adaptation sentences, we achieved an error rate of 6.1 percent (94 errors in 1541 reference words) on the adaptation test sentences.</Paragraph>
      <Paragraph position="3"> This improvement with adaptation leads to performance that is still quite short of speaker-dependent accuracy (the ultimate goal of adaptation). Thus, it does not seem worth the added inconvenience of obtaining 20 known sentences from a potential system user, though it is promising for on-line adaptation. We plan to look into several areas for further improvement. For example:  1. Rtischev et al. \[14\] have shown that adapting mixture weights is at least as important as adapting means.</Paragraph>
      <Paragraph position="4"> 2. Kubala \[15\] et al. have shown that adapting speaker-dependent models can be superior to adapting from speaker-independent models.</Paragraph>
      <Paragraph position="5"> 3. It is possible that the adaptation sentences need not be supervised  given the relatively good (7.4 percent error) initial performance. Rejection of Out-of-Vocabulary Input We implemented a version of DECIPHER that rejects false input as well as recognizing legal input (our standard recognizer attempts to classify all the inpu0. In addition to standard word models, it uses an out-of-vocabulary word model to recognize the extraneous input. The word model has the following pronunciation network similar to \[17\].  There are 67 phonetic models on each of the arcs in the above word network. All phonetic transition probabilities in this word network are equal, and are scaled by a parameter that adjusts the amount of false rejection vs. false acceptance.</Paragraph>
      <Paragraph position="6"> Thus far, we have performed a pilot study that shows this method to be promising. We gathered a database of 58 sentences total from six people. About half of the sentences are digit strings and the other half are digits mixed with other things. There are a total of 426 digits in the database, and 176 additional non-digit words. Example sentences are outlined in Table 3.</Paragraph>
      <Paragraph position="7"> We considered correct recognition for these sentences to be the digits in the string without the rest of the words (i.e. 2138767287, 3876541104, 33589170429 are the correct answers for the top three sentences in Table 3).</Paragraph>
      <Paragraph position="8"> We trained a digit recognizer with rejection from the Resource Management training set and achieved a word error rate of 5.3 percent for the 27 sentences that contained only digits (13 errors = 1 insert 3 delete 9 subs in 243 reference words), which is within one error of the system without rejection. Thus, in this pilot study, using rejection didn't hurt performance for &amp;quot;clean&amp;quot; input. The overall error rate was 11.7 percent (26 inserts 15 deletes 9 subs in 426 reference words). That is, 402 of 426 digits were detected, and at least 141 of the 176 extraneous words were rejected.</Paragraph>
      <Paragraph position="9">  my parents number is 2 1 3 urn 8 7 6 ok 7 2 8 7 if you have questions please dial extension 3 8 7 6 at5 4 1 1 oh 4 please call3 3 5 89 1 urn 7oh4 2 9 hmm let's see what's this 1 2 3 4 5 uh that's not right 2 3 4 5  We used a bigram language model to constrain the speech recognition system for the ATIS evaluation. A back-off estimation algorithm \[16\] was used for estimation of the bigram parameters. The training data for the grammar consisted of 5,050 sentences of spontaneous speech from various sites--l,606 from MIT's ATIS data collection project, 774 from NIST CD-ROM releases, 538 from SRI's ATIS data collection project, and 2,132 from various other sites.</Paragraph>
      <Paragraph position="10"> Robust estimates for many of the bigram probabilities cannot be achieved since the vast majority of them are seen very infrequently (because of the lack of sufficient training data). Furthermore, frequencies of words such as months and cities were biased by the data collection scenarios and the time of year the data was collected. To reduce these effects, words with effectively similar usage were assigned to groups, and instead of collecting counts for the individual words, counts were collected for the groups. After estimation of the bigram probabilities, the probabilities of transitioning to individual words were assigned the group probability divided by the number of words in the group. This scheme not only reduced some of the problems due to the sparse training data, but also allowed some unseen words (other city names, restriction codes, etc.) to be easily added to the grammar. The table below contains the groups of words tied together.</Paragraph>
      <Paragraph position="11"> months, days, digits, teens, decades, date-ordinals, cities, airports, states, airlines, class-codes, restriction-codes, fare-codes, airlinecodes, aircraft-codes, airport-codes, other-codes TABLE 4. Tied Groups Using our back-off bigram on our ATIS development set (most of the June 1990 DARPA-ATIS test set), we achieved a 14.1 percent word error rate with a test-set perplexity of 19 (not counting 6 words not covered by the grammar). When we applied this grammar to the February 1991 ATIS evaluation test set (200 sentences) the perplexity was 43, excluding 26 instances of words not covered in our vocabulary. For the 148 Class A sentences, the recognition word error rate was 17.8 percent.</Paragraph>
      <Paragraph position="12"> We also explored various class-grammar implementations.</Paragraph>
      <Paragraph position="13"> These grammars were generated by interpolating word-based bigrams with class-based bigrams. We were able to vary the grammars and their perplexities by varying the interpolation coefficients. However, recognition performance never improved over that for the back-off bigram. In fact, accuracy remained relatively constant throughout a large range of perplexities.</Paragraph>
      <Paragraph position="14"> Table 5 illustrates recognition accuracy using bigrams with different perplexities on our ATIS development test set. A preliminary set of models was used for recognition (with 442 words in the vocabulary) and the grammars were estimated using 2,909 sentences.  These tables also illustrate that recognition performance did not depend strongly on the test-set perplexity. Clearly, other factors are dominating performance. We believe that one of our most pressing needs in this research is to understand what this bottleneck is, and to develop ways that express it better than perplexity.</Paragraph>
    </Section>
    <Section position="4" start_page="96" end_page="97" type="sub_section">
      <SectionTitle>
Multi-Word Lexieal Units
</SectionTitle>
      <Paragraph position="0"> Many words occur with sufficient frequency and with significant cross-word coarticulation that a better acoustic model might be made by training these word combinations as a single word model. These words include &amp;quot;what-are-the,&amp;quot; &amp;quot;give-me,&amp;quot; etc., which can have a variety of pronunciations best modeled with a network of phones representing the phonetic and phonological variation of the whole sequence (&amp;quot;what're-the,&amp;quot; &amp;quot;gimme,&amp;quot; etc.) instead of each word separately.</Paragraph>
      <Paragraph position="1"> Also, when considering class grammars, multiple word sequences allow classes which could not be constructed by considering every word separately. For instance, having distinct models of all the restriction codes (e.g. &amp;quot;v-u-slash-one&amp;quot;) might be more appropriate than modeling alpha-&gt;alpha-&gt;slash-&gt;number in the bigram. The latter form would allow all the alphabet letters to transition to all the alphabet letters, with probabilities as prescribed by the bigram, and would incorrectly increase the probability for invalid restriction codes.</Paragraph>
      <Paragraph position="2"> This multi-word technique allows all the probabilities of all the restriction codes to be tied together, so that all are equally covered at the appropriate place in the grammar, instead of depending completely on the individual words' statistics estimated from sparse training data. The multi-word approach resulted in only a slight performance improvement compared to a system where non-coarticulatory multi-words were left separated. That is, for the &amp;quot;separate words&amp;quot; system, words like &amp;quot;a p slash eighty&amp;quot; were separate words, but coarticulatory word models like &amp;quot;what-are-the&amp;quot; and &amp;quot;list-the&amp;quot; were retained. On a  Note that the higher perplexity of the multi-word system is deceiving since high probability grammar transitions are now hidden within the multi-word models, and are not seen by the grammar. Tables 7 and 8 list the various multi-word units.</Paragraph>
      <Paragraph position="3"> flights-from, what-is-the, show-me-the, show-me-all, show-me, how-many, one-way, what-are-the, give-me, what-is, i-would-like, i'd-like-to, what-does  san-francisco, washington-de ....</Paragraph>
      <Paragraph position="4"> a-l, c-o, t-w-a, u-s-air, ...</Paragraph>
      <Paragraph position="5"> d-e-ten, seven-forty-seven ....</Paragraph>
      <Paragraph position="6"> a-t-l, b-o-s, s-f-o, d-f-w, ...</Paragraph>
      <Paragraph position="7"> q-x, f-y-b-m-q, k-y, y-n ....</Paragraph>
      <Paragraph position="8"> a-p-eighty, a-p-slash-eighty,...</Paragraph>
      <Paragraph position="9"> d-u-r-a, e-q-p, r-t-n-max ....</Paragraph>
      <Paragraph position="10"> TABLE 8. Semantic Multi-Words</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML