File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-1042_metho.xml
Size: 13,754 bytes
Last Modified: 2025-10-06 14:12:19
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-1042"> <Title>SRI's DECIPHER System</Title> <Section position="3" start_page="238" end_page="238" type="metho"> <SectionTitle> 2 DECIPHEI:Us Basic Design </SectionTitle> <Paragraph position="0"> SRI's DECIPHER speech recognition system uses discrete density 3-state hidden-Markov models to represent phones.. Four discrete densities per state model the variation of vector quantized Mel-cepstra, Vector quantized Mel-cepstral time-derivatives, and quantized energy and energy time-derivatives. Word models are constructed from network representations of word pronunciations and from a set of phone models (contextindependent, left biphones, right biphones, triphones, and phone-in-word models). The more samples of a word available in the system's training set, the more specific the contexts used for the phone models in the word. The most detailed, primary, models are smoothed by averaging in other models of less specific context with weights estimated automatically using an SRI version of IBM's deleted-interpolation algorithm \[6\].</Paragraph> </Section> <Section position="4" start_page="238" end_page="238" type="metho"> <SectionTitle> 3 Database </SectionTitle> <Paragraph position="0"> The speech database used for training and testing SRI's DECIPHER system is described in \[11\]. This database, intended for the design and evaluation of algorithms for continuous speech recognition, consists of sentences read in a sound-isolated room. The sentences are appropriate to a naval resource management task based on existing interactive database and graphics programs. The database includes 160 male and female talkers with a variety of dialects. The design includes a partition of the database into independent training and testing portions.</Paragraph> <Paragraph position="1"> The training materials used for the results reported here are the 3950 sentences from 97 training and development talkers that do not overlap the test set reported on. The testing materials used for most of the results reported here are the 150 sentences (1287 words) from the 1987 designated test sets designated by the National Institute of Standards and Technology (NIST, formerly NBS).</Paragraph> <Paragraph position="2"> The results reported here were obtained with and without the use of a grammar to constrain the recognition search. These conditions are not those that would be used in a real application, but they are simply defined, allow recognition systems to be evaluated over more than one condition of grammatical constraint, and they have been accepted as standards of comparisons. The degree of constraint provided by the grammar is measured by test set perplexity \[7\], or, the geometric mean of the number of words allowed by the grammar at each point in the test set, given the previous words. In the case of no grammatical constraint, any word can follow any other word and the perplexity is equal to the vocabulary size, in this case 1000. The DARPA standard word-pair grammar was created by collecting all two-word sequences allowed in the sentence-patterns used to generate the task sentences (as described in \[11\]). The perplexity of this grammar as measured on several different 25-sentence test sets from the database is about 60.</Paragraph> </Section> <Section position="5" start_page="238" end_page="239" type="metho"> <SectionTitle> 4 Phonological Modeling </SectionTitle> <Paragraph position="0"> Pronunciation varies significantly across speakers, as well as in the speech of individuals \[5\]. However, most current speech recognition systems model words with a single pronunciation or a small number of alternate pronunciations. For systems which use statistical training of models of speech segments, this lack of explicit representation of the range of variation of pronunciation causes different phenomena to be averaged together into the same model, resulting in a less precise model. These less precise models are likely to become more problematic as speech recognition systems move from eorpera of read speech to the spontaneous speech that can be expected in real applications, since significantly more pronunciation variability occurs in spontaneous than in read speech (c.f. \[2\]).</Paragraph> <Paragraph position="1"> Some previous attempts to explicitly model many pronunciations for each word have led to performance degradation possibly resulting from (1) many additional parameters to be estimated with the same amount of training data, and (2) unlikely pronunciations not previously modeled causing new false alarms. To deal with the first condition, we have designed a method for developing phonological rule sets based on measures of coverage and overcoverage of a database of pronunciations \[4\] in order to maximize the coverage of pronunciations observed in a corpus, while minimizing the size of the pronunciation networks.</Paragraph> <Paragraph position="2"> To address the problem of hypothesizing unlikely pronunciations in inappropriate places, the DECIPHER system incorporates probabilities into our network representation of word pronunciations. The incorporation of pronunciation probabilities has been shown to significantly increase the predictive power of our representation \[4\].</Paragraph> <Paragraph position="3"> Current databases for training speech recognition systems have too few occurrences of all but the most frequent words to make accurate estimates of pronunciation probabilities. Therefore, we have developed and implemented an automatic method for tying together frequently occurring sub-word units for training. Knowledge embedded in the rule set can be used to determine equivalence classes of nodes that share similar contextual constraints \[4\]. Nodes in the same equivalence class share training samples. The probabilities in the pronunciation networks combine wordtrained probabilities for frequently occurring words with these node equivalence class trained probabilities.</Paragraph> <Paragraph position="4"> accuracy on the 1987 test set with various lexicons and expansions.</Paragraph> </Section> <Section position="6" start_page="239" end_page="239" type="metho"> <SectionTitle> 5 Lexicon Performance </SectionTitle> <Paragraph position="0"> We compared performance of the DECIPHER system for a number of different lexicons, based on the rule set development method described above. A rule set with high coverage of a corpus of pronunciations was developed, and pronunciation probabilities were computed for the resulting pronunciation networks using the node equivalence classes described above. The data used to estimate the pronunciation probabilities was the same data used to train the phonetic models.</Paragraph> <Paragraph position="1"> A series of less bushy networks was derived by eliminating the least probable pronunciations from the networks. We refer to this series as Rule-Single (each word has only one pronunciation), Rule-Sparse (the mean number of pronunciations per word is 1.3), and Rule-Full (the mean number of pronunciations per word is 4.2). Performance was also measured using the lexicon from the BBN BYBLOS system (BBN), from the CMU SPHINX system (CMU), and the lexicon developed for an early version of the DECIPHER system, prior to the incorporation of multiple pronunciations.</Paragraph> <Paragraph position="2"> This latter lexicon is referred to as Hand-Single since it consists of a single pronunciation per word and was specified by hand by an expert linguist.</Paragraph> <Paragraph position="3"> Table 1 shows the results we have obtained with the SRI DECIPHER system using the various lexicons described above. The recognized word strings were aligned against the correct reference word string and differences tallied using the DARPA-NIST software package. The word correct is 1100 substitutidegns+deletidegns+insertidegns where ref is the r~f number of words in the set of reference words. The DARPA-NIST homophone-equivalency table is used for the no-grammar condition (Perplexity P=1000) and not for the grammar condition (Perplexity P=60).</Paragraph> <Paragraph position="4"> The lexicons labeled BBN and CMU do not compare the DECIPHER system to the BYBLOS system or to the SPHINX system, rather it compares the CMU, BBN and SRI lexicons all as used in the DECIPHER system without cross-word coarticulatory modeling.</Paragraph> <Paragraph position="5"> The results of Table 1 show that the lexicon has a significant effect on performance: for perplexity 1000, percent word correct ranges from 67.0% (for the BBN lexicon) to 74.1% (for SRI's Rule-Sparse lexicon); for perplexity 60, the range is from 90.6% (Hand-Single) to 93.7% (Rule-Sparse). Within the set of single (or near single) pronunciation lexicons, the range is nearly as large. Thus, a system that explicitly models only a single pronunciation per word can be improved with careful design of the dictionary of pronunciations. Automatically deriving a dictionary of most common pronunciations (as in the Rule-Single lexicon) was shown to improve performance over a dictionary of pronunciations carefully designed by hand by an expert linguist (Hand-Single).</Paragraph> <Paragraph position="6"> The improvement from the rule-single to the rulesparse lexicon suggests that modeling multiple probabilistic pronunciations can improve recognition performance. The degradation in performance from Rule-Sparse to Rule-Full illustrates the importance of keeping pronunciation networks from getting too bushy, while mMntaining coverage of likely pronunciations.</Paragraph> </Section> <Section position="7" start_page="239" end_page="240" type="metho"> <SectionTitle> 6 Cross-word Modeling </SectionTitle> <Paragraph position="0"> The use of triphone modeling and models of whole words has been used extensively (e.g., \[3\]) to take into account coarticulatory effects. However, extending this notion to operate across word boundaries had not been been done before 1989. ~Vord-boundary contexts have not typically been used because the sizes of the resulting word networks can get very large, and because it requires keeping track of which ending arcs can map to which starting arcs. Since we have already dealt with these issues in our large pronunciation net-</Paragraph> <Section position="1" start_page="240" end_page="240" type="sub_section"> <SectionTitle> 1989 Speaker-Independent Test Set </SectionTitle> <Paragraph position="0"> works, cross-word boundary contexts were a natural extension to the SRI DECIPHER system.</Paragraph> <Paragraph position="1"> Modeling acoustic variations across words was limited to initial and final phones in words with sufficient training data. To illustrate how the algorithm works, let us consider the initial phone &quot;dh&quot; in the word lhe. In the training database, there are many instances of words ending in &quot;n&quot; before the. An additional &quot;dh&quot; arc is added to the pronunciation graph of lhe, though this arc is only allowed to connect to arcs with the &quot;n&quot; phonetic label. The original &quot;dh&quot; arc is prevented from connecting with arcs with the &quot;n&quot; phonetic label. The above algorithm is applied to all words in the vocabulary, provided that 15 occurrences of a (previous/next) phone occurred in the training database.</Paragraph> <Paragraph position="2"> Table 2 shows that the addition of the cross-word context models improves performance of the DECIPHER system in both the perplexity 60 condition and the perplexity 1000 condition. Also shown in the table, labeled SPHINX, are the best previous results reported on for this database (actually using a little more training data) \[8\].</Paragraph> </Section> </Section> <Section position="8" start_page="240" end_page="240" type="metho"> <SectionTitle> 7 1989 Test Results </SectionTitle> <Paragraph position="0"> Table 3 shows SRI's official results reported at the 1989 DARPA speech and natural language workshop.</Paragraph> <Paragraph position="1"> These results use the Rule-Sparse pronunciation networks and the across-word-boundary pronunciation constraints. Px stands for perplexity.</Paragraph> <Paragraph position="2"> If the tradeoff for insertions and deletions is appropriately changed, which was not done for the results in Table 3, performance call be improved slightly: 5.7% substitutions, 1.3% deletions, and 0.8% insertions, for an overall error rate of 7.9%.</Paragraph> <Paragraph position="3"> Speaker-by-speaker performance varies greatly in the DECIPHER system, and in other systems that were reported in this workshop. For the official DECIPHER performance results for the 1989 workshop for the 109 speaker training set, speaker performance ranged from 17.8% to 37.1% (perplexity=1000), and 3.7% error to 14.3% error (perplexity=60). This variability causes difficulties when trying to compare systems from different sites. For instance, when comparing DECIPHER's results to Carnegie-Mellon's Sphinx system, we find that with perplexity=1000, Sphinx outperformed DECIPHER on six speakers, and DECIPHER outperformed Sphinx on four speakers. With perplexity=1000, Sphinx had better performance on 7 speakers. When comparing DECIPHER to the Lincoln Laboratories system, DECIPHER had fewer errors on 6 of 10 with perplexity=1000, and 7 of 10 with perplexity=60. This analysis shows that the variation across speakers swamps the variation among these three systems, and that the apparent system differences may be due to sampling error. To properly differentiate these systems at a high confidence level would require a test with many more speakers.</Paragraph> </Section> class="xml-element"></Paper>