SRI's DECIPHER System 
Hy Murveit, Michael Cohen, Patti Price, Gay Baldwin, 
Mitch Weintraub, and Jared Bernstein 
Speech Research Program, SRI International 
Menlo Park, CA 94025 
Abstract 
SRI has developed a speaker-independent con- 
tinuous speech, large vocabulary speech recogni- 
tion system, DECIPHER, that provides state-of-the- 
art performance on the DARPA standard speaker- 
independent resource management training and test- 
ing materials. SRI's approach is to integrate speech 
and linguistic knowledge into the HMM framework. 
This paper describes performance improvements aris- 
ing from detailed phonological modeling and from the 
incorporation of cross-word coarticulatory constraints. 
1 Introduction 
The hidden Markov model (HMM) formulation is 
a powerful statistical framework that is well-suited to 
the speech recognition problem. Systems based on this 
formulation have improved dramatically, however, as 
developers have learned how to modify them appropri- 
ately to take into account principles from speech re- 
search and from linguistics. Concepts arising from the 
study of the sound system of a language, i.e., phonol- 
ogy, are language-specific; they are done once for En- 
glish and little additional labor is required when, for 
example, the vocabulary or the domain changes. Care 
should be taken, however, in modeling detailed linguis- 
tic structure since the practice can lead to models with 
many additional parameters to be estimated; unless 
this problem is addressed directly, performance gains 
will be compromised. 
SRI is not the first group to incorporate speech 
knowledge and concepts from linguistics in an HMM 
formulation for speech recognition. In fact, many im- 
provements in I-IMM-based systems are implicitly re- 
lated to principles from speech and linguistics, even 
though this was not their original motivation. We sur- 
vey some of these modifications below. 
Phonetic Units. Not all recognizers are based 
on phonetic units. A number of HMM-based speech 
recognition systems have been based on word-level 
models \[1, 12, 9\]. The use of phonetic units allows 
for larger vocabularies by sharing training across sub- 
word units that repeat more frequently than words do. 
Phonetic-based units are now common to many HMM- 
based recognizers (e.g., \[7, 8, 10\]). 
Triphones. Triphones are phonetic models condi- 
tioned on the immediately surrounding phonetic units. 
BBN \[3\] was able to show significant performance 
gain (roughly halving the error rate compared to a 
similar system without the triphone models), pro- 
vided the context-dependent models were averaged 
("smoothed") with the context-independent ones in or- 
der to deal with the large number of poorly trained 
triphone models. Triphones are used extensively now 
(e.g., at BBN, CMU, Lincoln Laboratories, SPd). Tri- 
phones can model major coarticulatory effects de- 
scribed in the speech research literature, as well as 
phonological variation conditioned on the immediately 
surrounding phones. In general, more detailed models 
will perform better than less detailed models, provided 
there is sufficient data to estimate the parameters. For 
60 phones, there are 60 cubed triphones, which repre- 
sents a significant increase in parameters to estimate. 
Triphones would not have shown a performance gain 
had they been introduced without a mechanism to take 
this into account, i.e., in this case, smoothing with 
more general models. 
Difference Parameters. The use of additional, 
independently trainable parameters means that more 
details can be included in the model without a dra- 
matic increase in the amount of training material. In 
particular, recognition performance has been signifi- 
cantly increased \[8, 10\] through the use of codebooks 
that represent the difference between the current value 
of a parameter and its value several frames previously. 
Spectral and energy difference parameters are used in 
addition to their current values. This captures impor- 
tant dynamic patterns exhibited in speech as well as 
the standard static information. 
238 
In this paper we describe SRI's recent work in incor- 
porating linguistic concepts in the DECIPHER system: 
improved phonological modeling and modeling cross- 
word coarticulation. 
2 DECIPHEI:Us Basic Design 
SRI's DECIPHER speech recognition system uses 
discrete density 3-state hidden-Markov models to rep- 
resent phones.. Four discrete densities per state model 
the variation of vector quantized Mel-cepstra, Vector 
quantized Mel-cepstral time-derivatives, and quantized 
energy and energy time-derivatives. Word models are 
constructed from network representations of word pro- 
nunciations and from a set of phone models (context- 
independent, left biphones, right biphones, triphones, 
and phone-in-word models). The more samples of a 
word available in the system's training set, the more 
specific the contexts used for the phone models in 
the word. The most detailed, primary, models are 
smoothed by averaging in other models of less specific 
context with weights estimated automatically using an 
SRI version of IBM's deleted-interpolation algorithm 
\[6\]. 
3 Database 
The speech database used for training and testing 
SRI's DECIPHER system is described in \[11\]. This 
database, intended for the design and evaluation of 
algorithms for continuous speech recognition, consists 
of sentences read in a sound-isolated room. The sen- 
tences are appropriate to a naval resource management 
task based on existing interactive database and graph- 
ics programs. The database includes 160 male and 
female talkers with a variety of dialects. The design 
includes a partition of the database into independent 
training and testing portions. 
The training materials used for the results reported 
here are the 3950 sentences from 97 training and devel- 
opment talkers that do not overlap the test set reported 
on. The testing materials used for most of the results 
reported here are the 150 sentences (1287 words) from 
the 1987 designated test sets designated by the Na- 
tional Institute of Standards and Technology (NIST, 
formerly NBS). 
The results reported here were obtained with and 
without the use of a grammar to constrain the recogni- 
tion search. These conditions are not those that would 
be used in a real application, but they are simply de- 
fined, allow recognition systems to be evaluated over 
more than one condition of grammatical constraint, 
and they have been accepted as standards of com- 
parisons. The degree of constraint provided by the 
grammar is measured by test set perplexity \[7\], or, the 
geometric mean of the number of words allowed by 
the grammar at each point in the test set, given the 
previous words. In the case of no grammatical con- 
straint, any word can follow any other word and the 
perplexity is equal to the vocabulary size, in this case 
1000. The DARPA standard word-pair grammar was 
created by collecting all two-word sequences allowed 
in the sentence-patterns used to generate the task sen- 
tences (as described in \[11\]). The perplexity of this 
grammar as measured on several different 25-sentence 
test sets from the database is about 60. 
4 Phonological Modeling 
Pronunciation varies significantly across speakers, 
as well as in the speech of individuals \[5\]. However, 
most current speech recognition systems model words 
with a single pronunciation or a small number of alter- 
nate pronunciations. For systems which use statistical 
training of models of speech segments, this lack of ex- 
plicit representation of the range of variation of pro- 
nunciation causes different phenomena to be averaged 
together into the same model, resulting in a less pre- 
cise model. These less precise models are likely to be- 
come more problematic as speech recognition systems 
move from eorpera of read speech to the spontaneous 
speech that can be expected in real applications, since 
significantly more pronunciation variability occurs in 
spontaneous than in read speech (c.f. \[2\]). 
Some previous attempts to explicitly model many 
pronunciations for each word have led to performance 
degradation possibly resulting from (1) many addi- 
tional parameters to be estimated with the same 
amount of training data, and (2) unlikely pronuncia- 
tions not previously modeled causing new false alarms. 
To deal with the first condition, we have designed a 
method for developing phonological rule sets based on 
measures of coverage and overcoverage of a database of 
pronunciations \[4\] in order to maximize the coverage of 
pronunciations observed in a corpus, while minimizing 
the size of the pronunciation networks. 
To address the problem of hypothesizing unlikely 
pronunciations in inappropriate places, the DECI- 
PHER system incorporates probabilities into our net- 
work representation of word pronunciations. The in- 
corporation of pronunciation probabilities has been 
shown to significantly increase the predictive power of 
our representation \[4\]. 
239 
Current databases for training speech recognition 
systems have too few occurrences of all but the most 
frequent words to make accurate estimates of pronun- 
ciation probabilities. Therefore, we have developed 
and implemented an automatic method for tying to- 
gether frequently occurring sub-word units for train- 
ing. Knowledge embedded in the rule set can be used 
to determine equivalence classes of nodes that share 
similar contextual constraints \[4\]. Nodes in the same 
equivalence class share training samples. The proba- 
bilities in the pronunciation networks combine word- 
trained probabilities for frequently occurring words 
with these node equivalence class trained probabilities. 
LEXICON 
BBN-lexicon 
CMU-lexicon 
Hand-Single 
Rule-Single 
Rule:Sparse 
Rule-Full 
Mean Number 
of Pronunciations 
per Word 
1.1 
1.0 
1.0 
1.0 
1.3 
4.2 
Px:lO00 
word 
correct 
67.0 
67.5 
69.3 
72.8 
74.1 
72.9 
Px:60 
word 
correct 
91.6 
92.4 
90.6 
92.6 
93.7 
92.6 
Table 1: Phonological Effects in DECIPHER: Word 
accuracy on the 1987 test set with various lexicons and 
expansions. 
5 Lexicon Performance 
We compared performance of the DECIPHER system 
for a number of different lexicons, based on the rule 
set development method described above. A rule set 
with high coverage of a corpus of pronunciations was 
developed, and pronunciation probabilities were com- 
puted for the resulting pronunciation networks using 
the node equivalence classes described above. The 
data used to estimate the pronunciation probabilities 
was the same data used to train the phonetic models. 
A series of less bushy networks was derived by elimi- 
nating the least probable pronunciations from the net- 
works. We refer to this series as Rule-Single (each word 
has only one pronunciation), Rule-Sparse (the mean 
number of pronunciations per word is 1.3), and Rule- 
Full (the mean number of pronunciations per word is 
4.2). Performance was also measured using the lexi- 
con from the BBN BYBLOS system (BBN), from the 
CMU SPHINX system (CMU), and the lexicon devel- 
oped for an early version of the DECIPHER system, 
prior to the incorporation of multiple pronunciations. 
This latter lexicon is referred to as Hand-Single since 
it consists of a single pronunciation per word and was 
specified by hand by an expert linguist. 
Table 1 shows the results we have obtained with 
the SRI DECIPHER system using the various lexi- 
cons described above. The recognized word strings 
were aligned against the correct reference word 
string and differences tallied using the DARPA- 
NIST software package. The word correct is 1- 
100 substituti°ns+deleti°ns+inserti°ns where ref is the r~f 
number of words in the set of reference words. The 
DARPA-NIST homophone-equivalency table is used 
for the no-grammar condition (Perplexity P=1000) 
and not for the grammar condition (Perplexity P=60). 
The lexicons labeled BBN and CMU do not compare 
the DECIPHER system to the BYBLOS system or 
to the SPHINX system, rather it compares the CMU, 
BBN and SRI lexicons all as used in the DECIPHER 
system without cross-word coarticulatory modeling. 
The results of Table 1 show that the lexicon has a 
significant effect on performance: for perplexity 1000, 
percent word correct ranges from 67.0% (for the BBN 
lexicon) to 74.1% (for SRI's Rule-Sparse lexicon); for 
perplexity 60, the range is from 90.6% (Hand-Single) 
to 93.7% (Rule-Sparse). Within the set of single (or 
near single) pronunciation lexicons, the range is nearly 
as large. Thus, a system that explicitly models only 
a single pronunciation per word can be improved with 
careful design of the dictionary of pronunciations. Au- 
tomatically deriving a dictionary of most common pro- 
nunciations (as in the Rule-Single lexicon) was shown 
to improve performance over a dictionary of pronunci- 
ations carefully designed by hand by an expert linguist 
(Hand-Single). 
The improvement from the rule-single to the rule- 
sparse lexicon suggests that modeling multiple proba- 
bilistic pronunciations can improve recognition perfor- 
mance. The degradation in performance from Rule- 
Sparse to Rule-Full illustrates the importance of keep- 
ing pronunciation networks from getting too bushy, 
while mMntaining coverage of likely pronunciations. 
6 Cross-word Modeling 
The use of triphone modeling and models of whole 
words has been used extensively (e.g., \[3\]) to take 
into account coarticulatory effects. However, extend- 
ing this notion to operate across word boundaries had 
not been been done before 1989. ~Vord-boundary con- 
texts have not typically been used because the sizes 
of the resulting word networks can get very large, and 
because it requires keeping track of which ending arcs 
can map to which starting arcs. Since we have already 
dealt with these issues in our large pronunciation net- 
240 
Px:1000 Px:60 
SYSTEM Context word word 
Model? correct correct 
DECIPHER No 74.1 93.7 
Yes 78.3 95.0 
Table 2: Word Accuracy Results With and Without 
Cross-Word Coarticulatory Modeling for the 1987 test 
set). 
Speakers 
in Training 
109 
72 
109 
72 
Px %sub %del %ins %error 
60 5.9 2.5 0.4 8.8 
60 6.4 2.5 0.3 9.2 
1000 I 21.0 5.2 1.5 27.7 
10001 23.4 6.1 1.8 31.3 
Table 3: DECIPHER'S Performance using DARPA's 
1989 Speaker-Independent Test Set 
works, cross-word boundary contexts were a natural 
extension to the SRI DECIPHER system. 
Modeling acoustic variations across words was lim- 
ited to initial and final phones in words with sufficient 
training data. To illustrate how the algorithm works, 
let us consider the initial phone "dh" in the word lhe. 
In the training database, there are many instances of 
words ending in "n" before the. An additional "dh" 
arc is added to the pronunciation graph of lhe, though 
this arc is only allowed to connect to arcs with the "n" 
phonetic label. The original "dh" arc is prevented from 
connecting with arcs with the "n" phonetic label. The 
above algorithm is applied to all words in the vocabu- 
lary, provided that 15 occurrences of a (previous/next) 
phone occurred in the training database. 
Table 2 shows that the addition of the cross-word 
context models improves performance of the DECI- 
PHER system in both the perplexity 60 condition and 
the perplexity 1000 condition. Also shown in the table, 
labeled SPHINX, are the best previous results reported 
on for this database (actually using a little more train- 
ing data) \[8\]. 
7 1989 Test Results 
Table 3 shows SRI's official results reported at the 
1989 DARPA speech and natural language workshop. 
These results use the Rule-Sparse pronunciation net- 
works and the across-word-boundary pronunciation 
constraints. Px stands for perplexity. 
If the tradeoff for insertions and deletions is appro- 
priately changed, which was not done for the results in 
Table 3, performance call be improved slightly: 5.7% 
substitutions, 1.3% deletions, and 0.8% insertions, for 
an overall error rate of 7.9%. 
Speaker-by-speaker performance varies greatly in 
the DECIPHER system, and in other systems that 
were reported in this workshop. For the official DE- 
CIPHER performance results for the 1989 workshop 
for the 109 speaker training set, speaker performance 
ranged from 17.8% to 37.1% (perplexity=1000), and 
3.7% error to 14.3% error (perplexity=60). This vari- 
ability causes difficulties when trying to compare sys- 
tems from different sites. For instance, when compar- 
ing DECIPHER's results to Carnegie-Mellon's Sphinx 
system, we find that with perplexity=1000, Sphinx 
outperformed DECIPHER on six speakers, and DE- 
CIPHER outperformed Sphinx on four speakers. With 
perplexity=1000, Sphinx had better performance on 7 
speakers. When comparing DECIPHER to the Lincoln 
Laboratories system, DECIPHER had fewer errors on 
6 of 10 with perplexity=1000, and 7 of 10 with perplex- 
ity=60. This analysis shows that the variation across 
speakers swamps the variation among these three sys- 
tems, and that the apparent system differences may be 
due to sampling error. To properly differentiate these 
systems at a high confidence level would require a test 
with many more speakers. 
8 Discussion 
In this section we discuss first the results relating 
to choice of lexicon and then the results relating to 
cross-word models. 
Though we have shown an important performance 
gain through improved phonological modeling, we be- 
lieve that more substantial gains will be shown in the 
future for the following reasons: 
1. The system tests were based on read speech rather 
than spontaneous speech. The significant increase 
in phonological reduction and deletion in sponta- 
neous compared to read speech \[2\] should result 
in a bigger difference between systems that in- 
clude techniques for modeling multiple probabilis- 
tic pronunciations and those that do not. 
2. The rule sets used in the studies described here 
were developed using a corpus of hand phonetic 
transcriptions, rather than some form of system 
output. Different types of variation may be more 
important to model explicitly in different systems, 
241 
and are likely to be different from those captured 
by hand transcriptions. 
3. Larger amounts of training data will allow the de- 
sign of more detailed models of phonological vari- 
ation. 
Lee has suggested \[8\] that modeling multiple pro- 
nunciations is not worth-while because (1) it makes 
systems run too slowly, (2) it is impossible to estimate 
pronunciation probabilities, and (3) it unfairly penal- 
izes words with too many pronunciations. Although, 
we believe that improvements can certainly still be 
made in the way we estimate and use our pronunci- 
ation probabilities, it is clear from our studies that 
modelingpronunciation has significant positive impact 
on recognition performance without an excessive cost 
in speed. We suggest that the reason for our opposite 
conclusions lies in the difference between the multiple- 
pronunciation word-networks that CMU and SRI have 
tried. As shown in Table 1, SRI's best network models 
on average about 1.3 pronunciations per word. The 
network shown as an example in \[8\] allows thousands 
of pronunciations, partly because of the excessive de- 
tail in the example and partly because of the lack of 
constraint to correlate the many possibilities. 
As for the cross-word coarticulatory modeling we re- 
port on here, we believe that the performance improve- 
ment can be attributed to the following: (1) for short 
words (and the most frequent words are short, i.e., 
one to three phones long), the word boundaries form 
a significant portion of the context that should not be 
ignored, and (2) many triphones that otherwise would 
not be observed can be found across word boundaries, 
and the additional triphone training can help model 
the less frequent words. 
In sum, the use of speech and linguistic knowledge 
sources can be used to improve the performance of 
HMM-based speech recognition systems, provided that 
care is taken to incorporate these knowledge sources 
appropriately. 
Acknowledgement. This work was principally 
supported by SRI IR&D and investment funding. We 
also gratefully acknowledge the National Science Foun- 
dation and the Defense Advanced Research Projects 
Agency for partial support. 
References 
\[1\] Bakis, R. (1976) "Continuous Speech Recognition 
via Centisecond Acoustic States," J. Acoust Soc. 
Am., Suppl. 1, Vol. 59, $97. 
\[2\] Bernstein, J., G. Baldwin, M. Cohen, H. Murveit, 
and M. Weintraub (1986) "Phonological Studies 
for Speech Recognition", Proc. DARPA Speech 
Recognition Workshop, pp. 41-48. 
\[3\] Chow, Y., R. Schwartz, S. Roucos, O. Kim- 
ball, P. Price, F. Kubala, M. Dunham, M. 
Krasner, and J. Makhoul (1986) "The Role 
of Word-Dependent Coarticulatory Effects in 
a Phoneme-Based Speech Recognition System," 
Proc. ICASSP, pp. 1593-1596. 
\[4\] Cohen, M. (1989) Phonological Structures for 
Speech Recognition, U. C. Berkeley Ph.D. thesis. 
\[5\] Cohen, M., J. Sernstein, and H. Murveit (1987) 
"Pronunciation Variation Within and Across 
Speakers," ll3th Meeting Acoust. Soc. Am., Pg. 
\[6\] Jelinek, F. and R. Mercer (1980) "Interpolated 
Estimation of Markov Source Parameters from 
Sparse Data," pp. 381-397 in E. S. Gelsema and L. 
N. Kanal (editors), Pattern Recognition in Prac- 
tice, North-Holland Publishing Company, Ams- 
terdam, the Netherlands. 
\[7\] Kubala, F., Y. Chow, A. Derr, M. Feng, O. Kim- 
ball, J. Makhoul, P. Price, J. Rohlicek, S. Roucos, 
R. Schwartz, and J. Vandegrift (1988) "Continu- 
ous Speech Recognition Results of the BYBLOS 
System on the DARPA 1000-Word Resource Man- 
agement Database," Proc. ICASSP, pp. 291-294. 
\[8\] K. F. Lee (1988) Large-Vocabulary Speaker- 
Independent Continuous Speech Recognition: the 
SPHINX System, CMU Ph.D. Thesis. 
\[9\] Lippmann, R., E. Martin, and D. Paul (1987) 
"Multi-Style Training for Robust Isolated-Word 
Speech Recognition," Proc. ICASSP, pp. 705- 
708. 
\[10\] Murveit, H. and M. Weintraub (1988) "1000- 
Word Speaker-Independent Continuous-Speech 
Recognition Using Hidden Markov Models," 
Proc. ICASSP, pp. 115-118. 
\[11\] Price, P., W. Fisher, J. Bernstein, and D. Pallett 
(1988) "The DARPA 1000-Word Resource Man- 
agement Database for Continuous Speech Recog- 
nition," Proc. ICASSP, pp. 651-654. 
\[12\] Rabiner, L. R., B. H. Juang, S. E. Levinson, and 
M. M. Sondhi (1985) "Recognition of Isolated Dig- 
its using Hidden Markov with Continuous Mixture 
Densities," ATT Technical Journal, 64(6):1211- 
1233. 
242 
