New Results with the Lincoln Tied-Mixture HMM CSR System 1 
Douglas B. Paul 
Lincoln Laboratory, MIT 
Lexington, Ma. 02173 
ABSTRACT 
The following describes recent work on the Lincoln CSR 
system. Some new variations in semiphone modeling have 
been tested. A very simple improved duration model has 
reduced the error rate by about 10~ in both triphone and 
semiphone systems. A new training strategy has been tested 
which, by itself, did not provide useful improvements but 
suggests that improvements can be obtained by a related 
rapid adaptation technique. Finally, the recognizer has been 
modified to use bigram back-off language models. The sys- 
tem was then transferred from the RM task to the ATIS 
CSR task and a limited number of development tests per- 
formed. Evaluation test results are presented for both the 
RM and ATIS CSR tasks. 
INTRODUCTION 
The following experiments are all carried out in the 
context of the Lincoln tied-mixture (TM) hidden Markov 
model (HMM) continuous speech recognition (CSR) sys- 
tem. This system uses two observation streams (TM-2) 
for speaker-dependent (SD) recognition: mel-cepstra and 
time differential mel-cepstra. For speaker-independent (SI) 
recognition, a second differential mel-cepstral observation 
stream is added (TM-3). The system uses Gaussian tied 
mixture \[1, 2\] observation pdfs and treats each observation 
stream as if it is statistically independent of all others. Tri- 
phone models \[14\], including cross-word triphone models 
\[10, 7, 16\], are used to model phonetic coarticulation. These 
models are smoothed with reduced context phone models 
\[14\]. Each phone model is a three state "linear" (no skip 
transitions) HMM. The phone models are trained by the 
forward-backward algorithm using an unsupervised mono- 
phone (context independent phone) bootstrapping proce- 
dure. The recognizer extrapolates (estimates) untrained 
phone models and recognizes using a Viterbi beam search. 
The initial implementation uses finite-state grammars, con- 
tains an adaptive background model, and allows optional 
inter-word silences. All RM1 development tests use the des- 
ignated SD development test set (100 sentences x 12 speak- 
1This work was sponsored by the Defense Advanced Research 
Projects Agency. 
ers) and all RM2 tests use the designated development test 
set (120 sentences x 4 speakers). 
SEMIPHONES 
One difficulty with the current triphone-based HMM 
systems with cross-word triphone models is that the num- 
ber of triphones becomes very large (~60K triphones) when 
used in a large (20K word) vocabulary task\[ll\]. This re- 
quires estimation of very large numbers of parameters and 
makes execution of the trainer and recognizer inefficient 
on practical hardware. We have previously proposed semi- 
phones as a modeling unit because they significantly reduce 
the number of elemental phonetic models by as much as an 
order of magnitude. (Semiphone models split each phone 
into a triplet of left and right context dependent models\[11\]. 
Semiphones include triphones and "classic" diphones--which 
extend from the center of one phone to the center of the 
next--as special cases.) On the Resource Management (RM) 
task, they reduced the number of unique states by about a 
factor of 5 at the cost of a performance penalty of about 
20% for the speaker-independent (SI) task and 30% for the 
speaker-dependent (SD) task. 
The initial semiphone system used 1 left state, 1 center 
state, and 1 right state (notation: 1-1-1) system \[11\]. (In 
this notation, a triphone system is designated by 0-x-0 and 
a classic diphone system is designated by x-0-y.) We have 
recently explored a number of other variations on the semi- 
phone scheme subject to the constraint of three states per 
phone. The performance of 2-0-1 and 1-0-2 systems is shown 
in table 1. The lower error rate of the 1-0-2 system sug- 
gests that, on the average, the anticipatory coarticulation 
is stronger than the backward coarticulation. This agrees 
with an assertion by Ladefoged that English is dominantly 
an anticipatory coarticulation language \[6\]. 
We have also tested a hybrid triphone-semiphone sys- 
tem. This hybrid used 1-0-2 semiphones for the cross-word 
models and triphones for the word4nternal models. (50K of 
the above mentioned 60K triphones were cross-word-context 
phones.) Its performance was the same as the 1-0-2 system. 
This suggests that the less detailed modeling of the word 
boundary phones is the primary site where information is 
lost in the semiphone systems compared to the triphone 
65 
systems. 
These results may be affected by the lack of richness 
in the RM database--there were 1752 word-internal (WI) 
semiphones and 2413 WI triphones and therefore only 27% 
of the WI triphones were merged in transitioning to the 
semiphone models. Similarly there were 1891 cross-word 
(XW) semiphones and 3580 XW triphones and therefore 
47% of the cross-word (XW) triphones were merged in the 
transition. Thus the transition to semiphones would be ex- 
pected to affect the XW modeling more than the WI mod- 
eling. All of the XW semiphone systems, however, outper- 
form the corresponding non-XW triphone systems. 
Attempts to improve semiphone results by smooth- 
ing the mixture weights with occurrence based smoothing 
weights\[14\] proved unsuccessful. (This form of smooth- 
ing significantly improved the triphone system results \[11\].) 
This correlates with the reduced number of single occur- 
fence models in the semiphone system (1340=37% of the 
semiphones) compared to the triphone system (3094=52% 
of the triphones). 
IMPROVED DURATION MODELING 
The standard HMM system suffers from the difficulty 
that an incorrect phone can minimize its scoring penalty by 
minimizing the dwell time of the path through its model. 
The current CSR uses three states per phone and can suffer 
from this problem for long duration phones. Since there are 
no skip arcs within the phone model, a path can traverse a 
phone in 30 msec (3 time steps). Some phones are essentially 
never produced with this short a duration and therefore an 
incorrect short segment matched to this phone can have too 
high a score. 
One way to minimize this problem is alter the phone 
model to increase the minimum path dwell time to a time 
commensurate with the minimum duration of the phone. 
Since this system does not adapt in any way to the speaking 
rate, the desired minimum would be the minimum duration 
at the fastest speaking speed. Since the available training 
data is not fast speech, a pragmatic estimate of the mini- 
mum might be the shortest observed duration times a safety 
factor. An additional difficulty in estimating the minimum 
duration is that some phones are observed only a very few 
times in the training data thereby making such an estimate 
less reliable. 
For this experiment, a much simpler estimate of the 
minimum duration was chosen. The system was trained 
normally with three states per phone, which has the dual 
advantages of maintaining a uniform phone topology to al- 
low smoothing between different phone models and of not 
increasing the number of parameters to be estimated. Fi- 
nally, states whose average duration (as computed from the 
stay transition probability) was above a constant were split 
into a linear sequence of states until each final state had 
an average duration below the constant. Each of the split 
states shared the same observation pdf--only the stay and 
move transition probabilities were altered on the split states. 
Since no skip transitions were allowed in the phone models, 
the minimum duration was proportional to the final number 
of states in the phone. 
This simple strengthening of the duration model im- 
proved the triphone system results by about 10% for both SI 
and SD systems (Table 2). This result is in agreement with 
a similar improvement obtained adding minimum phone du- 
ration constraints to a large vocabulary IWR\[8\]. The overall 
amount of computation was not significantly changed. Es- 
sentially all of the word error rate reduction was a result of 
reduced word insertion and deletion error rates. 
NEW TRAINING STRATEGY WITH 
IMPLICATIONS FOR ADAPTATION 
A modified multi-speaker/speaker-independent training 
strategy was tested. The standard strategy used to date has 
been: 
1. Monophone bootstrap 
2. Train triphones (all parameters trained on all speak- 
ers) 
The new strategy is: 
1. Monophone bootstrap (single set of Gaussians) 
2. Train triphones (transition probabilities and mixture 
weights trained on all speakers, speaker-specific Gans- 
sians) 
3. (Optional) Fix transition probabilities and mixture 
weights and train a single set of Gaussians on all 
speakers 
This new multi-speaker (MS)/SI strategy (without the op- 
tion), in effect, implements a theory to the effect that all 
persons spea k alike except that each uses a different section 
of the acoustic space, perhaps due to differently sized and 
shaped vocal tracts. 
The new strategy without the option uses more data 
to train the mixture weights and might therefore, with the 
speaker-specific Ganssians, provide better SD recognition 
than the old method. It was significantly worse than the 
standard SD training for the RM1 database (12 speakers, 
Table 3), but slightly better for the RM2 database (4 speak- 
ers, Table 4). In both cases the new procedure was better 
than the SI-109 system. 
The new strategy with the option is a new method for 
training a MS or SI system. The mixture weights are again 
trained in the context of speaker-specific Gaussians, but 
then the weights are fixed and a single set of MS or SI Gaus- 
sians trained. In all cases, the systems using SD Gaussians 
outperformed the MS/SI Ganssians. On the RM1 database, 
the old training method outperformed the new with the op- 
tion respectively for both the MS-12 and the SI-109 training 
condition. Similarly, when training on the RM1 database 
and testing on the RM2 database, the old training method 
outperformed the new with the option respectively for the 
SI-12 and SI-109 training conditions. (The MS-12 models 
from RM1 become SI-12 when tested on the RM2 database 
because the RM2 database uses speakers which are not in- 
cluded in RM1.) 
The controls for this experiment (SI-109 and SI-12), 
when tested on the RM2 database, confirm BBN's result 
\[4\] that similar SI performance can be obtained by training 
on large amounts of data from a small number of speakers as 
66 

the June 90 spontaneous training (774 sentences) and test 
data was used. Due to the limited amount of time available 
before the evaluation tests, no attempt was made to model 
the open vocabulary, disfluencies partial words, thinking 
noises and extraneous noises. Thus the SNOR transcrip- 
tions of the acoustic data were used for both training and 
testing. The lexicon (548 words) and a bigram back-off lan- 
guage model were generated from the training data which 
produced a test set perplexity of 23.8 with 1.3% out-of- 
vocabulary words. 
The first system was as described in the introduction ex- 
cept that the system used SI TM-2 non-cross word triphone 
models and the improved duration modeling described above. 
Recognition was performed using the perplexity 23.8 bigram 
language model. The pilot tests were all SI trained with two 
observation streams. The closest RM system showed an SI- 
109 WPG word error rate of 10.4% \[11\]. After fixing some 
pruning difficulties in training due to the large silences in 
the training data, the system produced a word error rate 
of 37.5% (Table 5). Enabling optional inter-word silences 
in training reduced the pruning difficulties and improved 
the recognition performance to 33.3% (Table 5). (Optional 
inter-word silences during training had been tested on the 
RM task and found not to help the performance.) Finally, 
this system was tested using the perplexity 17.8 baseline 
language model and the error rate was reduced to 30.9% 
(Table 5). 
ATIS BASELINE 
DEVELOPMENT TESTS 
When the baseline test definition became available, the 
best pilot system was trained on the baseline training data. 
The error rate improved to 26.4% (Table 6). The addi- 
tional data, which consisted of read in-task sentences and 
read adaptation sentences, increased the number of train- 
ing sentences by a factor of 6.5, but produced a surprisingly 
small performance improvement. Cross-word triphone mod- 
eling was added which reduced the word error rate to 23.0%. 
(The closest corresponding system RM SI-109 WPG error 
rate is 8.5% \[11\].) Next, the third observation stream (sec- 
ond differential mel-ceptsra) was added (TM-a) which in- 
creased the error rate to 25.3%. In contrast, a 30% error 
rate reduction on the SI RM task occurred when the third 
observation stream was added\[ll\]. Finally, a TM-3 1-0-2 
semiphone system yielded 24.0% word error rate, which is 
between the results obtained with the TM-2 and TM-3 tri- 
phone systems. 
EVALUATION TESTS 
The SD and SI-109 RM evaluation tests were run with 
WPG and no grammar (NG). The systems are identical 
to the systems tested in the last set of evaluation tests\[ll\] 
except the enhanced duration models were used. The SD 
system used two observation streams and the SI-109 system 
used three observation streams. The average word error 
rates with the WPG are 1.77% and 4.39% respectively (Ta- 
ble 7). 
Due to the limited time between the distribution of the 
ATIS development data and the deadline for the evalua- 
tion tests, it was not possible to test all desired systems 
nor was it possible to adequately set the recognition pa- 
rameters such as the grammar weight and word insertion 
penalty. As noted earlier, the open vocabulary, disfluencies, 
partial words, thinking noises, and extraneous noises were 
not modeled. The tested system is an SI TM-2 XW tri- 
phone system with the improved duration model. The test 
set perplexity of the class A test data was 24 with .8% out- 
of-vocabulary words using the informal baseline language 
model and the recognition word error rate was 26.5% (Ta- 
ble 8). The non-Class A test sets were also tested. Their 
results and perplexities are shown in Table 8. The recog- 
nition output sentences (top-l) were sent to Unisys to be 
input to their natural language system\[9\]. 
DISCUSSION AND CONCLUSIONS 
While the additional work on semiphone models has not 
yielded any improvements over the original semiphone sys- 
tems, they still represent a potentially useful tradeoff. They 
still yield a 20-30% higher error rate than do triphone mod- 
els, but provide more than an order of magnitude reduction 
in the number of states required in a large vocabulary recog- 
nition system. 
The improved duration model, as tested here, is ex- 
tremely simple way to reduce the error rate by about 10%. 
A better method for determining the minimum state dura- 
tions might be to perform a Viterbi alignment of the train- 
ing data and determine the desired splitting factor from the 
observed minima. 
The new training strategy, while it did not improve per- 
formance as tested, did yield results consistent with a method 
of rapid speaker adaptation. This method of speaker adap- 
tation, which is performed by a modified TM trainer, is well 
suited to the current DARPA applications. 
The bigram back-off language model was added to the 
Lincoln CSR. This made the system operational with a more 
practical class of language models than the previously im- 
plemented finite state grammars. In particular, it made 
testing on the ATIS CSR task feasible. 
The tripling of error rates obtained on the ATIS task 
compared to the RM task is quite reasonable. A perplexity 
25.7 bigram back-off language model trained on 8K RM 
sentences resulted in an approximate doubling of the error 
rate compared to the WPG\[12\] and the perplexity 17.8 ATIS 
bigram language model was trained on only 4K sentences. 
Thus, only a factor of about 1.5 increase occurred due to the 
extemporaneous speech and the less controlled environment. 
Given the limited time between distribution of the data 
and the evaluation tests, it has not been possible to ade- 
quately study the difficulties unique to the ATIS database 
nor has it been possible to adequately test our systems. 
There are some known difficulties with the systems reported 
here (a bug in the recognition network generation has been 
found) and some known phenomena have not been modeled. 
We tested our best system-to-date and hope to be able to 
improve the modeling and cure the system difficulties in the 
near future. 
68 
REFERENCES 
1. J.R. Bellegardaand D.H. Nahamoo, "Tied Mixture Continu- 
ous Parameter Models for Large Vocabulary Isolated Speech 
Recognition," Proc. ICASSP 89, Glasgow, May 1989. 
2. X. D. Huang and M.A. Jack, "Semi-continuous Hidden 
Markov Models for Speech Recognition," Computer Speech 
and Language, Vol. 3, 1989. 
3. S. M. Katz, "Estimation of Probabilities from Sparse Data 
for the Language Model Component of a Speech Recog- 
nizer," ASSP-35, pp 400.-401, March 1987. 
4. F. Kubala and R. Schwartz, "A New Paradigm for Speaker- 
Independent Training and Speaker Adaptation," Proc. 
DARPA Speech and Natural Language Workshop, Morgan 
Kaufmarm Publishers, June 1990. 
5. F. Kubala, S. Austin, C. Barry, J. Makhoul, P. Placeway, 
R. Schwartz, "BYBLOS Speech Recognition Benchmark Re- 
sults," Proc. DARPA Speech and Natural Language Work- 
shop, Morgan Kau_fmann Publishers, Feb. 1991. 
6. P. Ladefoged, A Course in Phonetics, Harcourt Brace Ja- 
vanovich, New York, 1982. 
7. K. F. Lee, H. W. Hon, M. Y. Hwang, S. Mahajan, and R. 
Reddy, "The SPHINX Speech Recognition System," Proc. 
ICASSP 89, May 1989. 
8. M. Lermig, V. Gupta, P. Keuny, P. Mermelstein, D. 
O'Shaughnessy, "An 86,000--Word Recognizer Based on 
Phonemic Models," Proc. DARPA Speech and Natural Lan- 
guage Workshop, Morgan Kaufmarm Publishers, June 1990. 
9. L. M. Norton, M. C. Linebarger, D. A. Dahl and N. Nguyen, 
"Augmented Role Filling Capabilities for Semantic Inter- 
pretation of Spoken Language," Proc. DARPA Speech and 
Natural Language Workshop, Morgan Kaufmann Publish- 
ers, Feb. 1991. 
10. D. B. Paul, "The Lincoln Robust Continuous Speech Rec- 
ognlzer," Proc. ICASSP 89., Glasgow, Scotland, May 1989. 
11. D. B. Paul, "The Lincoln Tied-Mixture HMM Continu- 
ous Speech Recognizer," Proc. DARPA Speech and Natural 
Language Workshop, Morgan Kau_fmann Publishers, June 
1990. 
12. D. B. Paul, "Experience with a Stack Decoder-Based 
HMM CSR and Back-Off N-Gram Language Models," Proc. 
DARPA Speech and Natural Language Workshop, Morgan 
Kaufmarm Publishers, Feb. 1991. 
13. D. Rtischev, "Speaker Adaptation in a Large-Vocabulary 
Speech Recognition System," Masters Thesis, MIT, 1989. 
14. R Schwartz, Y. Chow, O. Kimball, S. Roucos, M. 
Krasner, and J. Makhoul, "Context-Dependent Modeling 
for Acoustic-Ph0netic Recognition of Continuous Speech," 
Proc. ICASSP 85, Tampa, FL, April 1985. 
15. R. Schwartz, personal communication. 
16. M. Weintraub, H. Murveit, M. Cohen, P. Price, J. Bernstein, 
G. Baldwin, and D. Bell, "Linguistic Constraints in Hidden 
Maxkov Model Based Speech Recognition," Proc. ICASSP 
89, May 1989. 
Table 1: SD RM TM-2 XW Semiphone Results 
System States per Phone Total States Wd Err 
Triphone 0-3-0 24000 1.7% (.13%) 
Semiphone 1-1-1 3800 2.2% (.14%) 
Semiphone 1-0-2 5500 2.2% (.14%) 
Semiphone 2-0-1' 5300 2.5% (.15%) 
Mixed wd bdry 1-0-2 9300 2.2% (.14%) 
wd int 0-3-0 
Table 2: Improved Duration Model 
RM1% Word Error Rates (s-d) with WPG 
Improved Dur Model 
System Models without with 
TM-2 SD* XW triphone 1.74% (.13%) 1.55% (.12%) 
TM-3 SI-109" XW triphone 5.64% (.23%) 5.20% (.22%) 
* Evaluation test systems 
Table 
System 
SD old 
MS-12 (SDG) new 
MS-12 old 
MS-12 (MSG) new, opt 
old 
3: New Training Strategy: RM1 Tests Using a TM-2 XW Triphone Systems 
Training Mixture Training: Procedure Weights Gauss Wd Err (s-d) 
SI-109 
SL109 (MSG) new, opt 
SD SD SD 1.7% (.13%) 
MS SD SD-12 2.6% (.16%) 
MS MS SD-12 3.4% (.18%) 
MS MS SD-12 5.2% (.22%) 
SI SI SI-109 7.8% (.27%) 
SI SI SI-109 8.6% (.28%) 
(Codes: SD=speaker dependent, MS----multi-speaker, SI=speaker independent, -12----all 12 RM1 SD speakers combined, 
-109=109 RM1 SI training speakers, SDG=SD Gaussians, MSG=MS Gaussians) 
59 
Table 
System 
4: New Training Strategy: RM2 Tests Using a TM-2 XW Triphone Systems 
Training Mixture Training 
Procedure Weights Gauss Set Wd Err (s-d) 
MS-4 (SDG) new 
SD old 
MS-4 (MSG) new,opt 
SI-12* old 
SI-12 (sIG)* new,opt 
SI-109 
SI-109 (SIG) 
MS SD 
SD SD 
MS MS 
SI SI 
SI SI 
old SI SI SI-109 
new,opt SI SI SI-109 
SD-4 .8% (.14%) 
SD (RM2) 1.0% (.16%) 
SD-4 1.8% (.21%) 
SD-12 6.4% (.39%) 
SD-12 i 7.0% (.40%) 
7.6% (.42%) 
8.3% (.44%) 
* These systems axe the same as the corresponding MS systems in Table 3 but are actually SI in these tests because the 
test speakers are not in the training set. (-4, -12, and -109 are all disjoint speaker sets.) 
(Codes: SD=speaker dependent (2400 training sentences for RM2), MS=multi-speaker, SI=speaker independent, -4=all 4 
RM2 speakers combined, -12=all 12 RM1 SD speakers combined, -109=109 RM1 SI training speakers, SDG=SD Gaussians, 
MSG=MS Gaussians) 
Table 5: ATIS Pilot Development Tests: SI, non-cross word triphones, 774 June 90 training sentences 
system opt silences 
TM-2 triphone no 
TM-2 triphone yes 
TM-2 triphone yes 
bigram perplexity 
23.8 
23.8 
17.8 
wd err (s-d) 
37.5% (1.2%) 
33.3% (1.2%) 
30.9% (1.2%) 
Table 6: ATIS Baseline Development Tests: SI, 5020 training sentences, opt silences, perplexity 17.8 
system 
TM-2 triphone 
TM-2 triphone* 
TM-3 triphone 
cross-word 
models 
no 
yes 
observation 
streams wd err (s:d) 
2 , l 26.4% (1.1%) 
2 23.0% (1.1%) 
yes 
TM-3 semiphone yes 3 
3 25.0% (1.1%) 
* Evaluation test system 
24.0% (1.1%) 
Table 7: RM Evaluation Test Results: XW triphones, improved duration model 
% Word Error Rates (std dev) 
System \]Training sub l ins I dell word (s-d) sent sub l ins I del word (s-d) sent 
TM-2 \] SD 
TM-3 SI-109 ilol 11 71177(26  1201 58113117 873 05  440 2.8 .6 1.0 4.39 (.41) 23.3 14.2 2.9 2.7 19.73 (.80) 71.7 
* Homonyms equivalent 
Table 8: ATIS Baseline Evaluation Test Results: SI, 5120 training sentences 
% Word Error Rates with Bigram Back-off Language Model 
System Models 
TM-2 XW triphone 
TM-2 XW triphone 
TM-2 XW triphone 
TM-2 XW triphone 
TM-2 XW triphone 
Test Nr Test Set 
Class Sent perplexity 
A 148 22.6 
D1 58 27.2 
A opt 11 73.7 
D1 opt 4 23.8 
all 200 27.5 
I II vocab wds sub ins del word (s-d) sent 
.8% 16.2 5.9 4.0 26.1 (1.1) 88.5 
1.4% 22.2 3.9 7.1 33.2 (1.9) 88.5 
1.4% 22.8 13.1 2.9 I 38.8 (3.4) 100.0 
.0% 15.8 21.1 3.5\] 40.4 (6.5) 100.0 
1.1% 19.1 6.5 '4.8 30.4 (1.0) 90.5 
70 
