Improved HMM Models for High Performance Speech Recognition 
Steve Austin, Chris Barry, Yen-Lu Chow 
,Man Derr, Owen Kimball, Francis Kubala, John Makhoul 
Paul Placeway, William Russell, Richard Schwartz, George Yu 
BBN Systems and Technologies Corporation 
Cambridge, MA 02138 
:~BSTRACT 
In this paper we report on the various techniques that 
we tmplemented in order to improve the basic speech 
recognition performance of the BYBLOS system. Some 
ot these methods are new, while others are not. We 
present methods that improved pertbrmance as well as 
those that did not. The methods include Linear Discrirn- 
inant Analysis, Supervised Vector Quantization, Shared 
Mixture VQ. Deleted Estimation of Context Weights, 
MMI Estimation Using "N-Best" Alternatives, Cross- 
Word Triphone Models. While we have not yet com- 
bined all of the methods in one system, the overall word 
recognition error rate on the May 1988 test set using the 
Word-Pair grammar has decreased from 3.4% to 1.7%. 
l Introduction 
We considered several directions for trying to improve 
the recognition accuracy within the basic framework of 
the BYBLOS system. The various techniques can be rea- 
sonably grouped into three general topics: changing the 
underlying distance metric in the spectral space, optimiz- 
ing the few weights that are used with the system, and 
improving the phonetic coarticulation model by adding 
cross-word triphone context. We introduce each of these 
areas below and discuss them in more detail in the body 
of the paper. Finally, we will present recognition results 
for a combination of two of the methods. 
Even in a discrete HMM system, there is an under- 
lying distance metric that is used to divide the spectral 
space into distinct regions. It has been suggested that 
it is possible to improve recognition accuracy by per- 
forming a linear discriminant analysis. We have also 
considered several methods of nonlinearly warping the 
spectral space as part of the vector quanlization process. 
We classify these methods as "supervised clustering" 
techniques. In addition, we implemented the technique 
that has been called "tied mixture vector quantization" 
(Bellagarda, 1989) or semi-continuous densities (Huang, 
1989), 
In the BYBLOS system there are a number of system 
parameters that are fixed for all speakers based on intu- 
itions and as a result of running a small number of luning 
experiments. Among these are the weights for the dif- 
ferent context-dependent models of phonemes and the 
relative weights for different feature sets (codebooks). 
While the weights chosen are certainly reasonable, on 
the average, it would seem inconsistent to estimate mil- 
lions of ~ probabilities automatically while having 
a handful of parameters set manually. Therefore, we 
implemented a deleted estimation algorithm to estimate 
the context model weights and developed a new MMI 
technique for estimating the teature set weights automat- 
ically. 
One obvious extension to context-dependent model- 
ing (which was introduced in BYBLOS in 1984) is to 
model context between phonemes that are not in the 
same word. In fact, three research sites (Paul, 1989; 
Lee, 1989; Murveit, 1989) reported modeling triphone 
context across word boundaries at the February 1989 
meeting. We have now implemented a similar :algorithm 
in the BYBLOS system. However, due to remarks from 
other researchers that the changes to the training and 
recognition programs were extensive and difficult to im- 
plement, we chose to implement the effect by precompil- 
mg all of the models in such a way that we did not need 
to change either the training or recognition programs. 
In sections 2 to 4 we describe the algorithms imple- 
mented under each of these areas along with results. 
In section 5 we present recogmtion results under sev- 
eral different conditions, including the test results for 
the October '89 test set. 
249 
2 Distance Measures and Supervised VQ 
This section deals with techniques for improving the dis- 
tance measure used in VQ, in particular, using linear 
discriminant analysis, and nonlinear supervised cluster- 
ing techniques. In addition, we present results when we 
replace the discrete densities with shared-mixture densi- 
ties. 
2.1 Linear Discriminant Analysis 
In our baseline system we compute 14 reel-frequency 
warped cepstral coefficients (cl-cl4) every 10 ms di- 
rectly from the speech power spectrum. These parame- 
ters are grouped in one codebook. These 14 parameters 
are then used to compute "difference" parameters, lay 
computing the slope of a least squares finear fit to a 
five-frame window centered around each frame. The 14 
slopes of this fit for the coefficients then make up the 
second set ~codebook) of teatures. Finally, we use a 
third codebook that has the log rms energy and the "dif- 
ference" of this energy. The energy parameter is normal- 
ized relative to a decaying running maximum, so as to 
be insensitive to arbitrary changes in amplitude. We di- 
vide the 30 features among three codebooks to avoid the 
training problem associated with high dimensionalit3,. 
The recognition group at IBM (Brown, 1987) has pro- 
posed using several successive frames jointly in order to 
model the joint density more accurately together with 
linear discriminant analysis (LDA) to reduce the num- 
ber of dimensions. We have attempted to use LDA to 
find a better set of features that could then be divided 
into sets that would, in fact be more independent. In ad- 
dition, we might hope that we would automatically find a 
more beneficial weighting on the different features than 
simple Euclidean distance (which is what we use in the 
VQ). 
First, we needed to define several classes that we 
wanted to discriminate. We chose the (50 or sol basic 
phonemes as that set. under the assumption that these 
modeled most of the distinctions that must be made in 
large vocabulary speech recognition. We segment all 
of the training data into phonemes automatically using 
the decoder constrained to find the correct answer. The 
recognized segment boundaries are then used to assign 
a phoneme label to each frame. Second, we compute 
the within (phoneme~ class and between class means 
and covariances. We use the generalized eigenvector 
solution to find best set of linear discriminant features. 
Third, we simply cluster and quantize the new features 
as usual. Alternatively, we can divide the new features 
up into a small number of codebooks in order to reduce 
the quantization error. 
We pertbrmed experiments with several variations in 
the number of codebooks and assignment of linear dis- 
criminants to codebooks. However, the results (aver- 
aged over several test speakers) did not improve over 
the baseline 3-codebook condition described at the be- 
ginning of this section. We can draw two possible con- 
clusiotls from these results relative to previous successes 
with this technique. First, while it might be possible to 
find a small number of discriminant directions that are 
important tor a small vocabulary task - especially one 
with minimal pair differences - it may not be as easy in 
a large vocabulary task, where the important distinctions 
are many and also very varied. That is, any choice of 
discriminants that is better for some distinctions may be 
worse for others. Second. it is not clear that optimiz- 
ing phonetic distinctions on single frames will help a 
recognition system that uses models of lriphones. 
2.2 Supervised Vector Quantization 
Since the simple linear discriminants did not improve 
results, we chose to consider a more complex warping 
of the feature space. We classify the general area as su- 
pervised clustering or supervised VQ. The basic idea is 
that instead of finding a codebook that mimmizes mean 
square error, without regard to phonetic similarity, we 
should be able to use the training data to generate a code- 
book that tends to preserve differences that are phoneti- 
cally important, and disregard feature differences (even 
if they are large) that are not phonetically important. 
Thus we attempt to maximize the mutual information 
between the VQ clusters and phonetic identity. We de- 
scribe two techniques below that seemed like they should 
accomplish this goal. While both methods were able 
to derive a codebook that was more closely related to 
phonetic distance, neither resulted in an improvement in 
overall continuous speech recognition accuracy. 
2.2.1 Binary Division of Space 
The first algorithm is most closely related to the 
nonuniform binary clustering algorithm that we use 
to derive an initial estimate for k-means clustering 
(Roucos,Makhoul,Gish 85). We label all the speech 
frames phonetically as described in the previous sec- 
uon. All the labeled flames are initially in one cluster. 
Then, we iteratively divide the clusters until we have the 
250 
desired number. One of the many clustering algorithms 
we tried is given below. 
First we have a procedure to measure the entropy re- 
duction that would result from dividing a single cluster 
into two: 
1. estimate a single Gaussian for the frames with each 
phoneme label in the cluster. 
2. in general there will be several different phoneme 
labels in the cluster• Identify the two most "promi- 
nent" phonemes within the cluster. The most effec- 
tive measure for this was simply the phoneme with 
the most frames. 
3. divide all data into two new clusters using these 
two guassian distributions. 
4. compute the difference between the entropy of the 
phoneme labels in the original cluster, and the av- 
• erage entropy of the two new clusters, weighted by 
the number samples in each subcluster. 
The outer loop repeatedly divides the cluster that will 
result in the largest enropy reduction. 
l. Place all the labeled frames initially in one cluster. 
2. Using the above procedure compute the potential 
entropy reduction that would be obtained upon for 
dividing each of the clusters. 
3. Adopt the division that resulted in the largest en- 
tropy reduction. 
4. Create two new clusters and measure the potential 
entropy reduction for dividing each of the two re- 
sulting clusters as described above. 
5. If we have fewer than 256 clusters, go to (3) 
The resulting hierarchical codebook was then used to 
quantize all of the training and test data. When we ap- 
plied the above algorithm to a single set of features (say 
14), we found only a minor improvement in the mutual 
information above the case for unsupervised k-means. 
When we used all the features in one codebook, there 
was a larger gain. However, as with LDA, there was no 
gain in the overall recognition accuracy. 
2.2.2 LVQ2: Kohonen's Learning Vector Quantizer 
The LVQ2 algorithm (Kohonen, 1988) was used very ef- 
fectively in a phoneme recognition system (McDermott, 
1989). The algorithm amounts to a discriminative train- 
ing of the codebook means to maximize recognition of 
frame the labels. 
As before, we start with the set of phonemically la- 
beled frames. Then we use the binary and k-means al- 
gorithm to divide the feature vectors from each phoneme 
into several clusters. We made the number of clusters 
for each phoneme proportional to the square root of the 
number of frames in that phoneme, such that the total 
number of clusters was 256. Each cluster has the name 
of the phoneme data in it. Then, we use LVQ2 to jiggle 
the means to optimize frame recognition. 
For each teature vector: 
1. find the nearest two clusters 
2. if the nearest cluster is from the wrong phoneme and 
the second nearest is the correct phoneme, shift the 
mean of the correct cluster toward the feature vector 
in question and shift wrong cluster mean away. 
The above algorithm is iterated until convergence 
(which requires some care). As suggested in the ref- 
erence, we used several adjacent speech frames together 
as a longer feature vector. This resulted m significantly 
higher phoneme-frame recognition rates, both from the 
k-means initial estimate, and after improvement with 
LVQ2. 
The LVQ2 algorithm was found to improve the frame 
recognition accuracy significantly (from 48% to 70%) 
on the training set, particularly for a large number of 
dimensions. However, the accuracy increased only to 
57% on an independent test set. As before, there was 
no gain overall system recognition accuracy. This result 
is in contrast to the vast improvements seen in (Mc- 
Dermott, 1989). While one possible difference is that 
they used handmarked phoneme boundaries, m isolated 
word utterances for both training and test, we believe 
that the important difference was probably that the final 
recognition task in their case was simply to recognize 
the identity of the phoneme• This was quite similar to 
the optimization in the LVQ2. 
The conclusion from these several efforts at improving 
the vector quantization or distance measure by looking 
at the phoneme labels of single frames (or even clusters 
of frames) was that any gains that were achieved were 
not relevant to the performance of the entire system• 
251 
Any method that would improve the vector quantization 
must be done witMn the context of the whole recogmtion 
system. 
2.3 Shared Mixture VQ 
One technique that partially avoids problems attributed 
to VQ is to use a fuzzy VQ technique (Tseng, 
ICASSP87) or a more rigorous shared mixture technique 
(Bellagarda, 1989). The basic notion is that each of the 
VQ regions is now treated as a guassian distribution that 
is shared by all of the probability densities in the entire 
HMM system. One of the effects of this is that an input 
feature vector is no longer "in" one cluster or another. 
Instead, there is a probability that it belongs to several 
clusters. The probability of an input feature vector tor a 
state is now a weighted combination of the discrete prob- 
abilities of the nearby clusters. This might have some 
smoothing effect on the discrete probability densities. It 
also might avoid some of the quantization effects, since 
the probability for an input feature vector would vary 
continuously between two or more clusters. 
We implemented a subset of the pieces of the shared 
mixture algorithms. In particular, we decided to avoid 
the computationally expensive reestimation of the mix- 
ture means and variances. Instead, we estimated a mean 
and full covariance matrix from the training data that 
fell within each of the original clusters. Then, we could 
compute for each training or test frame, the probability 
that it belonged to each of the 256 clusters. We found 
that the nearest five clusters accounted for 99% of the 
probability, and therefore discarded all but the nearest 
five. The five pairs of numbers (index and probabifity) 
then could replace the single VQ index in the probability 
lookup of either the training or recognition algorithms. 
We performed experiments with the shared mixtures 
in the decoder alone, or in the training and decoder. We 
found a 10%-20% gain for just using it in the decoder. 
There was no gain for using it in the training. While the 
effect of shared mixtures might be similar to those of 
other density smoothing algorithms, we found an addi- 
tional 5%-20% reduction in error rates for mixtures. This 
condition is included in the recognition results given at 
the end of this paper. 
3 Optimizing System Parameters 
Here we describe two techmques for estimating global 
system parameters in the BYBLOS system. 
3.1 Deleted Estimation Of Context Weights 
The BYBLOS system interpolates all the different prob- 
ability densities of the context-dependent phonemes to 
obtain a robust estimate of the densities. Currently we 
use heuristic weights that are a function of: 
• type of context (phone, left, right, triphone) 
• number of occurrences in training (5 ranges) 
• state in phone model (left. middle, right) 
The values of these weights were set based on reason- 
able intuitions about the importance of phonetic contexts 
and amount of training on different parts of a phoneme. 
We ran a few tuning experiments (on an earlier database) 
to determine rough scaling factors on the initial weights. 
Therefore, it is likely that we would see no further im- 
provement by estimating the weights automatically with 
deleted estimation. However, we might expect that if 
we estimated the weights automatically, we could use 
different weights for each speaker. We wanted to avoid 
any approximations if possible, due to assumptions about 
the alignments remaimng fixed, and so we chose to it- 
eratively estimate the weights and then reestimate the 
probability densities. 
We were womed about the effectiveness of the 
jackknifing procedure that is normally used, since the 
weights for combining models are estimated for the case 
where only half of the data was used to estimate the 
models. Therefore, we developed a method for hold- 
ing out only one utterance at a time, that was still very 
efficient: 
Each normal pass of forward-backward is followed 
by a second pass that estimates the weights. At the end 
of the forward-backward pass, we retain the "counts". 
In the second pass we remove the "counts" from one 
sentence at a time and then estimate context weights 
using that deleted sentence. 
1. Run usual tbrward-backward iteration on all sen- 
tences 
2. For each sentence: 
(a) Run forward-backward on this sentence using 
"old" model to determine its contribution to 
the new model. 
tb) Subtract the contribution of this sentence from 
those models relevant to this sentence. 
252 
(c) Run forward-backward to compute weight 
counts ffirom this sentence using the model 
with the contribution for this sentence re- 
moved. 
3. Reesfimate the context weights from the weight 
counts. 
4. iterate 
This algorithm requires only two times the compu- 
tation of the normal forward-backward algorithm, and 
should result in a more accurate estimate of the weights 
than the usual procedure. Unfortunately, when we ran 
our initial experiments, we found no improvement, de- 
spite the fact that the likelihood of the training data had 
increased somewhat. It is possible that the initial heuris- 
tic weights are close enough, or that the "reasonable" 
comanuity constraints existing in the initial weights were 
lost when each weight was estimated independently. 
3.2 MMI Estimation Using"N-Best" Alter- 
natives 
We have found in the past that the recognition results 
can be improved by optimizing the weights for the dif- 
ferent sets of features. We felt that it would make sense, 
therefore, to estimate these weights automatically. How- 
ever, since these weights are actually exponents on the 
probability densities, it is not possible to estimate them 
using maximum likelihood (ML) techniques. Clearly, 
the largest likelihood would occur when all the weights 
were large. If we constrain the weights to sum to one, 
there is still a problem, since the ML solution would 
determine one weight that would be equal to one, and 
the others would be zero. This can be shown easily 
for the Viterbi case by realizing that the final likelihood 
is simply the product of the whole sentence likelihoods 
due to each codebook. Therefore, we needed to use a 
discriminative technique to estimate the feature weights. 
We chose to use Maximum Mutual Information 
(MMI) Estimation to estimate these (and possibly other) 
parameters. In MMI, we want to maximize the likeli- 
hood of the correct answer (given the input) relative to 
the likelihood of all the possible answers. This typically 
is done by determining a set of alternative answers and 
performing a gradient descent to improve the mutual in- 
formation. The problem of finding good alternatives to 
the correct answer is harder for continuous speech than 
for isolated words, where each alternative can be consid- 
ered explicitly. However, the N-Best algorithm, (Chow, 
1989) which is described elsewhere in these proceedings 
can be used to solve this problem. 
The N-Best algorithm is a time-synchronous Viterbi- 
style beam search algorithm that can be made to find 
the most likely N whole sentence alternatives that are 
within a given a "beam" of the most likely utterance. 
The algorithm can be shown to be exact under some 
reasonable constraints. The computation is linear with 
the length of the utterance, and faster than linear in N. 
We use the N-Best algorithm to generate a list of the 
most likely alternatives for each sentence in a held-out 
set. We then explicitly compute the likelihood of the 
correct sentence (if it is not already in the list). The mu- 
tual information for each sentence and its corresponding 
imposters is used to compute a set of weights for each 
sentence hypothesis. The weights for the correct sen- 
tences are positive, while the imposter sentences have 
negative weights. Then, we use all of the sentences (real 
and imposter) in the usual forward-backward algorithm 
with the counts multiplied by the weight for the sen- 
tence. States common to all sentence hypotheses for a 
~entence will get no counts. Then, we compute the gra- 
dient directly from the counts and adjust the parameters 
accordingly. 
We used the above algorithm to estimate the (three) 
feature set weights for each speaker separately. We used 
the 600 training sentences to generate the models, which 
we assumed would not change. Then we generated 10 
imposter sentences for each of the 100 development test 
sentences. We used five iterations to optimize the code- 
book weights. Then we evaluated the resulting models 
on the February 1989 test data. The result was a 10% 
reduction in error rate, relative to the initial weights, 
which were empirically optimized for all the speakers. 
The gain is somewhat small, but we are not sure how 
much gain to expect from optimizing only three param- 
eters. Furthermore, we noticed that the gradient descent 
was dominated by a few bad sentences that it probably 
could not fix anyway. We believe that this area needs 
more work. 
4 Cross-Word Triphone Models 
A model of phonetic coarticulation between words has 
been proven to be effective by researchers at CMU, 
SRI, and Lincoln Labs (Paul, 1989; Lee, 1989; Murveit, 
1989). However, we wanted to avoid changes to existing 
training and decoding programs. Therefore, we devel- 
oped a compiler that reads a phonetic dictionary and a 
253 
word grammar and writes out a dictionary of triphones 
and a triphone grammar. That is, the new dictionary has 
one "word" for each triphone (about 7,000 in this case), 
and the new grammar specifies allowable sequences of 
these triphones. There are approximately 60,000 tri- 
phone arcs in the resulting grammar (for the word-pair 
grammar). Given this new dictionary and grammar, the 
training program did not need to change at all and the 
recognition program only needed to know how to write 
out the real words instead of the triphone names - a 
small change. As a result, we were able to implement 
the cross-word triphone effect in only 5 weeks. 
We tested the new models on the May 1988 test data. 
The addition of cross-word triphone models reduced the 
word error rate by 30% as will be seen in the tables of 
results below. 
5 System Recognition Results 
The table below compares the word error rate with sev- 
end combinations of smoothing, mixtures, and cross- 
word triphones. The "smoothing" algorithm is the Tri- 
phone Coocurrence Smoothing algorithm that was pre- 
sented at the DARPA meeting in June '88. "Mix- 
tures" means using the Shared Mixtures VQ as described 
above. "X-Word" means using Cross-Word Triphone 
models (without smoothing or mixtures), And, the last 
line includes all three algorithms. 
Results are shown for the different speaker-dependent 
test sets, indicated by the dates of the test set. The 
results for the baseline system and for the system with 
smoothing have been reported previously for the May'88 
and Feb'89 test sets, and are given for reference. We 
have been using the May'88 test set as our development 
test set. Therefore. each of the conditions is shown for 
this test set. As can be seen, the error rate with the Word- 
Pair grammar has been reduced from 3.4% to 1.7%. We 
never tested this configuration with no grammar until the 
Oct'89 test. 
The results for the Oct'89 test set using all three al- 
gorithm extensions indicate that the word error rate with 
the Word-Pair grammar is 2.5%, and the error rate with 
no grammar is 10.6%. While these error rates represent 
the best performance reported so far on this database, we 
were surprised at the large increase in error rate from the 
May'88 test to the Oct'89 test. Therefore, we reran the 
system configuration used in February, 1989, which in- 
cluded only the smoothing algorithm. As can be seen, 
the word error rate was 3.8% on the Oct. '89 test, as 
compared with 2.7% on the May '88 test, which is con- 
sistent with the other results. It is clear that the October 
1989 test is significantly harder (at least for our system) 
than the May 1988 test set, perhaps because it comes 
from a different recording session. However, the rela- 
tive improvements in the algorithms were observed in 
the new test set as well as the old. 
Percent word error using Word-pair grammar 
System 
Baseline System 
Smooth 
Smooth + Mix 
X-Word 
Smooth + Mix + 
X-Word 
May '88 
3.4 
2.7 
2.5 
2.3 
1.7 
Test Set 
Feb. '89 
2.9 
3.1 
Oct. '89 
3.8 
2.5 
Percent word error using no grammar 
Test Set 
System 
Baseline System 
Smooth 
Smooth + Mix 
X-Word 
Smooth + Mix + 
X-Word 
May '88 
16.2 
15.8 
12.6 
Feb. '89 
15.3 
13.8 
Oct. '89 
10.6 
6 Conclusions 
We draw several conclusions from this work: 
• Supervising the VQ with phoneme identity does not 
help overall recognition performance. 
Shared mixtures in the decoder reduces error rate 
by 10%-20% depending on the grammar, but after 
smoothing only by 5%-20%. 
We found no improvement for replacing the heuris- 
tically derived weights for the context-dependent 
models with weights determined by deleted estima- 
tion. 
We have implemented an algorithm for MMI train- 
ing in continuous speech that uses alternatives gen- 
erated by the N-Best algorithm. Initial experiments 
to optimize the three feature set weights using this 
procedure reduced word error rate by 10%. 
254 
As expectecL using cross-word tfiphone models re- 
duced word error rate by 30%. 
The word error rate using the Word-Pair grammar is 
now close to 2%, depending on the test set. When 
no grammar is used the error rate was 10.6% on the 
Oct. '89 test set. Due to the very low error rate with 
the Word-Pair grammar, we will use the statistical 
class grammar (Derr, 1989) for most of our testing 
as it will be easier to measure improvements using 
this more difficult and more realistic grammar. 
Acknowledgement 
This work was supported by the Defense Advanced 
Research Projects Agency and monitored by the Office 
of Naval Research under Conllact Nos. N0001~85-C- 
0279 and N00014-89-C-0008. 
\[7\] Lee, K.F., I-t.W. Hon,, and M.Y. Hwang (1989) "Re- 
cent Progress in the Sphinx Speech Recognition System" 
Proceedings of the Feb. 1989 DARPA Speech and Natu- 
ral Language Workshop Morgan Kaufmann Publishers, 
Inc., Feb. 1989. 
\[8\] McDermott, E. and S. Katagifi (1989) "Shift- 
Invariant, Multi-Category Phoneme Recognition using 
Kohonen's LVQ2," IEEE ICASSP-89, pp. 81-84 
\[9\] Paul, D. (1989) "The Lincoln continuous speech 
recognition system recent developments and results" 
Proceedings of the Feb. 1989 DARPA Speech and Natu- 
ral Language Workshop Morgan Kaufmann Publishers, 
Inc., Feb. 1989. 
References 
\[i\] BeUagard, J. and D. Nahamoo (1989) "Tied mixture 
continuous parameter models tor large vocabulary iso- 
lated speech recogmtion" IEEE ICASSP89 
\[2\] Brown, P. (1987) "The Acoustic-Modeling Problem 
in Automatic Speech Recognition" PhD Thesis, CMU, 
1987 
\[3\] Chow, Y.C. and R.M. Schwartz (1989) "The N- 
Best Algorithm: An Efficient Procedure for Finding 
Top N Sentence Hypotheses" Elsewhere in these Pro- 
ceedings of the Oct. 1989 DARPA Speech and Natural 
Language Workshop Morgan Kaufmann Publishers, Inc., 
Oct. 1989. 
\[4\] Derr, A. and R.M. Schwartz (1989) "A Statistical 
Class Grammar for Measunng Speech Recognition Per- 
forrnance" Elsewhere in these Proceedings of the Oct. 
1989 DARPA Speech and Natural Language Workshop 
Morgan Kaulinann Publishers, Inc., Oct. 1989. 
\[5\] Huang, X.D. and M.A. Jack (1989) "Semi-continuous 
hidden Markov models for speech recognition" Com- 
puter Speech and Language. Vol 3, 1989 
\[6\] Kohonen, T., G. Bama, and R. Chrisley (1988) 
"Statistical Pattern Recognition with Nerual Networks: 
Benchmarldng Studies," IEEE, Proc. of lCNN, Vol. 1, pp. 
61-68, July, 1988 
255 
