Bayesian Learning of Gaussian Mixture Densities 
for Hidden Markov Models 
Jean-Luc Gauvain I and Chin-Hui Lee 
Speech Research Department 
AT&T Bell Laboratories 
Murray Hill, NJ 07974 
ABSTRACT 
An investigation into the use of Bayesian learning of the parame- 
ters of a multivariate Gaassian mixture density has been carried out. 
In a continuous density hidden Markov model (CDHMM) framework, 
Bayesian learning serves as a unified approach for parameter smooth- 
ing, speaker adaptation, speaker clustering, and corrective training. 
The goal of this study is to enhance model robustness in a CDHMM- 
based speech recognition system so as to improve performance. Our ap- 
proach is to use Bayesian learning to incorporate prior knowledge into 
the CDHMM training process in the form of prior densities of the HMM 
parameters. The theoretical basis for this procedure is presented and 
preliminary results applying to HMM parameter smoothing, speaker 
adaptation, and speaker clustering are given. 
Performance improvements were observed on tests using the DARPA 
RM task. For speaker adaptation, under a supervised learning mode 
with 2 minutes of speaker-specific training data, a 31% reduction in 
word error rate was obtained compared to speaker-lndependent results. 
Using Baysesian learning for HMM parameter smoothing and sex-depen- 
dent modeling, a 21% error reduction was observed on the FEB91 test. 
INTRODUCTION 
When training sub-word units foi" continuous speech recogni- 
tion using probabilistic methods, we are faced with the general 
problem of sparse training data. This limits the effectiveness of 
conventional maximum likelihood approaches. The sparse train- 
ing data problem can not always be solved by the acquisition of 
more training data. For example, in the case of rapid adaptation 
to new speakers or environments, the amount of data available for 
adaptation is usually much less than what is needed to achieve 
good performance for speaker-dependent applications. 
Techniques used to alleviate the insufficient training data prob- 
lem include probability density ruction (pdf) smoothing, model 
interpolation, corrective training, and parameter sharing. The 
first three techniques have been developed for HMM with dis- 
crete pdfs and cannot be directly extended to the general case 
of continuous density hidden Markov model (CDHMM). For ex- 
ample, the classical scheme of model interpolation \[4\] \[9\] can be 
applied to CDHMM only if tied mixture HMMs or an increased 
number of mixture components are used. 
Our solution to the problem is to use Bayesian learning to 
incorporate prior knowledge into the CDHMM training process. 
1Jeaa-Luc Gauvain is on leave from the Speech Commur, ication Group at 
LIMSI/CNRS, Orsay, France. 
The prior information consists of prior densities of the HMM pa- 
rameters. Such an approach was shown to be effective for speaker 
adaptation in isolated word recognition of a 39-word, English 
alpha-digit vocabulary where adaptation involved only the pa- 
rameters of a multivariate Gaussian state observation density of 
whole-word HMM's \[12\]. In this paper, Bayesian adaptation is 
extended to handle parameters of mixtures of Gaussian densi- 
ties. The theoretical basis for Bayesian learning of parameters of 
a multivariate Gaussian mixture density for HMM is developed. 
In a CDHMM framework, Bayesian learning serves as a unified 
approach for parameter smoothing, speaker adaptation, speaker 
clustering, and corrective training. 
In the case of speaker adaptation, Bayesian learning may be 
viewed as a process for adjusting speaker-independent (SI) models 
to form speaker-specific ones based on the available prior infor- 
mation and a small amount of speaker-specific adaptation data. 
The prior densities are simultaneously estimated during the SI 
training process along with the estimation of the SI model pa- 
rameters. The joint prior density for the parameters in a state 
is assumed to be a product of normal-gamma densities for the 
mean and variance parameters of the mixture Gaussian compo- 
nents and a Diriehlet density for the mixture gain parameters. 
The SI models are used to initialize the iterative adaptation pro- 
cess. The speaker-specific models are derived from the adaptation 
data using a segmental MAP algorithm which uses the Viterbi al- 
gorithm to segment the data and an EM algorithm to estimate 
the mode of the posterior density. 
In the next section the principle of Bayesian learning for CD- 
HMM is presented. The remaining sections report preliminary 
results obtained for model smoothing, speaker adaptation and 
sex-dependent modeling. 
MAP ESTIMATE OF CDHMM 
The difference between maximum likelihood (ML) estimation 
and Bayesian learning lies in the assumption of an appropriate 
prior distribution of the parameters to be estimated. If 0 is the 
parameter vector to be estimated from a sequence of n observa- 
tions xl, ...,x , given a prior density P(8), then one way to esti- 
mate 0 is to use the maximum a posteriori (MAP) estimate which 
corresponds to the mode of the posterior density P(01xl, ..., xn), 
i.e. 
272 
8MAP = argmoax P( zl , ..., =,dO) P( O) (1) 
On the other hand, if 0 is assumed to be fixed but unknown pa- 
rameter vector, then there is no knowledge about 0. This is equiv- 
alent to assuming a non-informative prior, i.e. P(0) =constant. 
Equation (1) is now the familiar maximum likelihood formulation. 
Given the MAP formulation in equation (1) two problems re- 
main: the choice of the prior distribution family and the effective 
evaluation of the maximum a posteriori. In fact these two prob- 
lems are closely related, since the choice of an appropriate prior 
distribution can greatly simplify the estimation of the maximum 
a posteriori. The most practical choice is to use conjugate den- 
sities which are related to the existence of a sufficient statistic 
of a fixed dimension \[1\] \[2\]. If the observation density possesses 
such a statistic s and if g(01s, n ) is the associated kernel den- 
sity, MAP estimation is reduced to the evaluation of the mode 
of the product g(OIs, n)P(O). In addition, if the prior density is 
chosen in the conjugate family, i.e. in same family of the kernel 
density, P(#) = g(81t , m), the previous product is simply equal 
to g(Olu , m + n) since the kernel density family is closed under 
multiplication. The MAP estimate is then 
OMAP = argmoax g( Olu, m + n) (2) 
In this case, the MAP estimation problem is closely related to the 
MLE problem which consists of finding the mode of the kernel 
density. In fact, g(O\[u,m + n) can be seen as the kernel of the 
likelihood of a sequence of m + n observations. 
When there is no sufficient statistic of a fixed dimension, the 
MAP estimation, like ML estimation, has no analytical solution, 
but the problems are still very similar. For the general case of 
mixture densities of the exponential family, we propose to use 
a product of kernel densities of the exponential family assuming 
independence between the parameters of the mixture components 
in the joint prior density. To simplify the problem of finding the 
solution to equation 1, we restrict our choice to a product of a 
Dirichlet density and kernel densities of the mixture exponential 
density, i.e. 
K 
P(O) o~ 1I mr % g(Okltk,mk) (3) 
k--1 
where K is the number of mixture components and wk's are the 
mixture weights. However, this choice may be too restrictive to 
adequately represent the real prior information and in practice it 
may be of interest to choose a slightly larger family. 
In the following subsections, we focus our attention on the 
cases of normal density and mixture of normal densities for two 
reasons: solutions for the MLE problem are well known and we 
are using CDHMM based on mixtures of normal densities. 
Normal density case 
Bayesian learning of a normal density is well known \[1\]. If 
ml, ..., z~ is a random sample from .~(mlm , r), where m and r are 
respectively the mean and the precision (reciprocal of the vari- 
ance), and if P(m, r) is a normal-gamma prior density, P(m, r) ¢x 
r 1/2 exp(- ~(m -/.~)2)ra-1 exp(-~r), the joint posterior density 
is also a normal-gamma density with parameters/2, ~, & and ÷ 
such that: 
T n 
/2= ~-~-~ /~ -F r _F n ~ (4) 
n s rn(~ - ~)2 (5) 
= ~ + n/2 (6) 
= r+ n (7) 
where S= is the variance of the random sample. The MAP esti- 
mates of # and r are respectively /2 and & - 0.5 
This approach has been widely used for sequential learning 
of the mean vectors of feature-based or template-based speech 
recognizers, see for example \[5\] and \[8\]. Ferretti and Scarci \[11\] 
used Bayesian estimation of mean vectors to build speaker-specific 
codebooks in an HMM framework. In all these cases, the precision 
parameter was assumed to be known and the prior density was 
limited to a Gaussian. 
Brown et al. \[6\] have used Bayesian estimation for speaker 
adaptation of CDHMM parameters in a connected digit recog- 
nizer. More recently Lee et al. \[12\] investigated various train- 
ing schemes of the Gaussian mean and variance parameters us- 
ing normal-gamma prior densities for speaker adaptation. They 
showed that on the alpha-digit vocabulary, with a small amount 
of speaker specific data (1 to 3 utterances of each word), the MAP 
estimates gave better results than ML estimates. 
Mixture of normal densities 
In the current implementation of the recognizer used in this 
study \[13\] \[14\] the state observation density is a mixture of mul- 
tivariate normal densities. However, to simplify the presentation 
of our approach, we assume here a mixture of univariate normal 
densities: 
K 
PCzl#) = ~ ~k.'~r(=lm~, ~) (8) 
k=l 
where 0 = (wl,...,wK, ml, ...,mK, rl,...,rK). For such a density 
there exists no sufficient statistic of fixed dimension for 0 and 
therefore no conjugate distribution. 
We propose to use a prior joint density which is the product 
of a Dirichlet density and gamma-normal densities: 
K 
P(O) o( lln w~kr k exp(---~(mk - #k)2)r~k-1 exp(--flkrk) (9) 
k=l 
The choice of such a prior density can be justified by the fact that 
the Dirichlet density is the conjugate distribution of the multi- 
nomial distribution (for the mixture weights) and the gamma- 
normal density is the conjugate density of the normal distribution 
273 
(for the mean and the precision parameters). The problem is now 
to find the mode of the posterior joint density. 
If we assume the following regularity conditions, 1) ~k = rk 
and 2) ak = (vk + 1)/2, then the posterior density P(OIx 1 .... , zn) 
can be seen as the likelihood of a stochastically independent union 
of a set of ~c=1 Tk categorized observations and a set of n un- 
categorized observations. (A mixture of K densities can be in- 
terpreted as the density of a mixture of K populations, and an 
observation is said to be categorized if its population of origin is 
known with probability 1.) This fact suggests the use of the E.M. 
algorithm \[3\] to find the maximum a posteriori. The following 
recursive formulas estimate the MAP of the 3 parameter sets. 
z~ ~,k.¥'(~dm~, ,k) (lo) eik- p(zilO ) 
~ -- ~k "~ EL1 gik (11) 
n m~ = rk#k + ~i=1 cik i (12) 
~k + ~Li cik 
n C ~ = 2~ - 1 + E,=i ~ (13) 
By using a non-informative prior density (i.e. an improper 
distribution with Ak = 0, rk = 0, ctk = 1/2, and flk = 0) the 
classical E.M. reestimation formulas to compute the maximum 
likelihood estimates of the mixture parameters can be recognized. 
Generalization to. a mixture of multivariate normal densities 
is relatively straightforward. For the general case where the co- 
variance matrices are not diagonal, the prior joint density is the 
product of a Dirichlet density and multivariate normal-Wishart 
densities. In the case of diagonal covariance matrices, the prob- 
lem for each component reduces to the I-dimensional case, and 
formulas (12) and (13) are applied to each vector component. 
When the above regularity conditions on the prior joint den- 
sity are not satisfied we have no proof of convergence of this algo- 
rithm. However, in practice we have not encountered any prob- 
lems when these conditions were only approximately satisfied. 
Segmental MAP algorithm 
The above procedure to evaluate the MAP of a mixture of 
Gaussians can be applied to estimate the observation density pa- 
rameters of an HMM state given a set of n observations xl, ..., x~ 
assumed to be independently drawn from the state distribution. 
Following the scheme of the segmental k-means algorithm \[7\] to 
estimate the parameters of an HMM, first the Viterbi algorithm 
is used to segment the training data .¥ into sets of observations 
associated with each ttMM state and then the MAP estimate pro- 
cedure is applied to each state. The following segmental MAP 
algorithm originally proposed in \[12\] is obtained: 
1. Set 0 = argmax0 P(0) 
2. Obtain the optimal state sequence S, i.e. 
= argn~x P(XIS, O)P(O) 
3. Given the state sequence S, use the E.M. algorithm to find 
such that 
= argmsax P(,t"IS, O)P(O) 
4. Iterate 2 and 3 , until convergence. 
In order to compare our results to results previously obtained 
with the k-means segmental algorithm \[131 we used the segmental 
MAP algorithm to evaluate the HMM parameters. However, if it 
is desired to maximize P(XfO)P(O ) over the HMM and not only 
state by state along the best state sequence, a Bayesian version of 
the Baum-Welch algorithm can also be designed. As in the case 
of maximum likelihood estimation, simply replace clk by Cijk in 
the reestimation formulas and apply the summations over all the 
observations'for each state j: 
wlc.'~'( xi\]mjk, r jk ) (14) ciik = 7ii P(zi\[O~) 
where 7ij is the probability of being in the state sj at time i, given 
that the model generates X. (For the segmental MAP approach 
7ij is equal to 0 or 1.) 
Prior density estimation 
If the prior density defined by equation (9) for a mixture of 
univariate Gaussians is used, more parameters need to be evalu- 
ated for the prior density than for the mixture density itself. As 
in the case for the HMM parameters, it is therefore of interest to 
use tied parameters for the prior densities in order to obtain more 
robust estimators or to simply reduce the memory requirements. 
The method of estimating these parameters depends on the 
desired goals. We envisage the following three types of applica- 
tions for Bayesian learning. 
• Sequential training: The goal is to update existing models 
with new observations without reusing the original data in 
order to save time and memory. After each new data set 
has been processed, the prior densities must be replaced by 
an estimate of the posterior densities. In order to approach 
the HMM MLE estimators the size of each observation must 
be as large as possible. The process is initialized with non- 
informative prior densities. 
• Model adaptation: For model adaptation most of the prior 
density parameters are derived from parameters of an exist- 
ing HMM. (This justifies the use of the term "model adap- 
tation" even if the only sources of information for Bayesian 
learning are the prior densities and the new data.) To es- 
timate parameters not directly obtained from the existing 
model, training data is needed in which the "missing" prior 
information can be found. This data can be the data already 
used to build the existing models or a larger set containing 
the variability we want to model with the prior densities. 
• Parameter smoothing: Since the goal of parameter smooth- 
ing is to obtain robust HMM parameters, shared prior pa- 
rameters must be used. These parameters are estimated on 
the same training data used to estimate the ttMM parame- 
ters via Bayesian learning. For example, with this approach 
context-dependent (CD) models can be built from context- 
independent (CI) models. 
274 
In this study we were mainly interested in the problems of 
speaker-independent training and speaker adaptation. Therefore 
parameter smoothing and model adaptation in which the prior 
density parameters must be evaluated from SI or SD models and 
from SI training data were investigated. This approach was used 
to smooth the parameters of CD models, for speaker adaptation, 
and to build sex-dependent models. 
In these three cases, the prior density parameters were es- 
timated along with the estimation of the SI model parameters 
using the segmental k-means algorithm. Information about the 
variability to be modeled with the prior densities was associated 
with each frame of the SI training data. This information was 
simply represented by a class number which can be the speaker 
number, the speaker sex, or the phonetic context. The HMM 
parameters for each class Cl given the mixture component were 
then computed. For the experiments reported in this paper, the 
prior density parameters were estimated as follows: 
~ja + 1 (15) ~jk 2 
~J~ (16) ~ja = 2~jk 
~j~ = m Ta (17) 
K 
k=l 
P ~t cjal 
",-ja = E~ C,kt(Ym - mjk)'(Ea ~k"Z~)-'(Ym - mjk) (19) 
where wjk, mj~:, and rjk are the SI HMM parameters for each 
state j and each mixture component k (mjk and r~k are vec- 
tors of p components). The class mean vector Yjkl is equal to 
~i cijkl~i/Cjkl, where cija! is defined as cijal = cija if ~i E Cl and 
cijat = 0ifxi ~Ct, and ciat = ~icijkt. It can seen that when 
the ~ik's are known all the other prior parameters are directly 
estimated from the SI HMM parameters. The prior density pa- 
rameters 7-in can be regarded as a weight associated with the k th 
Gaussian of state s i. When this weight is large, the prior density 
is sharply peaked around the values of the SI HMM parameters 
and these values will be modified only slightly by the adaptation 
process. Conversely, if via is small the adaptation will be very 
fast. By choosing these estimators for the prior parameters the 
ability of the prior density to accurately model the inter-class 
variability is reduced but more robust estimators are obtained. 
Additionally, to further increase the robustness, the vj~ values 
can be constrained to be identical for all Gaussians of a given 
state, or for all states of an HMM, or even for all the HMMs. For 
the experiments reported in this paper a common value for all 
the HMMs was estimated. This is clearly too strong a constraint 
and we plan to relax it in future experiments. 
The state log-energy density parameters can be adapted using 
the same Bayesian learning principle. In the current models, a 
discrete pdf is used to model the state log-energy. Like for the 
mixture parameters, these pdfs were estimated using Bayesian 
learning. The prior density, a Dirichlet distribution, was esti- 
mated in the same way as the mixture weights. Bayesian learn- 
ing of the log-energy pdf was not used for fast speaker adaptation 
since we could only adapt the parameters corresponding to a few 
observed log-energy values. In fact, here the more general prob- 
lem is Bayesian learning of discrete HMMs based on multinomiai 
distributions, for which only the statistics of the observed sym- 
bols can be adapted. One solution to this problem is to view, only 
for training purposes, the multinomial distribution as a mixture 
of Gaussians with a common covariance matrix. 
CD MODEL SMOOTHING 
It is well known that HMM training requires smoothing, par- 
ticularly if a large number of context dependent (CD) phone mod- 
els are used with limited training data. While several solutions 
have been investigated to smooth discrete HMMs, such as model 
interpolation, co-occurence smoothing, and fuzzy VQ, only vari- 
ance smoothing has been proposed for continuous density HMMs. 
We investigated theuse of Bayesian learning to train CD phone 
models with prior densities obtained from CI phone training. This 
approach can be seen as model interpolation between CI and CD 
models for the case of continuous density HMMs. 
All the experiments presented in this paper use a set of 1769 
CD phone models. Each model is a 3 state left-to-right HMM with 
Gaussian mixture state observation densities except for the silence 
model which has only one state. Diagonal covariance matrices are 
used and it is assumed that the transition probabilities are fixed 
and known. As described in \[14\], a 38-dimensional feature vector 
composed of 12 cepstrum coefficients, 12 delta cepstrum coeffi- 
cients, the delta log energy, 12 delta-delta cepstrum coefficients, 
and the delta-delta log energy is used. The training and testing 
materials were taken from the DARPA Naval Resource Manage- 
ment task as provided by NIST. For telephone bandwidth com- 
patibility, the original speech signal was filtered from 100 Hz to 
3.8 kHz and down-sampled at 8 kHz. Results are reported using 
the standard word-pair grammar with a perplexity of about 60. 
For the parameter smoothing experiments, the training data 
consisted of 3969 sentences from 109 speakers (78 males and 31 
females). This data set will be subsequently referred to as the SI- 
109 training data. For the MAP estimation, the prior densities 
were based on a 47 CI model set. Covariance clipping, as reported 
in \[13\], has been used for the two approaches. Results are reported 
with a mixture of 16 Gaussian components for each state. Table 1 
shows word error rates obtained for the FEB89, 0CT89, JUN90, 
and FEBgl test sets using models estimated with the MLE and 
MAP methods. 
An average error rate reduction of about 10% was observed 
using parameter smoothing with prior densities estimated on a 
set of 47 units. This improvement is limited since the 1769 phone 
model set was originally designed to be trainable with a MLE 
approach on the SI-109 training data \[13\]. We intend to run 
some other experiments with a larger number of CD units to 
futher explore this approach. 
275 
Model type II FEB89 1 OCT89 1JUN90 I FEB91 \] 
MAP47 5.3 6.0 5.3 5.4 
Table 1: Parameter smoothing with Bayesian learning. 
SPEAKER ADAPTATION 
Previous works on speaker-adaptation within the framework 
of the DARPA R.M task have been reported for fast-adaptation 
(using less than 2 rain of speech). Model interpolation has been 
proposed to adapt SI models \[9\] and probabilistic spectral map- 
ping has been proposed to adapt SD models \[10\] and multi- 
speaker models \[15\]. In the framework of Bayesian learning, 
speaker adaptation may be viewed as adjusting speaker-indepen- 
dent models to form speaker-specific ones, using the available 
prior information and a small amount of speaker-specific adapta- 
tion data. Along with the estimation of the parameters for the 
SI CD models, the prior densities are simultaneously estimated 
during the speaker-independent training process. The speaker- 
specific models are built from the adaptation data using the seg- 
mental MAP algorithm. The SI models are used to initialize the 
iterative adaptation process. After segmenting all of the training 
sentences with the models generated in the previous iteration, 
the speaker-specific training data is used to adapt the CD phone 
models both with and without reference to the segmental labels. 
Three types of adaptation were investigated: adapting all CD 
phones with the exact triphone label (type 1), those with the 
same CI phone label (type 2), and all models without regard to 
the label (type 3). Each frame of the sentence is distributed over 
the models based on the observation densities of the preceding 
iteration. When the model labels are not used, this method can 
be viewed as probabilistic spectral mapping constrained by the 
prior densities. For fast speaker adaptation, it was found that a 
combination of adaptation types 1 and 2 was the most effective. 
The same set of 1769 CD phone units, where the observation 
densities are mixtures of 38-element multivariate Gaussian distri- 
butions was used for evaluation. While a maximum of 8 mixture 
components per density was allowed, the actual average number 
of components was 7. This represents a total of 3 million param- 
eters to be estimated and adapted. 
Experiments were conducted using approximately 1 and 2 
minutes of adaptation data to build the speaker-specific models. 
In 40 utterances, roughly 2 minutes of speech, only about 45% of 
the CD phones appear (28% for 20 sentences)~ whereas typically 
all the CI phones appear. Table 2 summarizes the test results 2 on 
the JUN90 data for the last 80 utterances of each speaker, where 
the first 20 (or 40) utterances were used for supervised adapta- 
tion of types 1 and 2. Speaker-independent recognition results 
are also shown for comparison. With 1 minute and 2 minutes of 
speaker-specific training data, a 16% and 31% reduction in word 
error were obtained compared to the speaker-independent results. 
On this test speaker adaptation appears to be effective only for 
the female speakers for whom SI results were lower than the male 
speakers. 
Preliminary experiments have also been carried out using un- 
~Results reported in this section were obtained with a recognizer using a 
guided search strategy \[17\] which has been found to give slightly biased and 
better performance than a regular beam search strategy. 
\[-~eaker H SI I SA (i min) 
BJW(F) 4.7 3.4 
JLS(M) 3.6 3.0 
JRM(F) 9.2 7.0 
LPN(M) 3.2 4.7 
I Overall 4.3 
SA (2 rain) \] Err. Red. (2 min) l 
2.2 53% : 
3.4 5% • 
5.3 42% 
3.2 0% 
3.5 I 31% 
Table 2: Speaker adaptation results on the JUNg0 test data. 
l-Speaker \[I SI I SA (2 x 2 min) 
BJW(F) 4.7 3.4 
JLS(M) 3.6 3.5 
JRM(F) 9.2 6.6 
LPN(M) 3.2 3.7 
I Overall 115.11 4.3 
Table 3: Unsupervised speaker adaptation results on the JUN90 test 
data. 
supervised speaker adaptation, which is more applicable to on-line 
situations. Starting with the SI models, adaptation of SI phone 
models is performed every 40 utterances using type 2 adaptation. 
The results on the JUN90 test are shown in Table 3 for the last 
80 sentences of each speaker. There is an overall error reduction 
of 16%. 
SEX-DEPENDENT MODELING 
It has recently been reported that the use of different models 
for male and female speakers reduced recognizer errors by 6% on 
the FEB89 and OCT89 tests using a word-pair grammar with 
models trained on the SL109 data set \[16\]. We investigated the 
same idea within the framework of Bayesian learning. Two sets of 
1769 CD phone models were generated using data from the male 
speakers for one set and from the female speakers for the other 
set. For both sets the same prior density parameters, which had 
been estimated along with SI training on all 109 speakers, were 
used. Recognition is performed by computing the likelihoods of 
the sentence for the two sets of models and by selecting the so- 
lution corresponding to the highest likelihood. In order to avoid 
problems due to likelihood disparities caused by implementation 
details, all the HMM parameters with the exception of the Gaus- 
sian mean vectors were assumed to be known and set to the pa- 
rameter values of the SI models trained on the 109 speakers. 
Table 4 shows the results obtained on the FEB91 test using the 
speaker independent set (SI), the male set (MA), the female set 
(FE), and the male and female sets together (MA+FE). Looking 
at the results speaker by speaker it can be seen that sex models 
do the job for which they have been designed; The best result for 
each speaker is obtained with the models of his/her sex. For the 
FEB91 test, the male models gave the higher likelihood for 153 
sentences and the female models for 147 sentences. The overall 
improvement obtained using separate models for male and female 
speakers is a reduction in error rate of about 16%. This improve- 
ment is observed for both male and female speakers. 
On the FEBgl test, using Baysesian learning for HMM pa- 
rameter smoothing and sex-dependent modeling, a 21% error re- 
duction compared to the baseline system results is obtained (5.8% 
to 4.6%). 
276 
Speaker \[I SII MAI ~E\[MA+FEJ 
ALK(F) 9.3 11.5 8.6 i 8.6 
CAL(F) 3.8 5.1 3.8 ' 3.8 
CAU(F) 3.3 3.7l 3.7 3.7 
EAC(F) 7.2 8.91 6.4 7.2 
JLS(M) 1.6 2.0 2.0 I 2.0 
JWG(M) 7.9 6.6i 12.9 6.6 
MEB(M) 4.1 3.3\[ 6.5 3.3 
i SAS(M) 1.9 2.2! 3.7 2.2 
STK(M) i 5.0 3.3 5.0 3.3 
TRB(F) '110.9' 18.3 5.7 5.7 
overall I\] 5'41 6"41 5"81 4.6 
Table 4: Results on FEBgl test using separate male/female models. 
SUMMARY 
An investigation into the use of Bayesian learning of CDHMM 
parameters has been carried out. The theorical framework for 
training HMMs with Gaussian mixture densities was presented. It 
was shown that Bayesian learning can serve as a unified approach 
for parameter smoothing, speaker adaptation, and speaker clus- 
tering. Encouraging results have been obtained for these three 
applications. 
Bayesian learning applied to HMM parameter smoothing had 
an overall 10% reduction on the word errors compared to results 
obtained using conventional segmental k-means training. Using 
Bayesian learning for sex-dependent modeling, an additional 15% 
error reduction was obtained. For speaker adaptation, a 31% er- 
ror reduction was obtained on the JUN90 test with 2 minutes of 
speaker-specific train!ng data. Since the extent of these tests is 
relatively limited, other experiments should be carried out to ob- 
tain more statistically significant results in order to fully validate 
this approach. 
REFERENCES 
\[1\] M. H. DeGroot, Optimal Statistical Decisions, McGraw-Hill, New 
York, 1970. 
\[2\] R. O. Duds and P. E. Hart, Pattern Classification and Scene Anal- 
ysis, John Wiley & Sons, New York, 1973. 
\[3\] A. P. Dempster, N. M. Laird and D. B. Rubin, "Maximum Likeli- 
hood from Incomplete Data via the EM algorithm", J. Roy. Statist. 
Soc. See. B, 39, pp. 1-38, 1977. 
\[4\] F. Jelinek and R.L. Mercer, "Interpolated Estimation of Markov 
Source Parameters from Sparse Data", Pattern Recognition in 
Practice, page 381-397, North-Holland Publishing Company, Am- 
sterdam, 1980. 
\[5\] R. Zelinski and F. Class, "A learning procedure for speaker- 
dependent word recognition systems based on sequential process- 
ing of input tokens," Proe. ICASSP83, pp. 1053-1056, Boston, May 
1983. 
\[6\] P. F. Brown, C.-H. Lee, and J. C. Spohrer, "Bayesian Adaptation 
in Speech Recognition," Proc. ICASSP83, pp. 761-764, Boston, 
May 1983. 
\[7\] L. R. Rabiner, J. G. Wilpon, and B.-H. Juang, "A Segmental 
k-Means Training Procedure for Connected 'Word Recognition", 
AT~T Teeh. Journal., Vol. 65, No. 3, pp. 21-32, May-June 1986. 
\[8\] R. M. Stern and M. J. Lasry, "Dynamic Speaker Adaptation for 
Feature-Based Isolated Word Recognition," IEEE Trans. on ASSP, 
Vol. ASSP-35, No. 6, June 1987. 
\[9\] K.-F. Lee, "Large-vocabulary speaker-independent continuous 
speech recognition: The SPHINX system", Ph.D. Thesis, 
Carnegie-Mellon University, 1988. 
\[10\] M.-W. Feng, F. Kubala, R. Schwartz, and J. Makhoul, "Improved 
Speaker Adaptation Using Text Dependent Spectral Mappings", 
Proe. ICASSP88, pp. 131-134, New York, April 1988. 
\[11\] M. Ferretti and S. Scarci, "Large-Vocabulary Speech Recognition 
with Speaker-Adapted Codebook and HMM Parameters", Proc. 
Eurospeeeh89, pp. 154-156, Paris, Sept. 1980. 
\[12\] C.-H. Lee, C.-H. Lin, and B.-H. 3uang, "A Study on Speaker Adap- 
tation of Continuous Density HMM Parameters", Proc. ICASSP90, 
pp. 145-148, Albuquerque, April 1990. 
\[13\] C.-H. Lee, L. R. Rabiner, R. Pieraccini and 3. G. Wilpon, "Acous- 
tic modeling for large vocabulary speech recognition", Computer 
Speech and Language, 4, pp. 127-165, 1990. 
\[14\] C.-H. Lee, E. Giachin, L. R. Rabiner, R. Pieraccini and A. E. 
Rosenberg, "Improved Acoustic Modeling for Continuous Speech 
Recognition", Proc. DARPA Speech and Natural language Work- 
shop, Hidden Valley, June 1990. 
\[15\] F. Kubala and R. Schwartz, "A New Paradigm for Speaker- 
Independent Training and Speaker Adaptation", Proe. DARPA 
Speech and Natural language Workshop, Hidden Valley, June 1990. 
\[16\] X. Huang, F. Alleys, S. Hayamizu, H.-W. Hon, M.-Y. Hwang, K.-F. 
Lee, "Improved Hidden Markov Modeling for Speaker-lndependent 
Continuous Speech Recognition," Proe. DARPA Speech and Natu- 
ral language Workshop, Hidden Valley, June 1090. 
\[17\] R. Pieraccini, C.-H. Lee, E. Giachin, L. R. Rabiner, "Implementa- 
tion Aspects of Large Vocabulary Recognition Based on Intraword 
and Interword Phonetic Units," Proc. DARPA Speech and Natural 
language Workshop, Hidden Valley, June 1990. 
277 
