MAP Estimation of Continuous Density HMM : Theory and Applications 
Jean-Luc Gauvain t and Chin-Hui Lee 
Speech Research Department 
AT&T Bell Laboratories 
Murray Hill, NJ 07974 
ABSTRACT 
We discuss maximum a posteriori estimation of continuous density hid- 
den Markov models (CDHMM). The classical MLE reestimation algorithms, 
namely the forward-backward algorithm and the segmental k-means algo- 
rithm, are expanded and reestimation formulas are given for HMM with 
Gaussian mixture observation densities. Because of its adaptive nature, 
Bayesian learning serves as a unified approach for the following four speech 
recognition applications, namely parameter smoothing, speaker adaptation, 
speaker group modeling and corrective ~aining. New experimental results 
on all four applications are provided to show the effectiveness of the MAP 
estimation approach. 
INTRODUCTION 
Estimation of hidden Marknv model (HMM) is usually obtained 
by the method of maximum likelihood (ML) \[1, 10, 6\] assuming 
that the size of the training data is large enough to provide robust 
estimates. This paper investigates maximum a posteriori (MAP) 
estimate of continuous density hidden Markov models (CDHMM). 
The MAP estimate can be seen as a Bayes estimate of the vector pa- 
rameter when the loss function is not specified \[2\]. This estimation 
technique provides a way of incorporatimg prior information in the 
training process, which is particularly useful to deal with problems 
posed by sparse training data for which the ML approach gives 
inaccurate estimates. This approach can be applied to two classes 
of estimation problems, namely, parameter smoothing and model 
adaptation, both related to the problem of sparse training data. 
In the following the sample x = (zl, ...,z,~) is a given set of 
n observations, where zl, ..., z n are either independent and identi- 
cally distributed (i.i.d.), or are drawn from a probabilistic function 
of a Markov chain. 
The difference between MAP and ML estimation lies in the 
assumption of an appropriate prior disliibution of the parameters to 
be estimated. If 0, assumed to be a random vector taking values 
in the space O, is the parameter vector to be estimated from the 
sample x with probability density function (p.d.f.) f(.lO), and if g 
is the prior p.d.f, of 0, then the MAP estimate, 0~p, is defined as 
the mode of the posterior p.d.f, of 0, i.e. 
Oma, = argmoax f(xlO)g(O) (I) 
If 9 is assumed to be fixed but unknown, then there is no knowl- 
edge about 8, which is equivalent to assuming a non-informative 
improper prior, i,e. g(8) ----constant. Equation (1) then reduces to 
the familiar ML formulation. 
Given the MAP formulation two problems remain: the choice of 
the prior distribution family and the evaluation of the maximum a 
~This work was done while Jean-Luc Gauvain was on leave from the 
Speech Communication Group at LIMSI/CNRS, Orsay, France. 
posteriori. These two problems are closely related, since the appro- 
pilate choice of the prior distribution can greatly simplify the MAP 
estimation. Like for ML estimation, MAP estimation is relatively 
easy if the famay ofp.d.f.'s {f(-10), 0 ~ O} possesses a sufficient 
statistic of fixed dimension t(x). In this case, the natural solution is 
to choose the prior density in a conjugate family, {k(.ko), ~o E ~}, 
which includes the kernel density of f(. lO), i.e. Vx t(x) e ~b \[4, 2\]. 
The MAP estimation is then reduced to the evaluation of the mode 
of k(Ol~o' ) = k(Oko)k(Olt(x)), a problem almost identical to the 
ML estimation problem. However, among the families of interest, 
only exponential families have a sufficient statistic of fixed dimen- 
sion \[7\]. When there is no sufficient statistic of fixed dimension, 
MAP estimation, like ML estimation, is a much more difficult prob- 
lem because the posterior density is not expressible in terms of a 
fixed number of parameters and cannot be maximized easily. For 
both finite mixture density and hidden Markov model, the lack of a 
sufficient statistic of fixed dimension is due to the underlying hid- 
den process, i.e. a multinomial model for the mixture and a Markov 
chain for an HMM. In these cases ML estimates are usually obtained 
by using the expectation-maximization (EM) algorithm \[3, I, 13\]. 
This algorithm exploits the fact that the complete-data likelihood 
can be simpler to maximize than the likelihood of the incomplete 
data, as in the case where the complete-data model has sufficient 
statistics of fixed dimension. As noted by Dempster et al. \[3\], 
the EM algorithm can also be applied to MAP estimation. In the 
next two sections the formulations of this algorithm for MAP esti- 
mation of Gaussian mixture and CDHMM with Gaussian mixture 
observation densities are derived. 
MAP ESTIMATES FOR GAUSSIAN MIXTURE 
Suppose that x = (zl,...,x,) is a sample of n i.i.d. 
observations drawn from a mixture of K p-dimensional 
multivariate normal densities. The joint p.d.f, is speci- 
fied by f(x\[0) = \]-\[:=l ~-~f=t~kA/'(Zt\[mk,rk) where 0 = 
(wl, ..., wK, ml ,..., inK, rl, ..., rK) is the parameter vector and ~k 
denotes the mixture gain for the k-th mixture component with the K 
constraint ~kft Wk = 1. A/'(Zlmk, rk) is the k-th normal density 
function where mk is the p-dimensional mean vector and rk is the 
p × p precision matrix. As stated in the introduction, for the parame- 
ter vector 0 no joint conjugate prior density exists. However a finite 
mixture density can he interpreted as a density associated with a sta- 
tistical population which is a mixture of K component populations 
with mixing proportions (wl .... , wK). In other words, f(x\[0) can 
be seen as a marginal p.d.f, of the product of a multinomial density 
(for the sizes of the component populations ) and normal densities 
(for the component densities). A practical candidate to model the 
185 
prior knowledge about the mixture gain parameter vector is there- 
fore a Dirichlet density which is the conjugate prior density for the 
multinomial distribution 
K 
g(~,, ..., WK) OC H ~'-1 (2) 
k:l 
where vk > 0. For the vector parameter (ink, rk) of the individual 
Gaussian mixture component, the joint conjugate prior density is a 
normal-Wishart density \[2\] of the form 
g(mk, rk) oc I,'kl (°'-")/: exp\[--½tr(nkrk)\] x 
rk exp\[--T(mk -- Zk)%k(mk -- Zk)\] (3) 
where (rk,/zk, t~k, Uk) are the prior density parameters such that 
ak > p -- 1, rk > 0,/~k is a vector of dimension p and uk is a p × p 
positive definite matrix. 
Assuming independence between the parameters of the mixture 
components and the mixture weights, the joint prior density g(0) 
is taken to be a product of the prior p.d.f.'s defined in equations 
• K • (2) and (3), Le. g(0) = g(w~, ...,~K)FL:, z(m~,,-,). As will 
be shown later, this choice for the prior density family can also be 
justified by noting that the EM algorithm can be applied to the MAP 
estimation problem if the prior density is in the conjuguate family 
of the complete-data density. 
The EM algorithm is an iterative procedure for approximat- 
ing maximum-likelihood estimates in an incomplete-data context 
such as mixture density and hidden Markov model estimation 
problems \[1, 3, 13\]. This procedure consists of maximizing at 
each iteration the auxilliary function Q(O, ~) defined as the ex- 
pectation of the complete-data log-likelihood log h(y\[0 ) given 
the incomplete data x = (~, ...,x,) and the current fit 0, i.e. 
Q(0, ~) = E\[log h(yl0)lx, ~. For a mixture density, the complete- 
data likelihood is the joint likelihood of x and £ = (£t, ..., £n ) the un- 
observed labels referring to the mixture components, i.e. y = (x, £). 
The EM procedure derives from the fact that log f(xl0 ) = 
Q(O, 0) - H(O, 0) where H(O, 0) = E(log h(ylx , 0)Ix , 0) and 
H(O, 0) _< H(O, ~), and whenever a value 0 satisfies Q(O, O) > 
Q(0, 0) then f(x\[0) > f(xl0). It foUows that the same iterative 
procedure can be used to estimate the mode of the posterior density 
by maximizing the anxilliary function R( O , 0) = Q( O , 0) + log 9(0) 
at each iteration instead of Q(O, 0) \[3\]. 
For a mixture of K densities {f(.10~)}~=L...,g with mixture 
weights {wk } k= ~,...,K, the auxilliary function Q takes the following 
form \[13\]: 
Q(O, #)= ~ ~ ~"'~f(zt!O~)log~f(ztlo~) (4) 
,=, ~=, /(z,lO) 
Let tP(0, 0) = exp R(O, 0) be the function to be maximized and 
define the following notations cat &~f(xtl#k) ck = ~=~ ckt, 
= f(z,l~) ' 
£k = ~=~ c~txt/ck and S~ = ~=~ c~t(xt - £k)(Xt -- £k) ~. It 
follows from the definition of f(x\[O) and equation (4) that 
g 
v(0, ~) o~ z(0) IX'~;~ I~l°~/~x 
k=l 
Ck exp\[-T(m~ - ~)%~(m~ - e~) - ½tr(s~)\] (5) 
From (2), (3) and (5) it can easily be verified that 
~(.,0) belongs to the same family as g, and has parameters O,L ' ' ' ' 
rk,/~k, t~k, uk}k:l,...,K satisfying the following conditions: 
I Vk : Vk+Ck (6) 
l rk = rk + ck (7) 
a~ = ak + ek (8) 
#~ _ rk#k + Ok.Ok (9) 
rk -+ck 
TkCk r a~ : uk -I- Sk + ~-~-~'tZk -- £k)(#k -- £k) T (I0) 
The considered family of distributions is therefore a conjugate fam- 
ily for the complete-data density. 
The mode of ~P(., 0), denoted J i , obtained (wk, ink, rk), may be 
from the modes of the Dirichlet and normal-Wishart densities: w~ = 
(.~ - 1)/~c:t(.~ _ 1), m~ = p~, and r~ = (~ - p)u~-'. 
Thus, the EM iteration is as follows: 
, Vk -- 1 + ~=! Ok, (11) 
(M k = 
mk= (12) 
, -, ,,~ + ~'k(#~ - mg)(~,~ - rag) ~ + 
T k = 
~L, ~k,(~, -- ~)(~, -- rag) ~ (13) 
If it is assumed &k > 0, then ckl, ck2, ...,ck, is a sequence of 
n i.i.d, random variables with a non-degenerate distribution and 
limsupn_o o ~=. ckt = co with probability one. It follows that 
w~ converges to ~=l Ckt/n with probability one when n ~ oo. 
Applying the same reasoning to m~ and r~, it can be seen that the 
EM reestimation formulas for the MAP and ML approaches are 
asymptotically similar. Thus as long as the initial estimates are 
identical, the EM algorithm will provide identical estimates with 
probability one when n ~ c¢. 
MAP ESTIMATES FOR CDHMM 
The results obtained for a mixture of normal densities can be 
extended to the case of HMM with Gaussian mixture state ob- 
servation densities, assuming that the observation p.d.f.'s of all 
the states have the same number of mixture components. We 
consider an N-state HMM with parameter vector A = (x, A, 0), 
where r is the initial probability vector, A is the transition ma- 
trix, and 0 is the p.d.f, parameter vector composed of the mixture 
parameters 0i = {Wik,mik,rik}kfl,...,K for each state i. For a 
sample x = (2~1, ..., zn), the complete data is y = (x, s,Q where 
s = (so,..., s,) is the unobserved state sequence, and l = (£h ..., l,~) 
are the unobserved mixture component labels, si E \[1, N\] and 
li E \[1, K\]. The joint p.d.f, h(.lX) of x, s, and£ is defined as \[1\] 
rl 
h(x, s,llA ) = a'. o Hao,_,,,w,,t,f(xtlO,,t,) (14) 
t=| 
where 7ri is the initial probabilty of state i, aij is the transition 
probability from state i to state j, and Oik =(mik, rik) is the 
parameter vector of the k-th normal p.d.f, associated to state i. It 
follows that the likelihood of x has the form 
186 
n 
/(xl~,)= ~ ~,o ~I :.,_,., f(=,lO.,) (15) 
8 t:l 
where f(x,lOi ) K = ~k=t w~kA/'(x*lralk, rik), and the summation 
is over all possible state sequences. 
In the general case where MAP estimation is to be applied not 
only to the observation density parameters but also to the initial 
and transition probabilities, a Dirichlet density can also be used for 
the initial probability vector ~r and for each row of the transition 
probability matrix A. This choice directly follows the results of 
the previous section: since the complete-data likelihood satisfies 
h(x, s,tlA ) = h(s, A)h(x, tls , A) where h(s, A) is the product 
of N + 1 multinomial densities with parameters {n, a't, ..., ~N} 
and { n, air ..... a i N } if l,...,N . The prior density for all the HMM 
parameters is thus 
G(A) oc ~rT'-Ig(Oi) H a?;J-I (16) 
i:1 j=l J 
In the following subsections we examine two ways of ap- 
proximating AMAp by local maximization of f(xl~)G(~) and 
f(x, sI~)G(A). These two solutions are the MAP versions of the 
B aura-Welch algorithm \[1 \] and of the segmental k-means algorithm 
\[12\], algorithms which were developed for ML estimation. 
Forward-Backward MAP Estimate 
From (14) it is straightforward to show that the auxilliary 
function of the EM algorithm applied to MLE of A, Q(A, ~) = 
E\[log h(Yi~)lx, £\], can be decomposed into a sum of three aux- 
illiary functions: Q,~(a', X), Q~(A, X) and Qo(O, ~) \[6\]. These 
functions which can be independently maximized take the follow- 
ing forms: 
N 
Q'rOr' ~) = E ~io log ~ri (17) 
i=1 
QA(A, ~) = fist log aij (18) 
i=1 t=l j=l 
N 
Qo(o,£) = ~ Qo,(od£) (19) 
i--1 
with 
~ ~ ~ikf(zt\[~ik) Qo,(Oi,X)= 7" logoJikf(xtlOik) (20) 
,=, k=t f(z,l@,) 
where ~i/t = Pr(st-t =i, st =jlx, ~) and 3'. =Pr(st =/Ix, ~) can 
be computed at each EM iteration by using the Forward-Backward 
algorithm \[I\]. As for the mixture Gaussian case discussed in the 
previous section, to estimate the mode of the posterior density 
the anxilliary function R(A, ~) = Q(A, ~) + log G(A) must be 
maximized. The form chosen for G(A) in (16) permits indepen- 
dent maximization of each of the following 2N + I parameter 
sets: {Trl .... ,a'N}, {ail,...,aiN}i=t,...,g and {0i}i=l,...,N. The 
MAP auxiUiary function R(A, A) can thus be written as the sum 
R. ( a', ~) + ~i R., ( a, , ~) + ~, Ro, ( O,, ~ ), where each term rep- 
resents the MAP anxilliary function associated with the indexed 
parameter set. 
We can recognize in (20) the same form as seen for Q(0\[~) in (4) 
for the mixture Ganssian case. It follows that if the Ckt are replaced 
by the cikt defined as 
,~,kX(xtl,h,~, ~,k ) (21) eikt = 7,t f(xt\[~i) 
then the reestimation formulas (11-13) can be used to maximize 
Ro~ (01, ~). It is straightforward to find the reesfimations formulas 
for ~r and A by applying the same derivations used for the mixture 
weights: 
, t/i -- 1 + 7m ~ri = (22) 
, .q - 1 + ~=, 6,. (23) 
aq = EjN= 1 "'J _ N -F Eju_t E~:a ~q' 
For multiple independent observation sequences { xo } q= l,...,Q, 
t~(q) ~(q)~ with Xq = x't .... , ~., ,, we maximize G(A) lq?:l f(xqlA)' 
where f(.\[A) is defined by (15). The EM auxilliary function is 
then R(A, X) = logG(A) + ~qQ=t E\[l°gh(Yql~)lxq, X\], where 
h(.lA) is defined by equation (14). It follows that the reestima- 
tion formulas for A and 0 still hold if the summations over t are 
~(q) and - (q) replaced by summations over q and t. The values ",~jt 7. 
are then obtained by applying the forward-backward algorithm for 
each observation sequence. The reestimation formula for the initial 
probabilities becomes 
, T/, - 1 + Eq%l ,~, = (24) 
N Q (q) Ei:, ', - Iv + E.:, ,,o 
As for the mixture Gaussian case, it can be shown that as Q ~ co, 
the MAP reestimation formulas approach the ML ones, exhibiting 
the asymptotic similarity of the two estimates. 
These reestimation equations give estimates of the HMM param- 
eters which correspond to a local maximum of the posterior density. 
The choice of the initial estimates is therefore essential to finding 
a solution close to a global maximum and to minimize the number 
of EM iterations needed to attain the local maximum. When using 
an informative prior, one natural choice for the initial estimates is 
the mode of the prior density, which represents all the available 
information about the parameters when no data has been observed. 
The corresponding values are simply obtained by applying the rees- 
timation formulas with n equal to 0. When using a non-informative 
prior, i.e. for ML estimation, while for discrete HMMs it is possible 
to use uniform initial estimates, there is no trivial solution for the 
continuous density case. 
Segmental MAP Estimate 
By analogy with the segmental k-means algorithm \[12\], a differ- 
ent optimization criterion can be considered. Instead of maximizing 
G(AIx), the joint posterior density of A and s, G(A, slx ), is maxi- 
mized. The estimation procedure becomes 
= argmax max G(A, six ) (25) ), s 
= argm~x m~x f(x, s\[A)G(A) (26) 
and A is called the segmental MAP estimate of A. As for the 
segmental k-means algorithm, it is straightforward to prove that 
starting with any estimate A (m), alternate maximization over s and 
187 
A gives a sequence of estimates with non decreasing values of 
G(A, slx), i.e. G(A (m+'), s(m+')\]x) > G(A(m), s(m)lx) with 
s (m) ---- argm~x f(x, slA (m)) (27) 
A (rn+l) = argmxax f(x, s(m)IA)G(A) (28) 
The most likely state sequence s (m) is decoded by the Viterbi 
algorithm. In fact, maximization over A can be replaced by 
any hill climbing procedure which replaces A ('~) by A ('~+1) 
subject to the constraint that f(x, s(m)\[A(m+D)G(A (re+D) _> 
f(x, s (m) \[A (m))G(A(m)). The EM algorithm is once again a good 
candidate to perform this maximization using A (m) as an initial es- 
timate. The EM anxilliary function is then R(A, ~) = log G(A) + 
E\[log h(ylA)lx, s ~), X\] where h(.IA) is defined by equation (14). 
It is straightforward to show that the forward-backward reestima- 
tion equations still hold with fijt= 6ts('n)~ t-t - i)6(s~ m) - J) and 
"fit = ~(s~ '~) -- i), where ~ denotes the Kronecker delta function. 
PRIOR DENSITY ESTIMATION 
In the previous sections it was assumed that the prior density 
G(A) is a member of a preassigned family of prior distributions de- 
fined by (16). In a strictly Bayesian approach the vector parameter 
of this family ofp.d.f.'s {G(.\[~), ~ E ~b} is also assumed known 
based on common or subjective knowledge about the stochastic pro- 
cess. Another solution is to adopt an empirical Bayesian approach 
\[14\] where the prior parameters are estimated directly from data. 
The estimation is then based on the marginal disttrbution of the data 
given the prior parameters. 
Adopting the empirical Bayes approach, it is assumed that the 
sequence of observations, X, is composed of multiple independent 
sequences associated with different unknown values of the HMM 
parameters. Letting (X,A) = \[(xt, Ai), (x2, A2) .... \] be such a 
multiple sequence of observations, where each pair is independent 
of the others and the Aq have a common prior distribution G(.\[~). 
Since the Aq are not directly observed, the prior parameter estimates 
must be obtained from the marginal density f(X\[~), 
f(Xl~) -- ~ f(XIA)G(A\[~) dA (29) 
where f(XIA ) = I~Iq f(xqlAq) and G(AIg~ ) = I~q G(AqI~)" 
However, maximum likelihood estimation based on f(Xl~ ) ap- 
pears rather difficult. To simplify this problem, we can choose a sim- 
pler optimization criterion by maximizing the joint p.d.f, f(X, A I~) 
over A and ~ instead of the marginal p.d.f, of X given ~. Starting 
with an initial estimate of ~o, we obtain a hill climbing procedure by 
alternate maximization over A and ~o, i.e. 
A (m) = argmAax f(X, AIr <m)) (30) 
(m+D = argmaxG(A(m)\[~) (31) ~p 
Such a procedure provides a sequence of estimates with non- 
decreasing values of f(X, Al~(m)). The solution of (30) is the 
MAP estimate of A based on the current prior parameter ~(m). It 
can therefore be obtained by applying the forward-backward MAP 
reestimation formulas to each observation sequence Xq. The solu- 
tion of (31) is simply the maximum likelihood estimate of ~ based 
on the current values of the HMM parameters. 
Finding this estimate poses two problems. First, due to the 
Wishart and Dirichlet components, ML estimation for the density 
defined by (16) is not trivial. Second, since more parameters are 
needed for the prior density than for the HMM itself, there can 
be a problem of overparametrization when the number of pairs 
(xq, Aq) is small. One way to simplify the estimation problem is 
to use moment estimates to approximate the ML estimates. For 
the overparametrization problem, it is possible to reduce the size of 
the prior family by adding constraints on the prior parameters. For 
example, the prior family can be limited to the family of the kernel 
density of the complete-data likelihood, i.e. the posterior density 
family of the complete.data model when no prior information is 
available. Doing so, it can be verified that the following constraints 
hold 
v~k = r~k (32) 
aik = rik-t-p (33) 
Parameter tying can also be used to further reduce the size of the 
prior family. 
We use this approach for approach for two types of applica- 
tions: parameter smoothing and adaptation learning. For parameter 
"smoothing", the goal is to estimate {Al, A2, ...}. The previous 
algorithm offers a direct solution to "smooth" these different esti- 
mates by assuming a common prior density for all the models. For 
adaptative learning, we observe a new sequence of observations Xq 
associated with the unobserved vector parameter value Aq. The 
MAP estimate of A, can be obtained by using for prior parameters 
a point estimate ~ obtained with the previous algorithm. Such a 
training process can be seen as an adaptation of an a priori model 
= argmaxx G(A\[~) (when no training data is available) to more 
specific conditions corresponding to the new observation sequence 
Xq. 
In the applications presented in this paper, the prior density pa- 
rameters were estimated along with the estimation of the SI model 
parameters using the segmental k-means algorithm. Information 
about the variability to be modeled with the prior densities was as- 
sociated with each frame of the SI training data. This information 
was simply represented by a class number which can be the speaker 
ID, the speaker sex, or the phonetic context. The HMM parameters 
for each class given the mixture component were then computed, 
and moment estimates were obtained for the tied prior parameters 
also subject to conditions (32-33) \[5\]. 
EXPERIMENTAL SETUP 
The experiments presented in this paper used various sets of 
context-independent (CI) and context-dependent (CD) phone mod- 
els. Each model is a left-to-right HMM with Gaussian mixture 
state observation densities. Diagonal covariance matrices are used 
and the transition probabilities are assumed fixed and known. As 
described in \[8\], a 3g-dimensional feature vector composed of 
LPC-derived cepstrum coefficients, and first and second order time 
derivatives. Results are reported for the RM task with the standard 
word pair grammar and for the TI/NIST connected digits. Both 
corpora were down-sampled to telephone bandwidth. 
MODEL SMOOTHING AND ADAPTATION 
Last year we reported results for CD model smoothing, speaker 
adaptation, and sex-dependentmodeling \[5\]. CD model smoothing 
was found to reduce the word error rate by 10%. Speaker adaptation 
188 
Training 0 mitt 2 rain 5 rain 30 rain 
SD -- 31.5 12.1 3.5 
SA (SI) 13.9 8.7 6.9 3.4 
SA (M/F) 11.5 7.5 6.0 3.5 
Table 1: Summary of SD, SA (SI), and SA (M/F) results on FEB91-SD 
test. Results are given as word error rate (%). 
was tested on the JUN90 data with 1 minute and 2 minutes of 
speaker-specific adaptation data. A 16% and 31% reduction in word 
error were obtained compared to the SI results \[5\]. On the FEB91 
test, using Bayesian learning for CD model smoothing combined 
with sex-dependent modeling, a 21% word error reduction was 
obtained compared to the baseline results \[5\]. 
In order to compare speaker adaption to ML training of SD 
models, an experiment has been carded out on the FEB91-SD test 
material including data from 12 speakers (7m/5f), using a set of 47 
CI phone models. Two, five and thirty minutes of the SD training 
data were used for training and adaptation. The SD, SA (SI) word 
error rates are given in the two first rows of Table 1. 
The SD word error rate for 2 min of training data was 31.5%. 
The SI word error rate (0 minutes of adaptation data) was 13.9%, 
somewhat comparable to the SD results with 5 min of SD training 
data. The SA models are seen to perform better than SD models 
when relatively small amounts of data were used for training or 
adaptation. When all the available training data was used, the 
SA and SD results were comparable, consistent with the Bayesian 
formulation that the MAP estimate converges to the MLE. Relative 
to the SI results, the word error reduction was 37% with 2 rain 
of adaptation data, an improvement similar to that observed on the 
JUN90 test data with CD models \[5\]. As in the previous experiment, 
a larger improvement was observed for the female speakers (51%) 
than for the male speakers (22%). 
Speaker adaptation was also performed starting with sex- 
dependent models (third row of Table 1). The word error rate with 
no speaker adaptation is 11.5%. The error rate is reduced to 7.5% 
with 2 rain, and 6.0% with 5 rain, of adaptation data. Comparing 
the last 2 rows of the table it can be seen that SA is more effective 
when sex-dependent seed models are used. The error reduction 
with 2 rain of training data is 35% compared to the sex-dependent 
model results and 46% compared to the SI model results. 
P.D.F. SMOOTHING 
We have shown that Bayesian learning can be used for CD model 
smoothing \[5\]. This approach can be seen either as a way to add 
extra constraints to the model parameters so as to reduce the effect 
of insufficient training data, or it can be seen as an "interpolation" 
between two sets of parameter estimates: one corresponding to 
the desired model and the other to a smaller model which can 
be trained using MLE on the same data. Instead of defining a 
reduced parameter set by removing the context dependency, we can 
alternatively reduce the mixture size of the observation densities 
and use a single Ganssian per state in the smaller model. Cast in the 
Bayesian learning framework, this implies that the same marginal 
prior density is used for all the components of a given mixture. 
Variance clipping can also be viewed as a MAP estimation technique 
with a uniform prior density constrained by a maximum (positive) 
value for the precision parameters \[9\]. However, this does not have 
the appealing interpolation capability of the conjugate priors. 
We experimented with this p.d.f, smoothing approach on the TI 
WACC SACC (Strings Correct) 
MLE 99.6 98.7 (8464) 
MLE+VC 99.6 98.8 (8477) 
MAP 99.7 99.1 (8502) 
Table 2: TI test results for p.d.t smoothing (213 inter-word CD-32 models) 
FEB89 OCT89 JUN90 FEB91 
MLE 93.3 92.5 92.1 92.9 
MLE+VC 95.0 95.0 94.8 95.9 
MAP(SI) 95.0 95.5 95.0 96.2 
MAP(M/F) 95.2 96.2 95.2 96.7 
Table 3: RM test results for p.d.f, smoothing (2421 inter-word CD-16 
models 
digit and RM databases. A set of 213 CD phone models with 32 
mixture components (213 CD-32) for the TI digits and a set of 2421 
CD phone models with 16 mixture components (2421 CD-16) for 
RM were used for evaluation. Results are given for MLE training, 
MLE with variance clipping (MLE+VC), and MAP estimation with 
p.d.f, smoothing in Tables 2 and 3. In Table 2, word accuracy 
(WACC) and suing accuracy (SACC) are given for the 8578 test 
digit strings of the TI digit corpora. Compared to the variance 
clipping scheme, the MAP estimate reduces the number of string 
errors by 25%. Using p.d.f, smoothing, the suing accuracy of99.1% 
is the best result reported on this task. 
For the RM tests summarized in Table 3, a consistent improve- 
ment over the variance clipping scheme (MLE+VC) is observed 
when p.d.f, smoothing is applied. Combined with sex-dependent 
modeling, the MAP(M/F) scheme gives an average word accuracy 
of about 95.8%. 
CORRECTIVE TRAINING 
Bayesian learning provides a scheme for model adaptation which 
can also be used for corrective training. Corrective training maxi- 
mizes the recognition rate on the training data hoping that that will 
also improve performance on the test data. One simple way to 
do corrective training is to use the training sentences which were 
incorrectly recognized as new data. In order to do so, the state 
segmentation step of the segmental MAP algorithm was modified 
to obtain not only the frame/state association for the sentence model 
states but also for the states corresponding to the model of all the 
possible sentences (general model). In the reestimation formulas, 
the values cikt for each state si are evaluated using (21), such that 
7it is equal to 1 in the sentence model and to -1 in the general 
model. While convergence is not guaranteed, in practice it was 
found that by using large values for rik(_ ~ 200), the number of 
training sentence errors decreased after each iteration until conver- 
gence. If we use the forward-backward MAP algorithm we obtain 
a corrective training algorithm for CDHMM's very similar to the 
recently proposed corrective MMIE training algorithm \[11 \]. 
Corrective training was evaluated on both the TI/NIST SI con- 
nected digit and the RM tasks. Only the Ganssian mean vectors 
and the mixture weights were corrected. For the TI digits a set of 
21 phonetic HMMs were ~ained on the 8565 digit strings. Results 
are given in Table 4 using 16 and 32 mixture components for the 
observation p.d.L's, with and without corrective training for both 
test and training data. The CT-16 results were obtained with 8 iter- 
189 
Training 
.'onditions 
MLE-16 
CT-16 
MLE-32 
CT-32 
Training Test 
string word string word 
1.6 (134) 0.5 2.0 (168) 0.7 
0.2 (18) 0.1 1.4 (122) 0.5 
0.8 (67) 0.2 1.5 (126) 0.5 
0.3 (29) 0.1 1.3 (111) 0.4 
Table 4: Corrective training results in siring and word error rates (%) on 
the TI-digits for 21 CI models with 16 and 32 mixture components per stale. 
String error counts are given in parenthesis. 
Test Set MLE-32 
TRAIN 7.7 
FEB89 11.9 
OCT89 11.5 
JUN90 10.2 
FEB91 11.4 
FEB91-SD 13.9 
Overall Test 11.8 
CT-32 ICT-32 
1.8 3.1 
10.2 8.9 
9.8 8.9 
8.8 8.1 
10.3 10.2 
11.3 11.0 
10.1 9.4 
Table S: Corrective eaining results on the RM task (47 CI models with 32 
mixture components per state) 
ations of corrective training while the CT-32 results were based on 
only 3 iterations, where one full iteration of conective training is 
implemented as one recognition run which produces a set of "new" 
training strings (i.e. errors and/or barely correct strings) followed by 
10 iterations of Bayesian adaptation using the data of these strings. 
String error rates of 1.4% and 1.3% were obtained with 16 and 32 
mixture components per state respectively, compared to 2.0% and 
1.5% without corrective training. These represent suing error re- 
ductions of 27% and 12%. We note that corrective training helps 
more with smaller models, as the ratio of adaptation data to the 
number of parameters is larger. 
The corrective training procedure is also effective for continuous 
sentence recognition of the RM task. Table 5 gives results for the 
RM task, using 47 SI-CI models with 32 mixture components. The 
CT-32 corrective training assumes a fixed beam width. Since the 
number of string errors was small in the training set, the amount 
of data for corrective training was rather limited. To increase the 
amount, a smaller beam width was used to recognize the training 
data. It was observed that this improved corrective training (ICT- 
32) procedure not only reduced the error rate in training but also 
increased the separation between the conect string and the other 
competing strings. The number of training errors also increased as 
predicted. The regular and the improved corrective training gave 
an average word error rate reduction of 15% and 20% respectively 
on the test data. 
SUMMARY 
The theoretical framework for MAP estimation of multivariate 
Gaussian mixt~e density and HMM with mixture Gaussian state 
observation densities was presented. Two MAP training algorithms, 
the forward-baclovard MAP estimation and the segmental MAP es- 
timation, were formulated. Bayesian learning serves as a unified 
approach for speaker adaptation, speaker group modeling, parame- 
ter smoothing and corrective training. 
Tested on the RM task, encouraging results have been obtained 
for all four applications. For speaker adaptation, a 37% word er- 
ror reduction over the SI results was obtained on the FEB91-SD 
test with 2 minutes of speaker-specific training data. It was also 
found that speaker adaptation is more effective when based on 
sex-dependent models than with an SI seed. Compared to speaker- 
dependent training, speaker adaptation achieved a better perfor- 
mance with the same amount of training/adaptation data. Correc- 
tive training appfied to CI models reduced word errors by 15-20%. 
The best SI results on RM tests were obtained with p.d.L smoothing 
and sex-dependent modeling, an average word accuracy of about 
95.8% on four test sets. 
Only corrective training and p.d.L smoothing were applied to the 
TI/NIST connected digit task. It was found that corrective training 
is effective.for improving CI models, reducing the number of string 
errors by up to 27%. Corrective training was found to be more 
effective for models having smaller numbers of parameters. This 
implies that we can reduce computational requierements by using 
corrective training on a smaller model and achieve performance 
comparable to that of a larger model. Using 213 CD models, 
p.d.L smoothing provided a robust model that gave a 99.1% string 
accuracy on the test data, the best performance reported on this 
corpus. 
REFERENCES 
\[1\] L. E. Baum, "An inequality and associated maximization technique 
in statistical estimation for probabilisties functions of Markov pro- 
cesses," Inequalities, vol. 3, pp. 1-8, 1972. 
\[2\] M. DeGroot, Optimal StatisticalDecisions, McGraw-Hill, 1970. 
\[3\] A. Dempster, N. Laird, D. Rubin, "Maximum Likelihood from Incom- 
plete Data via the EM algorithm", ./. Roy. Statist. Soc. Set. B, 39, pp. 
1-38, 1977. 
\[4\] R.O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, 
John Wiley & Sons, New York, 1973. 
\[5\] J.-L. Ganvain and C.-H. Lee, "Bayesian Learning of Ganssian Mixture 
Densities for Hidden Markov Models," Prec. DARPA Speech and 
Natural Language Workshop, Pacific Grove, Feb. 1991. 
\[6\] B. H. Juang, "Maximum-Likelihood Estimation for Mixture Multi- 
variate Stochastic Observations of Marker Chains", AT&T Technical 
Journal, Vol. 64, No. 6, July-August 1985. 
\[7\] B. O. Koopman, "On distributions admitting a sufficient statistic", 
Trans. Ar,~ Math. See., vol. 39, pp. 399-409, 1936. 
\[8\] C.-H. Lee, E. Giachin, L. R. Rabiner, R. I'ieraccini and A. E. Rosen- 
berg, "Improved Acoustic Modeling for Continuous Speech Recogni- 
tion", Prec. DARPA Speech and Natural l_zmguage Workshop, Hidden 
Valley, June 1990. 
\[9\] C.-H. Lee, C.-H. Lin and B.-H. Juang, "A Study on Speaker Adaptation 
of the Parameters of Continuous Density Hidden Markov Models", 
IEEE Trans. on ASSP, April 1991. 
\[10\] L. R. Liporace, "Maximum Likelihood Estimation for Multivariate 
Observations of Markov Sources," IEEE Trans. lnforr~ Theory, Vol. 
IT-28, no. 5, pp. 729-734, September 1982. 
\[l l\] Y. Normandin and D. Morgera, "An Improved MMIE Training Algo- 
rithm for Speaker-Independent Small Vocabulary, Continuous Speech 
Recognition", Prec. ICASSPgl, pp. 537-540, May 1991. 
\[12\] L.R. Rabiner, J. G. Wilpon, and B. H. Juang, "A segmental K-means 
training procedure for connected word recognition," AT&T Tech. Y., 
voL 64, no. 3, pp. 21-40, May 1986. 
\[13\] R.A. Redner and H. E Walker, "Mixture Densities, Maximum Like- 
lihood and the EM Algorithm," SIAM Review, Vol. 26, No. 2, pp. 
195-239, April 1984. 
\[14\] H. Robbins, "The Empirical Bayes Approach to Statistical Decision 
Problems," Ann. Math. Statist., Vol. 35, pp. 1-20, 1964. 
190 
