File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1036_metho.xml

Size: 26,702 bytes

Last Modified: 2025-10-06 14:13:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1036">
  <Title>MAP Estimation of Continuous Density HMM : Theory and Applications</Title>
  <Section position="3" start_page="0" end_page="185" type="metho">
    <SectionTitle>
MAP ESTIMATES FOR GAUSSIAN MIXTURE
</SectionTitle>
    <Paragraph position="0"> Suppose that x = (zl,...,x,) is a sample of n i.i.d.</Paragraph>
    <Paragraph position="1"> observations drawn from a mixture of K p-dimensional multivariate normal densities. The joint p.d.f, is specified by f(x\[0) = \]-\[:=l ~-~f=t~kA/'(Zt\[mk,rk) where 0 = (wl, ..., wK, ml ,..., inK, rl, ..., rK) is the parameter vector and ~k denotes the mixture gain for the k-th mixture component with the K constraint ~kft Wk = 1. A/'(Zlmk, rk) is the k-th normal density function where mk is the p-dimensional mean vector and rk is the p x p precision matrix. As stated in the introduction, for the parameter vector 0 no joint conjugate prior density exists. However a finite mixture density can he interpreted as a density associated with a statistical population which is a mixture of K component populations with mixing proportions (wl .... , wK). In other words, f(x\[0) can be seen as a marginal p.d.f, of the product of a multinomial density (for the sizes of the component populations ) and normal densities (for the component densities). A practical candidate to model the  prior knowledge about the mixture gain parameter vector is therefore a Dirichlet density which is the conjugate prior density for the multinomial distribution</Paragraph>
    <Paragraph position="3"> where vk &gt; 0. For the vector parameter (ink, rk) of the individual Gaussian mixture component, the joint conjugate prior density is a normal-Wishart density \[2\] of the form</Paragraph>
    <Paragraph position="5"> where (rk,/zk, t~k, Uk) are the prior density parameters such that ak &gt; p -- 1, rk &gt; 0,/~k is a vector of dimension p and uk is a p x p positive definite matrix.</Paragraph>
    <Paragraph position="6"> Assuming independence between the parameters of the mixture components and the mixture weights, the joint prior density g(0) is taken to be a product of the prior p.d.f.'s defined in equations * K * (2) and (3), Le. g(0) = g(w~, ...,~K)FL:, z(m~,,-,). As will be shown later, this choice for the prior density family can also be justified by noting that the EM algorithm can be applied to the MAP estimation problem if the prior density is in the conjuguate family of the complete-data density.</Paragraph>
    <Paragraph position="7"> The EM algorithm is an iterative procedure for approximating maximum-likelihood estimates in an incomplete-data context such as mixture density and hidden Markov model estimation problems \[1, 3, 13\]. This procedure consists of maximizing at each iteration the auxilliary function Q(O, ~) defined as the expectation of the complete-data log-likelihood log h(y\[0 ) given the incomplete data x = (~, ...,x,) and the current fit 0, i.e.</Paragraph>
    <Paragraph position="8"> Q(0, ~) = E\[log h(yl0)lx, ~. For a mixture density, the complete-data likelihood is the joint likelihood of x and PS = (PSt, ..., PSn ) the unobserved labels referring to the mixture components, i.e. y = (x, PS). The EM procedure derives from the fact that log f(xl0 ) = Q(O, 0) - H(O, 0) where H(O, 0) = E(log h(ylx , 0)Ix , 0) and H(O, 0) _&lt; H(O, ~), and whenever a value 0 satisfies Q(O, O) &gt; Q(0, 0) then f(x\[0) &gt; f(xl0). It foUows that the same iterative procedure can be used to estimate the mode of the posterior density by maximizing the anxilliary function R( O , 0) = Q( O , 0) + log 9(0) at each iteration instead of Q(O, 0) \[3\].</Paragraph>
    <Paragraph position="9"> For a mixture of K densities {f(.10~)}~=L...,g with mixture</Paragraph>
    <Paragraph position="11"> Let tP(0, 0) = exp R(O, 0) be the function to be maximized and define the following notations cat &amp;~f(xtl#k) ck = ~=~ ckt,</Paragraph>
    <Paragraph position="13"> follows from the definition of f(x\[O) and equation (4) that</Paragraph>
    <Paragraph position="15"> From (2), (3) and (5) it can easily be verified that ~(.,0) belongs to the same family as g, and has parameters O,L ' ' ' ' rk,/~k, t~k, uk}k:l,...,K satisfying the following conditions:</Paragraph>
    <Paragraph position="17"> The considered family of distributions is therefore a conjugate family for the complete-data density.</Paragraph>
    <Paragraph position="18"> The mode of ~P(., 0), denoted J i , obtained (wk, ink, rk), may be from the modes of the Dirichlet and normal-Wishart densities: w~ =</Paragraph>
    <Paragraph position="20"> If it is assumed &amp;k &gt; 0, then ckl, ck2, ...,ck, is a sequence of n i.i.d, random variables with a non-degenerate distribution and limsupn_o o ~=. ckt = co with probability one. It follows that w~ converges to ~=l Ckt/n with probability one when n ~ oo.</Paragraph>
    <Paragraph position="21"> Applying the same reasoning to m~ and r~, it can be seen that the EM reestimation formulas for the MAP and ML approaches are asymptotically similar. Thus as long as the initial estimates are identical, the EM algorithm will provide identical estimates with probability one when n ~ cC/.</Paragraph>
  </Section>
  <Section position="4" start_page="185" end_page="187" type="metho">
    <SectionTitle>
MAP ESTIMATES FOR CDHMM
</SectionTitle>
    <Paragraph position="0"> The results obtained for a mixture of normal densities can be extended to the case of HMM with Gaussian mixture state observation densities, assuming that the observation p.d.f.'s of all the states have the same number of mixture components. We consider an N-state HMM with parameter vector A = (x, A, 0), where r is the initial probability vector, A is the transition matrix, and 0 is the p.d.f, parameter vector composed of the mixture parameters 0i = {Wik,mik,rik}kfl,...,K for each state i. For a sample x = (2~1, ..., zn), the complete data is y = (x, s,Q where s = (so,..., s,) is the unobserved state sequence, and l = (PSh ..., l,~) are the unobserved mixture component labels, si E \[1, N\] and li E \[1, K\]. The joint p.d.f, h(.lX) of x, s, andPS is defined as \[1\]</Paragraph>
    <Paragraph position="2"> where 7ri is the initial probabilty of state i, aij is the transition probability from state i to state j, and Oik =(mik, rik) is the parameter vector of the k-th normal p.d.f, associated to state i. It follows that the likelihood of x has the form</Paragraph>
    <Paragraph position="4"> where f(x,lOi ) K = ~k=t w~kA/'(x*lralk, rik), and the summation is over all possible state sequences.</Paragraph>
    <Paragraph position="5"> In the general case where MAP estimation is to be applied not only to the observation density parameters but also to the initial and transition probabilities, a Dirichlet density can also be used for the initial probability vector ~r and for each row of the transition probability matrix A. This choice directly follows the results of the previous section: since the complete-data likelihood satisfies h(x, s,tlA ) = h(s, A)h(x, tls , A) where h(s, A) is the product of N + 1 multinomial densities with parameters {n, a't, ..., ~N} and { n, air ..... a i N } if l,...,N . The prior density for all the HMM parameters is thus</Paragraph>
    <Paragraph position="7"> In the following subsections we examine two ways of approximating AMAp by local maximization of f(xl~)G(~) and f(x, sI~)G(A). These two solutions are the MAP versions of the B aura-Welch algorithm \[1 \] and of the segmental k-means algorithm \[12\], algorithms which were developed for ML estimation.</Paragraph>
    <Section position="1" start_page="186" end_page="186" type="sub_section">
      <SectionTitle>
Forward-Backward MAP Estimate
</SectionTitle>
      <Paragraph position="0"> From (14) it is straightforward to show that the auxilliary function of the EM algorithm applied to MLE of A, Q(A, ~) = E\[log h(Yi~)lx, PS\], can be decomposed into a sum of three auxilliary functions: Q,~(a', X), Q~(A, X) and Qo(O, ~) \[6\]. These functions which can be independently maximized take the following forms:</Paragraph>
      <Paragraph position="2"> be computed at each EM iteration by using the Forward-Backward algorithm \[I\]. As for the mixture Gaussian case discussed in the previous section, to estimate the mode of the posterior density the anxilliary function R(A, ~) = Q(A, ~) + log G(A) must be maximized. The form chosen for G(A) in (16) permits independent maximization of each of the following 2N + I parameter sets: {Trl .... ,a'N}, {ail,...,aiN}i=t,...,g and {0i}i=l,...,N. The MAP auxiUiary function R(A, A) can thus be written as the sum R. ( a', ~) + ~i R., ( a, , ~) + ~, Ro, ( O,, ~ ), where each term represents the MAP anxilliary function associated with the indexed parameter set.</Paragraph>
      <Paragraph position="3"> We can recognize in (20) the same form as seen for Q(0\[~) in (4) for the mixture Ganssian case. It follows that if the Ckt are replaced by the cikt defined as ,~,kX(xtl,h,~, ~,k ) (21) eikt = 7,t f(xt\[~i) then the reestimation formulas (11-13) can be used to maximize Ro~ (01, ~). It is straightforward to find the reesfimations formulas for ~r and A by applying the same derivations used for the mixture weights:</Paragraph>
      <Paragraph position="5"> For multiple independent observation sequences { xo } q= l,...,Q, t~(q) ~(q)~ with Xq = x't .... , ~., ,, we maximize G(A) lq?:l f(xqlA)' where f(.\[A) is defined by (15). The EM auxilliary function is then R(A, X) = logG(A) + ~qQ=t E\[ldeggh(Yql~)lxq, X\], where h(.lA) is defined by equation (14). It follows that the reestimation formulas for A and 0 still hold if the summations over t are ~(q) and - (q) replaced by summations over q and t. The values &amp;quot;,~jt 7. are then obtained by applying the forward-backward algorithm for each observation sequence. The reestimation formula for the initial probabilities becomes</Paragraph>
      <Paragraph position="7"> As for the mixture Gaussian case, it can be shown that as Q ~ co, the MAP reestimation formulas approach the ML ones, exhibiting the asymptotic similarity of the two estimates.</Paragraph>
      <Paragraph position="8"> These reestimation equations give estimates of the HMM parameters which correspond to a local maximum of the posterior density. The choice of the initial estimates is therefore essential to finding a solution close to a global maximum and to minimize the number of EM iterations needed to attain the local maximum. When using an informative prior, one natural choice for the initial estimates is the mode of the prior density, which represents all the available information about the parameters when no data has been observed.</Paragraph>
      <Paragraph position="9"> The corresponding values are simply obtained by applying the reestimation formulas with n equal to 0. When using a non-informative prior, i.e. for ML estimation, while for discrete HMMs it is possible to use uniform initial estimates, there is no trivial solution for the continuous density case.</Paragraph>
    </Section>
    <Section position="2" start_page="186" end_page="187" type="sub_section">
      <SectionTitle>
Segmental MAP Estimate
</SectionTitle>
      <Paragraph position="0"> By analogy with the segmental k-means algorithm \[12\], a different optimization criterion can be considered. Instead of maximizing G(AIx), the joint posterior density of A and s, G(A, slx ), is maximized. The estimation procedure becomes = argmax max G(A, six ) (25) ), s = argm~x m~x f(x, s\[A)G(A) (26) and A is called the segmental MAP estimate of A. As for the segmental k-means algorithm, it is straightforward to prove that starting with any estimate A (m), alternate maximization over s and  A gives a sequence of estimates with non decreasing values of G(A, slx), i.e. G(A (m+'), s(m+')\]x) &gt; G(A(m), s(m)lx) with</Paragraph>
      <Paragraph position="2"> The most likely state sequence s (m) is decoded by the Viterbi algorithm. In fact, maximization over A can be replaced by any hill climbing procedure which replaces A ('~) by A ('~+1) subject to the constraint that f(x, s(m)\[A(m+D)G(A (re+D) _&gt; f(x, s (m) \[A (m))G(A(m)). The EM algorithm is once again a good candidate to perform this maximization using A (m) as an initial estimate. The EM anxilliary function is then R(A, ~) = log G(A) + E\[log h(ylA)lx, s ~), X\] where h(.IA) is defined by equation (14).</Paragraph>
      <Paragraph position="3"> It is straightforward to show that the forward-backward reestimation equations still hold with fijt= 6ts('n)~ t-t - i)6(s~ m) - J) and &amp;quot;fit = ~(s~ '~) -- i), where ~ denotes the Kronecker delta function.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="187" end_page="187" type="metho">
    <SectionTitle>
PRIOR DENSITY ESTIMATION
</SectionTitle>
    <Paragraph position="0"> In the previous sections it was assumed that the prior density G(A) is a member of a preassigned family of prior distributions defined by (16). In a strictly Bayesian approach the vector parameter of this family ofp.d.f.'s {G(.\[~), ~ E ~b} is also assumed known based on common or subjective knowledge about the stochastic process. Another solution is to adopt an empirical Bayesian approach \[14\] where the prior parameters are estimated directly from data.</Paragraph>
    <Paragraph position="1"> The estimation is then based on the marginal disttrbution of the data given the prior parameters.</Paragraph>
    <Paragraph position="2"> Adopting the empirical Bayes approach, it is assumed that the sequence of observations, X, is composed of multiple independent sequences associated with different unknown values of the HMM parameters. Letting (X,A) = \[(xt, Ai), (x2, A2) .... \] be such a multiple sequence of observations, where each pair is independent of the others and the Aq have a common prior distribution G(.\[~).</Paragraph>
    <Paragraph position="3"> Since the Aq are not directly observed, the prior parameter estimates must be obtained from the marginal density f(X\[~),</Paragraph>
    <Paragraph position="5"> However, maximum likelihood estimation based on f(Xl~ ) appears rather difficult. To simplify this problem, we can choose a simpler optimization criterion by maximizing the joint p.d.f, f(X, A I~) over A and ~ instead of the marginal p.d.f, of X given ~. Starting with an initial estimate of ~o, we obtain a hill climbing procedure by alternate maximization over A and ~o, i.e.</Paragraph>
    <Paragraph position="7"> Such a procedure provides a sequence of estimates with non-decreasing values of f(X, Al~(m)). The solution of (30) is the MAP estimate of A based on the current prior parameter ~(m). It can therefore be obtained by applying the forward-backward MAP reestimation formulas to each observation sequence Xq. The solution of (31) is simply the maximum likelihood estimate of ~ based on the current values of the HMM parameters.</Paragraph>
    <Paragraph position="8"> Finding this estimate poses two problems. First, due to the Wishart and Dirichlet components, ML estimation for the density defined by (16) is not trivial. Second, since more parameters are needed for the prior density than for the HMM itself, there can be a problem of overparametrization when the number of pairs (xq, Aq) is small. One way to simplify the estimation problem is to use moment estimates to approximate the ML estimates. For the overparametrization problem, it is possible to reduce the size of the prior family by adding constraints on the prior parameters. For example, the prior family can be limited to the family of the kernel density of the complete-data likelihood, i.e. the posterior density family of the complete.data model when no prior information is available. Doing so, it can be verified that the following constraints</Paragraph>
    <Paragraph position="10"> Parameter tying can also be used to further reduce the size of the prior family.</Paragraph>
    <Paragraph position="11"> We use this approach for approach for two types of applications: parameter smoothing and adaptation learning. For parameter &amp;quot;smoothing&amp;quot;, the goal is to estimate {Al, A2, ...}. The previous algorithm offers a direct solution to &amp;quot;smooth&amp;quot; these different estimates by assuming a common prior density for all the models. For adaptative learning, we observe a new sequence of observations Xq associated with the unobserved vector parameter value Aq. The MAP estimate of A, can be obtained by using for prior parameters a point estimate ~ obtained with the previous algorithm. Such a training process can be seen as an adaptation of an a priori model = argmaxx G(A\[~) (when no training data is available) to more specific conditions corresponding to the new observation sequence Xq.</Paragraph>
    <Paragraph position="12"> In the applications presented in this paper, the prior density parameters were estimated along with the estimation of the SI model parameters using the segmental k-means algorithm. Information about the variability to be modeled with the prior densities was associated with each frame of the SI training data. This information was simply represented by a class number which can be the speaker ID, the speaker sex, or the phonetic context. The HMM parameters for each class given the mixture component were then computed, and moment estimates were obtained for the tied prior parameters also subject to conditions (32-33) \[5\].</Paragraph>
  </Section>
  <Section position="6" start_page="187" end_page="187" type="metho">
    <SectionTitle>
EXPERIMENTAL SETUP
</SectionTitle>
    <Paragraph position="0"> The experiments presented in this paper used various sets of context-independent (CI) and context-dependent (CD) phone models. Each model is a left-to-right HMM with Gaussian mixture state observation densities. Diagonal covariance matrices are used and the transition probabilities are assumed fixed and known. As described in \[8\], a 3g-dimensional feature vector composed of LPC-derived cepstrum coefficients, and first and second order time derivatives. Results are reported for the RM task with the standard word pair grammar and for the TI/NIST connected digits. Both corpora were down-sampled to telephone bandwidth.</Paragraph>
  </Section>
  <Section position="7" start_page="187" end_page="188" type="metho">
    <SectionTitle>
MODEL SMOOTHING AND ADAPTATION
</SectionTitle>
    <Paragraph position="0"> Last year we reported results for CD model smoothing, speaker adaptation, and sex-dependentmodeling \[5\]. CD model smoothing was found to reduce the word error rate by 10%. Speaker adaptation  test. Results are given as word error rate (%).</Paragraph>
    <Paragraph position="1"> was tested on the JUN90 data with 1 minute and 2 minutes of speaker-specific adaptation data. A 16% and 31% reduction in word error were obtained compared to the SI results \[5\]. On the FEB91 test, using Bayesian learning for CD model smoothing combined with sex-dependent modeling, a 21% word error reduction was obtained compared to the baseline results \[5\].</Paragraph>
    <Paragraph position="2"> In order to compare speaker adaption to ML training of SD models, an experiment has been carded out on the FEB91-SD test material including data from 12 speakers (7m/5f), using a set of 47 CI phone models. Two, five and thirty minutes of the SD training data were used for training and adaptation. The SD, SA (SI) word error rates are given in the two first rows of Table 1.</Paragraph>
    <Paragraph position="3"> The SD word error rate for 2 min of training data was 31.5%.</Paragraph>
    <Paragraph position="4"> The SI word error rate (0 minutes of adaptation data) was 13.9%, somewhat comparable to the SD results with 5 min of SD training data. The SA models are seen to perform better than SD models when relatively small amounts of data were used for training or adaptation. When all the available training data was used, the SA and SD results were comparable, consistent with the Bayesian formulation that the MAP estimate converges to the MLE. Relative to the SI results, the word error reduction was 37% with 2 rain of adaptation data, an improvement similar to that observed on the JUN90 test data with CD models \[5\]. As in the previous experiment, a larger improvement was observed for the female speakers (51%) than for the male speakers (22%).</Paragraph>
    <Paragraph position="5"> Speaker adaptation was also performed starting with sex-dependent models (third row of Table 1). The word error rate with no speaker adaptation is 11.5%. The error rate is reduced to 7.5% with 2 rain, and 6.0% with 5 rain, of adaptation data. Comparing the last 2 rows of the table it can be seen that SA is more effective when sex-dependent seed models are used. The error reduction with 2 rain of training data is 35% compared to the sex-dependent model results and 46% compared to the SI model results.</Paragraph>
    <Paragraph position="6"> P.D.F. SMOOTHING We have shown that Bayesian learning can be used for CD model smoothing \[5\]. This approach can be seen either as a way to add extra constraints to the model parameters so as to reduce the effect of insufficient training data, or it can be seen as an &amp;quot;interpolation&amp;quot; between two sets of parameter estimates: one corresponding to the desired model and the other to a smaller model which can be trained using MLE on the same data. Instead of defining a reduced parameter set by removing the context dependency, we can alternatively reduce the mixture size of the observation densities and use a single Ganssian per state in the smaller model. Cast in the Bayesian learning framework, this implies that the same marginal prior density is used for all the components of a given mixture.</Paragraph>
    <Paragraph position="7"> Variance clipping can also be viewed as a MAP estimation technique with a uniform prior density constrained by a maximum (positive) value for the precision parameters \[9\]. However, this does not have the appealing interpolation capability of the conjugate priors.</Paragraph>
    <Paragraph position="8"> We experimented with this p.d.f, smoothing approach on the TI  models digit and RM databases. A set of 213 CD phone models with 32 mixture components (213 CD-32) for the TI digits and a set of 2421 CD phone models with 16 mixture components (2421 CD-16) for RM were used for evaluation. Results are given for MLE training, MLE with variance clipping (MLE+VC), and MAP estimation with p.d.f, smoothing in Tables 2 and 3. In Table 2, word accuracy (WACC) and suing accuracy (SACC) are given for the 8578 test digit strings of the TI digit corpora. Compared to the variance clipping scheme, the MAP estimate reduces the number of string errors by 25%. Using p.d.f, smoothing, the suing accuracy of99.1% is the best result reported on this task.</Paragraph>
    <Paragraph position="9"> For the RM tests summarized in Table 3, a consistent improvement over the variance clipping scheme (MLE+VC) is observed when p.d.f, smoothing is applied. Combined with sex-dependent modeling, the MAP(M/F) scheme gives an average word accuracy of about 95.8%.</Paragraph>
  </Section>
  <Section position="8" start_page="188" end_page="189" type="metho">
    <SectionTitle>
CORRECTIVE TRAINING
</SectionTitle>
    <Paragraph position="0"> Bayesian learning provides a scheme for model adaptation which can also be used for corrective training. Corrective training maximizes the recognition rate on the training data hoping that that will also improve performance on the test data. One simple way to do corrective training is to use the training sentences which were incorrectly recognized as new data. In order to do so, the state segmentation step of the segmental MAP algorithm was modified to obtain not only the frame/state association for the sentence model states but also for the states corresponding to the model of all the possible sentences (general model). In the reestimation formulas, the values cikt for each state si are evaluated using (21), such that 7it is equal to 1 in the sentence model and to -1 in the general model. While convergence is not guaranteed, in practice it was found that by using large values for rik(_ ~ 200), the number of training sentence errors decreased after each iteration until convergence. If we use the forward-backward MAP algorithm we obtain a corrective training algorithm for CDHMM's very similar to the recently proposed corrective MMIE training algorithm \[11 \].</Paragraph>
    <Paragraph position="1"> Corrective training was evaluated on both the TI/NIST SI connected digit and the RM tasks. Only the Ganssian mean vectors and the mixture weights were corrected. For the TI digits a set of 21 phonetic HMMs were ~ained on the 8565 digit strings. Results are given in Table 4 using 16 and 32 mixture components for the observation p.d.L's, with and without corrective training for both test and training data. The CT-16 results were obtained with 8 iter- null the TI-digits for 21 CI models with 16 and 32 mixture components per stale. String error counts are given in parenthesis.</Paragraph>
    <Paragraph position="2">  mixture components per state) ations of corrective training while the CT-32 results were based on only 3 iterations, where one full iteration of conective training is implemented as one recognition run which produces a set of &amp;quot;new&amp;quot; training strings (i.e. errors and/or barely correct strings) followed by 10 iterations of Bayesian adaptation using the data of these strings. String error rates of 1.4% and 1.3% were obtained with 16 and 32 mixture components per state respectively, compared to 2.0% and 1.5% without corrective training. These represent suing error reductions of 27% and 12%. We note that corrective training helps more with smaller models, as the ratio of adaptation data to the number of parameters is larger.</Paragraph>
    <Paragraph position="3"> The corrective training procedure is also effective for continuous sentence recognition of the RM task. Table 5 gives results for the RM task, using 47 SI-CI models with 32 mixture components. The CT-32 corrective training assumes a fixed beam width. Since the number of string errors was small in the training set, the amount of data for corrective training was rather limited. To increase the amount, a smaller beam width was used to recognize the training data. It was observed that this improved corrective training (ICT32) procedure not only reduced the error rate in training but also increased the separation between the conect string and the other competing strings. The number of training errors also increased as predicted. The regular and the improved corrective training gave an average word error rate reduction of 15% and 20% respectively on the test data.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML