File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1063_metho.xml
Size: 16,562 bytes
Last Modified: 2025-10-06 14:13:48
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1063"> <Title>HIGH-ACCURACY LARGE-VOCABULARY SPEECH RECOGNITION USING MIXTURE TYING AND CONSISTENCY MODELING</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> To improve the acoustic-modeling component of SRI's DECIPHER TM speech recognition system, our research has focused on two main directions. The first is to decrease the degree of mixture tying in the mixture observation densities, since conrinuousdensity hidden Markov models (HMMs) have recently been shown to outperform discrete-density and tied-mixture HMMs \[16\]. The second is the removal of the simplifying output independence assumption commonly used in HMMs.</Paragraph> <Paragraph position="1"> Tied mixtures (TM) achieve robust estimation and efficient computation of the density likelihoods. However, the typical mixture size used in TM systems is small and does not provide a good representation of the acoustic space. Increasing the number of the mixture components (the codebook size) is not a feasible solution, since the mixture-weight distributions become too sparse. In large-vocabulary problems, where a large number of basic HMMs is used and each has only a few observations in the training data, sparse mixture-weight distributions cannot be estimated robustly and are expensive to store. To solve this problem, we follow the approach of simultaneously reducing the codebook size and increasing the number of different sets of mixture components (or codebooks). This procedure reduces the degree of tying, and the two changes can be balanced so that the total number of component densities in the system is effectively increased. The mapping from HMM states to codebooks can be determined using clustering techniques. Since our algorithm transforms a &quot;less&quot; continuous, or fled-mixture system, to a &quot;more&quot; continuous one, it has enabled us to investigate a number of traditional differences between tied-mixture and fully continuous HMMs, including codebook size and modeling of the speech features using multiple vs. single observation streams.</Paragraph> <Paragraph position="2"> Our second main research direction is focused on removing the simplifying assumption used in HMMs that speech features from different frames are statistically independent given the underlying state sequence. In this paper we will deal with the modeling of the local temporal dependencies, that is, ones that span the duration of a phonetic segment. We will show through the use of recognition experiments and information theoretic criteria that achieving decorrelation of the speech features is not a sufficient condition for the improvement in recognition performance. To achieve the latter, it is necessary to improve the discrimination power of the output distributions through the use of new informarion. Local correlation modeling has recently been incorporated in our system through the use of linear discriminant features, and has reduced the word error rate by 7% on the Wall Street Journal (WSJ) corpus.</Paragraph> <Paragraph position="3"> The remainder of the paper is organized as follows: in Section 2 we present the general form of mixture observation distributions used in HMMs, we discuss variations of this form that have appeared in the literature, and present an algorithm that enables us to adjust the mixture tying for optimum recognition performance. In Section 3 we deal with the problem of local time-correlation modeling: we comment on the potential improvement in recognition performance by incorporating conditional distributions, and describe the type of local consistency modeling currently used in our system. In Section 4 we present experimental results on the WSJ Corpus. These results are mainly a by-product of the system development for the November 1993 ARPA evaluation \[16\]. Finally, we conclude in Section 5.</Paragraph> </Section> <Section position="4" start_page="0" end_page="313" type="metho"> <SectionTitle> 2. GENONIC MIXTURES </SectionTitle> <Paragraph position="0"> A typical mixture observation distribution in an HMM-based speech recognizer has the form</Paragraph> <Paragraph position="2"> where s represents the HMM state, x t the observed feature at flame t, and Q(s) the set of mixture-component densities used in state s. We will use the term codebook to denote the set Q(s). The stream of continuous vector observations can be modeled directly using Gaussians or other types of densities in the place off(x t I q), and HMMs with this form of observation distributions are known as continuous HMMs \[19\].</Paragraph> <Paragraph position="3"> Various forms of tying have appeared in the literature. When tying is not used, the sets of component densities are different for different HMM states--that is, Q (s) ~ Q (s') if s # s'. We will refer to HMMs that use no sharing of mixture components as fully continuous HMMs. The other extreme is when all HMM states share the same set of mixture components--that is, Q(s) = Q is independent of the state s. HMMs with this degree of sharing were proposed in \[8\], \[2\] under the names Semi-Continuous and Tied-Mixture (TM) HMMs. Tied-mixture distributions have also been used with segment-based models, and a good review is given !in \[11\]. Intermediate degrees of tying have also been examined. In phone-based tying, described in \[17\], \[13\], only HMM states that belong to allophones of the same phone share the sanae mixture components--that is, Q(s) = Q(s') if s and s' are states of context-dependent HMMs with the same center phone. We will use the term phonetically tied to describe this kind of tying. Of course, for context-independent models, phonetically tied and fully continuous HMMs are equivalent. However, phonetically tied mixtures (PTM) did not significantly improve recognition performance in previous work.</Paragraph> <Paragraph position="4"> The continuum between fully continuous and tied-mixture HMMs can be sampled at any other point. The choice of phonetically tied mixtures, although linguistically motivated, is somewhat arbitrary and may not achieve the optimum trade-off between resolution and trainability. We have recently introduced an algorithm \[4\] that allows as to select the degree of tying that attains optimum recognition performance for the given computational resources. This algorithm follows a bootstrap approach from a system that has a higher degree of tying (i.e., a TM or a PTM system), and progressively unties the mixtures using three steps: clustering, splitting and pruning, and reestirnafion.</Paragraph> <Section position="1" start_page="313" end_page="313" type="sub_section"> <SectionTitle> 2.1. Clustering </SectionTitle> <Paragraph position="0"> The HMM states of all allophones of a phone are clustered following an agglomerative procedure. The clustering is based on the weighted-by-counts entropy of the mixture-weight distributions \[12\]. The clustering procedure partitions the set of HMM states S into disjoint sets of states</Paragraph> <Paragraph position="2"> The same codebooks will be used for all HMM states belonging to a particular cluster Si.</Paragraph> </Section> <Section position="2" start_page="313" end_page="313" type="sub_section"> <SectionTitle> 2.2. Splitting and Pruning </SectionTitle> <Paragraph position="0"> After determination of the sets of HMM states that will share the same codebook, seed eodebooks for each set of states that will be used by the next re, estimation phase are constructed. These seed codebooks can be constructed by either one or a combination of two procedures: * Identifying the most likely subset of mixture components of the boot system for each cluster of HMM states Si and using these subsets Q (Si) c Q (S) as seed codebooks for the next phase * Copying the original eodebook multiple times (one for each cluster of states) and performing one iteration of the Baum-Welch algorithm over the training data with the new tying scheme; the number of component densities in each codebook can then be reduced using clustering \[10\] 2,3, Reestimation The parameters are reestimated using the Baum-Welch algorithm. This step allows the codebooks to deviate from the initial values and achieve a better approximation of the distributions. We will refer to the Gaussian codebooks as genones and to the HMMs with arbitrary tying of Gaussian mixtures as genonic HMMs. Clustering of either phone or subphone units in HMMs has also been used in \[18\], \[12\], \[1\], \[9\]. Mixture-weight clustering of different HMM states can reduce the number of free parameters in the system and, potentially, improve recognition performance because of the more robust estimation. It cannot, however, improve the resolution with which the acoustic space is represented, since the total number of component densities in the system remains the same. In our approach, we use clustering to identify sets of subphonetic regions that will share mixture components. The later steps of the algorithm, where the original set of mixture components is split into multiple overlapping genones and each one is reestimated using data from the states belonging to the corresponding cluster, effectively increase the number of distinct densities in the system and provide the desired detail in the resolution.</Paragraph> <Paragraph position="1"> Reestimation of the parameters can be achieved using the standard Baum-Weleh reestimation formulae for HMMs with Gauss-Jan mixture observation densities, since tying does not alter their form, as pointed out in \[21\]. During recognition, and to reduce the large amount of computation involved in evaluating Gaussian likelihoods, we can use the fast computational techniques described in \[15\].</Paragraph> <Paragraph position="2"> In place of the component densifiesf(x t I q) we use exponentially weighted Gaussian distributions:</Paragraph> <Paragraph position="4"> where the exponent C/x ~ 1 is used to reduce the dynamic range of the Gaussian scores (that would, otherwise, dominate the mixture probabilities p(q / s)) and also to provide a smoothing effect at the tails of the Gaussians.</Paragraph> </Section> </Section> <Section position="5" start_page="313" end_page="315" type="metho"> <SectionTitle> 3. TIME CORRELATION MODELING </SectionTitle> <Paragraph position="0"> For a given HMM state sequence, the observed features at nearby frames are highly correlated. Modeling time correlation can significantly improve speech recognition performance for two reasons. First, dynamic information is very important \[6\], and explicit time-correlation modeling can potentially outperform more traditional and simplistic approaches like the incorporation of cepstral derivatives as additional feature streams.</Paragraph> <Paragraph position="1"> Second, sources of variability--such as microphone, vocal tract shape, speaker dialect, and speech rate--will not dominate the likelihood computation during Viterbi decoding by being rescored at every frame. We will call techniques that model such temporal dependencies consistency modeling.</Paragraph> <Paragraph position="2"> The output-independence assumption is not necessary for the development of the HMM recognition (Viterbi) and training (Baum-Welch) algorithms. Both of these algorithms can be modified to cover the case when the features depend not only on the current HMM state, but also on features at previous frames \[20\]. However, with the exception of the work reported in \[3\] that was based on segment models, explicit time-correlation modeling has not improved the performance of HMM-based speech recognizers.</Paragraph> <Paragraph position="3"> To investigate these results, we conducted a pilot study to estimate the potential improvement in recognition performance when using explicit correlation modeling over more traditional methods like rime-derivative information. We used information-theoretic criteria and measured the amount of mutual information between the current HMM state and the eepstral coefficients at a previous &quot;history&quot; frame. The mutual information was always conditioned on the identity of the left phone, and was measured under three different conditions: I(h.s)--mutual information between the current HMM state s and a cepstral coefficient h at the history frame; a single, left-context-dependent Gaussian distribution for the cepstral coefficient at the history frame was hypothesized, * l(h,s/c)--conditional mutual information between the current HMM state s and a cepstral coefficient h at the history frame when the corresponding cepstral coefficient c of the current frame is given; a left-context-dependent, joint Gauss-Jan distribution for the cepstral coefficients at the current and the history frames was hypothesized, * l(h,s/c,d)--same as above, but conditioned on both the cepstral coefficient c and its corresponding derivative d at the current frame.</Paragraph> <Paragraph position="4"> The results are summarized in Table 1 for history frames with tags of 1, 2, 4 and a variable one. In the latter case, we condition the mutual information on features extracted at the last frame t o of the previous HMM state, as located by a forced Viterbi alignmeat. We can see from this table that in the unconditional case, the cepstral coefficients at frames closer to the current one provide more information about the identity of the current phone. However, the amount of additional information that these coefficients provide when the knowledge of the current cepstra and their derivatives is taken into account is smaller. The additional information in this case is larger for lags greater than 1, and is maximum for the variable lag.</Paragraph> <Paragraph position="5"> These measurements predict that the previous frame's observation is not the optimal frame to use when conditioning a state's output distribution. To verify this, and to actually evaluate recognition performance, we incorporated time-correlation modeling in an HMM system with genonic mixtures. Specifically, we generalized the Gaussian mixtures to mixtures of conditional Gaussians, with the current cepstral coefficient x t conditioned on the time t and eepstral coefficient h at time t-t o for various lags; included is the conditional mutual information when the corresponding cepstral coefficient and its derivative at rime t are given</Paragraph> <Paragraph position="7"> We either replaced the original unconditional distributions of the cepstral coefficients and their derivatives with the conditional Gaussian distributions, or we used them in parallel as additional observation streams. The results on the 5,000-word recognition task of the WSI0 corpus are summarized in Table 2 for fixed-lag history frames. We can see that the recognition results are in perfeet agreement with the behavior predicted by the mutual-informarion study. The improvements in recognition performance over the system that does not use conditional distributions are actually proportional to the measured amount of conditional mutual information at the various history frames. However, these improvements are small and statistically insignificant, and indicate that the derivative features effectively model the local dynamics.</Paragraph> <Paragraph position="8"> conditional distributions either replacing the unconditional ones or used in parallel Instead of using conditional Gaussian distributions, one can alternatively choose to use features obtained with linear discriminants. Local time correlation can be modeled by estimating the transformations over multiple consecutive flames \[5\],\[7\]. This approach has the additional advantage that it is computationally less expensive, since the discriminant transformations can be computed in the recognizer front end and only once at each flame. However, as we will see in the following section, linear discriminants gave only moderate improvements in recognition performance, and this is consistent with the conditional Gaussian results of this section. From the conditional information measurements that we have presented, we can see that in order to provide additional information to the recognizer we must condition the output distributions not only on a previous history frame, but also on the start time of the current subphonefic segment, and this is an area that we are currently investigating.</Paragraph> </Section> class="xml-element"></Paper>