File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/93/h93-1020_relat.xml
Size: 6,129 bytes
Last Modified: 2025-10-06 14:16:02
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1020"> <Title>ON THE USE OF TIED-MIXTURE DISTRIBUTIONS</Title> <Section position="4" start_page="0" end_page="102" type="relat"> <SectionTitle> 2. PREVIOUS WORK </SectionTitle> <Paragraph position="0"> A central problem in the statistical approach to speech recognition is finding a good model for the probability of acoustic observations conditioned on the state in hidden-Markov models (HMM), or for the case of the SSM, conditioned on a region of the model. Some of the options that have been investigated include discrete distributions based on vector quantization, as well as Gaussian, Gaussian mixture and tied-Gaussian mixture distributions. In tied-mixture modeling, distributions are modeled as a mixture of continuous densities, but unlike ordinary, non-tied mixtures, rather than estimating the component Gaussian densities separately, each mixture is constrained to share the same component densities with only the weights differing. The probability density of observation vector x conditioned on being in state i is thus</Paragraph> <Paragraph position="2"> Note that the component Gaussian densities, Pk(x) -'~ N(t~k, ~k), are not indexed by the state, i. In this light, tied mixtures can be seen as a particular example of the general technique of tying to reduce the number of model parameters that must be trained \[3\].</Paragraph> <Paragraph position="3"> &quot;Tied mixtures&quot; and &quot;semi-continuous HMMs&quot; are used in the literature to refer to HMM distributions of the form given in Equation (1). The term &quot;semi-continuous HMMs&quot; was coined by Huang and Jack, who first proposed their use in continuous speech recognition \[1\]. The &quot;semi-continuous&quot; terminology highlights the relationship of this method to discrete and continuous density HMMs, where the mixture component means are analogous to the vector quantization codewords of a discrete HMM and the weights to the discrete observation probabilities, but, as in continuous density HMMs, actual quantization with its attendant distortion is avoided.</Paragraph> <Paragraph position="4"> Bellegarda and Nahamoo independently developed the same technique which they termed &quot;tied mixtures&quot; \[2\]. For simplicity, we use only one name in this paper, and choose the term tied mixtures, to highlight the relationship to other types of mixture distributions and because our work is based on the SSM, not the HMM.</Paragraph> <Paragraph position="5"> Since its introduction, a number of variants of the tied mixture model have been explored. First, different assumptions can be made about feature correlation within individual mixture components. Separate sets of tied mixtures have been used for various input features including cepstra, derivatives of cepstra, and power and its derivative, where each of these feature sets have been treated as independent observation streams. Within an observation stream, different assumptions about feature correlation have been explored, with some researchers currently favoring diagonal covariance matrices \[4, 5\] and others adopting full covariance matrices \[6, 7\].</Paragraph> <Paragraph position="6"> Second, the issue of parameter initialization can be important, since the training algorithm is an iterative hill-climbing technique that guarantees convergence only to a local optimum. Many researchers initialize their systems with parameters estimated from data subsets determined by K-means clustering, e.g. \[6\], although Paul describes a different, bootstrapping initialization \[4\]. Often a large number of mixture components are used and, since the parameters can be overtrained, contradictory results are reported on the benefits of parameter re-estimation. For example, while many researchers find it useful to reestimate all parameters of the mixture models in training, BBN reports no benefit for updating means and covariances after the initialization from clustered data \[7\]. Another variation, embodied in the CMU senone models \[8\], involves tying mixture weights over classes of context-dependent models. Their approach to finding regions of mixture weight tying involves clustering discrete observation distributions and mapping these clustered distributions to the mixture weights for the associated triphone contexts.</Paragraph> <Paragraph position="7"> In addition to the work described above, there are related methods that have informed the research concerning tied mixtures. First, mixture modeling does not require the use of Gaussian distributions. Good results have also been obtained using mixtures of Laplacian distributions \[9, 10\], and presumably other component densities would perform well too. Ney \[11\] has found strong similarities between radial basis functions and mixture densities using Gaussians with diagonal covariances. Recent work at BBN has explored the use of elliptical basis functions which share many properties with tied mixtures of full-covariance Gaussians \[12\]. Second, the positive results achieved by several researchers using non-tied mixture systems \[13\] raise the question of whether tied-mixtures have significant performance advantages over untied mixtures when there is adequate training data.</Paragraph> <Paragraph position="8"> It is possible to strike a compromise and use limited tying: for instance the context models of a phone can all use the same tied distributions (e.g. \[14, 15\]).</Paragraph> <Paragraph position="9"> Of course, the best choice of model depends on the nature of the observation vectors and the amount of training data. In addition, it is likely that the amount of tying in a system can be adjusted across a continuum to fit the particular task and amount of training data. flowever, an assessment of modeling trade-offs for speaker-independent recognition is useful for providing insight into the various choices, and also because the various results in the literature are difficult to compare due to differing experimental paradigms.</Paragraph> </Section> class="xml-element"></Paper>