File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-2010_intro.xml

Size: 7,679 bytes

Last Modified: 2025-10-06 14:02:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-2010">
  <Title>Speaker Recognition with Mixtures of Gaussians with Sparse Regression Matrices</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Most state-of-the-art systems in speech and speaker recognition use mixtures of Gaussians when fitting a probability distribution to data. Reasons for this choice are the easily implementable estimation formulas and the modeling power of mixtures of Gaussians. For example, a mixture of diagonal Gaussians can still model dependencies on the global level. An established practice when applying mixtures of Gaussians is to use either full or diagonal covariances. However, imposing a structure may not be optimum and a more general methodology should allow for joint estimation of both the structure and parameter values.1.</Paragraph>
    <Paragraph position="1"> The first question we have to answer is what type of structure we want to estimate. For mixtures of Gaussians there are three choices. Covariances, inverse covariances or regression matrices. For all cases, we can see as selecting a structure by introducing zeros in the respective matrix. The three structures are distinctively different and zeros in one matrix do not, in general, map to zeros in another matrix. For example, we can have sparse covariance 1Here, we describe the Maximum Likelihood estimation methodology for both structure and parameters. One alternative is Bayesian estimation.</Paragraph>
    <Paragraph position="2"> but full inverse covariance or sparse inverse covariance and full regression matrix.</Paragraph>
    <Paragraph position="3"> There are no clear theoretical reasons why one choice of structure is more suitable than others. However, introducing zeros in the inverse covariance can be seen as deleting arcs in an Undirected Graphical Model (UGM) where each node represents each dimension of a single Gaussian (Bilmes, 2000). Similarly, introducing zeros in the regression matrix can be seen as deleting arcs in a Directed Graphical Model (DGM). There is a rich body of work on structure learning for UGM and DGM and therefore the view of a mixture of Gaussians as a mixture of DGM or UGM may be advantageous. Under the DGM framework, the problem of Gaussian parameter estimation can be cast as a problem of estimating linear regression coefficients. Since the specific problem of selecting features for linear regression has been encountered in different fields in the past, we adopt the view of a mixture of Gaussians as a mixture of DGM.</Paragraph>
    <Paragraph position="4"> In (Bilmes, 2000), the problem of introducing zeros in regression matrices of a mixture of Gaussians was presented. The approach taken was to set to zero the pairs with the lowest mutual information, i.e. bmi;j = 0 () I(Xi;Xj) 0, wheremis the Gaussian index andbi;j is the (i;j) element of regression matrix B. The approach was tested for the task of speech recognition in a limited vocabulary corpus and was shown to offer the same performance with the mixture of full-covariance Gaussians with 30% less parameters. full covariances. One issue with the work in (Bilmes, 2000) is that the structure-estimation criterion that was used was not discriminative. For classification tasks, like speaker or speech recognition, discriminative parameter estimation approaches achieve better performance than generative ones, but are in general hard to estimate especially for a high number of classes. In this work, a number of discriminative structure-estimation criteria tailored for the task of speaker recognition are introduced. We avoid the complexities of discriminative parameter estimation by estimating a discriminative structure and then applying generative parameter estimation techniques. Thus, overall the models attempt to model the discriminability between classes without the numerical and implementation diffi-</Paragraph>
    <Paragraph position="6"> culties that such techniques have. A comparison of the new discriminative structure-estimation criteria with the structural EM algorithm is also presented.</Paragraph>
    <Paragraph position="7"> This paper is structured as follows. In section 2, the view of a Gaussian as a directed graphical model is presented. In section 3, discriminative and generative structure-estimation criteria for the task of speaker recognition are detailed, along with a description of the structural EM algorithm. In section 4, the application task is described and the experiments are presented. Finally, in section 5, a summary and possible connections of this work with the speaker adaptation problem are discussed.</Paragraph>
    <Paragraph position="8"> 2 Gaussians as Directed Graphical Models Suppose that we have a mixture of M Gaussians:</Paragraph>
    <Paragraph position="10"> It is known from linear algebra that any square matrix A can be decomposed as A = LDU, where L is a lower triangular matrix, D is a diagonal matrix and U is an upper triangular matrix. In the special case where A is also symmetric and positive definite the decomposition becomes A = UTDU where U is an upper triangular matrix with ones in the main diagonal. Therefore we can write U = I B with bij = 0 if i&gt;= j.</Paragraph>
    <Paragraph position="11"> The exponent of the Gaussian function can now be written as (Bilmes, 2000):</Paragraph>
    <Paragraph position="13"> withV being the dimensionality of each vector. Equation 3 shows that the problem of Gaussian parameter estimation can be casted as a linear regression problem. Regression schemes can be represented as Directed Graphical Models. In fact, the multivariate Gaussian can be represented as a DGM as shown in Figure 1. Absent arcs represent zeros in the regression matrix. For example the B matrix in Figure 1 would have b1;4 = b2;3 = 0.</Paragraph>
    <Paragraph position="14"> We can use the EM algorithm to estimate the parameters of a mixture of Gaussian = [ mBmDm]. This formulation offers a number of advantages over the traditional formulation with means and covariances. First, it avoids inversion of matrices and instead solves V + 1 linear systems of d equations each where d = 1 : V + 1.</Paragraph>
    <Paragraph position="15"> If the number of components and dimensionality of input vectors are high, as it is usually the case in speech recognition applications, the amount of computations saved can be important. Second, the number of computations scale down with the number of regression coefficients set to zero. This is not true in the traditional formulation because introducing zeros in the covariance matrix may result in a non-positive definite matrix and iterative techniques should be used to guarantee consistency (Dempster, 1972). Third, for the traditional formulation, adapting a mixture of non-diagonal Gaussians with linear transformations leads to objective functions that cannot be maximized analytically. Instead, iterative maximization techniques, such as gradient descent, are used. With the new formulation even with arbitrary Gaussians, closed-form update equations are possible. Finally, the new formulation offers flexibility in tying mechanisms. Regression matrices Bm and variances Dm can be tied in different ways, for example all the components can share the same regression matrix but estimate a different variance diagonal matrix for each component. Similar schemes were found to be succesful for speech recognition (Gales, 1999) and this formulation can provide a model that can extend such tying schemes. The advantages of the new formulation are summarized in Table 1.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML