XML Viewer - n04-2010

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-2010_metho.xml
Size: 14,616 bytes
Last Modified: 2025-10-06 14:08:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-2010">
  <Title>Speaker Recognition with Mixtures of Gaussians with Sparse Regression Matrices</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Structure Learning
</SectionTitle>
    <Paragraph position="0"> In general, structure learning in DGM is an NP-hard problem even when all the variables are observed (Chickering et al., 1994). Our case is further complicated by the fact that we have a hidden variable (the Gaussian index). Optimum structure-finding algorithms, like the one in (Meila and Jordan, 2000) assume a mixture of trees and therefore are making limiting assumptions about the space of possible structures. In this paper, no prior assumptions about the space of possible structures are made but this leads to absence of guarantee for an optimum structure. Two approaches for structure-learning are introduced. null The first approach is to learn a discriminative structure, i.e. a structure that can discriminate between classes even though the parameters are estimated in an ML fashion. The algorithm starts from the fully connected model and deletes arcs, i.e. setsbi;jm = 08m = 1 : M (M is the number of Gaussian components in a mixture). After setting regression coefficients to zero, maximum likelihood parameter estimation of the sparse mixture is employed.</Paragraph>
    <Paragraph position="1"> A number of different structure-estimation criteria were tested for the speaker recognition task (at the right of each equation a shorthand for each criterion is defined): = [ m;Bm;Dm] = [ m; m]  Each component m=1:M requires solution of V+1 linear systems of d equations each, d=1:V+1.</Paragraph>
    <Paragraph position="2"> Computations scale down with the number of regression coefficients set to zero.</Paragraph>
    <Paragraph position="3"> Each component m=1:M requires an inversion of a rank V matrix.</Paragraph>
    <Paragraph position="4"> Iterative techniques must be employed for the sparse case.</Paragraph>
    <Paragraph position="5"> Adaptation Easy EM equations for estimating a linear transformation of a mixture of arbitrary Gaussians.</Paragraph>
    <Paragraph position="6"> Gradient descent techniques for the M-step of EM algorithm for estimating a linear transformation of a mixture of arbitrary Gaussians.</Paragraph>
    <Paragraph position="7"> Tying More flexible tying mechanisms across components, i.e. components can share B but estimate Dm.</Paragraph>
    <Paragraph position="8"> Components can share the entire .</Paragraph>
    <Paragraph position="10"> where I(Xi;Xj) is the mutual information between elements Xi and Xj of input vector X. The mutual informations are estimated by first fitting a mixture of 30 diagonal Gaussians and then applying the methods described in (Bilmes, 1999). All but MI and MIimp are discriminative criteria and all are based on finding the pairs (i;j) with the lowest values and zeroing the respective regression coefficients, for every component of the mixture. MIimp assigns the same speaker-independent structure for all speakers. For DMIconf target speaker k is the most confusable for target speaker s in terms of hits, i.e. when the truth is s, speaker k fires more than any other target speaker. We can see that different criteria aim at different goals. MI attempts to avoid the overfitting problem by zeroing regression coefficients between least marginally dependent feature elements. DMIimp attempts to discriminate against impostors, MIimp attempts to build a speaker-independent structure which will be more robustly estimated since there are more data to estimate the mutual informations and DMIconf attempts to discriminate against the most confusable target speaker. The most confusable target speakerk for a given target speaker s should be determined from an independent held-out set.</Paragraph>
    <Paragraph position="11"> There are three main drawbacks that are shared by all of the above criteria. First, they are limited by the fact that all Gaussians will have the same structure. Second, since we are estimating sparse regression matrices, it is known that the absence of an arc is equivalent to conditional independencies, yet the above criteria can only test for marginal independencies. Third, we introduce another free parameter (the number of regression coefficients to be set to zero) which can be determined from a held-out set but will require time consuming trial and error techniques. Nevertheless, they may lead to better discrimination between speakers.</Paragraph>
    <Paragraph position="12"> The second approach we followed was one based on an ML fashion which may not be optimum for classification tasks, but can assign a different structure for each component. We used the structural EM (Friedman, 1997), (Thiesson et al., 1998) and adopt it for the case of mixtures of Gaussians. Structural EM is an algorithm that generalizes on the EM algorithm by searching in the combined space of structure and parameters. One approach to the problem of structure finding would be to start from the full model, evaluate every possible combination of arc removals in every Gaussian, and pick the ones with the least decrease in likelihood. Unfortunately, this approach can be very expensive since every time we remove an arc on one of the Gaussians we have to re-estimate all the parameters, so the EM algorithm must be used for each combination. Therefore, this approach alternates parameter search with structure search and can be very expensive even if we follow greedy approaches. On the other hand, structural EM interleaves parameter search with structure search. Instead of following the sequence Estep!Mstep!structure search, structural EM follows Estep! structure search !Mstep. By treating expected data as observed data, the scoring of likelihood decomposes and therefore local changes do not influence the likelihood on other parameters. In essence, structural EM has the same core idea as standard EM. If M is the structure, are the parameters and n is the iteration index, then the naive approach would be to do:</Paragraph>
    <Paragraph position="14"> If we replace M with H, i.e. the hidden variables or sufficient statistics, we will recognize the sequence of steps as the standard EM algorithm. For a more thorough discussion of structural EM, the reader is referred to (Friedman, 1997). The paper in (Friedman, 1997) has a general discussion on the structural EM algorithm for an arbitrary graphical model. In this paper, we introduced a greedy pruning algorithm with step size K for mixtures of Gaussians. The algorithm is summarized in Table 2.</Paragraph>
    <Paragraph position="15"> One thing to note about the scoring criterion is that it is local, i.e. zeroing regression coefficient m;i;j will not involve computations on other parameters.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We evaluated our approach in the male subset of the 1996 NIST speaker recognition task (Przybocki and Martin, 1998). The problem can be described as following. Given 21 target speakers, perform 21 binary classifications (one for each target speaker) for each one of the test sentences.</Paragraph>
    <Paragraph position="1"> Each one of the binary classifications is a YES if the sentence belongs to the target speaker and NO otherwise.</Paragraph>
    <Paragraph position="2"> Under this setting, one sentence may be decided to have been generated by more than one speaker, in which case there will be at least one false alarm. Also, some of the test sentences were spoken by non-target speakers (impostors) in which case the correct answer would be 21 NO. All speakers are male and the data are from the Switchboard database (Godfrey et al., 1992). There are approximately 2 minutes of training data for each target speaker. All the training data for a speaker come from the same session and the testing data come from different sessions, but from the same handset type and phone number (matched conditions). The algorithms were evaluated on sentence sizes of three and thirty seconds. The features are 20-dimensional MFCC vectors, cepstrum mean normalized and with all silences and pauses removed. In the test data there are impostors who don't appear in the training data and may be of different gender than the target speakers.</Paragraph>
    <Paragraph position="3"> A mixture of Gaussians is trained on each one of the target speakers. For impostor modeling, a separate model is estimated for each gender. There are 43 impostors for each gender, each impostor with 2 minutes of speech.</Paragraph>
    <Paragraph position="4"> Same-gender speakers are pooled together and a mixture of 100 diagonal Gaussians is estimated on each pool. Impostor models remained fixed for all the experiments reported in this work. During testing and because some of the impostors are of different gender than the target speakers, each test sentence is evaluated against both impostor models and the one with the highest log-likelihood is chosen. For each test sentence the log-likelihood of each target speaker's model is subtracted from the log-likelihood of the best impostor model. A decision for YES is made if the difference of the log-likelihoods is above a threshold. Although in real operation of the system the thresholds are parameters that need to be estimated from the training data, in this evaluation the thresholds are optimized for the current test set. Therefore the results reported should be viewed as a best case scenario, but are nevertheless useful for comparing different approaches. null The metric used in all experiments was Equal Error Rate (EER). EER is defined as the point where the probability of false alarms is equal to the probability of missed detections. Standard NIST software tools were used for the evalution of the algorithms (Martin et al., 1997).</Paragraph>
    <Paragraph position="5"> It should be noted that the number of components per Gaussian is kept the same for all speakers. A scheme that allowed for different number of Gaussians per speaker did not show any gains. Also, the number of components is optimized on the test set which will not be the case in the real operation of the system. However, since there are only a few discrete values for the number of components and EER was not particularly sensitive to that parameter, we do not view this as a major problem.</Paragraph>
    <Paragraph position="6"> Table 3 shows the EER obtained for different base-line systems. Each cell contains two EER numbers, the left is for 30-second test utterances and the right for 3second. For the Diagonal case 35 components were used, while for the full case 12 components were used.</Paragraph>
    <Paragraph position="7"> The Random case corresponds to randomly zeroing 10% of the regression coefficients of a mixture of 16 components. This particular combination of number of parameters pruned and number of components was shown to provide the best results for a subset of the test set.</Paragraph>
    <Paragraph position="8"> All structure-finding experiments are with the same number of components and percent of regression coefficients pruned.</Paragraph>
    <Paragraph position="9"> Table 4 shows the EER obtained for different baseline Algorithm: Finding both structure and parameter values using structural EM Start with the full model for a given number of Gaussians while (number of pruned regression coefficients &lt;T) E step: Collect sufficient statistics for given structure, i.e, m(n) = p(zn = mjxn;Mold) StructureSearch: Remove one arc from a Gaussian at a time, i.e. set bmi;j = 0. The score associated with zeroing a single regression coefficient is. Scorem;i;j = 2Dimbmi;jPNn m(n)~xjn;m(~xin;m Bim~xn;m) +Dim(bmi;j)2PNn m(n)~xjn;m Order coefficients in ascending order of score. P is the set of the first K coefficients. Set the new structure Mnew as Mnew = MoldnfPg.</Paragraph>
    <Paragraph position="10"> M step: Calculate the new parameters given Mnew.</Paragraph>
    <Paragraph position="11"> This step can be followed by a number of EM iterations to obtain better parameter values.  utterances and right number for 3-second sparse structures. SEM is structural EM. The first column is zeroing the pairs with the minimum values of the corresponding criterion and the second column is zeroing the pairs with the maximum values. The second column is more of a consistency check. If the min entry of criterion A is lower than the min entry of criterion B then the max entry of criterion A should be higher than the max entry of criterion B. For the structural EM, pruning step sizes of 50 and 100 were tested and no difference was observed.</Paragraph>
    <Paragraph position="12">  is for 30 second test utterances and right number for 3second. null From Table 4 we can see improved results from the full-covariance case but results are not better than the diagonal-covariance case. All criteria appear to perform similarly. Table 4 also shows that zeroing the regression coefficients with the maximum of each criterion function does not lead to systems with much different performance. Also from Table 3 we can see that randomly zeroing regression coefficients performs approximately the same as taking the minimum or maximum. These numbers, seem to suggest that the structure of a mixture of Gaussians is not a critical issue for speaker recognition, at least with the current structure-estimation criteria used.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Summary-Future work
</SectionTitle>
    <Paragraph position="0"> In this work the problem of estimating sparse regression matrices of mixtures of Gaussians was addressed. Different structure-estimation criteria were evaluated, both discriminative and generative. The general problem of finding the optimum structure of a mixture of Gaussians has direct applications in speaker identification as well as speech recognition.</Paragraph>
    <Paragraph position="1"> Interesting connections can be drawn with Maximum Likelihood Linear Regression (MLLR) speaker adaptation (Leggetter and Woodland, 1995). Not surprisingly, the estimation equations for the regression matrix bare resemblance with the MLLR equations. However, researchers have thus far barely looked into the problem of structure-finding for speaker adaptation, focusing mostly on parameter adaptation. An interesting new topic for speaker adaptation could be joint structure and parameter adaptation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML