File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1020_metho.xml

Size: 19,537 bytes

Last Modified: 2025-10-06 14:13:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1020">
  <Title>ON THE USE OF TIED-MIXTURE DISTRIBUTIONS</Title>
  <Section position="5" start_page="102" end_page="103" type="metho">
    <SectionTitle>
3. TRAINING ALGORITHMS
</SectionTitle>
    <Paragraph position="0"> In this section we first review properties of the SSM and then describe the training algorithm used for tied mixtures with the SSM. Next, we describe an efficient method for training context-dependent models, and lastly we describe a parallel implementation of the trainer that greatly reduces experimentation time.</Paragraph>
    <Paragraph position="1"> 3.1. The SSM and &amp;quot;Viterbi&amp;quot; Training with Tied Mixtures The SSM is characterized by two components: a family of length-dependent distribution functions and a deterministic mapping function that determines the distribution for a variable-length observed segment. More specifically, in the work presented here, a linear time warping function maps each observed frame to one of m regions of the segment model. Each region is described by a tied Gaussian mixture distribution, and the frames are assumed conditionally independent given the length-dependent warping. The conditional independence assumption allows robust estimation of the model's statistics and reduces the computation of determining a segment's probability, but the potential of the segment model is not fully utilized. Under this formulation, the SSM is similar to a tied-mixture tIMM with a phone-length-dependent, constrained state trajectory.</Paragraph>
    <Paragraph position="2"> Thus, many of the experiments reported here translate to HMM systems.</Paragraph>
    <Paragraph position="3"> The SSM training algorithm \[16\] iterates between segmentation and maximum likelihood parameter estimation, so that during the parameter estimation phase of each iteration, the segmentation of that pass gives a set of known phonetic boundaries. Additionally, for a given phonetic segmentation, the assignment of observations to regions of the model is uniquely determined. SSM training is similar to IIMM &amp;quot;Viterbi training&amp;quot;, in which training data is segmented using the most likely state sequence and model parameters are updated using this segmentation. Although it is possible to define an SSM training algorithm equivalent to the Baum-Welch algorithm for HMMs, the computation is prohibitive for the SSM because of the large effective state space.</Paragraph>
    <Paragraph position="4">  The use of a constrained segmentation greatly simplifies parameter estimation in the tied mixture case, since there is only one unobserved component, the mixture mode. In this case, the parameter estimation step of the iterative segmentation/estimation algorithm involves the standard iterative expectation-maximization (EM) approach to estimating the parameters of a mixture distribution \[17\]. In contrast, the full EM algorithm for tied mixtures in an HMM handles both the unobserved state in the Markov chain and the unobserved mixture mode \[21.</Paragraph>
    <Section position="1" start_page="103" end_page="103" type="sub_section">
      <SectionTitle>
3.2. Tied-Mixture Context Modeling
</SectionTitle>
      <Paragraph position="0"> We have investigated two methods for training context-dependent models. In the first, weights are used to combine the probability of different types of context. These weights can be chosen by hand \[18\] or derived automatically using a deleted-interpolation algorithm \[3\]. Paul evaluated both types of weighting for tied-mixture context modeling and reported no significant performance difference between the two \[4\]. In our experiments, we evaluated just the use of hand-picked weights.</Paragraph>
      <Paragraph position="1"> In the second method, only models of the most detailed context (in our case triphones) are estimated directly from the data and simpler context models (left, right, and context-independent models) are computed as marginals of the triphone distributions. The computation of marginals is negligible since it involves just the summing and normalization of mixture weights at the end of training. This method reduces the number of model updates in training in proportion to the number of context types used, although the computation of observation probabilities conditioned on the mixture component densities, remains the same. In recognition with marginal models, it is still necessary to combine the different context types, and we use the same hand-picked weights as before for this purpose. We compared the two training methods and found that performance on an independent test set was essentially the same for both methods (marginal training produced 2 fewer errors on the Feb89 test set) and the marginal trainer required 20 to 35% less time, depending on the model size and machine memory.</Paragraph>
    </Section>
    <Section position="2" start_page="103" end_page="103" type="sub_section">
      <SectionTitle>
3.3. Parallel Training
</SectionTitle>
      <Paragraph position="0"> To reduce computation, our system prunes low probability observations, as in \[4\], and uses the marginal training algorithm described above. However, even with these savings, tied-mixture training involves a large computation, making experimentation potentially cumbersome.</Paragraph>
      <Paragraph position="1"> When the available computing resources consist of a network of moderately powerful workstations, as is the case at BU, we would like to make use of many machines at once to speed training. At the highest level, tied mixture training is inherently a sequential process, since each pass requires the parameter estimates from the previous pass. However, the bulk of the training computation involves estimating counts over a database, and these counts are all independent of each other. We can therefore speed training by letting machines estimate the counts for different parts of the database in parallel and combine and normalize their results at the end of each pass.</Paragraph>
      <Paragraph position="2"> To implement this approach we use a simple &amp;quot;bakery&amp;quot; algorithm to assign tasks: as each machine becomes free, it reads and increments the value of a counter from a common location indicating the sentences in the database it should work on next. This approach provides load balancing, allowing us to make efficient use of machines that may differ in speed. Because of the coarse grain of parallelism (one task typically consists of processing 10 sentences), we can use the relatively simple mechanism of file locking for synchronization and mutual exclusion, with no noticeable efficiency penalty. Finally, one processor is distinguished as the &amp;quot;master&amp;quot; processor and is assigned to perform the collation and normalization of counts at the end of each pass. With this approach, we obtain a speedup in training linear with the number of machines used, providing a much faster environment for experimentation.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="103" end_page="105" type="metho">
    <SectionTitle>
4. MODELING &amp; ESTIMATION
TRADE-OFFS
</SectionTitle>
    <Paragraph position="0"> Within the framework of tied Gaussian mixtures, there are a number of modeling and training variations that have been proposed. In this section, we will describe several experiments that investigate the performance implications of some of these choices.</Paragraph>
    <Section position="1" start_page="103" end_page="104" type="sub_section">
      <SectionTitle>
4.1. Experimental Paradigm
</SectionTitle>
      <Paragraph position="0"> The experiments described below were run on the Resource Management (RM) corpus using speakerindependent, gender-dependent models trained on the standard SI-109 data set. The feature vectors used as input to the system are computed at 10 millisecond intervals and consist of 14 cepstral parameters, their first differences, and differenced energy (second cepstral differences are not currently used). In recognition, the SSM uses an N-best rescoring formalism to reduce computation: the BBN BYBLOS system \[7\] is used to generate 20 hypotheses per sentence, which are rescored by the SSM and combined with the number of phones, number of words, and (optionally) the BBN HMM score, to rerank the hypotheses. The weights for recombination  are estimated on one test set and held fixed for all other test sets. Since our previous work has indicated problems in weight estimation due to test-set mismatch, we have recently introduced a simple time normalization of the scores that effectively reduces the variability of scores due to utterance length and leads to more robust performance across test sets.</Paragraph>
      <Paragraph position="1"> Although the weight estimation test set is strictly speaking part of the training data, we find that for most experiments, the bias in this type of testing is small enough to allow us to make comparisons between systems when both are run on the weight-training set. Accordingly some of the experiments reported below are only run on the weight training test set. Of course, final evaluation of a system must be on an independent test set.</Paragraph>
    </Section>
    <Section position="2" start_page="104" end_page="105" type="sub_section">
      <SectionTitle>
4.2. Experiments
</SectionTitle>
      <Paragraph position="0"> We conducted several series of experiments to explore issues associated with parameter allocation and training. The results are compared to a baseline, non-mixture SSM that uses full covariance Gaussian distributions.</Paragraph>
      <Paragraph position="1"> The first set of experiments examined the number of component densities in the mixture, together with the choice of full- or diagonal-covariance matrices for the mixture component densities. Although the full covariance assumption provides a more detailed description of the correlation between features, diagonal covariance models require substantially less computation and it may be possible to obtain very detailed models using a larger number of diagonal models.</Paragraph>
      <Paragraph position="2"> In initial experiments with just female speakers, we used diagonal covariance Gaussians and compared 200- versus 300-density mixture models, exploring the range typically reported by other researchers. With context-independent models, after several training passes, both systems got 6.5% word error on the Feb89 test set. For context-dependent models, the 300-density system performed substantially better, with a 2.8% error rate, compared with 4.2% for the 200 density system. These results compare favorably with the baseline SSM which has an error rate on the Feb89 female speakers of 7.7% for context-independent models and 4.8% for context-dependent models.</Paragraph>
      <Paragraph position="3"> For male speakers, we again tried systems of 200 and 300 diagonal covariance density systems, obtaining error rates of 10.9% and 9.1% for each, respectively. Unlike the females, however, this was only slightly better than the result for the baseline SSM, which achieves 9.5%.</Paragraph>
      <Paragraph position="4"> We tried a system of 500 diagonal covariance densities, which gave only a small improvement in performance to 8.8% error. Finally, we tried using full-covariance Gaussians for the 300 component system and obtained an 8.0% error rate. The context-dependent performance for males using this configuration showed similar improvement over the non-mixture SSM, with an error rate of 3.8% for the mixture system compared with 4.7% for the baseline. Returning to the females, we found that using full-covariance densities gave the same performance as diagonal. We have adopted the use of full-covariance models for both genders for uniformity, obtaining a combined word error rate of 3.3% on the Feb89 test set.</Paragraph>
      <Paragraph position="5"> In the RM SI-109 training corpus, the training data for males is roughly 2.5 times that for females, so it is not unexpected that the optimal parameter allocation for each may differ slightly.</Paragraph>
      <Paragraph position="6"> Unlike other reported systems which treat cepstral parameters and their derivatives as independent observation streams, the BU system models them jointly using a single output stream, which gives better performance than independent streams with a single Gaussian distribution (non-mixture system). Presumably, the result would also hold for mixtures.</Paragraph>
      <Paragraph position="7"> Since the training is an iterative hill climbing technique, initialization can be important to avoid converging to a poor solution. In our system, we choose initial models, using one of the two methods described below. These models are used as input to several iterations of context-independent training followed by context-dependent training. We add a small padding value to the weight estimates in the early training passes to delay premature parameter convergence.</Paragraph>
      <Paragraph position="8"> We have investigated two methods for choosing the initial models. In the first, we cluster the training data using the K-means algorithm and then estimate a mean and covariance from the data corresponding to each cluster. These are then used as the parameters of the component Gaussian densities of the initial mixture. In the second method, we initialize from models trained in a non-mixture version of the SSM. The initial densities are chosen as means of triphone models, with covariances chosen from the corresponding context-independent model. For each phone in our phone alphabet we iteratively choose the triphone model of that phone with the highest frequency of occurrence in training. The object of this procedure is to attempt to cover the space of phones while using robustly estimated models.</Paragraph>
      <Paragraph position="9"> We found that the K-means initialized models converged slower and had significantly worse performance on independent test data than that of the second method. Although it is possible that with a larger padding value added to the weight estimates and more training passes, the K-means models might have &amp;quot;caught up&amp;quot; with the</Paragraph>
    </Section>
    <Section position="3" start_page="105" end_page="105" type="sub_section">
      <SectionTitle>
4.8 8.5
3.6 7.3
3.2 6.1
</SectionTitle>
      <Paragraph position="0"> sets for the baseline non-mixture SSM, the tied-mixture SSM alone and the SSM in combination with the BYBLOS HMM system.</Paragraph>
      <Paragraph position="1"> other models, we did not investigate this further. The various elements of the mixtures (means, covariances, and weights) can each be either updated in training, or assumed to have fixed values. In our experiments, we have consistently found better performance when all parameters of the models are updated.</Paragraph>
      <Paragraph position="2"> Table 1 gives the performance on the RM Oct89 and Sept92 test set for the baseline SSM, the tied-mixture SSM system, and the tied-mixture system combined in N-best rescoring with the BBN BYBLOS HMM system.</Paragraph>
      <Paragraph position="3"> The mixture SSM's performance is comparable to results reported for many other systems on these sets. We note that it may be possible to improve SSM performance by incorporating second difference cepstral parameters as most HMM systems do.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="105" end_page="106" type="metho">
    <SectionTitle>
5. SEGMENTAL MIXTURE
MODELING
</SectionTitle>
    <Paragraph position="0"> In the version of the SSM described in this paper, in which observations are assumed conditionally independent given model regions, the dependence of observations over time is modeled implicitly by the assumption of time-dependent stationary regions in combination with the constrained warping of observations to regions. Because segmentation is explicit in this model, in principle it is straightforward to model distinct segmental trajectories over time by using a mixture of such segment-level models, and thus take better advantage of the segment formalism. The probability of the complete segment of observations, Y, given phonetic unit c~ is then</Paragraph>
    <Paragraph position="2"> where each of the densities P(Y\]trk) is an SSM. The component models could use single Gaussians instead of tied mixtures for the region dependent distributions and they would remain independent frame models, but in training all the observations for a phone would be updated jointly, so that the mixture components capture distinct trajectories of the observations across a complete segment. In practice, each such trajectory is a point in a very high-dimensional feature space, and it is necessary to reduce the parameter dimension in order to train such models. There are several ways to do this. First, we can model the trajectories within smaller, subphonetic units, as in the microsegment model described in \[19, 20\]. Taking this approach and assuming microsegments are independent, the probability for a segment is</Paragraph>
    <Paragraph position="4"> where aik is the k th mixture component of microsegment j and Yj is the subset of frames in Y that map to microsegment j. Given the SSM's deterministic warping and assuming the same number of distributions for all mixture components of a given microsegment, the extension of the EM algorithm for training mixtures of this type is straightforward. The tied-mixture SSM discussed in previous sections is a special case of this model, in which we restrict each microsegment to have just one stationary region and a corresponding mixture distribution. null A different way to reduce the parameter dimension is to continue to model the complete trajectory across a segment, but assume independence between subsets of the features of a frame. This case can be expressed in the general form of (2) if we reinterpret the Yj as vectors with the same number of frames as the complete segment, but for each frame, only a specific subset of the original frame's features are used. We can of course combine these two approaches, and assume independence between observations representing feature subsets of different microsegmental units. There are clearly a large number of possible decompositions of the complete segment into time and feature subsets, and the corresponding models for each may have different properties. In general, because of constraints of model dimensionality and finite training data, we expect a trade-off between the ability to model trajectories across time and to model the correlation of features within a local time region.</Paragraph>
    <Paragraph position="5"> Although no single model of this form may have all the properties we desire, we do not necessarily have to choose one to the exclusion of all others. All the models discussed here compute probabilities over the same observation space, allowing for a straightforward combination of different models, once again using the simple mechanism of non-tied mixtures:</Paragraph>
    <Paragraph position="7"> In this case, each of the i components of the leftmost summation is some particular realization of the general  model expressed in Equation (2). Such a mixture can combine component models that individually have beneficial properties for modeling either time or frequency correlation, and the combined model may be able to model both aspects well. We note that, in principle, this model can also be extended to larger units, such as syllables or words.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML