XML Viewer - h89-2041

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2041_metho.xml
Size: 10,186 bytes
Last Modified: 2025-10-06 14:12:20
<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2041">
  <Title>TIED MIXTURES IN THE LINCOLN ROBUST CSR 1</Title>
  <Section position="4" start_page="293" end_page="293" type="metho">
    <SectionTitle>
TIED MIXTURES
</SectionTitle>
    <Paragraph position="0"> A tied mixture HMM system simply substitutes a mixture with shared Gaussians for the observation pdf in a continuous observation HMM:</Paragraph>
    <Paragraph position="2"> where i is the state (or arc), b is the observation pdf, o is an observation vector, w is the weight, and Nj is the set of shared Gaussians. The forward-backward re-estimation procedure is identical to the procedure for independent mixtures except the Gaussians are tied. (The equations can be derived trivially from the well known independent mixture case. They are presented in \[3,2,4\] and are not repeated here.) In general, all Gaussians are used by all states, but in practice most of the weights are set to zero by the training procedure. However, the average mixture order can be very high during the early phases of training.</Paragraph>
  </Section>
  <Section position="5" start_page="293" end_page="293" type="metho">
    <SectionTitle>
TIED MIXTURE SYSTEMS AT OTHER SITES
</SectionTitle>
    <Paragraph position="0"> Several other sites have experimented with tied mixture HMM recognizers \[3,13,2,4\]. However, the initial parameters for training these systems have been derived from existing discrete observation HMM systems. The initial Gaussian means and covariances were derived from the templates of the vector quantizer, and the mixture weights were initialized from the observation probability histograms. All of these sites reported moderate performance improvements over their discrete observation systems. The work reported here does not bootstrap from any discrete observation system. This provides some additional freedom in training which may influence the final recognition performance.</Paragraph>
  </Section>
  <Section position="6" start_page="293" end_page="294" type="metho">
    <SectionTitle>
THE TESTS
</SectionTitle>
    <Paragraph position="0"> All system tests reported here were performed on the DARPA Resource Management (RM) database \[12\]. The SD system was trained on the designated 600 training sentences per speaker, and the SI system was trained on either 72 designated training speakers x 40 sentences per speaker = 2880 sentences (SI-72) or the SI-72 + 37 development test speakers x 30 sentences per speaker = 3990 sentences (SI-109). 0nly 72 of the 80 SI training and 37 of the 40 SI development test speakers could be used because the other speakers are contained in the test set. Except for the evaluation tests, the test set in all cases was all 100 development test sentences per speaker for the 12 SD speakers. These 1200 sentences contain 10242 words. The word error rate is: (substitutions + insertions + deletions) correct nr of words (3) The recognition development test results quoted in the text and in Table 1 are percent word error rate with the perplexity 60 word-pair grammar.</Paragraph>
  </Section>
  <Section position="7" start_page="294" end_page="295" type="metho">
    <SectionTitle>
THE TIED MIXTURE SYSTEMS AND EXPERIMENTS
</SectionTitle>
    <Paragraph position="0"> The systems reported at the February 1989 DARPA meeting (the &amp;quot;Feb89&amp;quot; systems) \[10,11\] were a single Gaussian per state HMM with word (boundary) context-dependent (WCD) triphones for SD, and a variable order Gaussian mixture per state HMM with word (boundary) context-free (WCF) triphone models. All Gaussians used a single-tied (grand) diagonal covariance matrix.</Paragraph>
    <Paragraph position="1"> The observation vector is a 10 ms mel-cepstrum augmented with a temporal difference (&amp;quot;delta&amp;quot;) mel-cepstrum. The development test performances are shown in Table 1.</Paragraph>
    <Paragraph position="2"> The tied-mixture systems were initialized by a modification of our monophone bootstrapping procedure \[10\]. As in the Feb89 systems, single Gaussian monophone (context independent phone) models were trained from a &amp;quot;fiat&amp;quot; start (all phones identical) and used to initialize single Ganssian triphone models. This produced about 7200 (one per state) Gaussians with a tied (grand) variance.</Paragraph>
    <Paragraph position="3"> The means of these Gaussians were treated as observations and clustered down to 256 clusters by a binary-splitting k-means algorithm. (The tied variance was used but not altered during clustering.) The mixture weights were initialized by computing the Gaussian probability of the cluster mean given the state and then normalizing according to Eq. 2. All parameters (transition probabilities, distribution weights, Gaussian means, and tied variance) were trained. (Each stage of training used the forward-backward algorithm.) If a mixture weight became less than a threshold, the component was removed from the mixture. Thus the mixtures were automatically pruned in response to the training data to reduce the computation. Average mixture orders were initially very high, but were reduced significantly by the end of training.</Paragraph>
    <Paragraph position="4"> The first tied mixture system used only mel-cepstral observations, WCF triphone models, and 256 Gaussians. (Unless otherwise noted, all of the following systems use WCF triphone models.) Results for SD (5.5% word errors) were very similar to the corresponding Feb89 system (5.2%), but the SI-72 performance was significantly degraded: 26.2% word errors vs. 12.9% for the Feb89 system. The reduced performance without the delta mel-cepstral parameters was not unexpected.</Paragraph>
    <Paragraph position="5"> However, the number of Gaussians was reduced from 24,000 for SI-72 and 7200 for SD to 256.</Paragraph>
    <Paragraph position="6"> Delta mel-cepstral parameters were then returned to the system by augmenting the observation vector. The performance on the SD task decreased to 6.1% word errors, but the SI-72 task improved to 17.2% word error rate. Including the delta parameters changed the relation between the mel-cepstral and delta mel-cepstraJ observations for the SD system. In the single Gaussian case, the diagonal covariance matrix treated the mel-cepstral and the delta mel-cepstral observations as statistically independent. However, the mixture weights induced a relation between the two parameter sets. (They were already related in the SI system due to the independent mixtures.) Increasing the number of Ganssians to 512 to increase the system's ability to model the correlation between the mel-cepstral and delta mel-cepstral parameters improved the SD performance to 5.0% word errors but had no effect on SI-72: 17.1% word errors. It appears that there was insufficient data to train the correlations or still an insufficient number of Gaussians to model the correlations in the SI task.</Paragraph>
    <Paragraph position="7"> A number of other sites \[14,5,6\], for example, have improved performance with limited training data by separating different parameters into separate observation streams and multiplying their respective observation probabilities to force the HMM to treat them as if they were statistically independent. Therefore, the mel-cepstra and the delta mel-cepstra were split into separate observation streams:</Paragraph>
    <Paragraph position="9"> where c denotes mel-cepstrum and d denotes the delta mel-cepstrum. This maintained the performance on the SD task (5.0% word errors) and further improved the performance on the SI-72 task to 14.7% word errors.</Paragraph>
    <Paragraph position="10"> Next, the training procedure was modified by, instead of clustering the means of the Gaussians, clustering a subset of the training data, again using a binary-splitting k-means algorithm. It was hoped that this would provide an initialization with better representation of outliers which might have been suppressed by the single Ganssians. This change resulted in improvements in both tasks: the SD error rate went down to 4.7% and the SI-72 error rate went down to 13.7% A variation in the training procedure of the &amp;quot;kt&amp;quot; systems was tested. It was feared that the high-frequency triphones were dominating the Ganssian means in the early iterations of training causing damage to the modeling of low-frequency triphones. Therefore, the Ganssian means were not trained until the weights had settled. This was intended to protect the Gaussian means until the phone models had become very specific. No improvement was found.</Paragraph>
    <Paragraph position="11"> To fully test for outliers, the system was initialized wilth a set of Ganssians formed by binary-splitting k-means clustering a subset of the training data using the perceptually-motivated weighting \[8,9\] (which was again not altered during clustering). The system was started with flat start tied mixture monophones (maximum order mixtures with all weights equal). These monophone models were used to bootstrap the triphone models, again using the forward-backward algorithm at each stage. These &amp;quot;ks&amp;quot; systems provided the best performance for the SD task (4.5% word errors), but failed to improve on the SI-72 task (15.3% word errors), probably due to the slight smoothing induced by the old initialization. This SD performance is better than the corresponding Feb89 system with WCF triphone models (5.2%).</Paragraph>
    <Paragraph position="12"> None of the above systems used word context (boundary) modeling. The &amp;quot;kt&amp;quot; system was tested on the SD task using word context-dependent models. The performance (4.0% word errors) was better than the WCF system (4.7% word errors), but was not better than the Feb89 SD systems with WCD models (3.0% word errors). The tied mixture system appears to require more training data than does a single Gaussian per state system.</Paragraph>
    <Paragraph position="13"> The above systems do not have any smoothing on the mixture weights. A preliminary attempt to use deleted interpolation across phonetic contexts \[1,5\] caused a slight increase in the error rate of an SI-72 system.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML