File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/h90-1060_metho.xml

Size: 19,055 bytes

Last Modified: 2025-10-06 14:12:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1060">
  <Title>A New Paradigm for Speaker-Independent Training and Speaker Adaptation</Title>
  <Section position="3" start_page="0" end_page="306" type="metho">
    <SectionTitle>
2. Speaker-Independent Training
109 Speaker SI Training
</SectionTitle>
    <Paragraph position="0"> For several years, the DARPA Resource Management continuous speech corpus has provided a testbed for SI recognition. 109 speakers are designated as training speakers and are each represented by a sample of 40 utterances. Typically, the data from all the speakers is pooled at the outset, as ff it all came from one speaker. Although the training data originates from many diverse sources, the forward-backward (Baum-Welch) training procedure is robust enough to do a reasonable job of modeling the pooled data. When used with  a standard word-pair grammar of perplexity 60, state-of-them SI recognition performance for this corpus is 6--7% word error rate.</Paragraph>
    <Paragraph position="1"> This performance is 3 times worse than our current SD performance using 600 training utterances. Also, the sentence error rate at this level of perfommnce is greater than 30% -- a level of error that we assume is far too high for the acoustic component of a spoken language system. Furthermore, this performance has been achieved with an artificial and non-robust grammar of modest perplexity which will not work within an SLS context. Combining the need for higher absolute performance with the need to use less powerful gratrmmrs indicates that the current SI error rate may need to be reduced by a factor of at least 4 to be acceptable for SLS applications.</Paragraph>
    <Section position="1" start_page="306" end_page="306" type="sub_section">
      <SectionTitle>
Results of SI Experiments
</SectionTitle>
      <Paragraph position="0"> Results for several SI experiment are shown in table 1. All results are from first runs of the designated Feb. '89 SI test set on the given system configuration. This test set consists of 10 speakers (4 females) with 30 utterances each.</Paragraph>
      <Paragraph position="1"> All runs used the standard word-pair grammar of perplexity 60. System parameters were fixed before running any of the conditions in this experiment. The limited development testing which we did perform was done only on the June '88 SD/SI test set using only the 109 speaker SI model.</Paragraph>
      <Paragraph position="2"> For each condition we show the number of training speakers, and the manner in which the models were trained and smoothed. The training was done either on pooled data (joint training) or on individual speakers' data (indep training). The smoothing was either not done, or was applied to either the jointly or independently trained model. For each condition, the word error rate, which includes insertion errors, and sentence error rate are given.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="306" end_page="307" type="metho">
    <SectionTitle>
12 Speaker SI Training
</SectionTitle>
    <Paragraph position="0"> Since we planned to perform adaptation from 12 reference speakers, we needed to mna SI control condition by using the data in the usual pooled fashion. We ran a comparative test using data from only the 12 speakers from the SD segment of the DARPA database. The training for each speaker consisted of 600 training utterances. Seven of the speakers are male.</Paragraph>
    <Paragraph position="1"> We did have some indication that pooling the data of even a few speakers could make large improvements from an experiment conducted at IBM and described in \[5\]. However, 12 speakers could hardly be expected to contain an example of all speaker types in the general population (including both genders), so we could anticipate the need for some kind of smoothing before we began. Our usual technique for smoothing across the bins of the discrete densities, triphone cooccurrence smoothing \[7\], has proven to be an effective method for dealing with the widely varying amounts of training data for the detailed context models in the system. When used in a SD training scenario, it has allowed us to observe a performance gain for explicitly modeling several thousand triphones which were observed only once or twice in the training.</Paragraph>
    <Paragraph position="2"> However, the cooccurrence smoothing is not appropriate for models derived from the pooled data of many speakers.</Paragraph>
    <Paragraph position="3"> Spectra from different speakers will cooccur much more randomly than spectra from a single speaker. This will yield poorer estimates of the smoothing matrices. As such, 11Sphone cooccurrence smoothing is a speaker-specific modeling technique. If the data is pooled prior to training, we cannot effectively apply our best smoothing to the model.</Paragraph>
    <Paragraph position="4"> This realization has led us to examine the practice of pooling the data in the first place. A straightforward altemafive to pooling the data is to keep the speakers separated until the speaker-specific operations of training and smoothing have been completed and then combine the multiple SD models.</Paragraph>
    <Paragraph position="5"> To allow the model combination to be done by averaging the model statistics, we constructed a SI codebook which was used in common for all speakers.</Paragraph>
    <Paragraph position="6">  '89 test set with word-pair grammar.</Paragraph>
    <Paragraph position="7"> The 109 speaker conditions were run to calibrate the BYBLOS system with published results for the same test set.</Paragraph>
    <Paragraph position="8"> We observe a small improvement, from 7.1% to 6.5% word error, for using smoothing on the jointly trained model. The 6.5% error rate is comparable to the best performance on record (6.1%) for this test set which was achieved by Lee as noted in \[4\]. Furthermore, the sentence error rates are identical. Lee's system used a corrective gaining and reinforcement procedure to increase the discrimination ability of the model for confusable words. No corrective training was used for the BYBLOS results given in table 1.</Paragraph>
    <Paragraph position="9"> The system configuration for the 109 condition was identical to that which we use for SD recognition except for one difference. One new system parameter was added to decrease the lambda factors used for combining the context-dependent models into interpolated triphones \[6\] by a factor of eight to account for the larger corpus.</Paragraph>
    <Paragraph position="10"> Next we repeated the same conditions for the 12 speaker SI model. Simply pooling the 12 speakers without smoothing does not perform as well as the 109 speaker model. And once again, smoothing the jointly trained model has a rather weak effect on performance. However, we were surprised that the 12 speaker model should have only 25% more error than the 109 speaker model.</Paragraph>
    <Paragraph position="11"> The final two results show the effect of independently smoothing the 12 speaker model after either joint or independent training. To independently smooth the jointly trained model, we first trained on the pooled data as usual. Then a  SD model was made, for each training speaker, by running the forward-backward algorithm on the combined SI model but on data from only one speaker in turn. This allowed us to generate a set of SD models for smoothing, which shared a common alignment. The smoothed models were then recombined by averaging the model statistics.</Paragraph>
    <Paragraph position="12"> The approach used on the final result is the most straight-forward -- we train multiple independent SD models allowing each to align optimally for the specific speaker, smooth each model to model spectral variation within each speaker, and then combine the models by averaging corresponding probabilities in the,models.</Paragraph>
    <Paragraph position="13"> As is evident from the table, both of the final methods improve due to the increased effectiveness of the smoothing when it is applied to a speaker-specific model. In a final surprise, we find that constraining all the speakers to a common alignment does not help. Further, the word error rate of this simple model is only 15% worse than our best performance with the 109 speaker model and the sentence error rates are statistically indistinguishable.</Paragraph>
    <Paragraph position="14"> Some caution is required in comparing results of the 12 and 109 speaker models due to two, possibly important differences. The total amount of training speech used is different as is the number of different sentence texts contained in the training script. The 109 speaker model is trained on a total of 4360 utterances drawn from 2800 sentence texts.</Paragraph>
    <Paragraph position="15"> The 12 speaker model is trained on 7200 utterances drawn from only 600 sentence texts. While the additional speech may benefit the 12 speaker condition, the greater richness of the sentence texts may help the 109 speaker model. The effect of the additional sentence texts can be seen in the different numbers of triphone contexts observed in the two training scripts: 5000 triphones for 600 sentences vs. 7000 for the 2800--sentence script.</Paragraph>
    <Paragraph position="16">  We have observed that the forward-backward algorithm freely re-defines some of the phonemes to model peculiarities of a given speaker. If we constrain all speakers to a common alignment, the training procedure must make a compromise between these speaker-specific adjuslments.</Paragraph>
    <Paragraph position="17"> Both forward-backward and triphone coocurrence smoothing are arguably speaker-specific procedures -- they work best when the training distributions are generated by a single source. Some compromise must be made for SI recognition, where the training is not homogeneous and the test distribution is, by definition, different than the training. It appears, from these results, that the least damaging compromise may be to delay pooling of the data/models until the last possible stage in the processing.</Paragraph>
    <Paragraph position="18"> Such a simple SI paradigm has several attractive attributes. It makes the data collection effort easier. It is trivial to add new training speakers to the SI model; no re-training is required. Therefore the system can easily make use of any speakers who have already committed to giving enough speech to train a high-performance SD model. There is a large payoff for being one of the training speakers in this scenario -- highly accurate SD performance. In contrast, there is no benefit for being a training speaker for the 109 speaker model. Finally, by delaying the stage at which the data or model parameters are pooled, new opportunities arise to use speaker-specific modeling approaches such as the multiple-reference adaptation procednre described in the next section.</Paragraph>
  </Section>
  <Section position="5" start_page="307" end_page="308" type="metho">
    <SectionTitle>
3. Speaker Adaptation
</SectionTitle>
    <Paragraph position="0"> Adaptation from 109 Speakers As mentioned above, previous attempts to use large population SI corpora for speaker adaptation have met with little success. In \[3\], Lee tried to cluster over 100 training speakers into a small number of groups which were then trained independently. In recognition, the test speaker was first classified into one of the speaker groups, based on 1 known utterance, and decoded with the appropriate model.</Paragraph>
    <Paragraph position="1"> This approach failed to improve over the SI performance since it reduced the amount of training data available to each speaker-group-specific model. In another attempt, Lee devised an interpolated re-estimation procedure which combined the SI model with 4 other models derived from a small sample of known speech from the target speaker. Interpolation weights for the 5 models were computed from a deleted sample of the training data. The reduction in word error rate was less than 10%, however, when 30 utterances from the target speaker were used. The gain was small for this approach because only a small amount of new information, robustly estimated in the 4 speaker-specific models, was added to an already robust SI model.</Paragraph>
    <Paragraph position="2"> We have also attempted to use the same SI corpus of over 100 speakers for speaker adaptation as reported in \[2\]. In this work, we estimated a deterministic transformation on the speech parameters of each of the training speakers which projected them onto the feature space of a single protolypical training speaker. We then trained on all of the transformed speech as if it came from a single speaker. The target speaker was similarly projected onto the prototypical speaker and recognition proceeded using the prototypical model. This procedure reduced the word error rate by 10% compared to the SI result; a minor improvement for a significant increase in the complexity of the scenario. We believe that this method did no better because the feature transformation was not powerful enough to superimpose a pair of speakers without significant loss of information. This resulted in a prototypical model whose densities were not significantly sharper than the comparable SI model made from the original data.</Paragraph>
    <Paragraph position="3"> Adaptation from 12 Speakers Our experience with the 109 corpus led us to rethink our approach to speaker adaptation from multiple reference speakers. null We already have a powerful speaker adaptation procedure which effectively transforms a single well-trained SD reference model into an adapted model of the target speaker \[1\]. The transformation is estimated from a small amount of  adaptation data (40 utterances) given by the target speaker. The approach is powerful for two reasons: first, the estimate of the probabilistic spectral mapping between two speakers is robust and generalizes well to phonetic contexts not observed in the adaptation speech, and second, the transformation can be applied to the well-estirnated, discriminating densities of the SD reference model without undue loss of detail.</Paragraph>
    <Paragraph position="4"> A natural extension of this approach to multiple references would be to combine the parameters of several SD models after they had been independently adapted to the same target speaker. We can assume from our 12 speaker SI experiments that the transformation will perform better if estimated independently between each speaker-pair in tum rather than from a pooled dataset, since the transformation is a speaker-pair-specific operation. We also know that we can successfully combine the multiple adapted models by averaging the model statistics.</Paragraph>
    <Section position="1" start_page="308" end_page="308" type="sub_section">
      <SectionTitle>
Results of Adaptation Experiments
</SectionTitle>
      <Paragraph position="0"> Table 2 shows results for development tests on the June '88 SD/SI test set and word-pair grammar. The test set consists of 12 speakers (7 males) and 25 utterances each.</Paragraph>
      <Paragraph position="1">  the June '88 test set with word-pair grammar.</Paragraph>
      <Paragraph position="2"> Adapting from a single male reference speaker trained on 30 minutes of speech (600 utterances) gives a word error rate of 6.2%. The reference speaker in this case is, LPN, from the designated RM2 database.</Paragraph>
      <Paragraph position="3"> In the second row, a small improvement is realized for increasing the reference speaker training to 2 hours (2400 utterances). We intend to make this comparison more reliable by using the three other speakers in the RM2 database as references.</Paragraph>
      <Paragraph position="4"> The third condition shows the result of combining models from 11 reference speakers after adapting them to the 12th speaker and jackknifing over all the reference speakers. The result is a significant improvement in both word and sentence error rates over the single reference perfonnance.</Paragraph>
      <Paragraph position="5">  Speaker adaptation from a single reference speaker continues to be an economical solution for systems which are forced to retrain due to changes in channel, environmental conditions, or task domain. With only 40 utterances from the system users and 600 training utterances from the reference speaker, a speaker-adaptive system can be rapidly re-configured and deliver performance equal to the best current SI performance trained on 4000 utterances.</Paragraph>
      <Paragraph position="6"> We can also make a comparison between the multi-reference adapted result, tested on the June '88 SD/SI test set, and the 12 speaker SI result tested on the Feb. '89 SI test set, since roughly the same population of training speakers are used (except for the held-out one). The two test sets give the same performance when tested using the 109 speaker SI model. Comparing to the 12 speaker SI model, the 11 reference adapted model has reduced the word error by 45%. We are encouraged by this large improvement for a straightforward application of our basic speaker adaptation algorithm to multiple references. Individual speaker perforrnance ranged from 0.6% to 7.7% error indicating that the multiple-reference model was very effective at eliminating the poorest outliers. Two speakers performed equal to or better than their SD models trained on 600 utterances.</Paragraph>
      <Paragraph position="7"> We intend to continue investigating the potential of speaker adaptation from multiple references. If we can continue to improve our adaptation algorithm, and understand what constitutes good reference speakers, it may be possible to bring our speaker-adaptive performance very close to our SD performance.</Paragraph>
      <Paragraph position="8"> Conclusions We have shown that it is possible to achieve near current state-of-the-art SI performance with a model trained from only 12 speakers. This result is possible due to two important changes to the usual SI training paradigm -- a large amount of speech is available from each training speaker and the data is not pooled before training.</Paragraph>
      <Paragraph position="9"> Having a large sample of data from each training speaker and keeping it separate allows us to train detailed, highly discriminating, densities in a SD model and make the most effective use of speaker-specific modeling techniques such as triphone cooccurrence smoothing and probabilistic spectral transfonnation.</Paragraph>
      <Paragraph position="10"> Furthermore, the new paradigm eases the burden of data collection for SI recognition and allows new training speakers to be added to the SI model with ease.</Paragraph>
      <Paragraph position="11"> Most importantly, the new SI corpus lends itself well to speaker adaptation. By combining multiple reference speaker models which have been independently transformed to the target speaker, we have cut the SI word error rate from 7.5% to 4.1% using only 40 utterances of adaptation speech.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML