File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1010_metho.xml
Size: 11,700 bytes
Last Modified: 2025-10-06 14:12:43
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1010"> <Title>New Results with the Lincoln Tied-Mixture HMM CSR System 1</Title> <Section position="3" start_page="0" end_page="65" type="metho"> <SectionTitle> SEMIPHONES </SectionTitle> <Paragraph position="0"> One difficulty with the current triphone-based HMM systems with cross-word triphone models is that the number of triphones becomes very large (~60K triphones) when used in a large (20K word) vocabulary task\[ll\]. This requires estimation of very large numbers of parameters and makes execution of the trainer and recognizer inefficient on practical hardware. We have previously proposed semiphones as a modeling unit because they significantly reduce the number of elemental phonetic models by as much as an order of magnitude. (Semiphone models split each phone into a triplet of left and right context dependent models\[11\]. Semiphones include triphones and &quot;classic&quot; diphones--which extend from the center of one phone to the center of the next--as special cases.) On the Resource Management (RM) task, they reduced the number of unique states by about a factor of 5 at the cost of a performance penalty of about 20% for the speaker-independent (SI) task and 30% for the speaker-dependent (SD) task.</Paragraph> <Paragraph position="1"> The initial semiphone system used 1 left state, 1 center state, and 1 right state (notation: 1-1-1) system \[11\]. (In this notation, a triphone system is designated by 0-x-0 and a classic diphone system is designated by x-0-y.) We have recently explored a number of other variations on the semiphone scheme subject to the constraint of three states per phone. The performance of 2-0-1 and 1-0-2 systems is shown in table 1. The lower error rate of the 1-0-2 system suggests that, on the average, the anticipatory coarticulation is stronger than the backward coarticulation. This agrees with an assertion by Ladefoged that English is dominantly an anticipatory coarticulation language \[6\].</Paragraph> <Paragraph position="2"> We have also tested a hybrid triphone-semiphone system. This hybrid used 1-0-2 semiphones for the cross-word models and triphones for the word4nternal models. (50K of the above mentioned 60K triphones were cross-word-context phones.) Its performance was the same as the 1-0-2 system.</Paragraph> <Paragraph position="3"> This suggests that the less detailed modeling of the word boundary phones is the primary site where information is lost in the semiphone systems compared to the triphone systems.</Paragraph> <Paragraph position="4"> These results may be affected by the lack of richness in the RM database--there were 1752 word-internal (WI) semiphones and 2413 WI triphones and therefore only 27% of the WI triphones were merged in transitioning to the semiphone models. Similarly there were 1891 cross-word (XW) semiphones and 3580 XW triphones and therefore 47% of the cross-word (XW) triphones were merged in the transition. Thus the transition to semiphones would be expected to affect the XW modeling more than the WI modeling. All of the XW semiphone systems, however, outperform the corresponding non-XW triphone systems.</Paragraph> <Paragraph position="5"> Attempts to improve semiphone results by smoothing the mixture weights with occurrence based smoothing weights\[14\] proved unsuccessful. (This form of smoothing significantly improved the triphone system results \[11\].) This correlates with the reduced number of single occurfence models in the semiphone system (1340=37% of the semiphones) compared to the triphone system (3094=52% of the triphones).</Paragraph> </Section> <Section position="4" start_page="65" end_page="65" type="metho"> <SectionTitle> IMPROVED DURATION MODELING </SectionTitle> <Paragraph position="0"> The standard HMM system suffers from the difficulty that an incorrect phone can minimize its scoring penalty by minimizing the dwell time of the path through its model.</Paragraph> <Paragraph position="1"> The current CSR uses three states per phone and can suffer from this problem for long duration phones. Since there are no skip arcs within the phone model, a path can traverse a phone in 30 msec (3 time steps). Some phones are essentially never produced with this short a duration and therefore an incorrect short segment matched to this phone can have too high a score.</Paragraph> <Paragraph position="2"> One way to minimize this problem is alter the phone model to increase the minimum path dwell time to a time commensurate with the minimum duration of the phone.</Paragraph> <Paragraph position="3"> Since this system does not adapt in any way to the speaking rate, the desired minimum would be the minimum duration at the fastest speaking speed. Since the available training data is not fast speech, a pragmatic estimate of the minimum might be the shortest observed duration times a safety factor. An additional difficulty in estimating the minimum duration is that some phones are observed only a very few times in the training data thereby making such an estimate less reliable.</Paragraph> <Paragraph position="4"> For this experiment, a much simpler estimate of the minimum duration was chosen. The system was trained normally with three states per phone, which has the dual advantages of maintaining a uniform phone topology to allow smoothing between different phone models and of not increasing the number of parameters to be estimated. Finally, states whose average duration (as computed from the stay transition probability) was above a constant were split into a linear sequence of states until each final state had an average duration below the constant. Each of the split states shared the same observation pdf--only the stay and move transition probabilities were altered on the split states. Since no skip transitions were allowed in the phone models, the minimum duration was proportional to the final number of states in the phone.</Paragraph> <Paragraph position="5"> This simple strengthening of the duration model improved the triphone system results by about 10% for both SI and SD systems (Table 2). This result is in agreement with a similar improvement obtained adding minimum phone duration constraints to a large vocabulary IWR\[8\]. The overall amount of computation was not significantly changed. Essentially all of the word error rate reduction was a result of reduced word insertion and deletion error rates.</Paragraph> </Section> <Section position="5" start_page="65" end_page="66" type="metho"> <SectionTitle> NEW TRAINING STRATEGY WITH IMPLICATIONS FOR ADAPTATION </SectionTitle> <Paragraph position="0"> A modified multi-speaker/speaker-independent training strategy was tested. The standard strategy used to date has been: 1. Monophone bootstrap 2. Train triphones (all parameters trained on all speakers) null The new strategy is: 1. Monophone bootstrap (single set of Gaussians) 2. Train triphones (transition probabilities and mixture weights trained on all speakers, speaker-specific Ganssians) null 3. (Optional) Fix transition probabilities and mixture weights and train a single set of Gaussians on all speakers This new multi-speaker (MS)/SI strategy (without the option), in effect, implements a theory to the effect that all persons spea k alike except that each uses a different section of the acoustic space, perhaps due to differently sized and shaped vocal tracts.</Paragraph> <Paragraph position="1"> The new strategy without the option uses more data to train the mixture weights and might therefore, with the speaker-specific Ganssians, provide better SD recognition than the old method. It was significantly worse than the standard SD training for the RM1 database (12 speakers, Table 3), but slightly better for the RM2 database (4 speakers, Table 4). In both cases the new procedure was better than the SI-109 system.</Paragraph> <Paragraph position="2"> The new strategy with the option is a new method for training a MS or SI system. The mixture weights are again trained in the context of speaker-specific Gaussians, but then the weights are fixed and a single set of MS or SI Gaussians trained. In all cases, the systems using SD Gaussians outperformed the MS/SI Ganssians. On the RM1 database, the old training method outperformed the new with the option respectively for both the MS-12 and the SI-109 training condition. Similarly, when training on the RM1 database and testing on the RM2 database, the old training method outperformed the new with the option respectively for the SI-12 and SI-109 training conditions. (The MS-12 models from RM1 become SI-12 when tested on the RM2 database because the RM2 database uses speakers which are not included in RM1.) The controls for this experiment (SI-109 and SI-12), when tested on the RM2 database, confirm BBN's result \[4\] that similar SI performance can be obtained by training on large amounts of data from a small number of speakers as the June 90 spontaneous training (774 sentences) and test data was used. Due to the limited amount of time available before the evaluation tests, no attempt was made to model the open vocabulary, disfluencies partial words, thinking noises and extraneous noises. Thus the SNOR transcriptions of the acoustic data were used for both training and testing. The lexicon (548 words) and a bigram back-off language model were generated from the training data which produced a test set perplexity of 23.8 with 1.3% out-of-vocabulary words.</Paragraph> <Paragraph position="3"> The first system was as described in the introduction except that the system used SI TM-2 non-cross word triphone models and the improved duration modeling described above. Recognition was performed using the perplexity 23.8 bigram language model. The pilot tests were all SI trained with two observation streams. The closest RM system showed an SI-109 WPG word error rate of 10.4% \[11\]. After fixing some pruning difficulties in training due to the large silences in the training data, the system produced a word error rate of 37.5% (Table 5). Enabling optional inter-word silences in training reduced the pruning difficulties and improved the recognition performance to 33.3% (Table 5). (Optional inter-word silences during training had been tested on the RM task and found not to help the performance.) Finally, this system was tested using the perplexity 17.8 baseline language model and the error rate was reduced to 30.9% (Table 5).</Paragraph> </Section> <Section position="6" start_page="66" end_page="66" type="metho"> <SectionTitle> ATIS BASELINE DEVELOPMENT TESTS </SectionTitle> <Paragraph position="0"> When the baseline test definition became available, the best pilot system was trained on the baseline training data.</Paragraph> <Paragraph position="1"> The error rate improved to 26.4% (Table 6). The additional data, which consisted of read in-task sentences and read adaptation sentences, increased the number of training sentences by a factor of 6.5, but produced a surprisingly small performance improvement. Cross-word triphone modeling was added which reduced the word error rate to 23.0%.</Paragraph> <Paragraph position="2"> (The closest corresponding system RM SI-109 WPG error rate is 8.5% \[11\].) Next, the third observation stream (second differential mel-ceptsra) was added (TM-a) which increased the error rate to 25.3%. In contrast, a 30% error rate reduction on the SI RM task occurred when the third observation stream was added\[ll\]. Finally, a TM-3 1-0-2 semiphone system yielded 24.0% word error rate, which is between the results obtained with the TM-2 and TM-3 triphone systems.</Paragraph> </Section> class="xml-element"></Paper>