File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/h94-1063_evalu.xml
Size: 6,886 bytes
Last Modified: 2025-10-06 14:00:14
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1063"> <Title>HIGH-ACCURACY LARGE-VOCABULARY SPEECH RECOGNITION USING MIXTURE TYING AND CONSISTENCY MODELING</Title> <Section position="6" start_page="315" end_page="316" type="evalu"> <SectionTitle> 4. EXPERIMENTAL RESULTS </SectionTitle> <Paragraph position="0"> We used the algorithms described in this paper on the 5,000- and 64,000-word recognition tasks of the WSJ corpus. We used the progressive-search framework \[14\] for fast experimentation.</Paragraph> <Paragraph position="1"> With this approach, an initial fast recognition pass creates word lattices for all sentences in the development set. These word lattices are used to constrain the search space in all subsequent experiments. In our development we used both the WSJ0 5,000 word and the WSJ1 64,000 word portions of the database, and the baseline bigram and trigram language models provided by</Paragraph> <Section position="1" start_page="315" end_page="316" type="sub_section"> <SectionTitle> Lincoln Laboratory. 4.1. Degree of Mixture Tying </SectionTitle> <Paragraph position="0"> To determine the effect of mixture tying on the recognition performance, we evaluated a number of different systems on both WSJ0 and WSJ1. Table 3 compares the performance and the number of free parameters of fled mixtures, phonetically fled mixtures, and genonic mixtures on a development set that consists of 18 male speakers and 360 sentences of the 5,000-word WS\]0 task. The training data for this experiment included 3,500 sentences from 42 speakers. We can see that systems with a smaller degree of tying outperform the conventional fled mixtures by 25%, and at the same time have a smaller number of free parameters because of the reduction in the codebook size.</Paragraph> <Paragraph position="1"> WSJ development set The difference in recognition performance between PTM and genonie HMMs with smaller tying is, however, much more dramarie in the WSJ1 portion of the database. The training data consisted of 37,000 sentences from 280 speakers, and genderdeg dependent models were built. The male subset of the 20,000word November 1992 evaluation set was used, with a bigrarn language model. Table 4 compares various degrees of tying by varying the number of genones used in the system. We earl see that, because of the larger amount of available training data, the improvement in performance of genonie systems over PTM systems is much larger (20%) than in our 5,000-word experiments. Moreover, the best performance is achieved for a larger number of genones--l,700 instead of the 495 used in the 5,000-word experiments. These results are depicted in Figure 1.</Paragraph> <Paragraph position="2"> on the 5,000-word WSJ0 and 20,000-word WSJ1 tasks of the WSJ corpus In Table 5 we explore the additional degree of freedom that genonie HMMs have over fully continuous HMMs, namely that states mapped to the same genone can have different mixture weights. We can see that tying the mixture weights in addition to the Gaussians introduces a significant degradation in recognition performance. This degradation increases when the features are modeled using multiple observation streams (see following section) and as the amount of training data and the number of genones decrease.</Paragraph> <Paragraph position="3"> 4.2. Multiple vs. Single Observation Streams Another traditional difference between fully continuous and tied mixture systems is the independence assumption of the latter when modeling multiple speech features. Tied mixture systems typically model static and dynamic spectral and energy features as conditionally independent observation streams given the HMM state, because tied mixture systems provide a very coarse representation of the acoustic space. It is, therefore, necessary to &quot;quantize&quot; each feature separately and artificially increase the resolution by modeling the features as independent: the number of &quot;bins&quot; of the augmented feature is equal to the product of the number of &quot;bins&quot; of all individual features. The disadvantage is, of course, the independence assumption. When, however, the degree of tying is smaller, the finer representation of the acoustic space makes it unnecessary to artificially improve the resolution accuracy by modeling the features as independent. Hence, for systems that are loosely tied we can remove the feature-independence assumption. This claim is verified experimentally in Table 6. The first row shows the recognition performance of a system that models the six static and dynamic spectral and energy features used in DECIPHER TM as independent observation streams. The second row shows the performance of a system that models the six features in a single stream. We can see that the performance of the two systems is similar.</Paragraph> </Section> <Section position="2" start_page="316" end_page="316" type="sub_section"> <SectionTitle> 4.3. Linear Discriminant Features </SectionTitle> <Paragraph position="0"> To capture local time correlation we used a linear discriminant feature extracted using a transformation of the features within a window around the current frame. The discriminant transformation was obtained using linear discriminant analysis with classes defined as the HMM state of the context-independent phone. The state index that was assigned to the frame was determined using the maximum a-posteriori criterion and the forward-backward algorithm.</Paragraph> <Paragraph position="1"> We found that the performance of the linear discriminant feature was similar to that of the original features. However, we found that an improvement in performance can be obtained if the discnminant features are used in parallel with the original features. A genonic HMM system with 1,700 genones and linear discriminants as an additional feature was evaluated on the 20,000-word open-vocabulary November 1993 ARPA evaluation set. It achieved word-error rates of 16.5% and 14.5% with the standard bigram and trigram language models, respectively. These results, however, were contaminated by the presence of a large DC offset in most of the waveforms of the phase 1 WSI1 corpus. We later removed the DC offset from the waveforms, and reestimated the models using the exact procedure followed during the development of the system used in the November 1993 evaluation. From Table 6, we can see that the linear discriminant feature reduced vocabulary male development set of the WSJ1 corpus with and without linear discriminant transformations the error rate on the WSJ1 20,000-word open-vocabulary male development set by approximately 7% using either a bigram or a trigram language model. Table 4 presents the results of the system with linear diseriminants on various test and development sets.</Paragraph> </Section> </Section> class="xml-element"></Paper>