File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1079_metho.xml
Size: 17,459 bytes
Last Modified: 2025-10-06 14:13:06
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1079"> <Title>Large Vocabulary Recognition of Wall Street Journal Sentences at Dragon Systems</Title> <Section position="3" start_page="0" end_page="388" type="metho"> <SectionTitle> 2. OVERVIEW OF DRAGON TRAINING AND RECOGNITION </SectionTitle> <Paragraph position="0"> The continuous speech recognition system developed by Dragon Systems was presented at the June 1990 DARPA SLS meeting (\[5\], \[6\], \[11\]) and at the February 1991 DARPA SLS meeting (\[4\]). The version presented in this paper is speaker-dependent, and was demonstrated to be capable of near real-time performance on a 1000word task when running on a 486-based PC. When running live, a TMS320C25-based board performs the signal processing and the speech is sampled at 12ktIz. In the experiments reported in this paper, the speech was sampled at 16kiiz, the speech waveforms having been supplied in a standard format by NIST.</Paragraph> <Paragraph position="1"> An important contribution to our improved performance in the last year was our switch to 32 signal processing parameters (consisting of our eight original spectral parameters together with 12 cepstral parameters and their estimated time derivatives). The cepstral parameters were computed via an inverse Fourier transform of the log magnitude spectrum. At recognition time, the parameters are computed every 20 ms, while for purposes of training, 10 ms data was used.</Paragraph> <Paragraph position="2"> The recognition algorithm relies on frame-synchronous dynamic programming (an implementation of the forward pass of the Baum-Welch algorithm) to extend sentence hypotheses subject to the elimination of poor paths by beam pruning. In addition, the Continuous Speech Recognizer uses the DARPA-mandated digram language model (\[15\]), which is a modification of the backoff algorithm from \[13\]. The rapid matcher, as described in \[11\], is another important component of the system. For any frame, it limits the number of word candidates that can be hypothesized as starting at that frame. For purposes of this paper, which is primarily concerned with the quality of our modeling, most of the rapid match errors have been eliminated by passing through long lists of words for the detailed match to consider, at the cost of considerable additional computation. Similarly, most of the pruning errors have been eliminated by running with a high threshold. A companion paper \[10\], that appears in this volume, describes a new strategy for training the rapid match models directly from the IIidden Markov Models specified by the PICs. This new strategy shows promise for reducing the average length of the rapid match list that must be returned at any given time, and thus, speeding up the recognizer.</Paragraph> <Paragraph position="3"> In the experiments described below, models were trained for each of the 12 speaker-dependent Wall Street Journal speakers, using the approximately 600 training sentences (300 with verbalized punctuation and 300 without). Testing was done using the approximately 40 recorded sentences (per speaker) available as the 5000-word closed-vocabulary verbalized punctuation development test set.</Paragraph> <Paragraph position="4"> In order to incorporate context information at the phoneme level, triphone structures were constructed that include information about the immediate phonetic environment that affects a phoneme's acoustic character.</Paragraph> <Paragraph position="5"> These augmented triphones, called &quot;PIC&quot;s, are the fundamental unit of the system, and are closely related to other approaches that have appeared in the literature (\[16\] and \[14\]). The information that the PICs currently contain is the identity of the preceding and succeeding phonemes, and, optionally, an estimate of the degree of the phoneme's prepausal lengthening. Each PIC is represented acoustically by a sequence of nodes. Each node is taken to have an output distribution specified by a PEL, and a duration distribution. PIC models representing the same phoneme may share PELs, but PELs can never be shared across phonemes. The parametric family used for modeling the probability distributions of the durations as well as of the individual acoustic parameters is assumed to have the double exponential form</Paragraph> <Paragraph position="7"> where p is the mean and a is the mean absolute deviation. null A detailed description of the original models for PICs and how they were formerly trained can be found in \[6\]. The following sections explain how a variety of modifications have been made to the original PIC training algorithm.</Paragraph> <Paragraph position="8"> The English phoneme alphabet used by the system includes 26 consonants (including the syllabic consonants, /L/,/M/, and/N/) and three levels of stress for each of 17 vowels, constituting a total of 77 phonemes. Approximately 10% of the lexical entries for the 5000-word WSJ task have multiple pronunciations, because of stress differences in the vowels and expected pronunciation variations. null Of course, the number of possible PICs that can appear in hypotheses at recognition time (including cross-word PICs) is vast compared to the number of PICs that typically appear in 600 sentences of Wall Street Journal training data. This paper reports results when around 35,000 PICs are built for the rePEL/respell models and when around 14,000 PICs are built for the tied mixture models. When the recognizer asks for a model for a PIC that has not been built, a backoff strategy is invoked which supplies a model for a related PIC instead.</Paragraph> </Section> <Section position="4" start_page="388" end_page="388" type="metho"> <SectionTitle> 3. REPELING/RESPELLING </SectionTitle> <Paragraph position="0"> In earlier reports \[6\], \[7\], we described a straightforward procedure that generated speaker-dependent models via several passes of adaptation of the reference speaker's models. The adaptation process modified the PEL probability distributions and the PIC-dependent duration distributions. However, no new PELs were created, nor was the PEL sequence for a given PIC allowed to change.</Paragraph> <Paragraph position="1"> The sharing of PELs by different PICs was determined by the acoustics of the reference speaker's speech, and was assumed to generalize to other speakers.</Paragraph> <Paragraph position="2"> At the last SLS meeting in Feb 1991 \[4\], we reported on a method for choosing the sequence of PELs for a PIC in a speaker-dependent fashion, essentially in the same manner as had been done for the reference speaker. This step could be performed once the original PELs had been adapted using the reference speaker's PIC spellings. To the extent that differences in PEL sequences for a given PIC can reflect different choices of allophones, this extra step can capture allophonic variation among different speakers, and lifts the restriction that the sharing of PELs be the same for all speakers. This change produced a significant improvement in performance.</Paragraph> <Paragraph position="3"> In order to take full advantage of our new more informative signal processing parameters, however, a further change was required. We needed to construct a new set of PELs to serve as the class of output distributions for the HMMs to be constructed. It was not adequate to simply extend, by adaptation, the 8 parameter PELs we had been working with, to 32 parameter PELs, as this would prevent us from making distinctions that could not even be seen with the old signal processing.</Paragraph> <Paragraph position="4"> In the previous reports \[6\] and \[4\], we described how a set of PELs for the reference speaker was initially hand-constructed while running an interactive program for &quot;labeling&quot; spectrograms of the reference speaker's speech. We needed to be able to construct a new set of PELs automatically; thus, we implemented a k-means clustering algorithm whose purpose was to create a new set of (32 parameter) PELs for each speaker whose models were to be trained. This step involved clustering the fxames in the &quot;spectral models&quot; for all of the PICs to be constructed for that phoneme. A spectral model for a PIC is obtained by performing linear stretching and shrinking operations on PIC tokens (examples of the given PIC and of related PICs, available from a prior segmentation of the training data, based on the best models then available) and then averaging the resulting transformed tokens (which have a common length), to obtain a kind of &quot;expected&quot; PIC token.</Paragraph> <Paragraph position="5"> The primary motivation behind the rePELing step was to make it likely that each spectral frame would have at least one PEL that matched it fairly well. As each of the 77 phonemes was limited to having only 63 PELs available for building PICs, about 4500 PELs were created per speaker.</Paragraph> <Paragraph position="6"> Once the new set of PELs had been created, a dynamic programming algorithm was used for converting the spectral model to an HMM containing up to six nodes, with each node assigned a PEL and a duration distribution. This respelling step drew on about 4000 of the 4500 PELs in constructing the HMMs.</Paragraph> <Paragraph position="7"> A summary of the overall training procedure is outlined below, with rePELing and respelling appearing as steps 4 and 5: 1. Six passes of adaptation were run on each speaker's training data, starting with the reference speaker's models, using the old 8 parameter signal processing.</Paragraph> <Paragraph position="8"> 2. Segmentation of each speaker's data was performed, using the best available models (originally, those produced in step 1).</Paragraph> <Paragraph position="9"> 3. Spectral models were built for each PIC, using all 32 parameters, based on the segmentation in step 2.</Paragraph> <Paragraph position="10"> 4. RePELing was done for each speaker in order to generate a speaker-dependent set of output distributions. null 5. For each speaker, respelling was performed to determine the PEL sequences that would be used in the resulting HMMs.</Paragraph> <Paragraph position="11"> 6. For each speaker, one additional pass of adaptation was performed in order to better estimate the mean absolute deviations for each parameter for each PEL.</Paragraph> <Paragraph position="12"> 7. Steps 2 - 6 could then be repeated, if desired.</Paragraph> <Paragraph position="13"> Results for this method appear in section 5.</Paragraph> </Section> <Section position="5" start_page="388" end_page="390" type="metho"> <SectionTitle> 4. TIED MIXTURES </SectionTitle> <Paragraph position="0"> Were the model described in section 3 correct, the 32 parameters in each acoustic frame corresponding to a given PEL would be distributed as if they were generated by 32 independent (unimodal) double exponential distributions. However, graphical displays reveal that the frame distributions for many PELs have multiple modes. Furthermore, it is well known that the parameters within a frame are correlated. In order to deal with the multimodality of the data and to capture the dependence among parameters, Dragon has implemented a modeling strategy in which the output distributions are represented in a more flexible way. This representation, similar to other tied mixture models developed elsewhere (\[8\], \[12\]), also provides the basis for achieving speaker independence.</Paragraph> <Paragraph position="1"> If we divide the parameters into groups or &quot;streams&quot;, with the property that parameters in different streams can be assumed to be independent, then our new modeling strategy represents the probability of a frame in a given state as the product of probability densities for each stream, and the probability density for a stream is assumed to be a mixture distribution over a fixed set of basis distributions specific to the stream.</Paragraph> <Paragraph position="2"> More formally, we let f(z) represent the probability density of a PEL, where x is a frame, and we assume that f(z) is the product of s probability densities, fi(zi), one for each stream:</Paragraph> <Paragraph position="4"> Furthermore, we assume that each fi can be represented in terms of a set of basis distributions gij :</Paragraph> <Paragraph position="6"> where Ci is the number of components for stream i.</Paragraph> <Paragraph position="7"> At the present time, we are using 32 streams; i.e., each parameter is assumed to be statistically independent of every other parameter in a given state. We have assumed the 32 parameters to be independent both as a way of relating our new results to our old results (which were also based on the same strong independence assumption), and as a debugging tool. We chose our basis distributions to be equally spaced double exponential distributions with a fixed mean absolute deviation, arranged so as to cover the full range of each parameter.</Paragraph> <Paragraph position="8"> Thus, when a mixture distribution was estimated, it was easy to see what values in the space were relatively likely or unlikely. In the system reported here, the set of basis components is the same for each stream, which would not be the case in a more general setting.</Paragraph> <Paragraph position="9"> The tied mixture P!C models were assumed to be either 1-node or 2-node models, with the number of nodes being determined based on the proportion of very short PIC tokens. At the present time, no PEL is used as an output distribution for more than one node. Each tied mixture PIC model was built via the EM algorithm from instances of the given PIC found in the training data for the given speaker (based on segmentations obtained using the best available models). Unfortunately, most of the PICs that occur in the training data occur very few times, and, not surprisingly, most of the PICs that could in principle occur never in fact do.</Paragraph> <Paragraph position="10"> Thus, two key problems that must be solved in training the recognizer are (1) the smoothing problem and (2) the backoff problem. The maximum likelihood estimator (MLE), together with many related asymptotically efficient estimators, has the defect of being a rather poor estimator when it is given only a small amount of data to work with: think of estimating the probability of &quot;heads&quot; from only one coin flip. Thus, it is important to smooth the MLE when there is clearly an insufficient supply of data. We have chosen to implement a smoothing algorithm with a strong Bayesian flavor. In this paper we will not address the backoff problem in any detail; at the present time, when we do not have a model for a PIC available to the recognizer, we substitute a &quot;generic&quot; PIC model, which has less specific context information.</Paragraph> <Paragraph position="11"> The Bayesian solution to the coin flip problem amounts to representing the prior information we may have about the probability of &quot;heads&quot; as a prior number of flips, of which a certain number are taken to be heads, and then combining those &quot;prior&quot; flips with the real flips. We have taken a similar approach to the problem of estimating the mixing probabilities in our tied mixture models. We build the more common PICs before we build the less common PICs (see below). At the time that we are ready to build a given PIC, we make our best judgement as to what the mixing probabilities are for each stream of each state in the PIC. This guess is based on the models that have already been built for related PICs. Not only do we guess the mixing probabilities, but we also make a judgement about the &quot;relevance&quot; of our estimate, which is to say, the number of frames of real data that we judge our guess to be worth. We then use these prior estimates to initialize the EM algorithm, and in addition, we combine the accumulated fractional counts for each mixture component with the prior counts based on our prior guess, in forming the estimate to be used during the next iteration. Thus we have as our re-estimation formula:</Paragraph> <Paragraph position="13"> where Ai~ is the a priori estimate based on the PICs that have already been built, k is the relevance of this estimate, and nij is the accumulated fractional count for the jth component when estimating the distribution for the ith parameter in a given node.</Paragraph> <Paragraph position="14"> PICs are currently built in a prescribed order in our system: we build those for which there is the most data first. Thus, we begin by building the doubly-sided generic PICs, i.e. models for phonemes averaged over all left and right contexts. Then we move on to build singlysided generic PICs, i.e. models for phonemes where the context is specified only on the right or on the left; we use the doubly generic PIC models to smooth the models for the singly generic ones. Finally we build our fully contextual PICs, but again we build the most common ones first, using the doubly and singly generic PICs to smooth the fully contextual ones. When building a relatively uncommon fully contextual PIC, it is useful to smooth the model using models of related fully contextual PICs which share some of the context or have closely related contexts.</Paragraph> </Section> class="xml-element"></Paper>