File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1016_metho.xml
Size: 18,629 bytes
Last Modified: 2025-10-06 14:13:19
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1016"> <Title>An Overview of the SPHINX-II Speech Recognition System</Title> <Section position="4" start_page="81" end_page="82" type="metho"> <SectionTitle> 3. DETAILED MODELING THROUGH PARAMETER SHARING </SectionTitle> <Paragraph position="0"> We need to model a wide range of acoustic-phonetic phenomena, but this requires a large amount of training data. Since the amount of available training data will always be finite one of the central issues becomes that of how to achieve the most detailed modeling possible by means of parameter sharing.</Paragraph> <Paragraph position="1"> Our successful examples include SCHMMs and senones.</Paragraph> <Section position="1" start_page="81" end_page="82" type="sub_section"> <SectionTitle> 3.1. Semi-Continuous HMMs </SectionTitle> <Paragraph position="0"> The semi-continuous hidden Markov model (SCHMM) \[12\] has provided us with an an excellent tool for achieving detailed modeling through parameter sharing. Intuitively, from the continuous mixture HMM point of view, SCHMMs employ a shared mixture of continuous output probability densities for each individual HMM. Shared mixtures substantially reduce the number of free parameters and computational complexity in comparison with the continuous mixture HMM, while maintaining, reasonably, its modeling power. From the discrete HMM point of view, SCHMMs integrate quantization accuracy into the HMM, and robustly estimate the discrete output probabilities by considering multiple codeword candidates in the VQ procedure. It mutually optimizes the VQ codebook and HMM parameters under a unified probabilistic framework \[13\], where each VQ codeword is regarded as a continuous probability density function.</Paragraph> <Paragraph position="1"> For the SCHMM, an appropriate acoustic representation for the diagonal Gaussian density function is crucial to the recognition accuracy \[13\]. We first performed exploratory semi-continuous experiments on our three-codebook system. The SCHMM was extended to accommodate a multiple feature front-end \[13\]. All codebook means and covariance matrices were reestimated together with the HMM parameters except the power covariance matrices, which were fixed. When three codebooks were used, the diagonal SCHMM reduced the error rate of the discrete HMM by 10-15% for the RM task \[16\]. When we used our improved 4-codebook MFCC front-end, the error rate reduction is more than 20% over the discrete HMM.</Paragraph> <Paragraph position="2"> Another advantage of using the SCHMM is that it requires less training data in comparison with the discrete HMM. Therefore, given the current limitations on the size of the training data set, more detailed models can be employed to improve the recognition accuracy. One way to increase the number of parameters is to use speaker-clustered models. Due to the smoothing abilities of the SCHMM, we were able to train multiple sets of models for different speakers. We investigated automatic speaker clustering as well as explicit male, female, and generic models. By using sex dependent models with the SCHMM, the error rate is further reduced by 10% on the WSJ task.</Paragraph> </Section> <Section position="2" start_page="82" end_page="82" type="sub_section"> <SectionTitle> 3.2. Senones </SectionTitle> <Paragraph position="0"> To share parameters among different word models, context-dependent subword models have been used successfully in many state-of-the-art speech recognition systems \[26, 21, 17\].</Paragraph> <Paragraph position="1"> The principle of parameter sharing can also be extended to subphonetic models \[19, 18\]. We treat the state in phonetic hidden Markov models as the basic subphonetic unit senone. Senones are constructed by clustering the state-dependent output distributions across different phonetic models. The total number of senones can be determined by clustering all the triphone HMM states as the shared-distribution models \[18\]. States of different phonetic models may thus be tied to the same senone if they are close according to the distance measure. Under the senonic modeling framework, we could also use a senonic decision tree to predict unseen triphones. This is particularly important for vocabularyinc~pendence \[10\], as we need to find subword models which are detailed, consistent, trainable and especially generalizable.</Paragraph> <Paragraph position="2"> Recently we have developed a new senonic decision-tree to predict the subword units not covered in the training set \[18\].</Paragraph> <Paragraph position="3"> The decision tree classifies senones by asking questions in a hierarchical manner \[7\]. These questions were first created using speech knowledge from human experts. The tree was automatically constructed by searching for simple as well as composite questions. Finally, the tree was pruned using cross validation. When the algorithm terminated, the leaf nodes of the tree represented the senones to be used. For the WSJ task, our overall senone models gave us 35% error reduction in comparison with the baseline SPHINX results.</Paragraph> <Paragraph position="4"> The advantages of senones include not only better parameter sharing but also improved pronunciation optimization.</Paragraph> <Paragraph position="5"> Clustering at the granularity of the state rather than the entire model (like generalized triphones \[21\]) can keep the dissimilar states of two models apart while the other corresponding states are merged, and thus lead to better parameter sharing. In addition, senones give us the freedom to use a larger number of states for each phonetic model to provide more detailed modeling. Although an increase in the number of states will increase the total number of free parameters, with senone sharing redundant states can be clustered while others are uniquely maintained.</Paragraph> <Paragraph position="6"> Pronunciation Optimization. Here we use the forward-backward algorithm to iteratively optimize a senone sequence appropriate for modeling multiple utterances of a word. To explore the idea, given the multiple examples, we train a word HMM whose number of states is proportional to the average duration. When the Baum-Welch reestimation reaches its optimum, each estimated state is quantized with the senone codebook. The closest one is used to label the states of the word HMM. This sequence of senones becomes the senonic baseform of the word. Here arbitrary sequences ofsenones are allowed to provide the flexibility for the automatically learned pronunciation. When the senone sequence of every word is determined, the parameters (senones) may be re-trained. Although each word model generally has more states than the traditional phoneme-concatenated word model, the number of parameters remains the same since the size of the senone codebook is unchanged. When senones were used for pronunciation optimization in a preliminary experiment, we achieved 10-15% error reduction in a speaker-independent continuous spelling task \[ 19\].</Paragraph> </Section> </Section> <Section position="5" start_page="82" end_page="83" type="metho"> <SectionTitle> 4. MULTI-PASS SEARCH </SectionTitle> <Paragraph position="0"> Recent work on search algorithms for continuous speech recognition has focused on the problems related to large vocabularies, long distance language models and detailed acoustic modeling. A variety of approaches based on Viterbi beam search \[28, 24\] or stack decoding \[5\] form the basis for most of this work. In comparison with stack decoding, Viterbi beam search is more efficient but less optimal in the sense of MAR For stack decoding, a fast-match is necessary to reduce a prohibitively large search space. A reliable fast-match should make full use of detailed acoustic and language models to avoid the introduction of possibly unrecoverable errors.</Paragraph> <Paragraph position="1"> Recently, several systems have been proposed that use Viterbi beam search as a fast-match \[27, 29\], for stack decoding or the N-best paradigm \[25\]. In these systems, N-best hypotheses are produced with very simple acoustic and language models.</Paragraph> <Paragraph position="2"> A multi-pass rescoring is subsequently applied to these hypotheses to produce the final recognition output. One problem in this paradigm is that decisions made by the initial phase are based on simplified models. This results in errors that the N-best hypothesis list cannot recover. Another problem is that the rescoring procedure could be very expensive per se as many hypotheses may have to be rescored. The challenge here is to design a search that makes the appropriate compromises among memory bandwidth, memory size, and computational power \[3\].</Paragraph> <Paragraph position="3"> To meet this challenge we incrementally apply all available acoustic and linguistic information in three search phases.</Paragraph> <Paragraph position="4"> Phase one is a left to right Viterbi Beam search which produces word end times and scores using right context between-word models with a bigram language model. Phase two, guided by the results from phase one, is a right to left Viterbi Beam search which produces word beginning times and scores based on left context between-word models. Phase three is an A* search which combines the results of phases one and two with a long distance language model.</Paragraph> <Paragraph position="5"> 4.1. Modified A* Stack Search Each theory, th, on the stack consists of five entries. A partial theory, th.pt, a one word extension th.w, a time th.t which denotes the boundary between th.pt and th.w, and two scores th.g, which is the score for th.pt up to time th.t and th.h which is the best score for the remaining portion of the input starting with ti~.w at time th.t+l through to the end. Unique theories are detlermined by th.pt and th.w. The algorithm proceeds as follows.</Paragraph> <Paragraph position="6"> l. Add initial states to the stack.</Paragraph> <Paragraph position="7"> 2. According to the evaluation function th.g+ th.h, remove the best theory, th, from the stack.</Paragraph> <Paragraph position="8"> 3. Ifth accounts for the entire input then output the sentence corresponding to th. Halt if this is the Nth utterance output.</Paragraph> <Paragraph position="9"> 4. For the word th.w consider all possible end times, t as provided by the left/right lattice.</Paragraph> <Paragraph position="10"> (a) For all words, w, beginning at time t + 1 as provided by the right/left lattice i. Extend theory th with w. Designate this theory as th'. Set th'.pt = th.pt + th.w, th'.w ::= w and th'.t = t.</Paragraph> <Paragraph position="11"> ii. Compute scores th'.g = th.g + w_score(w, th.t + 1,t), and th'.h. See following for definition of w_score and thqh computation.</Paragraph> <Paragraph position="12"> iii. If th' is already on the stack then choose the best instance of th' otherwise push th' onto the stack.</Paragraph> <Paragraph position="13"> 5. Goto step 2.</Paragraph> <Paragraph position="14"> 4.2. Discussion When tit is extended we are considering all possible end times t for th.w and all possible extensions w. When extending th with w to obtain th' we are only interested in the value for th'.t which gives the best value for th'.h + th'.g. For any t and w, th'.h is easily determined via table lookup from the right/left lattice. Furthermore the value of th'.g is given by th.g + w_score (w, th.t+l, t). The function w_score(w,b,e) computes the score for the word w with begin time b and end time e.</Paragraph> <Paragraph position="15"> Our objective is to maximize the recognition accuracy with a minimal increase in computational complexity. With our decomposed, incremental, semi-between-word-triphones search, we observed that early use of detailed acoustic models can significantly reduce the recognition error rate with a negligible increase computational complexity as shown in By incrementally applying knowledge we have been able to decompose the search so that we can efficiently apply detailed acoustic or linguistic knowledge in each phase. Further more, each phase defers decisions that are better made by a subsequent phase that will apply the appropriate acoustic or linguistic information.</Paragraph> </Section> <Section position="6" start_page="83" end_page="84" type="metho"> <SectionTitle> 5. UNIFIED STOCHASTIC ENGINE </SectionTitle> <Paragraph position="0"> Acoustic and language models are usually constructed separately, where language models are derived from a large text corpus without consideration for acoustic data, and acoustic models are constructed from the acoustic data without exploiting the existing text corpus used for language training.</Paragraph> <Paragraph position="1"> We recently have developed a unified stochastic engine (USE) that jointly optimizes both acoustic and language models. As the true probability distribution of both the acoustic and language models can not be accurately estimated, they can not be considered as real probabilities but scores from two different sources. Since they are scores instead of probabilities, the straightforward implementation of the Bayes equation will generally not lead to a satisfactory recognition performance.</Paragraph> <Paragraph position="2"> To integrate language and acoustic probabilities for decoding, we are forced to weight acoustic and language probabilities with a so called language weight \[6\]. The constant language weight is usually tuned to balance the acoustic probabilities and the language probabilities such that the recognition error rate can be minimized. Most HMM-based speech recognition systems have one single constant language weight that is independent of any specific acoustic or language information, and that is determined using a hill-climbing procedure on development data. It is often necessary to make many runs with different language weights on the development data in order to determine the best value.</Paragraph> <Paragraph position="3"> In the unified stochastic engine (USE), not only can we iteratively adjust language probabilities to fit our given acoustic representations but also acoustic models. Our multi-pass search algorithm generates N-best hypotheses which are used to optimize language weights or implement many discriminative training methods, where recognition errors can be used as the objective function \[20, 25\]. With the progress of new database construction such as DARPA's CSR Phase II, we believe acoustically-driven language modeling will eventually provide us with dramatic performance improvements.</Paragraph> <Paragraph position="4"> In the N-best hypothesis list, we can assume that the correct hypothesis is always in the list (we can insert the correct answer if it is not there). Let hypothesis be a sequence of words wl, w2, ...w~ with corresponding language and acoustic probabilities. We denote the correct word sequence as 0, and all the incorrect sentence hypotheses as 0. We can assign a variable weight to each of the n-gram probabilities such that we have a weighted language probability as:</Paragraph> <Paragraph position="6"> where the weight c~ 0 is a function of acoustic data, Xi, for wi, and words wi, Wi-l, .... For a given sentence k, a very general objective function can be defined as</Paragraph> <Paragraph position="8"> +a(.Vi, wiwi_l...)logPr(wilwi_l...)\]}. (2) where A denotes acoustic and language model parameters as well as language weights, Pr(O) denotes the a priori probability of the incorrect path 0, and Pr(Xi \]wi) denotes acoustic probability generated by word model w~. It is obvious that when Lk (A) > 0 we have a sentence classification error. Minimization of Equation 2 will lead to minimization of sentence recognition error rate. To jointly optimize the whole training set, we first define a nondecreasing, differentiable cost function Ik (A) (we use the sigmoid function here) in the same manner as the adaptive probabilistic decent method \[4, 20\].</Paragraph> <Paragraph position="9"> There exist many possible gradient decent procedures for the proposed problems.</Paragraph> <Paragraph position="10"> The term o~(,Y=i,wiwi_l...)logPr(wilwi_l...) could be merged as one item in Equation 2. Thus we can have language probabilities directly estimated from the acoustic training data. The proposed approach is fundamentally different from traditional stochastic language modeling. Firstly, conventional language modeling uses a text corpus only. Any acoustical confusable words will not be reflected in language probabilities. Secondly, maximum likelihood estimation is usually used, which is only loosely related to minimum sentence error. The reason for us to keep a 0 separate from the language probability is that we may not have sufficient acoustic data to estimate the language parameters at present. Thus, we are forced to have a0 shared across different words so we may have n-gram-dependent, word-dependent or even wordcount-dependent language weights. We can use the gradient decent method to optimize all of the parameters in the USE.</Paragraph> <Paragraph position="11"> When we jointly optimize L(A), we not only obtain our unified acoustic models but also the unified language models. A preliminary experiment reduced error rate by 5% on the WSJ task \[14\]. We will extend the USE paradigm for joint acoustic and language model optimization. We believe that t_he USE can further reduce the error rate with an increased amount of training data.</Paragraph> </Section> <Section position="7" start_page="84" end_page="84" type="metho"> <SectionTitle> 6. LANGUAGE MODELING </SectionTitle> <Paragraph position="0"> Language Modeling is used in Sphinx-II at two different points. First, it is used to guide the beam search. For that purpose we used a conventional backoff bigram for that purpose. Secondly, it is used to recalculate linguistic scores for the top N hypotheses, as part of the N-best paradigm. We concentrated most of our language modeling effort on the latter.</Paragraph> <Paragraph position="1"> Several variants of the conventional backoff trigram language model were applied at the reordering stage of the N-best paradigm. (Eventually we plan to incorporate this language model into the A* phase of the multi-pass search with the USE). The best result, a 22% word error rate reduction, was achieved with the simple, non-interpolated &quot;backward&quot; trigram, with the conventional forward trigram finishing a close second.</Paragraph> </Section> class="xml-element"></Paper>