File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1033_metho.xml
Size: 23,115 bytes
Last Modified: 2025-10-06 14:13:05
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1033"> <Title>Vocabulary and Environment Adaptation in Vocabulary-Independent Speech Recognition</Title> <Section position="3" start_page="168" end_page="170" type="metho"> <SectionTitle> 3 Vocabulary Adaptation </SectionTitle> <Paragraph position="0"> Unlike most speaker adaptation techniques, our vocabulary adaptation algorithms only take advantage of analyzing the target vocabulary and thus do not require any additional vocabulary-specific data. Two terminologies which play an essential role in our algorithms are defined as follows.</Paragraph> <Paragraph position="1"> relevant allophones Those allophones which occur in the target vocabulary (task).</Paragraph> <Paragraph position="2"> irrelevant allophones Those allophone which occur in the VI training set, but not in the target vocabulary (task).</Paragraph> <Paragraph position="3"> In 91' DARPA Speech and Natural Language Workshop \[7\], we have shown the decision-tree based generalized allophone is a adequate VI subword model. Figure 1 is an example of our VI subword unit, generalized allophone, which is actually an allophonic cluster. The allophones in the white area are relevant allophones and the rest are irrelevant ones.</Paragraph> <Section position="1" start_page="168" end_page="169" type="sub_section"> <SectionTitle> 3.1 Vocabulary.Adapted Decision Tree </SectionTitle> <Paragraph position="0"> Our first vocabulary adaptation algorithm is to change the allophone clustering (the decision trees) so that the brand new set of subword models would have a more discriminative power for the target vocabulary. Since the clustering decision tree was built on the entire VI training set, the existence of the enormous irrelevant aUophones might result in sub-optimally clustering of allophones for the target vocabulary.</Paragraph> <Paragraph position="1"> To reveal such facts, let's look at the following scenario.</Paragraph> <Paragraph position="2"> Figure 2 is a split in the original decision tree for phone / k / generated from vocabulary-independent training set and the associated question for this split is &quot;Is the left context a vowel&quot;. Suppose all the left contexts for phone/k/ in the target vocabulary are vowels. Thus, the question for this split is totally unsuitable for the target vocabulary because the split assigns all the allophones for /k/ in the target vocabulary to one branch and discrimination among those allophones becomes impossible.</Paragraph> <Paragraph position="3"> On the other hand, if only the relevant ailophones are considered for this split, the associated split question would turns out to be the one of relevant questions which separates the relevant allophones appropriately and therefore possesses the greatest discriminative ability among the relevant allophones.</Paragraph> <Paragraph position="4"> Figure 3 just shows such optimal split for relevant allophones.</Paragraph> <Paragraph position="5"> The generation of the clustering decision trees are recursive.</Paragraph> <Paragraph position="6"> The existence of enormous irrelevant allophones would prevent the generation of the decision trees from concentrating on those relevant allophones and relevant questions, and results in sub-optimal trees for those relevant allophones.</Paragraph> <Paragraph position="7"> vant allophones of phone / k / Based on the analysis, our first adaptation algorithm is to build vocabulary-adapted (VA) decision trees by using only relevant allophones during the generation of decision trees.</Paragraph> <Paragraph position="8"> The adapted trees would not only be automatically generated, but also focus on the relevant questions to separate the relevant allophones, therefore give the resulting allophonic clusters more discriminative power for the target vocabulary.</Paragraph> <Paragraph position="9"> Three potential problems are brought up when one examining the algorithm closely. First of all, some relevant allophones might not occur in the VI training set since we can't expect 100% allophone coverage for every task, especially for large-vocabulary task. Nevertheless, it is essential to have all the models for relevant allophones ready before generating the VA decision trees because we need the entropy information of models for each split. It is trivial for those relevant allophones which also occur in VI training set. The correspondent allophonic models trained from the training data can be used directly. Because of the nature of decision trees, every allophone could find its closest generalized allophonic cluster by traw~rsing the decision trees. Therefore, the correspondent generalized allophonic models could be used as the models for those relevant allophones not occurring in the VI training set during the generation of the VA clustering trees.</Paragraph> <Paragraph position="10"> Secondly, if only the part of VI training set which conrains the relevant allophones is used to train new generalized allophonic models, the new adapted generalized allophonic models would be under-trained and less robust. Fortunately, we can retain the entire training set because of the the nature of decision trees. All the allophones could find their generalized allophonic clusters by traversing the new VA decision trees, so the entire VI training set could actually contribute to the training of new adapted generalized allophonic models and make them well-trained and robust.</Paragraph> <Paragraph position="11"> The entropy criterion for splitting during the generation of decision trees is weighted by the counts (frequencies) of allophones \[6\]. By preferring to split nodes with large counts (allophones appearing frequently), the counts of the allophonic cluster will become more balanced and the final generalized allophonic models will be equally trainable. Since the VA decision tress are generated from the set of relevant allophones which is not the same as the set of allophones to train the generalized allophonic models. The balance feature of those models will be no longer valid. Some generalized allophonic models might only have few (or even none) examples in the VI training set and thus cannot be well-trained. Fortunately, we can enhance the trainability of VA subword models through gross validation with the entire VI training set. The gross validation for VA decision trees is somehow different than the conventional cross validation which uses one part of the data to grow the trees and the other part of independent data to prune the trees in order to predict new contexts. Since relevant allophones is already only a small portion of the entire VI training set, further dividing it will prevent the learning algorithm from generating reliable VA decision trees. Instead, we grow the VA decision trees very deeply; replace the entropy reduction information of each split by traversing through the trees with all the allophones (including irrelevant ones); and finally prune the trees based on the new entropy information. This will prune out those splits of nodes without enough training support (too few examples) even though they might be relevant to the target vocabulary. Therefore the resulting generalized allophonic models will become more trainable.</Paragraph> <Paragraph position="12"> The vocabulary-adapted decision tree learning algorithm, emphasizing the relevant allophones during growing of the decision trees and using the gross validation with the entire VI training set provides an ideal mean for finding the equilibrium between adaptability for the target vocabulary and trainability with the VI training database.</Paragraph> </Section> <Section position="2" start_page="169" end_page="170" type="sub_section"> <SectionTitle> 3.2 Vocabulary-Bias Training </SectionTitle> <Paragraph position="0"> While the above adaptation algorithm tailors the subword units to the target vocabulary by focusing on the relevant allophones during the generation of clustering decision trees, it treated relevant and other irrelevant allophones equally in the final training of generalized allophonic models. Our next adaptation algorithm is to give the relevant allophones more prominence during the training of generalized allophonic models.</Paragraph> <Paragraph position="1"> Since the VI training database is supposed to be very large, it is reasonable to assume that the irrelevant allophones are the majority of almost every cluster. Thus, the resulting allophonic cluster will more likely represent the acoustic behavior of the set of irrelevant allophones, instead of the set of relevant allophones.</Paragraph> <Paragraph position="2"> In order to make relevant allophones become the majority of the allophonic cluster without incorporating new vocabulary-specific data, we must impose a bias toward the relevant allophones during training. Since our VI system is based on HMM approach, it is trivial to give the relevant allophones more prominence by assigning more weight to them during Baum-Welch training. The simplest way is to multiply a prominent weight to the parametric re-estimation equations for relevant allophones.</Paragraph> <Paragraph position="3"> The prominent weight can be a pre-defined constant, like 2.0 or 3.0, or a function of some variables. However, it is better for the prominent weight to reflect the reliability of the relevant allophones toward which we imposed a bias.</Paragraph> <Paragraph position="4"> If a relevant allophone occur rarely in the training set, we shouldn't assign a large weight to it because the statistics of it is not reliable. On the other hand, we could assign larger weights to those relevant allophones with enough examples in the training data. In our experiments, we use a simple function based on the frequencies of relevant allophones. All the irrelevant allophones have the weight 1.0 and the weight for relevant allophones is given by the following function: 1 + loya(Z) where x is the frequency of relevant allophones a is chosen to be the minimum number of training examples to train a reasonable model in our configuration.</Paragraph> <Paragraph position="5"> Imposing a bias toward the relevant allophones is similar to duplicating the training data of relevant allophones. For example, using aprominent weight of 2.0 for an training example in the Baum-Welch re-estimation is like observing the same training example twice. Therefore, our vocabulary-bias training algorithm is identical to duplicating the training examples of relevant allophones according to the weight function. Based on the same principle, this adaptation algorithm can be applied to other non-HMM systems by duplicating the training data of relevant allophones to make relevant allophones become the majority of the training data during training. The resulting models will then be tailored to those relevant aUophones. null</Paragraph> </Section> </Section> <Section position="4" start_page="170" end_page="171" type="metho"> <SectionTitle> 4 Environment Adaptation </SectionTitle> <Paragraph position="0"> It is well known that when a system is trained and tested under different environments, the performance of recognition drops moderately \[8\] However, it is very likely for training and testing taking place under different environments in VI systems because the VI models can be used for any task which could happen anywhere. Even if the recording hardware remains unchanged, e.g., microphones, A/D converters, pre-amplifiers, etc, the other environmental factors, e.g. the room size, background noise, positions of microphones, reverberation from surface reflections, etc, are all out of the control realm. For example, when comparing the recording environment of Texas Instruments (TI) and Carnegie Mellon University (CMU), a few differences were observed although both used the same close-talk microphone (Sennheiser HMD-414).</Paragraph> <Paragraph position="1"> * Recording equipment - TI and CMU used different A/D devices, filters and pre-amplifiers which might change the overall transfer function and thus generate different spectral tilts on speech signals.</Paragraph> <Paragraph position="2"> * Room - The TI recording took place in a sound-proof room, while the CMU recording took place in a big laboratory with much background noise (mostly paper rustle, keyboard noise, and other conversations). Therefore; CMU's data tends to contain more additive noise than TI's.</Paragraph> <Paragraph position="3"> * Input level - The CMU recording process always adjusted the amplifier's gain control for different speakers to compensate the varied sound volume of speakers.</Paragraph> <Paragraph position="4"> Since the sound volume of TI's female speakers tends to be much lower, TI probably didn't adjust the gain control like CMU did. Therefore, the dynamic range of CMU's data tends to be larger.</Paragraph> <Section position="1" start_page="170" end_page="170" type="sub_section"> <SectionTitle> 4.1 Codebook Adaptation </SectionTitle> <Paragraph position="0"> The speech signal processing of our VI system is based on a characterization of speech in a codebook of prototypical moOels \[7\]. Typically the performance of systems based on a codebook degrade over time as the speech signal drifts through environmental changes due to the increased distortion between the speech and the codebook.</Paragraph> <Paragraph position="1"> Therefore, two possible adaptation strategies include: 1. continuously updating the cooebook prototypes to fit the testing speech spectral vectors xt.</Paragraph> <Paragraph position="2"> 2. continuously transforming the testing speech spectral vectors x, into normalized vectors Yi, so that the distribution of the y~ is close to that of the training data described by the codebook prototypes.</Paragraph> <Paragraph position="3"> Our first environment adaptation algorithm belongs to the first strategy, while two cepstral normalization algorithms which will be described in Section 4.2 belongs to the second strategy. Semi-continuous HMMs (SCHMMs) or tied mixture continuous HMMs \[9, 3\] has been proposed to extend the discrete HMMs by replacing discrete output distributions with a combination of the original discrete output probability distributions and continuous pdf's of codebooks. SCHMMs can jointly re-estimate both the codebooks and HMM parameters to achieve an optimal codebook/model combination according to a maximum likelihood criterion during training. They have been applied to several recognition systems with improved performance over discrete HMMs \[9, 3\].</Paragraph> <Paragraph position="4"> The cooebooks of our vocabulary-independent system can be modified to optimize the probability of generating data from new environment by the vocabulary-independent HMMs according to the SCHMM framework. Let #i denote the mean vector of cooebook index i in the original codebook, then the new vector ~ can be obtained from the following equation</Paragraph> <Paragraph position="6"> where 7~ (t) denotes the posterior probability observed the codeword i at time t using HMM m for speech vector xt.</Paragraph> <Paragraph position="7"> Note that we did not use continuous Gassian pdf's to represent the cooebooks in the Equation 1. Each mean vector of the new codebook is computed from acoustic vector xt associated with corresponding posterior probability in the discrete forward-backward algorithm without involving continuous pdf computation. The new data from different environment, xt, can be automatically aligned with corresponding codeword in the forward-backward training procedure. If the alignment is not closely associated with the corresponding codeword in the HMM training procedure, reestimation of the corresponding codeword will then be de-weighted by the posterior probability 7~ n (t) accordingly in order to adjust the new cooebook to fit the new data.</Paragraph> </Section> <Section position="2" start_page="170" end_page="171" type="sub_section"> <SectionTitle> 4.2 Cepstral Normalization </SectionTitle> <Paragraph position="0"> The types of environmental factors which differ in TI's and CMU's recording environments can roughly be classified into two complementary categories : 1. additive noise - noise from different sources, like paper rustle, keyboard noise, other conversations, etc.</Paragraph> <Paragraph position="1"> 2. spectral equalization - distortions from the convolution of the speech signal with an unknown channel, like positions of microphones, reverberation from surface reflections, etc.</Paragraph> <Paragraph position="2"> Acero at al. \[1,2\] proposed a series of environment normalization algorithms based on joint compensation for additive noise and equaliTation. They has been implemented successfully on SPHINX to achieve robustness to different microphones. Among those algorithms, codeword-dependent cepstral normalization (CDCN), is the most accurate one, while interpolated SNR-dependent cepstral normalization (ISDCN) is the most efficient one 1. In this study, we incorporate these two algorithms to make our vocabulary-independent system more robust to environmental variations.</Paragraph> <Paragraph position="4"> Equation 2 is the environmental compensation model, where x, z, w, q and n represent respectively the normalized vector, observed vector, correction vector, spectral equalization vector and noise vector. The CDCN algorithm attempts to determine q and n that provide an ensemble of compensated vectors x being collectively closest to the set of locations of legitimate VQ codewords. The correction vector w will be obtained using MMSE estimator based on q, n and the codebook. In ISDCN, q and n were determined by an EM algorithm aiming at minimizing VQ distortion. The final correction vector w also depends on the instantaneous SNR of the current input frame using a sigmoid function.</Paragraph> <Paragraph position="5"> to further tailor the vocabulary-independent models to the Resource Management task, no compound improvement was produced. It might be because either both algorithms are learning the similar characteristics of the target task, or the combination of these two algorithms already reaches the limitation of adaptation capability within our modeling technique without the help of vocabulary-specific data.</Paragraph> <Paragraph position="6"> adapting the codebooks for TI's data</Paragraph> </Section> </Section> <Section position="5" start_page="171" end_page="172" type="metho"> <SectionTitle> 5 Experiments and Results </SectionTitle> <Paragraph position="0"> All the experiments are evaluated on the speaker-independent DARPA resource management task. This task is a 991-word continuous task and a standard word-pair grammar with perplexity 60 was used throughout. The test set, TI.TEST, consists of 320 sentences from 32 speakers (a random selection from June 1988, February 1989 and October 1990 DARPA evaluation sets).</Paragraph> <Paragraph position="1"> In order to isolate the influence of cross-environment recognition, another identical same test set, CMU-TEST, from 32 speakers (different from TI speakers) was collected at CMU. Our baseline is using 4-codebook discrete SPHINX and decision-tree based generalized allophones as the VI sub-word units\[7\]. Table 1 shows that about 9% error reduction is achieved by adapting the decision trees for Resource Management task, while about 15% error reduction is achieved by using vocabulary-bias training for the same task. Nevertheless, when we try to combine these two adaptation algorithms In codebook adaptation experiments, the 4 codebooks used in our HMM-based system are updated according Equation 1. We randomly select 100, 300, 1000, 2000 sentences from TIRM database to form different adaptation sets. Two iteration were carried out for each adaptation sets to estimated the new codebooks for TI's data, while the HMM parameters are fixed. Table 2 shows the adaptation recognition result on TI testing set. It is indicated that only marginal improvement by adapting codebook for new environment even with lots of adaptation data. The result suggested that the adaptation of codebook alone fail to produce adequate adaptation because the HMM statistics used by recognizer have not been updated.</Paragraph> <Paragraph position="2"> Table 3 shows the recognition error rate on two test sets for VI systems incorporated with CDCN and ISDCN. Be aware that our VI training set was recorded at CMU. The degradation of cross-environment recognition with TI-TEST is roughly reduced by 50%. Like most environment normalization algorithms, there is also a minor performance degradation for same-environment recognition when gaining robustness to other environments.</Paragraph> </Section> <Section position="6" start_page="172" end_page="173" type="metho"> <SectionTitle> CDCN & ISDCN 6 Conclusions </SectionTitle> <Paragraph position="0"> In this paper, we have presented two vocabulary adaptation algonthms, including vocabulary-adapted decision trees and vocabulary-bias training, that improve the performance of the vocabulary-independent system on the target task by tailoring the VI subword models to he target vocabulary. In 91' DARPA Speech and Natural Language Workshop \[7\], we have shown that our VI system is already slightly better than our VD system. With these two adaptation algorithms which led to 9% and 15% error reduction respectively on Resource Management task, the resulting VI system is far more accurate than our VD system. In \[8\], we have demonstrated improved vocabulary-independent results with vocabulary-specific adaptation data. In the future, we plan to extend our adaptation algorithms with the help of vocabulary-specific data to achieve further adaptation with the target vocabulary (task).</Paragraph> <Paragraph position="1"> CDCN and ISDCN have been successfully incorporated to the vocabulary-independent system and reduce the degradation of VI cross-environment recognition by 50%. In the future, we will keep investigating new environment normalization techniques to further reduce the degradation and ultimately achieve the full environmental robustness across different acoustic environments. Moreover, environment adaptation with environment-specific data will also be explored for adapting the VI system to the new environment once we have more knowledge about it.</Paragraph> <Paragraph position="2"> To make the speech recognition system more robust for new vocabularies and new environments is essential to make the speech recognition application feasible. Our results have shown that plentiful training data, careful subword modeling (decision-tree based generalized allophones) and suitable environment normalization have compensated for the lack of vocabulary and environment specific training. With the additional help of vocabulary adaptation, the vocabulary-independent system can be further tailored to any task quickly and cheaply, and therefore facilitates speech applications tremendously.</Paragraph> </Section> class="xml-element"></Paper>