File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1059_metho.xml
Size: 29,428 bytes
Last Modified: 2025-10-06 14:07:31
<?xml version="1.0" standalone="yes"?> <Paper uid="H01-1059"> <Title>Portability Issues for Speech Recognition Technologies</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. SYSTEM DESCRIPTION </SectionTitle> <Paragraph position="0"> The LIMSI broadcast news transcription system has two main components, the audio partitioner and the word recognizer. Data partitioning [6] serves to divide the continuous audio stream into homogeneous segments, associating appropriate labels for cluster, gender and bandwidth with the segments. The speech recognizer uses continuous density HMMs with Gaussian mixture for acoustic modeling and n-gram statistics estimated on large text corpora for language modeling. Each context-dependent phone model is a tied-state left-to-right CD-HMM with Gaussian mixture observation densities where the tied states are obtained by means of a decision tree. Word recognition is performed in three steps: 1) initial hypothesis generation, 2) word graph generation, 3) final hypothesis generation. The initial hypotheses are used for cluster-based acoustic model adaptation using the MLLR technique [13] prior to word graph generation. A 3-gram LM is used in the first two decoding steps. The final hypotheses are generated with a 4-gram LM and acoustic models adapted with the hypotheses of step 2.</Paragraph> <Paragraph position="1"> In the baseline system used in DARPA evaluation tests, the acoustic models were trained on about 150 hours of audio data from the DARPA Hub4 Broadcast News corpus (the LDC 1996 and 1997 Broadcast News Speech collections) [9]. Gender-dependent acoustic models were built using MAP adaptation of SI seed models for wide-band and telephone band speech [7]. The models contain 28000 position-dependent, cross-word triphone models with 11700 tied states and approximately 360k Gaussians [8].</Paragraph> <Paragraph position="2"> The baseline language models are obtained by interpolation of models trained on 3 different data sets (excluding the test epochs): about 790M words of newspaper and newswire texts; 240M word of commercial broadcast news transcripts; and the transcriptions of the Hub4 acoustic data. The recognition vocabulary contains 65120 words and has a lexical coverage of over 99% on all evaluation test sets from the years 1996-1999. A pronunciation graph is associated with each word so as to allow for alternate pronunciations. The pronunciations make use of a set of 48 phones set, where 3 phone units represent silence, filler words, and breath noises. The lexicon contains compound words for about 300 frequent word sequences, as well as word entries for common acronyms, providing an easy way to allow for reduced pronunciations [6].</Paragraph> <Paragraph position="3"> The LIMSI 10x system obtained a word error of 17.1% on the 1999 DARPA/NIST evaluation set and can transcribe unrestricted broadcast data with a word error of about 20% [8].</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. TASK INDEPENDENCE </SectionTitle> <Paragraph position="0"> Our first step in developing a &quot;generic&quot; speech transcription engine is to assess the most generic system we have under cross-task conditions, i.e., by recognizing task-specific data with a recognizer developed for a different task. Three representative tasks have been retained as target tasks: small vocabulary recognition (TI-digits), goal-oriented human-machine spoken dialog (ATIS), and dictation of texts (WSJ). The broadcast news transcription task (Hub4E) serves as the baseline. The main criteria for the task selection were that they are realistic enough and task-specific data should be available. The characteristics of these four tasks and the available corpora are summarized in Table 1.</Paragraph> <Paragraph position="1"> For the small vocabulary recognition task, experiments are carried out on the adult speaker portion of the TI-digits corpus [14], containing over 17k utterances from a total of 225 speakers. The vocabulary contains 11 words, the digits '1' to '9', plus 'zero' and 'oh'. Each speaker uttered two versions of each digit in isolation and 55 digit strings. The database is divided into training and test sets (roughly 3.5 hours each, corresponding to 9k strings). The speech is of high quality, having been collected in a quiet environment. The best reported WERs on this task are around 0.2-0.3%.</Paragraph> <Paragraph position="2"> The digit phonemic coverage being very low, only 108 context-dependent models are used in our recognition system. The task- null ferent configurations: (left) BN acoustic and language models; (center) BN acoustic models combined with task-specific lexica and LMs and (right) task-dependent acoustic and language models.</Paragraph> <Paragraph position="3"> specific LM for the TI-digits is a simple grammar allowing any sequence of up to 7 digits. Our task-dependent system performance is 0.4% WER.</Paragraph> <Paragraph position="4"> The DARPA Air Travel Information System (ATIS) task is chosen as being representative of a goal-oriented human-machine dialog task, and the ARPA 1994 Spontaneous Speech Recognition (SPREC) ATIS-3 data (ATIS94) [4] is used for testing purposes.</Paragraph> <Paragraph position="5"> The test data amounts for nearly 5 hours of speech from 24 speakers recorded with a close-talking microphone. Around 40h of speech data are available for training. The word error rates for this task in the 1994 evaluation were mainly in the range of 2.5% to 5%, which we take as state-of-the-art for this task. The acoustic models used in our task-specific system include 1641 context-dependent phones with 4k independent HMM states. A back-off trigram language model has been estimated on the transcriptions of the training utterances. The lexicon contains 1300 words, with compounds words for multi-word entities in the air-travel database (city and airport names, services etc.). The WER obtained with our task-dependent system is 4.4%.</Paragraph> <Paragraph position="6"> For the dictation task, the Wall Street Journal continuous speech recognition corpus [17] is used, abiding by the ARPA 1995 Hub3 test (WSJ95) conditions. The acoustic training data consist of 100 hours of speech from a total of 355 speakers taken from the WSJ0 and WSJ1 corpora. The Hub3 baseline test data consist of studio quality read speech from 20 speakers with a total duration of 45 minutes. The best result reported at the time of the evaluation was 6.6%. A contrastive experiment is carried out with the WSJ93 Spoke 9 data comprised of 200 spontaneous sentences spoken by journalists [11]. The best performance reported in the 1993 evaluation on the spontaneous data was 19.1% [18], however lower word error rates have since been reported on comparable test sets (14.1% on the WSJ94 Spoke 9 test data). 21000 context and position-dependent models have been trained for the WSJ system, with 9k independent HMM states. A 65k-word vocabulary was selected and a back-off trigram model obtained by interpolating models trained on different data sets (training utterance transcriptions and newspapers data). The task-dependent WSJ system has a WER of 7.6% on the read speech test data and 15.3% on the spontaneous data.</Paragraph> <Paragraph position="7"> For the BN transcription task, we follow the conditions of the 1998 ARPA Hub4E evaluation (BN98) [15]. The acoustic training data is comprised of 150 hours of North-American TV and radio shows. The best overall result on the 1998 baseline test was 13.5%. Three sets of experiments are reported. The first are cross-task recognition experiments carried out using the BN acoustic and language models to decode the test data for the other tasks. The second set of experiments made use of mixed models, that is the BN acoustic models and task-specific LMs. Due to the different evaluation paradigms, some minor modifications were made in the transcription procedure. First of all, in contrast with the BN data, the data for the 3 tasks is already segmented into individual utterances so the partitioning step was eliminated. With this exception, the decoding process for the WSJ task is exactly the same as described in the previous section. For the TI-digits and ATIS tasks, word decoding is carried out in a single trigram pass, and no speaker adaptation was performed.</Paragraph> <Paragraph position="8"> The WERs obtained for the three recognition experiments are reported in Table 2. A comparison with Table 1 shows that the performances of the task-dependent models are close to the best reported results even though we did not devote too much effort in optimizing these models. We can also observe by comparing the task-dependent (Table 2, right) and mixed (Table 2, middle) conditions, that the BN acoustic models are relatively generic. These models seem to be a good start towards truly task-independent acoustic models. By using task-specific language models For the TI-digits and ATIS we can see that the gap in performance is mainly due a linguistic mismatch. For WSJ the language models are more closely matched to BN and only a small 1.6% WER reduction is obtained. On the spontaneous journalist dictation (WSJ S9 spoke) test data there is even an increase in WER using the WSJ LMs, which can be attributed to a better modelization of spontaneous speech effects (such as breath and filler words) in the BN models.</Paragraph> <Paragraph position="9"> Prior to introducing our approach for lightly supervised acoustic model training, we describe our standard training procedure in the next section.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. ACOUSTIC MODEL TRAINING </SectionTitle> <Paragraph position="0"> HMM training requires an alignment between the audio signal and the phone models, which usually relies on a perfect orthographic transcription of the speech data and a good phonetic lexicon. In general it is easier to deal with relatively short speech segments so that transcription errors will not propagate and jeopardize the alignment. The orthographic transcription is usually considered as ground truth and training is done in a closely supervised manner. For each speech segment the training algorithm is provided with the exact orthographic transcription of what was spoken, i.e., the word sequence that the speech recognizer should hypothesize when confronted with the same speech segment.</Paragraph> <Paragraph position="1"> Training acoustic models for a new corpus (which could also re- null flect a change of task and/or language), usually entails the following sequence of operations once the audio data and transcription files have been loaded: 1. Normalize the transcriptions to a common format (some adjustment is always needed as different text sources make use of different conventions).</Paragraph> <Paragraph position="2"> 2. Produce a word list from the transcriptions and correct blatant errors (these include typographical errors and inconsistencies).</Paragraph> <Paragraph position="3"> 3. Produce a phonemic transcription for all words not in our master lexicon (these are manually verified).</Paragraph> <Paragraph position="4"> 4. Align the orthographic transcriptions with the signal using existing models and the pronunciation lexicon (or bootstrap models from another task or language). This procedure often rejects a substantial portion of the data, particularly for long segments. null 5. Eventually correct transcription errors and realign (or just ignore these if enough audio data is available) 6. Run the standard EM training procedure.</Paragraph> <Paragraph position="5"> This sequence of operations is usually iterated several times to refine the acoustic models. In general each iteration recovers a portion of the rejected data.</Paragraph> </Section> <Section position="6" start_page="0" end_page="1" type="metho"> <SectionTitle> 5. LIGHTLY SUPERVISED ACOUSTIC MODEL TRAINING </SectionTitle> <Paragraph position="0"> One can imagine training acoustic models in a less supervised manner, by using an iterative procedure where instead of using manual transcriptions for alignment, at each iteration the most likely word transcription given the current models and all the information available about the audio sample is used. This approach still fits within the EM training framework, which is well-suited for missing data training problems. A completely unsupervised training procedure is to use the current best models to produce an orthographic transcription of the training data, keeping only words that have a high confidence measure. Such an approach, while very enticing, is limited since the only supervision is provided by the confidence measure estimator. This estimator must in turn be trained on development data, which needs to be small to keep the approach interesting.</Paragraph> <Paragraph position="1"> Between using carefully annotated data such as the detailed transcriptions provided by the LDC and no transcription at all, there is a wide spectrum of possibilities. What is really important is the cost of producing the associated annotations. Detailed annotation requires on the order of 20-40 times real-time of manual effort, and even after manual verification the final transcriptions are not exempt from errors [2]. Orthographic transcriptions such as closed-captions can be done in a few times real-time, and therefore are quite a bit less costly. These transcriptions have the advantage that they are already available for some television channels, and therefore do not have to be produced specifically for training speech recognizers. However, closed-captions are a close, but not exact transcription of what is being spoken, and are only coarsely time-aligned with the audio signal. Hesitations and repetitions are not marked and there may be word insertions, deletions and changes in the word order. They also are missing some of the additional information provided in the detailed speech transcriptions such as the indication of acoustic conditions, speaker turns, speaker identities and gender and the annotation of non-speech segments such as music. NIST found the disagreement between the closed-captions and manual transcripts on a 10 hour subset of the TDT-2 data used for the SDR evaluation to be on the order of 12% [5].</Paragraph> <Paragraph position="2"> Another approach is to make use of other possible sources of contemporaneous texts from newspapers, newswires, summaries and the Internet. However, since these sources have only an indirect correspondence with the audio data, they provide less supervision.</Paragraph> <Paragraph position="3"> The basic idea is of light supervision is to use a speech recognizer to automatically transcribe unannotated data, thus generating &quot;approximate&quot; labeled training data. By iteratively increasing the amount of training data, more accurate acoustic models are obtained, which can then be used to transcribe another set of unannotated data. The modified training procedure used in this work is: 1. Train a language model on all texts and closed captions after normalization 2. Partition each show into homogeneous segments and label the acoustic attributes (speaker, gender, bandwidth) [6] 3. Train acoustic models on a very small amount of manually annotated data (1h) 4. Automatically transcribe a large amount of training data 5. (Optional) Align the closed-captions and the automatic transcriptions (using a standard dynamic programming algorithm) 6. Run the standard acoustic model training procedure on the speech segments (in the case of alignment with the closed captions only keep segments where the two transcripts are in agreement) 7. Reiterate from step 4.</Paragraph> <Paragraph position="4"> It is easy to see that the manual work is considerably reduced, not only in generating the annotated corpus but also during the training procedure, since we no longer need to extend the pronunciation lexicon to cover all words and word fragments occurring in the training data and we do not need to correct transcription errors. This basic idea was used to train acoustic models using the automatically generated word transcriptions of the 500 hours of audio broadcasts used in the spoken document retrieval task (part of the DARPA TDT-2 corpus used in the SDR'99 and SDR'00 evaluations) [3].</Paragraph> <Paragraph position="5"> This corpus is comprised of 902 shows from 6 sources broadcast between January and June 1998: CNN Headline News (550 30-minute shows), ABC World News Tonight (139 30-minute shows), America VOA Today and World Report (111 1-hour shows). These shows contain about 22k stories with time-codes identifying the beginning and end of each story.</Paragraph> <Paragraph position="6"> First, the recognition performance as a function of the available acoustic and language model training data was assessed. Then we investigated the accuracy of the acoustic models obtained after recognizing the audio data using different levels of supervision via the language model. With the exception of the baseline Hub4 language models, none of the language models include a component estimated on the transcriptions of the Hub4 acoustic training data. The language model training texts come from contemporaneous sources such as newspapers and newswires, and commercial summaries and transcripts, and closed-captions. The former sources have only an indirect correspondence with the audio data and provide less supervision than the closed captions. For each set of LM training texts, a new word list was selected based on the word frequencies in the training data. All language models are formed by interpolating individual LMs built on each text source. The interpolation coefficients were chosen in order to minimize the perplexity on a development set composed of the second set of the Nov98 evaluation data (3h) and a 2h portion of the TDT2 data from Jun98 (not included in the LM training data). The following combinations were investigated: LMa (baseline Hub4 LM): newspaper+newswire (NEWS), commercial transcripts (COM) predating Jun98, acoustic transcripts LMn t c: NEWS, COM, closed-captions through May98 LMn t: NEWS, COM through May98 tic models trained on the HUB4 training data with detailed manual transcriptions. All runs were done in less than 10xRT, except the last row. &quot;1S&quot; designates one set of gender-independent acoustic models, whereas &quot;4S&quot; designates four sets of gender and bandwidth dependent acoustic models.</Paragraph> <Paragraph position="7"> It should be noted that all of the conditions include newspaper and newswire texts from the same epoch as the audio data. These provide an important source of knowledge particularly with respect to the vocabulary items. Conditions which include the closed captions in the LM training data provide additional supervision in the decoding process when transcribing audio data from the same epoch.</Paragraph> <Paragraph position="8"> For testing purposes we use the 1999 Hub4 evaluation data, which is comprised of two 90 minute data sets selected by NIST. The first set was extracted from 10 hours of data broadcast in June 1998, and the second set from a set of broadcasts recorded in August-September 1998 [16]. All recognition runs were carried out in under 10xRT unless stated otherwise. The LIMSI 10x system obtained a word error of 17.1% on the evaluation set (the combined scores in the penultimate row in Table 3 4S, LMa) [8]. The word error can be reduced to 15.6% for a system running at 50xRT (last entry in Table 3).</Paragraph> <Paragraph position="9"> As can be seen in Table 3, the word error rates with our original Hub4 language model (LMa) and the one without the transcriptions of the acoustic data (LMn t c) give comparable results using the 1999 acoustic models trained on 123 hours of manually annotated data (123h, 4S). The quality of the different language models listed above are compared in the first row of Table 3 using speaker-independent (1S) acoustic models trained on the same Hub4 data (123h). As can be observed, removing any text source leads to a degradation in recognition performance. It appears it is more important to include commercial transcripts (LMn t), even if they are old (LMn to) than the closed captions (LMn c). This suggests that the commercial transcripts more accurately represent spoken language than closed-captioning. Even if only newspaper and newswire texts are available (LMn), the word error increases by only 14% over the best configuration (LMn t c), and even using older newspaper and newswire texts (LMno) does not substantially increase the word error rate. The second row of Table 3 gives the word error rates with acoustic models trained on only 1 hour of manually transcribed data. These are the models used to initialize the process of automatically transcribing large quantities of data. These word error rates range from 33% to 36% across the language models.</Paragraph> <Paragraph position="10"> We compared a straightforward approach of training on all the automatically annotated data with one in which the closed-captions are used to filter the hypothesized transcriptions, removing words that are &quot;incorrect&quot;. In the filtered case, the hypothesized transcriptions are aligned with the closed captions story by story, and only regions where the automatic transcripts agreed with the closed captions were kept for training purposes. To our surprise, somewhat comparable recognition results were obtained both with and without filtering, suggesting that inclusion of the closed-captions in the language model training material provided sufficient supervision (see Table 5).</Paragraph> <Paragraph position="11"> It should be noted that in both cases the closedcaption story boundaries are used to delimit the audio segments after automatic transcription.</Paragraph> <Paragraph position="12"> To investigate this further we are assessing the effects of reducing the amount of supervision provided by the language model training texts on the acoustic model accuracy (see Table 4). With 14 hours (raw) of approximately labeled training data, the word error is reduced by about 20% for all LMs compared with training on 1h of data which has carefully manual transcriptions. Using larger amounts of data transcribed with the same initial acoustic models gives smaller improvements, as seen by the entries for 28h and 58h. The commercial transcripts (LMn+t and LMn+to), even if predating the data epoch, are seen to be more important than the closed-captions (LMn+c), supporting the earlier observation that they are closer to spoken language. Even if only news texts from the same period (LMn) are available, these provide adequate supervision for lightly supervised acoustic model training.</Paragraph> <Paragraph position="13"> Table 5: Word error rates for increasing quantities of automatically label training data on the 1999 evaluation test sets using gender and bandwidth independent acoustic models with the language model LMn t c (trained on NEWS, COM, closed-captions through May98).</Paragraph> <Paragraph position="14"> Amount of training data %WER raw unfiltered filtered unfiltered filtered</Paragraph> </Section> <Section position="7" start_page="1" end_page="2" type="metho"> <SectionTitle> 6. TASK ADAPTATION </SectionTitle> <Paragraph position="0"> The experiments reported in the section 3 show that while direct recognition with the reference BN acoustic models gives relatively The difference in the amounts of data transcribed and actually used for training is due to three factors. The first is that the total duration includes non-speech segments which are eliminated prior to recognition during partitioning. Secondly, the story boundaries in the closed captions are used to eliminate irrelevant portions, such as commercials. Thirdly, since there are many remaining silence frames, only a portion of these are retained for training.</Paragraph> <Paragraph position="1"> Table 6: Word error rates (%) for TI-digits, ATIS94, WSJ95 and S9 WSJ93 test sets after recognition with three different configurations, all including task-specific lexica and LMs: (left) BN acoustic models, (middle left) unsupervised adaptation of the BN acoustic models, (middle right) supervised adaptation of the BN acoustic models and (right) task-dependent acoustic models. competitive results, the WER on the targeted tasks can still be improved. Since we want to minimize the cost and effort involved in tuning to a target task, we are investigating methods to transparently adapt the reference acoustic models. By transparent we mean that the procedure is automatic and can be carried out without any human expertise. We therefore apply the approach presented in the previous section, that is the reference BN system is used to transcribe the training data of the destination task. This supposes of course that audio data have been collected. However, this can be carried out with an operational system and the cost of collecting task-specific training data is greatly reduced since no manual transcriptions are needed. The performance of the BN models under cross task conditions is well within the range for which the approximate transcriptions can be used for acoustic model adaptation.</Paragraph> <Paragraph position="2"> The reference acoustic models are then adapted by means of a conventional adaptation technique such as MLLR and MAP. Thus there is no need to design a new set of models based on the training data characteristics. Adaptation is also preferred to the training of new models as it is likely that the new training data will have a lower phonemic contextual coverage than the original reference models.</Paragraph> <Paragraph position="3"> The cross-task unsupervisedadaptation is evaluated for the tasks: TI-digits, ATIS and WSJ. The 100 hours of the WSJ data were transcribed using the BN acoustic and language models. For ATIS, only 26 of the 40 hours of training data from 276 speakers were transcribed, due to time constraints. For TI-digits, the training data was transcribed using a mixed configuration, combining the BN acoustic models with the simple digit loop grammar.</Paragraph> <Paragraph position="4"> For completeness we also used the task-specific audio data and the associated transcriptions to carry out supervised adaptation of the BN models.</Paragraph> <Paragraph position="5"> Gender-dependent acoustic models were estimated using the corresponding gender-dependent BN models as seeds and the genderspecific training utterances as adaptation data. For WSJ and ATIS, the speaker ids were directly used for gender identification since in previous experiments with this test set there were no gender classification errors. Only the acoustic models used in the second and third word decoding passes have been adapted. For the TI-digits, the gender of each training utterance was automatically classified by decoding each utterance twice, once with each set of gender-dependent models. Then, the utterance gender was determined based on the best global score between the male and female models (99.0% correct classification).</Paragraph> <Paragraph position="6"> Both the MLLR and MAP adaptation techniques were applied.</Paragraph> <Paragraph position="7"> The recognition tests were carried out under mixed conditions (i.e., with the adapted acoustic models and the task-dependent LM). The In order to assess the quality of the automatic transcription, we compared the system hypotheses to the manually provided training transcriptions. For resulting word error rates on the training data are 11.8% for WSJ, 29.1% for ATIS and 1.2% for TI-digits.</Paragraph> <Paragraph position="8"> BN models are first adapted using MLLR with a global transformation, followed by MAP adaptation.</Paragraph> <Paragraph position="9"> The word error rates obtained with the task-adapted BN models are given in Table 6 for the four test sets. Using unsupervised adaptation the performance is improved for TIdigits (53% relative), WSJ (19% relative) and S9 (7% relative).</Paragraph> <Paragraph position="10"> The manual transcriptions for the targeted tasks were used to carry out supervised model adaptation. The results (see the 4th column of Table 6) show a clear improvement over unsupervisedadaptation for both the TI-digits (60% relative) and ATIS (47% relative) tasks. A smaller gain of about 10% relative is obtained for the spontaneous dictation task, and only 3% relative for read WSJ data. The gain appears to be correlated with the WER of the transcribed data: the difference between BN and task specific models is smaller for WSJ than ATIS and TI-digits. The TI-digit task is the only task for which the best performance is obtained using task-dependent models rather than BN models adapted with supervised. For the other tasks, the lowest WER is obtained when the supervised adapted BN acoustic models are used: 3.2% for ATIS, 6.7% for WSJ and 11.4% for S9. This result confirms our hypothesis that better performance can be achieved by adapting generic models with task-specific data than by directly training task-specific models.</Paragraph> </Section> class="xml-element"></Paper>