File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1019_metho.xml
Size: 27,399 bytes
Last Modified: 2025-10-06 14:13:17
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1019"> <Title>Identification of Non-Linguistic Speech Features</Title> <Section position="3" start_page="0" end_page="96" type="metho"> <SectionTitle> PHONE-BASED ACOUSTIC LIKELIHOODS </SectionTitle> <Paragraph position="0"> The basic idea is to train a set of large phone-based ergodic hidden Markov models (HMMs) for each non-linguistic feature to be identified (language, gender, speaker, ...). Feature identification on the incoming signal x is then performed by computing the acoustic likelihoods f(xlAi) for all the models Ai of a given set. The feature value corresponding to the model with the highest likelihood is then hypothesized. This decoding procedure can efficiently be implemented by processing all the models in parallel using a time-synchronous beam search strategy.</Paragraph> <Paragraph position="1"> This approach has the following advantages: * It can perform text-independent feature recognition. (Textdependent feature recognition can also be performed.) * It is more precise than methods based on long-term statistics such as long term spectra, VQ codebooks, or probabilistic acoustic maps\[26, 28\].</Paragraph> <Paragraph position="2"> * It can easily take advantage of phonotactic constraints. (Theseare shown to be useful for language identification.) * It can easily be integrated in recognizers which are based on phone models as all the components already exist.</Paragraph> <Paragraph position="3"> A disadvantage of the approach is that, at least in the current formulation, phonetic labels are required for training the models. However, there is in theory no absolute need for phonetic labeling of the speech training data to estimate the HMM parameters. Labeling of a small portion of the training data can be enough to bootstrap the training procedure and insure the phone-based nature of the resulting models.</Paragraph> <Paragraph position="4"> (In this case, phonotactic constraints must be obtained only from speech corpora.) We have sucessfully experimented with this approach for speaker identification.</Paragraph> <Paragraph position="5"> In our implementation, each large ergodic HMM is built from small left-to-right phonetic HMMs. The Viterbi algorithm is used to compute the joint likelihood f(x, s lAi) of the incoming signal and the most likely state sequence instead of f(xlAi). This implementation is therefore nothing more than a slightly modified phone recognizer with language-, sex-, or speaker- dependent model sets used in parallel, and where the output phone string is ignored 1 and only the acoustic likelihood for each model is taken into account.</Paragraph> <Paragraph position="6"> The phone recognizer can use either context-dependent or context-independent phone models, where each phone model is a 3-state left-to-fight continuous density hidden Markov model (CDHMM) with Gaussian mixture observation densities. The covariance matrices of all Gaussian components are diagonal. Duration is modeled with a gamma distribution per phone model. As proposed by Rabiner et a1.\[23\], the HMM and duration parameters are estimated separately and combined in the recognition process for the Viterbi search.</Paragraph> <Paragraph position="7"> Maximum likelihood estimators are used to derive language specific models whereas maximum a posteriori (MAP) estimators are used to generate sex- and speaker- specific models as has already been proposed in \[11\]. The MAP estimates are obtained with the segmental MAP algorithm \[16, 9, 10\] using speaker-independent seed models. These seed models are used to estimate the parameters of the prior densities and to serve as an initial estimate for the segmental MAP algorithm. This approach provides a way to incorporate prior information into the model training process and is ~The likelihood computation can in fact be simplified since there is no need to maintain the backtracking infomlation necessary to know the recognized phone sequence.</Paragraph> <Paragraph position="8"> particularly useful to build the speaker specific models when using only a small amount of speaker specific data.</Paragraph> <Paragraph position="9"> In our earlier reported results using this approach for language- and speaker-identification\[13, 14, 7\], the acoustic likelihoods were computed sequentially for each of the models. As mentioned earlier, the Viterbi decoder is now implemented as a one-pass beam search procedure applied on all the models in parallel, resulting in an efficient decoding procedure which saves a lot of computation.</Paragraph> </Section> <Section position="4" start_page="96" end_page="97" type="metho"> <SectionTitle> EXPERIMENTAL CONDITIONS </SectionTitle> <Paragraph position="0"> Four corpora have been used to carry out the experiments reported in this paper: BDSONS\[2\] and BREF\[15, 8\] for French; and TIMIT\[4\] and WSJ0122\] for English. From the BDSONS corpus only the phonetically equilibrated sentence sub-corpus (CDROM 6) has been used for testing, whereas depending on experiment, the 3 other corpora have been used for training and testing.</Paragraph> <Paragraph position="1"> The BDSONS Corpus: BDSONS, Base de Donn6es des Sons du Fran~ais\[2\], was designed to provide a large corpus of French speech data for the study of the sounds in the French language and to aid speech research. The corpus contains an &quot;evaluation&quot; subcorpus consisting primarily of isolated and connected letters, digits and words from 32 speakers (16m/16f), and an &quot;acoustic&quot; subcorpus which includes phonetically balanced words and sentences from 12 speakers (6m/6f).</Paragraph> <Paragraph position="2"> The BREF Corpus: BREF is a large read-speech corpus, containing over 100 hours of speech material, from 120 speakers (55m/65f)\[15\]. The text materials were selected verbatim from the French newspaper Le Monde, so as to provide a large vocabulary (over 20,000 words) and a wide range of phonetic environments\[8\]. Containing 1115 distinct diphones and over 17,500 triphones, BREF can be used to train vocabulary-independenet phonetic models. The text material was read without verbalized punctuation.</Paragraph> <Paragraph position="3"> The DARPA WSJ0 Corpus: The DARPA Wall Street Journal-based Continuous-Speech Corpus (WSJ)\[22\] has been designed to provide general-purpose speech data (primarily, read speech data) with large vocabularies. Text materials were selected to provide training and test data for 5K and 20K word, closed and open vocabularies, and with both verbalized and non-verbalized punctuation. The recorded speech material supports both speaker-dependent and speaker-independent training and evaluation.</Paragraph> <Paragraph position="4"> The DARPA TIMIT Corpus: The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus\[4\] is a corpus of read speech designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the U.S. The TIMIT CDROM\[4\] contains a training/test subdivision of the data that ensures that there is no overlap in the text materials. All of the utterances in TIMIT have associated time-aligned phonetic transcriptions.</Paragraph> <Paragraph position="5"> Since the identification of non-linguistic speech features is based ,on phone recognition, some phone recognition results for the above corpora are given here. The speaker-independent (SI) phone recognizers use sets of context-dependent (CD) models which were automatically selected based on their frequencies in the training data. There are 428 sex-dependent CD models for BREF, 1619 for WSJ and 459 for TIMIT. Phone errors rates are given in Table 1. For BREF and WSJ phone errors are reported after removing silences, whereas for TIMIT silences are included as transcribed. Scoring without the sentence initial/final silence increases the phone error by about 1.5%. The phone error for BREF is 21.3%, WSJ (Feb-92 5knvp) is 25.7% and TIMIT (complete testset) is 27.6% scored using the 39 phone set proposed by\[18\]. These results are provided to calibrate the recognizers used in the experiments in this paper, and observe differences in the corpora. It appears that the BREF data is easiest to recognize at the phone level, and that TIMIT is more difficult than WSJ.</Paragraph> </Section> <Section position="5" start_page="97" end_page="98" type="metho"> <SectionTitle> SEX IDENTIFICATION </SectionTitle> <Paragraph position="0"> It is well known that the use of sex-dependent models gives improved performance over one set of speaker-independent models. However, this approach can be costly in terms of computation for medium-to-large-size tasks, since recognition of the unknown sentence is typically carried out twice, once for each sex. A logical alternative is to first determine the speaker's sex, and then to perform word recognition using the models of selected sex. This is the approach used in our Nov-92 WSJ system\[6\]. In these experiments the standard SI-84 training material, containing 7240 sentences from 84 speakers (42m/42f) is used to build speaker-independent phone models. Sex-dependent models are then obtained using MAP estimation\[11\] with the SI seed models. The phone likelihoods using context-dependent male and female models were computed, and the sex of the speaker was selected as the sex associated with the models that gave the highest likelihood. Since these CD male and female models are the same as are used for word recognition, there is no need for additional training material or effort. No errors were observed in sex identification for WSJ on the Feb92 or Nov92 5k test data containing 851 sentences, from 18 speakers (10m/8f).</Paragraph> <Paragraph position="1"> For BREF, sex-dependent models were also obtained from SI seeds by MAP estimation. The training data consisted of 2770 sentences from 57 speakers (28m/29f). No errors in sex-identification were observed on 109 test sentences from 21 test speakers (10m/1 If).</Paragraph> <Paragraph position="2"> To further investigate sex identification based on acoustic likelihoods on a larger set of speakers, the approach was evaluted on the 168 speakers of the TIMIT test corpus. The SI seed models were trained using all the available training data, i.e., 4620 sentences from 462 speakers, and adapted using data from the 326 males speakers and 136 females to form gender-specific models. The test data consist of 1344 sentences, comprised of 8 sentences from each of the 168 test speakers (112m/560. Results are shown in the first row of Table 2 where the error rate is given as a function of the speech duration. Each speech segment used for the test is part of a single sentence, and always starts at the beginning of the sentence, preceeded by about lOOms of silence 2. These results on this more significant test show that sex identification error rate using phone-based acoustic likelihoods is 2.8 % with 400ms of speech and is under 1% with 2s of speech.</Paragraph> <Paragraph position="3"> The 400ms of speech signal (which includes about lOOms of silence) represents about 4 phones, about the number found in a typical word (avg. 3.9 phones/word) in TIMIT. This implies that before the speaker has finished enunciating the first word, one is fairly certain of the speaker's sex. Sentences misclassified with regards to the speaker's sex had better phone recognition accuracies with the cross-sex models.</Paragraph> <Paragraph position="4"> Using exactly the same test data and the same phone models, an experiment of text-dependent sex identification was carried out in order to assess if by adding linguistic information the speaker's gender can be more easily identified. To do this a long left-to-right HMM is built for each sex by concatenating the sex-dependent CD phone models corresponding to the TIMIT transcriptions. The basic idea is to measure the lower bound on the error rate that would be obtained if higher order knowledge such as lexical information were provided.</Paragraph> <Paragraph position="5"> The acoustic likelihoods are then computed for the two models. These likelihood values are lower than are obtained for text-independent identification. The results are given in the second row of Table 2 where it can be seen that the error rate is not any better than the error rate obtained with the text-independent method. This shows that acoustic-phonetic knowledge is sufficient to accomplish this task.</Paragraph> <Paragraph position="6"> formalities are used perhaps more than in English, the system acceptance may be easier if the familiar &quot;Bonjour Madame&quot; or &quot;Je vous en prie Monsieur&quot; is foreseen.</Paragraph> <Paragraph position="7"> Since sex-identification is not perfect, some fall-back mechanism must be integrated to avoid including the signs of politeness if the system is unsure of the sex. This can be accomplished by comparing the likelihoods of the model sets, or by being wary of speakers for whom the better likelihood jumps back and forth between models.</Paragraph> </Section> <Section position="6" start_page="98" end_page="99" type="metho"> <SectionTitle> LANGUAGE IDENTIFICATION </SectionTitle> <Paragraph position="0"> Language identification is another feature that can be identified using the same approach. In this case language-dependent models are used instead of sex-dependent ones.</Paragraph> <Paragraph position="1"> The basic idea is to process in parallel the unknown incoming speech by different sets of phone models (each set is a large ergodic HMM) for each of the languages under consideration, and to choose the language associated with the model set providing the highest normalized likelihood. 3 In this way, it is no longer necessary to ask the speaker to select the language, before using the system. If the language can be accurately identified, it simplifies using speech recognition for a variety of applications, from selecting an appropriate operator, or aiding with emergency assistance. Language identification can also be done using word recognition, but it is much more efficient to use phone recognition, which has the added advantage of being task independent.</Paragraph> <Paragraph position="2"> Experimental results for language identification for English/French were given in \[13, 14\], where models trained on TIMIT \[4\] and BREF \[15\], were tested on different sentences taken from the same corpus. While these results gave high identification accuracies (100% if an entire sentence is used, and greater than 97% with 4ooms, and error free with 1.6s of speech signal), it is difficult to discern that the language and not the corpus are being identified. Identification of independent data taken from the WSJ0 corpus was less accurate: 85% with 400ms, and 4% error with 1.6s of speech signal.</Paragraph> <Paragraph position="3"> In these experiments we attempted to avoid the bias due to corpus, by testing on data from the same corpora from which the models are built, and on independent test data from different corpora. The language-dependent models are trained from similar-style corpora, BREF for French and WSJ0 for English, both containing read newspaper texts and similar size vocabularies\[8, 15, 22\]. For each language a set of context-independent phone models were built, 35 for French and 46 for English. 4 Each phone model has 32 gaussians per 3 In fact, this is not a new idea: House and Neuberg (1977)\[ 12\] proposed a similar approach for language identification using models of broad phonetic classes, where we use phone models. Their experimental results, however, were synthetic, based on phonetic transcriptions derived from texts. semivowels), and silence. The phone table can be found in \[5\]. For English, the set of 46 phones include 21 vowels (including 3 diphthongs and 3 schwas), 24 consonants (6 plosives, 8 fricatives, 2 affricates, 3 nasals, 5 mixture, and no duration model is used. In order to minimize influences due to the use of different microphones and recording conditions a 4 kHz bandwidth is used. The training data were the same as for sex-identification on BREF (2770 sentences from 57 speakers) and WSJ (standard SI-84 training: 7240 sentences from 84 speakers).</Paragraph> <Paragraph position="4"> Language identification accuracies are given in Tables 3 and 4 without and with phonotactic constraints provided by a phone bigram. Results are given for 4 test corpora, WSJ and TIMIT for English, and BREF and BDSONS for French, as a function of the duration of the speech signal which includes approximately lOOms of silence. As for speakeridentification, the initial and final silences were automatically removed based on HMM segmentation, so as to be able to compare language identification as a function of duration without biases due to long initial silences. The test data for WSJ are the first 10 sentences for each of the 10 speakers (5m/5f) in the Feb92-si5knvp (speaker-independent, 5k, non-verbalized punctuation) test data. For TIMIT, the 192 sentences in the &quot;coretest&quot; set containing 8 sentences from each of 24 speakers (16m/80 was used. The BREF test data consists of 130 sentences from 20 speakers (10m/100 and for BDSONS the data is comprised of 121 sentences from 11 speakers (5m/60.</Paragraph> <Paragraph position="5"> While WSJ sentences are more easily identified as English for short durations, errors persist longer than for TIMIT. In contrast for French with 4ooms of signal, BDSONS data is better identified than BREF, perhaps because the sentences are phonetically balanced. For longer durations, BREF is slightly better identified than BDSONS. The performance indicates that language identification is task independent.</Paragraph> <Paragraph position="6"> Using phonotactic constraints is seen to improve language identification, particularly for short signals. The smallest improvement is seen for TIMIT, probably due to the nature semivowels), and silence.</Paragraph> <Paragraph position="7"> of the selected sentences which emphasized rare phone sequences. The error rate with 2s of speech is less than 1% and with Is of speech (not shown in the tables) is about 2%. With 3s of speech, language identification is almost error free.</Paragraph> <Paragraph position="8"> Due to the source of the BREF and WSJ data, language identification is complicated by the inclusion of foreign words. One of the errors on BREF involved such a sentence.</Paragraph> <Paragraph position="9"> The sentence was identified as French at the beginning and then all of a sudden switched to English. The sentence was &quot;Durant mon adolescence, je d6vorais les r6cits westerns de Zane Grey, Luke Short, et Max Brand...&quot;, where the italicized words were pronounced in correct English.</Paragraph> <Paragraph position="10"> We are in the process of obtaining corpora for other languages to extend our language identification work. However, there are variety of applications where a bilingual system,just French/English would be of use, including air traffic control (where both French and English are permitted languages for flights within France), telecommunications applications, and many automated information centers, ticket distributors, and tellers, where already you can select between English and French with the keyboard or touch screen.</Paragraph> </Section> <Section position="7" start_page="99" end_page="100" type="metho"> <SectionTitle> SPEAKER IDENTIFICATION </SectionTitle> <Paragraph position="0"> Speaker identification has been the topic of active research for many years (see, for example, \[3, 21, 26\]), and has many potential applications where propriety of information is a concern. In our experiments with speaker identification, a set of CI phone models were built for each speaker, by supervised adaptation of SI models\[l 1\], and the unknown speech was recognized by all of the speakers models in parallel) Speaker-identification experiments were performed using BREF for French and TIMIT for English.</Paragraph> <Paragraph position="1"> TIMIT has recently been used in a few studies on speaker identification\[l, 20, 27, 14\] with high speaker identification rates reported using subsets of 100 to all 462 speakers.</Paragraph> <Paragraph position="2"> For the experiments with TIMIT, a speaker-independent set of 40 CI models were built using data from all of the 462 training speakers with 8kHz Mel frequency-based cepstral coefficients and their first order differences. 31-phone model sets were then adapted to each of the 168 test speakers using 8 sentences (2 SA, 3 SX, and 3 SI) for adaptation. We chose this set for identification test so as to evaluate the performance for speakers not in the original SI training material, which greatly simplifies the enrollment procedure for new speakers.</Paragraph> <Paragraph position="3"> A reduced number of phones was used so as to minimize subtle distinctions, and to reduce the number of models to be adapted. The remaining 2 SX sentences for each speaker were reserved for the identification test. While the original CI models had a maximum of 32 Gaussians, the adapted models were limited to 4 mixture components, since the amount of adaptation data was relatively limited.</Paragraph> <Paragraph position="4"> 5Using HMM for speaker recognition has been previously proposed, see \[26\] for a review, and also \[24, 25\].</Paragraph> <Paragraph position="5"> The unknown speech was recognized by all of the speakers models in parallel by building one large HMM. Error rates are shown as a function of the speech signal duration in Table 5, for text-independent speaker identification. As for sex and language identification, the initial and final silences were adjusted to have a maximum duration of lOOms according to the provided time-aligned transcriptions. Using the entire utterance the identification accuracy is 98.5%. With 2.5s of speech the speaker identification accuracy is 98.3%.</Paragraph> <Paragraph position="6"> For the small number of sentences longer than 3s, speaker identification was correct, suggesting that with longer sentences performance will improve. This is also supported by the result that speaker-identification using both sentences for identification was 100%.</Paragraph> <Paragraph position="7"> duration for 168 test speakers of TIMIT, and 65 speakers from BREE (EOS is End Of Sentence identification error rate.) For French, the acoustic seed models were 35 SI CI models, built using data from 57 BREF training speakers, excluding 10 sentences to be used for adaptation and test. In order to have a similar situation to English, these models were adapted to each of 65 speakers (including 8 new speakers not used in training) using only 8 sentences for adaptation, and reserving 2 sentences for identification test. Using only one sentence per speaker for identification, there is one error, giving an identification accuracy of 99.2%, and when 2 sentences are used all speakers are correctly identified (as observed for TIMIT). Speaker-identification results are given in Table 5 for 65 speakers (27m/38f) as a function of signal duration. It can be noted that the identification accuracies as a function of time are similar for both corpora. However, since BREF sentences are somewhat longer than TIMIT sentences, the overall identification error rate per sentence is lower for BREF (EOS), even though the error for BREF at 2.5s is greater. For both TIMIT and BREF, when there was a confusion, the speaker was always identified by another speaker of the same sex.</Paragraph> <Paragraph position="8"> Experiments for text-dependent speaker identification using exactly the same models and test sentences were performed. For both TIMIT and BREF a performance degradation was observed (on the order of 4% using the accuracy at the end of the sentence.) These results were contrary to our expectations, in that typically text-dependent speaker verification is considered to outperform text-independent\[3, 19\]. An experiment was also performed in which speaker-adapted models were built for each of the 168 test speakers from TIMIT without knowledge of the phonetic transcription, using the same 8 sentences for adaptation. Performing text-independent speaker identification as before on the remaining 2 sentences give the results shown in Table 6. As be- null fore if both sentences are used for identification, the speaker identification accuracy is 100%. This experimental result indicates that the time consuming step of providing phonetic transcriptions is not needed for accuracte text-independent speaker identification.</Paragraph> </Section> <Section position="8" start_page="100" end_page="100" type="metho"> <SectionTitle> (EOS is End Of Sentence identification error rate.) SUMMARY </SectionTitle> <Paragraph position="0"> In this paper we have reported on recent work on the identification of non-linguistic speech features from recorded signals using phone-based acoustic likelihoods. The inclusion of this technique in speech-based systems, can broaden the scope of applications of speech technologies, and lead to more user-friendly systems.</Paragraph> <Paragraph position="1"> The approach is based on training a set of large phone-based ergodic HMMs for each non-linguistic feature to be identified (language, gender, speaker, ...), and identifying the feature as that associated with the model having the highest acoustic likelihood of the set. The decoding procedure is efficiently implemented by processing all the models in parallel using a time-synchronous beam search strategy.</Paragraph> <Paragraph position="2"> This has been shown to be a powerful technique for sex-, language-, and speaker-identification, and has other possible applications such as for dialect identification (including foreign accents), or identification of speech disfluencies. Sexidentification for BREF and WSJ was error-free, and 99% accurate for TIMIT with 2s of speech. With 2s of speech the language is correctly identified as English or French with over 99% accuracy. Speaker identification accuracies of 98.5% on TIMIT (168 speakers) and 99.1% on BREF (65 speakers) were obtained with one utterance per speaker, and 100% if 2 utterances were used for identification. The same identification accuracy was obtained on the 168 speakers of TIMIT using unsupervised adaptation, verifying that it is not necessary to provide phonetic transcription for accurate speaker identification. Being independent of the spoken text, and requiring only a small amount of speech (on the order of 2.5s), this technique is promising for a variety of applications, particularly those for which continual verification is preferable.</Paragraph> <Paragraph position="3"> In conclusion, we propose a unified approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods. This technique has been shown to be effective for language, sex, and speaker identification and can enable better and more friendly human machine interaction.</Paragraph> </Section> class="xml-element"></Paper>