File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1028_metho.xml
Size: 18,495 bytes
Last Modified: 2025-10-06 14:12:42
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1028"> <Title>FIELD TEST EVALUATIONS and OPTIMIZATION of SPEAKER INDEPENDENT SPEECH RECOGNITION for TELEPHONE APPLICATIONS</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> FIELD TEST EVALUATIONS and OPTIMIZATION of SPEAKER INDEPENDENT SPEECH RECOGNITION for TELEPHONE APPLICATIONS Christian GA GNOULET and Christel SORIN CNET D6pt RCP 22301 LANNION-France ABSTRACT </SectionTitle> <Paragraph position="0"> This paper presents, in a first part, the detailed results of several field evaluations of the CNET speaker independent speech recognition system in a context of 2 voice-activatedservers accessible by the general French public over the telephone. The analysis of roughly ll 000 user's tokens indicates that the rejection of incorrect input is a major problem and that the gap between the recognition rates observed in real use conditions and in the most &quot;realistic&quot; laboratory tests remains very large. The second part of the paper describes the current improvements of the system : better rejection procedures, enhancement of the recognition performances resulting from both the introduction of field data in the training data and the increase of the number of parameters, automatic adjustments of the HMM topology allowing to either reduce overall model complexity or improve recognition performance. Tested on long distance telep.hone databases (450 to 750 speakers), the current version of the CNET recognition system yields a laboratory error rate of 0.7 % on the 10 French digits and of 0.95 % on a 36 word vocabulary.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> At CNET, the speech recognition studies are specifically oriented toward the development of telecommunications applications. This implies the development of robust, speaker independent speech recognition systems but also the design and the evaluation of complete spoken dialogue systems, for which human factor studies are essential.</Paragraph> </Section> <Section position="3" start_page="0" end_page="160" type="metho"> <SectionTitle> SYSTEM OVERVIEW and FIELD TEST EVALUATIONS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="160" type="sub_section"> <SectionTitle> System overview </SectionTitle> <Paragraph position="0"> The connected speech recognition algorithm developed at CNET in 1986 \[1\] uses the HMM approach and has been implemented on several devices (RDP 20 and RDP 50 boards) \[2\]. In the implemented version of the algorithm, 6 Mel cepstral coefficients, the energy and its derivative are computed every 16 ms to obtain the input vectors.</Paragraph> <Paragraph position="1"> The observation probabilities are represented by gaussian functions with diagonal covariance matrices and are tied to the transitions of the Markov chains. Various kinds of modellinlg can be implemented : either word units or sub-word units such as phonemes, diphones or allophones.</Paragraph> <Paragraph position="2"> For each application, the network is fully compiled and includes imtial and final silence models for each word.</Paragraph> <Paragraph position="3"> This system has been tested on several databases of isolated and connected words recorded over the telephone network with willing subjects (mainly long distance lines, speakers representing different regional accents) : DIGITS-1 (455 speakers, 10 French digits), NUMBERS (730 speakers, 00...99 in French), TREGOR (513 speakers, 36 French words). For each database, one half of the data was used for training, the other half for testing. Using word models, the obtained word error rates (for the first version of the system) were 2.1% for DIGITS-l, 2.7 % for TREGOR and 9.6 % for NUMBERS.</Paragraph> <Paragraph position="4"> Field test evaluations Experimental server&quot; MAIRIE-VOX In 1988, an experimental, one-port voice interactive system, MAIRIEVOX \[3\], was built on a PC computer using the RDP 50 board (word models, 13 states/word, 3 ~gaussian pdfs per state). Designed to give various lnformations about local services around the city of Lannion (20 000 inhabitants), MAIRIEVOX is accessible by the general public over the telephone since mid-88.</Paragraph> <Paragraph position="5"> The input interface for the user is restricted to voice input without any keypad complementary command. A tree structure is used to access information. The complete vocabulary contains 21 words (extracted from the 36 words TREGOR data-base) but the dialogue module limits the active vocabulary to 6 words at each step.</Paragraph> <Paragraph position="6"> Since that time, MAIRIEVOX has been the subject of several field trials allowing to identify its critical points and to substantially improve both the speech recognition performances and the acceptability of the service.</Paragraph> <Paragraph position="7"> The first evaluations (during which the input signal was not recorded) mainly allowed to improve the ergonomy of the service. For example, it appeared extremely usefull to authorize the recognition of the speech commands during the delivery of the voice messages : this allows the regular users of the service to anticipate the commands and therefore to quickly reach the required information in the dialog.ue-tree. An echo-cancellation procedure (non recurswe filter with a 8 ms window) was thus introduced on the speech recognition board. The dialogue strategy was also modified to take into account the necessity of recovering from the largest part of recognition errors (&quot;confirmation&quot; procedures with Yes/No commands in case of recognition difficulties). With these two main improvements, the acceptability of the isolated-word, menu-driven speech command server has been demonstrated (less than 10 % failure in the access to the requested informations).</Paragraph> <Paragraph position="8"> During 1990, a new set of evaluations has been done : roughly 4600 voice inputs (corresponding to 340 telephone calls) were systematically recorded, listened to and labelled as &quot;correct inputs&quot; (55.5 %), '~ncorrect speech inputs&quot; (i.e. non permitted by the dialogue) (17.8 %) and 'gto/se&quot; (26.7 %).</Paragraph> <Paragraph position="9"> From the application point of view, the rejection of incorrect inputs appears therefore to be a crucialpoint : despite clear instructions to the caller, roughly 45% of the inputs to MAIRIEVOX are incorrect (words outside the vocabulary or noise). The simple rejection procedure used in MAIRIEVOX (all the vocabulary words are candidates at any time even if the dialogue module filters the words which are not valid in the context, use of a simple duration-based rejection threshold) allowed to l!mtt the false rejection error rate to roughly 10 % ('correct inputs&quot;). For incorrect inputs, 82 % are corectly rejected but 18 % induce an error (false acceptance).</Paragraph> <Paragraph position="10"> From the recognition point of view (&quot;correct inputs&quot; only), we observed a 12.2 % error rate (21 valid words), 36 % of which being due to bad end point detection (truncated words). On the other hand, contrary to previous observations, the need for modelling hesitauons (or surrounding speech) didn't really appear to be crucial : less than 5 % of the speech inputs contain hesitations or supplementary words (the design of the dialogue seems to play an essential role in this phonemenons).</Paragraph> </Section> <Section position="2" start_page="160" end_page="160" type="sub_section"> <SectionTitle> Industrial Server &quot;Horoscope&quot; </SectionTitle> <Paragraph position="0"> A commercial voice-activated server &quot;HOROSCOPE&quot; was operating since April 1990 over the 9 taxation areas of the French telephone network. Based on the same recognition technology as MAIRIEVOX, it involves the recognition of the 12 horoscope signs spoken in an isolated manner. The calling person had the ability to ask for a horoscope sign at any time (branching factor of 12), any number of times, by waiting for the end of a message playback, or by interrupting it. The very direct dialogue procedure prevented the use of any dialogue-driven rejection process (contrary to MAIRIEVOX).</Paragraph> <Paragraph position="1"> During June 1990, 6446 tokens from 1724 calls \[4\] were recorded, listened to and labelled as &quot;correct speech inputs&quot; (73 %), &quot;incorrect speech inputs&quot; (speech without a i~! a&quot;rbe~woelr~ls~nWo~i~!g b'('l elp5mro~?m mn~ ~s ~i~ ~n(deg~2e~Pf ~ ~ct~ ~c~ speech&quot; inputs contain hesitations or supplementary words.</Paragraph> <Paragraph position="2"> The lack of perfect noise/speech discrimination in the endpoint detector aggravates the problem, as already observed for MAIRIEVOX : from the 27.1% word error rate observed on &quot;correct inputs&quot;, roughly 50 % are due to bad endpoint detection. The very low recognition score observed here results from 3 main short comings in the realisation of this industrial system : 1) only 1 gaussian/state was used in the 13 state word models, 2) only 150 speakers were used for training the models, 3) the implemented echo-cancellation procedure was a very simplified version of the procedure proposed by CNET (a 10 dB difference was observed between the 2 attenuation rates).</Paragraph> <Paragraph position="3"> In conclusion, after assessing two general-public wordrecognition applications in use over the French telephone network, it was found that, despite clear instructions to the caller, a considerable proportion of the input lies outside the permitted vocabulary. These extraneous inputs are either incorrect speech tokens or non-speech tokens for which the caller is not always responsible (DTMF dialing, line bursts, outside noise etc...). There is therefore an urgent need for efficient rejection procedures. Moreover, the gap between the recognition error rates observed in real use conditions and in the laboratory tests is very large : a multiplicative factor of 3-4 is observed ; it can reach 10 if the application is carelessly designed and modelized !...</Paragraph> </Section> </Section> <Section position="4" start_page="160" end_page="161" type="metho"> <SectionTitle> NEW REJECTION PROCEDURES </SectionTitle> <Paragraph position="0"> Two rejection procedures \[5\] have been investigated and compared on the &quot;HOROSCOPE&quot; field database containing all the &quot;correct&quot; and &quot;extraneous&quot; tokens recorded during the &quot;Horoscope&quot; field trials, to which were added 1699 tokens from 151 willing subjects recorded through the telephone network. Half of the data was used for training, the other half for testing.</Paragraph> <Paragraph position="1"> The first rejection procedure uses 3 sink models trained with the &quot;extraneous&quot; tokens (incorrect or noise inputs) of the training corpus and imposes thresholds on wordmodel scoring : the rejection threshold is applied on a '~orrected score&quot; which is the word HMM score minus the contribution of the silence models.</Paragraph> <Paragraph position="2"> The second rejection procedure operates on the &quot;trace&quot; of the HMM (i.e. informations on the optimal Viterbi path).</Paragraph> <Paragraph position="3"> It involves the extraction of the HMM trace from a given input token and the classification of this trace into &quot;acceptance&quot; or &quot;rejection&quot; by a multi-layer perceptron (MLP). This rejection procedure is independent from the recognition process : it uses HMMs designed with the sole purpose of producing informative traces.</Paragraph> <Paragraph position="4"> For the '?race&quot; rejection procedure, the best results were obtained with a trace containing 1) the number of frames observed per gaussian, 2) the average energy coefficient and 3) the average first Mel frequency coefficient of the frames observed per gaussian, i.e. with a trace exhibiting both a duration and a signal representation.</Paragraph> <Paragraph position="5"> The results of both procedures are illustrated on Figure 1 where the sum of the SE rate and the FR rate measures performance on correct tokens and the FA rate measures performance on extraneous tokens.</Paragraph> <Paragraph position="7"> % error rate on correct tokens (SE+FR) Figure 1 : Rejection using the HMM trace (full curve) and rejection using sink models and a score threshold (dashed curve) Although the performances of both procedures appears to be simdar within the confidence interval, there is one aspect unique to the rejection by trace .&quot; its ability to reject a large proportion of the substitution errors instead of proposing them to the user (substitution rejection error rate of 66 %).</Paragraph> <Paragraph position="8"> Work is currently underway to refine the trace-based procedure. Another promising direction seems to be to combine this two complementary methods.</Paragraph> </Section> <Section position="5" start_page="161" end_page="162" type="metho"> <SectionTitle> RECOGNITION OPTIMIZATIONS </SectionTitle> <Paragraph position="0"> Use of field data in HMM training The constant gap between the recognition rates observed in real use conditions and in the laboratory tests led us to investigate the introduction of field data in the training database to see if it can significantly improve the recognition performances.</Paragraph> <Paragraph position="1"> In this experiment \[6\], 2 sets of a telephone-speech database corresponding to the 21 word vocabulary of the MAIRIEVOX server have been used : - a &quot;LABoratory database&quot; corresponding to a subset of the TREGOR database : 513 willing subjects, 9797 uniformly distributed tokens, - an &quot;EXPloitation database&quot; corresponding to an extension of the previously introduced field database : 1547 naive speakers (real users) produced 9536 &quot;correct tokens&quot;, non uniformly distributed among the 21 words. Both databases exclusively hold manually validated data, i.e. data labelled as &quot;perfect&quot; (non truncated and without hesitations or supplementary words) after listening. Each database was spht into two equal parts : one for training, the other one for testing.</Paragraph> <Paragraph position="2"> The training of the HMM word models was done either on the LAB database, on the EXP database or on a MIXed database containing an equal proportion of laboratory and field data. The results are illustrated on table 1 (word error rate).</Paragraph> <Paragraph position="3"> Word error rate for a 21 word vocabulary (long distance telephone speech) : influence of &quot;field&quot; data introduced in training It can be seen that the use of &quot;MiXed&quot; models leads to a 30 % reduction of the recognition error rate on the field databases : the introduction of field data in the training phase does improve the field recognition performances. Work is currently underway for achieving on-line selection of the &quot;correct&quot; field data to be introduced in a '~'etraining&quot; phase of systems in exploitation.</Paragraph> <Paragraph position="4"> Increasing the number of parameters Several studies have shown the usefulness of adding timede..pendent information in the HMM input vectors. Table 2 illustrates the results of various tests on the DIGITS-1 data base (455 speakers) using input vectors containing either 9 acoustic coefficients (8 MFCC and energy), 18 acoustic coefficients (the same as above plus their first derivative) \[7\] or 27 acoustic coefficients (second derivative added).</Paragraph> <Paragraph position="5"> It is also well known that increasing the size of the models (i.e. number of states and pdf's) yields better performance, at least for isolated word recognition.</Paragraph> <Paragraph position="6"> Comparative results between 13 state and 30 state word models are shown in Table 2.</Paragraph> <Paragraph position="7"> The new version of the recognition algorithm implemented on the RDP 50 board (TMS 320C25) yields an error rate of 0.69 % (4I state word models, 27 acoustic coefficients) on an expanded version of the long distance telephone DIGITS database (775 speakers) and of 0.95 % (18 acoustic coefficients, word model size depending of the word length) for the TREGOR long distance telephone database (36 words, 513 speakers).</Paragraph> <Paragraph position="8"> Automatic adjustments of the structure of HMM models Using whole word basic units is generally a good choice for small vocabulary isolated word recognition, and increasing the size of the models usually leads to better performance. However, this also increases the computation time, due to the number of observation robabilities (gaussian functions) that must be computed r each frame. Thus, in order to use the best possible model in real time industrial devices, it was usefull to investigate the possibility of reducing the number of gaussian functions by clustering &quot;similar&quot; pdf's. This was done by iteratively merging the 2 gaussian pdf's inducing the smallest decrease of the total probability of the training observations, until the desired number of pdfs is reached \[8\]. On the 36 word TREGOR database, this procedure allowed to reduce b}, 40 % the number of gaussian functions while keeping Identical performances. Using sub-word basic units leads to more compact models (since all the occurrences of a given unit share the same set of pdf's), but it is difficult to increase the a priori size of the acoustical models (they may become too long). An algorithm has thus been developped \[8\] around the two following basic ideas : splitting the pdfs having the highest contribution to the probability of the training data, and discarding the transitions which are scarcely used. These two operators (splitting and discarding) are applied successively, and the model is re-trained after each modification. By applying this procedure on a pseudo-diphone based model \[I\], the recognition error rate has been reduced from 2.5 % to 1.8 % on the 36 word TREGOR telephone database used above;</Paragraph> </Section> class="xml-element"></Paper>