File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/h94-1064_intro.xml
Size: 2,885 bytes
Last Modified: 2025-10-06 14:05:47
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1064"> <Title>The LIMSI Continuous Speech Dictation Systemt</Title> <Section position="4" start_page="0" end_page="319" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> Speech recognition research at LIMSI aims to develop recognizers that are task-, speaker-, and vocabulary-independent so as to be easily adapted to a variety of applications. The applicability of speech recognition techniques used for one language to other languages is of particular importance in Europe. The multilingual aspects are in part carried out in the context of the LRE SQALE (Speech recognizer Quality Assessment for Linguistic Engineering) project, which is aimed at assessing language dependent issues in multilingual recognizer evaluation. In this project, the same system will be evaluated on comparable tasks in different languages (English, French and German) to determine cross-lingual differences, and different recognizers will be compared on the same language to compare advantages of different recognition strategies.</Paragraph> <Paragraph position="1"> In this paper some of the primary issues in large vocabulary, speaker-independent, continuous speech recognition for dictation are addressed. These issues include language modeling, acoustic modeling, lexical representation, and search.</Paragraph> <Paragraph position="2"> Acoustic modeling makes use of continuous density HMM with Gaussian mixture of context-dependent phone models.</Paragraph> <Paragraph position="3"> For language modeling n-gram statistics are estimated on tThis work is partially funded by the LRE project 62-058 SQALE.</Paragraph> <Paragraph position="4"> text material. To deal with phonological variability alternate pronunciations are included in the lexicon, and optional phonological rules are applied during training and recognition. The recognizer uses a time-synchronous graph-search strategy\[16\] for a first pass with a bigram back-off language model (LM)\[10\]. A trigram LM is used in a second acoustic decoding pass which makes use of the word graph generated using the bigram LM\[6\]. Experimental results are reported on the ARPA Wall Street Journal (WSJ)\[19\] and BREF\[14\] corpora, using for both corpora over 37k utterances for acoustic training and more than 37 million words of newspaper text for language model training. While the number of speakers is larger for WSJ, the total amount of acoustic training material is about the same (see Table 1). It is shown that for both corpora increasing the amount of training utterances by an order of magnitude reduces the word error by about 30%.</Paragraph> <Paragraph position="5"> The use of a trigram LM in a second pass also gives an error reduction of 20% to 30%. The combined error reduction is on the order of 50%.</Paragraph> </Section> class="xml-element"></Paper>