File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/h94-1064_evalu.xml
Size: 8,454 bytes
Last Modified: 2025-10-06 14:00:13
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1064"> <Title>The LIMSI Continuous Speech Dictation Systemt</Title> <Section position="9" start_page="321" end_page="323" type="evalu"> <SectionTitle> EXPERIMENTAL RESULTS </SectionTitle> <Paragraph position="0"> WSJ: The ARPA WSJ corpus\[19\] was designed to provide general-purpose speech data with large vocabularies. Text materials were selected to provide training and test data for word lexicon on the WSJ corpus. Bigram/trigram (bg/tg) grammars estimated on WSJ text data. +: 20,000 word lexicon with open test. For testing purposes, the 20k closed vocabulary includes all the words in the test data whereas the 20k open vocabulary contains only the 20k most common words in the WSJ texts.</Paragraph> <Paragraph position="1"> The 20k open test is also referred to as a 64k test since all of the words in these sentences occur in the 63,495 most frequent words in the normalized WSJ text material\[ 19\]. Two sets of standard training material have been used for these experiments: The standard WSJ0 SI84 training data which include 7240 sentences from 84 speakers, and the standard set of 37,518 WSJ0/WSJ1 SI284 sentences from 284 speakers.</Paragraph> <Paragraph position="2"> Only the primary microphone data were used for training.</Paragraph> <Paragraph position="3"> The WSJ corpus provides a wealth of material that can be used for system development. We have worked primarily with the WSJ0-Dev (410 sentences, 10 speakers), and the WSJ1-Dev from spokes s5 and s6 (394 sentences, 10 speakers). Development of the word recognizer was done with the 5k closed vocabulary system in order to reduce the computational requirements. The Nov92 5k and 20k nvp test sets were used to assess progress during this development phase.</Paragraph> <Paragraph position="4"> The WSJ system was evaluated in the Nov92 ARPA evaluation test\[17\] for the 5k-closed vocabulary and in the Nov93 ARPA evaluation test\[18\] for the 5k and 64k hubs. Except when explicitly stated otherwise, all of the results reported for WSJ use the standard language models\[19\]. Using a set of 1084 CD models trained with the WSJ0 si84 training data, the word error is 6.6% on the Nov92 5k test data and 9.4% on the Nov93 test data. Using the combined WSJ0\]WSJ1 si284 training data reduces the error by about 27% for both tests. When a trigram LM is used in the second pass, the word error is reduced by an addition 35% on the Nov92 test and by 22% on the Nov93 test.</Paragraph> <Paragraph position="5"> Results are given in the Table 3 for the Nov92 nvp 64K test data using both closed and open 20k vocabularies. With si84 training (si84c, a slightly smaller model set than si84) the word error rate is doubled when the vocabulary increases from 5k to 20k words and the test perplexity goes from 111 to 244. The higher error rate with the 20k open lexicon can be largely attributed to the out-of-vocabulary (OOV) words, which account for almost 2% of the words in the test sentences. Processing the same test data with a system trained on the si284 training data, reduces the word error by 30%. The word error on the Nov93 20k test is 15.2% with the si284 system. Using the trigrarn LM reduces the error rate by 18% on the Nov92 test and 22% on the Nov93 test.</Paragraph> <Paragraph position="6"> The 20k trigram sentence error rates for Nov92 and Nov93 are 60% and 62% respectively. Since this is an open vocabulary test, the lower bound for the sentence error is given by the percent of sentences with OOV words, which is 26% for Nov92 and 21% for Nov93. In addition there are errors introduced by the use of word graphs generated by the first pass. The graph error rate (ie. the correct solution was not in the graph) was 6% and 12% respectively for Nov92 and Nov93. In fact, in most of these cases the errors should not be considered search errors as the recognized string has a higher likelihood than the correct string.</Paragraph> <Paragraph position="7"> A final test was run using a 64k lexicon in an attempt to eliminate errors due to unknown words. (In principle, all of the read WSJ prompts are found in the 64k most frequent words, however, since the WSJ1 data were recorded with non-normalized prompts, additional OOV words can occur.) Running a full 64k system was not possible with the computing facilities available, so we added a third decoding pass to extend the vocabulary size. Starting with the phone string corresponding to the hypothesis of the trigram 20k system, an A* algorithm is used to generate a word graph using phone confusion statistics and the 64k lexicon. This word graph is then used by the recognizer with a 64k trigram LM trained on the standard WSJ training texts (37M words). Using this approach only about 30% of the errors due to OOV words on the Nov93 64k test are recovered, reducing the word error to 11.2% from 11.8%.</Paragraph> <Paragraph position="8"> BREF: BREF\[14\] is a large read-speech corpus, containing over 100 hours of speech material, from 120 speakers (55m/65f). The text materials were selected verbatim from the French newspaper Le Monde, so as to provide a large vocabulary (over 20,000 words) and a wide range of phonetic environments\[7\]. The material in BREF was selected to maximize the number of different phonemic contexts. 5 Containing 1115 distinct diphones and over 17,500 triphones, BREF can be used to train vocabulary-independent acoustic models. The text material was read without verbalized punctuation using the verbatim prompts. 6 bigram/trigram grammars estimated on Le Monde text data. +: 20k word lexicon with open test.</Paragraph> <Paragraph position="9"> We have previously reported results using only a small portion (2770 sentences from 57 speakers) of the available training material for BREF\[3, 5, 4\]. In these experiments, the amount of training data has been extended to 38,550 sentences from 80 speakers. The amount of text material used for LM training has also been increased to 38M words, enabling us to estimate trigram LMs. Vocabularies containing the most frequent 5k and 20k words in the training matefial are used and bigram and trigram LMs were estimated for both vocabularies. 200 test sentences (25 from each of 8 speakers) for each vocabulary were selected from the development test material for a closed vocabulary test. The perplexity of all the within vocabulary sentences of the development test data using the 5k/20k LM is 106/178 (which can be compared to 96/196 for WSJ computed under the same conditions with the 5k/20k-open LM). An additional 200 sentences were used for a 20k-open test set. As ensured by the prompt selection process, the prompt texts were distinct from the training prompts.</Paragraph> <Paragraph position="10"> Word recognition results for the 5k test are given in Table 4 with bigram and trigram LMs estimated on the 38M-word normalized text material from Le Monde. With 428 CD models trained on the si57 sentences, the word error is 12.6%. Using an order of magnitude more training data (si80) and 1747 CD models, the word error with the bigram is reduced by 28% to 9.1%. The use ofa trigram LM gives an additional 36% reduction of error.</Paragraph> <Paragraph position="11"> Results for the 20k test are given in Table 5 using the same acoustic model sets and LMs, for both closed and open vocabulary test sets. For the closed vocabulary test, the si80 training data gives an error reduction of 20% over the si57 WSJ0 were normalized, where for BREF the prompts were presented as they appeared in the original text. This latter approach has since been adopted for the recordings of WSJ1. However, while for WSJ1 orthographic transcriptions are provided, for BREF the only reference currently available is the prompt text.</Paragraph> <Paragraph position="12"> training. The use of the trigram LM reduces the word error by an additional 26%. The 20k-open test results are given in the lower part of the table. 3.9% of the words are OOV and occur in 72 of the 200 sentences. We observe almost a 50% increase in word error, with a three-fold increase in the word insertions compared with the closed vocabulary test.</Paragraph> <Paragraph position="13"> Thus apparently the OOV words are not simply replaced by another word, but are more often replaced by a sequence of words. The trigram LM only reduces the word error by 15% on this test.</Paragraph> </Section> class="xml-element"></Paper>