File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2007_metho.xml
Size: 5,886 bytes
Last Modified: 2025-10-06 14:12:21
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-2007"> <Title>Modelling Non-verbal Sounds for Speech Recognition</Title> <Section position="3" start_page="0" end_page="49" type="metho"> <SectionTitle> SPREADSHEET TASK </SectionTitle> <Paragraph position="0"> Alex Rudnicky and Michelle Sakamoto gathered a large corpus of examples of users performing a spreadsheet task using voice (Rudnicky et al, 1989). They used an operational speech recognition system, not a PNAMBIC paradigm. The subjects spoke in a spontaneous manner and were recorded using a Sennheiser close talking microphone. The input to their system was continuous, recognition wasn't started by pressing a key just before speaking. The data used in the present experiment consists of 15 sessions each from 7 speakers (they subsequently recorded more). A session represents approximately 100 utterances. These utterances were divided into a training set and testing set which were then transcribed using noise words. Some &quot;utterances&quot; contained only real words, some contained words and noise, and some were noise alone. Noises loud enough to pass background thresholds were recognized as words even ff the subject were not speaking. The recognizer used in their experiment was trained on 4000 read utterances from spreadsheet and calculator tasks. In order to avoid becomming speaker dependent, we used the 4000 speaker-independent read utterances along with 416 noise-word containing spontaneous utterances as our training set. The test set was 7185 spontaneous utterances from the seven speakers (not including any from the training set). The 416 noise utterances in the test set came from only five speakers, so two of the speakers in the test set had not been seen in the training. The original recognizer, Sphinx (SPHX) and the noise word recognizer, Phoenix (PHNX) were both run on the test set. Table 1 shows the word and sentence error rates for each type of utterances.</Paragraph> <Section position="1" start_page="47" end_page="48" type="sub_section"> <SectionTitle> 9.6 8.5 </SectionTitle> <Paragraph position="0"> The WORDS and NOISE results reflect those utterances whose transcripts contained both real words and at least one noise word. In this condition, use of noise words reduced the sentence error rate by 43 percent. The NOISE results are for those utterances whose transcripts consist solely of noise words. The noise word models were very effective at discriminating these events from real speech. Less than four percent produced real word hypotheses. For the original system, 81.4 percent of these noises resulted in hypothesized words from the lexicon. The two previous categories are then combined to give errors for all utterances containing noise. For this test set, the number of sentence errors for utterances containing noise was reduced by a factor of five by using noise words (290 vs 41).</Paragraph> <Paragraph position="1"> The WELL-FORMED condition are those utterances whose transcripts contained only real words from the lexicon.</Paragraph> <Paragraph position="2"> This was run as a check that using noise words did not degrade performance on clean input. As can be seen, the system with noise words performed at least as weU as the original system on clean input.</Paragraph> </Section> <Section position="2" start_page="48" end_page="49" type="sub_section"> <SectionTitle> Census Data Task Richard Stern and Alejandro Acero gathered data on subjects entering census data (Stern & Acero, 1989). Subjects </SectionTitle> <Paragraph position="0"> were asked to speU their name and street address, etc. This is largely an alphanumeric task. The recordings were made in a booth partitioned from the rest of the office, and a Sennheiser close talking mike was used. In this task, subjects were prompted when to speak as opposed to the spread sheet task where recording was continuous. Thus, there were no utterances that contained only noise. As before, the utterances were transcribed using noise words, and the system was trained using these models. The error rate for the utterances containing words and noise is shown in Table 2.</Paragraph> <Paragraph position="1"> In this task also, the use of noise words significantly reduced error rate for noisyinput. As before, there was also no degredafion of performance for clean input.</Paragraph> <Paragraph position="2"> Conclusion The experiments here were quick studies designed to test the feasibility of using HMM models in the standard framework to model non-stationary noise in the input. The results suggest that the noises that are problematic for close talking mikes in office environments can be modelled with these techniques. We intend to extend and refine these models to give better models of a wider range of events. Much more can also be done with the way that the models are used. Currently, they are allowed to follow all words with no difference in probability. While environmental noise probably is randomly interspersed throughout the signal, this is not true of user generated noise. These noises are more probable at some places than others. Breath noises and rustles are far more common at the beginning and end of utterances, for example. Statistics on occurences of these events can be incorporated into the search as a part of the language model. However, this ability requires that noise words be reliably distinguished from each other. In the data presented, noise words were stripped out for the analysis. Insertions and substitutions of noise words were not counted as incorrect. While noise words were not often confused with real words, they were often confused with other noise words. Better modelling of these events will be required before their &quot;language model&quot; probabilities can be reliably applied.</Paragraph> </Section> </Section> class="xml-element"></Paper>