File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/90/h90-1040_abstr.xml
Size: 3,920 bytes
Last Modified: 2025-10-06 13:46:59
<?xml version="1.0" standalone="yes"?> <Paper uid="H90-1040"> <Title>Continuous Speech Recognition from a Phonetic Transcription</Title> <Section position="1" start_page="0" end_page="190" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> A long-standing and widely accepted linguistic theory of speech recognition holds that natural spoken messages are understood on the basis of an intermediate representation of the acoustic signal in terms of a small number of phonetic symbols. The traditional linguistic theory is very attractive for several reasons. First, it provides a natural way to partition the process of communication by spoken language into distinct acoustic, phonetic, lexical and syntactic sub-processes. Second, it provides for a reduction in bandwidth at each successive stage of the process. And, finally, it seems to be reflected in the development of written language. It is thus not surprising that this seminal idea formed the basis for several early speech recognition machines \[1,2, 3, 4\].</Paragraph> <Paragraph position="1"> In this report we offer what we believe to be the simplest and most direct expression of the linguistic theory in a working speech recognition system. The present system is the culmination of a succession of experiments conducted over the past three years. The method of acoustic phonetic mapping is described in \[5\], and results of its application to speaker-dependent recognition of fluently spoken digit strings are given in \[6\]. Next, a new method of lexical access was devised and applied to the problem of speaker-dependent recognition of isolated words from a large vocabulary \[7\] and sentences composed of them \[8\]. Attention was then tumed to speaker-independent phonetic transcription \[9, 10\] which was then used in an early account of speaker independent recognition of fluent speech from the 991 word DARPA \[11\] resource management task \[12\].</Paragraph> <Paragraph position="2"> In its present form, our speech recognition system uses a particular kind of hidden Markov model in conjunction with an appropriate dynamic programming algorithm to accomplish the acoustic-to-phonetic mapping. This part is not constrained by lexical or syntactic considerations and is thus vocabulary and task independent. Word recognition is then easily treated as a classical string-to-string editing problem which is solved by a two-level dynamic programming algorithm, the lower level of which performs lexical access while the upper level performs the parsing function.</Paragraph> <Paragraph position="3"> Our account of the present speech recognition system is given in the following order. We first give an overview of the system at the block diagram level. This is followed by a detailed description of each of the component blocks, the acoustic phonetic model, the phonetic decoder and, finally, the lexical access and parsing techniques which, because they are so closely coupled, are treated as a unit. This is followed by an account of our experimental results and an interpretation of them.</Paragraph> <Paragraph position="4"> To summarize our results, on the DARPA resource management task with the perplexity 9 grammar, we attained 88% correct word recognition with 3% insertions yielding a word accuracy of 85%. Phonetic transcription accuracy was assessed by resynthesizing directly from the phonetic transcription. In a few informal listening tests, we judged the word intelligibility rate to be approximately 75%.</Paragraph> <Paragraph position="5"> The word accuracy of our system is not as good as that obtained on exactly the same data by several other conventional systems \[13,14,15,16\]. However, we believe that a few correctable shortcomings of the existing system are responsible for the disparity. We hope to make the necessary changes in the near future.</Paragraph> </Section> class="xml-element"></Paper>