File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/89/h89-2039_intro.xml
Size: 3,138 bytes
Last Modified: 2025-10-06 14:04:50
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-2039"> <Title>ACOUSTIC MODELING OF SUBWORD UNITS FOR LARGE VOCABULARY SPEAKER INDEPENDENT SPEECH RECOGNITION</Title> <Section position="3" start_page="0" end_page="280" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> In the past few years there have been proposed a number of systems for large vocabulary speech recognition which have achieved high word recognition accuracy \[1-6\]. Although a couple of the systems have concentrated on either isolated word input \[6\], or have been trained to individual speakers \[5, 6\], most current large vocabulary recognition systems have the goal of performing speech recognition on fluent input (continuous speech) by any talker (speaker independent systems).</Paragraph> <Paragraph position="1"> The approach to large vocabulary speech recognition we adopt in this study is a pattern recognition based approach. For a detailed description of the system we have developed, the reader is referred to \[7\]. The basic speech units in the system are modeled acoustically based on a lexical description of words in the vocabulary. No assumption is made, a priori, about the mapping between acoustic measurements and phonemes; such a mapping is entirely learned via a finite training set of utterances.</Paragraph> <Paragraph position="2"> The resulting speech units, which we call phone-like units (PLU's) are essentially acoustic descriptions of linguistically-based units as represented in the words occurring in the given training set.</Paragraph> <Paragraph position="3"> The focus of this paper is a discussion of various methods used to create a set of acoustic models for characterizing the PLU's used in large vocabulary recognition (LVR). The set of context independent (CO units we used in this study is a fixed set of 47 phone-like units (PLU's), in which each PLU is associated with a linguistically defined phoneme symbol. We model each CI PLU using a continuous density hidden Markov model (CDHMM) with a Gaussian mixture state observation density. Each word model is defined as the concatenation of the PLU models according to a fixed lexicon defined by the set of 47 associated phoneme symbols. We also consider a set of context dependent (CD) units which includes PLUs' defined by left, right and both left and right context.</Paragraph> <Paragraph position="4"> t On leave from CSELT, Torino, Italy.</Paragraph> <Paragraph position="5"> We tested the recognition system on the DARPA Naval Resource Management task using the word-pair (WP) grammar in a speaker independent mode. In the case of context independent acoustic modeling, we varied the maximum number of mixtures in each state from 1 to 256 and found that the word accuracy increased from 61% to 90% which indicates that sufficient acoustic resolution is essential for improved performance. The 90% word accuracy is the highest performance reported based on context independent units. When intraword context dependency modeling is incorporated, we improved out performance to 93% word accuracy.</Paragraph> </Section> class="xml-element"></Paper>