File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2026_metho.xml

Size: 15,488 bytes

Last Modified: 2025-10-06 14:12:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2026">
  <Title>A STACK DECODER FOR CONTINOUS SPEECH RECOGNITION 1</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
THE SPEECH RECOGNITION PROBLEM
</SectionTitle>
    <Paragraph position="0"> Natural language automatic speech recognition typically proceeds as follows. Human speech is recorded via a microphone, then digitized. This digitized waveform is further processed to extract time- and/or frequency-domain parameters and features. This processed input, which we will refer to as an 'utterance', is then fed to a recognizer, which is a program that uses knowledge about speech and language to present a list of possible transcriptions of the input.</Paragraph>
    <Paragraph position="1"> We will discuss a particular algorithm, known as a 'stack decoder' (or sometimes, 'A*-search') for doing continuous speech recognition. We will also discuss an implementation of this algorithm developed at Dragon Systems.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="193" type="metho">
    <SectionTitle>
THE STACK DECODER
</SectionTitle>
    <Paragraph position="0"> This section provides a description of the basic structure of the algorithm. The development of the stack decoder idea as applied to speech recognition was first done by Fred Jelinek and his associates at IBM in the early 70's (Jelinek et. al., 1975 \[3\]), based on earlier work by Jelinek (Jelinek, 1969 \[1,2\]), who had developed the algorithm as a method of sequential decoding of transmitted information.</Paragraph>
    <Paragraph position="1"> The algorithm controls the use of two sub-algorithms, which in this section we will assume as given. Each of the sub-algorithms takes as input a word sequence W and an utterance U. The first, which we will call a 'complete transcription scorer', or CTS, computes the likelihood that W is a complete and correct transcription of the utterance U. The second sub-algorithm, a 'partial transcription evaluator', or PTE, computes a 'priority' for a hypothesized partial transcription. This priority will be used to decide which of a given list of partial transcriptions will be considered first for extension.</Paragraph>
    <Paragraph position="2">  The 'stack' (which is far from being a stack in the computer science sense of a first-in-first-out list) is a list of partial transcriptions, ordered by the PTE. Initially the stack consist,,; of the empty word sequence (which is certain to extend to a correct transcription of the utterance). The algorithm then iterates the following steps until a stopping criterion, explained below, is satisfied.</Paragraph>
    <Paragraph position="3">  - I. Remove from the stack the partial transcription W having the highest priority. - 2. For each possible one-word extension W' of W: 3. Apply the PTE to W' and use the result to insert W' in the appropriate place on the stack.</Paragraph>
    <Paragraph position="4"> * 4. Apply the CTS to W' and use the result to insert W' in a 'choice list'.</Paragraph>
    <Paragraph position="5">  The stopping criterion will depend on the desired output and performance. If the desired output is a list of the N most likely candidates for the transcription, then run the algorithm until it may be determined that the best partial transcription on the stack cannot extend to a complete transcription which scores better than the Nth best choice so far computed.</Paragraph>
  </Section>
  <Section position="4" start_page="193" end_page="194" type="metho">
    <SectionTitle>
REFINEMENTS OF THE BASIC ALGORITHM
</SectionTitle>
    <Paragraph position="0"> One obvious refinement is to save the computational state of an item on the stack. That is, one expects that the computation that the PTE and CTS perform with input W', a one-word extension of W, can make use of the work done in computing with input W. In particular, suppose that the utterance U consists of a sequence of parameter vectors, one vector representing the speech signal at a certain discrete time instant. Then one could save, over a certain time interval, the likelihood that W provides a correct partial transcription of U up to each time in that interval. In other words, save an ending time distribution for the partial transcription W.</Paragraph>
    <Paragraph position="1"> Another refinement is called 'thresholding'. This is the process whereby a hypothesized partial transcription is discarded when there is good reason to believe that it will not extend to a choice in the top N. If thresholding is done, there is no guarantee that the most likely complete transcription will appear on the choice list. However, thresholding significantly reduces the amount of computation required. One method of thresholding is to maintain a likelihood score for the best current hypothesis at each time instant, and to discard a partial hypothesis if its score at a particular time is more than a fixed difference from the score of the best hypothesis at that time. Another kind of thresholding results from placing a limit on the number of items on the stack at any one time (such a limit already exists by virtue of memory limitations). When a new item is inserted into a full stack, the item having least priority will be discarded in order to make room.</Paragraph>
    <Paragraph position="2"> A third refinement is called 'shared computation'. The acoustic models corresponding to different one-word extensions of a partial transcription may be identical on initial portions. One may compute the acoustic match on those identical portions just once. This may be accomplished by changing the algorithmic assumptions slightly, and allowing the PTE and CTS to operate on all legal one-word extensions of a given partial transcription. A further advantage of doing this is that thresholding may be done sooner if the extensions are processed timesynchronously. null 'Rapid match' is the process of pruning the one-word extensions of a partial transcription to produce a shorter list. Rapid match would make use of both acoustic and linguistic information in a crude manner to discard possibilities that are not likely to extend to a correct complete transcription. One method of doing this is to combine simple word frequencies and acoustic matches to crude models for word beginnings to obtain a smaller list of candidates.</Paragraph>
  </Section>
  <Section position="5" start_page="194" end_page="194" type="metho">
    <SectionTitle>
ACOUSTIC MODELS
</SectionTitle>
    <Paragraph position="0"> With this section we start discussion of the implementation details of the stack decoder at Dragon Systems. The models we use for words are based on phonetic spellings (from the Random House Unabridged Dictionary \[4\]) together with stochastic models for phonemes in context.</Paragraph>
    <Paragraph position="1"> A phoneme in context model (or PIC) models the acoustics and duration of a phoneme in a given environment which consists of:  1. The immediately preceding phoneme.</Paragraph>
    <Paragraph position="2"> - 2. The immediately succeeding phoneme.</Paragraph>
    <Paragraph position="3"> 3. The level of stress (primary, secondary, or unstressed). - 4. Whether or not the phoneme is expected to undergo pre-pausal lengthening.  The fourth component of the environment may need further explanation. Before a pause, a speaker will typically extend speech sounds past their normal durations, and this pre-pausal lengthening is confined mainly (although not strictly) to the syllable immediately preceding the pause.</Paragraph>
    <Paragraph position="4"> A PIC is a stochastic network consisting of a sequence of from one to three nodes. Each node carries with it a probability distribution of acoustic parameters and a probability distribution of durations.</Paragraph>
    <Paragraph position="5"> For purposes of efficiency in storage, we allow the same PIC to represent the phoneme in many different environments.</Paragraph>
    <Paragraph position="6"> A stochastic model for a word is constructed &amp;quot;on the fly&amp;quot; during the recognition process using its phonetic spelling. The PICs that may represent the phonemes in a word (with a given phoneme given as left context) are assembled into a network in the following way.  1. Each phoneme that is not included in the final syllable of the word has only one possible PIC representing it.</Paragraph>
    <Paragraph position="7"> 2. The final syllable will have one of two subnetworks representing it (unless the final syllable consists of one phoneme - in that case there will be more, as indicated in 3.). One subnetwork (the 'pre-pausal subnetwork', or PPS) would consider the hypothesis that a pause immediately follows the word, the other subnetwork ('non-pausal subnetwork', or NPS) would consider the opposite hypothesis.</Paragraph>
    <Paragraph position="8"> 3. In the NPS, the final phoneme will be represented by any one of a number of PICs,  depending on which phoneme starts the next word in the hypothesis. For the PPS, the final phoneme is assumed to be followed by silence.</Paragraph>
  </Section>
  <Section position="6" start_page="194" end_page="195" type="metho">
    <SectionTitle>
LANGUAGE MODELS
</SectionTitle>
    <Paragraph position="0"> At the time of preparation of this paper, a language module was being installed in the system.</Paragraph>
    <Paragraph position="1"> This language module has been tested in conjunction with Dragon Systems' discrete utterance recognizer.</Paragraph>
    <Paragraph position="2"> The language module makes use of word-pair statistics (for common word pairs) combined with one-gram statistics.</Paragraph>
  </Section>
  <Section position="7" start_page="195" end_page="195" type="metho">
    <SectionTitle>
THE COMPLETE TRANSCRIPTION SCORER
</SectionTitle>
    <Paragraph position="0"> When we refer to a 'score', we mean a negative log likelihood of a path through a stochastic network. Thus the most likely paths are those with the lowest scores.</Paragraph>
    <Paragraph position="1"> Recall that a complete transcription scorer (or CTS) computes an estimate of the likelihood that a given word sequence is a correct transcription of the input: utterance. The CTS for the stack decoder at present consists of a single component, the acoustic likelihood module (ALM) (future versions will include a linguistic likelihood module and perhaps more). The output of the ALM is based upon consideration of the following stochastic process. Link together the stochastic networks defined by the words in the proposed transcription in a linear fashion. The right and left contexts for each word are now determined, so the resulting network is a linear sequence of nodes. For each node in this sequence, obtain a sample parameter vector sequence from its distribution. Concatenate these sequences together to produce an utterance. The ALM is a dynamic program to compute th'e likelihood that this process produces the input utterance.</Paragraph>
  </Section>
  <Section position="8" start_page="195" end_page="196" type="metho">
    <SectionTitle>
THE PARTIAL TRANSCRIPTION EVALUATOR
</SectionTitle>
    <Paragraph position="0"> The general description given in the section entitled &amp;quot;THE STACK DECODER&amp;quot; defined a partial transcription evaluator, or PTE, as a module which takes as input an utterance and a hypothesized partial transcription and returns a priority for evaluating extensions of the partial transcription. It is a challenging research problem to design algorithms that compute an effective measure of priority.</Paragraph>
    <Paragraph position="1"> An effective priority should satisfy the following requirements.</Paragraph>
    <Paragraph position="2">  - 1. There should be an algorithm (i.e., a PTE) which computes it fairly rapidly. - 2. It should tend to favor extending correct partial hypotheses.</Paragraph>
    <Paragraph position="3">  If the PTE computes an estimate of the likelihood of the best complete transcription extending the input partial transcription, then that would satisfy requirement 2. The PTEs that we have developed and tested have this goal in mind.</Paragraph>
    <Paragraph position="4"> The importance of the actual implementation of the PTE is two-fold. First of all, the earlier the best scoring transcription is considered, the better the subsequent thresholding will be. Second, a poor PTE may cause the best partial transcription to 'fall off the bottom of the stack'. So far we have tested several PTEs. The first few of these do not look ahead at the speech data. However, they do (as they should) allow comparison between hypotheses which consume differing amounts of speech data. Following are the descriptions of the PTEs.</Paragraph>
    <Paragraph position="5"> The Average Score PTE. The measure returned by this PTE is the average score per time instant for the hypothesis as computed by the ALM.</Paragraph>
    <Paragraph position="6"> The Score Difference PTE. The number returned by this PTE is the difference between the score for the hypothesis and the best score of any path at the time which is the best guess for the ending time of the last word in the hypothesis.</Paragraph>
    <Paragraph position="7"> The Confidence PTE. As explained above, each node in the stochastic network has associated with it some probability distributions. Hence there is a notion of 'expected score' for a path in the network (averaged over utterances created by drawing samples from these distributions). The 'confidence' of a path through the network obtained by running the acoustic likelihood module on an input utterance is the difference between the expected score for that path and the actual score as computed by the module, divided by a factor designed to reduce the effect of larger variance for longer hypotheses. This confidence is the measure returned by the confidence PTE. It is essentially a 'normalized' total score to allow comparison between paths of different lengths in the network.</Paragraph>
  </Section>
  <Section position="9" start_page="196" end_page="196" type="metho">
    <SectionTitle>
MISCELLANEOUS IMPLEMENTATION DETAILS
</SectionTitle>
    <Paragraph position="0"> The code implementing the stack decoder described above has been written using the C programming language, conforming completely to the ANSI standard.</Paragraph>
    <Paragraph position="1"> The implementation also was designed with parallelism in mind. In particular, given a parallel programming environment which supports C, the code may be modified so as to allow different partial transcriptions to be extended simultaneously on separate processors.</Paragraph>
    <Paragraph position="2"> A separate PC-specific module allows one to view labelled spectrograms of both recorded and fabricated data. More specifically, one can save live utterances (as above - live speech data which has undergone signal processing) or create utterances from the models (string together data based on the means of the probability distributions in the nodes). Using either one of these types of utterances as input, together with acoustic models and a transcription of the utterance, the spectrogram will display (using CGA, EGA, or VGA) a view of the utterance, segmented according to the model (i.e., the module finds the best path through the stochastic network defined by the the model corresponding to the given transcription). The module may be used for (among other things) understanding speech recognition errors and discovering programming errors.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML