XML Viewer - h90-1040

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/h90-1040_metho.xml
Size: 12,459 bytes
Last Modified: 2025-10-06 14:12:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1040">
  <Title>Continuous Speech Recognition from a Phonetic Transcription</Title>
  <Section position="2" start_page="190" end_page="190" type="metho">
    <SectionTitle>
2. The System
</SectionTitle>
    <Paragraph position="0"> Acoustic signal processing is an autocorrelation based linear predictive analysis. The LPC's are transformed into cepstral coefficients at a centisecond frame rate. The phonetic decoding module is a dynamic programming algorithm applied to a 47-state ergodic semi-Markov model. There are two very important points to be made regarding this stage of processing. First, no lexical or syntactic information of any kind is available to the phonetic decoder. Second, once the decoding is accomplished, the acoustic signal is discarded. All that remains is its phonetic transcription and the duration, in centiseconds, of each phonetic unit in that transcription.</Paragraph>
    <Paragraph position="1"> The lexical access and parsing functions are conceptually separate but are combined here in a two-level dynamic programming algorithm. The lower level is the lexical part while the upper level accomplishes the grammatical analysis. The two are intricately coupled. The DP algorithm simply performs a string-to-string editing in which the error-ridden phonetic transcription is mapped into sentences of conventional orthography. The lexicon used simply gives the phonetic transcription of each vocabulary word pronounced in citation form. The grammar is a strict right linear grammar with no null productions.</Paragraph>
    <Paragraph position="2"> The entire system is implemented in FORTRAN-77 and runs on an Alliant FX-80.</Paragraph>
    <Paragraph position="3"> Because the phonetic decoding and lexical access stages have a high degree of intrinsic parallelism, we can exploit the architecture of the FX-80 to full advantage resulting in an execution time of 15 times real time for a typical sentence.</Paragraph>
    <Paragraph position="4"> We have applied this system to the DARPA Naval Resource Management Task \[11\] which allows one to inquire about and display in various ways, the status of a 180 ship fleet. The vocabulary is 992 words including silence and the grammar imposes a highly stylized word order syntax resulting in a entropy of about 4.4 bits/word.</Paragraph>
    <Paragraph position="5"> We now turn our attention to the individual components of this system.</Paragraph>
  </Section>
  <Section position="3" start_page="190" end_page="191" type="metho">
    <SectionTitle>
3. Signal Processing
</SectionTitle>
    <Paragraph position="0"> The speech was sampled at 8 kHz and was analyzed using a sliding 30 ms. window at a 100 Hz frame rate. The spectrum, S(CO, t), was represented using 12 cepstral coefficients, where the approximate relationship between the spectral magnitude and the resulting cepstral coefficients is defined as</Paragraph>
    <Paragraph position="2"> The cepstral coefficients were computed from autocorrelation coefficients via LPC's \[17\] and they were liftered using the bandpass lifter \[18\]</Paragraph>
    <Paragraph position="4"> Twelve additional parameters were obtained by evaluating the differential cepstral coefficients, Ag'm, which contain important information about the temporal rate of change of the cepstmm, and are given in \[19\] as</Paragraph>
    <Paragraph position="6"> The combined cepstral and delta cepstral vectors form a set of 24-parameter observation vectors, Or, which were used in all the experiments described below.</Paragraph>
  </Section>
  <Section position="4" start_page="191" end_page="191" type="metho">
    <SectionTitle>
4. The Acoustic-Phonetic Model
</SectionTitle>
    <Paragraph position="0"> It is generally accepted that speech is an acoustic manifestation of an underlying phonetic code having a relatively few symbols. The code is, however, a purely mental representation of the spoken language and, as such, is not directly observable. Since the hidden Markov model comprises an unobservable Markov chain and a set of random processes that can be directly measured, it seems most natural to represent speech as a hidden Markov chain in which the hidden states correspond to the putative unobservable phonetic symbols and the state-dependent random processes account for the variability of the observable acoustic manifestation of the corresponding phonetic symbol.</Paragraph>
    <Paragraph position="1"> The model that we use to represent the acoustic-phonetic structure of the English language is the continuously variable duration hidden Markov model (CVDHMM) \[5\]. The states of the model, {qi }~= 1, represent the hidden phonetic units. The phonotactic structure of the language is modelled, to a first order approximation, by the state transition matrix, aij, which defines the probability of occurrence of state (phoneme) qj at time t + z conditioned on state (phoneme) qi at time t, where x is the duration of phoneme i. The information about the temporal structure of the hidden units is contained in the set of durational densities {dq(t ) }inj=l. The acoustic correlates of the phonemes are the observations, denoted Or, and their distributions, which are defined by a set of observation densities {b 0 (Or)}\[j=l.</Paragraph>
    <Paragraph position="2"> The durational densities are 3-parameter gamma distributions -- (x - Xmin (i,j)) vdeg-I e -n'~ (~-~ (i,y)) (4) d0(x)- r(v0 ) where F(x) is the ordinary gamma function. The observation densities are multivariate Gaussian distributions. Note that they are both indexed by state transition rather than initial state. This affords a rudimentary ability to account for coarticulatory phenomena.</Paragraph>
    <Paragraph position="3"> The complete model thus consists of the set of n states (phonemes), the state transition probabilities, aij, 1 _&lt; i,j_&lt; n; the observation means, ttii, 1 _&lt; i,j_&lt; n; the observation covariances, Uij, 1 _&lt; i,j _&lt; n; and the durational parameters, vii and rlij, 1 _&lt; i,j _&lt; n, where the mean duration associated with state transition i to j is vii and the variance of that duration is</Paragraph>
    <Paragraph position="5"> With n = 47 phonetic units, the model has 191,000 parameters in all.</Paragraph>
  </Section>
  <Section position="5" start_page="191" end_page="194" type="metho">
    <SectionTitle>
5. Phonetic Decoding
</SectionTitle>
    <Paragraph position="0"> Since we identify each phonetic unit with a unique state of the CVDHMM as described above, phonetic transcription reduces to the task of finding the most likely state sequence of  the model corresponding to the sequence of acoustic vectors, O = O 10 2 ... Ot ..- O r. We do so by finding the state and duration sequences whose joint likelihood with O is maximum. The required optimization is accomplished using a modified Viterbi \[20\] algorithm. Let at(i) denote the maximum likelihood of O1 02 ... Ot over all state and duration sequences terminating in state i. This quantity can be evaluated recursively according to { I &amp;quot; at(j) = max max a}i-)x aij dij(x) I'I bij (Or-0) (5) l ~ i ~ n xmi, (i,j) ~ x ~ Xm= O=0 for 1 _&lt; j _&lt; n, 1 _&lt; t &lt;_ T where Xmi~(i,j) is the minimum duration for which dq(x ) is defined and 'rmax is the maximum allowable duration for any phonetic unit.</Paragraph>
    <Paragraph position="1"> If, at each stage of the recursion on t and j, the values of i and x that maximize (5) are retained, then one can trace back through the at(j) array to obtain the best state and duration  sequences a = ...</Paragraph>
    <Paragraph position="2"> (6) 6. Lexical Access and Parsing  The function of the lexical access and parsing algorithms is to find that sentence, W, which is well-formed with respect to the task grammar, G, and best matches, in some sense, the phonetic transcription, ~. The lexical access part of the process is that of matching words to subsequences of ~, while parsing is the part that joins the lexical hypotheses together according to grammatical rules. The two components are conceptually separate and sequential as indicated in Figure 1. However, in order to achieve an efficient implementation, the two are interleaved in a two-level dynamic programming algorithm and hence are treated together in this section.</Paragraph>
    <Paragraph position="3"> Lexical access is effected by the lower level of the two-level DP algorithm and consists in matching standard transcriptions of lexical items to various subsequences of ft. In particular we seek the word, v, whose standard transcription q = q 1 q2 ... qr is closest, in a well defined sense, to parts of fi, say qt+l qt+2 ... qt+L. The well-known solution to this problem \[29\] is a search over the lattice shown in Figure 3 in which the desired interval of fi is placed on the horizontal axis and the correct transcription, q, of some word, v, is lined up along the vertical access. The lattice point (k,l) signifies the alignment of ~ and q such that qt+l coincides with qk.</Paragraph>
    <Paragraph position="4"> Let Sjk~ be the cost of substituting qt+l for qk given that the previous state is qj; Dkt, the cost of deleting qt from q given that the previous state is qk; and lkl the cost of inserting qt+l in fi when qt+l-1 = qk. Let us denote by CKL(V) the cost of matching the word, v, to qt+l ..... qt+L where v has the phonetic spelling ql, q2 ..... qg. Then the lattice is evaluated according to</Paragraph>
    <Paragraph position="6"> for 1 _&lt; k _&lt; K and 1 &lt;_ l _&lt; L. The relation (7) is based upon the symmetric local constraints \[21\]. The boundary values needed to perform the recursion indicated in (7) are</Paragraph>
    <Paragraph position="8"> for 1 &lt;_ k _&lt; K, 1 _&lt; l _&lt; L and V v. In (8), xij is the average duration of qj when preceded by qi and dj is the duration of q t+j as computed by (5).</Paragraph>
    <Paragraph position="9"> One could evaluate (7) and (8) based on the Levenshtein metric \[22\] in which case we</Paragraph>
    <Paragraph position="11"> However, the acoustic-phonetic model tells us a great deal about the relative similarities of the phonetic units so we can be more precise than simply using (9) allows.</Paragraph>
    <Paragraph position="12"> The dissimilarity between two phonetic units is naturally expressed as the distance between their respective acoustic distributions integrated over their estimated durations. If we adopt the rhobar metric \[23\] between bjk (X) and bjt (x) then we have</Paragraph>
    <Paragraph position="14"> We use a simple heuristic for the costs of insertion and deletion.</Paragraph>
    <Paragraph position="15"> substitutions with silence, which is represented for convenience by q l.</Paragraph>
    <Paragraph position="17"> We treat them both as Thus (lOb) The lexical hypotheses evaluated by the lower level of the DP algorithm, (7), are combined to form sentences by the upper level in accordance with the finite state diagram of the task grammar. The form of the finite state diagram is shown in Figure 4. The state set, Q, contains 4767 states connected by 60,433 transitions. There are 90 final states. This grammar was produced from the original specification of the task by a grammar compiler \[24\]. The language generated by this grammar has a maximum entropy of 4.4 bits/word. The states r and s are completely separate from and not to be confused with the states of the acoustic/phonetic model. The state transition from r to s given word v is denoted by $(r,v) = s.</Paragraph>
    <Paragraph position="18"> Let R(s,k) be the minimum accumulated cost of any phrase of k words starting in state 1 and ending in state s. The cumulative cost function obeys the recursion</Paragraph>
    <Paragraph position="20"/>
    <Paragraph position="22"> and the global constraints on expansion and compression of words are given by</Paragraph>
    <Paragraph position="24"> 5 1 where el = ~ and e2 = ~. Note that the incremental costs Cl-k, k(V) are supplied by the lower level from (7). Because the outer minimization of (11) is over the set P as defined in (12), the operation is parallel in s.</Paragraph>
    <Paragraph position="25"> While computing R from (11), we retain the values of r, v and l that minimize each R(s,k). When R is completely evaluated, we trace back through it beginning a\[ the least R(s,N) for which s is a final state. This allows the recovery of the best sentence, W, and its parse in the form of a state sequence.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML