Continuous Speech Recognition from a Phonetic Transcription 
S. E. Levinson 
A. Ljolje 
L. G. Miller 
AT&T Bell Laboratories 
Murray Hill, New Jersey 07974 
1. Introduction 
A long-standing and widely accepted linguistic theory of speech recognition holds that 
natural spoken messages are understood on the basis of an intermediate representation of the 
acoustic signal in terms of a small number of phonetic symbols. The traditional linguistic 
theory is very attractive for several reasons. First, it provides a natural way to partition the 
process of communication by spoken language into distinct acoustic, phonetic, lexical and 
syntactic sub-processes. Second, it provides for a reduction in bandwidth at each successive 
stage of the process. And, finally, it seems to be reflected in the development of written 
language. It is thus not surprising that this seminal idea formed the basis for several early 
speech recognition machines \[1,2, 3, 4\]. 
In this report we offer what we believe to be the simplest and most direct expression of the 
linguistic theory in a working speech recognition system. The present system is the 
culmination of a succession of experiments conducted over the past three years. The method 
of acoustic phonetic mapping is described in \[5\], and results of its application to speaker- 
dependent recognition of fluently spoken digit strings are given in \[6\]. Next, a new method of 
lexical access was devised and applied to the problem of speaker-dependent recognition of 
isolated words from a large vocabulary \[7\] and sentences composed of them \[8\]. Attention was 
then tumed to speaker-independent phonetic transcription \[9, 10\] which was then used in an 
early account of speaker independent recognition of fluent speech from the 991 word DARPA 
\[11\] resource management task \[12\]. 
In its present form, our speech recognition system uses a particular kind of hidden Markov 
model in conjunction with an appropriate dynamic programming algorithm to accomplish the 
acoustic-to-phonetic mapping. This part is not constrained by lexical or syntactic 
considerations and is thus vocabulary and task independent. Word recognition is then easily 
treated as a classical string-to-string editing problem which is solved by a two-level dynamic 
programming algorithm, the lower level of which performs lexical access while the upper level 
performs the parsing function. 
Our account of the present speech recognition system is given in the following order. We 
first give an overview of the system at the block diagram level. This is followed by a detailed 
description of each of the component blocks, the acoustic phonetic model, the phonetic decoder 
and, finally, the lexical access and parsing techniques which, because they are so closely 
coupled, are treated as a unit. This is followed by an account of our experimental results and 
an interpretation of them. 
To summarize our results, on the DARPA resource management task with the perplexity 9 
grammar, we attained 88% correct word recognition with 3% insertions yielding a word 
accuracy of 85%. Phonetic transcription accuracy was assessed by resynthesizing directly from 
the phonetic transcription. In a few informal listening tests, we judged the word intelligibility 
rate to be approximately 75%. 
190 
The word accuracy of our system is not as good as that obtained on exactly the same data 
by several other conventional systems \[13,14,15,16\]. However, we believe that a few 
correctable shortcomings of the existing system are responsible for the disparity. We hope to 
make the necessary changes in the near future. 
2. The System 
Acoustic signal processing is an autocorrelation based linear predictive analysis. The 
LPC's are transformed into cepstral coefficients at a centisecond frame rate. The phonetic 
decoding module is a dynamic programming algorithm applied to a 47-state ergodic semi- 
Markov model. There are two very important points to be made regarding this stage of 
processing. First, no lexical or syntactic information of any kind is available to the phonetic 
decoder. Second, once the decoding is accomplished, the acoustic signal is discarded. All that 
remains is its phonetic transcription and the duration, in centiseconds, of each phonetic unit in 
that transcription. 
The lexical access and parsing functions are conceptually separate but are combined here in 
a two-level dynamic programming algorithm. The lower level is the lexical part while the 
upper level accomplishes the grammatical analysis. The two are intricately coupled. The DP 
algorithm simply performs a string-to-string editing in which the error-ridden phonetic 
transcription is mapped into sentences of conventional orthography. The lexicon used simply 
gives the phonetic transcription of each vocabulary word pronounced in citation form. The 
grammar is a strict right linear grammar with no null productions. 
The entire system is implemented in FORTRAN-77 and runs on an Alliant FX-80. 
Because the phonetic decoding and lexical access stages have a high degree of intrinsic 
parallelism, we can exploit the architecture of the FX-80 to full advantage resulting in an 
execution time of 15 times real time for a typical sentence. 
We have applied this system to the DARPA Naval Resource Management Task \[11\] which 
allows one to inquire about and display in various ways, the status of a 180 ship fleet. The 
vocabulary is 992 words including silence and the grammar imposes a highly stylized word 
order syntax resulting in a entropy of about 4.4 bits/word. 
We now turn our attention to the individual components of this system. 
3. Signal Processing 
The speech was sampled at 8 kHz and was analyzed using a sliding 30 ms. window at a 
100 Hz frame rate. The spectrum, S(CO, t), was represented using 12 cepstral coefficients, 
where the approximate relationship between the spectral magnitude and the resulting cepstral 
coefficients is defined as 
12 
log IS(co, t)l = 2 ~ Cm(t) COS(co mt) + Co(t) . (1) 
m=l 
The cepstral coefficients were computed from autocorrelation coefficients via LPC's \[17\] and 
they were liftered using the bandpass lifter \[18\] 
Cm = (1 + 6 sin (n m/12)) Cm 1 _< m_< 12. (2) 
Twelve additional parameters were obtained by evaluating the differential cepstral coefficients, 
Ag'm, which contain important information about the temporal rate of change of the cepstmm, 
and are given in \[19\] as 
191 
2 E kCm(t+k) 
A~?m(t ) = k=-2 _ ~ Cm (3) 2 ~gt 
k 2 
k= -2 
The combined cepstral and delta cepstral vectors form a set of 24-parameter observation 
vectors, Or, which were used in all the experiments described below. 
4. The Acoustic-Phonetic Model 
It is generally accepted that speech is an acoustic manifestation of an underlying phonetic 
code having a relatively few symbols. The code is, however, a purely mental representation of 
the spoken language and, as such, is not directly observable. Since the hidden Markov model 
comprises an unobservable Markov chain and a set of random processes that can be directly 
measured, it seems most natural to represent speech as a hidden Markov chain in which the 
hidden states correspond to the putative unobservable phonetic symbols and the state-dependent 
random processes account for the variability of the observable acoustic manifestation of the 
corresponding phonetic symbol. 
The model that we use to represent the acoustic-phonetic structure of the English language 
is the continuously variable duration hidden Markov model (CVDHMM) \[5\]. The states of the 
model, {qi }~= 1, represent the hidden phonetic units. The phonotactic structure of the language 
is modelled, to a first order approximation, by the state transition matrix, aij, which defines the 
probability of occurrence of state (phoneme) qj at time t + z conditioned on state (phoneme) qi 
at time t, where x is the duration of phoneme i. The information about the temporal structure 
of the hidden units is contained in the set of durational densities {dq(t ) }inj=l. The acoustic 
correlates of the phonemes are the observations, denoted Or, and their distributions, which are 
defined by a set of observation densities {b 0 (Or)}\[j=l. 
The durational densities are 3-parameter gamma distributions 
-- (x - Xmin (i,j)) v°-I e -n'~ (~-~ (i,y)) (4) d0(x)- r(v0 ) 
where F(x) is the ordinary gamma function. The observation densities are multivariate 
Gaussian distributions. Note that they are both indexed by state transition rather than initial 
state. This affords a rudimentary ability to account for coarticulatory phenomena. 
The complete model thus consists of the set of n states (phonemes), the state transition 
probabilities, aij, 1 _< i,j_< n; the observation means, ttii, 1 _< i,j_< n; the observation 
covariances, Uij, 1 _< i,j _< n; and the durational parameters, vii and rlij, 1 _< i,j _< n, where the 
mean duration associated with state transition i to j is vii and the variance of that duration is 
rl0 
vii 
With n = 47 phonetic units, the model has 191,000 parameters in all. 
5. Phonetic Decoding 
Since we identify each phonetic unit with a unique state of the CVDHMM as described 
above, phonetic transcription reduces to the task of finding the most likely state sequence of 
192 
the model corresponding to the sequence of acoustic vectors, O = O 10 2 ... Ot ..- O r. 
We do so by finding the state and duration sequences whose joint likelihood with O is 
maximum. The required optimization is accomplished using a modified Viterbi \[20\] algorithm. 
Let at(i) denote the maximum likelihood of O1 02 ... Ot over all state and duration 
sequences terminating in state i. This quantity can be evaluated recursively according to 
{ I " at(j) = max max a}i-)x aij dij(x) I'I bij (Or-0) (5) l ~ i ~ n xmi, (i,j) ~ x ~ Xm= O=0 
for 1 _< j _< n, 1 _< t <_ T where Xmi~(i,j) is the minimum duration for which dq(x ) is defined 
and 'rmax is the maximum allowable duration for any phonetic unit. 
If, at each stage of the recursion on t and j, the values of i and x that maximize (5) are 
retained, then one can trace back through the at(j) array to obtain the best state and duration 
sequences 
a = ... 
(6) 
6. Lexical Access and Parsing 
The function of the lexical access and parsing algorithms is to find that sentence, W, which 
is well-formed with respect to the task grammar, G, and best matches, in some sense, the 
phonetic transcription, ~. The lexical access part of the process is that of matching words to 
subsequences of ~, while parsing is the part that joins the lexical hypotheses together according 
to grammatical rules. The two components are conceptually separate and sequential as 
indicated in Figure 1. However, in order to achieve an efficient implementation, the two are 
interleaved in a two-level dynamic programming algorithm and hence are treated together in 
this section. 
Lexical access is effected by the lower level of the two-level DP algorithm and consists in 
matching standard transcriptions of lexical items to various subsequences of ft. In particular 
we seek the word, v, whose standard transcription q = q 1 q2 ... qr is closest, in a well defined 
sense, to parts of fi, say qt+l qt+2 ... qt+L. The well-known solution to this problem \[29\] is a 
search over the lattice shown in Figure 3 in which the desired interval of fi is placed on the 
horizontal axis and the correct transcription, q, of some word, v, is lined up along the vertical 
access. The lattice point (k,l) signifies the alignment of ~ and q such that qt+l coincides with 
qk. 
Let Sjk~ be the cost of substituting qt+l for qk given that the previous state is qj; Dkt, the 
cost of deleting qt from q given that the previous state is qk; and lkl the cost of inserting qt+l 
in fi when qt+l-1 = qk. Let us denote by CKL(V) the cost of matching the word, v, to 
qt+l ..... qt+L where v has the phonetic spelling ql, q2 ..... qg. Then the lattice is evaluated 
according to 
Ckt(v) =min{Ckl_l(v) + lkl, Ck_ll(1J ) +Dk_lk, Ck_ll_l(V ) -I-Sk_lkl} (7) 
193 
for 1 _< k _< K and 1 <_ l _< L. The relation (7) is based upon the symmetric local constraints 
\[21\]. The boundary values needed to perform the recursion indicated in (7) are 
Coo(V) = 0 
k 
Cko(v) = ~ Dlj ~lj (8) 
j=l 
l 
COt(V) = ~ Ilj d\] 
j=l 
for 1 <_ k _< K, 1 _< l _< L and V v. In (8), xij is the average duration of qj when preceded by qi 
and dj is the duration of q t+j as computed by (5). 
One could evaluate (7) and (8) based on the Levenshtein metric \[22\] in which case we 
would set 
{~ if qt+t = qk 
Sin = otherwise V \], k,l 
Dkt = 1 V k,l (9) 
Ikt = 1 V k,l 
However, the acoustic-phonetic model tells us a great deal about the relative similarities of the 
phonetic units so we can be more precise than simply using (9) allows. 
The dissimilarity between two phonetic units is naturally expressed as the distance between 
their respective acoustic distributions integrated over their estimated durations. If we adopt the 
rhobar metric \[23\] between bjk (X) and bjt (x) then we have 
Sjkt = I I.t.ik - lxytl dt+t Vj (10a) 
We use a simple heuristic for the costs of insertion and deletion. 
substitutions with silence, which is represented for convenience by q l. 
Dkl = Skl l 
Ikl = Ski 1 
We treat them both as 
Thus 
(lOb) 
The lexical hypotheses evaluated by the lower level of the DP algorithm, (7), are combined 
to form sentences by the upper level in accordance with the finite state diagram of the task 
grammar. The form of the finite state diagram is shown in Figure 4. The state set, Q, contains 
4767 states connected by 60,433 transitions. There are 90 final states. This grammar was 
produced from the original specification of the task by a grammar compiler \[24\]. The language 
generated by this grammar has a maximum entropy of 4.4 bits/word. The states r and s are 
completely separate from and not to be confused with the states of the acoustic/phonetic model. 
The state transition from r to s given word v is denoted by $(r,v) = s. 
Let R(s,k) be the minimum accumulated cost of any phrase of k words starting in state 1 
and ending in state s. The cumulative cost function obeys the recursion 
R(s,k) =min{m}n{R(r,k-l)p +Ck_l,k(V)} } (11) 
194 
Vs E Q and l _< k_< N. In(ll), 
P={15(r,v) =s for any v} (12) 
and the global constraints on expansion and compression of words are given by 
k - Iris1 -< l -< k - Ivl 2 (13) 
5 1 where el = ~ and e2 = ~. Note that the incremental costs Cl-k, k(V) are supplied by the 
lower level from (7). Because the outer minimization of (11) is over the set P as defined in 
(12), the operation is parallel in s. 
While computing R from (11), we retain the values of r, v and l that minimize each R(s,k). 
When R is completely evaluated, we trace back through it beginning a\[ the least R(s,N) for 
which s is a final state. This allows the recovery of the best sentence, W, and its parse in the 
form of a state sequence. 
7. Experimental Results 
All the tests described below, except for one informal listening test, were conducted on 
standard DARPA data that has been filtered and downsarnpled to a 4 kHz bandwidth. The 
training set consists of 3,267 sentences spoken by 109 different speakers. This comprises 
about 4 hrs. of speech. Two test sets each consist of 300 sentences spoken by 10 speakers. 
The tNrd test set comprises 54 sentences spoken by one of us (SEL) recorded using equipment 
similar to that used for the DARPA data. All four data sets are completely independent. 
The acoustic/phonetic model was trained as follows. The training data was segmented in 
temas of the 47 phonetic symbols by means of the segmental k-means algorithm \[25\]. All 
frames so assigned to each phonetic unit were collected and sample statistics for the spectral 
means and covariances, IXij and Uij and the durational means and variances, mij and ~ij, were 
computed for 1 _< i,j _< 47. If fewer than 500 samples were available for a particular value of 
i, then the samples for all values of i and fixed j were pooled and only a single statistic was 
computed and used for all values of i. The durational means and variances were then 
converted to parameters appropriate to the gamma distribution vii and rlij according to 
miy -- vij/~ij and ¢~ij = Vij/'l~i~. 
The transition matrix was computed from the lexicon. All adjacent pairs of words allowed 
by the grammar were formed and all occurrences of phonetic units and bigrams were counted. 
These were then converted to transition probabilities from 
Ar(i'J) (14) aij = .K ( i) 
where N(i,j) is the total number of occurrences of the bigram qi qj and .Y(i) is the total 
number of occurrences of the unit q i. 
Word recognition results are summarized in Table I. All results are for the perplexity 9 
grammar. 
195 
Data # 
words 
trainl09 1838 
feb89 2561 
oct89 2684 
sell 457 
% 1% % 
ins. del. s~s. 
2.5 2.1 6.3 
2.6 3.8 10.3 
2.3 4.1 7.9 
0.9 0.4 2.2 
% 
% word 
correct accuracy 
91.6 89.1 
85.9 83.3 
88 85.7 
97.4 96.5 
% 
# sentence 
sents accuracy 
218 57.3 
300 40 
300 44 
54 75.9 
Table I. Recognition Results 
Data set trainl09 is a subset of the training data formed by taking two sentences at random 
from each of the training set speakers. This set was used for algorithm development. The 
three independent test sets were run only once. Recognition requires about 15 times real time 
on an 8 CE AUiant FX-80. 
Rather than try to measure the accuracy of the phonetic transcription directly, we tried to 
get an impression of its quality by listening to speech ^resynthesized from it. For this purpose 
we use the PRONOUNCE module of tts \[26\] with ~, d, and a pitch contour computed by the 
harmonic sieve method \[27\]. The average data rate for these quantities is approximately 
100 bps pointing to the possible utility of the phonetic decoder as a very-low-bit-rate vocoder. 
Our informal test was made on six sentences recorded by one of us (SEL). An audio tape 
was made of the resynthesis and played for several listeners from whose responses we judged 
that about 75% of the 91 words were intelligible. The speech recognition system gave an 96% 
word accuracy on these sentences. We have also recorded, decoded and resynthesized several 
Harvard phonetically balanced sentences with nearly identical results. This is significant since 
these sentences have no vocabulary in common with the DARPA task. 
8. Interpretation of the Results 
The results listed in Table I are approximately the same as those achieved by more 
conventional systems tested on the same data \[13, 14, 15, 16\] and the perplexity 60 grammar. 
Given the difficulty of the task and the early stage of development of this system, however, we 
consider these results quite respectable. Also, note that the performance on training data is not 
substantially different from that obtained on new test data indicating a certain robustness of our 
method. Moreover, almost all of the insertions and deletions are of monosyllabic articles and 
prepositions which do not change the meaning of the sentence. 
It appears that there are two straightforward ways to improve performance. First we need 
to improve the acoustic/phonetic model. Desirable structural changes would appear to be the 
incorporation of trigram phonotactics by making the underlying Markov chain second order 
\[28\]. This would allow us to associate the spectral distributions with three states rather than 
two. This should afford a better model of coarticulatory effects. Also, the spectral 
distributions can be made more faithful by using Gaussian mixtures rather than unimodal 
multi-variate densities. Fidelity can be further improved by accounting for temporal 
correlations among observations. Finally, we need to make a global improvement in the model 
by optimizing it. We have repeatedly tried reestimation techniques but, thus far, they have 
actually degraded performance. We speculate that applying constraints to the reestimation 
196 
formulae by forcing the state sequence to be fixed will ameliorate the results of optimization. 
Second, we can improve the lexical access technique by rationalizing the insertion, deletion, 
substitution metric. One possible alternative is to replace the rhobar distance with error 
probabilities determined either analytically or empirically. Also, applying phonological rules to 
the fixed, citation form, pronunciations stored in the lexicon may eliminate some errors. 
9. Summary 
We have described a novel method for speaker independent recognition of fluent speech 
from a large vocabulary. The system is a clear and simple implementation of well known 
linguistic theories of speech perception. The two most striking features of the system are that 
phonetic decoding is accomplished by a simple optimal search algorithm operating on a 
stochastic model of the acoustic-to-phonetic mapping and that, after phonetic transcription, 
processing is entirely symbolic and makes no reference to the acoustic signal. 
The performance obtained is not competitive with those obtained from traditional 
techniques but offers several advantages deriving from the fact that phonetic transcription is 
independent of lexical or syntactic considerations. 
The method described here is in its very earliest stage of development. We are optimistic 
that further experimentation will soon yield performance at least as good as that displayed by 
conventional methods. 
197 
REFERENCES 
1. Lesser, V. R., FenneU, R. D., Erman, L. D. and Reddy, D. R., "Organization of the 
HEARSAY-II Speech Understanding System," IEEE Trans. Acoust. Speech and Signal 
Processing, ASSP-23, pp. 11-24, 1975. 
2. Woods, W. A., "Motivation and Overview of SPEECHLIS: An Experimental Prototype 
for Speech Understanding Research," IEEE Trans. Acoust. Speech and Signal 
Processing, ASSP-23, pp. 2-10, 1975. 
3. Jelinek, F., "Continuous Speech Recognition by Statistical Methods," Proc. IEEE, 
Vol. 64, pp. 532-556, 1976. 
4. Mercier, G., Nouhen, A., Quinton, P. and Siroux, J., "The KEAL Speech Understanding 
System," in Spoken Language Generation and Understanding, J. C. Simon, Ed., 
D. Reidel, Dordrecht, The Netherlands, pp. 525-544, 1979. 
5. Levinson, S. E., "Continuously Variable Duration Hidden Markov Models for Speech 
Analysis," Computer Speech and Language, Vol. 1, No. 1, pp. 29-46, March, 1986. 
6. Levinson, S. E., "Continuous Speech Recognition by Means of Acoustic/Phonetic 
Classification Obtained from a Hidden Markov Model," Proc. ICASSP-87, Dallas, TX, 
pp. 93-96, Apr., 1987. 
7. Levinson, S. E., Ljolje, A. and Miller, L. G., "Large Vocabulary Speech Recognition 
Using a Hidden Markov Model for Acoustic/Phonetic Classification," Proc. ICASSP-88, 
New York, NY, pp. 505-508, Apr., 1988. 
8. Miller, L. G. and Levinson, S. E., "Syntactic Analysis for Large Vocabulary Speech 
Recognition Using a Context-Free Covering Grammar," Proc. ICASSP-88, New York, 
NY, pp. 271-274, Apr., 1988. 
9. Levinson, S. E., Liberman, M. Y., Ljolje, A. and Miller, L. G., "Speaker Independent 
Phonetic Transcription of Fluent Speech for Large Vocabulary Speech Recognition," 
Proc. ICASSP-89, Glasgow, Scotland, UK, pp. 441-444, May, 1989. 
10. Levinson, S. E., Libennan, M. Y., Ljolje, A. and Miller, L. G., "Speaker Independent 
Phonetic Transcription of Fluent Speech for Large Vocabulary Speech Recognition," 
Proc. DARPA Workshop on Speech and Natural Language, Philadelphia, PA, pp. 75-80, 
Feb., 1989. 
11. Price, P., Fisher, W., Bemstein, J. and Pallett, D., "The DARPA 1000-Word Resource 
Management Database for Continuous Speech Recognition," Proc. ICASSP-88, New 
York, NY, pp. 651-654, April, 1988. 
12. Levinson, S. E. and Ljolje, A., "Continuous Speech Recognition from Phonetic 
Transcription," Proc. DARPA Workshop on Speech and Natural Language, Harwichport, 
MA, Oct., 1989. 
13. Schwartz, R., Barry, C., Chow, Y.-L., Derr, A., Feng, M.-W., Kimball, O., Kubala, F., 
Makhoul, J. and Vandegrift, J., "The BBN BYBLOS Continuous Speech Recognition 
System," Proc. DARPA Speech and Natural Language Workshop, Philadelphia, PA, 
pp. 94-99, Feb., 1989. 
198 
14. Paul, D. B., "The Lincoln Continuous Speech Recognition System: Recent Development 
and Results," Proc. DARPA Speech and Natural Language Workshop, Philadelphia, PA, 
pp. 160-166, Feb., 1989. 
15. Murveit, H., Cohen, M., Price, P., Baldwin, G., Weintraub, M. and Bemstein, J., "SRI's 
DECIPHER System," Proc. DARPA Workshop on Speech and Natural Language 
Workshop, Harwichport, MA, Oct., 1989. 
16. Lee, C. H., Rabiner, L. R., Pieraccini, R. and Wilpon, J. G., "Acoustic Modeling for 
Large Vocabulary Speech Recognition," Proc. DARPA Speech and Natural Language 
Workshop, Harwichport, MA, Oct. 1989. 
17. Tohkura, Y., "A Weighted Cepstral Distance Measure for Speech Recognition," Proc. 
ICASSP-86, Tokyo, Japan, pp. 761-764, Apr., 1986. 
18. Juang, B.-H., Rabiner, L. R. and Wilpon, J. G., "On the Use of Bandpass Liftering in 
Speech Recognition," IEEE Trans. Acoust. Speech and Signal Processing, ASSP-35, 
No. 7, pp. 947-954, July, 1987. 
19. Soong, F. K. and Rosenberg, A. E., "On the Use of Instantaneous and Transitional 
Spectral Information in Speaker Recognition," Proc. ICASSP-86, Tokyo, Japan, 
pp. 877-880, April, 1986. 
20. Viterbi, A. J., "Error Bounds for Convolutional Codes and an Asymptotically Optimal 
Decoding Algorithm," IEEE Trans. Information Theory, Vol. IT-13, pp. 260-269, 1967. 
21. Sakoe, H. and Chiba, S., "Dynamic Programming Algorithm Optimization for Spoken 
Word Recognition," IEEE Trans. Acoust. Speech and Signal Processing, ASSP-26, 
pp. 43-49, Feb., 1978. 
22. Levenshtein, V. I., "Binary Codes Capable of Correcting Deletions, Insertions, and 
Reversals," Sov. Phys.-Dokl., Vol. 10, pp. 707-710, 1966. 
23. Gray, R. M., Probability, Random Processes and Ergodic Properties, Springer-Verlag, 
New York, 1988, pp. 254 ff. 
24. Brown, M. K. and Wilpon, J. G., "A Grammar Compiler for Connected Speech 
Recognition," submitted to IEEE Trans. Acoust. Speech and Signal Processing. 
25. Rabiner, L. R., Wilpon, J. G. and Juang, B.-H., "A Segmental K-means Training 
Procedure for Connected Word Recognition," AT&T Tech. J., Vol. 65, No. 3, pp. 21-31, 
May-June, 1986. 
26. Olive, J. P. and Liberman, M. Y., "Text to Speech: An Overview," J. Acoust. Soc. Am., 
Vol. 78, Supp. 1, p. 56, Fall, 1985. 
27. Duifhuis, H., Willems, L. F. and Sluyter, R. J., "Measurement of Pitch in Speech: An 
Implementation of Goldstein's Theory of Pitch Perception," J. Acoust. Soc. Am., 71, 
pp. 1568-1580, 1982. 
28. Levinson, S. E., "A Method for the Incorporation of a Tri-gram Model of English 
Phonotactics in a System for Phonetic Transcription of Unrestricted Speech," 
Unpublished Bell Laboratories Technical Memorandum, 1988. 
29. Wagner, R. and Fischer, M., "The String-to-String Correction Problem," JACM, 
Vol. 21, No. 1, pp. 168-173, 1974. 
199 
