Improvements in the Stochastic Segment Model for Phoneme Recognition 
V. Digalakis t M. Ostendofft J.R. Rohlicek:~ 
tBoston University 
~BBN Systems and Technologies Corp. 
ABSTRACT 
The heart of a speech recognition system is the acoustic 
model of sub-word units (e.g., phonemes). In this work 
we discuss refinements of the stochastic segment model, 
an alternative to hidden Markov models for representa- 
tion of the acoustic variability of phonemes. We con- 
centrate on mechanisms for better modelling time corre- 
lation of features across an entire segment. Results are 
presented for speaker-independent phoneme classifica- 
tion in continuous speech based on the 'lIMIT 0a!~base. 
INTRODUCTION 
Although hidden Markov models (HMMs) are cur- 
rently one of the most successful approaches to acous- 
tic modelling for continuous speech recognition, their 
performance is limited in part became of the as- 
sumption that observation features at different times 
are conditionally independent given the underlying 
state sequence and because the Markov assumption 
on the state sequence may not adequately model 
time structure. An alternative model, the stochas- 
tic segment model (SSlVl), was proposed to overcome 
some of these deficiencies \[Roucos and Dunham 1987, 
Ostendorf and Roucos 1989, Roucos et al 1988\]. 
An observed segment of speech (e.g., a phomeme) 
is represented by a sequence of q-dimensional feature 
vectors Y = \[~ ~ ... yt\] T, where the length k 
is variable and T denotes block transposition. The 
stochastic segment model for Y has two components 
\[Roncos et al 1988\]: 1) a time transformation Tk to 
model the variable-length observed segment, Y, in 
terms of a fixed-length unobserved sequence, X ffi 
\[zl z2 ... z,,~\] a', as Y = TkX, and 2) a probabilis- 
tic representation of the unobserved feature sequence X. 
The conditional density of the observed segment Y given 
phoneme a and observed length k is: 
f p(Yl, , k) = • pCXI, )dX. 
JX :Y= TaX 
Assuming the observed length is less than or equal to the 
length of X, k _< m, T~ is a time-warping transformation 
which obtains Y by selecting a subset of elements of X 
and the density p(Yla, k) is a qk-dimensional marginal 
distribution of p(Xlc0. In practice, we can accomodate 
observations of length k > m by either tying distri- 
butions of X (so rn is effectively larger) or discarding 
some of the observations in Y. In this work, as in previ- 
ous work, the time transformation, T~, is chosen to map 
each observed frame Vi to the nearest model sample zj 
according to a linear time-warping criterion. The distri- 
bution p(Xla) for the segment X given the phoneme a 
is then modelled using an rnq-dirnensional multi-variate 
Gaussian distribution. 
Algorithms for automatic recognition and training of 
the stochastic segment model are similar to those for 
hidden Markov modelling. The maximum a posteriori 
probability rule is used for classification of segments 
when the phoneme segmentation is known: 
max p(Y lct, k )p( k la)p(~), Q 
where p(kla) is the probability that phoneme a has 
length k. A Viterbi search over all possible segmen- 
tations is used for recognition with unknown segmen- 
tations. The models are trained from known segmen- 
tations using maximum likelihood parameter estimation. 
When segmentations are unknown, an iterative algorithm 
based on automatic segmentation and maximum likeli- 
hood parameter estimation exists for which increasing 
the probability of the observations with each iteration is 
guaranteed. 
Initial results with segment-based models have been 
encouraging. A stochastic segment model has pre- 
viously been used for speaker-dependent phoneme 
and word recognition, demonstrating that a seg- 
ment model outperformed a discrete hidden Markov 
model when both models were context-independent 
\[Ostendorf and Roucos 1989\]. Other segment-based 
models have also showed encouraging results in speaker- 
independent applications \[Bush and Kopec 1987, 
Bocchieri and Doddington 1986, 
332 
Makino and Kido 1986, Zue et al 1989\]. 
Although the previous results using the stochastic seg- 
ment model were encouraging, there were several limi- 
tations. First, the comparison to HMMs did not clearly 
show the advantages of the segment model since the 
SSM and the HMM used disparate feature distribu- 
tions: the segment model used ten continuous distri- 
butions and the HMM used three discrete distributions 
for each phoneme. Second, the flexibility of the seg- 
ment model was not fully exploited because time sam- 
pies within a segment were assumed independent due 
to training data limitations in these speaker-dependent 
applications. Finally, results showed that the context- 
dependent HMM triphone models \[Schwartz et al 1985\] 
outperformed the context-independent segment models. 
Again due to training data limitations, context-dependent 
segment models were not effective. 
In this work we address the issues of 1) time correla- 
tion modelling and 2) meaningful comparisons of the 
SSM with HMMs in a speaker-independent phoneme 
classification task. In the next section, we describe 
refinements to the SSM which improve the time cor- 
relation modelling capability, including time-dependent 
parameter reduction and assumption of a Markov time 
correlation structure. Then experimental results using 
the TIMIT database are described. These results include 
comparisons of HMM and SSM, as well as the effects of 
modelling time correlation. Although the HMM perfor- 
mauce is similar to segment model performance when 
time-sample independence is assumed for the segment 
model, we demonstrate that the refinements improve per- 
formance of the stochastic segment model so that it out- 
performs the HMM for phoneme classification. 
TIME CORRELATION 
MODELLING 
Preliminary experiments using full, rru/-dimensional co- 
variance structure for p(Xla) did not improve perfor- 
rnance over the block-diagonal structure used when time 
samples are assumed to be uncorrelated. We believe 
that this was due to insufficient training data since a 
full covariance model has roughly m times as many free 
parameters as a block diagonal one and since the particu- 
lar task on which the SSM was tested had a very fimited 
amount of training data. Our efforts have focused on two 
parallel approaches of handling the training data prob- 
lem: devising an effective parameter reduction method 
and constraining the structure of the covariance matrix 
to further reduce the number of parameters to estimate. 
PARAMETER REDUCTION 
A first step towards the incorporation of time correlation 
in the SSM is parameter reduction; an obvious candi- 
date is the method of linear discriminants \[Wilks 1962\]. 
Our intuition suggested that sample-dependent reduction 
would outperform a single transformation for reduction. 
In fact, contrary to other results \[Brown 1987\], the sin- 
gle tranformation yielded poor performance (see Section 
3). 
Linear discriminant parameter reduction for the 
SSM was implemented using sample-dependent trans- 
formations as follows. The speech segment X = 
\[Zl z2 ... zm\] ~r is substituted by a sequence of lin- 
ear transformations of the ra original cepstral vectors 
= \[zl ~2 ... Zm\] ~ = \[R(I)zI R(2)z2 ... R(ra)zm\] T 
The transformation matrices, R (i), are obtained by solv- 
ing m generalized eigenvalue problems, 
S~3~=~4~, iffil,2,...,m 
The classes used at each time instant from 1 to m are 
the phonemes, but the between- and within-class scatter 
matrices (S~) and Sty, respectively) are computed using 
only the observations for that specific sample. 
MARKOVIAN ASSUMPTION 
Structure 
In order to obtain the advantages of parameterization 
of time correlation without the m-fold increase in the 
number of parameters in the full-covariance case, we 
consider a constrained structure for the covariance ma- 
trix. More specifically, we assume that the density of 
the unobserved segment is that of a non-homogeneous 
Markov process 
F(X) = i~ (zl)pz(z21zl)p3 (z3 I z2).., pm(zm I zm- I) 
Under this hypothesis, the number of parameters that 
have to be estimated increases by less than a factor of 3 
over the block-diagonal case (see Table 1). Furthermore, 
by introducing the Markov restriction to the covariance 
marx, we shall also see in the following section that we 
can simplify the reestimation formulas for the Estimate- 
Maximize (EM) algorithm. 
333 
Block Diagonal ma" + 3rod 2 2 
(3nt-2)d ~ 3ntd Markovian -'--'-3-'-- + "-Z-- 
FuLl covariance '~ + 
Table 1: Number of parameters per phone model 
Parameter Estimation 
As mentioned earlier, we assume that the observed seg- 
ment Y is a "downsampled" version of the underlying 
fixed-length "hidden" segment, X. We used two differ- 
ent approaches for the parameter estimation problem. 
Linear Time Upsampling. The first, linear time up- 
sampling, interpolates an unobserved sample zi of the 
underlying sequence X, by mapping that point to an ob- 
served frame, y~, with a linear-in-time warping trans- 
formation of the observed length k to the fixed length 
rn. The disadvantage of this method is that linear time 
upsampling introduces a correlation problem when mod- 
els with non-independent flames are assumed, and in 
\[Roucos et al 1988\] better results were reported when 
the parameter estimates were obtained with the EM al- 
gorithm. However, in the case of frame dependent trans- 
formations, a missing observation is not interpolated by 
an adjacent one, but by a different transformation of 
that observation. This partially eliminates the correla- 
tion problem. 
Estimate-Maximize Algorithm for the Markovian 
Case. A second approach for the parameter estimation 
problem is to use the Estimate-Maxitrdze (EM) algorithm 
to obtain a maximum likelihood solution under the as- 
sumptions given in Section 2.2.1. As defined in Sec- 
tion 1, X represents the sequence of the incomplete d~!~ 
and Y that of the observed data. In this case, the ob- 
served length k is mapped through a linear time warping 
transformation to the fixed length ra, and each observa- 
tion y~ is assigned to the closest in time zi. Under the 
assumption that rn is always greater than k, there are cer- 
tain elements of X that have no elements of Y assigned 
to them, and we refer to them as "missing". Let/~\[. and 
C~ denote the estimates at the r-th iteration of the mean 
vector of the i-th sample and the cross-covariance be- 
tween the i-th and j-th samples respectively. Then the 
steps of the EM algorithm are: 
1. Estimate step: Estimate the following complete 
dam statistics for each frame or appropriate combination 
of frames and each observation: 
ff zi is observed; 
ff =i is missing. 
ff both are observed; 
ff =~ is missing; 
ff both are missing. 
E"C=ilY) = ~ =~' 
L E'(=dY), 
\[ ='=~' 
where j = i or ../= i+ 1, ' denotes transposition and .E"(-) 
is the expectation operator using the r-th iteration density 
estimates. Under the assumption that the observations 
form a Markov chain, we have that 
W(=~IY) = W(=dz6,=D, k < i < t 
where k and 1 are the immediately last and next non- 
missing elements of X to i and j. Let/~ be the mean of 
=i and Ci~ be the covariance of zi and =$. Assuming 
Gaussian densities, the conditional expectations become 
~(=, 1=6, =0 = ~ + q6 v66 (=6 - ~D+,q6 Kz(=~ - ~D+ 
+c~vi~(=6 - ~I,) + c~i(=~ - ~) 
and 
~"(=,=} I=,,, =,) = q~ - {q6 v66 c;l + q6 ~,q;+ 
l l l l +~ vi~cT~ + ~ ~i } + E"(=, I=6, =0E"(=i I=6, =0 
where Vkk Vk, \] r c~k c~, \] -1 
and the Vtk, V6z, ~i matrices are obtained from the ma- 
trix inversion by partitioning lemma. 
2. Maximize step: The (r + 1)-th estimates are: 
XET 
I 
XET 
334 
where T is the set of all observations for a certain 
phoneme. In order to simplify the reestimation formu- 
las, we also investigated a "forward prediction" type ap- 
proximation, where the expectations of the r-th step are 
conditioned only on the last observed sample instead of 
both the last and the next: 
~'(z~lz~) = ~'I + Cl'~CI~-l(zk - ~'D 
~ --lfvrt. I. ~(ZiZJ Izk) = ~'i -- ~k'-'~k ~k 
EXPERIMENTAL RESULTS 
In this section we present experimental results for 
speaker-independent phoneme recognition. We per- 
formed experiments on the TIM1T database \[Lamel et 
al 1986\] for segment models and hidden Markov mod- 
els using known phonetic segmentations. Mel-warped 
cepstra and their derivatives, together with the deriva- 
tive of log power, are used for recognition. We used 
61 phonetic models. However, in counting errors, dif- 
ferent phones representing the same English phoneme 
can be substituted for each other without causing an er- 
ror. The set of 39 English phonemes that we used is 
the same as the ones that the CMU SPHINX and MIT 
SUMMIT systems reported phonetic recognition results 
on \[Lee and Hon 1988, Zue et al 1989\]. 
The portion of the database that we have available 
consists of 420 speakers and 10 sentences per speaker, 
two of these, the "sa" sentences, are the same across all 
speakers and were not used in either recognition or train- 
ing because they would lead to optimistic results. We 
designated 219 male and 98 female speakers for training 
(a total of 2536 training sentences) and a second set of 
71 male and 32 female different speakers for testing (a 
total of 824 test sentences with 31,990 phonemes). We 
deliberately selected a large number of testing speak- 
ers to increase the confidence of performance estimates. 
The best-case results, reported at the end of this section, 
were obtained using all of the available training sen- 
tences (from both male and female speakers) and testing 
over the entire test set. Most of the other results for 
algorithm development and comparisons were obtained 
by training over the male speakers only and testing on 
the Western and N.Y. dialect male speakers (a total of 
219 training and 17 test speakers), a subset that gives us 
good estimates of the overall performance as we can see 
from the global results. 
SSM w/o duration 67.0% 
SSM with duration 68.6% 
HMM 68.7% 
Table 2: HMM and Segment Comparison (rn = 5, 10 
cepstra and deriv., no time correlation). 
The TIMIT database has also been used by other re- 
searchers for the evaluation of the phonetic accuracy of 
their speech recognizers. Lee and Hon, 1988, reported 
a phonetic accuracy for the SPHINX system of from 
58.8% with 12% insertions when context-independent 
(61) phone models were used. Zue et al, 1989 obtained 
a 70% classification performance on the same database 
for unknown speakers. 
HMM/segment comparison. We first evaluated the 
relative performance of the SSM to a continuous Gaus- 
sian distribution HMM \[Babl et al 1981\]. In this ex- 
periment, the features were ten reel-warped cepstra and 
their derivatives. Both the SSM and the HMM had the 
same number of distributions (SSM of length rn = 5 
and no time correlation versus a 7-state, 5 distributions 
HMM), and the recognition algorithm for the HMM was 
a full search along the observed length. For the SSM, 
we obtained results for two cases: with and without us- 
ing the duration information p(kl~ ). Note that, for a 
fair comparison, the segment model should include du- 
ration information, which was not incorporated in ear- 
tier versions of the segment model. With the duration 
information, both the SSM and the HMM gave simi- 
lar performance (68.6% and 68.7%, see Table 2). The 
superiority of the SSM becomes clearer after the time 
correlation and the parameter reduction methods are in- 
corporated, even though the SSM with lime correlation 
suffered from limited training. 
Parameter Reduction. We compared the single and 
multiple transformation reduction methods on a single- 
speaker task, using 14 mel-warped cepstra (but not their 
derivatives) as original features. We evaluated the recog- 
nition performance of the SSM for 1) different numbers 
of the original cepstral coefficients (from 4 up to 14), 2) 
different number of linear discriminants obtained using a 
single transformation and 3) different numbers of linear 
335 
80 i i 
70 
6 10 16 
Number of Featurea (q) 
VS 
d 
o 
Q g~ 
IlO 
I I I 
/ ¢, _Markovian .umpUols 
0 lna*pmulmst Frmnm 
tiff , I i I , I i 
0 10 80 80 40 
Number of Linear Dl~rlmlnan~ (q) 
Figure 1: Evaluation of parameter reduction methods on 
a single-speaker task. 
Figure 2: Performance of different covariance structure 
for a segment model length rn = 5 using linear upsam- 
piing parameter estimation. 
discriminants obtained from frame dependent transfor- 
mations (see Figure 1). The frame dependent features 
gave the best performance of 78.2% when the features 
were reduced to 9 due to training problems, whereas 
the single transformation features gave actually lower 
performance than the original features. This can be 
explained by the fact that there was a small between- 
class scatter for the single Wansformation, relative to 
the sample-dependent transformations. The eigenvalue 
spread for the single transformation was only 6.2 (ratio 
of largest to smallest eigenvalue) whereas in the case of 
multiple transformations this ratio ranged from 178.7 to 
318.3. A larger ratio occurs at the middle frames since 
the effect of adjacent phonemes is smaller at the middle 
of a certain phoneme and is easier to discriminate. The 
recognition performance for the single speaker reported 
here was measured on a set of 61 phonemes counting 
misrecognized allophones as errors. 
Time Correlation. We performed a series of experi- 
ments on three different types of covatiance matrices for 
the SSM. The length of the SSM in this case was rn = 5. 
In Figure 2, we have plotted the phonetic accuracy ver- 
sus the number of features q for 1) a full covariance, 2) a 
Markov structure and 3) for a block diagonal covariance 
(independent frames). When the number of features is 
small and there is enough data to estimate the param- 
eters, the full covariance model outperforms the other 
two. It should be noted though that the performance 
of the Markovian model is close to that of the full co- 
variance even for a small number of features. Hence, 
the Markov hypothesis represents well the structure of 
the covariance matrix, and as the number of features 
increases the Markov model outperforms the full covari- 
ance model since it is more easily trainable. In addition, 
we expect that the curves for those two models would be 
further separated for a bigger segment length m, since 
the number of parameters for the full model is quadratic 
in ra, whereas that of the Markov model is linear. (For 
m = 5 the full model has almost twice as many param- 
eters as the Markov model). 
As the number of parameters increases, the inde- 
pendence assumption gives the best recognition perfor- 
mance. However, with more training data the mod- 
els that use time correlation will outperform the model 
which does not. Furthermore, we were able to duplicate 
the best case results using a Markov model for the first 
features and a second independent distribution for the 
"less significant" features. (In this case, the correlation 
336 
Method 5-Cepstra \[ 
5-DCepstra 
Lin.Time Resampling 45.7% 
EM Algorithm 59.0% 
Forward Approxim. 58.7% 
10 LD 
67.3% 
62.8% 
60.5% 
Table 3: Parameter Estimation Algorithms (ra = 8). 
between the first and the last features is lost, but the time 
correlation between successive frames compensates for 
this). 
Parameter Estimation Algorithms. We evaluated the 
different methods of parameter estimation that we pre- 
sented in Section 2. In this set of experiments, we only 
used 10 features (either 5 reel-warped cepstra and their 
derivatives, or 10 linear discriminants) due to limitations 
in the available computer time. A segment of length 
m = 8 rather than the usual rn = 5 was used, in order to 
obtain a better understanding of the interpolation poten- 
tial of each algorithm. The comparative results are sum- 
marized in Table 3. When cepstra and their derivatives 
are used, the EM algorithm clearly gives better results 
than the linear time upsampling method. In addition, 
the "forward prediction" approximation gave us simi- 
lar recognition performance to the one obtained when 
the full reestirnation formulas were used. However, the 
situation is inverted when the features are linear dis- 
eriminants, and we refer the reader to Section 2 for an 
explanation of those results. 
Global Results. The best case system - based on in- 
dependent samples, m = 5, and 38 linear discriminants 
- was evaluated using the entire data set. The classi- 
fier was trained on the whole training set of all male 
and female speakers, and tested on 824 sentences from 
103 speakers. As it can be seen in Table 4, where we 
present the results by region and sex, the phoneme clas- 
sification rate does not have large variations among dif- 
ferent regions, indicating the robustness of our classifier. 
The somewhat higher numbers on the male speakers can 
be attributed to the fact that approximately 70% of our 
training set consisted of sentences spoken by male speak- 
ers and the classifier was biased in this sense. The re- 
suits were also consistent among different speakers. The 
Region \] Males I Females \[ M + P I 
New England 72.4% 
Northern 73.5% 
North Midland 70.8% 
South Midland 72.4% 
Southern 73.2% 
N.Y.City 69.2% 
Western 72.8% 
Army Brat 73.8% 
70.1% 71.4% 
71.7% 73.1% 
70.0% 70.6% 
70.6% 71.8% 
70.7% 72.2% 
70.6% 69.7% 
75.9% 73.6% 
73.3% 73.7% 
Total I 72.3% \[ 71.4% \[ 72.1% \[ 
Table 4: Phonetic Accuracyby Region and Sex 
recognition rates for all speakers ranged from 59.9% to 
80.7%, with the median speakers at 72.7% for the male 
test speakers, 71.9% for the female speakers and 72.3% 
for the whole test set. Approximately 80% of all the 
test speakers (82 out of 103) had a recognition perfor- 
mance over 69%, and only 8% of the speakers gave 
performance below 65%, including some "problematic" 
speakers. 
Our best cease result of 72% correct classification can 
be compared to the SUMMIT 70% classification per- 
formance on the TIMIT data for unknown speakers 
\[Zue et al 1989\]. Although these results are based on 
known segmentations, past work in segment modelling 
for speaker-dependent phoneme recognition showed 
that recognition with unknown segmentations yields a 
small loss in recognition performance with a cost of 
10% phoneme insertion \[Ostendorf and Roucos 1989\]. 
With this small loss in performance, the segment 
models can still be expected to outperform HMM 
phoneme recognition performance of 59% on this task 
\[Lee and Hon 1988\]. 
CONCLUSIONS 
In summary, we have shown that with sufficient training 
data, it is possible to model detailed time correlation in 
a segment-based model which can outperform HMMs in 
context-independent phoneme classification tasks. It re- 
mains to be shown that this result also holds for phoneme 
recognition, when phoneme segmentation boundaries are 
not known. In addition, the result should be extended to 
337 
new tasks, where automatic training will be required. 
There are several directions to further develop the 
SSM. Since context-dependent models have been shown 
to give dramatic improvements in HMM word recog- 
nition, it is important to demonstrate similar results for 
segment models. This will require research in robust pa- 
raameter estimation techniques. In addition, research on 
the variable-to-fixed length transformation is also im- 
portant. Although a constrained transformatiou is prob- 
ably an advantage of the segment model, it is not clear 
that linear time warping is the best transformation for all 
phonemes, and it may be useful to develop a mechanism 
for estimating transformations. 
Acknowledgement 
This work was jointly funded by NSF and DARPA under 
NSF grant number IRI-8902124. 
References 
\[Bahl et al 1981\] L.R. Bahl, R. Bakis, P.S. Cohen, A. 
Cole, F. Jelinek, B.L. Lewis, and R.L. Mercer, 
"Continuous parameter acoustic processing for 
recognition of a natural speech corpus," In IEEE 
Int. Conf Acoust., Speech, Signal Processing, 
pages 1149-1152, Atlanta, GA, April 1981. 
\[Bocchieri and Doddington 1986\] E. Bocchieri and O. 
Doddington, "Frame-specific statistical features 
for speaker-independent speech recognition," 
IEEE Trans. Acoust., Speech and Signal Proc., 
ASSP-34(4):755-764, August 1986. 
\[Brown 1987\] P. F. Brown, The Acoustic-Modeling 
Probelm in Automatic Speech Recognition, PhD 
thesis, Carnegie-Mellon University, May 1987. 
IBM Technical Report RC12750. 
\[Bush and Kopec 1987/\] M.A. Bush and G.E. Kopec, 
"Network-based connected digit recognition," 
IEEE Trans. Acoust., Speech, Signal Processing, 
ASSP-35(10):1401-1413, October 1987. 
\[Lamel et al 1986\] L.F. Lamel, R. H. Kassel and S. 
Seneff, "Speech Database Development: Design 
and Analysis of the Acoustic-Phonetic Corpus," 
in Proc. DARPA Speech Recognition Workshop, 
Report No. SAIC-86/1546, pp. 100-109, Feb. 
1986. 
\[Lee and Hon 1988\] K.-F. Lee and H.-W. Hon, 
"Speaker-Independent Phone Recognition Using 
Hidden Markov Models," CMU Technical 
Report No. CMU-CS-88-121. 
\[Makino and Kido 1986\] S. Makino and K. Kido, 
"Recognition of Phonemes Using Time-Spectrum 
pattern," Speech Communication, Vol. 5, No. 2, 
June 1986, pp. 225-238. 
\[Ostendorf and Roucos 1989\] M. Ostendorf and S. 
Roucos, "A stochastic segment model for 
phoneme-based continuous speech recognition," 
to appear, IEEE Trans. Acoustic Speech and 
Signal Processing, December 1989. 
\[Roucos and Dunlmm 1987\] S. Roucos and M. 
Ostendorf Dunham, "A stochastic segment 
model for phoneme-based continuous speech 
recognition," In IEEE Int. Conf. Acoust., Speech, 
Signal Processing, pages 73---89, Dallas, TX, 
April 1987/. Paper No. 3.3. 
\[Roucos et al 1988\] S. Roucos, M. Ostendorf, H. Gish, 
and A. Derr, "Stochastic segment modeling 
using the estimate-maximize algorithm," In IEEE 
Int. Conf. Acoust., Speech, Signal Processing, 
pages 127-130, New York, New York, April 
1988. 
\[Schwartz et al 1985\] R.M. Schwartz, Y.L. Chow, 
O.A. Kimball, S. Roucos, M. Krasner, and J. 
Makhoul, "Context-dependent modeling for 
acoustic-phonetic recognition of continuous 
speech," In IEEE Int. Conf. Acoust., Speech, 
Signal Processing, pages 1205-1208, Tampa, FL, 
March 1985. Paper No. 31.3. 
\[Wilks 1962\] S.S. Wilks, Mathematical Statistics, John 
Wiley & Sons, 1962. 
\[Zue et al 1989\] V. Zue, J. Glass, M. Phillips and S. 
Seneff, "Acoustic Segmentation and Phonetic 
Classification in the Summit System," in IEEE 
Int. Conf. Acoust., Speech, Signal Processing, 
pages 389-392, Glasgow, Scotland, May 1989. 
338 
