A DYNAMICAL SYSTEM APPROACH 
TO CONTINUOUS SPEECH RECOGNITION 
V.Digalaki~f J.R. Rohlieek~i M.Ostendor~ 
t Boston University 
44 Cuminington St. 
Boston, MA 02215 
ABSTRACT 
A dynamical system model is proposed for better represent- 
ing the spectral dynamics of speech for recognition. We assume 
that the observed feature vectors of a phone segment are the 
output of a stochastic linear dynamical system and consider two 
alternative assumptions regarding the relationship of the segment 
length and the evolution of the dynamics. Training is equivalent 
to the identification of a stochastic linear system, and we follow 
a nontraditional approach based on the Estlmate-Maximize algo- 
rithm. We evaluate this model on a phoneme classification task 
using the TIMIT database. 
INTRODUCTION 
A new direction in speech recognition via statistical meth- 
ods is to move from frame-based models, such as Hidden 
Markov Models (HMMs), to segment-based models that 
provide a better framework for modeling the dynamics of 
the speech production mechanism. The Stochastic Segment 
Model (SSM) is a joint model for a sequence of observa- 
tions, allowing explicit modeling of time correlation. Orig- 
inally in the SSM, a phoneme was modeled as a sequence 
of feature vectors that obeyed a multivariate Gaussian dis- 
tribution. The variable length of an observed phoneme was 
handled either by modeling a fixed-length transformation of 
the observations \[6\] or by assuming the observation was a 
partially observed sample of a trajectory represented by a 
fixed-length model \[7\]. In the first case, the maximum like- 
lihood estimates of the parameters can be obtained directly, 
but the Estimate-Maximize algorithm \[2\] may be required 
in the second case. 
Unfortunately, the joint Gaussian model suffers from es- 
timation problems, given the number of acoustic features 
and the analysis-frame rate that modern continuous speech 
recognizers use. Therefore, a more constrained assumption 
about the correlation structure must be made. In previ- 
ous work \[3\], we chose to constrain the model to a time- 
inhomogeneous Gauss-Markov process. Under the Gauss- 
Markov assumption, we were able to model well the time 
correlation of the first few cepstral coefficients, but the per- 
formance decreased when a larger number of features were 
BBN Inc. 
10 Moulton St. 
Cambridge, MA 02138 
used. We attribute the performance decrease to insufficient 
training data and the noisy nature of the cepstral coeffi- 
dents. In this work we deal with the problem of noisy ob- 
servations through a time-inhomogeneous dynamical system 
formalism, including observation noise in our model. 
Under the assumption that we model speech as a Gaus- 
sian process at the frame-rate level, a linear state-space dy- 
namical system can be used to parameterize the density 
of a segment of speech. This a natural generalization of 
our previous Gauss-Markov approach, with the addition of 
modeling error in the form of observation noise. 
We can make two different assumptions to address the 
time-variability issue: 
1. Trajectory invariance (A1): There are underlying un- 
observed trajectories in state-space that basic units 
of speech follow. In the dynamical system formalism, 
this assumption translates to a fixed sequence of state 
transition matrices for any occurrence of a speech seg- 
ment. Then, the problem of variable segment length 
can be solved by assuming that the observed feature 
vectors are not only a noisy version of the fixed un- 
derlying trajectory, but also an incomplete one with 
missing observations. Successive observed frames of 
speech have stronger correlation for longer observa- 
tions, since the underlying trajectory is sampled at 
shorter intervals (in feature space). 
2. Correlation invariance (A2): The underlying trajec- 
tory in phase space is not invariant under time-warping 
transformations. In this case, the sequence of state 
transition matrices for a particular observation of a 
phoneme depends on the phoneme length, and we have 
a complete (albeit noisy) observation of the state se- 
quence. In this case, we assume that it is the corre- 
lation between successive frames that is invariant to 
variations in the segment length. 
Under either assumption, the training problem with a 
known segmentation is that of maximum likelihood identifi- 
cation of a dynamical system. We use here an nontraditionnl 
253 
method based on the EM algorithm, that can be easily used 
under either correlation or trajectory invariance. The model 
is described in Section, and the identification algorithms are 
in Section . In Section we shall briefly describe phoneme 
classification and recognition algorithms for this model, and 
finally in Section we present phone classification results on 
the TIMIT database \[5\]. 
A DYNAMICAL MODEL FOR 
SPEECH SEGMENTS 
A segment of speech is represented by an L-long se- 
quence of q-dimensional feature vector Z = \[zl z2 ... zL\]. 
The original stochastic segment model for Z had two compo- 
nents \[7\]: i) a time transformation TL to model the variable- 
length observed segment in terms of a fixed-length unob- 
served sequence Z = YTL, where Y = \[yl y2 ... yM\], and 
ii) a probabilistic representation of the unobserved feature 
sequence Y. We assumed in the past \[3\] that the density 
of Y was that of an inhomogeneous Ganss-Markov process. 
We then showed how the EM algorithm can be used to esti- 
mate the parameters of the models under this assumption. 
In this work, we extend the modeling of the feature se- 
quence, to the more general Markovian representation for 
each different phone model ot 
xk+1 = Fk(a)zk + w~ 
yk = Hk(0~)zk + v, (1) 
where wk, vk are uncorrelated Gaussian vectors with covari- 
&rices 
E{wkw~'la} = q,(a)Skz 
where 6m is the Kronecker delta. We further assume that 
the initial state xo is Gaussian with mean and covariance 
/~o(o~), ~0(o 0. In this work, we arbitrarily choose the dimen- 
sion of the state to be equal to that of the feature vector 
and Hk(cr) = I, the identity matrix. The sequence Y is 
either fully or partially observed under the assumptions of 
correlation and trajectory invariance respectively. In order 
to reduce the number of free parameters in our model, we 
assume that a phone segment is locally stationary over dif- 
ferent regions within the segment, where those regions are 
defined by a fixed time warping that in this work we simply 
choose as linear. In essence, we are tying distributions, and 
the way this is done under the correlation and trajectory 
invariance assumptions is shown in Figure 1. 
The likelihood of the observed sequence Z can be ob- 
tained by the Kalman predictor, as 
L 
log p(zla) = - {log 
k=l 
+ekT(.)\[~(ke)(.)\]--lek(.)} + constant (2) 
L.3 L-3 
Correlation \[nvarlence Traje(:tory Invarlanco 
Figure 1: Distribution tying for (a) Correlation and (b) Trajec- 
tory invariance. 
where  (,e)(ot) is the prediction error variance given phone 
model a. In the trajectory invariance case, innovations are 
only computed at the points where the output of the system 
is observed, and the predicted state estimate for these times 
can be obtained by the/-step ahead prediction form of the 
Kalman filter, where I is the length of the last "black-out" 
interval - the number of missing observations y immediately 
before the last observed frame z. 
TRAINING 
The classical method to obtain maximum likelihood es- 
timates involves the construction of a time-varying Kahnan 
predictor'and the expression of the likelihood function in 
terms of the prediction error as in (2) \[1\]. The minimization 
of the log-likelihood function is equivalent to a nonlinear 
programming problem, and iterativc optimization methods 
have to be used that all require the first and perhaps the sec- 
ond derivatives of the log-likelihood function with respect 
to the system parameters. The solution requires the inte- 
gration of adjoint equations, and the method becomes too 
involved under the trajectory invariance assumption, where 
we have missing observations. 
We have developed a nontraditional iterative method for 
maximum likelihood identification of a stochastic dynami- 
cal system, based on tlle observation that tile computation 
of the estimates would be simple if tile state of the system 
were observable: using simple first and second order suffi- 
cient statistics of time state and observation vectors. The 
Estimate-Maximize algorithm provides an approach for es- 
timating parameters for processes having unobserved com- 
ponents, in this case the state vectors, and therefore ca, 
be used for maximum likelihood identification of dynamical 
systems. 
254 
If we denote the parameter vector of phone model a by 
8, then at the pth iteration of the EM algorithm the new 
estimate of the parameter vector is obtained by minimizing 
+log l&l] + constant I Z, dp)} (3) 
where we have suppressed the parameterization of the sys- 
tem parameters on phone model cw and the first summation 
is over all occurrences of a specific phone model in the train- 
ing data. 
Since the noise process is assumed to be Gaussian, the 
EM algorithm simply involves iteratively computing the ex- 
pected first and second order sufficient statistics given the 
current parameter estimates. It is known from Kalman fd- 
tering theory [I] that the conditional distribution of the 
state X given the observations Z on an interval is Gaus- 
sian. The sufficient statistics are then 
zki;, if observed; 
E{ykx:I~, 8) = 
HkE{xkx~IZ), if missing. 
where the quantities on the right, iklL, CkIL, Ck,k-lIL are 
the fixed interval smoothed state estimate, its variance and 
the one lag cross-covariance respectively. The computation 
of these sufficient statistics can be done recursively. Under 
A2, since Y = Z, it reduces to the fixed-interval smoothing 
form of the Kalman filter, together with some additional re- 
cursions for the computation of the cross-covariance. These 
recursions consist of a forward pass through the data, fol- 
lowed by a backward pass and are summarized in Table 1. 
Under Al, the recursions take the form of a fixed interval 
smoother with blackouts, and can be derived similarly to 
the standard Kalman filter recursions. 
To summarize, assuming a known segmentation and there- 
fore a known sequence of system models, the EM algorithm 
involves at each iteration the computation of the sufficient 
statistics described previously using the recursions of Ta- 
Forward recursions 
Yk - Hkiklk-1 
c~+~~~ = F~c~~~F~T + Q~ 
Backward Recursions 
ble 1 and the old estimates of the model parameters (Esti- 
mate step). The new estimates for the system parameters 
can then be obtained from these statistics as simple multi- 
variate regression coefficients (Maximize step). In addition, 
the structure of the system matrices can be constrained in 
order to satisfy identifiability conditions. When the seg- 
mentation is unknown, since the estimates obtained from 
our known segmentation method are Maximum Likelihood 
- 
ones, training can be done in an iterative fashion, as de- 
scribed in [6]. 
RECOGNITION 
When the phonetic segmentation is known, under both 
assumptions A1 and A2 the model sequence can be deter- 
mined from the segmentation and therefore the MAP rule 
can be used for phone classification, where the likelihood of 
the observations is obtained from the Kalman predictor (2). 
For connected-phone recognition, with unknown segmen- 
tation, the MAP rule for detecting the most likely phonetic 
sequence involves computing the total probability of a cer- 
tain sequence by summing over all possible segmentations. 
Because of the computational complexity of this approach, 
one can jointly search for the most likely phone sequence 
and segmentation given the observed sequence. This can be 
done with a Dynamic-Programming recursion. In previous 
work we have also introduced alternative fast algorithms 
for both phone classification and recognition [4] which yield 
performance similar to Dynamic-Programming with signif- 
icant computation savings. 
EXPERIMENTAL RESULTS 
We have implemented a system based on our correla- 
tion invariance assumption and performed phone classifi- 
0.9 
0.8 
0.7 
0.6 
| r v 
od ¢ 
Classification rate I 
~1 I I I I I I I I 
I I I I 
0 2 4 6 8 10 
Iteration 
12 
Figure 2: Classification performance of test data vs. number 
of iterations and log-likellhood ratio of each iteration relative to 
the convergent value for the training data. 
cation experiments on the TIMIT database \[5\]. We used 
Mel-warped cepstra and their derivatives together with the 
derivative of log power.The number of different distribu- 
tions (time-invariant regions) for each segment model was 
5. We used 61 phonetic models, but in counting errors 
we folded homophones together and effectively used the re- 
duced CMU/MIT 39 symbol set. The measurement-noise 
variance was common over all different phone-models and 
was not reestimated after the first iteration. In experiments 
with class-dependent measurement noise, we observed a de- 
crease in performance, which we attribute to "over-training"; 
a first order Gauss-Markov structure can adequately model 
the training data, because of the small length of the time- 
invariant regions in the model. In addition, the observed 
feature vectors were centered around a class-dependent mean. 
Duration probabilities as well as a priori class probabilities 
where also used in these experiments. The training set that 
we used consist of 317 speakers (2536 sentences), and eval- 
uation of our algorithms is done on a separate test set with 
12 speakers (96 sentences). 
The effectiveness of the training algorithm is shown in 
Figure 2, where we present the normalized log-likelihood 
of the training data and classification rate of the test data 
versus the number of iterations. We used 10 cepstra for 
this experiment, and the initial parameters for the models 
where uniform across all classes, except the class-dependent 
means. We can see the fast initial convergence of the EM 
algorithm, and that the best performance is achieved after 
only 4 iterations. 
In Figure 3 we show the classification rates for no cor- 
relation modeling (independent frames), the Gauss-Markov 
model and the Dynamical system model for different num- 
bers of input features. We also include in the same plot the 
classification rates when the derivatives of the cepstra are 
74 
72 
70 
68 
66 
64 
62 
60 
58 
56 
54 
I 1 I I I 
• . . . . 
• .• X" " 
• . . • . • 
• . . • • 
• . • • . - . .~" - . . 
• ." 7~" . . • " X" ." .'.~. - • " 
O'" .- . • Indep.frames -~. 
." , ' Gauss-Markov model • +- 
• -". ' " Dynam•System model - O. 
." ." lnd.fr,+deriv - ×. 
~"1 I 
6 8 
13 ..... 
+ ..... .7 
D.S. model+deriv. "Z~- • 
I I I I 
10 12 14 16 18 
Number of Cepstral Coefficients 
Figure 3: Classification rates for various types of Correlation 
modeling and numbers of cepstral coefficients 
included in the feature set, so that some form of correlation 
modeling is included in the independent-frame model. We 
can see that the proposed model clearly outperforms the 
independent-frame model. Furthermore, we should notice 
the significance of incorporating observation noise in the 
model, by comparing the performance of the new model to 
the earlier, Gauss-Markov one. 
CONCLUSION 
In this paper, we have shown that segment model based 
on a stochastic linear system model which incorporates a 
modeling/observation noise term is effective for speech recog- 
nition. We have shown that classification performance us- 
ing this model is significantly better than is obtained using 
either an independent-frame or a Gauss-Markov assump- 
tion on the observed frames. Finally, we have presented a 
novel approach to the system parameter estimation problem 
based on the EM algorithm. 
ACKNOWLEDGEMENTS 
This work was supported jointly by NSF and DARPA 
under NSF grant number IRI-8902124. This paper also ap- 
pears in the Proceedings of the International Conference on 
Acoustics, Speech and Signal Processing. 
REFERENCES 
1. P.E.Calnes, "Linear Stochastic Systems", Jolm \Viley & 
Sons, 1988. 
256 
2. A.P.Dempster, N.M.L~ird and D.B.Rubin, "Maximum Like- 
lihood Estimation from Incomplete Data," in Journal of the 
Royal Statistical Society (B), Vol. 39, No. 1, pp. 1-38, 1977. 
3. V. Digalakis, M. Ostendorf and J. It. Roldlcek, "Improve- 
ments in the Stochastic Segment Model for Phoneme Ftecog- 
nition," in Proceedings of the Second DARPA Workshop on 
Speech and Natural Language, pp. 332-338, October 1989. 
4. V. Digalakis, M. Ostendorf and J. R. Roldicek, "Fast Search 
Algorithms for Connected Phone Recognition Using the 
Stochastic Segment Model," manuscript submitted to IEEE 
Trans. Acoustic Speech and Signal Processing (a shorter ver- 
sion appeared in Proceedings of the Third DARPA Work- 
shop on Speech and Natural Language, June 1990). 
5. L.F. Lamel, R. H. Kassel and S. Seneff, "Speech Database 
Development: Design and Analysis of the Acoustic-Phonetic 
Corpus," in Proc. DARPA Speech Recognition Workshop, 
Report No. SAIC-86/1546, pp. 100-109, Feb. 1986. 
6. M. Ostendorf and S. Roucos, "A Stochastic Segment Model 
for Phoneme-based Continuous Speech Recognition," in 
IEEE Trans. Acoustic Speech and Signal Processing, VoL 
ASSP-37(12), pp. 1857-1869, December 1989. 
7. S. Roucos, M. Ostendorf, H. Gish, and A. Derr, "Stochas- 
tic Segment Modeling Using the Estimate-Maximize Algo- 
rithm," in IEEE Int. Conf. Acoust., Speech, Signal Process- 
ing, pp. 127-130, New York, New York, April 1988. 
257 
