VOICE SIMULATION: FACTORS AFFECTING QUALITY AND NATURALNESS 
B. Yegnanarayana 
Department of Computer Science and Engineering 
Indian Institute of Technology, Madras-60O 036, India 
J.M. Naik and D.G. Childers 
Department of Electrical Engineering 
University of Florida, Galnesville, FL 32611, U.S.A. 
ABSTRACT 
In this paper we describe a flexible 
analysls-synthesls system which can be used for a 
number of studies In speech research. The maln 
objective Is to have a synthesis system whose 
characteristics can be controlled through a set 
of parameters to realize any desired voice 
characteristics. The basic synthesis scheme 
consists of two steps: Generation of an excita- 
tion signal from pitch and galn contours and 
excitation of the linear system model described 
by linear prediction coefficients, We show that 
a number of basic studies such as time expansion/ 
compression, pitch modifications and spectral 
expansion/compression can be made to study the 
effect of these parameters on the quality of 
synthetic speech. A systematic study is made to 
determine factors responsible for unnaturalness 
tn synthetic speech. It is found that the shape 
of the glottal pulse determines the quality to a 
large extent. We have also made some studies to 
determine factors responsible for loss of Intel- 
ligibility tn some segments of speech. A signal 
dependent analysts-synthesis scheme ts proposed 
to improve the intelligibility of dynamic sounds 
such as stops. A simple implementation of the 
signal dependent analysis is proposed. 
I. INTRODUCTION 
The maln objective of this paper is to 
develop an analysis-synthesls system whose 
parameters can be varied at will to realize any 
desired voice characteristics. Thls wlll enable 
us to determine factors responsible for the 
unnatural quality of synthetic speech. It is 
also possible to determine parameters of speech 
that contribute to intelligibility. The key 
ideas In our basic system are similar to the 
usual linear predictive (LP) coding vocoder \[I\], 
\[2\]. Our main contributions to the design of the 
basic system are: (1) the flexibility incorpor- 
ated in the system for changing the parameters of 
excitation and system independently and (2) a 
means for combining the excitation and system 
through convolution without further interpolation 
of the system parameters during synthesis. 
Atal and Hanauer \[1\] demonstrated the feasl- 
billty of modifying voice characteristics through 
an LPC vocoder. There have been some attempts to 
modify some characteristics (llke pitch, speaking 
rate) of speech without explicitly extracting the 
source parameters. One such attempt is with the 
phase vocoder \[3\]. A recent attempt to 
independently modify the excitation and vocal 
tract system characteristics is due to Senef 
\[4\]. Unlike the LPC method, Senef's method 
performs the desired transformations in the 
frequency domain without explicitly extracting 
pitch. However, it Is difficult to adjust the 
intonation patterns while modifying the voice 
characteristics. 
In order to transform voice from one type 
(e.g., masculine) to another (e.g., feminine), it 
is necessary to change not only the pitch and 
vocal tract system but also the pitch contour as 
well as the glottal waveshape independently. It 
is known that glottal pulse shapes differ from 
person to person and also for the same person for 
utterances in different contexts \[5\]. Since one 
of our objectives is to determine factors respon- 
sible for producing natural sounding synthetic 
speech, we have decided to implement a scheme 
which controls independently the vocal tract 
system characteristics and the excitation charac- 
teristics such as pitch, pitch contour and 
glottal waveshape. For thls reason we have 
decided to use the standard LPC-type vocoder. 
In Sec. II we describe the basic analysis- 
synthesis system developed for our studies. We 
discuss two important innovations in our system 
which provide smooth control of the parameters 
for generating speech. In Sec. III we present 
results of our studies on voice modifications and 
transformations using the basic system. In 
particular, we demonstrate the ease wtth which 
one can vary independently the speaking rate, 
pitch, glottal pulse shape and the vocal tract 
response. We report in Sec. IV results from our 
studies to determine the factors responsible for 
unnatural quality of synthetic speech from our 
system, After accounting for the major source of 
unnaturalness in synthetic speech, we investigate 
the factors responsible for low intelligibility 
of some segments of speech. We propose a signal 
dependent analysls-synthesls scheme in Sec. V to 
improve Intelliglbility of dynamic sounds such as 
stops. 
530 
II. DESCRIPTION OF THE ANALYSIS- 
SYNTHESIS SYSTEM 
A. Basic System 
As mentioned earlier, our system is basical- 
ly same as that LPC vocoders described in the 
literature F2\]. The production model assumes 
that speech is the output of a tlme varying vocal 
tract system excited by a time varying excita- 
tion. The excitation is a quaslperlodlc glottal 
volume velocity signal or a random noise signal 
or a combination of both. Speech analysis Is 
based on the assumption of quasistationarlty 
during short intervals (10-20 msec). At the 
synthesizer the excitation parameters and gain 
for each analysis frame are used to generate the 
excitation signal. Then the system represented 
by the vocal tract parameters is excited by this 
signal to generate synthetic speech. 
B. Analysis Parameters 
For the basic system a fixed frame size of 
20 msec (200 samples at 10kHz sampling rate) and 
a frame rate of 100 frames per second are used. 
For each frame a set fo 14 LPCs are extracted 
using the autocorrelatlon method \[2\]. Pitch 
period and volce/unvoiced decisions are deter- 
mined using the SIFT algorithm \[2\]. The glottal 
pulse information is not extracted in the basic 
system. The gain for each analysis frame Is 
computed from the linear prediction residual, 
The residual energy for an Interval corresponding 
to only one pitch period is computed and the 
energy is divided by the period in number of 
samples. This method of computation of squared 
~aln per sample avoids the incorrect computation 
of the gain due to arbitrary location of analysls 
frame relative to glottal closure. 
C. Synthesis 
Synthesis consists of two steps: Generation 
of the excitation signal and synthesis of speech. 
Separation of the synthesis procedure into these 
two steps helps when modifying the voice charac- 
teristics as will be evident in the followlng 
sections. The excitation parameters are used to 
generate the excitation signal as follows: The 
pitch period and galn contours as a function of 
analysls frame number (1) are first nonllnearly 
smoothed using a 3-polnt median smoothing. Two 
arrays (called Q and H for convenience) are cre- 
ated as illustrated in Figure I. The smoothed 
pitch contour P(1) is used to generate a Q-array 
using the value of the pitch period at any point 
to determine the next point on the pitch contour. 
Since the pitch period Is given in number of 
samples and the Interframe interval is known, say 
N samples, the value of the pitch period at the 
end of the current pitch period is determined 
using suitable interpolation of P(1) for points 
in between two frame Indicles. The values of the 
pitch period as read from the pitch contour are 
stored in the Q-array. The entry In the Q-array 
is the value of the pitch period for that 
frame. For nonvolced frames the number of 
samples to be skipped along the horizontal axis 
is N, although on the pitch contour the value is 
zero. The entry in the O-array for unvoiced 
frames is zero. For each entry in the Q-array 
the corresponding squared gain per sample can be 
computed from the gain contour using suitable 
interpolation between two frame indices. The 
squared gain per sample corresponding to each 
element in the Q-array Is stored in the H-array. 
From the Q and H arrays an excitation slgnal 
is generated as follows. For each nonvoIced 
segment, identified by an entry zero in the Q- 
array, N s samples of random noise are generated. 
The average energy per sample of the noise is 
adjusted to be equal to the entry in the H-array 
corresponding to that segment. For a voiced 
segment identified by a nonzero value in the Q- 
array, the required number of excitation samples 
are generated using any desired excitation model. 
In the initial experiments only one of the five 
exctlation models shown in Figure 2 were 
considered. The model parameters were fixed 
aprlorl and they were not derived from the 
speech signal. Note that the total number of 
excitation samples generated In this way are 
equal to the number of desired synthetic speech 
samples. 
Once the excitation signal Is obtained, the 
synthetic speech Is generated by exciting the 
vocal tract system with the excitation samples. 
The system parameters are updated every N 
samples. We are not using pitch synchronous 
updating of the parameters, as is normally done 
in LPC synthesis. Therefore, interpolation of 
parameters is not necessary. Thus, the 
instability problems arising out of the 
interpolated system parameters are avolced. We 
still obtain a very smooth synthetic speech. 
III. STUDIES USING THE BASIS SYSTEM 
Two sentences spoken by a male speaker were 
used In our studies with the system: 
Sl: WE WERE AWAY A YEAR AGO 
$2: SHOULD WE CHASE THOSE COWBOYS 
Speech data sampled at lOkHz was analyzed under 
the following conditions: 
Frame size: 200 samples 
Frame rate: 100 frames/sec 
Each frame was preemphastzed and windowed 
Number of LPC's: 14 
Pitch contour: (SIFT algorithm) 
Gain contour: (from LP residual) 
3-potnt median smoothing of pitch and gatn 
contour 
The excitation signal was generated using the 
smoothed pitch and gain contours with the non- 
overlapping samples per frame being N=200, The 
excitation model-3 (Fig. 2) was used throughout 
the tntttal studies. This model was a stmple 
impulse excitation normally used in most LPC syn- 
thesizers, Synthesis was performed by using the 
excitation signal with the all-pole system, 
The system parameters were updated every 100 
samples. 
Ne conducted the following studies using 
this system. 
531 
A. Tlme expanslon/compresslon wlth spectrum 
and excitation characteristics preserved. 
B. Pitch period expanslon/compression with 
spectrum and other excitation 
characteristics preserved, 
C. Spectral expanslon/compresslon wlth all 
the excitation characteristics preserved. 
D. Modification of voice characteristics 
(both pitch and spectrum). 
The llst of recordings made from these studies Is 
given in Appendix. 
The synthetic speech is highly Intelllglble 
and devoid of c11cks, noise, etc. The speech 
quallty Is distinctly synthetic. The issues of 
quallty or naturalness w111 be addressed In 
Section IV. 
IV. FACTORS FOR UNNATURAL QUALITY 
OF SYNTHETIC SPEECH 
It appears that the quality of the overall 
speech depends on the quality of reproduction of 
voiced segments. To determine the factors 
responsible for synthetic quality of speech, a 
systematic investigation was performed. The 
first part of the investigation consisted of 
determining which of the three factors namely, 
the vocal tract response, pitch period contour, 
and glottal pulse shape contributed significantly 
to the unnatural quality. Each of these factors 
was varied over a wide range of alternatives to 
determine whether a significant improvement in 
quality can be achieved. We have found that 
glottal pulse approximation contributes to the 
voice quality more than the vocal tract system 
model and pitch period errors. 
Different excitation models were Investl- 
gated to determine the one which contributes most 
significantly to naturalness. If we replace the 
glottal pulse characteristics wlth the LP 
residual itself, we get the original speech. If 
we can model the excitation sultably and 
determine the parameters of the model from 
speech, then we can generate hlgh quality 
synthetic speech. But it is not clear how to 
model the excitation. Several artificial pulse 
shapes wlth their parameters arbitrarily fixed, 
are used In our studies (Fig. 2). 
Excitation Model-l: Impulse excitation 
Excitation Model-2: Two impulse excitation 
Excitation Model-3: Three impulse excita- 
tion 
Excitation Model-4: Hflbert transform of an 
impulse 
Excitation Model-5: First derivative of 
Fant's model \[6\] 
Out of all these, Model-5 seems to produce 
the best quality speech. However, the most 
important problem to be addressed is how to 
determine the model parameters from speech. 
The studies on excitation models indicate 
that the shape of the excitation pulse Is 
crltlcal and It should be close to the original 
pulse If naturalness Is to be obtained in the 
synthetic speech. Another way of viewing thls is 
that the phase function of the excitation plays a 
prominent role In determining the quality. None 
of the simplified models approximate the phase 
properly. So it Is necessary to model the phase 
of the original signal and incorporate it in the 
synthesis. Flanagan's phase vocoder studies \[7\] 
also suggest the need for incorporating phase of 
the signal In synthesis. 
V. SIGNAL-DEPENDENT ANALYSIS- 
SYNTHESIS SCHEME 
The quality of synthetic speech depends 
mostly on the reproduction of voiced speech, 
whereas, we conjecture that intelligibility of 
speech depends on how different segments are 
reproduced. It Is known \[8\] that analysis frame 
size, frame rate, number of LPCs, pre-emphasis 
factor, glottal pulse shape, should be different 
for different classes of segments In an 
utterance. In many cases unnecessary preemphasls 
of data, or hlgh order LPCs can produce 
undesirable effects. Human listeners perform the 
analysis dynamically depending on the nature of 
the input segment. So it is necessary to 
Incorproate a signal dependent analysls-synthesis 
feature Into the system. 
There are several ways of implementing the 
slgnal dependent analysls ideas. One way is to 
have a fixed slze window whose shape changes 
depending on the desired effective size of the 
frame. We use the signal knowledge embodied in 
the pitch contour to guide the analysls. For 
example, the shape of the window could be a 
Gaussian function, whose width can be controlled 
by the pitch contour. The frame rate is kept as 
high as possible during the analysis stage. 
Unnecessary frames can be discarded, thus 
reducing the storage requirement and synthesis 
effort. 
The slgnal dependent analysls can be taken 
to any level of sophistication, wlth consequent 
advantages of improvement in inte111glbility, 
bandwidth compression and probably quality also. 
VI. DISCUSSION 
We have presented in this paper a discussion 
of an analysts-synthesis system which is 
convenient to study various aspects of the speech 
signal such as the importance of different 
parameters of features and their effect on 
naturalness and intelligibility. Once the 
characteristics of the speech signal are well 
understood, it fs possible to transform the voice 
characteristics of an utterance tn any desired 
manner. It is to be noted that modelling both 
the excitation signal and the vocal tract system 
are crucial for any studies on speech. 
Significant success has been achieved in 
modelling the vocal tract system accurately for 
purposes of synthesis. But on the other hand we 
have not yet found a convenient way of modelling 
the excitation source. It is to be noted that 
the solution to the source modelling problem does 
not lle in preserving the entire LP residual or 
Its Fourier transform or parts of the residual 
information In either domain. Because any such 
532 
approach limits the manipulative capability in 
synthesis especially for changing voice 
characterl stl cs. 
APPENDIX A: LIST OF RECORDINGS 
1. Basic system 
Utterance of Speaker I: (a) original (b) 
synthetic (c) original 
Utterance of Speaker 2: (a) original (b) 
synthetic (c) original 
Utterance of Speaker 3: (a) original (b) 
synthetic (c) original 
2. Time expansl on/compression 
(a) original (b) 11/2 times normal speaking 
rate (c) normal speaking rate (d)I/2 the 
normal speaking rate (e) original 
3. Pitch period expansion/compression 
(a) original (b) twice the normal pitch 
frequency (c) normal pitch frequency (d) 
half the normal pitch frequency (e) 
ori gi nal 
4. Spectral expanslon/compression 
(a) original (b) spectran expansion factor 
1.1 (c) normal spectrum (d) spectral com- 
pression factor 0.9 (e) original 
5. Conversion of one voice to another 
(a) male to female voice: 
original male voice - artificial 
female voice - original female voice 
(b) male to child voice: 
original male voice artificial 
child voice - original child voice 
(c) child to male voice: 
original child voice - artificial 
male voice - original male voice 
Q(1) - o 
Q(Z) • 0 " pitch 
contour ¢ 
: . Q(3) - Pl 
I i 
iil I 0 ,I, ,' I , , . I 
i °, 
Time in # samples 
Ft~ le. Illustration of generating Q-Array from smoothed 
pitch contour 
gain contour N(1) . G 1 H(2) • G 2 
H(3) - G 3 
H(4) - G 4 
HiS) - G s 
Time in # samples 
Fig lb. I11ustratlon of qenerstlnq H-Array from smoothed 
pitch and getn contours 
6. Effect of excitation models 
(a) orlginal (b) single Impulse excitation 
(c) two Impulses excitation (d) three 
impulses excitation (e) Hllbert transform 
of an impulse if) first derivative of 
Fant's model of glottal pulse 
REFERENCES 
\[1\] B.S. Atal and S.L. Hanauer, J. Acoust. Soc. 
Amer., vol. 50, pp. 637-655, 1971. 
\[2\] J.D. Markel and A.H. Gray, Linear Predic- 
tion of Speech, Sprtnger-Verlag, 19/6. 
\[3\] J.L. Flanagan, Speech Analysts, Synthesis 
and Perception, Sprlnger-Verlag, 1972. 
\[4\] s. Seneff, IEEE Trans. Acoust., Speech and 
Signal Processing, vol. ASSP-30, no. 4, pp. 
566-577, August 1982. 
\[5\] R.H. Cotton and J.A. Estrie, Elements of 
Voice Quality in Speech and Language, N.J. 
Lass (Ed.), Academic Press, 1975. 
\[6\] G. Fant, "The Source Filter Concept in 
Voice Production," IV FASE Symposium on 
Acoustics and Speech, Venezta, April 21-24, 
1981. 
\[7\] J.L. Flanagan, 3. Acoust. Soc. Amer., vol. 
68, pp. 412-420, August lgBO. 
\[8\] C.R. Patlsaul and J.C. Hammett, Jr., J. 
Acoust. Soc. Amer., vol. 58, pp. 1296-1307, 
December 1975. 
Time tn t saumles 
T • J (a) Stngle tmpulse excitation 
P 
l (b) Two tmpulses excitation 
P 
Time In ! samples 
t I (c) 
O p T 1 IJ T2-WP 
Ttme |n t samplei 
llw,,, 
" " I I I o I 
! 
Time In # stmples 
Three tmpulses excitation 
p (d) Htlbert transform of an tmpulse 
k--'Tl ' 1~P 
Ttme to # samples 
(e) Ftrst der|vat|ve of Fanl:'s 
model of glottal pulse 
Flq 2. Different Hodels for excitation 
533 
