File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/84/p84-1113_metho.xml
Size: 15,892 bytes
Last Modified: 2025-10-06 14:11:43
<?xml version="1.0" standalone="yes"?> <Paper uid="P84-1113"> <Title>VOICE SIMULATION: FACTORS AFFECTING QUALITY AND NATURALNESS</Title> <Section position="3" start_page="0" end_page="530" type="metho"> <SectionTitle> I. INTRODUCTION </SectionTitle> <Paragraph position="0"> The maln objective of this paper is to develop an analysis-synthesls system whose parameters can be varied at will to realize any desired voice characteristics. Thls wlll enable us to determine factors responsible for the unnatural quality of synthetic speech. It is also possible to determine parameters of speech that contribute to intelligibility. The key ideas In our basic system are similar to the usual linear predictive (LP) coding vocoder \[I\], \[2\]. Our main contributions to the design of the basic system are: (1) the flexibility incorporated in the system for changing the parameters of excitation and system independently and (2) a means for combining the excitation and system through convolution without further interpolation of the system parameters during synthesis.</Paragraph> <Paragraph position="1"> Atal and Hanauer \[1\] demonstrated the feaslbillty of modifying voice characteristics through an LPC vocoder. There have been some attempts to modify some characteristics (llke pitch, speaking rate) of speech without explicitly extracting the source parameters. One such attempt is with the phase vocoder \[3\]. A recent attempt to independently modify the excitation and vocal tract system characteristics is due to Senef \[4\]. Unlike the LPC method, Senef's method performs the desired transformations in the frequency domain without explicitly extracting pitch. However, it Is difficult to adjust the intonation patterns while modifying the voice characteristics.</Paragraph> <Paragraph position="2"> In order to transform voice from one type (e.g., masculine) to another (e.g., feminine), it is necessary to change not only the pitch and vocal tract system but also the pitch contour as well as the glottal waveshape independently. It is known that glottal pulse shapes differ from person to person and also for the same person for utterances in different contexts \[5\]. Since one of our objectives is to determine factors responsible for producing natural sounding synthetic speech, we have decided to implement a scheme which controls independently the vocal tract system characteristics and the excitation characteristics such as pitch, pitch contour and glottal waveshape. For thls reason we have decided to use the standard LPC-type vocoder.</Paragraph> <Paragraph position="3"> In Sec. II we describe the basic analysissynthesis system developed for our studies. We discuss two important innovations in our system which provide smooth control of the parameters for generating speech. In Sec. III we present results of our studies on voice modifications and transformations using the basic system. In particular, we demonstrate the ease wtth which one can vary independently the speaking rate, pitch, glottal pulse shape and the vocal tract response. We report in Sec. IV results from our studies to determine the factors responsible for unnatural quality of synthetic speech from our system, After accounting for the major source of unnaturalness in synthetic speech, we investigate the factors responsible for low intelligibility of some segments of speech. We propose a signal dependent analysls-synthesls scheme in Sec. V to improve Intelliglbility of dynamic sounds such as stops.</Paragraph> </Section> <Section position="4" start_page="530" end_page="531" type="metho"> <SectionTitle> II. DESCRIPTION OF THE ANALYSIS- SYNTHESIS SYSTEM A. Basic System </SectionTitle> <Paragraph position="0"> As mentioned earlier, our system is basically same as that LPC vocoders described in the literature F2\]. The production model assumes that speech is the output of a tlme varying vocal tract system excited by a time varying excitation. The excitation is a quaslperlodlc glottal volume velocity signal or a random noise signal or a combination of both. Speech analysis Is based on the assumption of quasistationarlty during short intervals (10-20 msec). At the synthesizer the excitation parameters and gain for each analysis frame are used to generate the excitation signal. Then the system represented by the vocal tract parameters is excited by this signal to generate synthetic speech.</Paragraph> <Paragraph position="1"> B. Analysis Parameters For the basic system a fixed frame size of 20 msec (200 samples at 10kHz sampling rate) and a frame rate of 100 frames per second are used.</Paragraph> <Paragraph position="2"> For each frame a set fo 14 LPCs are extracted using the autocorrelatlon method \[2\]. Pitch period and volce/unvoiced decisions are determined using the SIFT algorithm \[2\]. The glottal pulse information is not extracted in the basic system. The gain for each analysis frame Is computed from the linear prediction residual, The residual energy for an Interval corresponding to only one pitch period is computed and the energy is divided by the period in number of samples. This method of computation of squared ~aln per sample avoids the incorrect computation of the gain due to arbitrary location of analysls frame relative to glottal closure.</Paragraph> <Paragraph position="3"> C. Synthesis Synthesis consists of two steps: Generation of the excitation signal and synthesis of speech. Separation of the synthesis procedure into these two steps helps when modifying the voice characteristics as will be evident in the followlng sections. The excitation parameters are used to generate the excitation signal as follows: The pitch period and galn contours as a function of analysls frame number (1) are first nonllnearly smoothed using a 3-polnt median smoothing. Two arrays (called Q and H for convenience) are created as illustrated in Figure I. The smoothed pitch contour P(1) is used to generate a Q-array using the value of the pitch period at any point to determine the next point on the pitch contour.</Paragraph> <Paragraph position="4"> Since the pitch period Is given in number of samples and the Interframe interval is known, say N samples, the value of the pitch period at the end of the current pitch period is determined using suitable interpolation of P(1) for points in between two frame Indicles. The values of the pitch period as read from the pitch contour are stored in the Q-array. The entry In the Q-array is the value of the pitch period for that frame. For nonvolced frames the number of samples to be skipped along the horizontal axis is N, although on the pitch contour the value is zero. The entry in the O-array for unvoiced frames is zero. For each entry in the Q-array the corresponding squared gain per sample can be computed from the gain contour using suitable interpolation between two frame indices. The squared gain per sample corresponding to each element in the Q-array Is stored in the H-array.</Paragraph> <Paragraph position="5"> From the Q and H arrays an excitation slgnal is generated as follows. For each nonvoIced segment, identified by an entry zero in the Qarray, N s samples of random noise are generated. The average energy per sample of the noise is adjusted to be equal to the entry in the H-array corresponding to that segment. For a voiced segment identified by a nonzero value in the Qarray, the required number of excitation samples are generated using any desired excitation model.</Paragraph> <Paragraph position="6"> In the initial experiments only one of the five exctlation models shown in Figure 2 were considered. The model parameters were fixed aprlorl and they were not derived from the speech signal. Note that the total number of excitation samples generated In this way are equal to the number of desired synthetic speech samples.</Paragraph> <Paragraph position="7"> Once the excitation signal Is obtained, the synthetic speech Is generated by exciting the vocal tract system with the excitation samples.</Paragraph> <Paragraph position="8"> The system parameters are updated every N samples. We are not using pitch synchronous updating of the parameters, as is normally done in LPC synthesis. Therefore, interpolation of parameters is not necessary. Thus, the instability problems arising out of the interpolated system parameters are avolced. We still obtain a very smooth synthetic speech.</Paragraph> <Paragraph position="9"> III. STUDIES USING THE BASIS SYSTEM Two sentences spoken by a male speaker were used In our studies with the system: 3-potnt median smoothing of pitch and gatn contour The excitation signal was generated using the smoothed pitch and gain contours with the non-overlapping samples per frame being N=200, The excitation model-3 (Fig. 2) was used throughout the tntttal studies. This model was a stmple impulse excitation normally used in most LPC synthesizers, Synthesis was performed by using the excitation signal with the all-pole system, The system parameters were updated every 100 samples.</Paragraph> <Paragraph position="10"> Ne conducted the following studies using this system.</Paragraph> <Paragraph position="11"> A. Tlme expanslon/compresslon wlth spectrum and excitation characteristics preserved.</Paragraph> <Paragraph position="12"> B. Pitch period expanslon/compression with spectrum and other excitation characteristics preserved, C. Spectral expanslon/compresslon wlth all the excitation characteristics preserved.</Paragraph> <Paragraph position="13"> D. Modification of voice characteristics (both pitch and spectrum).</Paragraph> <Paragraph position="14"> The llst of recordings made from these studies Is given in Appendix.</Paragraph> <Paragraph position="15"> The synthetic speech is highly Intelllglble and devoid of c11cks, noise, etc. The speech quallty Is distinctly synthetic. The issues of quallty or naturalness w111 be addressed In Section IV.</Paragraph> </Section> <Section position="5" start_page="531" end_page="531" type="metho"> <SectionTitle> IV. FACTORS FOR UNNATURAL QUALITY OF SYNTHETIC SPEECH </SectionTitle> <Paragraph position="0"> It appears that the quality of the overall speech depends on the quality of reproduction of voiced segments. To determine the factors responsible for synthetic quality of speech, a systematic investigation was performed. The first part of the investigation consisted of determining which of the three factors namely, the vocal tract response, pitch period contour, and glottal pulse shape contributed significantly to the unnatural quality. Each of these factors was varied over a wide range of alternatives to determine whether a significant improvement in quality can be achieved. We have found that glottal pulse approximation contributes to the voice quality more than the vocal tract system model and pitch period errors.</Paragraph> <Paragraph position="1"> Different excitation models were Investlgated to determine the one which contributes most significantly to naturalness. If we replace the glottal pulse characteristics wlth the LP residual itself, we get the original speech. If we can model the excitation sultably and determine the parameters of the model from speech, then we can generate hlgh quality synthetic speech. But it is not clear how to model the excitation. Several artificial pulse shapes wlth their parameters arbitrarily fixed, are used In our studies (Fig. 2).</Paragraph> <Paragraph position="2"> Out of all these, Model-5 seems to produce the best quality speech. However, the most important problem to be addressed is how to determine the model parameters from speech.</Paragraph> <Paragraph position="3"> The studies on excitation models indicate that the shape of the excitation pulse Is crltlcal and It should be close to the original pulse If naturalness Is to be obtained in the synthetic speech. Another way of viewing thls is that the phase function of the excitation plays a prominent role In determining the quality. None of the simplified models approximate the phase properly. So it Is necessary to model the phase of the original signal and incorporate it in the synthesis. Flanagan's phase vocoder studies \[7\] also suggest the need for incorporating phase of the signal In synthesis.</Paragraph> </Section> <Section position="6" start_page="531" end_page="532" type="metho"> <SectionTitle> V. SIGNAL-DEPENDENT ANALYSIS- SYNTHESIS SCHEME </SectionTitle> <Paragraph position="0"> The quality of synthetic speech depends mostly on the reproduction of voiced speech, whereas, we conjecture that intelligibility of speech depends on how different segments are reproduced. It Is known \[8\] that analysis frame size, frame rate, number of LPCs, pre-emphasis factor, glottal pulse shape, should be different for different classes of segments In an utterance. In many cases unnecessary preemphasls of data, or hlgh order LPCs can produce undesirable effects. Human listeners perform the analysis dynamically depending on the nature of the input segment. So it is necessary to Incorproate a signal dependent analysls-synthesis feature Into the system.</Paragraph> <Paragraph position="1"> There are several ways of implementing the slgnal dependent analysls ideas. One way is to have a fixed slze window whose shape changes depending on the desired effective size of the frame. We use the signal knowledge embodied in the pitch contour to guide the analysls. For example, the shape of the window could be a Gaussian function, whose width can be controlled by the pitch contour. The frame rate is kept as high as possible during the analysis stage.</Paragraph> <Paragraph position="2"> Unnecessary frames can be discarded, thus reducing the storage requirement and synthesis effort.</Paragraph> <Paragraph position="3"> The slgnal dependent analysls can be taken to any level of sophistication, wlth consequent advantages of improvement in inte111glbility, bandwidth compression and probably quality also. VI. DISCUSSION We have presented in this paper a discussion of an analysts-synthesis system which is convenient to study various aspects of the speech signal such as the importance of different parameters of features and their effect on naturalness and intelligibility. Once the characteristics of the speech signal are well understood, it fs possible to transform the voice characteristics of an utterance tn any desired manner. It is to be noted that modelling both the excitation signal and the vocal tract system are crucial for any studies on speech.</Paragraph> <Paragraph position="4"> Significant success has been achieved in modelling the vocal tract system accurately for purposes of synthesis. But on the other hand we have not yet found a convenient way of modelling the excitation source. It is to be noted that the solution to the source modelling problem does not lle in preserving the entire LP residual or Its Fourier transform or parts of the residual information In either domain. Because any such approach limits the manipulative capability in synthesis especially for changing voice characterl stl cs.</Paragraph> </Section> <Section position="7" start_page="532" end_page="532" type="metho"> <SectionTitle> APPENDIX A: LIST OF RECORDINGS </SectionTitle> <Paragraph position="0"> frequency (c) normal pitch frequency (d) half the normal pitch frequency (e) ori gi nal 4. Spectral expanslon/compression (a) original (b) spectran expansion factor 1.1 (c) normal spectrum (d) spectral compression factor 0.9 (e) original 5. Conversion of one voice to another (a) male to female voice: original male voice - artificial female voice - original female voice (b) male to child voice: original male voice artificial child voice - original child voice (c) child to male voice: original child voice - artificial male voice - original male voice</Paragraph> <Paragraph position="2"/> </Section> class="xml-element"></Paper>