SPI'\]ECIt-RATE VARIATION AND TIlE PREDICTION OF DURATION 
W. N. CAMPBELL 
IBM (UK) Scientific Centre . Winchester . England 
ABSTRACT 
A comparison between the output fioin a set of duration rules 
based on Klatt '76 and measured durations in a text allows 
quantificati.on of speech rate at a local as well as a global level. 
The rules account for knowo correlates of duration change, 
such as Stl'e~s, phonetic and phrasal context, aud inherent 
differences in the durations of each segment, but make no 
allowan(e for local changes of rate within a text. The degree 
of fit of the OUtl~Ut from snch a system to the observed 
durations in tile text provides a guide both to the accuracy of 
Ille rule-set and to tile rate-related variation within that text. 
Statistic~l procedures can bc applied to reduce the rule-related 
error au, l thereby strengthen both the predictions of the rules 
and the quantification of the rate variation. This paper 
describc~ research it\] progress. 
INTROI) IJCTION 
Speech late is known to be a variable affecting timing in a 
speech signal, but one that is difficult to quantify. Absolute 
mcasure.~, of duration in a text tell little about tile relative 
lengths (,f seglnents, and aceouut Inust be taken of all other 
factors involved if relative valncs such as 'long', 'short', 'fast', 
or 'slow' are to be applied. 
Simple lmasures of speech rate, Sllch as 'words-per-lninutc', 
and 'syllabics-per-second' account well for variation at a global 
level, blli: are inadequate to describe local changes in rate, due 
to thc effects of differences in the structure of words and 
syllables~ Words can be menu- up p0Iy-syllabic/and syllables 
themselves can vary greatly in the nmnber and type of 
segments occurring lU onset, peak and coda positions. A 
measure of a snlall number of words or syllables, expressed as 
a rate in counts per unit of tin,e, will be affected by the 
complexity of the compouent units, structure of syllabics or 
syllabicity of words, such that a Stl'ing of simple units will yield 
a higher rate than the salne number of more COlnplex ones. 
This eff~mt is reduced somewhat as tile nulnber of milts 
increases and a Inure balanced distribution occl.trs, bnt will 
always b,; tlrescnt as a corrupting factor in the accuracy of the 
rate nlc~lsurcment. It is likely that text type, with stylistic 
differcnc~zs in lexical choice, will be a strong determiner of 
'rate' in measnres such as these, and a more text-independent 
method is rcqnired 
Segment~: would seem to be a better unit for such 
measurel,aent, bat as yet there is uo satisfactory method of 
determining segmcnt boundaries for automated nmasurement, 
and the lminber of decisious required for lneasurement by hand 
of a passage of text long enough to provide statistically 
adequate icsulls would bc both unecouomical and error-prone. 
In tile present study, a compromise is rcachcd in the choice of 
syllables as basic unit, so boundary decisions arc reduced and 
enough measurements call be taken to allow statistically valid 
conclusions to be made. A method of normalising for 
diffcrenccs ill syllabic structure is proposed. 
THE I)ATABASE 
A database of five thousand syllables wits prepared fl'om 
recordings in tile Spoken English Corpus \[1\] which have been 
prosodically transcribed, tagged for part of speech and 
punctuated. These wcrc lncasured for duration and transcribed 
l)honelnically. Salnples chosen wcrc one long text, a 
twenty-minute broadcast of a short story by Doris Lessing, 
read by Elizabeth Bell, of apl3roximately four thousand 
sylhlbles, and two shorter texts of approximately five hundred 
syllables each, one Open University lccturc on l)hilosophy, and 
one news extract, for cross-checking. 
Sylhlbles were measured in milliseconds from recordings 
digitiscd at 10k Hz (4.5k lowpass filtered) with the IBM UKSC 
SAY speech analyser \[2, 3\] using interactive graphic display 
at one-thousand samplcs per screen-width, and simultaneous 
auditory rcplay of tile waveform. Hard copy of both the 
waveform and gain plots were retained for reference purposes. 
In the case of ambisyllabicity, the clcarcst boundary in tile 
acoustic waveform was selected, and the phoimmic 
transcription, later to be used as input to tile rule system. 
marked accordingly. 
FITTING THE RULES 
'The rules operate within the framework of a model of 
durational behaviour which statcs that (a) each rule tries to 
effect a percentage increase or decrease in tile duration of a 
segment, but (b) segments cannot be compressed shorter than 
a certain ininimum duration. The model is summariscd by the 
for,nula 
DUR = \[(INHDUR - MINDUR)*PRCNT\]/100 + MINDUR 
where INHDUR is the inherent segment du,'ation ill lns, 
MINDUR is the minimuln duration of a segment if stressed 
and PRCNT is the percentage shortening determined by the 
rules.' (D. H. Kiatt \[4\]) 
An iterative process was used to match the nile set (originally 
designed for American English, based on the durations of a 
single male speaker, and takcn fi'om CVC words in fi'ame 
sentences uttered in a controlled environment) to the durations 
rcquired for the prediction of British English and for this 
particular spcakcr-text pail'. The phonemic transcription of tile 
test text was uscd as input to a computer iInpleinentation of 
93 
the Klatt \[5\] rules for duration prediction, and the resulting 
segment values summed to the syllable level. These were 
compared with the measured durations according to the factors 
underlying the rules, which were in turn adjusted accordingly. 
The same input was passed through the improved rules and the 
process repeated until the output stabilised. 
Segment durations were first adjusted, by sorting 'fit' for each 
syllable, expressed as a percentage of predicted duration to 
observed, by natu,'e of the segments appearing. Thus, /t'/ for 
example, although assigned an inherent dnration of 120ms and 
a minimum of 60ms in the Klatt rules, was found to be 
appearing in syllables that were consistently overpredicted, and 
by reducing its inherent duration in the rules to 95ms, and its 
minimum to 50ms, a better overall fit was observed. An exact 
fit is not to be expected since the rules make no allowance for 
speech-rate variation, other than offering a single variable 
('PRCNT') that can be reset to change the overall rate of 
duration. The variance observed in the fit for any individual 
factor will never be reduced below the variance of the 
underlying speech rate changes, but can only be minimiscd. 
The original rules assume that the minimum duration of an 
unstressed segmcut is half that of the segment in a stressed 
position. On further analysis of segment fit according to stress, 
it was fouud that a better prediction could be achieved by 
specifying absolute minima separately for the two situations; 
thus /k/ for example, while 65ms and 50ms for inherent and 
minimtun in the Klatt rules, was found to fit better if specified 
as 65ms and 35ms, with an absolute minimum (for the 
unstressed position) of 15ms. The full table of final values with 
the original defaults is shown in Fig 1. These represent an 
intermediate stage in an iterative process, and arc not presented 
as statements about individual segment durations per se. 
With these segment defaults fitted to the sample text, the 
wdues specified in the rules for modifying PRCNT were 
similarly adjnsted so that the best fit could be obtained. In 
summary, clause and phrase medial syllables were found to be 
ovcrpredicted, and both initial and final syllables 
underpredicted; clause final syllables considerably so. An extra 
rule was included to cover the case of phrase-initial syllables, 
which are not accessible through the framework of the original 
rules \[6\]. 
QUANTIFYING SPEECH-RATE 
With the rules matched to the text at a global level through 
statistical analysis of averaged results, differences in output can 
be examined at the local level. Since there is no speech-rate 
information in the rule-set, differences will contain a 
quantification this, contaminated by noise from measurement 
and prediction error. 
There will inevitably be a certain amount of error in 
hand-measurement of several thousand syllables, no matter 
how precise the equipment, but since the totals are cumulative, 
and sums can be simply checked against overall durations for 
stretches of the text, it can be assumed that the majority of 
errors, will lie in boundary determination. These can be 
overcome by smoothing with a three-syllable moving-.average 
window since any over- or under-measurement in an individual 
syllable should be compensated by a corresponding under- or 
over-measurement of its immediate neighbours. 
Errors in prediction will be systematic by definition, and 
therefore susceptible to detection by statistical methods. They 
inhm+ m- orig min 
i t65 85 25 160 50 
i 130 60 25 130 40 
c t65 60 30 150 60 
ae 205 100 35 230 60 
^ 185 75 45 140 50 
a 265 125 55 240 80 
D 190 110 35 240 80 
o 235 100 55 240 100 
u 185 110 30 210 60 
t~ 105 40 30 160 50 
o 100 55 25 120 40 
3 170 75 20 180 60 
h 55 40 10 80 20 
m 80 45 15 70 60 
~ 120 60 25 170 110 
n 90 40 15 65 35 
13 90 50 25 80 50 
.n 170 80 50 170 100 
I 85 45 25 80 40 
+ 85 55 15 90 70 
! 180 90 45 160 110 
r 80 40 25 80 30 
J 85 50 10 80 40 
w 85 30 30 80 60 
Figure 1, 
inh m+ m- orig mien 
el 220 110 35 190 70 
a~ 250 t15 60 250 90 
~l 220 ll0 35 280 I10 
~t~ 220 110 25 220 70 
clt~ 220 110 50 260 100 
xo 235 110 50 260 100 
ee 270 100 50 270 100 
ao P.30 100 50 230 100 
ff 100 75 30 70 50 
rJ3 95 70 20 70 50 
p 70 40 20 85 50 
t 60 30 15 65 40 
k 65 35 15 65 55 
b 70 40 15 80 50 
d 70 40 15 65 40 
,o 70 40 15 65 50 
f 95 50 25 120 60 
v 65 50 20 60 40 
a 75 35 25 110 40 
55 25 10 50 30 
s 120 50 30 125 50 
J' 105 55 25 125 50 
z 80 35 15 75 40 
3 90 35 20 70 40 
Dtlration values (ms) used for British English text: Default 
inherent, stressed and unstressed minimum durations for 
modified Klatt rules, with originals. 
can be determined by examining the measures of tit according 
to criteria not included in the rule-set and implementing new 
rules to cover any regularities found. 
The quantification of speech-rate is thus not a single, simple 
process, but an iterative one, with accuracy (and therefore 
confidence) increasing at each iteration. It can be expressed 
as a ratio ('SPRATE') of predicted rate in syllables/second to 
observed rate in syllables/second calculated from smoothed 
data. The above objection to syllables per second as a measure 
of speech-rate is overcome by comparing like with like m the 
present method. Thus 
SPRATE(%) = (SMOOTHED PREDICTED 
RATE/SMOOTHED OBSERVED RATE) * 100 
PRELIMINARY RESULTS 
At the current iteration, sprate mean is 100.2% for 3959 
syllables \[7\], indicating an almost exact overall fit between tile 
predicted and observed durations, but with a standard 
deviation of 19.18 that is partly accounted for by the lack of 
rate information. Of the other factors contributing to this 
variation, no significant effects could be found for e.g. the type 
of syllable structure, the position of the syllable in the word, 
or the position of that word in the phrase or clause. Part of 
speech, however, appeared to be a significant factor, with 
sprate results for selected categories as below, 
syllable type mean s.d.(cst) n 
lexical verbs 102.7 0.7 626 
nouns 98.5 1.0 750 
adjectives 92.9 1.2 339 
adverbs 102.6 1.6 184 
which shows that while verbs and adverbs are slightly 
overpredicted by the rules, nouns and especially adjectives 
94 
would generally appear to be spoken more slowly than the 
rtllcs prod et. 
An carlie." iteration shewed that polysyllabicity, instead of 
being in lllc domain of the word as the original rules In'edict, 
gives a better fit if measured in feet, and adjustments made 
according to the number of the unstressed syllables that follow 
each stresl:cd syllable. 
A category that needs further examination is that of stressed 
bttl mmcccnled syllables which are 'prominent but have no 
pitch movement' \[8\]. By default, these are treated as stressed, 
but on examinalion of the results, sprate, which is 100.3 for 
unshessed syllables and 98.3 for stressed, is 106.7 for 
stresscd-lmt-unaccented, showing slight underprediction of 
stressed syllables, but grealcl ovcrpredielion of the 
intermedi~,tc category. 
Of perhaps greater interest though, is the fit of sprate to the 
perceived speeding up and slowing down in the presentation 
of the texi: by tile reader. Taking tile mean sprate values for 
all syllabl(s in the tone-group, we find the following 
93.9 Walking down the path with her, he blurted out 
112.1 'i'd like to go and have a look at those rocks 
down there.' 
86,4 She gave the idea her attention. 
102.8 The water was pushing him up against tile roof 
98.7 The roof was sharp aud pained his back. 
120.2 He pulled himself along with his hands, fast, fast. 
97.8 qnd used his legs as lcvcrs. 
99.8 They sat down to hmch together. 
128.2 q~.4umnay, I can stay under water for two minutes, 
three minutes at least.' 
84.1 I t came blurting ont of him. 
where an increased sprate indicates overprediction in tile rules 
or, conversely, a speeding up in the text. Iiere, examples have 
had to bc chosen to include texttml clues to the rate, but 
listening to longer passages confirms that rate correlates well 
with sprale. Further iterations will allow more confident 
examination at levels lower than tile sentence. 
DISCUSSION 
As Fig 2. suggests, a small random or high frequency error 
summed with a more slowly changing effect does little to hide 
its rhythms. Speech rate cannot be expressed as a simple sine 
wave, but its effect on the prediction of segment duration by 
rule can tmrhaps be seen in this way and nntil its processes are 
understood, no predicted durations can match observations 
from a real text - the lows cannot be slow enough nor the highs 
last enoul;h. Until rate information is superimposed on 
Figure 2. Interacting datasets. 
phonetic and phrase-level information ill a systematic manner, 
the output will be fiat and if ill a computer text-to-speech 
system, 'robotic'. 
The above method provides a quantification of speech rate that 
reveals both local and wider-range domains. Being an iterative 
process, it provides for an imt)rovement of the rules for 
prediction of dm'ation in a text while at the same time 
revealing processes within the text that govern changes in rate 
at the nmre local level. 
References

\[I\] Spoken I-'nglish Corpus : Lancaster Univcrsity and 
IBM UKSC. 

\[2\] IBM UKSC Reports No.135, June 1985, and No.145, 
January t986. 

\[3\] W N Campbell : A Search for Higher-level duratio, 
rules h~ a Real-Speech Corpus. Proc Conf Speech Tech 
Edinburgh 1987 

\[4\] D H Klatt : Synthesis hy rule of Segmental Durations 
in English Sentem:es m Frontiers of _:Speech 
Commm,ication Research edited by Lindblom & Ohmnn> 
Academic Press 1979 (pp 287-299). 

\[5\] D H Klatt : Linguistic uses of segmental duration in 
Engli~'h pp 1208-1221, JASA 59 1976. 

\[6\] W N Campbell : Extracting Speech-Rate ValuesJ)'o,1 
a Real-Speech Database. 1CASSP 1988 (forthcoming) 

\[7\] Glim 3.77 update I (copyright) 1985 Royal Statistical 
Society, London 

\[8\] L G Taylor & G Knowlcs : Mamml o\[ D{/brmation to 
Accompany the SEC Corpus. I.JCREL University of 
Lancaster 1988 
