File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/c88-1020_metho.xml

Size: 12,812 bytes

Last Modified: 2025-10-06 14:12:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="C88-1020">
  <Title>SPI'\]ECIt-RATE VARIATION AND TIlE PREDICTION OF DURATION</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
INTROI) IJCTION
</SectionTitle>
    <Paragraph position="0"> Speech late is known to be a variable affecting timing in a speech signal, but one that is difficult to quantify. Absolute mcasure.~, of duration in a text tell little about tile relative lengths (,f seglnents, and aceouut Inust be taken of all other factors involved if relative valncs such as 'long', 'short', 'fast', or 'slow' are to be applied.</Paragraph>
    <Paragraph position="1"> Simple lmasures of speech rate, Sllch as 'words-per-lninutc', and 'syllabics-per-second' account well for variation at a global level, blli: are inadequate to describe local changes in rate, due to thc effects of differences in the structure of words and syllables~ Words can be menu- up p0Iy-syllabic/and syllables themselves can vary greatly in the nmnber and type of segments occurring lU onset, peak and coda positions. A measure of a snlall number of words or syllables, expressed as a rate in counts per unit of tin,e, will be affected by the complexity of the compouent units, structure of syllabics or syllabicity of words, such that a Stl'ing of simple units will yield a higher rate than the salne number of more COlnplex ones.</Paragraph>
    <Paragraph position="2"> This eff~mt is reduced somewhat as tile nulnber of milts increases and a Inure balanced distribution occl.trs, bnt will always b,; tlrescnt as a corrupting factor in the accuracy of the rate nlc~lsurcment. It is likely that text type, with stylistic differcnc~zs in lexical choice, will be a strong determiner of 'rate' in measnres such as these, and a more text-independent method is rcqnired Segment~: would seem to be a better unit for such measurel,aent, bat as yet there is uo satisfactory method of determining segmcnt boundaries for automated nmasurement, and the lminber of decisious required for lneasurement by hand of a passage of text long enough to provide statistically adequate icsulls would bc both unecouomical and error-prone.</Paragraph>
    <Paragraph position="3"> In tile present study, a compromise is rcachcd in the choice of syllables as basic unit, so boundary decisions arc reduced and enough measurements call be taken to allow statistically valid conclusions to be made. A method of normalising for diffcrenccs ill syllabic structure is proposed.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
THE I)ATABASE
</SectionTitle>
    <Paragraph position="0"> A database of five thousand syllables wits prepared fl'om recordings in tile Spoken English Corpus \[1\] which have been prosodically transcribed, tagged for part of speech and punctuated. These wcrc lncasured for duration and transcribed l)honelnically. Salnples chosen wcrc one long text, a twenty-minute broadcast of a short story by Doris Lessing, read by Elizabeth Bell, of apl3roximately four thousand sylhlbles, and two shorter texts of approximately five hundred syllables each, one Open University lccturc on l)hilosophy, and one news extract, for cross-checking.</Paragraph>
    <Paragraph position="1"> Sylhlbles were measured in milliseconds from recordings digitiscd at 10k Hz (4.5k lowpass filtered) with the IBM UKSC SAY speech analyser \[2, 3\] using interactive graphic display at one-thousand samplcs per screen-width, and simultaneous auditory rcplay of tile waveform. Hard copy of both the waveform and gain plots were retained for reference purposes.</Paragraph>
    <Paragraph position="2"> In the case of ambisyllabicity, the clcarcst boundary in tile acoustic waveform was selected, and the phoimmic transcription, later to be used as input to tile rule system.</Paragraph>
    <Paragraph position="3"> marked accordingly.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="93" type="metho">
    <SectionTitle>
FITTING THE RULES
</SectionTitle>
    <Paragraph position="0"> 'The rules operate within the framework of a model of durational behaviour which statcs that (a) each rule tries to effect a percentage increase or decrease in tile duration of a segment, but (b) segments cannot be compressed shorter than a certain ininimum duration. The model is summariscd by the</Paragraph>
    <Paragraph position="2"> where INHDUR is the inherent segment du,'ation ill lns, MINDUR is the minimuln duration of a segment if stressed and PRCNT is the percentage shortening determined by the rules.' (D. H. Kiatt \[4\]) An iterative process was used to match the nile set (originally designed for American English, based on the durations of a single male speaker, and takcn fi'om CVC words in fi'ame sentences uttered in a controlled environment) to the durations rcquired for the prediction of British English and for this particular spcakcr-text pail'. The phonemic transcription of tile test text was uscd as input to a computer iInpleinentation of  the Klatt \[5\] rules for duration prediction, and the resulting segment values summed to the syllable level. These were compared with the measured durations according to the factors underlying the rules, which were in turn adjusted accordingly. The same input was passed through the improved rules and the process repeated until the output stabilised.</Paragraph>
    <Paragraph position="3"> Segment durations were first adjusted, by sorting 'fit' for each syllable, expressed as a percentage of predicted duration to observed, by natu,'e of the segments appearing. Thus, /t'/ for example, although assigned an inherent dnration of 120ms and a minimum of 60ms in the Klatt rules, was found to be appearing in syllables that were consistently overpredicted, and by reducing its inherent duration in the rules to 95ms, and its minimum to 50ms, a better overall fit was observed. An exact fit is not to be expected since the rules make no allowance for speech-rate variation, other than offering a single variable ('PRCNT') that can be reset to change the overall rate of duration. The variance observed in the fit for any individual factor will never be reduced below the variance of the underlying speech rate changes, but can only be minimiscd.</Paragraph>
    <Paragraph position="4"> The original rules assume that the minimum duration of an unstressed segmcut is half that of the segment in a stressed position. On further analysis of segment fit according to stress, it was fouud that a better prediction could be achieved by specifying absolute minima separately for the two situations; thus /k/ for example, while 65ms and 50ms for inherent and minimtun in the Klatt rules, was found to fit better if specified as 65ms and 35ms, with an absolute minimum (for the unstressed position) of 15ms. The full table of final values with the original defaults is shown in Fig 1. These represent an intermediate stage in an iterative process, and arc not presented as statements about individual segment durations per se.</Paragraph>
    <Paragraph position="5"> With these segment defaults fitted to the sample text, the wdues specified in the rules for modifying PRCNT were similarly adjnsted so that the best fit could be obtained. In summary, clause and phrase medial syllables were found to be ovcrpredicted, and both initial and final syllables underpredicted; clause final syllables considerably so. An extra rule was included to cover the case of phrase-initial syllables, which are not accessible through the framework of the original rules \[6\].</Paragraph>
  </Section>
  <Section position="5" start_page="93" end_page="93" type="metho">
    <SectionTitle>
QUANTIFYING SPEECH-RATE
</SectionTitle>
    <Paragraph position="0"> With the rules matched to the text at a global level through statistical analysis of averaged results, differences in output can be examined at the local level. Since there is no speech-rate information in the rule-set, differences will contain a quantification this, contaminated by noise from measurement and prediction error.</Paragraph>
    <Paragraph position="1"> There will inevitably be a certain amount of error in hand-measurement of several thousand syllables, no matter how precise the equipment, but since the totals are cumulative, and sums can be simply checked against overall durations for stretches of the text, it can be assumed that the majority of errors, will lie in boundary determination. These can be overcome by smoothing with a three-syllable moving-.average window since any over- or under-measurement in an individual syllable should be compensated by a corresponding under- or over-measurement of its immediate neighbours.</Paragraph>
    <Paragraph position="2"> Errors in prediction will be systematic by definition, and therefore susceptible to detection by statistical methods. They  inherent, stressed and unstressed minimum durations for modified Klatt rules, with originals.</Paragraph>
    <Paragraph position="3"> can be determined by examining the measures of tit according to criteria not included in the rule-set and implementing new rules to cover any regularities found.</Paragraph>
    <Paragraph position="4"> The quantification of speech-rate is thus not a single, simple process, but an iterative one, with accuracy (and therefore confidence) increasing at each iteration. It can be expressed as a ratio ('SPRATE') of predicted rate in syllables/second to observed rate in syllables/second calculated from smoothed data. The above objection to syllables per second as a measure of speech-rate is overcome by comparing like with like m the present method. Thus</Paragraph>
  </Section>
  <Section position="6" start_page="93" end_page="94" type="metho">
    <SectionTitle>
SPRATE(%) = (SMOOTHED PREDICTED
RATE/SMOOTHED OBSERVED RATE) * 100
PRELIMINARY RESULTS
</SectionTitle>
    <Paragraph position="0"> At the current iteration, sprate mean is 100.2% for 3959 syllables \[7\], indicating an almost exact overall fit between tile predicted and observed durations, but with a standard deviation of 19.18 that is partly accounted for by the lack of rate information. Of the other factors contributing to this variation, no significant effects could be found for e.g. the type of syllable structure, the position of the syllable in the word, or the position of that word in the phrase or clause. Part of speech, however, appeared to be a significant factor, with sprate results for selected categories as below,  which shows that while verbs and adverbs are slightly overpredicted by the rules, nouns and especially adjectives  would generally appear to be spoken more slowly than the rtllcs prod et.</Paragraph>
    <Paragraph position="1"> An carlie.&amp;quot; iteration shewed that polysyllabicity, instead of being in lllc domain of the word as the original rules In'edict, gives a better fit if measured in feet, and adjustments made according to the number of the unstressed syllables that follow each stresl:cd syllable.</Paragraph>
    <Paragraph position="2"> A category that needs further examination is that of stressed bttl mmcccnled syllables which are 'prominent but have no pitch movement' \[8\]. By default, these are treated as stressed, but on examinalion of the results, sprate, which is 100.3 for unshessed syllables and 98.3 for stressed, is 106.7 for stresscd-lmt-unaccented, showing slight underprediction of stressed syllables, but grealcl ovcrpredielion of the intermedi~,tc category.</Paragraph>
    <Paragraph position="3"> Of perhaps greater interest though, is the fit of sprate to the perceived speeding up and slowing down in the presentation of the texi: by tile reader. Taking tile mean sprate values for all syllabl(s in the tone-group, we find the following 93.9 Walking down the path with her, he blurted out 112.1 'i'd like to go and have a look at those rocks down there.' 86,4 She gave the idea her attention.</Paragraph>
    <Paragraph position="4"> 102.8 The water was pushing him up against tile roof 98.7 The roof was sharp aud pained his back.</Paragraph>
    <Paragraph position="5"> 120.2 He pulled himself along with his hands, fast, fast. 97.8 qnd used his legs as lcvcrs.</Paragraph>
    <Paragraph position="6"> 99.8 They sat down to hmch together.</Paragraph>
    <Paragraph position="7"> 128.2 q~.4umnay, I can stay under water for two minutes, three minutes at least.' 84.1 I t came blurting ont of him.</Paragraph>
    <Paragraph position="8"> where an increased sprate indicates overprediction in tile rules or, conversely, a speeding up in the text. Iiere, examples have had to bc chosen to include texttml clues to the rate, but listening to longer passages confirms that rate correlates well with sprale. Further iterations will allow more confident examination at levels lower than tile sentence.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML