XML Viewer - h89-2003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2003_metho.xml
Size: 15,310 bytes
Last Modified: 2025-10-06 14:12:19
<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2003">
  <Title>TIMING MODELS FOR PROSODY AND CROSS-WORD COARTICULATION IN CONNECTED SPEECH</Title>
  <Section position="3" start_page="13" end_page="17" type="metho">
    <SectionTitle>
CROSS-WORD LEHITIONS AND THE GESTURAL SCORE
</SectionTitle>
    <Paragraph position="0"> One of the biggest problems in recognizing connected speech is coarticulation across word boundaries. This coarticulation can cause a drastic restructuring of the spectral characteristics of segments at the edges of words. Final segments can change by assimilation to the following word's initial segment, and they can even be seemingly deleted, as shown in the examples in (1), which are taken from Brown (1977) and Catford (1977).</Paragraph>
    <Paragraph position="1">  /muhst bi/ -&gt; \[muhsbi\] 'must be' c. deletion and assimlation /graUnd prEshR/ -&gt; \[graUmprEshR\] 'ground pressure' Such lenitions are ubiquitous in casual or fast speech and are not uncommon even in fluent read speech. They can occur within the word as well as at word boundaries, as in the assimilative devoicing or deletion of the first vowel in \[ptEIto\] for 'potato' or the apparent deletion of the medial \[t\] in \[twEni\] for 'twenty'.</Paragraph>
    <Paragraph position="2"> In these examples, we have described the lenitions as if they were discrete changes in the symbolic representation of the segment string. If the lenitions are approximated by an allophonic analysis in this way, the word-internal cases could be accounted for in isolated-word recognition systems by encoding all common patterns as variant pronunciations in the lexicon. This could be accomplished, for example, by providing separate spectral templates * Here and elsewhere, I use the following ARPABET-Iike substitutions for the standard phonetic symbols:</Paragraph>
    <Paragraph position="4"> for each variant pronunciation or by listing alternate paths in an allophonlcsegment-based HHM model (Kopec and Bush 1985). Lenltlons across word boundaries in connected speech can also be handled by pre-compiling alternate HHM paths for every possible transition (Bush and Kopec 1987), but this is feasible only when the vocabulary size is very small. Thus, cross-word segment lenitions cause a particular problem for large-vocabulary recognition systems even when explicit phonetic knowledge is incorporated in the form of allophonlc variants for acoustic segments.</Paragraph>
    <Paragraph position="5"> A possible solution is to base the lexical representation of the allophones not on alternate paths through discrete phonologically unanalyzed acoustic intervals, but rather on alternate specifications of acoustic features in a feature-based recognition system (Stevens 1986). The assimilatlon of \[s\] to \[sh\] in 'this shop' could then be handled by an explicit assimilation rule that changes the acoustic features associated with the \[s\] segment from \[+anterior\] to \[-anterior\] in the context of the following \[-anterior\] segment in the following word. The apparent deletion of the \[t\] in 'must be', similarly, could be handled by a rule deleting the features associated with \[t\] stop release in the context of a following obstruent segment. If this solution is adopted, the problem reduces to that of discovering the correct assimilation and deletion rules and the optimal acoustic feature system for stating these rules.</Paragraph>
    <Paragraph position="6"> A disadvantage of this approach is that these coarticulatory assimilations and deletions look like a motley array of discrete rules when described in terms of feature changes and deletions. Among the ways that models of articulatory kinematics might contribute to speech recognition is in providing a more explanatory account of these cross-word lenitlons, an account that better predicts the patterns of assimilation and apparent deletion that are likely to occur in any given context. Browman and Goldstein (1987) have suggested an account of common lenltion patterns that unifies assimilations and deletions into a single process.</Paragraph>
    <Paragraph position="7"> The basis for Browman and Goldstein's account is the gestural score. Browman and Goldsteln, in conjuction with Saltzman and other colleagues at Hasklns Laboratories, have developed a task-dynamlc model in which utterances are represented as a principled orchestration of invariant artlculatory gestures.</Paragraph>
    <Paragraph position="8"> The gestures are modeled as target-speclfic movements in a second-order linear spring-mass system. The orchestration specifies a given phasing for a gesture relative to the relevant surrounding gestures. The \[t\] of 'must be', for example, is represented as an overdamped gesture of a given stiffness and underlying amplitude specified for the task of making a complete closure with the tongue tip near the alveolar ridge. This alveolar closing gesture is specified as concurrent with either a ballistic abductive glottal gesture or a totally adductlve glottal stop gesture, and as occuring at some time relative to the opening gesture from the word-initial \[m\] into the \[uh\] vowel. The \[b\], similarly is composed of a labial closing gesture coupled to a glottal approximation gesture, with the two gestures specified to occur at some time relative to the oral and glottal gestures of the preceding \[t\].</Paragraph>
    <Paragraph position="9">  Under this account, the apparent deletion of the It\] can be modeled as the endpoint of a continuum of lesser to greater overlap between the tongue-tlp gesture in the \[t\] and the labial gesture in the \[b\]. If the two gestures overlap to any extent, the release of \[t\] tongue-tip closure will be masked by the \[b\] labial closure. That is, the usual aerodynamic consequences of the It\] release -- namely, the burst, will be prevented by the closure upstream.</Paragraph>
    <Paragraph position="10"> In extreme cases, not just the release of the \[t\] but the entire tongue-tip gesture can be hidden by the labial gesture, as Browman and Goldsteln have shown in their examination of the movements of the tongue tip and lower llp and other movement traces recorded at the Tokyo X-ray mlcrobeam system (Kiritanl et al. 1975). Nolan (1989) shows similar cases of overlap between dental and velar gestures as evident in patterns of contact measured by an electro-palatograph. In sequences such as 'late calls', the tongue-tlp contact for the word-final It\] can overlap to a greater or lesser extent with the tongue-body contact for the following word-inltial \[k\].</Paragraph>
    <Paragraph position="11"> In Browman and Goldstein's task-dynamlc model, assimilations such as the apparent substitution of \[sh\] for \[s\] in 'this shop', can also be specified as overlap. The two tongue-tlp constriction gestures for the fricatives overlap in time in the same way as the It\] and \[hi of 'must be'. In this case, however, the overlap involves the same vocal tract subsystem. Therefore, the kinematic consequence of the overlap is not a &amp;quot;hiding&amp;quot; of one gesture by the other, but a spatio-temporal &amp;quot;blending&amp;quot; of the two gestures, resulting in an uninterrupted \[sh\]-like spectral pattern.</Paragraph>
    <Paragraph position="12"> Thus, examination of the artlculatory patterns provides a single explanatory account of the motley array of cross-word lenltion patterns. Both the apparent segment deletions and the feature assimilations can be described by a common articulatory mechanism. It seems likely that the same mechanism also will account for various sorts of manner lenltions, such as the flapping of \[t\] and \[d\] and stop consonants being produced as fricatives. In the gestural score, these will probably be represented as undershoot of the temporal or spatial target for the consonant when the consonant's closing gesture is blended with the opening gesture for the following vowel. That is, flapping and frlcatlon are probably simply two more examples of gestural overlap.</Paragraph>
    <Paragraph position="13"> One advantage of this account is that the continuous phase settings of the gestural score correctly predict that there will be varying degrees of overlap, resulting in varying degrees of spectral masking by the following segment, unlike in the all-or-none segment deletion and assimilative feature-changing accounts. Since human listeners apparently can use the residual spectral information of the preceding vowel-formant transition to perceive the different between a deleted \[t\] in 'late calls' and no \[t\] in 'lake calls' (Nolan 1989), this is a desirable outcome. In a recognition system based on all-or-none feature changes, by contrast, near minimal pairs such as these can only be distinguished if there is disamblguatlng syntactic or semantic information in the context.</Paragraph>
    <Paragraph position="14"> Finally, the gestural score account makes all types of segmental lenltlon fall out from manipulations of the timing pattern, and when combined with a model  of the articulatory correlates of tempo change and prosodic structure, should provide a better prediction of when lenitlons will occur. That is, lenitions should occur more frequently at tempi and in prosodic contexts where articulatory gestures are phased more closely together.</Paragraph>
    <Paragraph position="15"> THE KINEMATICS OF TEMPO, PHRASING, AND ACCENT While Browman and Goldsteln have not yet provided an account of articulatory correlates of prosodic structure within their task-dynamlc model, there is other recent work that suggests how several effects can be described using the gestural score. Such a description is obviously important, for many reasons.</Paragraph>
    <Paragraph position="16"> A first obvious reason is that the cross-word assimilations and deletions discussed in the preceding section are blocked by certain sorts of prosodic phrase boundaries. For example, the word-flnal \[s\] in 'this' would not assimilate to the following \[sh\] in any typical intonational phrasing for 'So the question is this: should we do it or not?' An even more general reason for wanting a better description of the articulatory correlates of prosodic structure is that stress and phrasing interact with segmental duration patterns in ways that are very difficult to capture in computational models of acoustic interval durations (see, e.g., van Santen and Olive 1989; Riley 1989). Yet human perceivers clearly use the timing patterns of an utterance to parse the segments, stress pattern, prosodic structure, and overall tempo. It seems unlikely that in doing so, they perform the complicated computations that interval-based models use to predict the segment interval durations. A better model of speech timing could provide evidence as to what is actually being perceived when the timing patterns of an utterance are parsed to provide the perceptual cues to segmental and suprasegmental structures.</Paragraph>
    <Paragraph position="17"> Work by Beckman, Edwards, and Fletcher (1989) suggests that artlculatory kinematics can differentiate global tempo change from phrase-flnal lengthening, and both of these from the lengthening effect of accent or stress. We looked at the durations, displacements, and peak velocities for openinggestures and closing gestures in the sentence-intial \[pap\] sequences in the sentences in (2):  (2) a. Pop, opposing the question strongly, refused to answer it.</Paragraph>
    <Paragraph position="18"> b. Poppa, posing the question loudly, refused to answer it.</Paragraph>
    <Paragraph position="19"> c. Poppa posed the question loudly, and then refused to answer it.</Paragraph>
    <Paragraph position="20">  The underlining in (2) indicates the test sequences. In (2a), the sequence is final to an intonation phrase, whereas in (2b) it is not final. The sequence in (2b), in turn contrasts to the sequence in (2c) in bearing the nuclear accent in its phrase.</Paragraph>
    <Paragraph position="21"> We had several speakers repeat these utterances at three self-selected speaking rates, and measured the kinematics of the jaw-opening and closing gestures into and out of the low vowel \[a\]. We found that slowing down tempo  overall works essentlally by changing the stiffness of the artlculatory system. Both the opening gestures and the closing gestures have smaller peak velocities at slower tempi, with essentially no change in displacement.</Paragraph>
    <Paragraph position="22"> Phrase-flnal lengthening looks llke slowing down tempo, but localized to the closing gesture. The lengthening associated with accent, by contrast, did not significantly change the speed of either gesture. Instead it seemed that the accented vowel was longer because the closing gesture was later relative to the opening gesture. In terms of Browman and Goldsteln's gestural score, accentual lengthening is a phase shift that lessens the overlap between the vowel gesture and the following \[p\] gesture.</Paragraph>
    <Paragraph position="23"> This last result confirms the findings of Summers (1987), who compared the artlculatory kinematics of accentual lengthening with the effects of voicing in a following final stop. The duration and velocity patterns he found for accent are similar to those in our experiment, whereas the effect of voicing was more similar to those of our final lengthening; the closing gesture out of the vowel was slower before a voiced stop. Voicing differed from final lengthening in affecting displacement slightly as well as velocity; the jaw did not open as far before the voiced stop.</Paragraph>
    <Paragraph position="24"> This work has implications for the ways in which acoustic timing patterns can be used to recognize stress and prosodic phrasing. Other things being equal, jaw opening is correlated with first formant frequency and overall amplitude. Low vowels, with more open jaw positions, have higher first formants and greater amplitudes than high vowels, with less open jaw position. In keeping with these correlation, Summers (1987) found that the first formant was lower in \[a\] and \[ae\] before \[b\], as expected from the lesser jaw opening there. In a later perception experiment involving syllables synthesized to mimic the first formant patterns in his production experiment, he found that first formant frequency and transition speed could cue the difference between a following voiced versus voiceless stop.</Paragraph>
    <Paragraph position="25"> Given our results concerning accent and final lengthening, then, we would expect that final lengthening should effect longer, slower first-formant transitions, whereas accent should not. Accent, on the other hand, should be associated with a greater average volume over the syllable nucleus, whereas final lengthening should result in gradually decreasing amplitude after an early loudness peak. We are testing these predictions in experiments presently underway. If they are borne out, then tracking formant kinematics and amplitude contours over a syllable should help interpret its overall duration pattern. A recognition system that incorporated these results would have much better recognition of the stress and phrasing pattern, with all the improvements in segmental recognition which that entails.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML