File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/83/p83-1013_metho.xml

Size: 17,391 bytes

Last Modified: 2025-10-06 14:11:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="P83-1013">
  <Title>Automatic Recognition of Intonation Patterns</Title>
  <Section position="4" start_page="475" end_page="475" type="metho">
    <SectionTitle>
PHRASE BOUNDARY
ACCENT TONE
</SectionTitle>
    <Paragraph position="0"> phrasal tunes of English given in In certain circumstances, a single tone gives rise to a flat stretch in the F0 contour. For example, the phrase accent in Figure 3A has spread over two words. This phenomenon could be treated either at a phonological level, by linking the tone to a large number of syllables, or at a phonetic level, by positing a sustained style of transition. There are some interesting theoretical points here, but they do not seem to affect the design of an intonation recognizer.</Paragraph>
    <Paragraph position="1"> Note that the rules just described all operate in a small window, as defined on the sequence of tonal units. To a good approximation, the realization of a given tonal element can be computed without look-ahead, and looking back no further than the previous one. Of course, the window size could never be stated so simply with respect to the segmental string; two pitch accents could, for example, be squeezed onto adjacent syllables or separated by many syllables. One of the crucial assumptions of the work, taken from autosegmental and metrical phonology, is that the tonal string can be projected off the segmental string. The recognition system will make strong use of the locality constraint that this projection makes possible.</Paragraph>
    <Section position="1" start_page="475" end_page="475" type="sub_section">
      <SectionTitle>
2.3 Summary
</SectionTitle>
      <Paragraph position="0"> The major theoretical innovations of the description just sketched have important computational consequences. The theory has only two tones, L and H, whereas earlier tone-level theories had four. In combination with expressive variation in pitch range, a four tone system has too many degrees of freedom for a transcription to be recoverable, in general, from the F0 contour. Reducing the inventory to two tones raises the hope of reducing the level of ambiguity to that ordinarily found in natural language. The claim that implementation rules for tonal elements are local mean that the quantitative evidence for the occurrence of a particular element is confined to a particular area of the F0 contour. This constraint will be used to simplify the control structure. A third claim, that phrasal tunes are constructed syntactically from a small number of elements, means that standard parsing methods are applicable to the recognition problem.</Paragraph>
      <Paragraph position="1">  3. A recognition system  The recognition system as currently implemented has three components, described in the next three sections. First, the F0 contour is preprocessed with a view to removing pitch tracking  errors and minimizing the effects of the speech segments. Then, a schematization in terms of events is established, by finding crucial features of the smoothed contour through analysis of the derivatives. Events are the interface between the quantitative and symbolic levels of description; they are discrete and relatively sparse with respect to the original contour, but carry with them relevant quantitative information. Parsing of events is carried out top down, with the aid of rules for matching the tonal elements to event sequences. Tonal elements may account for variable numbers of events, and different analyses of an ambiguous contour may divide up the event stream in different ways. Steps in the analysis of an example F0 contour are shown in Figure 5.</Paragraph>
    </Section>
    <Section position="2" start_page="475" end_page="475" type="sub_section">
      <SectionTitle>
3.1 Pveprocessing
</SectionTitle>
      <Paragraph position="0"> The input to the system is an FO contour computed by the Gold Rabiner algorithm (Gold and Rabiner, 1969). Two difficulties with this input make it unsuitable for immediate prosodic analysis. First, the pitch tracker in some cases returns values which are related to the true values by an integer multiplier or divisor. These stray values are fatal to any prosodic analysis if they survive in the input to the smoothing of the contour. This problem is addressed by imposing continuity constraints on the F0 contour. When a stray value is located, an attempt to find a multiplier or divisor which will bring it into line is made, and if this attempt fails, the stray value is deleted. In our experience, such continuity constraints are necessary to eliminate sporadic errors; without them, no amount of parameter tweaking suffices.</Paragraph>
      <Paragraph position="1"> A second problem arises because the speech segments perturb the F0 contour; here, consonantal effects are of particular concern. There are no FO values during voiceless segments.</Paragraph>
      <Paragraph position="2"> Glottal stops and voiced obstruents depress the F0 on both sides.</Paragraph>
      <Paragraph position="3"> In addition, voiceless obstruents raise the F0 at the beginning of a following vowel. Because of these effects, a attempt was made  placement of lettering indicates roughly the alignment of tune and text. Parts of the F0 contour which survive the continuity constraints and the clipping are drawn with a heavier line.</Paragraph>
      <Paragraph position="4"> Panel B shows the connected and smoothed F0 contour, together with its event characterization. The two transcriptions of the contour are shown underneath. The alignment of tonal elements indicates what events each covers.</Paragraph>
      <Paragraph position="5">  to remove F0 values in the immediate vicinity of obstruents. An adapted version of the Fleck and Liberman (1982) syllable peak finder controlled this clipping. Our modification worked outward from the sy!labic peaks to find sonorant regions, and then retained the FO values found there. In Figure 5A, the portions of the F0 contour remaining after this procedure are indicated by a heavier line. The retained portions of the contour are connected by linear interpolation. Following Hildreth and Marr's work on vision, the connected contour is smoothed by convolution with a Gaussian in order to permit analysis of the derivatives. The smoothed contour for the example is shown in Figure 5B.</Paragraph>
    </Section>
    <Section position="3" start_page="475" end_page="475" type="sub_section">
      <SectionTitle>
3.2 Schematization
</SectionTitle>
      <Paragraph position="0"> Events in the contour are found by analysis of the first and second derivatives. The events of ultimate interest are maxima, minima, plateaus, and points of inflection. Roughly speaking, peaks correspond to H tones, some valleys are L tones, and points of inflection can arise through downstep, upstep, or a disparity in prominence between adjacent H accents. Plateaus, or level parts of the contour, can arise from tone spreading or from a sequence of two like tones. Events are implemented as structures which store quantitative information, such as location, F0 value, and derivative values.</Paragraph>
      <Paragraph position="1"> Maxima and minima can be located as zeroes in the first derivative. Those which exhibit insufficient contrast with their local environment are suppressed; in regions of little change, such as that covered by the phrase accent in Figure 3A, this threshholding prevents minor fluctuations from being treated as prosodic. Plateaus are significant stretches of the contour which are as good as level. A plateau is created from a sequence of low contrast maxima and minima, or from a very broad peak or valley. In either case, the boundaries of the plateau are marked with events, whose type is relevant to the ultimate tonal analysis. These events are not located at absolute maxima or minima, which in nearly level stretches may fall a fair distance from points of prosodic significance. Instead, they are pushed outward to a near-maximum, or a near-minimum. The event locations in Figure 5B reflect this adjustment. Minima in the absolute slope, (which form a subset of zero crossings in the second derivative) are retained as points of inflection if they contrast sufficiently in slope with the slope maxima on either side. In some cases, such points were engendered by smoothing from places where the original contour had a shelf. In many others, however, the shoulder in the original contour is a slope minimum, although a more prototypical realization of the same prosodic pattern would have a shell Presumably, this fact is due to the low pass characteristics of the articulatory system itself.</Paragraph>
    </Section>
    <Section position="4" start_page="475" end_page="475" type="sub_section">
      <SectionTitle>
3.3 Parsing
</SectionTitle>
      <Paragraph position="0"> Tonal analysis of the event stream is carried out by a topdown nondeterministic finite state parser, assisted by a set of verification rules. The grammar is a close relative of the transition network in Figure 1. (There is no effort to make distinctions which would require independent information about stress location, and provision is made for the case where the phrase accent and boundary tone collapse phonetically,) The verification rules relate tonal elements to sequences of events in the F0 contour. As each tonal element is hypothesized, it is checked against the event stream to see whether it plausibly extends the analysis hypothesized so far. The integration of successful local hypotheses into complete analyses is handled conventionally (see Woods 1973).</Paragraph>
      <Paragraph position="1"> The ontology of the verification rules is based on our understanding of the phonetic realization rules for tonal elements. Each rule characterizes the realization of a particular element or class of elements, given the immediate left context.</Paragraph>
      <Paragraph position="2"> Wider contexts are unnecessary, because the realization rules are claimed to be local. Correct management of chained computations, such as iterative downsteps, falls out automatically from the control structure. The verification rules refer both to the event types (e.g. &amp;quot;maximum', &amp;quot;inflection,') and to values of a small vocabulary of predicates describing quantitative characteristics. The present system has five predicates, though a more detailed accounting of the F0 contour would require a few more. One returns a verdict on whether an event is in the correct relation to a preceding event to be considered downstepped. Another determines whether a minimum might be explained by a non-monotonic F0 transition, like that pointed out in Figure I. In general, relations between crucial points are considered, rather than their absolute values.</Paragraph>
      <Paragraph position="3"> Even for a single speaker, absolute values are not very relevant to melodic analysis, because of expressive variation in pitch range. Our experiments showed that local relations, when stated correctly, are much more stable.</Paragraph>
      <Paragraph position="4"> Timing differences result in multiple realizations for some tonal sequences. For example, the L* H H% sequence in Figure 5A comes out as a rise--plateau--rise. If the same sequence were compressed onto less segmental material, one would see a rise-inflection--rise, or even a single large rise. For this reason, the rules OR several ways of accepting a given tonal hypothesis. As just indicated, these can involve different numbers of events.</Paragraph>
      <Paragraph position="5"> The transcription under figure 5B indicates the two analyses returned by the system. Note that they differ in the total number of tonal elements, and in the number of events covered by the H phrase accent. The first analysis correctly reflects the speaker's intention. The second is consistent with the shape of the F0 contour, but would require a different phrasal stress pattern. Thus the location of the phrasal stress cannot be uniquely recovered from the F0 contour, although analysis of the  F0 does constrain the possibilities.</Paragraph>
      <Paragraph position="6"> 4. Discussion and conclusions</Paragraph>
    </Section>
    <Section position="5" start_page="475" end_page="475" type="sub_section">
      <SectionTitle>
4.1 Intellectual antecedents
</SectionTitle>
      <Paragraph position="0"> The work described here has been greatly influenced by the work of Marr and his collaborators on vision. The schematization of the F0 contour has a family resemblance to their primal sketch, and I follow their suggestion that analysis of the derivatives, i~ a useful step in making such a schematization.</Paragraph>
      <Paragraph position="1"> Lea (1979) argues that stressed syllables and phrase boundaries can be located by setting a threshhold on FO changes. This procedure uses no representation of different melodic types, which are the main object of interest here. Its assumptions are commonly met, but break down in many perfectly well-formed English intonation patterns.</Paragraph>
      <Paragraph position="2"> Vires et al. (1977) use F0 in French to screen lexical hypotheses, by placing restrictions on the location of word boundaries. This procedure is motivated by the observation that the FO contour constrains but does not uniquely determine the boundary locations. In English, F0 does not mark word boundaries, but there are somewhat comparable situations in which it constrains but does not determine an analysis of how the utterance is organized. However, the English prosodic  system is much more complex than that of French, and so an implementation of this idea is accordingly more dii~cult.</Paragraph>
    </Section>
    <Section position="6" start_page="475" end_page="475" type="sub_section">
      <SectionTitle>
4.2 Segmentation and labelling
</SectionTitle>
      <Paragraph position="0"> The ~pproach to segmentation used here contrasts strongly with that used in the past in phonemic analysis. Whereas the HWIM system, for example, proposed segmental boundaries bottom up (Woods et al., 1976), the system described here never establishes boundaries. For example, there is no point on the rise between a L* and a H* which is ever designated as the boundary between the two pitch accents. Whereas phonetic segments ordinarily carry only categorical information, the events found here are hybrids, with both categorical and quantitative !nformation. A kind of soft segmentation comes out, in the sense that a particular tonal element accounts for some particular sequer~ce of events. Study of ambiguous contours indicates that this grouping of events cannot be carried out separately from labelling. Thus, there is no stage of analysis where the contour is segmented, even in this soft sense, but not labelled.</Paragraph>
      <Paragraph position="1"> It is not hard to find examples suggesting that the approach taken here is also relevant for phonemic analysis. Consider the word &amp;quot;joy&amp;quot;, shown in Figure 6. Here, the second formant fails from the palatal locus to a back vowel position, and then rises again for the off-glide. A different transcription involving two syllables might also be hypothesized; the second formant could be falling through a rather nondistinct vowel into a vocalized /I/, and then rising for a front vowel. Thus, we can only establish the correct segment count for this word by evaluating the hypothesis of a medial /1/. Even having clone so, there is no argument for boundary locations. The multiple pass strategy used in the HW!M system appears to have been aimed at such problems, but ~loes not really get at their root.</Paragraph>
    </Section>
    <Section position="7" start_page="475" end_page="475" type="sub_section">
      <SectionTitle>
4.3 Problems
</SectionTitle>
      <Paragraph position="0"> A number of defects in the current implementation have become apparent. In the example, the amount of clipping and smoothing needed to suppress segmental effects enough for parsing results in poor time alignment of the second transcription. The H* in this analysis is assigned to &amp;quot;source', whereas the researcher looking at the raw F0 contour would be inclined to put it on &amp;quot;gumes'. In general, curves which are too smooth may still be  sentence &amp;quot;We find joy in the simplest things.&amp;quot; The example is taken from Zue et al. (1982).</Paragraph>
      <Paragraph position="1"> insufficiently smooth to parse. An alternatwe 2rpproacn basea on Hildreth's suggestions about integration of different scale channels in vision was also investigated. (Hildreth, 1980.) Most of the obstacles she mentions were actually encountered, and no way was found to surmount them. Thus, I view the separation of segmental and prosodic effects on F0 as an open problem.</Paragraph>
      <Paragraph position="2"> Adding verification rules for segmental effects appears to be the most promising course.</Paragraph>
      <Paragraph position="3"> Two classes of extraneous analyses generated by the system merit discussion. Some analyses, such as the second in Figure 5, violate the stress pattern. These are of interest, because they inform us about how much F0 by itself constrains the interpretation of stress. A second group, namely analyses which have too many tonal elements for the syllable count, is of less interest. A future implementation should eliminate these by referring to syllable peak locations.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML