File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/94/j94-1002_abstr.xml
Size: 10,531 bytes
Last Modified: 2025-10-06 13:48:17
<?xml version="1.0" standalone="yes"?> <Paper uid="J94-1002"> <Title>A Hierarchical Stochastic Model for Automatic Prediction of Prosodic Boundary Location</Title> <Section position="2" start_page="0" end_page="29" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Prosodic phrase structure plays a role in both naturalness and intelligibility of speech.</Paragraph> <Paragraph position="1"> For example, prosodic phrase boundaries break the flow of a sentence, dividing it into smaller units for easier processing. In addition, researchers have shown that prosodic phrase break placement is important in syntactic disambiguation (Lehiste 1973; Price, Ostendorf, Shattuck-Hufnagel, and Fong 1991). For these reasons, computational modeling of prosodic phrases is important both for text-to-speech synthesis and speech understanding applications. In this work, we present a computational model that represents a hierarchy of prosodic constituents using a stochastic formalism to capture the natural variability allowable in prosodic phrasing. The model is useful for both analysis and synthesis applications; we focus on synthesis here, and present experimental results for predicting prosodic phrase structure from text.</Paragraph> <Paragraph position="2"> Prosodic phrase structure, or groupings of words in a sentence, can be equivalently represented by different phrase break markers. The location and relative size of these breaks define the prosodic phrase structure, which we will refer to here as a prosodic parse. Prosodic phrase breaks are discrete events that are associated with acoustic cues such as duration lengthening, pause insertion, and intonation markers.</Paragraph> <Paragraph position="3"> In this work, we are concerned only with the relationship between the abstract events * ECS Department, 44 Cummington Street, Boston, MA 02215 (~) 1994 Association for Computational Linguistics Computational Linguistics Volume 20, Number 1 (different levels of phrase breaks) and text. To be useful in synthesis or understanding applications, the results presented here need to be integrated with a component that models the acoustics associated with these abstract events (see, for example, Hirose and Fujisaki 1982).</Paragraph> <Paragraph position="4"> Several observations about prosodic phrase breaks raise issues to be considered in designing an algorithm to predict such breaks from text. First, there is a significant body of literature in linguistics concerning various hierarchies that specify the relationship among prosodic constituents, and the model should reflect this structure.</Paragraph> <Paragraph position="5"> Second, several different prosodic parses may all be acceptable for one sentence. This variability is particularly important to represent if the model is to be useful for analysis as well as synthesis. Third, prosodic phrase breaks do not always coincide with syntactic phrase boundaries, and the relationship between prosody and syntax is not well understood. This means that prosodic phrases cannot simply be predicted from syntactic structure. Finally, since most text-to-speech synthesis applications require a low cost implementation, there is the concern of computational complexity. We shall expand on these points separately below, to motivate the work described here.</Paragraph> <Paragraph position="6"> The various linguistic theories of prosodic phrase structure (e.g., Liberman and Prince 1977; Selkirk 1980, 1984; Beckman and Pierrehumbert 1986; Nespor and Vogel 1983; Ladd 1986) differ in the specific levels that they represent, but all have a similar hierarchical structure. Two levels of prosodic phrases are common to most proposals: the intonational phrase and the intermediate phrase, using the terminology of Beckman and Pierrehumbert. A sentence is composed of a sequence of intonational phrases, which in turn are composed of sequences of intermediate phrases. An intonational phrase break is therefore perceived as stronger or more salient than an intermediate phrase break. Intonational phrases are delimited by boundary tones, and intermediate phrases are theoretically marked with a phrase accent, where the pitch markers can be either high or low (Beckman and Pierrehumbert 1986). (In other theories of intonation, for example, t'Hart, Collier, and Cohen \[1990\], pitch markers also occur at phrase boundaries, but are identified with movement and referred to as either rising or falling.) Both types of constituents are also cued by segmental lengthening in the phrase final syllable (Wightman, Shattuck-Hufnagel, Ostendorf, and Price 1992).</Paragraph> <Paragraph position="7"> Since intonational and intermediate phrases are generally accepted, the experiments here will only address these two levels, referring to them as major and minor phrases, respectively. However, other types of prosodic constituents may be useful and, in fact, there is durational evidence for at least four levels (Wightman, Shattuck-Hufnagel, Ostendorf, and Price 1991; Ladd and Campbell 1991). We therefore propose a more general hierarchical model that can be extended to an arbitrary, but fixed, number of levels. In the examples given here, we will represent intonational phrases (I) using &quot;\] I&quot; to mark a major break and intermediate phrases (i) using &quot;1&quot; to mark a minor break. The example below illustrates how phrase breaks are used to represent prosodic phrase structure: Those on early release \] must check in with correction officials II fifty times a week II according to Ash, II who says about half I the contacts for a select group II will now be made I by the computerized phone calls. \[I ((Those on early release)i (must check in with correction officials)i)i ((fifty times a week)i)i ((according to Ash,)i)I ((who says about half)i (the contacts for a select group)i)~ ((will now be made)i (by the computerized phone calls.)i)~ M. Ostendorf and N. Veilleux Hierarchical Stochastic Model for Automatic Prediction Another important consideration in modeling prosody (and evaluating the model) is that prosodic phrase structure is not deterministic. Speakers can produce a sentence in several ways without altering the naturalness or the meaning. Prosodic breaks can differ in size and/or placement because of differences in style, competence, or simply natural speaking variations. For example, the following sentence was said three ways by five speakers: They're in jail I\] for such things \]l as bad checks or stealing.</Paragraph> <Paragraph position="8"> They're in jail \]for such things I as bad checks I or stealing.</Paragraph> <Paragraph position="9"> They're in jail I I for such things as bad checks I or stealing.</Paragraph> <Paragraph position="10"> Although deterministic rules can be used to predict phrase breaks for speech synthesis applications, such a model will be limited in its usefulness in speech analysis. In addition, speech synthesis might be more natural if variability is included in the model. Here, a stochastic model is used to represent the natural variability in prosodic structure by deriving probabilities of phrase breaks, rather than predicting locations of phrase breaks by rule.</Paragraph> <Paragraph position="11"> The relationship between prosody and syntax is not fully understood, though it is generally accepted that there is such a relationship. For example, relatively higher syntactic attachment usually corresponds to relatively larger prosodic breaks, but there are many exceptions, as in: \[\[Mary\]np lwas amazed \[Ann Dewey was angry\]s'\]vp\]s which was produced by four speakers as Mary was amazed I I Ann Dewey was angry.</Paragraph> <Paragraph position="12"> In an analysis of the London-Lund corpus, Altenberg (1987) finds relative frequencies that describe the correspondence between prosodic constituents (tone units) and different syntactic units. This data supports the use of a probabilistic model, which also has an advantage in that it can be trained automatically, facilitating representation of a wide variety of speaking styles and allowing a means of discovering syntax-prosody relationships from a large corpus. One reason that the mapping between syntax and prosody is not simple is because, in speech, the constraints of syntactic structure and phrase length are balanced to produce a regular, roughly equal, sequence of prosodic phrases (Gee and Grosjean 1983). Consequently, we include constituent length as a factor in the model.</Paragraph> <Paragraph position="13"> The cost of obtaining a full and accurate syntactic parse can be high, which presents difficulties for text-to-speech synthesis systems. In addition, a full syntactic parse may not be necessary for predicting prosodic phrases, since prosody is not directly related to syntax. Consequently, we investigate computation/performance trade-offs associated with using a skeletal syntactic parse vs. simple part-of-speech (POS) assignments.</Paragraph> <Paragraph position="14"> To summarize, the model proposed here addresses several issues in modeling prosodic phrase structure. The model is a general formalism for an embedded hierarchy, which we specifically apply to represent sentences, major phrases, and minor phrases. In order to account for the allowable variability in prosodic parsing, the model is probabilistic. The structure of the model allows use of grammatical information such as part-of-speech labels, syntactic structure and constituent length, but the specific parameters are trained automatically. Finally, computational complexity trade-offs are investigated by evaluating the algorithm with and without syntactic cues.</Paragraph> <Paragraph position="15"> Computational Linguistics Volume 20, Number 1 The remainder of the paper is organized as follows. We begin, in Section 2, by discussing past work in predicting prosodic phrase breaks from text for speech synthesis. In Section 3, we introduce the probabilistic formalism of the hierarchical model and outline the implementation: text pre-processing, parameter estimation, and phrase break prediction using a dynamic programming algorithm to obtain the most likely prosodic parse. In Section 4, we present experimental results for prediction of major and minor prosodic phrase breaks based on a corpus of FM radio news stories. Finally, we conclude in Section 5 by discussing possible implications and extensions of these results.</Paragraph> </Section> class="xml-element"></Paper>