XML Viewer - j94-1002

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/j94-1002_metho.xml
Size: 56,225 bytes
Last Modified: 2025-10-06 14:13:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="J94-1002">
  <Title>A Hierarchical Stochastic Model for Automatic Prediction of Prosodic Boundary Location</Title>
  <Section position="4" start_page="30" end_page="37" type="metho">
    <SectionTitle>
3. Hierarchical Model of Prosodic Phrases
</SectionTitle>
    <Paragraph position="0"> A prosodic parse of a sentence can be represented by a sequence of break indices, one index following each word, which code the level of bracketing or attachment in a tree.</Paragraph>
    <Paragraph position="1"> A prosodic parse S is therefore given by</Paragraph>
    <Paragraph position="3"> where bi is the break index after the ith word and L is the number of words in the sentence. A break is a random variable that can take on one of a finite number of values from &amp;quot;no break&amp;quot; (orthographic word boundary, but not a prosodic constituent boundary) to &amp;quot;sentence boundary,&amp;quot; where the values form an ordered set that correspond to the different levels of the hierarchy. Below we consider a stochastic model for first a general hierarchical prosodic parse (any specified number of levels), and then specifically for the three-level case that models a sentence as a sequence of major phrases, which are in turn modeled as a sequence of minor phrases. Although most phonological theories do not recognize the &amp;quot;sentence&amp;quot; as a unit, it is useful for both synthesis and recognition applications to model sentences separately, as sentence-final boundaries tend to be acoustically different from sentence-internal boundaries (e.g., a low boundary tone is much more likely). 1  Computational Linguistics Volume 20, Number 1 We will begin by presenting the mathematical structure, first generally and then specifically as a three-level embedded hierarchy. Next, some pragmatic details of text processing are discussed, followed by a description of the parameter estimation and phrase break prediction algorithms.</Paragraph>
    <Section position="1" start_page="31" end_page="33" type="sub_section">
      <SectionTitle>
3.1 Stochastic Model
</SectionTitle>
      <Paragraph position="0"> We assume the relationship between units and subunits will hold at any level of the hierarchy. Therefore, in describing the general case, we need only consider one level of embedding and will use Ui and Uq when referring to units and subunits, respectively, at some unspecified level of the hierarchy. Using this notation, the probability of a unit Ui is parameterized in terms of the probability of the sequence of subunits uq (on the next lower level) and the length ni in subunits of that sequence given the orthographic transcription of the sentence W:</Paragraph>
      <Paragraph position="2"> The specific hierarchy considered here involves representing the prosodic parse of a sentence S as an N-length sequence of major phrases Mi:</Paragraph>
      <Paragraph position="4"> Finally, a minor phrase mq is composed of a v/j-length sequence of breaks bt starting at time t(i,j) and ending at time t(i,j + 1) - 1, mq = (bt(i,j),. . . ~ bt(i,j+l)-l), where t(i,j) is the time index of the first word of mij and t(i,j + 1) -1 is the time index of the final word of mq, vii = t(i,j + 1) - t(i,j) is simply the number of words in the minor phrase, and the breaks bt take on values from the set {no break, minor break, major break, sentence break}.</Paragraph>
      <Paragraph position="5"> It might be useful to consider phonological words rather than orthographic words as possible sites for break indices. This could be accomplished, without using deterministic rules, by specifying the bottom of the hierarchy (e.g., break level 0) to represent locations internal to a phonological word and the next level of the hierarchy (e.g., break level 1) to represent phonological word boundaries. However, it is controversial as to whether phonological words can be larger or smaller than orthographic (lexical) words (Booij 1983; Nespor and Vogel 1983), so it is not clear how the lowest level should be defined relative to the orthographic words. In this work, we have chosen not to distinguish between these two levels, to reduce the complexity of implementation and performance evaluation. For similar reasons, we have limited comprises syntactically well-formed sentences. The phrase prediction model may also be useful in speech recognition applications, in which case the term &amp;quot;utterance&amp;quot; would clearly be more appropriate.  M. Ostendorf and N. Veilleux Hierarchical Stochastic Model for Automatic Prediction Table 1 Histograms of number of minor phrases in a major phrase and number of major phrases in a sentence, as a function of quantized length of the unit. The quantizer regions are indicated by the length ranges.</Paragraph>
      <Paragraph position="6"> Number of minor phrases in major phrase Number of major phrases in sentence e(M) 1 2 3 4 5 ~(S) 1 2 3 4 5 6  this study to the more universally agreed-upon levels of major and minor prosodic phrases, although there is durational evidence that a more detailed hierarchy would be useful (Ladd and Campbell 1991; Wightman, Shattuck-Hufnagel, Ostendorf, and Price 1992).</Paragraph>
      <Paragraph position="7"> In the current work, we make several simplifying assumptions due to training limitations. First, the probability of the number of subunits in a unit p(nilUi) is assumed to depend only on the number of words in the unit f(Ui). As is not surprising, our data indicate that units that span a larger number of words tend to comprise more subunits. Altenberg has noticed similar tendencies in the London-Lund corpus (Altenberg 1987, p. 81). (Alternatively, it has been suggested that either phonological word count or stressed syllable count rather than orthographic word count may be a useful measure of phrase length on the lowest level \[Bachenko and Fitzpatrick 1990\].) In addition, the probability distribution is approximated by conditioning on quantized lengths Qu(~(Ui)). The quantizer varies as a function of the specific unit and is designed using a regression tree (Breiman, Friedman, Olshen, and Stone 1984). A regression tree partitions the data along intervals of a continuous variable, in this case length of the unit, to decrease variance of the response variable, the number of subunits in the unit. The resulting quantizer regions and the corresponding distribution of subunits in a unit are given in Table 1 for major phrase and sentence units.</Paragraph>
      <Paragraph position="8"> Using these simplifying assumptions, the constituent length probability distributions are then:</Paragraph>
      <Paragraph position="10"> Next, major phrases in a sequence are assumed to be Markov given the number in the sequence: p(Mi\]W, M1,...~Mi_i) = p(MilW, Mi_l).</Paragraph>
      <Paragraph position="11"> Minor phrases are also assumed to be Markov, depending only on the previous minor phrase and the features of the major phrase it is contained in: p(mij\]Wi, rail,..., miO._l ),Mi-1) = p(rnijIWi, mi(j-1)) p(mil\]Wi, Mi-1) -- p(mil\]Wi~m(i_l)ni_l)~ where Wi is the sequence of feature vectors spanning the ith major phrase. For sim- null Computational Linguistics Volume 20, Number 1 plicity, we will abbreviate the notation to: p(mijJVVi~ mil~ . . . ~ mi(j-1)~Mi-1) = p(mijlVVi~ mprev). The conditional probability of a sequence of words within a minor phrase is assumed to depend on a state determined by the (variable-length) sequence of past words and the time of the last break, where the state is given by a decision tree as in Bahl, Brown, deSouza, and Mercer (1989): p(bklWi~ bt(i,j)~ . . . ~ bk-l ~ mprev ) = p(bklf (Wi~ mprev ) ). Note that within a minor phrase, probabilities for only two cases, that of no break or that of any higher level break index, are used.</Paragraph>
      <Paragraph position="12"> Incorporating all of the above simplifying assumptions, the probability of a specific prosodic parse is given by</Paragraph>
      <Paragraph position="14"> sequence of feature vectors, one per word extracted from the word sequence. Examples of possible features include part-of-speech labels and syntactic information such as bracketing labels or labels of an associated node in a syntactic tree. The decision tree f(Wi~ mprev) used in determining the probabilities p(bklf(Wi, mpre~;)) includes questions based on these features, attributes of the previous minor phrase and the current major phrase, and length in words of the sentence. Details on our specific choice of features and questions is given in Section 4.</Paragraph>
      <Paragraph position="15"> Our use of decision trees is different from the phrase break detection algorithm of Wang and Hirschberg (1992), although the tree design algorithm and choice of features is similar. The tree is not used to classify phrase breaks directly; instead it is used to determine the probability of the occurrence of a minor break at some location, conditioned on the decision tree structure. This probability is used to represent the lowest level in the hierarchical model.</Paragraph>
      <Paragraph position="16"> Previously, we mentioned two important factors affecting the placement of phrase breaks: (1) grammatical structure and (2) length constraints on the prosodic constituents such as overall length and length relative to neighboring phrases. Grammatical information is incorporated in the tree f(Wi~mprev) through questions about the feature sequence Wi. Prosodic constituent length is modeled in two ways, through the constituent length probability distributions and through questions about the length of the previous phrase used in the tree f(Wi~ mpr~v).</Paragraph>
    </Section>
    <Section position="2" start_page="33" end_page="35" type="sub_section">
      <SectionTitle>
3.2 Text Processing
</SectionTitle>
      <Paragraph position="0"> In the experiments reported here, the feature vectors include part-of-speech labels, punctuation and, optionally, information from a skeletal syntactic parse. The feature extraction is described in more detail below, and an example is given in Figure 1.</Paragraph>
      <Paragraph position="1"> Two levels of detail are considered for part-of-speech (POS) labeling. At the simplest level, a function word table look-up is used to categorize words either as one of  M. Ostendorf and N. Veilleux Hierarchical Stochastic Model for Automatic Prediction word # in sent.</Paragraph>
      <Paragraph position="2"> word class part-of-speech punctuation left dominated right dominated both dominated # init. constit.</Paragraph>
      <Paragraph position="3"> # term. constit.</Paragraph>
      <Paragraph position="4"> \[VP is \[AP free \[PP on \[NP bail\]\] \[PP after \[NP \[VP facing ... 5 6 7 8 9 v c p c p  v adj p noun p none none none none none same same same PP same</Paragraph>
      <Paragraph position="6"> Seven features are extracted for each word in a sentence to describe the boundary between that word and the following word. Syntactic information is based on a skeletal parse, as shown. Part=of-speech assignment is based on a table look-up from lists of function words.</Paragraph>
      <Paragraph position="7"> six types of function words, as a proper name (P) if capitalized, or otherwise as a content word (c). Function words are divided into several classes: conjunctions (j) (such as and, but, if, because), auxiliary verbs and modals (v), determiners (d), prepositions (p), pronouns (n), and a general category (g), which includes the quantifiers and functionlike adverbs such as not, no, ever, now. The POS labels given by the simple table look-up are referred to here as &amp;quot;word classes.&amp;quot; A more detailed part-of-speech classification is given by Penn Treebank POS tags (Marcus and Santorini 1993), which were obtained automatically using the BBN tagger (Meteer, Schwartz, and Weischedel 1991). What we refer to here as POS labels is actually a grouping of these classes that includes the above function word categories, the proper name category (now determined by the tagger rather than from capitalization), plus categories for particles (pa), nouns (noun), verbs (verb), adjectives (adj), adverbs (adv), and all other content words (def).</Paragraph>
      <Paragraph position="8"> Contractions are not decomposed into separate words, since it is not possible that a phrase break will occur within the word contraction. The contraction is treated as a single word in constituent length measures and feature extraction, and it is assigned the POS label of its base word (left component).</Paragraph>
      <Paragraph position="9"> Punctuation following a word is incorporated as a feature for that word. In our data, the only punctuation that appears are commas and periods. Periods and other sentence-final punctuation deterministically assign a sentence break. This implies some text preprocessing that distinguishes periods used for abbreviation from sentence-final periods. While commas often correspond to major breaks, there is a systematic exception: a series of the same syntactic units such as a series of nouns (an apple, an orange, and a pear) or a series of adjectives (safe, cost-effective alternative ...) may or may not be associated with a major prosodic break. Therefore, we have chosen to use commas as a feature to determine the likelihood of a phrase break rather than as a deterministic cue to a prosodic break. Although using commas deterministically to assign a major phrase boundary yields better performance on our test set than using commas as a feature, we felt that using commas as a feature was a more extensible approach, and have used this strategy in the results reported here. Including commas as a feature does improve performance relative to not using commas, as will be discussed in Section 4. Commas (and other punctuation) can be very useful for prosodic boundary prediction when they are available, and they are used in other algorithms (e.g., Allen, Hunnicutt, and Klatt 1987; O'Shaughnessy 1989; Bachenko and Fitzpatrick 1990). However, commas are not reliably transcribed from spoken language and not consistently used in written text, so it is important that the algorithm not depend too heavily on commas.</Paragraph>
      <Paragraph position="10">  Computational Linguistics Volume 20, Number 1 Syntactic features were extracted from skeletal parses provided through a preliminary version of the Penn Treebank Corpus. Since these are hand-corrected parses, the results are indicative of the performance possible using syntactic information, but do not reflect performance achievable with an existing parser. Several researchers have investigated the relationship between prosody and syntax (e.g., Selkirk 1984; Gee and Grosjean 1983; Cooper and Paccia-Cooper 1980; Altenberg 1987). Our features have been motivated by some of these results, which suggest that some syntactic constituents are more likely to be separated by a phrase break than others. However, we have chosen to let the important constituents be determined automatically, similar to Wang and Hirschberg (1992), rather than by rule. One feature is the highest syntactic constituent dominating the left word but not dominating the right word, which describes potential locations for phrase breaks after a specific syntactic constituent. We also consider the similar case, the highest syntactic constituent dominating the right word but not the left, to allow for prosodic phrase breaks that may be associated with the beginning of a syntactic constituent. The lowest syntactic constituent that dominates both words is a feature that will provide information about which constituents are not likely to be divided by a phrase break. In addition, the number of terminating constituents and the number of initiating constituents between the two words were included as features to investigate the influence of relative strength of syntactic attachment. Eight categories of syntactic constituent were used: sentence (S), noun phrase (NP), verb phrase (VP), prepositional phrase (PP), wh-noun phrase (WHNP), adjective or adverbial phrase (AP), any other constituent (O), and both words in the same lowest level constituent (same).</Paragraph>
    </Section>
    <Section position="3" start_page="35" end_page="36" type="sub_section">
      <SectionTitle>
3.3 Parameter Estimation
</SectionTitle>
      <Paragraph position="0"> An advantage of a stochastic model is that the parameters can be estimated automatically from a large corpus of data, which means that it is relatively straightforward to redesign the model to reflect a different speaking style. Here we describe a maximum likelihood approach to parameter estimation, where model parameters are chosen to maximize the likelihood of the training data.</Paragraph>
      <Paragraph position="1"> We will assume that sentences are independent and identically distributed to simplify parameter estimation and prediction, although the independence assumption precludes capturing any speaker-dependent or discourse effects. In this case, the likelihood of the prosodic parse of a corpus of sentences ($1,..., S T) given parameters 0 is, from Equations (4)-(6),</Paragraph>
      <Paragraph position="3"> M. Ostendorf and N. Veilleux Hierarchical Stochastic Model for Automatic Prediction Arranging terms we have</Paragraph>
      <Paragraph position="5"> Since there are no cross dependencies between parameters, the four terms in Equation (7) can be maximized separately. The resulting parameter estimates for qs, qM, and qm are then simply relative frequency estimates. The last term is maximized jointly with the design of the state function f (-) using standard classification tree design techniques, as described below.</Paragraph>
      <Paragraph position="6"> The tree is grown using a greedy algorithm, which iteratively extends branches by choosing the parameters of a question, the question at a node and the node in a tree that together maximize some criterion for reducing the impurity of the class distributions at the leaves of the tree. The tree is used to find the probability of a minor phrase boundary, so there are only two classes: &amp;quot;break&amp;quot; and &amp;quot;no break.&amp;quot; In this work, we have used the Gini criterion, i.e., the node distribution impurity is given by i(t) = ~i#jp(ilt)p(jlt) (Breiman, Friedman, Olshen, and Stone 1984). Since the relative frequency of the &amp;quot;break&amp;quot; class is so low (8% of all breaks), we include different error costs in the design criterion. Generally, the cost of classifying a &amp;quot;break&amp;quot; as &amp;quot;no break&amp;quot; is chosen to be three to four times higher than the opposite error, and the specific costs for each tree are chosen to control the false prediction rate on the training set. Initially a tree is grown using two-thirds of the training data, and the remaining one-third of the data is used to determine a good complexityperformance trade-off point. The complexity criterion determined at this point is then used in pruning a second tree grown with the entire training set, in order to make better use of the available data. Each leaf t of the tree is associated with a conditional probability distribution of &amp;quot;break&amp;quot; vs. &amp;quot;no break&amp;quot; (actually, the relative frequency estimate). This probability distribution p(blt ) is used in the hierarchical model for computing the probability of a minor phrase, Equation (6), by running test data through the tree and using the probability distribution associated with the final leaf node = f(Wi, mprev).</Paragraph>
    </Section>
    <Section position="4" start_page="36" end_page="37" type="sub_section">
      <SectionTitle>
3.4 Phrase Break Prediction Algorithm
</SectionTitle>
      <Paragraph position="0"> The stochastic model can be used to predict a prosodic parse for a sentence simply by finding the most probable prosodic parse for that sequence of words, where the probability of any given parse is determined by Equations (4)-(6). In other words, we hypothesize all possible prosodic parses, compute the probability of each, and choose the most probable. The most likely prosodic parse can be found efficiently using a dynamic programming (DP) algorithm that is similar to algorithms used in speech recognition, in particular that for the Stochastic Segment Model (Ostendorf and Roukos 1989), except that the dynamic programming routine is called recursively for successive levels in the hierarchy. Defining pt(uil...Uinl~/Vi, Ui) as the probability of the most likely sequence of n subunits in but not necessarily spanning Ui and ending at location t, and Uij(S ~ t) as a subunit that spans boundaries {bs,...,bt}, the  Computational Linguistics Volume 20, Number 1 dynamic programming algorithm can be expressed generally in the subroutine that follows. This subroutine is called recursively for each level of the hierarchy, with the lowest level constituent probability being computed using probabilities given by the tree.</Paragraph>
      <Paragraph position="1"> Dynamic Programming Routine for Prosodic Parse Prediction For each word t in unit Ui (t = 1~..., li): Compute log pt ( Uil (1, t ) \[Wi, Ui- l ).</Paragraph>
      <Paragraph position="2"> For each n-length sequence of subunits spanning \[1, t\] (n = 2,...~ t):</Paragraph>
      <Paragraph position="4"> Save pointers to best previous break location s.</Paragraph>
      <Paragraph position="5"> To find the most likely sequence, p(Ui\[~/Vi, Ui-1) -~ maxn logPl,(Uil~ . . . ,Uin\[~/Vi, Ui-1) q- logq(n\[li) The final step is to decode the sequence of breaks once the value n* that maximizes the above equation is determined. Using the n* associated with any level unit, we can trace back to find the optimal segmentation of subunits that comprise that unit. The complete parse is found by tracing back at the highest level units and successively tracing back in each lower level.</Paragraph>
      <Paragraph position="6"> For the specific case of a three-level hierarchy, the most likely major phrase sequence in a sentence p(S\[W) and the most likely minor phrase sequence in a major phrase p(Mil W) are found by a dynamic programming algorithm, called recursively. The lowest level unit considered here is the minor phrase, and the probability of the minor phrase is computed as given in Equation (6) using the decision tree.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="37" end_page="49" type="metho">
    <SectionTitle>
4. Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="37" end_page="38" type="sub_section">
      <SectionTitle>
4.1 Corpus
</SectionTitle>
      <Paragraph position="0"> For our investigation of prosodic phrase structure, an FM radio news story corpus was used. The training data included ten stories from one announcer and another ten stories from a second announcer, both female, for a total of 312 sentences (6,157 words, or potential boundary locations). The stories were studio recordings of actual radio broadcasts, which were transcribed by a listener who did not have access to the original scripts. It is likely that transcription of punctuation did not exactly match the original written text and may have been biased by the prosody of the utterance. However, the radio announcers tended to annotate the transcribed text before reading the test stories, so we conjecture that commas were more often omitted than inserted in our transcriptions. All of the training stories were used to estimate the probabilities of the  M. Ostendorf and N. Veilleux Hierarchical Stochastic Model for Automatic Prediction number of subconstituents (Equations 1-3). In the first pass of tree design, two-thirds of the training data was used to grow the tree and one-third was used to determine the performance complexity trade-off, but the final tree used was redesigned on the entire training set.</Paragraph>
      <Paragraph position="1"> For testing, we used five versions of a different story spoken by two female and two male announcers (one radio broadcast version and four radio-news-style lab recordings). One of the female announcers (two spoken versions) was the same as the speaker who provided roughly three-quarters of the training data. Multiple test versions are used in order to allow for some acceptable differences in phrasing in the context of the FM radio news style, and to investigate the possibility of speaker-dependent effects. On average, there were 3.3 different prosodic parses among the five versions. The test story contained 23 sentences (385 words) ranging in length from 3 to 36 words. For reference, the test sentences are included in an appendix with the phrase predictions of our best system.</Paragraph>
      <Paragraph position="2"> Prosodic phrase breaks were hand-labeled in the entire corpus; the training set labels were used for estimating the parameters of the model and test set labels were used for evaluating the performance of the model. The prosodic phrase labeling system used break indices marked between each pair of words, based on auditory perceptual judgments (that is, the labelers did not have access to spectrogram or pitch displays). The break indices ranged on a scale of 0 to 6, chosen to map to a superset of the prosodic hierarchies proposed in the literature. The labeling scheme is described in more detail in Price, Ostendorf, Shattuck-Hufnagel, and Fong (1991). Six of the stories were labeled by polling two listeners who discussed any discrepancies. The remaining stories were labeled by a third listener working independently. Comparing the labels of one story using both schemes showed that there was a high degree of consistency across labelers. For the full seven-level labeling system, the correlation between the two sets of labels was 0.93, where correlation is computed as the maximum likelihood estimate of the correlation coefficient based on the two sets of labels. Only 1% of the labels differed by 2, and these were at locations where the disagreement was actually over the location of the boundary rather than the relative strength of the boundary.</Paragraph>
      <Paragraph position="3"> In this work we considered only a three-level hierarchy and therefore mapped breaks 0-2 to &amp;quot;no break,&amp;quot; 3 to a &amp;quot;minor break&amp;quot; (I), 4 and 5 to a &amp;quot;major break&amp;quot; (ll) and 6 to a &amp;quot;sentence break.&amp;quot;</Paragraph>
    </Section>
    <Section position="2" start_page="38" end_page="39" type="sub_section">
      <SectionTitle>
4.2 Evaluation Methods
</SectionTitle>
      <Paragraph position="0"> The goal of this algorithm is to predict placement of phrase breaks that sound natural to listeners and that communicate the intended meaning of the sentence. As mentioned above, many renditions of a sentence can fulfill this criterion. Therefore, we have attempted to estimate system performance by comparing the predicted breaks to parses observed in five spoken versions of the sentence. Although the ultimate test of the algorithm is in a speech synthesis system, a quantitative measure of system performance is useful in algorithm development and comparison. We have considered four performance measures in this work.</Paragraph>
      <Paragraph position="1"> Since one incorrectly assigned break could make a whole sentence or clause unacceptable, one measure of system performance is the number of sentences with a predicted parse that matches entirely a parse observed in any of the five spoken versions. When such a match occurs, we call the predicted parse &amp;quot;correct.&amp;quot; The five spoken versions do not represent an exhaustive set of acceptable parses, however. Therefore in a separate evaluation, the sentence is also judged subjectively to determine whether it is an &amp;quot;acceptable&amp;quot; parse. The number of sentences that fall into these two categories  Computational Linguistics Volume 20, Number 1 are reported separately, and for the best case system are marked separately in the results in the appendix.</Paragraph>
      <Paragraph position="2"> In order to better understand the system performance, we have chosen to compute additional error measures based on the prediction accuracy at individual break locations. A predicted sentence is compared to each of the five spoken versions, and the closest spoken version is used as the reference for that sentence. (The closeness of parses is measured using a Euclidean distance with 0 for no break, 1 for minor break and 2 for major break.) Then the correspondence between predicted and observed breaks is tabulated in a confusion matrix. Sentence breaks are deterministically assigned at periods, but these are included in the performance results reported here (as major breaks) to be consistent with results reported elsewhere. Also, note that confusion tables for different systems sometimes reflect different numbers of observed minor and major breaks because the predicted sentences may best match different versions of the test sentence.</Paragraph>
      <Paragraph position="3"> It is also useful to have a simple measure for comparing systems. One possible performance figure is the overall percent correct, but we have found this measure to be difficult to interpret because the overall figure is dominated by the performance on the much more frequent &amp;quot;no break&amp;quot; locations. Instead, we compute the correct prediction and false prediction rate for breaks as a combined class (merging minor and major breaks). Using terminology from detection theory, these are also referred to as correct detection (CD) and false detection (FD) in the following sections. CD/FD results must be interpreted with some caution, because there is a trade-off between the two error rates: higher break detection rates are associated with a higher rate of false break insertion. If the insertion rate is too high, there will be few good parses at the sentence level. We have therefore tried to control the insertion rate as much as possible for the different systems evaluated. Two types of CD/FD results are reported. One figure is computed based on comparison to the nearest sentence of the five versions.</Paragraph>
      <Paragraph position="4"> In addition, since other research results have been reported based on comparison to only one spoken version, we include correct prediction and false prediction rates that correspond to the average rates over the five separate test versions. In general, the correct prediction rates using the single version comparison are roughly 10% lower than using the comparison to five versions, so comparison to one version significantly underestimates performance of the algorithms. The variation in error rate over the five versions is relatively small, as shown later in the discussion of speaker-dependent effects.</Paragraph>
    </Section>
    <Section position="3" start_page="39" end_page="43" type="sub_section">
      <SectionTitle>
4.3 Tree Questions and Designs
</SectionTitle>
      <Paragraph position="0"> Several experiments using different sets of questions to train the embedded decision trees were performed in order to compare the relative merits of different information in the hierarchical model, as well as trade-offs associated with computational complexity.</Paragraph>
      <Paragraph position="1"> The entire set of questions is listed below. All experiments included questions 1-8, which were based on features that were relatively straightforward to extract from text, using a table look-up to assign part-of-speech labels. Experiments that made use of syntactic features also allowed questions 9-13. The syntax experiments were based on trees that were trained using only 14 of the 20 stories, since skeletal parses were only available for these stories. Another set of experiments included question 14, which tested the ratio of the current minor phrase length to the previous minor phrase length.</Paragraph>
      <Paragraph position="2"> Finally, experiments that made use of the more detailed POS classifications included question 15, and used the additional particle category in question 3. All questions were based on features derived from text information only.</Paragraph>
      <Paragraph position="3">  M. Ostendorf and N. Veilleux Hierarchical Stochastic Model for Automatic Prediction Below we enumerate the questions used in the different tree design experiments, together with the motivation for each question.</Paragraph>
      <Paragraph position="5"> Is this a sentence or major phrase boundary? Assuming major breaks occur at qualitatively different locations than minor breaks, we effectively remove the major breaks and sentences from our training corpus with this question.</Paragraph>
      <Paragraph position="6"> Is the left word a content word and the right word a function word? In the training data, 65% of the minor and major breaks combined occur at content word/function word (CW/FW) boundaries, and about half of the CW/FW boundaries are marked with breaks. The CW/FW boundaries also correspond to the prosodic group boundaries used deterministically in Sorin, Larreur, and Llorca (1987) and in Veilleux et al. (1990).</Paragraph>
      <Paragraph position="7"> What is the function word type of the word to the right? Previous work in prosodic parsing with a small dictionary (Sorin, Larreur, and Llorca 1987) suggested that different types of function words may be more or less likely to signal a prosodic phrase break.</Paragraph>
      <Paragraph position="8"> Is either adjacent word a proper name (capitalized)? Preliminary examination of our data suggested there was some relationship between proper nouns and phrase boundaries, probably related to the phrasing of complex nominals.</Paragraph>
      <Paragraph position="9"> How many content words have occurred since the previous function word? Speakers seemed to insert phrase breaks when a string of content words became long, e.g., exceeded four or five words.</Paragraph>
      <Paragraph position="10"> Is there a comma at this location? Usually, but not always, a major phrase break occurs at locations orthographically transcribed with commas.</Paragraph>
      <Paragraph position="11"> What is the relative location in the sentence (in eighths)? Previous work (Gee and Grosjean 1983) has suggested that prosodic phrase boundaries tend to bisect a longer unit. Therefore, one of the questions used to partition the training data is the ratio of the word number over the sentence length, quantized to the nearest eighth.</Paragraph>
      <Paragraph position="12"> What is the relative location in the proposed major phrase (in eighths)? This question is included following the same reasoning as the previous question.</Paragraph>
      <Paragraph position="13"> What is the largest syntactic unit that dominates the word preceding the potential boundary location and not dominating the succeeding word? Phrase breaks are known to co-occur with certain syntactic configurations. For example, phrase breaks often occur before subordinate clauses.</Paragraph>
      <Paragraph position="14"> What is the largest syntactic unit that dominates the word succeeding the potential boundary location and not dominating the preceding word? The rationale behind this question is similar to that of the previous question. What is the smallest syntactic unit that dominates both? Some syntactic units may be less likely to be broken up by a phrase break.</Paragraph>
      <Paragraph position="15"> How many syntactic units end between the two words? This question provides information on the relative level of syntactic attachment between the two words, capturing the effect of constituent endings.</Paragraph>
      <Paragraph position="16">  similar to the previous one, except that it captures effects associated with the start of new constituents.</Paragraph>
      <Paragraph position="17"> How large is the ratio of the current minor phrase length over the previous minor phrase length? This question incorporates the concept of balancing minor phrase lengths noted by other researchers (Gee and Grosjean 1983; Bachenko and Fitzpatrick 1990), and was found to be useful in phrase prediction trees investigated by Wang and Hirschberg (1992). In the beginning of a sentence where there is no previous minor phrase, the ratio is treated as missing data and handled using a surrogate variable (Breiman, Friedman, Olshen, and Stone 1984).</Paragraph>
      <Paragraph position="18"> What is the label of the content word to the right? to the left? Wang and Hirschberg found that part-of-speech information is useful in phrase break prediction (Wang and Hirschberg 1992).</Paragraph>
      <Paragraph position="19"> Questions 5, 12, 13, and 14 are based on numerical features, so the binary question asks whether the feature is greater than some threshold, where the threshold is determined automatically in tree design. All other questions are based on categorical variables, and the best binary groupings of the possible values are determined automatically (Breiman, Friedman, Olshen, and Stone 1984). Two of the questions (8, 14) require knowledge of major or minor phrase boundaries. This information is available in the training data or from a spoken utterance, but hypothesized locations of minor and major phrase breaks must be used in phrase prediction from text. Therefore, these features are calculated dynamically in the prediction algorithm for each hypothesized prosodic parse.</Paragraph>
      <Paragraph position="20"> The first tree designed used only the very simple information represented by questions 1-8. The resulting tree is shown in Figure 2, with the relative frequency of a break in the training data included at each node. The first split trivially locates the sentence and major break boundaries. The second split utilized the content word/function word boundary question that we had used deterministically in previous work (Veilleux et al. 1990). The content/function word boundaries seem to be important in other algorithms as well: they correspond closely to the phi-phrase boundaries that would be predicted by the Bachenko-Fitzpatrick algorithm, and they seem to be captured in the Wang-Hirschberg text-only tree by a succession of questions about the part-of-speech labels of the words adjacent to the break. Of the boundaries that were preceded by a content word and followed by a function word, 30% were hand-labeled as minor breaks, whereas only 4% of other locations were labeled as minor breaks and these were identified by the next question as coinciding with a comma. The complete tree was relatively small (9 nodes), and used almost all questions provided. On the training data, the resulting tree classified 89% of the nonbreaks correctly and 59% of the minor breaks correctly. All sentence and major breaks were given in the tree design. The next stage was to incorporate syntactic information (questions 9-13) into the tree design algorithm to determine minor phrase probabilities. Syntactic parses were available for only 14 of the 20 training stories (217 sentences, 4,230 words), and the tree was designed using this subset. A very simple five-node tree was designed, as shown in Figure 3. Again the first nontrivial question was concerning the content/function word boundaries, and the presence of a comma was again used to predict minor breaks at other locations. The two other questions in the tree were based on which syntactic unit dominated one or the other words at the boundary site. The tree design algorithm chose syntactic units that were less likely to contain a boundary as: words  of a &amp;quot;break&amp;quot; (in the training data) is indicated in each node for the subset of data associated with that node, and the left branch in a split is more likely to have a break. in the same constituent, words separated by a wh-noun phrase boundary, and words separated by a verb phrase initial boundary. (In their work on spontaneous speech, Wang and Hirschberg found that noun phrases in general tended to be less likely to contain boundaries.) The tree with syntactic information seemed to classify minor breaks with slightly higher accuracy than the previous tree: 90% correct classification of nonbreaks and 62% correct classification of minor breaks in the training data. A third tree, illustrated in Figure 4, was grown using the first 8 baseline questions and question 14, which examines the ratio of current minor phrase length to previous minor phrase length. The motivation behind this question is constituent balancing, as mentioned earlier. The main difference between this tree and the first one is that the minor phrase length ratio test is chosen instead of the question about the position in a major phrase. These two questions served similar roles, as evidenced by the fact that the surrogate variable for the ratio test was the location of the current word within  1-13. Relative frequency of a &amp;quot;break&amp;quot; (in the training data) is indicated in each node for the subset of data associated with that node, and the left branch in a split is more likely to have a break.</Paragraph>
      <Paragraph position="21"> the major phrase, in terms of the ratio of the number of words up to the current position over the total length of the major phrase. Classification rates on the training data for this tree were 87% for nonbreaks and 66% for minor breaks. A fourth tree was designed using the first fourteen questions, and performance was similar to that for the third tree.</Paragraph>
      <Paragraph position="22"> The decision tree design algorithm's performance was not significantly changed by the introduction of additional features. New features can supplant previously used ones, as also found by Wang and Hirschberg (1992), because of the redundancy in information between features. For example, in the syntax trees, many of the baseline questions were no longer chosen, but the overall classification performance was similar.</Paragraph>
    </Section>
    <Section position="4" start_page="43" end_page="45" type="sub_section">
      <SectionTitle>
4.4 Phrase Prediction Results
</SectionTitle>
      <Paragraph position="0"> The trees were used in the hierarchical model, and the phrase break prediction algorithm was evaluated on the independent test set described in Section 4.1. A summary of the results is given in Table 2, and the corresponding confusion matrices are in Tables 3 and 4. The baseline system (questions 1-8) gave the best performance, with a correct prediction rate of 81% and a false prediction rate of 4%. The results indicate that syntactic information did not improve the performance of the algorithm, and in fact gave poorer phrase predictions by every measure of performance on the test data.</Paragraph>
      <Paragraph position="1"> The difference in performance cannot be attributed to the smaller amount of training data used in the experiments with syntax, because designing the model without syntax on this subset actually yielded slightly better performance on the test set than that  M. Ostendorf and N. Veilleux Hierarchical Stochastic Model for Automatic Prediction entence/major phrase boundary ? - --.%__.</Paragraph>
      <Paragraph position="3"> designed on the full training set. We conjecture that the poorer performance associated with using syntax in our model may be due to the fact that syntax plays more of a role in location of major breaks as opposed to the minor breaks predicted in the tree. As we shall see later, syntactic cues were useful in our implementations of the Bachenko-Fitzpatrick and Wang-Hirschberg algorithms. We also found that the minor phrase length ratio test hurt performance, which is likely due to the fact that the ratios are based on hypothesized boundary locations in phrase prediction, as opposed to the known locations used in training. In examining the confusion matrices, we see the main effect of the additional syntactic and minor phrase length ratio questions is more errors at minor phrase boundary locations.</Paragraph>
      <Paragraph position="4"> Examining the sentence level performance of the algorithms, we find that a phrase break was inserted between the verb and the particle in three of the six unacceptable  Computational Linguistics Volume 20, Number 1 Table 2 Performance of different break prediction algorithms, including variations of our hierarchical model, variations of a tree-based classifier, and the Bachenko-Fitzpatrick (B-F) algorithm based on a test set of 23 sentences (386 words). &amp;quot;Questions Used&amp;quot; refers to those questions listed in Section 4.3 for the two tree-based algorithms. Although the B-F algorithm does not use these specific questions, it does utilize syntactic information as well as relative constituent  length. Correct Detection/False Detection (CD/FD) rates are for the merged category of minor and major breaks computing (a) error according to the closest utterance of five versions, and (b) the average error in comparing to a single utterance.</Paragraph>
      <Paragraph position="5">  Confusion matrices for predicted breaks using the hierarchical system with and without syntax, in both cases with the minor phrase length ratio test.</Paragraph>
    </Section>
    <Section position="5" start_page="45" end_page="49" type="sub_section">
      <SectionTitle>
No Syntax Syntax
Actual Actual
</SectionTitle>
      <Paragraph position="0"> Predicted major minor no-break Predicted major minor no-break major 51 15 9 major 49 13 9 minor 5 1 1 minor 5 5 2 no-break 9 9 286 no-break 9 12 282 parses (e.g., tried I out, plugs I in, and check I in ). This is not surprising since the simple POS labeling scheme labels the particle as a preposition. The trees using syntactic information were not able to overcome this effect because of the relative sparsity of particles in the training data (only 5% of the words labeled as prepositions are particles). Other mistakes included a misplaced minor phrase and a deleted major phrase where a comma occurs in the original text. Most of the sentences that were correct (had an exact match with one of the spoken versions) were shorter in length. However, there were several long sentences judged to have acceptable parses. Since many more variations in prosodic phrasing are allowable for longer sentences, it is not surprising that the predicted version was not one of the five spoken versions. The  M. Ostendorf and N. Veilleux Hierarchical Stochastic Model for Automatic Prediction Table 5 Confusion matrices for predicted breaks using the simple classification trees (no hierarchical model) with and without syntax, in both cases without the minor phrase length ratio test.  predicted breaks for the best system, the hierarchical model based on questions 1-8, are shown in the Appendix together with the closest spoken prosodic parse. In tree design, we chose to represent major breaks as a separate category that the tree was not explicitly designed to detect. A consequence of this choice is that there are fewer &amp;quot;break&amp;quot; data points for training the tree, since there are less than half as many minor breaks as major breaks in the training data. This choice is reasonable if the two breaks occur at qualitatively different locations, which we suspect. In fact, results using trees that were trained by merging major and minor breaks into a single category and then embedded in the hierarchical model had either lower prediction accuracy or a higher false prediction rate. Another consequence of using only minor breaks to train the tree is that features that are associated with major breaks are not represented in the model, which may explain the poor performance of the model with syntax. However, this problem could be addressed with an extension of the current model.</Paragraph>
      <Paragraph position="1"> In order to see if explicit modeling of a prosodic hierarchy was a useful aspect of the model, we conducted similar experiments using trees designed specifically for classification. A binary decision tree was trained using the baseline questions (1-8), and another tree was trained using syntax as well (1-13). In both cases, the trees were trained to predict three classes: major break, minor break, and no break. Including the minor phrase length ratio test using known break locations (from hand labels) did not improve prediction performance, so we did not implement a dynamic version based on hypothesized minor breaks. Results for these two trees are included in Table 2, with confusion matrices given in Table 5. The costs of different errors were chosen to obtain a false detection rate similar to that for the hierarchical model. Choosing good costs proved difficult, so the correct detection rate is lower than that for the other models primarily because the false detection rate was so low. The difference in false detection rates makes comparison to the hierarchical model difficult. However, experience with performance of the models at different false detection rates suggests that the baseline hierarchical model outperforms the classification tree that does not use syntax, but that the classification tree that uses syntax is at least as good as the hierarchical model. Since the complexity associated with obtaining a syntactic parse is significantly greater than that associated with the simple three-level hierarchy that we have proposed, we conclude that explicit modeling of a hierarchy is a useful feature of the model. In addition, the fact that syntactic information was useful for the classification tree but not for the hierarchical model suggests that syntactic features are more important for predicting major breaks than minor breaks, since major breaks are not represented as a class in tree design for the hierarchical model.</Paragraph>
      <Paragraph position="2"> For both the hierarchical model and the simple classification trees, we also investigated the use of more detailed part-of-speech information both with and without  the syntactic features. The more detailed part-of-speech did not improve performance under any of these conditions. For the classification trees, correct detection improved slightly, but there was a corresponding increase in false detection. For the hierarchical model, performance actually degraded. These results suggest that the complexity associated with more detailed part-of-speech tagging may not be necessary; however, further research is needed to answer this question. It may be that other POS questions, such as testing a larger window of words around the break as in Wang and Hirschberg (1992), would yield better results.</Paragraph>
      <Paragraph position="3"> Finally, we thought it would be interesting to compare the results of our prediction algorithm to that of Bachenko and Fitzpatrick's algorithm on our corpus. Since the test set was relatively small, we were able to implement the algorithm by predicting the phrase boundaries from the rules by hand. To assign node indices to prosodic breaks (Bachenko and Fitzpatrick 1990), a critical value for separating major and minor phrase breaks is calculated based on an average of the indices associated with the prosodic phrase nodes, where the prosodic phrase nodes are all those created by the Bachenko-Fitzpatrick primary salience rules. Boundaries with an index greater than the critical value are assigned a major break, indices below 5 have no prosodic break, and intermediate indices map to a minor prosodic break. (Bachenko and Fitzpatrick include index 4 in the minor break category, but 5 was used here to obtain a lower insertion rate.) For multiple verb phrases in sequence, the verb balancing rule is applied left-to-right until all verb phrases are grouped before applying the verb adjacency rule or other processing. The confusion matrix for these results is shown in Table 6, and the performance summary is also included in Table 2. Although the correct break detection rate is significantly higher than that for the other algorithms, the false detection rate is also higher, and so the sentence accuracy is similar to that for the baseline hierarchical model. Unlike the other algorithms, the Bachenko-Fitzpatrick algorithm did not make the mistake of assigning a minor phrase break before a particle, but this relies on having a parser that can make that distinction. An advantage of both the classification tree and the hierarchical model over the Bachenko-Fitzpatrick model is that they can be automatically trained, and thus can be tuned to handle particular tasks.</Paragraph>
      <Paragraph position="4"> Table 7 gives the correct detection and false detection rates calculated by comparing the predicted prosodic parses to each of the different spoken versions. The performance of speaker f2b, whose speech made up roughly three-quarters of the training data, had performance similar to the average for the five versions with slightly lower correct detection rates but also slightly lower false detection rates. These results suggest that the automatic algorithms are not particularly speaker-dependent, though we expect that it is important to have similar styles for both training and test data. There was no consistent difference in performance between male and female speakers, and the difference in error rates for different speakers was relatively small.</Paragraph>
      <Paragraph position="5">  Correct detection/false detection rates for predicted phrase breaks to each of the five different test versions. With the exception of the Bachenko-Fitzpatrick algorithm, the systems included here did not use the minor ratio question. Speaker codes begin with &amp;quot;f&amp;quot; or &amp;quot;m&amp;quot; for female and male speakers, respectively. Speaker code f2b is annotated with &amp;quot;r&amp;quot; for the original radio recording and &amp;quot;1&amp;quot; for the subsequent lab recording.</Paragraph>
      <Paragraph position="6">  In all of the previous experiments, the presence of a comma was an important feature for predicting phrase boundaries for all algorithms implemented. While this is a valid feature in text-to-speech synthesis applications, it is not available in applications involving spoken speech. (Although presence of a pause might be a useful alternative feature.) In addition, as we have mentioned earlier, commas are not reliably used even in written text. Therefore, it is interesting to determine the performance of the algorithm without the comma feature. As expected, performance degrades significantly, both for the hierarchical model and for the classification trees. In addition, fqr the hierarchical model, syntactic information and the minor phrase length ratio test now provide information that improves performance over the baseline system. To illustrate performance differences, some correct prediction/false prediction rates are given in Table 8.</Paragraph>
      <Paragraph position="7"> It is difficult to compare our performance figures with other reported results because of differences in corpora and speaking styles. However, the average single speaker correct detection and false detection rates reported here for our implementations of the Bachenko-Fitzpatrick and Wang-Hirschberg algorithms indicate the robustness of these algorithms to different types of data. Our results for the Bachenko-Fitzpatrick algorithm are somewhat higher than those that they report, .84/.09 vs. ,78/.08. 2 Using only the features inferable from text, Wang and Hirschberg use classification trees to predict prosodic boundaries in spontaneous speech, achieving phrase break prediction results of .66/.02. 3 (Again, note that these results are not directly comparable because of the differences in false detection rates, and results for other trees in Wang and Hirschberg \[1992\] suggest that these two algorithms have similar  Computational Linguistics Volume 20, Number 1 performance.) Our classification trees used somewhat different features, though also based on POS and syntactic information, and achieved results on radio news speech that are surprisingly lower, i.e., .59/.02 for the tree that used syntax but did not use commas. Of course, the POS and syntactic information used here may not have been as detailed and/or reliable as that used by Wang and Hirschberg. (The comparisons here are based on the average error rates for single version comparisons.)</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML