PREDICTING INTONATIONAL PHRASING FROM TEXT 
Michelle Q. Wang 
Churchill College 
Cambridge University 
Cambridge UK 
Julia Hirschberg 
AT&T Bell Laboratories 
600 Mountain Avenue 
Murray Hill, NJ 07974 
Abstract 
Determining the relationship between the intona- 
tional characteristics of an utterance and other 
features inferable from its text is important both 
for speech recognition and for speech synthesis. 
This work investigates the use of text analysis 
in predicting the location of intonational phrase 
boundaries in natural speech, through analyzing 
298 utterances from the DARPA Air Travel In- 
formation Service database. For statistical model- 
ing, we employ Classification and Regression Tree 
(CART) techniques. We achieve success rates of 
just over 90%, representing a major improvement 
over other attempts at boundary prediction from 
unrestricted text. 1 
Introduction 
The relationship between the intonational phras- 
ing of an utterance and other features which can 
be inferred from its transcription represents an 
important source of information for speech syn- 
thesis and speech recognition. In synthesis, more 
natural intonational phrasing can be assigned if 
text analysis can predict human phrasing perfor- 
mance. In recognition, better calculation of prob- 
able word durations is possible if the phrase-final- 
lengthening that precedes boundary sites can be 
predicted. Furthermore, the association of intona- 
tional features with syntactic and acoustic infor- 
mation can also be used to reduce the number of 
sentence hypotheses under consideration. 
Previous research on the location of intonational 
boundaries has largely focussed on the relation- 
ship between these prosodic boundaries and syn- 
tactic constituent boundaries. While current re- 
search acknowledges the role that semantic and 
discourse-level information play in boundary as- 
I We thank Michael Riley for helpful discussions. Code 
implementing the CART techniques employed here was 
written by Michael Riley and Daryi Pregibon. Part-of- 
speech tagging employed Ken Church's tagger, and syn- 
tactic analysis used Don Hindle's parser, Fiddltch. 
signment, most authors assume that syntactic con- 
figuration provides the basis for prosodic 'defaults' 
that may be overridden by semantic or discourse 
considerations. While most interest in boundary 
prediction has been focussed on synthesis (Gee 
and Grosjean, 1983; Bachenko and Fitzpatrick, 
1990), currently there is considerable interest in 
predicting boundaries to aid recognition (Osten- 
doff et al., 1990; Steedman, 1990). The most 
successful empirical studies in boundary location 
have investigated how phrasing can disambiguate 
potentially syntactically ambiguous utterances in 
read speech (Lehiste, 1973; Ostendorf et al., 1990). 
Analysis based on corpora of natural speech (Ab 
tenberg, 1987) have so far reported very limited 
success and have assumed the availability of syn- 
tactic, semantic, and discourse-level information 
well beyond the capabilities of current NL systems 
to provide. 
To address the question of how boundaries are 
assigned in natural speech -- as well as the need 
for classifying boundaries from information that 
can be extracted automatically from text -- we 
examined a multi-speaker corpus of spontaneous 
elicited speech. We wanted to compare perfor- 
mance in the prediction of intonational bound- 
aries from information available through simple 
techniques of text analysis, to performance us- 
ing information currently available only come from 
hand labeling of transcriptions. To this end, 
we selected potential boundary predictors based 
upon hypotheses derived from our own observa- 
tions and from previous theoretical and practi- 
cal studies of boundary location. Our corpus for 
this investigation is 298 sentences from approxi- 
mately 770 sentences of the Texas Instruments- 
collected portion of the DARPA Air Travel In- 
formation Service (ATIS) database(DAR, 1990). 
For statistical modeling, we employ classification 
and regression tree techniques (CART) (Brieman 
et al., 1984), which provide cross-validated de- 
cision trees for boundary classification. We ob- 
tain (cross-validated) success rates of 90% for both 
automatically-generated information and hand- 
285 
labeled data on this sample, which represents 
a major improvement over previous attempts to 
predict intonational boundaries for spontaneous 
speech and equals or betters previous (hand- 
crafted) algorithms tested for read speech. 
Intonational Phrasing 
Intuitively, intonational phrasing divides an ut- 
terance into meaningful 'chunks' of information 
(Bolinger, 1989). Variation in phrasing can change 
the meaning hearers assign to tokens of a given 
sentence. For example, interpretation of a sen- 
tence like 'Bill doesn't drink because he's unhappy.' 
will change, depending upon whether it is uttered 
as one phrase or two. Uttered as a single phrase, 
this sentence is commonly interpreted as convey- 
ing that Bill does indeed drink -- but the cause 
of his drinking is not his unhappiness. Uttered as 
two phrases, it is more likely to convey that Bill 
does sot drink -- and the reason for his abstinence 
is his unhappiness. 
To characterize this phenomenon phonologi- 
cally, we adopt Pierrehumbert's theory of into- 
national description for English (Pierrehumbert, 
1980). In this view, two levels of phrasing are sig- 
nificant in English intonational structure. Both 
types are composed of sequences of high and low 
tones in the FUNDAMENTAL FREQUENCY (f0) con- 
tour. An INTERMEDIATE (or minor) PHRASE con- 
slats of one or more PITCH ACCENTS (local f0 min- 
ima or maxima) plus a PHRASE ACCENT (a simple 
high or low tone which controls the pitch from 
the last pitch accent of one intermediate phrase 
to the beginning of the next intermediate phrase 
or the end of the utterance). INTONATIONAL (or 
major) PHRASES consist of one or more intermedi- 
ate phrases plus a final BOUNDARY TONE, which 
may also be high or low, and which occurs at the 
end of the phrase. Thus, an intonational phrase 
boundary necessarily coincides with an intermedi- 
ate phrase boundary, but not vice versa. 
While phrase boundaries are perceptual cate- 
gories, they are generally associated with certain 
physical characteristics of the speech signal. In 
addition to the tonal features described above, 
phrases may be identified by one of more of the 
following features: pauses (which may be filled 
or not), changes in amplitude, and lengthening 
of the final syllable in the phrase (sometimes ac- 
companied by glottalization of that syllable and 
perhaps preceding syllables). In general, ma- 
jor phrase boundaries tend to be associated with 
longer pauses, greater tonal changes, and more fi- 
nal lengthening than minor boundaries. 
The Experiments 
The Corpus and Features Used in 
Analysis 
The corpus used in this analysis consists of 298 
utterances (24 minutes of speech from 26 speak- 
ers) from the speech data collected by Texas In- 
struments for the DARPA Air Travel Information 
System (ATIS) spoken language system evaluation 
task. In a Wizard-of-Oz simulation, subjects were 
asked to make travel plans for an assigned task, 
providing spoken input and receiving teletype out- 
put. The quality of the ATIS corpus is extremely 
diverse. Speaker performance ranges from close to 
isolated-word speech to exceptional fluency. Many 
utterances contain hesitations and other disfluen- 
cies, as well as long pauses (greater than 3 sec. in 
some cases). 
To prepare this data for analysis, we labeled the 
speech prosodically by hand, noting location and 
type of intonational boundaries and presence or 
absence of pitch accents. Labeling was done from 
both the waveform and pitchtracks of each utter- 
ance. Each label file was checked by several la- 
belers. Two levels of boundary were labeled; in 
the analysis presented below, however, these are 
collapsed to a single category. 
We define our data points to consist of all po- 
tential boundary locations in an utterance, de- 
fined as each pair of adjacent words in the ut- 
terance < wi, wj >, where wi represents the 
word to the left of the potential boundary site 
and wj represents the word to the right. 2 Given 
the variability in performance we observed among 
speakers, an obvious variable to include in our 
analysis is speaker identity. While for applica- 
tions to speaker-independent recognition this vari- 
able would be uninstantiable, we nonetheless need 
to determine how important speaker idiosyncracy 
may be in boundary location. We found no signif- 
icant increase in predictive power when this vari- 
able is used. Thus, results presented below are 
speaker-independent. 
One easily obtainable class of variable involves 
temporal information. Temporal variables include 
utterance and phrase duration, and distance of the 
2See the appendix for a partial list of variables em- 
ployed, which provides a key to the node labels for the 
prediction trees presented in Figures 1 and 2. 
286 
potential boundary from various strategic points 
in the utterance. Although it is tempting to as- 
sume that phrase boundaries represent a purely 
intonational phenomenon, it is possible that pro- 
cessing constraints help govern their occurrence. 
That is, longer utterances may tend to include 
more boundaries. Accordingly, we measure the 
length of each utterance both in seconds and in 
words. The distance of the boundary site from 
the beginning and end of the utterance is another 
variable which appears likely to be correlated with 
boundary location. The tendency to end a phrase 
may also be affected by the position of the poten- 
tial boundary site in the utterance. For example, 
it seems likely that positions very close to the be- 
ginning or end of an utterance might be unlikely 
positions for intonational boundaries. We measure 
this variable too, both in seconds and in words. 
The importance of phrase length has also been 
proposed (Gee and Grosjean, 1983; Bachenko and 
Fitzpatrick, 1990) as a determiner of boundary lo- 
cation. Simply put, it seems may be that consecu- 
tive phrases have roughly equal length. To capture 
this, we calculate the elapsed distance from the 
last boundary to the potential boundary site, di- 
vided by the length of the last phrase encountered, 
both in time and words. To obtain this informa- 
tion automatically would require us to factor prior 
boundary predictions into subsequent predictions. 
While this would be feasible, it is not straightfor- 
ward in our current classification strategy. So, to 
test the utility of this information, we have used 
observed boundary locations in our current anal- 
ysis. 
As noted above, syntactic constituency infor- 
mation is generally considered a good predictor 
of phrasing information (Gee and Grosjean, 1983; 
Selkirk, 1984; Marcus and Hindle, 1985; Steed- 
man, 1990). Intuitively, we want to test the notion 
that some constituents may be more or less likely 
than others to be internally separated by intona- 
tional boundaries, and that some syntactic con- 
stituent boundaries may be more or less likely to 
coincide with intonational boundaries. To test the 
former, we examine the class of the lowest node in 
the parse tree to dominate both wi and wj, using 
Hindle's parser, Fidditch (1989) To test the latter 
we determine the class of the highest node in the 
parse tree to dominate wi, but not wj, and the 
class of the highest node in the tree to dominate 
wj but not wi. Word class has also been used 
often to predict boundary location, particularly 
in text-to-speech. The belief that phrase bound- 
aries rarely occur after function words forms the 
basis for most algorithms used to assign intona- 
tional phrasing for text-to-speech. Furthermore, 
we might expect that some words, such as preposi- 
tions and determiners, for example, do not consti- 
tute the typical end to an intonational phrase. We 
test these possibilities by examining part-of-speech 
in a window of four words surrounding each poten- 
tial phrase break, using Church's part-of-speech 
tagger (1988). 
Recall that each intermediate phrase is com- 
posed of one or more pitch accents plus a phrase 
accent, and each intonational phrase is composed 
of one or more intermediate phrases plus a bound- 
ary tone. Informal observation suggests that 
phrase boundaries are more likely to occur in some 
accent contexts than in others. For example, 
phrase boundaries between words that are deac- 
cented seem to occur much less frequently than 
boundaries between two accented words. To test 
this, we look at the pitch accent values of wi and 
wj for each < wi, wj >, comparing observed values 
with predicted pitch accent information obtained 
from (Hirschberg, 1990). 
In the analyses described below, we employ 
varying combinations of these variables to pre- 
dict intonational boundaries. We use classification 
and regression tree techniques to generate decision 
trees automatically from variable values provided. 
Classification and Regression Tree 
Techniques 
Classification and regression tree (CART) analy- 
sis (Brieman et al., 1984) generates decision trees 
from sets of continuous and discrete variables by 
using set of splitting rules, stopping rules, and 
prediction rules. These rules affect the internal 
nodes, subtree height, and terminal nodes, re- 
spectively. At each internal node, CART deter- 
mines which factor should govern the forking of 
two paths from that node. Furthermore, CART 
must decide which values of the factor to associate 
with each path. Ideally, the splitting rules should 
choose the factor and value split which minimizes 
the prediction error rate. The splitting rules in 
the implementation employed for this study (Ri- 
ley, 1989) approximate optimality by choosing at 
each node the split which minimizes the prediction 
error rate on the training data. In this implemen- 
tation, all these decisions are binary, based upon 
consideration of each possible binary partition of 
values of categorical variables and consideration of 
different cut-points for values of continuous vari- 
ables. 
287 
Stopping rules terminate the splitting process 
at each internal node. To determine the best 
tree, this implementation uses two sets of stopping 
rules. The first set is extremely conservative, re- 
sulting in an overly large tree, which usually lacks 
the generality necessary to account for data out- 
side of the training set. To compensate, the second 
rule set forms a sequence of subtrees. Each tree 
is grown on a sizable fraction of the training data 
and tested on the remaining portion. This step is 
repeated until the tree has been grown and tested 
on all of the data. The stopping rules thus have ac- 
cess to cross-validated error rates for each subtree. 
The subtree with the lowest rates then defines the 
stopping points for each path in the full tree. Trees 
described below all represent cross-validated data. 
The prediction rules work in a straightforward 
manner to add the necessary labels to the termi- 
nal nodes. For continuous variables, the rules cal- 
culate the mean of the data points classified to- 
gether at that node. For categorical variables, the 
rules choose the class that occurs most frequently 
among the data points. The success of these rules 
can be measured through estimates of deviation. 
In this implementation, the deviation for continu- 
ous variables is the sum of the squared error for the 
observations. The deviation for categorical vari- 
ables is simply the number of misclassified obser- 
vations. 
Results 
In analyzing boundary locations in our data, we 
have two goals in mind. First, we want to dis- 
cover the extent to which boundaries can be pre- 
dicted, given information which can be gener- 
ated automatically from the text of an utter- 
ance. Second, we want to learn how much predic- 
tive power can be gained by including additional 
sources of information which, at least currently, 
cannot be generated automatically from text. In 
discussing our results below, we compare predic- 
tions based upon automatically inferable informa- 
tion with those based upon hand-labeled data. 
We employ four different sets of variables dur- 
ing the analysis. The first set includes observed 
phonological information about pitch accent and 
prior boundary location, as well as automati- 
cally obtainable information. The success rate of 
boundary prediction from the variable set is ex- 
tremely high, with correct cross-validated classi- 
fication of 3330 out of 3677 potential boundary 
sites -- an overall success rate of 90% (Figure 1). 
Furthermore, there are only five decision points in 
the tree. Thus, the tree represents a clean, sim- 
ple model of phrase boundary prediction, assum- 
ing accurate phonological information. 
Turning to the tree itself, we that the ratio of 
current phrase length to prior phrase length is very 
important in boundary location. This variable 
alone (assuming that the boundary site occurs be- 
fore the end of the utterance) permits correct clas- 
sification of 2403 out of 2556 potential boundary 
sites. Occurrence of a phrase boundary thus ap- 
pears extremely unlikely in cases where its pres- 
ence would result in a phrase less than half the 
length of the preceding phrase. The first and last 
decision points in the tree are the most trivial. 
The first split indicates that utterances virtually 
always end with a boundary -- rather unsurpris- 
ing news. The last split shows the importance of 
distance from the beginning of the utterance in 
boundary location; boundaries are more likely to 
occur when more than 2 ½ seconds have elapsed 
from the start of the utterance. 3 The third node in 
the tree indicates that noun phrases form a tightly 
bound intonational unit. The fourth split in 1 
shows the role of accent context in determining 
phrase boundary location. If wi is not accented, 
then it is unlikely that a phrase boundary will oc- 
cur after it. 
The significance of accenting in the phrase 
boundary classification tree leads to the question 
of whether or not predicted accents will have a 
similar impact on the paths of the tree. In the sec- 
ond analysis, we substituted predicted accent val- 
ues for observed values. Interestingly, the success 
rate of the classification remained approximately 
the same, at 90%. However, the number of splits 
in the resultant tree increased to nine and failed to 
include the accenting of wl as a factor in the clas- 
sification. A closer look at the accent predictions 
themselves reveals that the majority of misclas- 
sifications come from function words preceding a 
boundary. Although the accent prediction algo- 
rithm predicted that these words would be deac- 
cented, they were in fact accented. This appears 
to be an idiosyncracy of the corpus; such words 
generally occurred before relatively long pauses. 
Nevertheless, classification succeeds well in the ab- 
sence of accent information, perhaps suggesting 
that accent values may themselves be highly cor- 
related with other variables. For example, both 
pitch accent and boundary location appear sen- 
sitive to location of prior intonational boundaries 
and part-of-speech. 
3This fact may be idiosyncratic to our data, given the 
fact that we observed a trend towards initial hesitations. 
288 
In the third analysis, we eliminate the dynamic 
boundary percentage measure. The result remains 
nearly as good as before, with a success rate of 
89%. The proposed decision tree confirms the use- 
fulness of observed accent status of wi in bound- 
ary prediction. By itself (again assuming that the 
potential boundary site occurs before the end of 
the utterance), this factor accounts for 1590 out of 
1638 potential boundary site classifications. This 
analysis also confirms the strength of the intona- 
tional ties among the components of noun phrases. 
In this tree, 536 out of 606 potential boundary 
sites receive final classification from this feature. 
We conclude our analysis by producing a clas- 
sification tree that uses automatically-inferrable 
information alone. For this analysis we use pre- 
dicted accent values instead of observed values and 
omit boundary distance percentage measures. Us- 
ing binary-valued accented predictions (i.e., are 
< wl, wj > accented or not), we obtain a suc- 
cess rate for boundary prediction of 89%, and 
using a four-valued distinction for predicted ac- 
cented (cliticized, deaccented, accented, 'NA') we 
increased this to 90%. The tree in Figure 2) 
presents the latter analysis. 
Figure 2 contains more nodes than the trees 
discussed above; more variables are used to ob- 
tain a similar classification percentage. Note that 
accent predictions are used trivially, to indicate 
sentence-final boundaries (ra='NA'). In figure 1, 
this function was performed by distance of poten- 
tial boundary site from end of utterance (at). The 
second split in the new tree does rely upon tem- 
poral distance -- this time, distance of boundary 
site from the beginning of the utterance. Together 
these measurements correctly predict nearly forty 
percent of the data (38.2%). Th classifier next 
uses a variable which has not appeared in earlier 
classifications -- the part-of-speech of wj. In 2, 
in the majority of cases (88%) where wj is a func- 
tion word other than 'to,' 'in,' or a conjunction 
(true for about half of potential boundary sites), a 
boundary does not occur. Part-of-speech ofwi and 
type of constituent dominating wi but not wj are 
further used to classify these items. This portion 
of the classification is reminiscent of the notion of 
'function word group' used commonly in assigning 
prosody in text-to-speech, in which phrases are de- 
fined, roughly, from one function word to the next. 
Overall rate of the utterance and type of utterance 
appear in the tree, in addition to part-of-speech 
and constituency information, and distance of po- 
tential boundary site from beginning and end of 
utterance. In general, results of this first stage of 
analysis suggest -- encouragingly -- that there is 
considerable redundancy in the features predict- 
ing boundary location: when some features are 
unavailable, others can be used with similar rates 
of 8UCCe88. 
Discussion 
The application of CART techniques to the prob- 
lem of predicting and detecting phrasing bound- 
aries not only provides a classification procedure 
for predicting intonational boundaries from text, 
but it increases our understanding of the impor- 
tance of several among the numerous variables 
which might plausibly be related to boundary lo- 
cation. In future, we plan to extend the set of 
variables for analysis to include counts of stressed 
syllables, automatic NP-detection (Church, 1988), 
MUTUAL INFORMATION, GENERALIZED MUTUAL 
INFORMATION scores can serve as indicators of 
intonational phrase boundaries (Magerman and 
Marcus, 1990). 
We will also examine possible interactions 
among the statistically important variables which 
have emerged from our initial study. CART tech- 
niques have worked extremely well at classifying 
phrase boundaries and indicating which of a set of 
potential variables appear most important. How- 
ever, CART's step-wise treatment of variables, Ol>- 
timization heuristics, and dependency on binary 
splits obscure the possible relationships that ex- 
ist among the various factors. Now that we have 
discovered a set of variables which do well at pre- 
dicting intonational boundary location, we need to 
understand just how these variables interact. 

References 
Bengt Altenberg. 1987. Prosodic Patterns in Spo- 
ken English: Studies in the Correlation between 
Prosody and Grammar for Tezt-to-Speech Con- 
version, volume 76 of Land Studies in English. 
Lund University Press, Lund. 
J. Bachenko and E. Fitzpatrick. 1990. A compu- 
tational grammar of discourse-neutral prosodic 
phrasing in English. Computational Linguistics. 
To appear. 
Dwight Bolinger. 1989. Intonation and Its Uses: 
Melody in Grammar and Discourse. Edward 
Arnold, London. 
Leo Brieman, Jerome H. Friedman, Richard A. Ol- 
shen, and Charles J. Stone• 1984. Classification 
and Regression Trees. Wadsworth & Brooks, 
Monterrey CA. 
K. W. Church. 1988. A stochastic parts pro- 
gram and noun phrase parser for unrestricted 
text. In Proceedings of the Second Conference 
on Applied Natural Language Processing, pages 
136-143, Austin. Association for Computational 
Linguistics. 
DARPA. 1990. Proceedings of the DARPA Speech 
and Natural Language Workshop, Hidden Valley 
PA, June. 
J. P. Gee and F. Grosjean. 1983. Performance 
structures: A psycholinguistic and linguistic ap- 
praisal. Cognitive Psychology, 15:411-458. 
D. M. Hindle. 1989. Acquiring disambiguation 
rules from text. In Proceedings of the 27th An- 
nual Meeting, pages 118-125, Vancouver. Asso- 
ciation for Computational Linguistics. 
Julia Hirschberg. 1990. Assigning pitch accent 
in synthetic speech: The given/new distinc- 
tion and deaccentability. In Proceedings of the 
Seventh National Conference, pages 952-957, 
Boston. American Association for Artificial In- 
telligence. 
I. Lehiste. 1973. Phonetic disambiguation of syn- 
tactic ambiguity. Giossa, 7:197-222. 
David M. Magerman and Mitchel P. Marcus. 
1990. Parsing a natural language using mu- 
tual information statistics. In Proceedings of 
AAAI-90, pages 984-989. American Association 
for Artifical Intelligence. 
Mitchell P. Marc'us and Donald Hindle. 1985. A 
• computational account of extra categorial ele- 
ments in japanese. In Papers presented at the 
First SDF Workshop in Japanese Syntaz. Sys- 
tem Development Foundation. 
M. Ostendorf, P. Price, J. Bear, and C. W. Wight- 
man. 1990. The use of relative duration in 
syntactic disambiguation. In Proceedings of the 
DARPA Speech and Natural Language Work- 
shop. Morgan Kanfmann, June. 
Janet B. Pierrehumbert. 1980. The Phonology 
and Phonetics of English Intonation. Ph.D. 
thesis, Massachusetts Institute of Technology, 
September. 
Michael D. Riley. 1989. Some applications of tree- 
based modelling to speech and language. In 
Proceedings. DARPA Speech and Natural Lan- 
guage Workshop, October. 
E. Selkirk. 1984. Phonology and Syntaz. MIT 
Press, Cambridge MA. 
M. Steedman. 1990. Structure and intonation in 
spoken language understanding. In Proceedings 
of the ~Sth Annual Meeting of the Association 
for Computational Linguistics. 
