THE PARSODY SYSTEM : AUTOMATIC PREDICTION OF PROSODIC BOUNDARIES 
FOR TEXT-TO-SPEECH 
Stephen Minnis 
Natural Language Group, Systems Research Division, BT Laboratories, Martlesham Heath, Ipswich IP5 7RE, UK 
tel +44-473-645384, e-mail : sminnis@bt-sys.bt.co.uk@unet 
As commercial text-to-speech systems move above the 
word level to the sentence level, the prediction of the 
correct prosodic information becomes a significant 
factor in the perceived naturalness of the synthesised 
speech. 
This article reports on "Parsody", an experimental 
system which combines a partial parser with a prosodic 
marking component to predict the location and relative 
strength of prosodic boundaries in text. 
INTRODUCTION 
Modern text-to-speech (TTS) systems are quite good at 
word level synthesis, but tend to perform badly on 
connected word sequences. It has been suggested that 
the poor prosody of synthetic connected speech is the 
primary factor leading to difficulties in comprehension 
\[1,5\]. TTS systems must therefore incorporate better 
mechanisms for prosodic processing. For the purpose 
of this article, prosodic processing is narrowly 
interpreted as being the prediction of the location and of 
the relative strengths (salience) of prosodic boundaries 
(although, of course, there are several other important 
aspects to prosody). A prosodic boundary is a point in a 
spoken utterance associated with important acoustic 
prosodic phenomena, such as pauses and pitch change. 
There are two main approaches to the prosodic marking 
problem : the rule-based approach and the stochastic- 
based approach. 
The rule-based approach stems from Gee and Grosjean's 
work on performance structures 1 \[7\], which has been 
the focus of many extensions, such as that reported by 
Bachenko and Fitzpatrick \[2,4\]. Gee and Grosjean's 
work sought to account for the (then) disparity between 
linguistic phrase-structure theories and actual 
performance structures produced by humans, and 
focused on recreating the pause data of several analysed 
sentences from syntax (althougtt they claim that their 
method could easily account for other prosodic 
features). The central tenet of their work was that 
prosodic phrasing is a compromise between the need to 
respect the linguistic structure of the sentence and the 
performance aspect (which manifests itself as a need to 
balance the length of the constituents in the output). 
More recent efforts have extended the Gee and Grosjean 
approach in various ways. Bachenko and Fitzpatrick 
take a similar rule-based approach but believe that 
syntax plays a lesser role in determining phrasing, and 
that certain prosodic performance constraints, such as 
length, override syntactic structure. They allow prosodic 
boundaries to cross syntactic boundaries (under certain 
conditions), whereas Gee and Grosjean viewed their 
1 Performmme structures are "structures based on experimental 
data, such as pausing and parsing values" \[7\]. 
rules as acting within basic sentence clauses. Bachenko 
and Fitzpatrick made several other changes to the basic 
Gee and Grosjean algorithm, including counting 
phonological words 2 rather than actual words when 
determining node strengths. Wightman et al. have 
proposed some further interesting extensions to the 
Bachenko and Fi~patrick method \[12\]. 
With the availability of large and accurately labelled 
prosodically annotated corpora, the stochastic-based 
approach will come more to the fore. Wang and 
Hirschberg \[11\], and Ostendorf et al. \[9\], have both 
described methods for automatically predicting prosodic 
information using decision tree models. Generally, 
decision trees are derived by associating a probability 
with each potential boundary site in the text, and 
relating various features with each boundary site (e.g. 
utterance and phrase duration, length of utterance (in 
syllables/words), positions relative to the start or end of 
the nearest boundary location etc.) \[11\]. The resulting 
decision tree provides, in effect, an algorithm for 
predicting prosodic boundaries on new input texts. 
It is interesting to note that Ostendorf et al. report 
similar results in their evaluations of the performance of 
both the rule-based and decision tree algorithms. 
THE PARSODY SYSTEM 
Our approach has been to implement a rule-based 
method on top of a chart parser. This was a purely 
practical decision, as an efficient chart parser had been 
developed in-house. Also the larger and more detailed 
descriptions of the rule-based methods in the literature 
provided an adequate starting point on which to build. 
The resulting system, the Parsody (from Parser + 
Prosody) system is designed to provide a test-bed for 
investigating the interface between syntactic parse 
structures and the performance structures of actual 
speech. Results from the Parsody system are directly 
led into BT's TTS system, Laureate. 
The fact that Laureate will be a commercial TPS system 
places several requirements on the parser, it must be 
robust, it must be fast, and it must predict prosodic 
boundaries with a reasonable degree of accuracy. At 
present the emphasis is on the prediction of the location 
of prosodic boundaries rather than on the strength of the 
boundaries. 
The Parsody system is implemented in C, under X- 
windows on a Sun Sparc station, and allows for 
interactive editing of intermediate results throughout the 
parsing/prosodic marking process. This provides us 
2A phonological word is one which effectively functions as one spoken item, as the internal word-word boundaries are 
resistant to pausing \[7\]. Typical examples are determiner-noun 
word groups, such as "the+man". 
992 
with a useful tool for investigating our algorithms, as A typical tree is shown in Figure I. The partition nodcs 
well as a debngging aid. are denoted by the two "s" labels in this tree. 
A description of the main aspects of tile parser and the 
prosodic marking comlx)nents is now given. 
THE PARSER COMPONENT 
It is interesting to note tlmt one of thc sentences in tile 
Bachenko and Fitzpatrick appendix of sentences was 
not parsed bccausc of "tot) many parse t)roblems". 
Obviously this is not acceptable for a conunercial tcxt- 
to-speech system. Thc Parsody parser is designcd 
always to produce one result through a combination of 
stochastic word tagging and partial parsing with a 
minimal grammar. All processing is performed on a 
chart data structure back-bone incorporating packing. 
This overall approach rcsnlLs in a very fast and efficient 
parser. 
A word's part-of-speech is important for TTS as it may 
affect tile word's pronunciation. Stochastic word tagging 
enables the parser always to choose one word tag, 
althtmgh this may t)r may ut)t be the correct one (the 
current accuracy is al)proximately 95% correct - this 
figure heing given tm the Bachenko and Fitzpatrick 
sentences and on other test sentences). Forttmately for 
i)ronunciation ptu'l)oses , the lmml)er of words having 
mttltiple prontmciatitms is quite small - between 1 anti 
2% of words in our lexicon. Initial investigations have 
shown that there is less than a 0.3% chance of picking 
the wrong pronunciation li)r a word. 
Anothcr importaut aspect of the Parsody word-lagging 
approach is that ill-limned input can be accommodatexl, 
and the prosodic marking component can still function 
tt) protluce a result. Some speech is better than none, 
even if it sounds sffangc. 
The minimal grannnar also helps the parser tt) prt)duce 
only tree, anti always one, output. The granmlar is a 
silnple LNP/PP grammar augmented by special 
'partition' rules. An LNP is simply a 'longest noun 
12hrase' which is an unambiguous intcrpretation of the 
longest NP in the parse result. A PP is a 12rcpositional 
I!hrase. An example of a partition node is one which is 
inscrtcd I)ctween two immediately adjacent 'longest 
NPs' in the parse strncture 3. 
.f.~sigma---..~ 
since prio felt ill tMt ~ d~t ~ celled round 
she da9 her friend 
Figure I. Example Parse "Free produccxt by Parsody 
3\[:or this reason, cach of the prosodic rules to be described 
works within the t)artition nodes. This is because the lx)undary 
between each partition node seems to mark the largest 
boundaries within the sentence, and tl~e later in the analysis 
lhey are joined, the larger the prosodic botmdary will be. It 
may well be that some analysis, perhaps verb adjacency, 
shoukl take place across partition node boundaries. Further 
research will examine this. 
Tile partial syntactic tree is then passed to the prosodic 
m~u'king system. 
THE PROSODY COMPONENT 
The prosodic nmrking algorithms arc founded on thc 
Bachenko and Fitzpatrick extensions to the Gee and 
Grosjcan rules. 
There are essentially two main components in the 
Baehcnko and Fi~patriek model. The first, concerning 
boundary location is basically adhered to in Parsody. 
tl(mndary location entails the grouping of words into 
phonological words, and then into phonological phrases. 
The boundaries separating prosodic phrases form 
potential prosodic boundary location sites. 
The sccond component seeks to determine the boundary 
strengths via a scries of rules. Bachenko and Fi~patrick 
descril)e a verb-balancing rule which attempts to 
bahmce matcrial around a vcrb, and a verb adjacency 
rule which in effect extends the verb balancing rule, 
using 'lmndling' (the adjoining of adjaccnt phrases) to 
continue to centre material round a verb. ttere, Parsody 
cmploys two main departures from the Bachenko and 
Fitzpatrick rules. The first is in the domain of verb 
adjacency. Parsody's verb adjacency algorithm retains 
the notion of grouping nodes to lorm a balanced tree, 
bnt extends this rule to cover all nodes (with the 
exception of the vcry final PP). The basic algorithm is 
also different. 
By extending the grouping of nodes to cover all nodes, 
the confusion of Bachenko and Fitzpatrick's "general 
bundling rulc" is avoided, since all nodes will have becn 
grouped at completion. The change to the algorithm is 
more subtle, yielding thc rule: 
if Count(X) + Cmmt(Y) < Count(Z) 
then 
Join to the Left(Y) 
else 
Join to tile Right(Y) 
where : 
Count(a) = Number of Phonological Words 
beneath Node 'a' 
X = Previous Nnde 
Y = Current N(~le 
Z = Next Node 
This makes explicit the assumption in Bachenko and 
Fitzpatrick's algorithm that tile adjoining of phrases 
pr(xluces a balancezt tree. The above approach continues 
to balance the structure created so far, with the 
phonological phrases which have not yet been joined 
into the structure. By doing this, the boundary values 
(strengths/salience) remain dependent on the values of 
the constituent prosodic phrases. 
in the example shown in Figure 2, tile left-to-right 
nature of application of this rule ensures that earlier 
material will generally be gronpcd lower in the structure 
than later material. This ties in with Gee and Gmsjean's 
work on discourse semantics: the later in the sentence 
993 
Ihe information, the greater the prosodic offset. It is at 
this stage that PP's which p}ecede verbs are added into 
the structure (assuming they haven't been already). The 
proviso is continued that PP's should always join to the 
left, rather than the right. The exception to this is the 
PP at the end of the sentence, if there is one, which 
remains untouched. 
s 
|1 
I 
t he+method 
v tl n 
I I I 
by+which+one converts+a+word into+phonemes 
+ 
s 
7~-__ 
iI v n n 
I I I I 
the+method by+which+one converts+a+word into+phonemes ... 
Figure 2: Verb Adjacency example 
The second main clmnge to the Bacbenko and 
Fitzpatrick algorithm concerns boundary value 
assignment. Bachenko and Fitzpatrick choose to use the 
absolute boundary values as their reference. Parsody 
does not do this, since, according to the algorithm, the 
longer tile sentence, the larger the values on each 
boundary 4 (varying from a maximum of 5 in small 
sentences, to 13 in the larger sample sentences). Does 
this mean that s,nall sentences should have smaller 
boundaries, perhaps none? According to Gee and 
Grosjean \[7; footnote 10\], "It turns out (importantly) 
that the actual pause duration of the longest pause in 
each sentence does not correlate all that well (is not a 
factor of) the overall length of the sentence (for 
example, it is possible for a short, less complex sentence 
to have a longer main break than a longer, more 
complex sentence)". For this reason, in Parsody a 
normalisation algorithm is applied, so that sentences of 
varying lengths may trove their boundaries mapped to 
reasonable values. 
EVALUATION 
Ultimately the success of a prosody component of a 
TI'S system will be determined by perceptual tests on 
the naturalness, or the acceptability, of the synthesised 
speech. Such tests are subjective, as well as time 
consuming and costly to perform, so a more objective 
point of reference is required. 
4The values on a boundary node can be seen in Figure 2. 
Values are computed by taking the sum of the phonological 
words at each node and adding one. A phonological word is 
one joined by a "+" symbol. In Figure 2, each terminal node 
has only one phonological word. 
Our approach compares the prosodic boundaries 
assigned by Parsody with data provided in Bachenko 
and Fitzpatrick's paper \[2\]. This data comprises a set of 
35 sentences with the prosodic boundaries marked by 
hand 5. Of these sentences, 14 are the 'original' Gee and 
Grosjean sentences re-analysed by Bachenko and 
Fitzpatrick. Our evaluation was concerned only with the 
primary and secondary boundaries assigned by 
Bachenko and Fi~patrick. 
Some points should be noted about these sentences. Tile 
Bachenko and Fi~patrick sentences (excluding the 14 
Gee and Grosjean sentences) have a fairly simple 
sentence structure, and should therefore be handled well 
by the system (Parsody and Bachenko and Fi~patrick's 
system). In our opinion they do not constitute a rigorous 
test for the prosodic component of a TTS system, but 
they are useful for ev~duation nevertheless. 
The Gee and Grosjean sentences, however, have a 
complex sentence structure, ,although this is similar for 
each sentence. Experience would suggest that this is 
not a realistic s,'unple of sentences from which to work. 
Bachenko and Fitzpatrick have converted these 
sentences to their notation. This results in catch sentence 
having only one primary boundary, and all but one 
sentence having eric secondary botmdary. 
Furthermore, the primary boundary nearly always 
appears at the mid-point of the sentence. These results 
seem intuitively simple for such complex sentenccs, so 
in this evaluation the data-set "Gee and Grosjean Re- 
analysed" is a test against the Gee and Grosjean data, 
with the boundaries marked according to the 
normalisation algorithm employed by Parsody. 
Table 1 : llachenko and Fitzpatrick Sentences 
Parsc~dy B achenko & 
Fitzpatrick 
Correct Primary 21 16 
Close Primary 8 9 
Missed Primary 2 6 
Extra Primary -2 - 13 
Prhnary Score 0.778 0.490 
Correct 13 ll 
Secondary 
Close Secondary 3 10 
Missed 8 3 Secondary 
Extra Secondary 4 5 
Secondary Score 0.583 0.494 
Overgeneration 0.965 0.740 
Factor 
Overall Score 0.681 0.492 
Total Sentences - 21 
Total Primary Boundaries - 31 
Total Secondary Boundaries - 24 
Total Tertiary Boundaries - 2 
5It is important to note that even though tile system assigned 
tyoundaries may be different to the human ones, file system boundaries may actually be better according to a majority 
expert view. 
994 
Table 2 : Gee and (;rosjean Sentences 
Parsody B achenko & 
Fitzpatrick 
Correct Primary .... 10 10 
Close Primary 2 0 
Missed Primary 2 4 
Extra IMmary 8 - 1 
i'rimary Score 01495 0.498 
Correct 7 5 
Secondary 
Close Secondary 5 10 
Missed 3 0 
Secondary 
Extra Secondary 9 -3 
Secondary Score 0.399 0.465 
Overgeneration 0.630 0.698 
Factor 
Overall Score 0.44"} 0.482 
Total Sentences - 14 
"lk)tal Primary Boundaries - 14 
Total Secondary Boundaries - 15 
Tolal Tertiary Boundaries - 31 
Table 3 : Gee and Grosjean Re-analysed Sentences 
I'arsody Bachenko & 
Fitzpatrick 
Correct Primary 10 10 
Close Primm-y 3 l r., 
Missed Primary 2 4 
Extra Primary 7 -2 
Primmy Score 0.700 0.342 
Correct 6 7 
S ccondary 
Close Secondary 9 17 
Missed 12 3 
Secondary 
Extra Secondary .-3 - 15 
Secondary Score 0.355 ' 0.280 
Overgeneration 0.913 ..... 0.488 ' " 
Factor 
OvelTafl Score " '~ t)..528 - 0.311 
Total Sentences - 14 
Total Primary Botmdaries - 15 
Total Secondary l~oundaries - 2"1 
Total TertiaJ T tk)undaries - 0 
Most of tile measurements giveu iu tile three tables are 
clear from their description. A "correct" boundary ix a 
perfect match with the human-annotation. A "close" 
botmdary is oue where another boundary appears ill itS 
place (e.g. a secondary instead of a primary). "Extra 
boundary" refers to tile number ot! boundaries producexl 
by the system greater than the actual number of 
boundaries (a negative figure indicating that fewer 
boundaries el' that type were produced). 
The "~ores" presented, basically provide a figure by 
which systems can be compared (with each uther, or 
with human-annotated results). A score of 1 would 
indicate a perfect comparison of results. The figure 
includes both the successes and failures (including 
overgeneration) of the system. The overall score given, 
is the mean of the primary and secondm'y Scores. 
The scores are calculated according to the following 
lormula. 
TotalBoundaries OverGenemtionFactor (OGF) = "l'olalSystemBoundaries 
Score = (2xCorrectl?,oundary)+CloscBouud:.~y x OGF 
2xActualBonndary 
where : 
TotalBotmdaries = Number ofgouudaries 
in text 
TotalSystemgoundaries = Number of Boundaries 
produced by system 
CorrectBoundary = Number of Boundaries 
matched exactly 
CloseBoundary = Number of boundaries 
matched closely (ie. a 
Primary marked by a 
Secondary, or vice 
versa) 
Bonndary = Primary or Secondary 
houndary 
The CorrcctBoundary result is multiplied by 2 as a 
weighting factor. Obviously it is better to have correct 
boundaries than close boundaries. Accordingly, the 
ActualBoumlary score is also doubled to maintain the 
scale. 
Note that the smaller the overgeneration factor, tile 
larger the amount of overgeneration (a score greater 
thin1 1 indicates undergenerafion). 
The results reported show that the Parsody system 
compares favourably, under this analysis, with tile 
Bachenko and Fitzpatrick system - for example in Table 
1 the overall score is 68% for Parsody, and 49% for 
Bachenko and Fitzpalrick's system. What is encouraging 
is the better performance on the predictkm of primary 
boundaries. The automatic scoring program also 
presents the results in a useful way. To relate these 
results to Bacbenko and Fitzpatrick's evaluation in \[12\], 
they quote a figure of 80%, given the assumption that 
prima'y and secondary boundaries are basically similar 
from a comprehensibility, and acceptability, viewpoint 
on the synthesised speech. This score is given hy 
summing the Correct Primary and Close Primary 
scores and dividing hy the total nmnber of Primary 
txnmdaries, l'arsody scores 93% in this case ( calculated 
from Table 1). 
As regards our evaluation metht~t proper, it is clear that 
the method requires improvement. Future ,nethods 
should concentrate on punishing the inco~ect placement 
of boundaries, especially those that affect the perception 
of the synthesised speech, a viewpoint that Bachenko 
and l::itztmtrick also seem to hold. 
995 
This short article outlined the Parsody system, the 
essentials of which form a component of BT's Laureate 
Text-to-Speech system. Key features of the Parsody 
system include its ability to provide accurate parses 
robustly, which allows it to handle ill-formed input with 
case. Parsody also provides a robust rule-based 
prosodic annotation facility, that has been developed 
from algorithms presented in the literature, but which 
have been extended for greater performance. 
Most of the problems with the Parsody system currently 
lie with the parser. Despite the high performance of the 
word tagger, the effect of wrongly tagging a word is 
large, since the prosody component uses this 
information to construct a prosody tree in a bottom-up 
fashion. To improve the tagging performance we ,are 
considering including word collocation statistics. Also, 
it would be desirable to increase the range of syntactic 
structures produced by the parser. To improve the 
parser performance we are looking at extending the 
minimal grammar, but in such a way that processing 
speed is maintained. Future versions of the parser may 
also include special disambiguation roles concenlrating 
on words having multiple pronunciations. Topic and 
focus marking will also be introduced at some stage. 
We also hope to investigate the stochastic approach to 
prosodic marking. Future work will focus on 
assembling a suitable corpus, it is likely that the best 
prosodic marking procedare is one which is a hybrid of 
both the rule-based and stochastic-based approaches. As 
was mentioned earlier, the immediate goal with respect 
to prosodic marking has been the prediction of prosodic 
boundary location and of the tx)undary strengths. Future 
work will concentrate on the interpretation of boundary 
slrengths, for cxample by investigating the corrclafion 
of our normalised (hence gradable) boundaries with 
acoustic phenomena at Ihcse boundaries. 
Finally, it is important to remember the intended goal of 
text-to-speech systems is to synthesise unrestricted text 
input, hfitial work has begun on extending the 
evaluation of the system to more 'normal' sentences. For 
example, work in BT's Natural Language Group 
includes automatic text summarisation; in tests on the 
summarisation of newspaper articles the length of 
sentences often exceeds 100 words. Our text-to-speech 
system must be able to handle such sentences efficiently 
both at the parsing and prosody stage. The lessons 
learnt from more difficult input such as this, may serve 
to increase our understanding of the relationship 
between syntax and prosody. 
Acknnwledgements 
The author would like to express his gratitude to the 
director of BT for permission to publish this paper. 
Thanks also go to my colleagues in the Natural 
Language Group and the Text-to-Speech Group for their 
comments on this report; Keith Preston, Andy Breen, 
Mike Edgington, Sandra Williams, Anna Cordon and 
Peter Wyard. Special thanks go to Eddy Kaneen for his 
invaluable assistance in getting the Parsody system 
operable and for his contributions to many of the 
original ideas implemented in the prosodic marking 
component. 
\[1\] Allen, J. Synthesis of Speech from Unrestricted Text; 
Proceedings of the IEEE; Vol 4; pp. 433-442 (1976). 
\[2\] Bachenko, J. and Fitzpatrick, E. 
A Computational Grammar of Discourse-neutral 
prosodic phrasing in English. 
Computational Linguistics Volume 16, No.3, pp.155- 
170 1990 
\[3\] Bachenkn, J., and Fitzpatrick, E. 
Parsing for Prosody: What a Text-to-speech system 
needs from syntax 
Proceedings of lE~E Artificial Intelligence Systems in 
Government (AISIG), 1989 
\[41 Bachenko, J., Fitzpatrick, E., and Wright, C.E. 
The contribution of parsing to prosodic phrasing in an 
experimental text-to-speech system 
Proceedings of the 24th Annual Meeting of the Association for Computational Linguistics, 
pp. 145-153, 
1986 
\[5\] Cahn, J. 
From Sad to Glad: Emotional Computer Voices; 
Proceedings of Speech Teeh '88; pp. 35-36 (1988). 
\[6\] Church K.W. 
A stochastic parts program and noun phrase parser for 
unrestricted text 
In Proceedings of Second Conference on Applied Natural Language Processing (ACL), 
pp. 136-143, 
Austin, 1988 
171 Gee J.P., and Grosjean F. 
Performance structures: A psycholinguistic and 
Linguistic Appraisal 
Cognitive Psychology, 15: pp. 411-458, 1983 
\[8\] Grosjean F., Grosjean L., and Lane II. 
The patterns of silence: Performance structures in 
sentence production 
Cognitive Psychology, 11: pp. 58-81, 1979 
\[9\] Ostendnrf M., Wightman, C.W. and Veilleux, 
N.M. 
Parse scoring with prosodic inlbrmation: An analysis / 
Synthesis approach 
Computer Speech and Language (1993) 7, pp. 193-210 
\[10\] Selkirk, E.O. 
Phonology and Syntax: The Relation between Sound and Structure; 
M1T Press (1984). 
\[11\] Wang, M., aml Hirschberg J. 
Predicting Intonational Boundaries Automatically from 
Text : the ATIS Domain 
Proceedings of the DARPA Speech and Natural Language Workshop, 
Feb 199l, pp. 378-383 
\[1211 Wightman C.W., Veilleux N.M., and Ostendorf M. 
Use of Prosody in syntactic disambiguation: An 
Analysis-by-synthesis approach Proceedings of the DARPA Speech and Natural 
Language Workshop, Feb 1991, pp. 384-389 
996 
