Matching a tone-based and tune-based approach to English 
intonation for concept-to-speech generation 
Elke Teich 
Universitgt des Saarlandes, Saarbr{icken & University of Sydney 
Catherine I. Watson and Cdcile Pereira 
Macquarie University, Sydney 
Abstract 
Tlle paper describes the results of a compari- 
son of two annotation systems for isstoslal;ion, 
the tone-based ToBI al)proach and the 1;une- 
based api)roach proposed by Systemic Func- 
ti(mal Grammar (SFO). The goal of this compar- 
ison is to detine a mapping between the two sys- 
tems tbr the purpose of concept-to-speech gen- 
eration of English. Since ToB: is widely used 
in Sl)eech synthesis and SFG is widely used in 
nal;ural  generation and oft~rs a lin- 
guistically motivated aecollnt of intonation, it; 
appears a promising step to comt)ine the two 
approaches for concept-to-speech. A corpus of 
English utterances has been analysed with both 
~\].~()13I and SFG categories; eomparison of the 
analysis results has lead to the identification of 
some basic equivalents between the two systems 
on which a mapping can be based. 
1 Introduction 
The pallet describes the main results of a con> 
parison of /;he ToB: (Tone-and-Break-Indices) 
ai)proach (Pierrehumbert, 1.9801 Silverman el; 
al.., 19961 to annotating English speech data 
with information about intonation and one of 
the British School approaches (e.g., Brazil et al. 
(1980)), Systenfie Fmmtional Grammar (SFO; 
(Halliday, 19671 Halliday, 1970)). The goal of 
this comparison is the definition of a mapping 
between the two systems. 
This attempt has a two-fbld motiw~tion. 
First, it is motivated by computational appli- 
cation in concept-to-si)eech systems, in which 
text in spoken mode is automatically generated 
from an underlying abstract lneaning represen- 
tation, it is widely acknowledged that in order 
for spoken  technology to gain wider 
acceptance, it has to improve on the quality of 
output considerably. Itere, appropriate intona- 
tion is one of the major factors (ct'. Cole et 
al. (1995)). The concrete goal we are pursu- 
ing is to connect an oil-the-shelf speech syn- 
thesizer for English (FESTIVAL; (Black et al., 
1998)) with an automatic text generation sys- 
tem tbr English based on SFO (Matthiessen & 
Bateman, 19911. Since in the SFO approach, in- 
tonation is accounted for as part of grammar 
rather than as an independent component, it is 
straightforward to extend the grammatical re- 
sources of a systemically based text generation 
system with an account of intonation (cf Teich 
et al. (1.997) iml)lenmnting such all approach for 
German concet/t-to-speech generation). Con- 
necting such a system to a speech synthesizer 
requires mapping the OUtl)ut of the generator 
to the input requirements of the st)eech synth(> 
sizer. In the FESTIVAL systei11, the intonation of 
the text to be synthesized can be manipulated 
1)y ~mnotation with TOBI labels. Therefore, a 
mapl)ing betweeIl the SFC and the ToBI anno- 
tation systems is required. 
Second, there is a theoretical lnotivation. 
With a mapping between tile ToBI and the slpo 
systems for intonation almotation, it will be 
possible to link the 1)honetic analysis of speech 
data to an interpretation of intonational mean- 
ing as it is proposed by SFO. Existing speech 
corpora that are acoustically analysed and an- 
notated with ToBI tail then be used to test 
some of the assumptions brought forward by 
SFO about the natm:e of intonation. Also, with 
a mapping between ~oBI and SFG annotations, 
an exchange of annotated corpora between ToBI 
and SFO users would be possible. 
We report on the analysis of a sl)eech cor- 
pus compiled fl'om Halliday (1970) with ToBI 
and SFO labels (See. 3). The intonation analy- 
sis is based on an acoustic analysis of the speech 
data in terms of fundamental frequency (F0). 
829 
The data are represented in EMU (Cassidy &; 
Harrington, 1996), a database system for stor- 
ing speech data that provides for a nmltiple- 
tier analysis of acoustic (e.g., F0 contour and 
speech wavetbrm) and phonological (segmental 
and suprasegmental) features. We present the 
major differences and commonalities between 
ToBI and SFO (See. 2). On the basis of the 
corpus analysis, we identify matches between 
the tunes assmned by Halliday and unique se- 
quences of To\]~I tones (See. 4). We conclude 
with a smmnary and a sketch of future work. 
2 Intonation Annotation 
The majority of text-to-speech systems that al- 
low for the manipulation of an input string so 
as to control intonation employ the ToBI system 
(Silverman et al., 19961, which is based on the 
autosegmental-metrical approach originally set 
up by Pierrehumbert (19801 to describe Amer- 
ican English intonation. Versions of ToBI for 
other s have been developed, e.g., Grice 
et al. (19961 for German, and are also widely 
used in computational contexts. One major the- 
oretical difference between the ToBI approach 
and the British School approaches, such as the 
one advocated by SFG, is that in the latter there 
is a built-in focus on the relation between into- 
mttion and nmaning. In spG, intonation con- 
tours are distinguished according to their di, f 
fcrcntial meanings, i.e., they label pitch move- 
ments that are commonly interpreted by the 
speakers of (British) English as having quite 
different pragmatic purport (cf. Teich et al. 
(1997)). This is what snakes the SFO approach 
attractive in the context of concept-to-speech 
generation, in which it is crucial to be able 
to represent criteria for selecting an intonation 
contour appropriate in a given context. TOBI, 
on the other hand, is a phonetic-phonological 
annotation scheme tbr intonation. Since it is 
widely used, there exist nmnerous tools sup- 
porting analysis with a high degree of analyt- 
ical rigor. It seems theretbre doubly significant 
to combine the two approaches in an attempt to 
achieve high-quality synthesized speech output. 
While clearly some fimdamental theoretical 
ditferences exist between the ToBI and SFG ap- 
proaches, more technically there is a basic com- 
mortality. Any annotation scheme tbr intonation 
nmst establish three principal constructs for the 
representation of intonation: the units of into- 
nation, a set of categories that describe the pitch 
movement occurring in that unit, and a set of 
labels that mark the nuclear stress oi1 which the 
pitch movement is realised. 
In the remainder of this section we briefly de- 
scribe how these constructs are realised in ToBI 
(Sec. 2.1) and in SFG (See. 2.2) and sketch the 
mQor differences between them. 
2.1 ToBI 
There are two tiers to the ToBI analysis, the 
tonal analysis and the analysis of the strength 
of the word boundaries, which is referred to as 
the "break index". The Tom tones are either 
high (H) or low (L). The break index gives the 
strength of a word's association with the tbl- 
lowing word, where 0 is the strongest perceived 
conjoining and 4 is the most disjoint (Beckman 
gc Ayers, 19971. In our analysis (See. 3), we 
only consider the tonal part of TOBI. 
The Tom intonational phonology model 
aligns a tune with the words of an utterance 
(cf. Harrington 8c Cassidy (1999)), wherc some 
of these words are accented. The words of an 
utterance are grouped into phrases. There are 
two types of phrases, intonational and inter- 
mediate ph, mses. Utterances always consist of 
one or more intonational phrases which iu tm:n 
consist of one or lnore intermediate phrases. 
The break between two intonational 1)hrases is 
greater than 1)etween two intermediate t)hrases, 
the bl'eak index being 4 in the former case and 
3 or 2 in the latter. 
Words that have prominence in a phrase or 
utterance m:e accented (sentence level stress). 
Unlike lexical stress which is usually fixed, sen- 
tence level stress is variable. When a word 
carries sentence level stress, a pitch accent is 
associated with the syllable of primary stress. 
Pitch accents are denoted by *. The most com- 
mon pitch accent is an H*, which is usually 
realised as a pitch peak near tim vowel in the 
primary stressed syllable, it is also possible to 
have pitch accents which are a combination of a 
pitch movement towards and including a peak 
or trough. One sudl bitonal accent is L+H*, 
which moves from a low in pitch towards a high. 
Intermediate and intonational phrases carry 
edge tones. Intermediate phrases carry phrase 
tones, indicated by -. The phrase tone L- is low 
pitch following the final pitch accent of a phrase. 
830 
H* 
Margaret's looking for you 
L% 
L- 
c) 
eq 
C) C) 
C) 
22600 22800 23000 23200 23400 
time (ms) 
(a) 
23600 23800 
is there any more news of the French 
o H* L* 
elections 
H% 
H- 
24500 25000 25500 26000 
time (ms) 
(b) 
in the far 
H* 
25000 
H% 
L- 
corner of that field the 
!1t*  FI.T tUK, 
25500 26000 26500 
footpath 
L% 
L- 
goes over a stile 
H* 
27000 27500 28000 
time (ms) 
(c) 
Figure 1: Examples of' the pitch contours of three utterances in the corpus, mid the associated ToI\]I 
labels 
831 
tone 1 
tone 2 
tone 3 
tone 4 
tone 5 
\ (fidl) 
conveys certainty 
/ (rico) 
conveys uncertainty 
-- (level/low rise) 
"continuation tone" 
\/ (fall-rise) 
seems certain (reservation) 
/\ (rise-fall) 
seems uncertain (strongly assertive) 
Figure 2: SFC tones and their meanings 
The phrase tone H- represents high pitdt follow- 
ing the last pitch accent. Tile tone associated 
with an intonation phrase is a boundary tone 
and is indicated by %. The boundary tone H% 
represents a final rise and the L% boundary tone 
is typically interpreted as the absence of a final 
rise (cf. Ladd (1996)). 
Every intermediate phrase must have at least 
one pitch accent. By definition, the last ac- 
cented word in any intermediate phrase is al- 
ways the nuclear accented word, and it is usu- 
ally perceived as more prominent than any other 
accented word. The utterance (a) in Fig. 1 is 
produced by an H'L-L% combination and typ- 
ically interpreted as a neutral declarative. The 
second utterance (b) has a H'L'H-H% combi- 
nation (yes/no question). The final example 
(c) illustrates a complex ntterance, made up of 
more than one intonation phrase. 
2.2 SFG 
According to SFG the unit to which intonation 
is attributed is the tone group. A tone group 
consists of.feet, and feet consist of syllables. A 
tone group carries a tune or tone, which can be 
falling (tone 1), rising (tone 2), level (tone 3), 
faning-risiug (tone 4), or rising-f~lling (tone 5). 
See Fig. 2 giving these five options with their 
approximate pragmatic meanings. The exam- 
ples in Fig. 3 show how tone is annotated in 
SFG: the nmnber gives the kind of tone, the 
double slashes snark the tone group boundaries 
and the single slashes mark feet. Also, there 
may be combinations of different; tones in one 
utterance, e.g., tone 4 followed by tone 1 (ex- 
ample (c) in Fig. 3). 
Each tone group contains an element which 
carries the nuclear stress, called Tonic. In the 
default case, the Tonic is placed on the last lex- 
(a) //1 Margaret's / looking for you // 
(b) //2 A is there / any more / news of the / Iq-ench 
e/leetions// 
(c) //4 A in the/ far corner of that/field the// 1 foot- 
path goes / over a / sq;i\]e_// 
Figure 3: Examples of SFG labelling 
ical elenmnt in tile tone group (unmarked nu- 
clear stress). In marked cases, the Tonic can 
be placed on other elements in the tone group. 
For an example of the tbrmer see (b) in Fig. 3 
(Tonic denoted by underlining); an example of 
the latter is (a) in Fig. 3. 
The Tonic represents the nuclear stress and 
is part of the tonic segment of the tone group. 
If the Tonic does not fall on the frst syllable of 
the tone group, there is an element preceding 
it, called the pretonic segment. It carries a so- 
called Pretonic stress (see (b) in Fig.3). 
2.3 Preliminary comparison 
On a technical level, the major differences we 
can observe between the ToBI and SFG annota- 
tion schemata of intonation are the following. 
Units. While there is a rough cor- 
respondence between the intonation 
phrase/intermediate phrase in ToB~ and 
the tone group in SFG (cf. Harrington & 
Cassidy (1999)), in Tom the refit of the foot is 
not acknowledged. 
Pitch movement. While in ToBI, the prim- 
itives of description of pitch movement are dis- 
tinct highs (It) and lows (L), where a particular 
pitch movement is described by a sequence of 
highs and/or lows in the pitch, in SFC the prim- 
itive of description is the tune, i.e., a relative 
concept, such as a rising, falling or level tune. 
Nuclear stress. While in ToBI, the mmlear 
stress is marked by the last starred tone in the 
sequence of tones and is thus only implicitly in- 
dicated in the annotation, SFG marks nuclear 
stress explicitly by marking up the Tonic) 
While there is a basic match in terms of ac- 
counting for the pitch movement and we cast 
thus expect to be able to recast ToBI tone se- 
quences as SFC tones, we may encounter some 
problems due to the non-acknowledgement of 
tile unit of foot in ToBI on the one hand, and 
due to ToBI marking up pitch accents other 
ICE Sec. 2.1, however: the nuclear stress in Tom is 
by definition the last starred tone. 
832 
than the nuclear stress, on the other hand. 
3 Method 
3.1 The Corpus 
The eorl)us was obtained from tlm recorded (lat~ 
which colnes with Italliday (1970). We inv(;sti- 
gated tones 1, 2, and/l, and tone sequen('es 1 & 
1, l& 2, 2 & l, 2 & 2, l & 4, mid4 & 1. A 
total of 290 utterances were analysed (= 1700 
words of text, approx. 350 tone groul)s). The 
utter~mces ranged fl:om inono- and polysyllabic 
words to sentences. The utterances varied in 
tone, number of feet, the position of the Tonic, 
and whether there were silent t)eats in the tone 
group. Also, some of the utteran(:es had a pre- 
tonic segmenl;, others did not. 
3.2 Labelling 
The labelling of the data a(:(:or(ling to SFG (:ri- 
teria was obtained from Halliday (1970). The 
labelling of the dater using ToBI was done l)y 
a trained acoustic l)honeti(:ian. 2 The exisl;ing 
recording was digitised at 20 kltz as 16 bit san> 
ples, and stored on a Unix machine. The pitch 
tracks were calculated using ESPS WAVES+. 
The labelling of the data was done in F, MU 
(Cassidy & Harrington, 1996). All the intona- 
tional and inl;ermedit~te l)hrases were marked, 
as', were the pit(:h ac(',ents, 1)hrasal and 1)oun(l- 
ary tones. 
4 Results 
The first l)art of the study estaMished that there 
is a basic eorresl)onden(:e l)etween the SFG tones 
mid particular sequences of ToBI lal)els tbr the 
simplest possible utterances, i.e., those consist- 
ing of a tonic segment only. As can be seen from 
~l~,l)le 1, tone 1 usually corresponds to H'L-L%, 
tone 2 to L'H-H% and tone 4 to II*L-H%. a 
These siml)le milts usually have one pitch a('- 
cent and (;oincide with one intonation t)hrase 
(:(resisting of one internmdiate 1)hrase. 
In a second step, we looked at the more com- 
plicated utterances, i.e., those with a pretonic 
segment, and those consisting of a sequence of 
tone groups. In these cases there is usually more 
2The phonetieimt was aware of the Sl.'(' analysis. How- 
ever, the ToBI analysis was done listening to the audio 
files and looking at the pitch plots. 
aThis confirms e.g., Ladd (1.996) stating that the 
British-style "nuclear-tones" are merely the specific cont- 
binations of accents and edge tones. 
than one l)itch accent per utterance. Further, if 
the utterance has a Pretonic, there is always a 
pitch accent in that segment. Also, what can 
lie seen here is that there is no more than one 
internlediate l)hrase per tone group, and more 
than (}lie tone group per intonation phrase. 
Table 2 gives |;lie ToBI seqllenee for the ut- 
t;eran(;es which include a pretonic segnle.nt. The 
results are essentially the same as for the sin> 
ph; utterances (~151)le 1). One small difference is 
that tone 1 and tone 4 can have either an H* or 
a !H* nuclear accent. This however is expected, 
|)ecause it simply means that although the nu- 
clear accent is high, it is down-stepped from an 
earlier It* accent. 
rl'al)le, 3 gives the TOBI S(Xluences for utter- 
anees consisting of SF(; tone groul) sequences. 
The Toni analysis tbr the final tone in a se- 
quence are essentially the same as tbr the utter- 
anees given in Table 2. The first tone group in 
a se(lllen(;e is more often than not an interme- 
(liate t)hrase rather than a separate intonation 
l)hrase. Itowevei', keel)ing in mind the dominat- 
ing intonation 1)hrase, the ToBI sequences for 
the first elenmnt in a sequence are essentially 
the same as t'omld for utterances with a 1)re- 
toni(: clement (Table 2). The results shown in 
Tal)les 1, 2, and 3 taken together show that tbr 
tones 1, 2, 4 there is one corresponding 'l.'oBI se- 
quence each tlmt characterizes tile interval 1)e- 
tween the nuclear accented word and the edge 
of the 1)hrase regardless of the complexity of 
the ul;terance. 
We also tbund a very close correspondence 
between the ~ibnic in SFG and the nuclear ac- 
cented syllable in the Tom analysis: In virtu- 
ally all cases they were in exactly the same place 
in the analyses. When the utteran(:es are more 
(:on lplex, e.g., they have a 1)retonic segment, or 
consist of sequences, in l;he ToBI analysis 1)itch 
accents are also lint in other places, not just 
on the mmlear accented syllaMe. ToBI analysis, 
unlike SFC, allows for more than just the nu- 
clear accented syllable to be marked up. The 
extra pitch accents from the ToBI analysis are 
potentially a problem for a ToBI-SFG mapping. 
However, closer examination of the placelnent of 
these other 1)itch accents revealed that they al- 
ways fall on the first syllable of a foot (also when 
that is not the one carrying the nuclear stress). 
This suggests that the SFG feet can give some 
833 
information about where these other pitch ac- 
cents are likely to tM1 or, that these other pitch 
accents may be an indication of toot boundaries. 
5 Conclusions 
In this paper we have presented the results 
of a comparison between the ToBI and the 
SFG systems for analysing intonation. The 
goal of this comparison has been to establish 
equivalents between them. The motivation be- 
hind this is to make the two systems collabo- 
rate in concept-to-speech generation: Tom is a 
phonetic-phonological approach to the deserip- 
tion of intonation, SFG offers a linguistic ap- 
proach to intonation, tbcusing on the meaning- 
ful intonation patterns. ToBI i8 widely used in 
speech synthesis, SFG is widely used in natu- 
ral  generation. It seems therefore a 
promising step to combine the two approaches 
tbr concept-to-speech generation. 
Through this study we have established some 
basic matches between SFG tones and ToBI se- 
quences of pitch accents and edge tones. Here, 
we have concentrated on the SFG tones 1, 2 and 
4. We have analysed tones 3 and 5 as well and 
identified their ToBI equiwdents using the same 
method (cf. Sections 3 and 4). In the next step 
we will integrate the SFG description of intona- 
tion for English in the existing SFG-based Pen- 
man generation system and then interface the 
FESTIVAL synthesizer with the generator using 
the correspendences established by our analy- 
ses. 
In another step of analysis we will look more 
closely at other kinds of realization of nuclear 
stresses, such as bitonal pitch accents, to es- 
tablish whether they reflect linguistic meanings. 
What also remains to be investigated is the as- 
signment of pitch accents other than the nuclear 
stress. Nuclear stress can be predicted on the 
basis of linguistic and pragmatic information, 
but it is not clear under which conditions other 
pitch accents should be placed. Our observation 
above (See. 4) that pitch accents other than the 
nuclear stress are typically placed on the first 
syllable of a foot may be a possible motivation. 
We are aware that there is controversy among 
researchers about rhythm. However, if it turns 
out that rhythm is a useful concept in the pre- 
diction of non-nuclear pitch accents, then we 
will consider including it in our approach. 
6 Acknowledgements 
We thank J. Harrington, C. Matthiessen, M. Hall- 
iday and the anonymous reviewers for their useful 
comments. 
834 
Hallid;wan descrit)tion Toll(~ ql.'olu descril)tion 
Tonic:l foot 1 H*I,-L% (20) 
and 1 or more 2 L*II-H% (20) 
syllables 4 H'L-H% (19) 
Tonic:l in- 1. H'L-L% (18) 
coml)lete foot 2 L'H-H% (9) 
,% 1. foot 4 H*I,-H% (10) 
Tonic:>1 foot (first might 1. H'L-L% (17), L+H*L-L% (1), 
be incomplete) II*L-H*L-L% (1), H*L-!II*L-L% (1) 
2 I1*It-It% (1), L'H-It% (10) 
4 tt*H-It% (1), IVL-It% (9) 
2~d)lc' 1: Simple tone groups 
Ibdlidayan descril)tion Tone ToBI description 
1 (4O) !H'L-L% (18), H'L-L% (22) 
Pretonic + tonic with 1. or > 1 feet 2 (20) L'H-H% (20) 
4 (19) !H'L-n% (12), H'L-H% (7) 
Table 2: tivOli(? groups with a Pr(~,tOlfi(: 
2bnes Tone & ToB\] descrit)tion 
1 & 1 Tone 1 
(20) !It*L- (4), H'L-L% (5), It*L-(11) 
2 & 1 Tone 2 
(10) L'It- (5), L*It-tt% (4), L+It*H- (1) 
1 ~Q, 2 Tone 1 
(9) H'L- (9) 
2 & 2 Tone 2 
(9) It*It- (1), II*L-It% (1), L*It- (8) 
1 & 4 Tone 1 
(10) It*L- (1.0) 
4 ck5 1. ~lbne 4 
(10) !H'H- (1), !H*L-II% (3), 1I*I:It% (6) 
Ton(; 1 
I~*L-L% (20) 
Tone 1 
H'L-L% (6), !H'L-L% (2), L'L-L% (2) 
Tone 2 
L*II-H% (9) 
Tone 2 
L'H-H% (10) 
Tone 4 
H'L-It% (10) 
Tone 1 
H'L-L% (9),!It*L-L% (1) 
2}d)lc 3: Tone group sequences 
835 

References 

M. E. Beckman & G. M. Ayers. 1997. Guidelines 
for ToBI labeling (Version 7.0). Ohio State Uni- 
versify. (ling.ohio-state.edu/Phoneties/E-ToBI). 

A. Black, P. Taylor, & R. Caley. 1998. The FESTI- 
VAL speech synthesis system; system documen- 
tation, (Version 1.3.1). University of Edinburgh. 
(www.cstr.ed.ac.uk/projects/festival/). 

D. Brazil, M. Coulthard, & C. Johns. 1980. Dis- 
course Intonation and Language Teaching. Long- 
lIlan, London. 

S. Cassidy & J. Harrington. 1996. Emu: An en- 
hanced hierarchical speech data management sys- 
tem. PTvceedings o\]" the 6th Australian Interna- 
tional Conference on Speech Science and Technol- 
ogy , pp. 361--366. 

R.A. Cole, J. Marimfi, H. Uszkoreit, A. Zae- 
hen, & V. Zue. 1995. Survey of the State 
of the Art in Human Language Technology. 
(c.sht.cse.ogi.edu/nI~Tsurvey/ItLTsurvey.html). 

M. Grice, M. Reyelt, R. Benzlniiller, J. Mayer, & 
A. Batliner. 1996. Consistency in transcription 
labelling of German intonation with GTom. Pw- 
ceedings of the 4th International Conference on 
Spoken Language Processing, pp. 1716-1719. 

M. A.K. Halliday. 1967. Intonation and Grammar 
in British English. Mouton, The Hague. 

M. A.K. Halliday. 1970. A Course in Spoken En- 
glish: Intonation. Oxford University Press, Ox- 
ford. 

J. Harrington & S. Cassidy. 1999. Tcch, niques in 
Speech Acoustics. Kluwer Academic Publishers, 
Dordrecht. 

D.R. Lad& 1996. Intonational Phonology. Cam- 
bridge University Press, Cmnbridge. 

C. M.I.M. Matthiessen & J. A. Bateman. 1991. Text 
Generation and Systemic Functional Linguistics: 
Experiences from English and Japanese. Pinter, 
London. 

J. B. Pierrehulnbert. 1980. The phonology and pho- 
netics of English intonation. Ph.D. thesis, MIT. 

K. Silverman, M. Beckman, J. Petrelli, M. Osten- 
dorf, C. Wightman, P. Price, J. Pierrehumbert, 
& J. Hirschberg. 1996. ToBI: A standard tbr 
labelling English prosody. Proceedings of ICSLP 
92, volmne 2, pp. 867-870. 

E. Teich, E. Hagen, B. Grote, &: J. Bateman. 1997. 
From communicative context to speech: Integrat- 
ing dialogue processing, speech production, and 
natural  generation. Speech Communici- 
ation, 21:73-99. 
