Message-to-Speech: high quality speech generation for messaging 
and dialogue systems 
P. Spyns (1), F. Deprez (1), L. Van Tichelen (1), B. Van Coile (1,2) 
(1) Lernout ~ Hauspie Speech Products, Sint Krispijnstraat 7, B-8900 Ieper, Belgium 
tel.: 32-57-22.88.88, fax: 32-57-20.84.89 
(2) E.L.I.S., University of Gent, Sint-Pietersnieuwstraat 41, B-9000 Gent, Belgium 
tel.: 32-9-264.33.95, fax: 32-9-264.35.94 
{Peter.Spyns,Filip.Deprez,Luc.VanTichelen,Bert.VanCoile}@lhs.be 
Abstract 
In this paper, we present a Message-to- 
Speech (MTS) system that offers the lin- 
guistic flexibility desired for spoken di- 
alogue and message generating systems. 
The use of prosody transplantation and 
special purpose prosody models results in 
highly natural prosody for the synthesised 
speech. 
1 Introduction 
Many of the Natural Language Generation (NLG) 
systems that produce flexible output, i.e. sentences 
with variations on the syntactical and morphological 
levels, only aim at the production of written text and 
do not deal with spoken language. As a result, the 
important topic of generation of natural prosody is 
not touched upon (see e.g. (Elhadad, 1992; Reiter 
et al., 1995; Dalianis, 1996b; Somerset al., 1997)). 
Message generating systems (e.g. announcement 
systems, phone banking and voice mail applications) 
often combine fixed pieces of pre-recorded speech 
to provide speech of a natural quality. In practical 
applications, the linguistic flexibility of the spoken 
messages is usually kept very limited because of the 
high costs of recording and storing the fixed pieces 
of speech. 
The Message-to-Speech (MTS) system described 
below is specifically designed to generate high qual- 
ity speech output with the flexibility desired for spo- 
ken dialogue and message generating systems. Such 
systems typically generate speech for a predefined 
set of messages that consist of fixed and variable 
parts. High flexibility may be required for the vari- 
able parts in the messages only. 
Text-to-Speech (TTS) is an evident technique for 
providing speech output with nearly unlimited flex- 
ibility. As full flexibility is only needed for the vari- 
able parts in the messages, the MTS system can 
make use of special purpose prosody models for the 
actual set of messages of the application. These 
models can lead to a prosodic quality that is su- 
perior to the one generated by TTS systems, which 
apply general prosody models for unrestricted text 
(see also (Hovy, 1995, p.161)). 
For the fixed parts of a message, the prosody 
transplantation technique (see section 2) is used to 
overrule prosody generated by general models, as 
is done by TTS, with specific prosody copied from 
natural speech. For the parts of a message where 
flexibility is needed, prosody is obtained by either a 
general model or by a model that is specifically de- 
veloped for those parts. The MTS system thus com- 
bines transplanted prosody with prosody by model 
in order to achieve highly natural prosody for partly 
variable messages (Van Coile et al., 1995). 
The key concepts of the MTS system are presented 
in section 3.1. The system consists of two main mod- 
ules: a generation module and a prosodic integration 
module. The generation module (see section 3.2) 
is template driven (canned "text" interspersed with 
slots), and accounts for the flexibility, including the 
linguistic variation, of the messages. For a discus- 
sion of template driven systems see (Reiter, 1995; 
van Deemter et al., 1994; van Deemter and Odijk, 
1997). The prosodic integration module (see sec- 
tion 3.3) takes care of the prosodic integration of 
the slot fillers with the rest of the template. 
In section 4 the Message-to-Speech system is 
briefly discussed, and section 5 compares the sys- 
tem with related research. To conclude, an overview 
of current developments to further enhance the MTS 
system is presented in section 6. 
2 Prosody Transplantation 
The idea behind Prosody Transplantation is that 
of copying intonation and duration values from a 
recorded donor message (human speech) to the pho- 
netic transcription of the same message. The En- 
11 
viched Phonetic Transcription (EPT) obtained in 
this manner can be fed to a TTS system whereby the 
normal linguistic and prosodic modules (based on 
general models) are by-passed (Phonetics-to-Speech 
-- PTS). Only the segmental synthesis and the syn- 
thesiser modules are used. 
# T\[104\] m\[74(0,98)\] N\[47\] k\[107(10,81)\] j\[14(0,106)\] 
u\[44\] f\[93(0,91)\] o\[47(0,102)\] r\[29\] j\[68(0,98)(30,90)\] 
0\[50(0,96)\] r\[71\] $\[45(0,93)\]-t\[lOS\] E\[70(0,102)1 n\[68\] 
-S\[96\] $\[56\] n\[106(30,83)(100,83)\] 
Figure 1: textual representation of an EPT for the 
sentence "Thank you for your attention" 
An example of an EPT is provided by figure 1. 
The first value between square brackets is the 
phoneme duration (in ms), optionally followed by 
one or more intonation breakpoints. Each break- 
point consists of a location value (in ms) relative to 
the beginning of the phoneme, followed by a pitch 
value (in ST/4; reference 50 Hz). 
A major asset, of Prosody Transplantation is the 
combination of natural sounding speech with a low 
bit. rate for storage (less than 300 bit per second). 
In addition, only the prosody and not the timbre of 
the speaker is retained. New donor messages can 
be recorded by new speakers and seamlessly inte- 
grated in existing applications. Specific tools have 
been developed to speed up the prosody transplanta- 
tion process (Van Coile et al., 1994). Although the 
EPTs as such do not support linguistic variation, 
the combination of PTS with a template driven sys- 
tern provides linguistic flexibility as well as natural 
prosody. 
3 The Message-to-Speech System 
The message-to-speech system described in this sec- 
t, ion takes as input a message specification and out- 
puts synthetic speech with highly natural prosody. 
Below, we first define the key concepts and then fo- 
cus on two main modules of the system: the gener- 
ation module and the prosodic module. 
3.1 Key Concepts 
A message can be seen as a complete sentence. It is 
specified as a concatenation of message units (build- 
ing blocks that constitute prosodic units). The flex- 
ibility of a message unit is guaranteed by the pres- 
ence of slots. A slot is a placeholder that can take 
an argument. A carrier is a template containing the 
enriched phonetic transcription of the canned text 
part, transplanted from an appropriate donor, and 
zero or more slots. For each slot, the carrier contains 
morpho-syntactic and prosodic information. By fill- 
ing out a slot of a carrier with different arguments, 
several variants can be derived from the same mes- 
sage unit at run-time. 
Figure 2 shows the wave and the prosody corre- 
sponding to the donor "in four miles". In order to 
obtain a flexible carrier, "four" is cut away and re- 
placed by a slot in which any argument of the type 
/number/(see figure 3) can be filled out at run time. 
Figure 3 illustrates that the message "In four 
miles, bear left." is realised as a concatenation of 
two message units: "in/number/mile(s)" and "bear 
left". The message unit "in/number/mile(s)" has 
one slot in which a numeric argument is to be filled 
out. The message unit "bear left" has no slots. 
message In four miles, bear left 
message 
specification 
message unitl 
message unit2 
carrier 1 
carrier 2 
(message unitl 4) (message unit2) 
in /number/miles 
bear left 
in/number/miles 
#\[952(952,101)\]?\[18\]I\[66\]n\[92(4,98)\] 
/number: ... ON=CO ... / 
m\[138(10,103)(70,96)\] 
Y\[224(2,93)(132,92)\] 11173(58,82)\] 
z\[352\] #\[411(231,82)\] 
bear left 
# \[50(1,124)\] b\[141(104,91)\] 
E\[228(211,119)\] r\[50\]- 1160(4,120)\] 
E\[205(156,82)\] f\[131\] t\[151\] #\[800(800,79)\] 
Figure 3: example of message specification, message 
units and carriers for a message 
3.2 Message-to-Speech Generation Module 
The generation module translates each message unit 
of the message specification into a carrier with op- 
tional arguments. This translation is guided by a 
two-fold mechanism: 
• argument dependent carrier selection consists in 
selecting a carrier in function of (a characteris- 
tic of) an argument. Figure 4 shows that the 
message unit "in/number/mile(s)" can be re- 
alised by one out of two carriers, depending on 
the numeric argument that is filled out. If the 
argument is "1", the message unit is mapped on 
carrier la. In the other cases, the message unit 
is mapped on carrier lb. 
• carrier dependent argument realisation consists 
in determining the correct surface realisation of 
an argument, depending on properties of the 
slot in which it is inserted. Figure 5 illustrates 
that the argument 'T' has a different surface 
12 
, , , s i i , i i 
, , , t , s i , | i il I I I I I i I 
;l;i l" i l; ; l" l; l; i donor 
..... ' ' ' tence 
I I : I I I I I I I I I I I I I I 
I 
* M * i i J i g 
,, ,, ,, ~ ', ~ ', : I 
, , SIOI , , , , , 
, _.~ . . ~an~lanted 
i l i i" i I i ...... ..... i I onto a carrier 
i i ! *¢ i i ! * * 
ii ........ 'o I 
7I n m 'Y 1 z # 
Figure 2: intonation contour for a carrier obtained from a donor sentence 
message unit 
"in ~,.,tuber/mile(s)" mapping condition the argument = "1" 
the argument ¢ 'T' 
carrier (represented orthographically) 
ta: "in/a/mile_ 
lb: "in/number/ miles 
Figure 4: example of argument dependent carrier selection 
realisation ("a" versus "an") depending on the 
phonetic on-set of the word to the right of the 
slot. 
For the arguments filled out in the slot of a carrier, 
prosody is calculated at run-time (see section 3.3). 
As prosody derived from human recordings is pre- 
ferred over prosody calculated at run-time, we try 
to keep the number of slots in a carrier as limited 
as possible. Therefore, the possibility is offered to 
delete arguments during the translation of message 
units into carriers. This functionality is exploited 
when the number of possible slot fillers is restricted. 
Figure 6 shows a message unit with one slot that is 
translated into one out of four carriers without slot, 
depending on the message unit argument. 
3.3 Message-to-Speech Prosodic 
Integration Module 
The purpose of the prosodic integration module is 
to calculate appropriate prosody for the arguments 
that are filled out in a slot of a carrier. Therefore, 
a phonetic transcription of the argument needs to 
be available. This transcription can be obtained 
by a dictionary look-up or by using a grapheme-to- 
phoneme conversion routine. 
In a first step a duration is calculated for each of 
the phonemes in the argument. In a second step, an 
• appropriate intonation contour is calculated. 
3.3.1 Duration module 
The input of the duration module is a phonetic 
transcription in which primary and secondary stress 
are indicated. The duration module has access to 
one or more duration models in order to produce a 
duration value for each phoneme in a phonetic tran- 
scription. 
A duration model is a rule-based system calculat- 
ing durations, taking into account parameters such 
as lexical stress, position of phonemes (word initial, 
word medial, word final, sentence final), length of the 
argument, phonetic context of phonemes (left/right 
neighbour, consonant cluster), etc. As speech rate 
can vary from one message to another, a slot spe- 
cific speech rate coefficient, provided by the carrier, 
is also taken into account. 
Two major strategies with respect to duration 
modelling can be discriminated: 
• As the most natural prosody is the one derived 
from human speech, the possibility is offered to 
feed the duration module with phonetic tran- 
scriptions enriched with duration information 
copied from natural speech. When customis- 
ing the MTS system, an argument dictionary 
containing this information can be built off-line 
13 
message unit 
"in/number/mite(s)" 
"in/number/hour(s)" 
argument surface 
realisation 
1 a 
1 an 
condition 
word to the right of the slot has a consonantic on-set 
word to the right of the slot has a vocalic on-set 
Figure 5: example of carrier dependent argument realisation 
message unit with 
one slot 
"go to the/direction/" 
mapping condition 
the argument = "west" 
the argument = "east" 
the argument = "north" 
the argument = "south" 
carrier (represented orthographically) 
.without slot 
go to the west 
go to the east 
go to the north 
go to the south 
Figure 6: example of message unit argument deletion 
by making use of the prosody transplantation 
tools (see section 2). If transplanted durations 
are available in the argument, they are taken 
over by the duration module and only modified 
in specific cases -- e.g. change a duration in 
order to cope with a phenomenon such as final 
lengthening. 
• For arguments without transplanted durations, 
a general purpose duration module is activated. 
It consists of a cascade of different duration 
rnodels each having a decreasing specificity. 
Specific duration models exist for particular ar- 
guments such as numbers or date and time in- 
dications. The general purpose model is only 
used if a more specific model is not available. 
Special tools have been developed to speed up 
the creation of general and special purpose du- 
ration models. 
3.3.2 Intonation module 
The input of the intonation module is a phonetic 
transcription enriched with phoneme duration infor- 
mation. The output is a phonetic transcription de- 
scribing both duration and intonation. After tak- 
ing care of assimilation, this enriched phonetic tran- 
scription can be inserted without further action into 
the carrier. 
There are two ways to model the intonation on 
arguments: 
• The most natural intonation is obtained by 
transplanting part of an intonation contour as 
obverved in a donor sentence onto the argument 
that is to be filled out in a carrier. It is indeed 
possible to reuse the intonation as realised on 
"4.6" in the donor phrase "in 4.6 miles" for the 
argument "9.5" that is to be inserted in the car- 
rier "in/number/miles". 
• If no appropriate donor contour is available, the 
intonation module calculates a piecewise linear 
intonation contour based on slot specific intona- 
tion models. Slot specific intonation parameters 
that are taken into account are among others 
the begin pitch, the end pitch, the declination 
rate and the intonation context (final fall, con- 
tinuation rise, etc.). 
4 Discussion 
The Message-to-Speech system is designed to gen- 
erate high quality speech output with the flexibil- 
ity desired for spoken dialogue and message gener- 
ating systems. It produces high quality speech while 
morpho-syntactic variations are taken into account. 
More specifically, as the message units and under- 
lying carriers can take arguments, it is possible to 
generate several variants of the same basic message. 
• variations on the level of a carrier slot can be 
paradigmatic: a message ranges over all the ele- 
ments belonging to a certain semantic category 
(e.g. product name, cardinality, direction - see 
figure 6) but the actual message is not known 
on beforehand. 
• variations on the level of a carrier slot can be 
syntagmatic: agreement of all kinds, liaison, 
contraction, etcetera (see figures 4 & 5). 
• variations on the level of the message units can 
be semantic: new combinations of message units 
lead to the creation of new messages. E.g. the 
message unit "in /number/ mile(s)" can not 
only be combined with a message unit "drive 
/slowly_fastp but also with the message unit 
"bear fleft_rightp. 
Highly natural prosody for the carriers is obtained 
thanks to the prosody transplantation technique. 
The prosody transplantation technique can be used 
14 
for the slot arguments as well. However, if no donor 
prosody is available for an argument, prosody is cal- 
culated at run-time on the basis of specific duration 
and intonation models. 
5 Related Research 
In what follows we try to relate the MTS system 
to the levels that are generally recognised to form 
part of a NLG system. A well known architectural 
scheme outlining the three basic levels of an NLG 
system has been proposed by Reiter (Reiter, 1994, 
p.164) 1: 
1. content determination and text planning: The 
content of the message to be communicated is 
mapped onto a semantic form, possibly anno- 
tated with rhetorical relations. On this level, 
reasoning takes place about the communicative 
goals of the text or message and the rhetorical 
relations between these goals. 
2. sentence planning: The information of the se- 
mantic form is distributed over sentences and 
paragraphs. The sentences are linked together. 
3. surface generation: The abstract specification 
of the linguistic structure is mapped to a sur- 
face form that communicates the information 
while syntactic and morphologic processing is 
done in order to generate a grammatically cor- 
rect surface form. 
If we compare our strategy with the classification 
proposed above, the mapping of a message unit onto 
carriers is to be situated on the surface generation 
level. The result of the mapping stage is a complete 
surface form (represented by an EPT): the precise 
wording of a message has been fixed in accordance 
with syntactic and morphologic restrictions. The 
prosodic integration phase has no explicit place in 
Reiter's architecture since he only studied text gen- 
eration systems. 
The content determination, text planning, and 
sentence planning levels are not provided by the 
MTS system. In a number of practical message 
generating systems, the content of a message cor- 
responds in a straightforward manner with the mes- 
sage units, which can therefore easily be generated 
by the back-end application. 
6 Current Developments 
We are currently enhancing the functionality of the 
MTS system in the following areas: 
l A more recent and detailed description can be found 
in (Reiter and Dale, 1997). 
* The MTS system in its current state only com- 
prises carriers with one slot or multiple non- 
related slots. The slots of a carrier are filled 
in a fixed sequential way (left to right), so that 
the linguistic restrictions are also applied in the 
same order. This entails that no restrictions 
between related slots can be applied. There- 
fore, the selection mechanism risks entering a 
deadlock situation. E.g. consider the car- 
rier "you have bought/number//item/' where 
/number/indicates the number of items/item/. 
/number/could be realised as "no, a(n), two, 
three" etc. In the case that "num= 1", the 
system blocks since the argument "1" cannot 
be realised as long as the phonetic on-set of the 
following word is not known. But that word 
cannot be realised (singular vs. plural) either 
since the number slot is not yet filled in. 
• The MTS system in its current state only deals 
with atomic arguments. For some applications, 
it is also useful to support lists as arguments. A 
back-end application then could use the same 
message unit to have the MTS system gen- 
erate e.g. "You have new mail from Tom" 
(atomic argument) or "You have new mail from 
Tom, Paul and John" (list argument). In order 
to achieve this, the MTS system will have to 
deal with syntactic aggregation (Dalianis and 
Hovy, 1996; Dalianis, 1996a). 

References 
Hercules Dalianis and Eduard Hovy. 1996. Aggrega- 
tion in natural language generation. In Giovanni 
Adorni and Michael Zock, editors, Trends in Nat- 
ural Language Generation: An Artificial Intelli- 
gence Perspective, pages 88-105. Springer-Verlag. 
Hercules Dalianis. 1996a. Aggregation as a sub- 
task of text and sentence planning. In J.H. Stew- 
man, editor, Proceedings of the Artificial Intelli- 
gence Research Symposium. 
Hercules Dalianis. 1996b. Concise Natural Lan- 
guage Generation from Formal Specifications. 
Ph.D. thesis, The Royal Institute of Technology 
and Stockholm University, Department of Com- 
puter and Systems Science, Stockholm, Sweden. 
Michael Elhadad. 1992. Using argumentation to 
control lexical choice: A functional unification- 
based approach. Ph.D. thesis, Computer Science 
Department, Columbia University. 
Eduard Hovy. 1995. Overview. In Ronald Cole, 
Joseph Mariani, Hans Uszkoreit, Annie Zaenen, 
and Victor Zue, editors, Survey of the State of the 
Art in Human Language Technology, pages 161 - 
169. Cambridge University Press (in press). 
Ehud Reiter and Robert Dale. 1997. Building ap- 
plied natural language generation systems. Jour- 
nal of Natural Language Engineering, pages 1-38 
(submitted). 
Ehud Reiter, Chris Mellish, and John Levine. 1995. 
Automatic generation of techical documentation. 
Applied Artificial Intelligence, 9(3):259-287. 
Ehud Reiter. 1994. Has a consensus nl generation 
architecture appeared, and is it psycholinguisti- 
cally plausible? In Proceedings of the Seventh In- 
ternational Workshop on Natural Language Gen- 
eration, pages 163-170, Nonantum Inn, Kenneb- 
unkport, Maine, June 21-24, 
Ehud Reiter. 1995. NLG vs. templates. In Proceed- 
ings of the European NLG Workshop 95, pages 95 
- 106. 
Harold Somers, Bill Black, Joakim Nivre, Torbj 
on Lager, Annarosa Multari, Luca Gilardoni, , 
Jeremy Ellman, and Alex Rogers. 1997. Multilin- 
gum generation and summarization of job adverts: 
the TREE project. In Proceedings of the Fifth 
Conference on Applied Natural Language Process- 
ing, pages 269 - 276, Washington D.C. Morgan 
Kaufmann Publishers. 
B. Van Coile, L. Van Tichelen, A. Vorstermans, 
J.W. Jang, and M. Staessen. 1994. Protran: A 
prosody transplantation tool for Text-to-Speech 
applications. In Proceedings of the 1994 Interna- 
tional Conference on Spoken Language Processing 
(ICSLP-9~), pages 423-426, Yokohama, Japan. 
B. Van Coile, H. Riihl, L. Vogten, M. Thoone, 
S. Goss, D. Delaey, E. Moons, J. Terken, J. de Pi- 
jper, M. Kugler, P. Kauflmlz, R. Krfiger, S. Leys, 
and S Willems. 1995. Speech synthesis for the 
new pan-european traffic message control system 
RDS-TMC. In Proceedings of Eurospeech 1995, 
pages 145-148. 
K. van Deemter and J. Odijk. 1997. Context model- 
ing and the generation of spoken discourse. Speech 
Communication, 21:101 - 121. 
K. van Deemter, J. Landsbergen, R. Leermakers, 
and .l. Odijk. 1994. Generation of spoken 
monologues by means of templates. In L. Boves 
and A. Nijholt, editors, Proceedings of the Eight 
Twente Workshop on Language Technology, pages 
87 - 96, Twente. 
