Synthesis of Spoken Messages from Semantic Representations 
(Semantlc-Representat Ion--to-Speech System) 
Laurence DANL05, Eric LAPORTE 
Laboratoire d'Automatlque Documentalre et Ltngulstlque 
Universlte Paris 7 
2, place Jussleu 
75251 PARIS CEDEX 05 
Franqoise EMERARD 
Centre National d'Etudes des Telecommunications 
22301 LANNION CEDEX 
Abstract 
A semantic-representation-to-speech system 
communicates orally the information given in a seman- 
tic representation. Such a system must Integrate a text 
generation module, a phonetic conversion module, a pro- 
sodic module and a speech synthesizer We will see how 
the syntactic information elaborated by the text genera- 
tlon module is used for both phonetic conversion and 
prosody, so as to produce the data that must be supplied 
to the speech synthesizer, namely a phonetic chain 
Including prosodic Information. 
Introduction 
A spoken message can be produced either to utter 
a written text (text-to-speech system), or to communi- 
cate orally the information given in a semantic repre- 
sentation (semantic-representation-to-speech system). 
In both cases, the speech synthesizer must be provided 
with a phonetic chain including prosodic Information In 
order to reconstruct the acoustic signal. As we will 
recall in 1., syntactic knowledge is necessary to com- 
pure the phonetic transcription of a written text and to 
include prosodlc Information in It. Hence a text-to- 
speech system must Include a parsing module to get this 
syntactic knowledge. On the other hand, a semantic- 
representation-to-speech system can take advantage of 
the syntactic information elaborated when expressing 
the semantic representation in natural language. There- 
fore, we design a semantic-representation-to-speech 
system that generates directly from the semantic 
representation a phonetic string with prosodic markers, 
without a written stage. Our system has been designed 
for French but It could be extended to other languages. 
1. tn French, semantic features ore needed to distinguish only o few 
non-homophonic homographs, mostly technical words, 
I. Knowledge needed In a text-to-speech system 
!.1. Spelling-to-sound conversion 
The first problem encountered in synthesizing 
speech from written text is that of spelling-to-sound 
conversion. Certain languages are much easier than 
others in this respect. For example, about 50 rules are 
sufficient for tl~e conversion of written Spanish into 
phonetic symbols, with a virtually zero error rate 
(,Santos & Nombela 1982). For other languages, such as 
French or English, the problem is much greater A 
phoneme does not generally correspond to only one 
grapheme, and the reverse ls also true For instance, the 
word o/seau is pronounced /wazo/ : none of its gra- 
phemes Is pronounced as would be expected (ie. /o/ for 
o, /i/for /~ /s/ for 5, schwa for e, /a/for a and/y/ 
for u). 
Spelling-to-sound conversion is further compli- 
cated by the existence of' non-homophonic homographs, 
ie. words spelled the same but pronounced differently. 
The distinction between two homographs requires to 
know their grammatical categories (record Is pro- 
nounced \['reko:d\] If it is a noun and \[rl'ko:d\] if It is a 
verb), their Inflexional features (read Is pronounced 
\[ri:d\] in the infinitive form and \[red\] in the preterite), or 
their semantic features (lead Is pronounced tied\] when 
It is a noun or a verb related to the metal and \[li:d\] 
otherwise) i, 
In French, words in context raise the additional 
problem of liaison. A liaison occurs between a word 
ending in a mute consonant and a word beginning with a 
vowel. For example, the n in mon is pronounced in mon 
arrivEe but mute in /non depart, However, a liaison Is 
made only if this phonological condition is accom- 
panied with syntactic conditions. For example, a liaison 
is made between a determiner and a noun as in rno/~ 
arrivEe (my arrival), but not between a subject and a 
verb as in Le limonarrive (The silt is coming). 
599 
To sum up, the phonetic conversion of French 
texts relies on syntactic knowledge to deal with homo- 
graphs and liaisons. 
1.2. Prosody 
A text-to-speech system supposes the storage of 
minimum acoustic units that allow the reconstruction 
of the acoustic signal for any sentence. One solution 
consists In the choice of diphones as acoustic units. A 
diphone is defined as a segment (about 1,200 for French) 
that goes from the steady state of a phonetic segment 
to the steady state of the following segment and that 
contains In its heart all the transitional part between 
two consecutive sounds. 
Furthermore, the issue of increasing the natural- 
ness of synthetic speech requires to take into account 
prosodic factors, namely, stress, timing (structuring of 
the utterance by pauses) and intonation. Intonation is 
characterized by the interaction of three parameters: 
evolution of intensity and laryngeal frequency as func- 
tions of duration. 
The prosodic behavior of one speaker was there- 
fore subjected to a systematic study. An acceptable 
model was extracted from this behaviour. The prosodic 
processing (Emerard 1977) is based on the allocation of 
prosodic markers (e.g. \[=\], \[#\]) at different points in a 
sentence. Fifteen prosodic markers were considered to 
be sufficient for determining suitable prosodic contours 
for the synthesis of French. Each marker assigns a 
melody and a rhythm to each syllable of the preceding 
word. More precisely, each marker may 
- cause an interruption in the dlphone concatenation, 
-introduce a pause, 
- affect to varying degrees the amplitude of laryngeal 
frequency (F o) on the last vowel of the word, 
- determine rising or falling F o movements. 
The choice of a marker after a constituent is determined 
both by the syntactic category of the constituent (verbal 
syntagm, subordinate clause) and by its location inside 
the sentence, especially by the existence of a more or 
less complex right context. In the simple enunciative 
sentence Jean part (John is leaving), the prosodic 
processing has to give the following results: Jean \[#\] 
part \[.\]. Nevertheless, it is not possible to conclude 
with the following prosodic rules: 
\[#} is the marker assigned to \[end of subject noun 
phrase\] 
\[.\]is the marker assigned to \[end of verbal syntagm\] 
because in the enunciative sentence Jean part etMar/~; 
600 
vl~nt (John is going away and Mary is coming), the 
prosodic processing has to propose: Jean \[:\] part \[,\] et 
Marie \[#\] v1~nt \[.\]. A comparison of these two sentences 
clearly shows that it is not possible to assign a specific 
marker after a constituent only on the basis of its 
syntactic category. It is necessary to take its right 
context into account, Moreover, placing prosodic mark- 
ers must be carried out in a hierarchical manner. For 
example, the marker between the preverbal phrase and 
the verbal syntagm depends on the marker assigned at 
the end of the clause containing them; this last marker 
depends in turn on the marker assigned at the end of the 
sentence containing the clause. 
To sum up, the issue of prosody is handled by 
placing appropriate markers in appropriate locations. 
This can only be done when precise syntactic informa- 
tion is available. 
2. Production of a phonetic chain with prosodic 
markers 
The system which translates a semantic repre- 
sentation into a phonetic chain with prosodic markers 
has been built from a written text generation system 
(Danlos 1986) that has been modified and completed. Let 
us start with a brief description of this generator. 
2.1. The generator 
The generator Is modularized into a strategic 
component and a syntactic component. From a semantic 
representation such as 
(1) EVENT:ACT =: GIVE-PRESENT 
ACTOR = HUM1 =: HUMAN 
NAME ~ Jean 
OBJECT = TOKI =: FLOWER 
TYPE = anemone 
DATIVE = HUM2 =: HUMAN 
NAME = Marie 
GOAL = : HAPPY 
OBJECT = HUM2 
the strategic component makes conceptual decisions 
(e.g. the decision about the order of the informations) 
and linguistic decisions (e.g. the decision about the 
number of sentences)(Danlos 1984 a and b). The output 
of this component is a "text template" (TT) that indi- 
cares 
1) the splitting up of the text into sentences: 
TT o (Sentencel. Sentence 2.) 
2) for each sentence, its structure in terms of maln 
clause and subordinate clauses: 
Sentence l = (Clause l (SUB (CONJ pour que) 
Sentence3)) 
Sentence3 ~ Clause2 
3) for each clause, its main verb with Its 
complementation: 
Clause 1 = ((SUBJECT HUM1 ) (VERB offrir) 
(OBJECT TOKI) (A-OBJECT HUM2)) 
Clause2 = ((SUBJECT HUM1 ) (VERB rendre) 
(OBJECT HUM2)(ATTRIBUTE heureux)) 
A text template is turned Into a text by the syntactic 
component. This component applies grammar rules (e.g. 
reduction of a subordinate clause to an Infinitive form), 
synthesizes the tokens and performs the morphological 
routines. For these operations to be carried out, a text 
template includes, for each sentence, syntactic Infor- 
mation that Is represented in a tree whose nodes are 
syntacti~ categories such as S (sentence), CL (clause), 
SUBJECT or VERB. A text template may be made up of 
several sentences, however we wlll give an example 
with a single sentence because the operations of 
phonetic conversion and entering prosodic markers are 
performed within a sentence, independently of the other 
sentences. From the semantic representation (I), the 
text template may be: 
(2) ((S (CL (SUBJECT HUM1) (VERB offrtr) 
(OBJECT TOK I ) (A-OBJECT HUM2)) 
(SUB (COW pour que) 
(S (CL (SUBJECT HUM1 ) (VERB rendre) 
(OBJECT HUM2) 
(ATTRIBUTE heureux))))).) 
The syntactic component turns it Into a tree whose 
leaves are words: 
((S (CL (SUBJECT (NP (N Jean))) 
(VERB a offert) 
(OBJECT (NP (DET des) (N anemones))) 
(A-OBJECT (NP (PREP ~) (N Marie)))) 
(SUB (S (CL (CONJ pour) (PPV la) 
(VERB rendre) 
(ATTRIBUTE heureuse))))).) 
The erasing of the auxiliary vocabulary leads to: 
Jean a offert des anemones ~ Mar/e pour la rendr~ 
iveureuse. 
(John offered anemones to Mary to make her happy.) 
The syntactic component contains a morphological 
module (Courtois 1984) that works out an inflected 
form (e.g. heureuse, the feminine singular of heureuxJ 
given a baslc form (e.g. heureux) and Inflexlonal fea- 
tures (e.g. feminine, singular). Thls module Is based on a 
dictionary that indicates an inflexlon mode for each 
basic form. Each inflexlon mode is associated with a 
rule that computes inflected forms. 
The only modification made to the text generation 
system was to replace the morphologlcal module wlth a 
morpho-phonetic module that proceeds to both inflexlon 
and speIllng-to-sound conversion. With thls modifica- 
tion, the syntactic component produces a tree whose 
leaves are phonetic words. 
2.2. Inflexlon and phonetic conversion 
A French morpho-phonetlc system has been built 
to compute an Inflected phonetic form given an ortho- 
graphic basic word and Inflexlonal features (Laporte 
1986). This system uses an intermediate phonological 
representation devised to optimize not only word 
Inflexion and phonetic conversion but also liaison pro- 
cessing. The system works In the following way: given a 
basic orthographic form (e.g. heureux), its syntactic 
category and Inflexlonal features (e.g. adjective, femi- 
nine, singular), a phonological dictionary works out its 
phonological representation (e.g. ~r~z). The word is then 
inflected (e.g. ~r~z) by means of a set of rules. These 
rules for phonologlcal Inflexlon are much simpler than 
those that would be required for Inflecting orthographic 
or phonetic words. By way of Illustration, the feminine 
of the following adjectlves: z)on, grand, gros, Igger, 
pet/t, pr/s, sot, vu can be obtained from their phono- 
logical representatlon wlth only I rule, whereas 3 
would be required when starting from their orthographic 
representation and 8 from their phonetic representation 
(Laporte 1984). The shift from phonological words to 
phonetic words entails knowing where liaisons should 
take place. Recall that a llalson takes place when both 
syntactic and phonological conditions are satisfied. In 
the semantlc-representatlon-to-speech system, the 
syntactlc tree of the sentence allows us to place liaison 
markers at the points where a liaison is syntactically 
allowed. The conversion of phonological words Into 
phonetic words Is then performed by a set of straight- 
forward rules that check the phonologlcal conditions of 
liaisons at the points wher'e a lialson marker Is present. 
Laporte's system Is represented In Fig. I. 
From the text template (2), the syntactic 
component with the morpho-phonetic module outputs the 
following tree: 
601 
Basic words 
Inflected 
words 
Orthographical 
representation 
Phonological 
representation 
Phonetic 
representation 
50,000- > 50,000 
dictionary lrules 
350,000 150,000 
rules 
50,000 
150,000 
Fig. 1, 
(3) ((5 (CL (SUBJECT (NP (N 2~))) 
(VERB a Of£R) 
(OBJECT (NP (DET de) (N zanem3n))) 
(A-OBJECT (NP (PREP a) (N maRi))) 
(SUB (S (CL (CONJ pUR) (PPV la) (VERB R~idR) 
(ATTRIBUTE e~z))))),) 
All the segmental phenomena have been taken into 
account and the next operation consists In entering 
prosodic markers in such a tree. 
213, The prosodic component 2 
Our prosodic system is based on syntax. However, 
there is not an isomorphic relation between the syntax 
and the prosody of a sentence. For example, the syn- 
tactic structures of Jean est part/a Paris (John went 
to Paris) and //estpartia Parts (He went to Paris) are 
nearly identical, whereas there is a prosodic marker 
after the noun Jean and none after the pronoun il 
Conversely, the syntactic representations of Jean a 
parle de ce prob/eme ~ Marie (John spoke about this 
problem to Mary) et Jean a parle de ce probl~me ~ Paris 
(John spoke about this problem in Paris) are different 
although their prosodic markers are identical. As a con- 
sequence, we had to build a complete syntactlco- 
prosodic grammar for French ~. This grammar enables us 
to obtain a structure of a sentence that is Isomorphic to 
its prosodic structure and computable from its syn- 
tactic structure. A syntactico-prosodic category cor- 
responds 
- either to a syntactic category (e.g. the syntactlco- 
prosodic category S is equivalent to the syntactic 
category S), 
- or to a sequence of syntactic categories (e.g. the 
prosodic category POV \[post-verbal phrase\] groups 
together all the complements which appear after the 
2. This work w~ supported by CNET under contract no. 857B068 with 
LADL 
3. This solution was also considered by Martin (1979). 
verbal syntagm \[VS\], and the prosodic category PRV 
\[pre-verbal phrase\] groups all the complements which 
appear before the VS), 
- or to several syntactic categories (e.g. the prosodic 
category VC \[verbal complement\] corresponds to the 
following syntactic categories: SUBJECT, OBJECT, 
A-OBJECT and ATTRIBUTE). 
The first operation performed In the prosodic 
component thus consists in transforming the syntactic 
tree produced by the syntactic component into a syntac- 
tico-prosodic tree. From (3), this operation produces 
the following tree, in which the leaves are written in 
spelling representation for readability: 
(4) ((S (CL (PRV (VC (NP (N Jean)))) 
(VS a offert) 
(POV (CV (NP (DET des) (N anemones))) 
(CV (NP (PREP ~) (N Marie))))) 
(SUB (S (CL (CONJ pour) (VS la rendre) 
(VC heureuse))))).) 
Besides the syntactico-prosodic grammar, a func- 
tion SEG-C has been designed for each syntactlco- 
prosodic category C. Such a function takes two argu- 
ments: a constituent IX\] of the category C and the 
prosodic marker x that Is to appear to the right of \[X\]. It 
computes the prosodic markers that have to be entered 
In \[X\]. More precisely, If the syntactlco-prosodic ana- 
lysis of IX\] Is: 
\[x\] = (Ix 1\] \[x d ... \[xn\]) 
then: 
(SEG-C \[X\] x) = (\[X 1\] x 1 \[X 2\] x 2 .. \[Xn_ 1\] xn_ I \[X n\] x) 
where x 1, x2, ... Xn_ 1 are the appropriate markers. As an 
Illustration, the grammar lays down that 
\[CL\] = (CL \[CONJ\]I \[PRV\]I \[V5\] \[POV\]I) 
where the sign "r' following an element means that the 
element ls either absent or present once. The function 
602 
(SEG-CL \[CL\] x) indicates that 
- when \[PRV\] is present, a marker f(x) must be entered 
after it; 
- when \[POV\] is present, a marker g(x) must be entered 
after \[VS\]; 
- in any case, x is after the last constituent, i.e. \[POV\] 
when present, \[VS\] otherwise. 
The algorithm for entering the markers works in 
a recursive manner by means of a function SEG. Given a 
constituent \[X\] and the marker x that is to appear to the 
right of IX\], this function figures out the category C of 
\[X\] and calls (SEG-C \[X\] x). Next, the functions 
(SEG-C1 \[X~\] xl), (SEG-C 2 \[X 2\] x 2) .... (SEG-C n \[X n\] x) 
are called. For example, after (SEG-CL \[CL\] x) has been 
called, the entering of the markers into \[PRV\] when 
present is executed by 
(SEG \[PRV\] f(x)) = (SEG-PRV (PRV\] f(x)). 
When \[POV\] is present, the functions (SEG \[VS\] g(x)) and 
(SEG \[POV\] x) are called, otherwise the function (SEG 
\[VS\] x) is called. The function 5EG is first applied to the 
root of the arborescent syntactico-prosodic structure of 
the sentence involved and to its final punctuation mark 
("." "," "?" ";" ":") which corresponds to a prosodic marker. 
When the recurslon Is over, the auxiliary vocabulary is 
erased, leaving a phonetic chain wlth prosodic markers. 
As an example, the function SEG applied to (4) leads to 
the following result: 
(.5) 2~ \[=\] a of¢R \[$\] de zanem;)n \[=\] a maR1 \[,\] puR la 
ff~dR \[$\] OROZ \[,\] 
(Jean \[=\] a offert \[$\] des anemones \[=\] ~ Marie \[,\] pour la 
rendre \[$\] heureuse \[,\]) 
3. Algorithm end results 
The phonetic chain with prosodic markers produ- 
ced by the system are forwarded to the speech synthe- 
sizer developed at CNET (Courbon & Emerard 1982). The 
chart in Fig. 2 depicts the whole algorithm for gener- 
atlng spoken messages from semantic representations. 
An implementation of the system has been 
developed in COMMON-LISP in the domain of terrorism 
crime newspaper reports. It produces phonetic chains 
wlth prosodic markers such as the ones shown below. 
Again, orthographic word5 replace phonetic symbols for 
readability. The syntactic conditioning of liaisons is 
(I) 
(2) 
(3) 
(4) 
(5) 
semantic representation I 
I strategic component 
I syntactic component 
with a morpho-phonetlc module 
I syntactic tree 1 
whose leaves are phonetlcwords 
I syntax--to-prosody module 
syntactlco-prosodlc tree 
whose leaves are phonetic words 
I prosodic marker module 
phonetic string 
with prosodic markers 
I speech synthesizer 
\[ spoken message \] 
Fig. 2 
603 
marked wlth the sign \[-\],We present three syntactically 
different versions of the same terrorism crime to 
emphasize the robustness of the syntactic component 
and the entering of appropriate prosodic markers 
according to syntax. 
Version I. Ind/ra Gandhi'\[ #\] a 6t6 assasslnGe \[$\] mer- 
credi ~ New-Dehli \[.\] Des \[-\] extrGmlstes sikhs \[=\] ont 
t#'6 \[@\] sur \[-\] le premier ministre indlen \[,\] a/ors que 
\[-\] elle \[-\] partait \[$\] de \[-\] son domicile \[=\] ~pied\[*\] 
pour se rendre \[$\] ~ \[-\] son bureau \[.\] 
(Indira Gandhi was assassinated Wednesday in New- 
Dehli. Sikh extremists shot the Indian Prime Minister as 
she was leaving her home on foot to go to her office.) 
Version 2. Des \[-\] extrGmistes slkhs \[#\] ont assassin6 
\[@\] Indira Gandhi \[*\] mercred! ~ New-DehN \[.\] lls \[-\] ont 
tI~'6 \[@\] sur \[-\] le premier mlnistre indien \[,\] a/ors que 
\[-\] elle \[-\] parfait \[$\] de \[-\] son domicile \[=\] apied \[*\] 
pour se rendre \[$\] a \[-\] son bureau \[.\] 
(Sikh extremists assassinated Indira Gandhi Wednesday 
in New-Dehli. They shot the Indian Prime Minister as she 
was leaving her home on foot to go to her office.) 
Version 3. Mercredi a New-DehN \[,\] des \[-\] extrGmistes 
sikhs \[=\] Grit assass/n~ \[@\] Indira Gandhi \[,\] en tirant 
\[@\] sur \[-\] le premier m/nistre indien \[*\] a/ors que \[-\] 
elle \[-\] partait \[$\] de \[-\] son domicile \[=\] a pied \[*\] 
pour se rendre \[$\] ~ \[-\] son bureau \[.\] 
(Wednesday in New-Dehli, Slkh extremists assassinated 
Indira Gandhi by shooting the Indian Prime Minister as 
she was leaving her home on foot to go to her office.) 
Conclusion 
The semantic-representation-to-speech system 
developed in COMMON-LISP produces a spoken message 
of about 35 words in less than 1 second. 
In our system, only the strategic component is 
domain dependent. The lexicon and discourse structures 
used to build the text templates are domain dependent 
linguistic data. The rest of the system is domain 
independent. Let us recapitulate the data and rules 
integrated in It: 
- a syntactic component which can apply the French 
grammar rules whatever the structure of the texts and 
the syntax of the sentences; 
- a complete phonological dlctionaryof the 50,000 basic 
forms of French and a set of rules for obtaining a 
phonetic text from a phonological text; 
604 
- a complete syntactico-prosodic grammar of French and 
a set of rules that enable us to enter prosodic markers 
in a sentence whatever the syntax of the sentence; 
- a speech synthesizer and a synthesis software. 
Of course, these data and rules are only valid for French 
but it must be clear that the same kind of data is 
required for other languages and that the algorithm 
should be similar. 
Bibliography

COURBON, J. L., & EMERARD, F., t982, "SPARTE: A Text- 
to-Speech Machine Using Synthesis by Dtphones", /EEE 
Int. Conf, ASSP, pp. 1597-1600, Paris. 

COURTOIS, B., 1984, "DELAS : Dictionnaire Electronique 
du LADL, roots Simples", Rapport technique du LADL, 
n~ 12. 

DANLOS, L., 1984 a, "Conceptual and Linguistic Decisions 
in Generation", in Proceedings of COLING 84, Stanford 
University, California. 

DANLOS, L., 1984 b, "An Algorithm for Automatic 
Generation", in Proceedings of\[CA/84, T. O'Shea ed., 
Elsevier Science Publishers BV. Amsterdam. 

DANLOS L., 1986, The Linguistic Bases of Text 
Generation, Cambridge University Press, Cambridge. 

EMERARD, F., 1977, Synthese par diphones et traitemen\[ 
de laprosodie, Th~se de trolsi~me cycle, Universit6 
de Grenoble I/I. 

LAPORTE, E., 1984, "Transductions et phonologie", DEA, 
Universit~ de Paris 7. 

LAPORTE, E., 1986, "Application de la morpho-phonologie 
la production de textes phon~tlques", Actes du 
s~minalre 'Zexiques et traltement automatklue des 
langages'j Toulouse. 

MARTIN, Ph., 1979, "Un analyseur syntaxlque pour la 
synth~se du texte", Actes des IO p JournGes d~tudes 
sur laparole, pp. 227-236, Grenoble. 

SANTOS, J. M., & NOMBELA, J. R., 1982, "Text-to-Speech 
Conversion in Spanish: A Complete Rule-Based Syn- 
thesis System", /EEE Int, Conf, ASSP, pp. 1593-1596, 
Paris. 
