Segmentation and Labelling of Slovenian Diphone Inventories* 
Jerneja Gros, Ivo Ip~i~, Simon Dobri~ek, France Miheli~, Nikola Pavegi~ 
Faculty of Electrical Engineering 
University of Ljubljana 
Tr~agka cesta 25 
SI-1000 Ljubljana 
Slovenia 
jerneja.gros @ fe.uni-lj.si 
Abstract 
Preparation, recording, segmentation and pitch 
labelling of Slovenian diphone inventories are 
described. A special user friendly intert'ace 
package was developed in order to facilitate 
these operations. As acquisition of a labelled 
diphone inventory or adaptation of a speech 
synthesis system to synthesise further voices 
is manually intensive, an automatic procedure 
is required. A speech recogniser, based on 
Hidden Markov Models in forced segmenta- 
tion mode is used to outline phone boundaries 
within spoken logatoms. A statistical evalua- 
tion of manual and automatic segmentation dis- 
crepancies is performed so as to estinmte the 
reliability of automatically derived labels. Fi- 
nally, diphone boundaries are determined and 
pitch markers are assigned to voiced sections 
of the speech signal. 
1 Introduction 
For the Slovenian language, several attempts were 
made in the past, where different aspects of a Slove- 
nian text-to-speech synthesis (TI'S) system were co- 
vered (Dobnikar95). Nevertheless, none of them suc- 
ceeded in building a complete system, providing high 
quality synthetic speech. In the Laboratory of Artifi- 
cial Pemeption, we started on text-to-speech synthesis 
one year ago (Gros96). Here we describe the acquisi- 
tion of an appropriate diphone inventory in a first ver- 
sion of our Slovenian TTS system, which is supposed 
to serve as a reference system for future improvements. 
We start with a brief overview of the different mo- 
dules el' the Slovenian TTS system, then we go on to 
describe how the existing diphone inventory was ob- 
tained. The acquisition era labelled diphone inventory 
or adaptation of a speech synthesis system to synthe- 
sise new voices is manually intensive and prone to er- 
rors, therefore automatic procedures are required (Tay- 
lor91, Sehmidt93, Cosi91, Ottesen93). In section 4 we 
*This work was partly funded by the Commission of the 
European Community under COP-94 contract No 01634 
(SQEL) 
explain how we intend to automatically derive addi- 
tional diphone inventories for building new synthetic 
voices. 
2 Slovenian TTS system 
Tile different phases of the text-to-speech transforma- 
tion are performed by separate independent modules, 
operating sequentially, as shown in Figure 1. Thus in- 
put text is gradually transformed into its spoken equi- 
valent, 
Graphemc-to-phoneme transcription 
First, abbreviations are expanded to t'ornl equiva- 
lent full words using a special list of lexical en- 
tries. A text pre-processor converts further special for- 
mats, like numbers or dates, into standard graphemic 
strings. Next, word pronunciation is derived, based on 
a user extensible pronunciation dictionary and letter- 
to-sound rules. The dictionary is supposed to cover the 
most fl'equent words in a given hmguage and a second 
dictionary helps with pronouncing proper names. 
Prosody generation 
Prosody generation assigns the sequence of allo- 
phones with some of their prosodic parameters (pitch 
l'requency, duration). First, words are syllabitied by 
counting the nmnber of their vowel clusters and dura- 
tion of syllables is modelled according to the speaker's 
normal articulation rate, depending on the number of 
syllables within a word and on the word's position 
within a phrase. Then, segmental prosodic parame- 
te~w are determined tbr each allophone on the basis of 
the accent position within a word and its type. Finally, 
the global intonation contour of a phrase is determined 
(Sorin87). 
Diphone Concatenation 
Once the appropriate phonetic symbols and prosody 
markers are determined, the final step is to produce 
audible speech by assembling elemental speech units, 
computing pitch and duration contours, and synthe- 
sising the speech waveform. A concatenative TD- 
PSOLA diphone synthesis technique was used, allow- 
ing high-quality pitch and duration transformations di- 
rectly on the waveform (Moulines9(/). 
298 
ASCII text ~ 
"'°":72'""i- ........ / J 
pr( ,'ody lules 
nllcroprosody 
.... l'tl!es . 
nlacroprosody 
• . rules _ _ . 
dii;h,,,m ! 
,. !!!veIlt°!Y 
speech OUtlmt ~< " @','))~'} 
l"igure I: Slovenian text-to-speech xystem architec- 
tu re. 
diphone inventory cotnprising 955 pitch-labelled di- 
phones was created. In order to guarantee optimal 
synthesis quality, a neutral phonetic context in which 
the diphones needed to be located, was speeitied. Un- 
favourable positions, like inside stressed syllables or 
in over-a,ticulated contexts, were excluded. The di- 
phones were placed in the middle of logatoms, pro- 
nounced with a steady intonation. The exception is in 
the case where the silence phone is part of the required 
pair: there the diphone was word initial or word final. 
Speech signals were recorded by a close talking nil- 
crophone using a sampling rate of 16 kHz and 16 bit 
linear AID conversion. 
3 Slovenian Diphone Inventory 
In concatenation systems, both the choice and the 
proper segmentation of the traits to be concatenated 
play a key role. Acoustic differences between stored 
and requested segments, its well as acoustic disconti- 
nuities at the boundaries belween adjacent segments 
have to be minimised. Dipt,one units are most corn-. 
monly adopted as a compromise between the size of 
the unit inventory and tile quality of synthetic speech. 
A diphonc is, generally speaking, a unit which starts 
in tim middle o1: one phone, passes through the transi~ 
tion to the next phone and ends in the middle of this 
next phone. So the transition between two phones is 
encapsulated and does not ueed to be calculated. 
Yet it is not clear whether speech segments should 
be extracted from nonsense plurisyllabic words, called 
Iogatoms, existing isolated words or meaningful sen- 
teuces. Even the question of a bust positioning oI' 
the units within the spoken corpus is still widely de- 
bated. Stressed syllables are longer, thus less submit- 
ted to coarticulation, which results in easily chainable 
units; while unstressed ones are more ntnnerous in nat- 
ural speech, so that producing them efficiently would 
both increase segmental qt, ality and reduce lrlcmory 
requirements. Likewise, coarticulalions are strongly 
subject to speaker's lluency, so that imposing a slow 
speaking rate results in more intelligible units. To a 
large extent these issues are part of a necessary trade- 
off between intelligibility and naturalness. 
One diphone tor every allophone combination pos- 
sible in a given language is required. A Slovenian 
Iqgure 2: Wawqbrm (above) and spectral (below) re- 
presentation of the diphone ac. Markers 1, and R are 
set at the pitch periods ~" the left part of the do, hone 
and of the right part, respectively. 
After the recording phase, logatoms were hand- 
segmented and tile center of tile transition between 
tile phones was marked, using information from both 
temporal and spectral representation of tile speech sip 
nal. A special user-friendly interface was developed 
for this purpose, allowing editing, scaling, viewing, llt- 
belling and pitch-marking of the speech signal, t,'irst 
the approxiumte neighbourhood of a diphone was de- 
termined, then a fine labelling of its boundaries was 
performed and the center of the phoneme transition 
was marked, l"inally, pitch markers were manually set 
for voiced parts of tile corresponding speech signal. 
t;igure 2 gives an example of the diphone am along 
with its spcctrtuu. 
'lb phonetically transcribe the logatom words we 
299 
vowels 
ei~ (I,I,I) 
" (e, e, e) 
e (E, ~, E) 
(3, 3, 3) 
(o, o, o) 
(o, o, o) 
(u, u, u) 
sonorants 
n N 
v-2 I W 
I L 
phone\[ model 
nonsonorants 
p (-, P) 
t (_, T) 
k (_, K) 
b (=,B) 
d (:, D) 
g (:, G) 
f F 
h H 
s S 
z Z 
2 
c (_, c, s) 
(-,C~) 
phone I model 
silence 
silence \[ 
Table l: List of phones and their corresponding submodels used for Slovenian Iogatom segmentation. Symbol = 
represents a voiced closure while symbol _ represents an unvoiced closure. 
used a set of 34 symbols for allophones, which 
we adapted to the SAMPA standard requirements 
(Fourcin89) t . 
While concatenating diphones into words it suddenly 
turned out that there was a large discrepancy between 
the duration of allophones, as suggested by the prosody 
module, and the actual corresponding diphone dura- 
tion stored in the diphone inventory. This happened 
due to the exaggerated eagerness of the speaker try- 
ing to pronounce the meaningless logatoms in a cor- 
rect and clear way. Consequently, the quality of the 
synthetic speech was considerably affected and we are 
therefore planning to record another diphone inven- 
tory. As the transformation range for prosodic speech 
parameters needed for synthesising naturally sounding 
speech is large, the recording should thus be carefully 
controlled to achieve medium pitch and duration va- 
lues. 
4 Automatic Diphone Segmentation 
Automatic speech segmentation procedures are power- 
fill tools for including new synthetic voices and for 
updating and supplementing existing diphone libraries 
whereas manual diphone segmentation is a tedious, 
time consuming task, prone to errors. Therelbre, in or- 
der to be able to synthesise speech in a variety of differ- 
ent voices, we decided to use procedures for automatic 
segmentation and pitch marking of spoken logatoms. 
The extraction of diphones from the recorded words 
is performed in two stages. The first stage is the 
phoneme segmentation of logatoms, yielding a start 
point, transition center and end point for each phone. 
The second part of the diphone extraction procedure is 
to find the concatenation point of each phone. 
~A list of Slovenian SAMPA symbols together with their 
audio samples is available on the WWW on the address 
"http:#1uz.fer.uni-lj.si/english/SQEL/sampa-eng.htlnl". 
Finally, pitch markers are to be determined for 
voiced parts of the signal. We intend to apply 
the SRPD (Super Resolution Pitch Determination) 
algorithm as it allows precise pitch determination 
(Medan91). 
Hidden Markov Model Phone Segmentation 
To solve the segmentation problem, methods for 
stochastic modelling of speech are used. Hidden 
Markov Models (HMMs) are stochastic finite-state au- 
tomata that consist of a finite number of states, mo- 
delling the temporal structure of speech, and a prob- 
abilistic function for each of the states, modelling the 
emission and observation of acoustic feature vectors 
(Rabiner89). 
To perform logatom segmentation we used the 
Isadora system, developed at the University Erlangen- 
Nuremberg (Schukat92). The Isadora system is a tool 
used for modelling of one dimensional patterns, like 
speech. It consists of modules for speech signal fea- 
ture extraction, hard or soft vector quantization and 
beam-search driven Viterbi training and recognition. 
The ls'adora system builds a large network of nodes 
that correspond to different speech events like phones, 
phonemes, words or sentences. The nodes are pro- 
vided with a dedicated HMM in order to acoustically 
represent the corresponding speech event. 
For system training, approximately half an hour of 
continuous speech recorded from a single speaker is 
required along with its orthographic transcription. The 
acoustical analyser delivers every milisecond a set of 
Mel fi'equency cepstral coefficients along with their 
slopes plus the energy of each frame. A phone level de- 
scription is obtained using the orthographic transcrip- 
tion and a pronunciation dictionary. In the initiali- 
sation step the feature vectors are classified into 64 
classes using a soft vector quantization technique. Us- 
ing a phonetically labelled vocabulary a Baum-Welch 
training procedure is applied, and parameters of mono- 
300 
Phoneme 
A 
B 
C (: 
1) 
E 
e 
3 
14 
G 
H 
I 
J 
K 
I, 
M 
N 
O 
o 
P 
R 
S 
T 
U 
V 
Z 
2_ 
Manual segmentation 
\[ms\] Ims\] 
82.00 30.80 
21.70 7.06 
75.90 17.60 
60.80 14.90 
24.50 11.30 
67.10 27.80 
83.80 20.60 
61.20 16.80 
88.30 16.10 
20.80 7.54 
82.40 41).20 
62.20 23.30 
41.40 17.80 
45.80 22.50 
47.50 14.90 
60.90 18.60 
44.00 18.70 
65.80 27.80 
97.80 24.50 
coniidence 
interval 
\[ms\] 
2.08 
1,07 
4.88 
4.69 
1.38 
2.33 
3.51 
4.66 
5.44 
1.6 
9.36 
1.75 
2.05 
3.06 
1.4 
2.43 
1.88 
2.59 
4.34 
Automatic segmentation 
x 
\[ms\] 
(7 
\[msl 
30.20 
18.40 
25.30 
19.50 
16.80 
26.80 
25.00 
18.90 
32.70 
10.70 
56.80 
27.70 
24.80 
25.60 
19.10 
24.90 
18.40 
27.90 
28.40 
24.50 
44.30 
77.60 
76.20 
31.60 
65.40 
54.60 
58.20 
63.40 
38.90 
43.70 
5.17 
I1.70 
27.70 
22.30 
13.90 
22.90 
10.30 
15.10 
11.51) 
16.00 
15.90 
0.7 I 
1.14 
3.57 
6.65 
1.2 
3.1 
1.8 
2.58 
3.84 
1.41 
0.97 
84.10 
42.30 
86.30 
69.50 
32.60 
76.50 
92.40 
87.70 
79.20 
28.90 
66.60 
78.70 
35.00 
61.00 
43.90 
51.90 
36.70 
77.40 
107.00 
39.41) 21.70 
39.40 l 7.10 
63.20 30.00 
76.70 29.20 
36.70 19.80 
75.30 25.40 
48.90 19.50 
49.70 18.30 
52.40 16.90 
27.10 I1.10 
25.90 11.30 
confidence 
interval 
\[ms\] 
2.02 
2.79 
7.02 
6.14 
2.05 
2.25 
4.27 
5.22 
I1.01 
2.52 
13.2 
2.08 
2.86 
3.49 
1.79 
3.24 
1.84 
2.61 
5.04 
2.96 
1.67 
3.87 
8.71 
1.72 
3.43 
3.39 
3.11 
5.62 
0.98 
0.69 
Number 
of 
samples 
852 
17l 
52 
41 
260 
545 
136 
52 
36 
73 
74 
679 
293 
210 
437 
231 
383 
442 
125 
209 
404 
234 
46 
510 
213 
130 
136 
37 
495 
1040 
"Iable 2: Average l)honeme duration, cot~fidence interval and xtandard deviation of the population,\[br manual and 
automatic segmentation. 
phone models ark obtained. By applying the Viterbi 
alignment procedure, the training logatoms are auto- 
matically labelled using our monophone inventory. 
Due to the properties ot:the Slovenian language some 
phones are composed of several phone components, 
like the stop consonants k,p,b,d,t and the affricates c 
and & Such phones are described by multiple sub- 
models. Table I gives the Slovenian phones and their 
corresponding submodels as they are used li)r logatom 
segmentation. 
A preliminary statistical evaluation of mantml and 
automatic segmentation discrepancies was performed 
on a much larger speech database than the logatom in- 
ventory itself as proposed in (Schmidt93). 150 spoken 
sentences were extracted fi'om the Slovenian speech 
corpus GOPOI,IS (l)obrigek96) concerning airflight 
timetable inquiries in total duration of 25 minutes. Av- 
erage duration, conlidence interwtl and standard devi- 
ation of the population for both manual and automatic 
segmentation are presented in Table 2. 
The discrepancies between manual and automatic 
segmentation are considerable. Most of the problems 
arise when detecting bursts of plosives as tile automatic 
procedure tends to shorten their closures considerably. 
The situation improves when plosives are taken as a 
whole, closures and bursts together. 
As a result, a fully automatic seglnentation of speech 
segments is hardly conceivable in the context of con- 
catenation synthesis. As most phonological units orig- 
inate via phonological considerations rather than on 
acoustic grounds, isolating them requires a deep prior 
knowledge of their specilic features. Unsupervised 
segmentation, i.e. segmentation on acoustic princi- 
ples only, often results in segments and sub-segments 
boundaries being misplaced, or just missing, while un- 
defined ones appear. However, it can be used as a seg- 
mentation outline, the retinement of which has to be 
performed by a human expert. 
301 
Figure 3: Automatic (above) and manual (center and below) segmentation of the logatom inacu~. 
Thus automatic procedures can speed up the segmen- 
tation process, but they are not likely to suppress man- 
ual corrections, at least for obtaining highest synthesis 
quality with a given corpus. 
Diphone boundaries determination 
As the concatenation point of the diphones corre- 
sponds to the center of the phone, it is somewhere in 
the steady region of the phone. By studying the dis- 
tances from the signal to the target values, (Ottesen91) 
claims that minimal distances tend to be just before the 
middle of the phoneme. We decided to divide each 
phoneme duration in a fixed ratio, 40 and 60%. Plo- 
sives are exception to this rule: they are divided just in 
front of the opening burst. A diphone boundary detec- 
tion algorithm, minimising spectral discontinuities at 
concatenation points (Taylor91) may be investigated. 
5 Conclusion 
Diphone inventory acquisition for the Slovenian lan- 
guage was discussed. In order to avoid the tedious 
time consuming manual segmentation of logatoms, an 
automatic procedure, based on HMM models is consi- 
dered. Thus diphone sets for new synthetic voices are 
easier to produce. Results of the statistical evaluation 
of manual and automatic segmentation discrepancies 
are given. 
We expect the whole process of creating a new wfice 
to be semi-automatic (with manual correction of stop- 
consonant boundaries), allowing the synthesiser to be 
retrained on a new voice in less than 3 days. 
302 
Acknowledgement 
The authors wish to thank 3bma~ Erjavec tbr proof- 
reading of the text and his usefld comments on the ar- 
ticle. 
References 
A. Dobnikar and J. Bakran. 1995. A new approach 
for Slovene text-to-speech synthesis. I'roceedings 
Mipro95. Opatija, Croatia. 265-268. 
J. Gros et al. 1996. A text-to-speech system for the 
Slovenian language. Euxipco96. Trieste, Italy. Ac- 
cepted for presentation. 
P. A. %~ylor and S. D. Isard. 199l. Automatic 
diphone segmentation, l'roceedings Eurospeech91. 
Genova, Italy. 709-711. 
M. S. Schmidt and G. S. Watson. 1993. The eval- 
uation and optimization of automatic speech segmen- 
tation. Proceedings Eumspeech93 Berlin, Germany. 
701-704. 
E Cosi et al. 1991. A preliminary statistical evalua- 
tion of manual and automatic segmentation discrepan- 
cies. Proceedings Eurospeech91 Genova, Italy. 693- 
696. 
C. Sorin et al. 1987. A Rhythm-Based Prosodic 
Parser for Text-to-Speech Systems in French. Pro- 
ceedings Xlth ICPhS. Tallin, Estonia. 125-128. 
E. Moulines and F. Charpentier. 1990. Pitch - Syn- 
chronous Waveform Processing Techniques for Text- 
to-Speech Synthesis Using Diphones. Speech Com- 
munication. 9:453-467. 
Y. Medan et al. 1991. Super resolution pitch de- 
termination of speech signals. IEITE Transactions on 
Signal Ptvcessing. 39(l):40-48. 
E. G. Schukat-Talamazzini et al. 1992. Acoustic 
modelling of subword units in the ISADORA speech 
recognizer, l'roceedings ICASSP92. San Francisco, 
USA. 57'7-580. 
I~. Rabiner. 1989. A Tutorial on Iiidden Markov 
Models and SelEcted Applications in Speech Recogni- 
tion. Proceedings ~f the IEEE. 77(2):257-289. 
G. E. Ottesen. 1991. An automatic diphone segmen- 
tation system. Proceedings Eurospeech93. Berlin, 
Germany. 713-716. 
A. Fourcin et al. 1989. Speech Input and OutputAs- 
sessment: Multilingual Methods and Standards. Ellis 
Horwood Limited, John Wiley & Sons, New York - 
Chichester - Brisbane - ~Ibronto. 
303 
