File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1051_metho.xml

Size: 13,468 bytes

Last Modified: 2025-10-06 14:14:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1051">
  <Title>Segmentation and Labelling of Slovenian Diphone Inventories*</Title>
  <Section position="3" start_page="0" end_page="298" type="metho">
    <SectionTitle>
2 Slovenian TTS system
</SectionTitle>
    <Paragraph position="0"> Tile different phases of the text-to-speech transformation are performed by separate independent modules, operating sequentially, as shown in Figure 1. Thus input text is gradually transformed into its spoken equivalent, null Graphemc-to-phoneme transcription First, abbreviations are expanded to t'ornl equivalent full words using a special list of lexical entries. A text pre-processor converts further special formats, like numbers or dates, into standard graphemic strings. Next, word pronunciation is derived, based on a user extensible pronunciation dictionary and letter-to-sound rules. The dictionary is supposed to cover the most fl'equent words in a given hmguage and a second dictionary helps with pronouncing proper names.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Prosody generation
</SectionTitle>
      <Paragraph position="0"> Prosody generation assigns the sequence of allophones with some of their prosodic parameters (pitch l'requency, duration). First, words are syllabitied by counting the nmnber of their vowel clusters and duration of syllables is modelled according to the speaker's normal articulation rate, depending on the number of syllables within a word and on the word's position within a phrase. Then, segmental prosodic paramete~w are determined tbr each allophone on the basis of the accent position within a word and its type. Finally, the global intonation contour of a phrase is determined (Sorin87).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="298" type="sub_section">
      <SectionTitle>
Diphone Concatenation
</SectionTitle>
      <Paragraph position="0"> Once the appropriate phonetic symbols and prosody markers are determined, the final step is to produce audible speech by assembling elemental speech units, computing pitch and duration contours, and synthesising the speech waveform. A concatenative TD-PSOLA diphone synthesis technique was used, allowing high-quality pitch and duration transformations directly on the waveform (Moulines9(/).</Paragraph>
      <Paragraph position="1">  tu re.</Paragraph>
      <Paragraph position="2"> diphone inventory cotnprising 955 pitch-labelled diphones was created. In order to guarantee optimal synthesis quality, a neutral phonetic context in which the diphones needed to be located, was speeitied. Unfavourable positions, like inside stressed syllables or in over-a,ticulated contexts, were excluded. The diphones were placed in the middle of logatoms, pronounced with a steady intonation. The exception is in the case where the silence phone is part of the required pair: there the diphone was word initial or word final. Speech signals were recorded by a close talking nilcrophone using a sampling rate of 16 kHz and 16 bit linear AID conversion.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="298" end_page="299" type="metho">
    <SectionTitle>
3 Slovenian Diphone Inventory
</SectionTitle>
    <Paragraph position="0"> In concatenation systems, both the choice and the proper segmentation of the traits to be concatenated play a key role. Acoustic differences between stored and requested segments, its well as acoustic discontinuities at the boundaries belween adjacent segments have to be minimised. Dipt,one units are most corn-.</Paragraph>
    <Paragraph position="1"> monly adopted as a compromise between the size of the unit inventory and tile quality of synthetic speech.</Paragraph>
    <Paragraph position="2"> A diphonc is, generally speaking, a unit which starts in tim middle o1: one phone, passes through the transi~ tion to the next phone and ends in the middle of this next phone. So the transition between two phones is encapsulated and does not ueed to be calculated.</Paragraph>
    <Paragraph position="3"> Yet it is not clear whether speech segments should be extracted from nonsense plurisyllabic words, called Iogatoms, existing isolated words or meaningful senteuces. Even the question of a bust positioning oI' the units within the spoken corpus is still widely debated. Stressed syllables are longer, thus less submitted to coarticulation, which results in easily chainable units; while unstressed ones are more ntnnerous in natural speech, so that producing them efficiently would both increase segmental qt, ality and reduce lrlcmory requirements. Likewise, coarticulalions are strongly subject to speaker's lluency, so that imposing a slow speaking rate results in more intelligible units. To a large extent these issues are part of a necessary trade-off between intelligibility and naturalness.</Paragraph>
    <Paragraph position="4"> One diphone tor every allophone combination possible in a given language is required. A Slovenian Iqgure 2: Wawqbrm (above) and spectral (below) representation of the diphone ac. Markers 1, and R are set at the pitch periods ~&amp;quot; the left part of the do, hone and of the right part, respectively.</Paragraph>
    <Paragraph position="5"> After the recording phase, logatoms were hand-segmented and tile center of tile transition between tile phones was marked, using information from both temporal and spectral representation of tile speech sip nal. A special user-friendly interface was developed for this purpose, allowing editing, scaling, viewing, lltbelling and pitch-marking of the speech signal, t,'irst the approxiumte neighbourhood of a diphone was determined, then a fine labelling of its boundaries was performed and the center of the phoneme transition was marked, l&amp;quot;inally, pitch markers were manually set for voiced parts of tile corresponding speech signal.</Paragraph>
    <Paragraph position="6"> t;igure 2 gives an example of the diphone am along with its spcctrtuu.</Paragraph>
    <Paragraph position="7"> 'lb phonetically transcribe the logatom words we</Paragraph>
    <Paragraph position="9"> phone I model silence silence \[ Table l: List of phones and their corresponding submodels used for Slovenian Iogatom segmentation. Symbol = represents a voiced closure while symbol _ represents an unvoiced closure. used a set of 34 symbols for allophones, which we adapted to the SAMPA standard requirements (Fourcin89) t .</Paragraph>
    <Paragraph position="10"> While concatenating diphones into words it suddenly turned out that there was a large discrepancy between the duration of allophones, as suggested by the prosody module, and the actual corresponding diphone duration stored in the diphone inventory. This happened due to the exaggerated eagerness of the speaker trying to pronounce the meaningless logatoms in a correct and clear way. Consequently, the quality of the synthetic speech was considerably affected and we are therefore planning to record another diphone inventory. As the transformation range for prosodic speech parameters needed for synthesising naturally sounding speech is large, the recording should thus be carefully controlled to achieve medium pitch and duration values. null</Paragraph>
  </Section>
  <Section position="5" start_page="299" end_page="301" type="metho">
    <SectionTitle>
4 Automatic Diphone Segmentation
</SectionTitle>
    <Paragraph position="0"> Automatic speech segmentation procedures are powerfill tools for including new synthetic voices and for updating and supplementing existing diphone libraries whereas manual diphone segmentation is a tedious, time consuming task, prone to errors. Therelbre, in order to be able to synthesise speech in a variety of different voices, we decided to use procedures for automatic segmentation and pitch marking of spoken logatoms.</Paragraph>
    <Paragraph position="1"> The extraction of diphones from the recorded words is performed in two stages. The first stage is the phoneme segmentation of logatoms, yielding a start point, transition center and end point for each phone.</Paragraph>
    <Paragraph position="2"> The second part of the diphone extraction procedure is to find the concatenation point of each phone.</Paragraph>
    <Paragraph position="3"> ~A list of Slovenian SAMPA symbols together with their audio samples is available on the WWW on the address &amp;quot;http:#1uz.fer.uni-lj.si/english/SQEL/sampa-eng.htlnl&amp;quot;. Finally, pitch markers are to be determined for voiced parts of the signal. We intend to apply the SRPD (Super Resolution Pitch Determination) algorithm as it allows precise pitch determination (Medan91).</Paragraph>
    <Paragraph position="4"> Hidden Markov Model Phone Segmentation To solve the segmentation problem, methods for stochastic modelling of speech are used. Hidden Markov Models (HMMs) are stochastic finite-state automata that consist of a finite number of states, modelling the temporal structure of speech, and a probabilistic function for each of the states, modelling the emission and observation of acoustic feature vectors (Rabiner89).</Paragraph>
    <Paragraph position="5"> To perform logatom segmentation we used the Isadora system, developed at the University Erlangen-Nuremberg (Schukat92). The Isadora system is a tool used for modelling of one dimensional patterns, like speech. It consists of modules for speech signal feature extraction, hard or soft vector quantization and beam-search driven Viterbi training and recognition.</Paragraph>
    <Paragraph position="6"> The ls'adora system builds a large network of nodes that correspond to different speech events like phones, phonemes, words or sentences. The nodes are provided with a dedicated HMM in order to acoustically represent the corresponding speech event.</Paragraph>
    <Paragraph position="7"> For system training, approximately half an hour of continuous speech recorded from a single speaker is required along with its orthographic transcription. The acoustical analyser delivers every milisecond a set of Mel fi'equency cepstral coefficients along with their slopes plus the energy of each frame. A phone level description is obtained using the orthographic transcription and a pronunciation dictionary. In the initialisation step the feature vectors are classified into 64 classes using a soft vector quantization technique. Using a phonetically labelled vocabulary a Baum-Welch training procedure is applied, and parameters of mono- null phone models ark obtained. By applying the Viterbi alignment procedure, the training logatoms are automatically labelled using our monophone inventory.</Paragraph>
    <Paragraph position="8"> Due to the properties ot:the Slovenian language some phones are composed of several phone components, like the stop consonants k,p,b,d,t and the affricates c and &amp; Such phones are described by multiple submodels. Table I gives the Slovenian phones and their corresponding submodels as they are used li)r logatom segmentation.</Paragraph>
    <Paragraph position="9"> A preliminary statistical evaluation of mantml and automatic segmentation discrepancies was performed on a much larger speech database than the logatom inventory itself as proposed in (Schmidt93). 150 spoken sentences were extracted fi'om the Slovenian speech corpus GOPOI,IS (l)obrigek96) concerning airflight timetable inquiries in total duration of 25 minutes. Average duration, conlidence interwtl and standard deviation of the population for both manual and automatic segmentation are presented in Table 2.</Paragraph>
    <Paragraph position="10"> The discrepancies between manual and automatic segmentation are considerable. Most of the problems arise when detecting bursts of plosives as tile automatic procedure tends to shorten their closures considerably.</Paragraph>
    <Paragraph position="11"> The situation improves when plosives are taken as a whole, closures and bursts together.</Paragraph>
    <Paragraph position="12"> As a result, a fully automatic seglnentation of speech segments is hardly conceivable in the context of concatenation synthesis. As most phonological units originate via phonological considerations rather than on acoustic grounds, isolating them requires a deep prior knowledge of their specilic features. Unsupervised segmentation, i.e. segmentation on acoustic principles only, often results in segments and sub-segments boundaries being misplaced, or just missing, while undefined ones appear. However, it can be used as a segmentation outline, the retinement of which has to be performed by a human expert.</Paragraph>
    <Paragraph position="13">  Thus automatic procedures can speed up the segmentation process, but they are not likely to suppress manual corrections, at least for obtaining highest synthesis quality with a given corpus.</Paragraph>
    <Paragraph position="14"> Diphone boundaries determination As the concatenation point of the diphones corresponds to the center of the phone, it is somewhere in the steady region of the phone. By studying the distances from the signal to the target values, (Ottesen91) claims that minimal distances tend to be just before the middle of the phoneme. We decided to divide each phoneme duration in a fixed ratio, 40 and 60%. Plosives are exception to this rule: they are divided just in front of the opening burst. A diphone boundary detection algorithm, minimising spectral discontinuities at concatenation points (Taylor91) may be investigated.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML