File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0608_metho.xml

Size: 9,954 bytes

Last Modified: 2025-10-06 14:14:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0608">
  <Title>A practical Message-to-Speech strategy for dialogue systems</Title>
  <Section position="4" start_page="0" end_page="41" type="metho">
    <SectionTitle>
2 Prosody Transplantation
</SectionTitle>
    <Paragraph position="0"> The idea behind Prosody Transplantation is that of copying intonation and duration values from a recorded donor message (human speech) to the phonetic transcription of the same message. The specific Enriched Phonetic Transcription (EPT) obtained in this manner can be fed to a TTS system whereby the normal linguistic and prosodic modules (based on general models) are by-passed (Phonetics-to-Speech -- PTS). Only the segmental synthesis and the synthesiser modules are used.</Paragraph>
    <Paragraph position="1"> An example of an EPT is provided by figure 1.</Paragraph>
    <Paragraph position="2"> The first value between square brackets is the phoneme duration (in ms), optionally followed by one or more intonation breakpoints. Each breakpoint consists of a location value (in ms) relative to the beginning of the phoneme, followed by a pitch</Paragraph>
    <Paragraph position="4"> sentence &amp;quot;Thank you for your attention&amp;quot; value (in ST/4; reference 50 Hz).</Paragraph>
    <Paragraph position="5"> A major asset of Prosody Transplantation is the combination of natural sounding speech with a low bit rate for storage (less than 300 bit per second). In addition, only the prosody and not the timbre of the speaker is retained. New donor messages can be recorded by new speakers and seamlessly integrated in existing applications. Specific tools have been developed to speed up the prosody transplantation process (Van Coile et al., 1994). Although the EPTs as such do not support linguistic variation, the combination of PTS with a template driven system provides linguistic flexibility as well as natural prosody.</Paragraph>
  </Section>
  <Section position="5" start_page="41" end_page="42" type="metho">
    <SectionTitle>
3 The Message-to-Speech System
</SectionTitle>
    <Paragraph position="0"> In the following sections, more details will be provided about the combination of fixed and variable information (templates and arguments). Once the appropriate surface form is selected (see section 3.1), the resulting EPT template with its arguments (phonetically transcribed) is integrated on the prosodic level (see section 3.2). Finally, the integrated EPT is fed into the TTS synthesis module (PTS).</Paragraph>
    <Section position="1" start_page="41" end_page="42" type="sub_section">
      <SectionTitle>
3.1 MTS Generation Module
</SectionTitle>
      <Paragraph position="0"> A message represents a complete sentence and is composed of one or more building blocks or message units (MU), which constitute the input of the MTS system. All MUs are prosodic units that cannot be combined in an arbitrary way to form messages: syntax specifies how to combine different MUs units into a message. The flexibility of a MU is guaranteed by the presence of slots. By providing different arguments for a slot, several variants can be derived from the same MU at run-time. An entire message can thus be parameterised.</Paragraph>
      <Paragraph position="1"> Subsequently, the MUs are mapped into one or more carriers. A carrier is a template containing the enriched phonetic transcription of canned text, transplanted from an appropriate donor message (see above), together with the prosodic information for the free slot parts (see below).</Paragraph>
      <Paragraph position="2"> MU arguments are not necessarily passed on to a carrier slot in a straightforward way: the argument can be deleted, adapted, or swapped. Examples of MUs and carriers are given in figure 2 1  The MTS generation part basically tries to procure a method that ensures the variability of a piece of information and takes the related linguistic variations into account (selection of the correct variant). The transformation of a MU into one or more carriers is guided by a two-fold mechanism: * argument dependent carrier selection: the carrier is selected in function of (a characteristic of) an argument. E.g. /a/car_ vs. /two/cars (singular vs. plural templates). In order to select the appropriate carrier, morpho-syntactic information about the argument must be available (in a dictionary) .</Paragraph>
      <Paragraph position="3"> * carrier dependent argument realisation: the argument is realised in a different way in function of the selected carrier. E.g. /a/ car vs.</Paragraph>
      <Paragraph position="4"> /an/_automobile (vocalic onset or not for singular noun). For the argument to be realised correctly, linguistic constraints on the slot must be taken into account.</Paragraph>
      <Paragraph position="5"> The arguments to be filled in a slot are phonetic transcriptions provided by a dictionary or a grapheme to phomene (G2P) conversion module.</Paragraph>
      <Paragraph position="6"> E.g., the dictionary entry for the determiner is an;ON=VO I a, NB=SG: &amp;quot;a&amp;quot; is the default; &amp;quot;an&amp;quot; is used before nouns with a vocalic onset and both forms are singular. It will be clear that the prosody of a carrier (EPT with slots), although better than plain TTS, risks to be slightly inferior to that of an entire EPT (no slot). Therefore, a good and practical compromise has to be found for the trade-off between storage space on the one hand and flexibility and prosodic quality on the other .</Paragraph>
      <Paragraph position="7"> An example (see figure 3) gives an idea of how the system works. The transformation of MU 0001 into carrier 3551 is straightforward (no specific condition). Depending on the value of the argument resentation, it must be stressed that a carrier is a very concise representation of a piece of recorded speech without segmental voice-specific features. Each phoneme also has duration and intonation characteristics (see figure 1).  (ARG), MU 0002 is mapped onto carrier 3561 or carrier 3562.</Paragraph>
      <Paragraph position="8"> This is an example of argument dependent carrier selection. Subsequently, if alternative surface forms co-exist, the restriction on the slot (see figure 4) is compared with the characteristics of its argument.</Paragraph>
      <Paragraph position="9"> As &amp;quot;an&amp;quot; is associated with &amp;quot;ON=VO&amp;quot; (vocalic onset), the default case &amp;quot;a&amp;quot; is selected (= carrier de.pendent argument realisation).</Paragraph>
      <Paragraph position="11"/>
    </Section>
    <Section position="2" start_page="42" end_page="42" type="sub_section">
      <SectionTitle>
3.2 MTS Prosodic Integration Module
</SectionTitle>
      <Paragraph position="0"> The purpose of the prosodic integration module is to calculate appropriate prosody for all arguments that are to be filled out in a carrier. In a first step a duration is calculated for each of the phonemes in the argument (see section 3.2.1). In a second step, an appropriate intonation contour is calculated (see section 3.2.2).</Paragraph>
      <Paragraph position="1">  The input of the duration module is a phonetic transcription in which primary and secondary stress, provided by the dictionary or G2P module, are indicated. The duration module has access to one or more duration models in order to produce a phonetic transcription that is enriched with a duration value for each phoneme.</Paragraph>
      <Paragraph position="2"> A duration model is a rule-based system calculating durations, taking into account parameters such as lexical stress, position of phonemes (word initial, word medial, word final, sentence final), length of the argument, phonetic context of phonemes (left/right neighbour, consonant cluster, open/closed syllable) etc. As speech rate can vary from one message to another, a slot specific speech rate coefficient, provided by the carrier, can also be taken into account.</Paragraph>
      <Paragraph position="3"> Two major strategies with respect to duration modelling can be discriminated: * As the most natural prosody is the one derived from human speech, the possibility is offered to feed the duration module with phonetic transcriptions enriched with duration information copied from natural speech. When customising the MTS system, an argument dictionary containing this information can be built off-line by making use of the prosody transplantation tools (see section 2). If transplanted durations are available in the argument, they are taken over by the duration module and only modified in specific cases -- e.g. change a duration in order to cope with a phenomenon such as final lengthening.</Paragraph>
      <Paragraph position="4"> * For arguments without transplanted durations, a general purpose duration module is activated.</Paragraph>
      <Paragraph position="5"> It consists of a cascade of different duration models each having a decreasing specificity.</Paragraph>
      <Paragraph position="6"> Specific duration models exist for particular arguments such as numbers or date and time indications. The general purpose model is only used if no more specific model is available. Special tools have been developed to speed up the creation of general and special purpose duration models.</Paragraph>
      <Paragraph position="7">  The results after duration modelling are input to the intonation module, which produces phonetic transcriptions describing both duration and intonation. After assimilation has been taken care of, the resulting EPT for the argument can be inserted without any further action into the EPT of the carrier. null For each argument, the intonation module calculates a piecewise linear intonation contour based on slot specific intonation models. The slot specific information, provided by the carrier, that can be taken into account is among others the begin pitch, the end pitch, the declination rate and the intonation context (final fall, continuation rise, etc.) of the argument. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML