File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-1202_intro.xml
Size: 3,614 bytes
Last Modified: 2025-10-06 14:06:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1202"> <Title>Message-to-Speech: high quality speech generation for messaging and dialogue systems</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Many of the Natural Language Generation (NLG) systems that produce flexible output, i.e. sentences with variations on the syntactical and morphological levels, only aim at the production of written text and do not deal with spoken language. As a result, the important topic of generation of natural prosody is not touched upon (see e.g. (Elhadad, 1992; Reiter et al., 1995; Dalianis, 1996b; Somerset al., 1997)).</Paragraph> <Paragraph position="1"> Message generating systems (e.g. announcement systems, phone banking and voice mail applications) often combine fixed pieces of pre-recorded speech to provide speech of a natural quality. In practical applications, the linguistic flexibility of the spoken messages is usually kept very limited because of the high costs of recording and storing the fixed pieces of speech.</Paragraph> <Paragraph position="2"> The Message-to-Speech (MTS) system described below is specifically designed to generate high quality speech output with the flexibility desired for spoken dialogue and message generating systems. Such systems typically generate speech for a predefined set of messages that consist of fixed and variable parts. High flexibility may be required for the variable parts in the messages only.</Paragraph> <Paragraph position="3"> Text-to-Speech (TTS) is an evident technique for providing speech output with nearly unlimited flexibility. As full flexibility is only needed for the variable parts in the messages, the MTS system can make use of special purpose prosody models for the actual set of messages of the application. These models can lead to a prosodic quality that is superior to the one generated by TTS systems, which apply general prosody models for unrestricted text (see also (Hovy, 1995, p.161)).</Paragraph> <Paragraph position="4"> For the fixed parts of a message, the prosody transplantation technique (see section 2) is used to overrule prosody generated by general models, as is done by TTS, with specific prosody copied from natural speech. For the parts of a message where flexibility is needed, prosody is obtained by either a general model or by a model that is specifically developed for those parts. The MTS system thus combines transplanted prosody with prosody by model in order to achieve highly natural prosody for partly variable messages (Van Coile et al., 1995).</Paragraph> <Paragraph position="5"> The key concepts of the MTS system are presented in section 3.1. The system consists of two main modules: a generation module and a prosodic integration module. The generation module (see section 3.2) is template driven (canned &quot;text&quot; interspersed with slots), and accounts for the flexibility, including the linguistic variation, of the messages. For a discussion of template driven systems see (Reiter, 1995; van Deemter et al., 1994; van Deemter and Odijk, 1997). The prosodic integration module (see section 3.3) takes care of the prosodic integration of the slot fillers with the rest of the template.</Paragraph> <Paragraph position="6"> In section 4 the Message-to-Speech system is briefly discussed, and section 5 compares the system with related research. To conclude, an overview of current developments to further enhance the MTS system is presented in section 6.</Paragraph> </Section> class="xml-element"></Paper>