File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1204_metho.xml
Size: 15,443 bytes
Last Modified: 2025-10-06 14:14:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1204"> <Title>Integrating Language Generation with Speech Synthesis Concept to Speech System</Title> <Section position="4" start_page="23" end_page="23" type="metho"> <SectionTitle> 3 System Architecture </SectionTitle> <Paragraph position="0"> The main new feature of the architecture (see Fig. 1) is the introduction of SIML. The system has three major components: the NLG component, the SIML To Prosody Component(STP) and the TTS component. Each can be designed and implemented independently. The NLG-SIML component first converts the input concepts into grammatical sentences with associated discourse, semantic, and syntactic information. Then the SIML converter transforms the system specific NLG representation into standard SIML format. The STP component computes the prosodic features based on the discourse, semantic and syntactic information encoded in the SIML format. The STP component has three modules: the SIML parser, the STP algorithms and the SIML generator. First the SIML parser analyzes the information in SIML. The STP algorithms predict prosodic parameters based on the information derived from the markup language. Then the SIML generator encodes the prosodic features in SIML format. The TTS component first extracts the prosodic parameters from the SIML representation and translates it into a specific, system dependent TTS input. In this way various NLG tools, STP algorithms and TTS can be integrated through the standard interfaces, SIML.</Paragraph> </Section> <Section position="5" start_page="23" end_page="26" type="metho"> <SectionTitle> 4 MA GIC CTS system </SectionTitle> <Paragraph position="0"> Our CTS system is a component of the MAGIC system (Multimedia Abstract Generation for Intensive Care). (Dalal et ah, 1996) (Pan and McKeown, 1996). MAGICs goal is to provide a temporally coordinated multimedia presentation of data in an online medical database. The graphics and speech generators communicate through a media coordinator to produce a synchronized multimedia presentation. Given that CTS takes place within a multimedia context, many of the parameters our CTS system address are those needed for aiding coordination between media. Currently, there are three components in the CTS system: an NLG component, a set of CTS algorithms and a TTS. The NLG tools and TTS are application independent. We use the FUF/SURGE package (Elhadad, 1993) for generation. The speech synthesis system is AT&T Bell Labs' TTS system. The concept to speech algorithms, however, are not system independent. The input of these algorithms are in FUF/SURGE representation and the output is designed specificly for AT~T TTS input. In this section we describe our current CTS system and in the following section discuss extensions that we plan to adapt it to the general, proposed architecture.</Paragraph> <Paragraph position="1"> NLG component in MAGIC The NLG component in MAGIC consists of 4 modules: a general content planner, a micro planner, a lexical chooser and a surface realizer. The general content planner groups and organizes the data items from the medical database into topic segments; each segment may consist of several sentences. Then the micro planner plans the content of the sentences within each topic segment. One or several sentences can be used to convey the information within a topic segment. The lexical chooser makes decisions on word selection and the semantic structure of the sentence. The output of the lexical chooser is an internal semantic representation of the intended utterance. null For example, the internal semantic structure of &quot;The patient is hypertensive&quot; is represented in:</Paragraph> <Paragraph position="3"> In a semantic representation, a clause is defined by process type, participant and circumstance. Process type could be simple, as in the example, or composite (e.g., using conjunction). Each participant or circumstance may consist of a head and one or more pre-modifiers or qualifiers. Words and phrases are used to realize each semantic unit.</Paragraph> <Paragraph position="4"> The surface realizer maps the lexicalized, semantic representation to its corresponding syntactic struc- null I ?LG Component 1 TFS Component \] Input . ~ NLGSystem ) \[ \['lTrS System ~ ~td~t representation It J / NLG Output 4 .... ~ \[ &quot; C/ &quot;~=- ~&quot; - b - - - -SP~YC/ in Representation I r LG.SIML a \[ \[\[ ~on~erter \]\[ ( SIML-TTS ~\[ TTS Format t Co~e~ter j j NLG Output in ~ &quot; &quot; ,,J SIML Format ..... \[ SiML ~ STP 1_~ SIML } Parser 1 , STPAlgorithms Component , Generator l &quot;a&quot; i ..... _Prosody format in SIML ' ! semantic, syntactic Prosody discourse info.</Paragraph> <Paragraph position="5"> ture. Aft~er linearizing the syntactic structure, which usually is the last step in a written language generation system, the internal semantic and syntactic structure as well as the words of the sentence are used as a rich and reliable knowledge source for speech synthesis.</Paragraph> <Paragraph position="6"> CTS Algorithms in MAGIC Due to the synchronization requirements, we are specifically interested in two features: pause and speaking rate. We want to increase or decrease the length of pauses or the speaking rate in such a way ttiat speech actions begin and end at the same time as corresponding graphical actions. Even a small drift can be noticed by human eyes and cause uncomfortable visual effects. In MAGIC, only pause and speaking rate are set by our CTS algorithms; all other prosodic features are set by the default values predicted by AT&T Bell Labs' TTS system.</Paragraph> <Paragraph position="7"> Currently, we use a simple strategy in adjusting the speeda rate. We define the relative speaking rate as the ratio of the real speaking rate to the default speaking rate. Through experiments, we determined that the relative speaking rate can vary from 0.5 to 1 without significantly affecting the speech quality.</Paragraph> <Paragraph position="8"> In the future, we plan to develop an algorithm where the adjustable range is not uniform everywhere but decided by the underlying discourse, semantic and syntactic structures.</Paragraph> <Paragraph position="9"> In the following, we give more detail on the CTS algorithm which is used to predict prosodic phrase boundary. It provides a reliable indication on where pauses can be inserted and how long the pause could be.</Paragraph> <Paragraph position="10"> We use semantic structures to derive the prosodic phrase boundaries. In our algorithm, we first identify the basic semantic unit (BSU), which is the smallest, complete information unit in the semantic structure. Then we define the closeness measurement between two adjacent BSUs. If two adjacent BSUs are loosely connected, then we have reason to believe that it won't hurt the intelligibility significantly if we speak them separately. Therefore, semantic closeness is an important knowledge source for prosodic phrase boundary prediction. Other factors which also affect the placement of prosodic phrase boundary are breath length, and the distance to the end of the utterance.</Paragraph> <Paragraph position="11"> A Basic Semantic Unit(BSU) is a leaf node in a semantic hierarchy. In the semantic hierarchy (see Fig. 2), the BSU is indicated by dark blocks.</Paragraph> <Paragraph position="12"> We define the closeness between two adjacent BSUs as the level of the lowest common ancestor in the semantic hierarchy. If a node has only one child, then both parent and the child are considered at the same level. The closeness indicates the semantic distance of two adjacent BSUs. 1 means they are semanticly far apart, while higher numbers indicate they are semantically dose.</Paragraph> <Paragraph position="13"> Breath length is defined as the typical number of words a human can speak comfortably without breathing. The value used in the algorithm is learned automatically from a corpus. The distance from the current place to the end of an utterance is simply defined by the number of words.</Paragraph> <Paragraph position="14"> Now we have 3 factors working together determining the prosodic phrase boundary. Basically, there won't be any prosodic phrase boundary within a BSU. For each place between two adjacent BSUs, we measure the possibility of inserting a prosodic phrase boundary using the combination of the 3 factors: null 2. The closer the current breath length to the comfortable breath length, the more the possibility of a boundary.</Paragraph> <Paragraph position="15"> 3. The closer the current place to the end of the utterance, the less the possibility of a boundary.</Paragraph> <Paragraph position="16"> 4. The above factors are weighted, using a learning algorithm we trained automatically on a small corpus (40 sentences).</Paragraph> <Paragraph position="17"> The result is encouraging. When we test this on the set provided in (Bachenko and Fitzpatrick, 1990), we got a 90% accuracy for primary phrase boundary and we get an 82% accuracy for the utterances in (Gee and Grosjean, 1983). We did not formally measure the algorithm for secondary phrase boundaries, because we only consider inserting pauses at primary phrase boundary.</Paragraph> <Paragraph position="18"> TTS in MAGIC Basically, we treat TTS as a black box in MAGIC.</Paragraph> <Paragraph position="19"> We use the escape sequence of TTS to override the TTS default value.</Paragraph> <Paragraph position="20"> 5 Extensions to MAGIC CTS Based on the New Architecture The cm'rent MAGIC CTS uses CTS algorithms that are closely integrated with both the NLG tools and TTS. This will make it difficult to experiment with new tools, requiring changes in all the input and output format for the CTS algorithms. In the spirit of developing a portable language generation system such as FUF/SURGE, we are working on a portable spoken language generation system by using the new architecture.</Paragraph> <Paragraph position="21"> In order to extend the current CTS, we must define a prototype SIML. As a first step, we have designed a prototype SIML that covers the information needed for CTS in the multimedia context. For our CTS algorithms, only semantic and syntactic structure are used in predicting prosodic phrase boundary and are represented in the SIML. Speaking rate and pause are also included in SIML.</Paragraph> <Paragraph position="22"> We first describe how this information is represented in SIML, giving examples showing how to use SIML to tag pauses, speaking rate, semantic and syntactic structure. Then part of the formal Document Type Definition (DTD) of the prototype SIML is presented, providing a grammar for SIML. See (Sperberg-McQueen and Burnard, 1993) for more information about SGML and DTD.</Paragraph> <Paragraph position="23"> Example 1: Using SIML to tag speaking rate and pauses: <u.pro>Ms. Jones <pause dur=5 durunit=ms> is an <phrase rate=0.9> 80 year old </phrase> hypertensive, diabetic female patient of doctor Smith undergoing CABG. </u.pro> <u.pro> and </u.pro> above indicate the start and end of an utterance. <phrase> and </phrase> is the front and end tag of a phrase. Rate is an attribute associated with <phrase>, indicating the speaking rate of the phrase. <pause> is a tag with two associated attributes: dur and durunit. They indicate the length of the pause.</Paragraph> <Paragraph position="24"> In the above DTD specification, three elements and their associated attributes are defined: * u.pro and its attribute, rate; * phrase and its attribute, rate; * pause and its attributes, dur and durunit.</Paragraph> <Paragraph position="25"> The following is the element definition for &quot;u.pro&quot;: <! ELEMENT u.pro-- ((#PCDATAI phrase\]pause)*)> ELEMENT is a reserved word for the element definition. &quot;u.pro&quot; is the element name. % -&quot; is an omitted tag minimization which means both the start and end tags are mandatory. The last part is the content model specification. (#PCDATA I phrase \] pause)* means only parsed character data, phrases and pauses may appear between the start and end tags of &quot;u.pro&quot;.</Paragraph> <Paragraph position="26"> The associated attributes are defined in</Paragraph> </Section> <Section position="6" start_page="26" end_page="26" type="metho"> <SectionTitle> <! ATTLIST u.pro </SectionTitle> <Paragraph position="0"> rate NUMBER 1 > where the ATTLIST is the reserved word for attribute list definition. &quot;u.pro&quot; is the element name, &quot;rate&quot; is the attribute name, the type of &quot;rate&quot; is NUMBER and the default value is &quot;1&quot;.</Paragraph> <Paragraph position="1"> The STP component is the core part in the architecture and deserves more explanation. There are three tasks for this component: parsing of the input SIML, generation of prosodic parameters from the information produced by NLG, and transformation of the parameters into the SIML format. The SIML parsing is straight forward. It can be done either by developing an SIML specific parser for better efficiency or by using an SGML parser (there are several which are publicly available). The output of this component is the semantic and syntactic information extracted from SIML. Generation of prosodic parameters must be done using a set of CTS algorithms; we need to change the input and output of our existing CTS algorithms and make it system independent. Since the performance of these algorithms directly affects the quality of the synthesized speech, much effort is required to develop good CTS algorithms. The good news is that the proposed design ensures that the markup to prosody algorithms are system independent. Therefore, they can be reused in other applications. The output of the STP algorithms then converts to the SIML format by the SIML generator. The procedure is straight forward and it can be done very efficiently.</Paragraph> </Section> <Section position="7" start_page="26" end_page="27" type="metho"> <SectionTitle> 6 Generalize SIML </SectionTitle> <Paragraph position="0"> Since the current prototype SIML is designed specifically for multimedia application, it includes very limited semantic, syntactic and prosodic information. Thus, it is currently too primitive to be used as a standard interface for other CTS applications.</Paragraph> <Paragraph position="1"> For the future, we must include other forms information that are needed for speech synthesis and that can be generated by an NLG system. Some types of knowledge that we have identified include: 1. Discourse information (e.g. discourse structure, focus, rhetoric relations etc.), semantic structure and its associated features (such as in the prototype SIML), and syntactic structure.</Paragraph> <Paragraph position="2"> 2. Pragmatic information such as speaker-hearer goals, hearer background, hearer type, speaker type, emotions.</Paragraph> <Paragraph position="3"> 3. Morphology information, such as root, prefix, suffix.</Paragraph> <Paragraph position="4"> 4. Speech features, such as pronunci null ation, prosodic features, temporal information (such as duration, start, end), and non-lexical features (such as click, cough).</Paragraph> </Section> class="xml-element"></Paper>