File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1206_metho.xml
Size: 19,868 bytes
Last Modified: 2025-10-06 14:14:50
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1206"> <Title>Computing prosodic properties in a data-to-speech system</Title> <Section position="5" start_page="0" end_page="39" type="metho"> <SectionTitle> 2 Architecture of D2S </SectionTitle> <Paragraph position="0"> Generation Module (LGM), and the Speech Generation Module (SGM).</Paragraph> <Paragraph position="1"> The LGM takes data as input and produces enriched text, i.e., prosodically annotated text. For instance, it contains annotations to indicate accents and prosodic boundaries. This is input to the SGM, which turns it into a speech signal.</Paragraph> <Paragraph position="2"> Our example system GoalGetter (Klabbers et al., 1997b) takes data on a football match as input. The output of the system is a correctly pronounced, coherent monologue in Dutch which conveys the in- null formation on this match. An example of the input data is given in Figure 2, and one possible output text is given in Figure 3. In the enriched text, pitch accents are indicated by double quotes (&quot;) and phrase boundaries of varying strength are indicated by one to three slashes (/). The other symbols used in the text will be clarified in Section 4.</Paragraph> <Paragraph position="3"> Since we lack the space for a full description of the LGM, presented schematically in Figure 4, we only point out some important aspects which are relevant for the prosodic rules given in Section 3. For a more detailed description, see Klabbers et al. (1997a).</Paragraph> <Paragraph position="5"> The input for the LGM consists of data on specific football matches (see Figure 2) and on the domain, e.g., background information on the players and the teams. The information in the input can be expressed by means of templates in the form of a syntactic tree with variable slots. Choice and ordering of the templates and the filling of their slots depend on conditions on (1) the Knowledge State, which keeps track of which information has been expressed, and (2) the Context State, in which various aspects of the context are represented (Deemter and Odijk, 1995).</Paragraph> <Paragraph position="6"> A central part of the Context State is the Discourse Model, which contains information about which linguistic expressions have been used in the preceding text. Rules formulated in terms of this Discourse Model make it possible to use various referential expressions (proper names, pronouns, definite descriptions, etc.) appropriately. For instance, in the fourth sentence of the example text given in Figure 3, Dertien minuten later liet de aanvaller zijn tweede doelpunt aantekenen ('Thirteen minutes later the forward had his second goal noted'), it was possible to use a definite description (de aanvaller, 'the forward') to refer to Kluivert, because the Discourse Model contained an appropriate unique antecedent (namely, the proper name Kluivert that was used in the third sentence). When a new sentence has been generated, the Discourse Model is updated accordingly, and then the sentence with its full parse tree and the updated Discourse Model are input to the Prosody module.</Paragraph> </Section> <Section position="6" start_page="39" end_page="42" type="metho"> <SectionTitle> 3 Computing prosody </SectionTitle> <Paragraph position="0"> In this section we present the rules that are used in the Prosody module of the LGM, which determines the location of accents and phrase boundaries in a generated sentence on the basis of both syntactic &quot;De &quot;wedstrijd tussen &quot;PSV en &quot;Ajax / eindigde in &quot;@een//- &quot;@drie///&quot; Vijfentwintig duizend &quot;toeschouwers / bezochten het &quot;Philipsstadion /// &quot;Ajax ham na &quot;vijf &quot;minuten de &quot;leiding / door een &quot;treff~r van &quot;Kluivert///&quot;Dertien minuten &quot;later / tier de aanvaller zijn &quot;tweede doelpunt aantekenen /// De 7o &quot;verdediger &quot;Blind / verzilverde in de &quot;drieentachtigste minuut een &quot;strafschop voor Ajax///Vlak voor het &quot;eindsignaal / bepaalde &quot;Nilis van &quot;PSV de &quot;eindstand / op &quot;@een//-&quot;@drie/// % &quot;Scheidsrechter van &quot;Dijk / &quot;leidde het duel ///&quot; Valckx van &quot;PSV kreeg een &quot;gele &quot;kaart/// Translation: The match between PSV and Ajax ended in 13. Twenty-five thousand spectators visited the Philips stadium.</Paragraph> <Paragraph position="1"> After five minutes, Ajax took the lead tlirough a goal by Kluivert. Thirteen minutes later the forward had his second goal noted. The defender Blind kicked a penalty home for Ajax in the 83rd minute. Just before the end signal, Nilis of PSV brought the final score to 1-3.</Paragraph> <Paragraph position="2"> Referee Van Dijk led the match. Valckx of PSV received a yellow card.</Paragraph> <Paragraph position="3"> and semantic information. First we will discuss the accentuation algorithm, which is based on a version of Focus-Accent Theory proposed in (Dirksen, 1992) and (Dirksen and Quen~, 1993). In Focus-Accent Theory, binary branching metrical trees are used to represent the relative prominence of nodes with respect to pitch accent.</Paragraph> <Paragraph position="4"> We will use our previous example sentence, Dertien minuten later liet de aanvaller zijn tweede doelpunt aantekenen, as an illustration. First, the accentuation algorithm constructs the sentence's metrical tree, shown in Figure 5 (simplified). In our implementation, this tree corresponds to the sentence's syntactic tree, 1 except that its nodes have focus markers and are labeled weak or strong. The focus properties of the nodes in the metrical tree are determined as follows.</Paragraph> <Paragraph position="5"> Initially, all maximal projections (NP, VP etc.) are assigned a positive focus value, indicated as \[+F\]. The other nodes are not specified for focus. These initial focus values can be changed by non-syntactic factors causing the focus value to become negative, indicated as I-F\]. This happens in three cases: (1) a node dominates an unaccentable word; (2) a node represents given information? (3) a node dominates only nodes which are marked \[-F\]. Unaccentable that phrases expressing 'new' information are normally accented, while phrases expressing ~given' or 'old' infbrmation are usually deaccented.</Paragraph> <Paragraph position="6"> words, e.g., certain function words, are explicitly listed. Our example sentence contains only one such word, the determiner de ('the'). The rules for determining givenness are based on the theory proposed by van Deemter (1994), who distinguishes two kinds of givenness: object-givenness and conceptgivenness. null A phrase is regarded as object-given if it refers to a discourse entity that has been referred to earlier in its local discourse domain, which in the present implementation consists of all preceding sentences in the same paragraph. In the example, checking the Discourse Model reveals that the phrases de aanvaUer ('the forward') and zijn ('his') are objectgiven, because their referent (Kluivert) was referred to in the preceding sentence, which belongs to the same paragraph. This means that their dominating nodes in the metrical tree must be marked I-F\]. This example illustrates that object-givenness does not depend on the surface form of the referring expression, but only on its referent. The expressions de aanvaller and zijn are object-given even though they were not used earlier in the text.</Paragraph> <Paragraph position="7"> The second kind of givenness, concept-givenness, occurs if the root of a word is synonymous (including identity) with the root of a preceding word in the local discourse domain, or if the concept expressed by the second word subsumes the concept expressed by the first word. Our example sentence contains two instances of the first case: the words rainuten and doelpunt are concept-given, and therefore marked I-F\], due to the presence in the preced- null can be illustrated by the sequence Kluivert is een heel goede aanvaller; Hij is de beste speler van Ajax ('Kluivert is a very good forward; He is the best player of Ajax'). Since the concept speler ('player') subsumes the concept aanvaller ('forward'), the word speler in the second sentence will be defocused due to concept-givenness.</Paragraph> <Paragraph position="8"> Note that the first case of concept-givenness is the only kind of givenness distinguished in D2S which can also be determined in a relatively easy way in unrestricted Text-to-Speech systems, e.g., NewSpeak (Hirschberg, 1990); (Hirschberg, 1992). The second case of concept:givenness, subsumption, will be very difficult to detect in an unrestricted Text-to-Speech system because it requires the presence of a concept hierarchy, which is only feasible if the relevant concepts are known in advance. Finally, determining object-givenness will also be very difficult in Textto-Speech, because it makes very high demands on text analysis.</Paragraph> <Paragraph position="9"> Aider the metrical tree nodes have been assigned focus markings, their weak/strong labelling can be determined. This labelling depends both on the structure of the tree and the focus properties of the nodes. In Dutch, the structural rule is that the left node of two sisters is weak and the right node is strong, unless the right node is a zero projection, like the V deg node dominating aantekenen in figure 5.3 This structural labelling can be changed under the influence of focus. If the structurally strong node is marked I-F\] while the structurally weak node is not, the so-called Default Accent Rule applies and the labelling is switched. In figure 5, this happened to the AP dominating tweede and the N' dominating doelpunt. The N' is marked \[-F\] because all the nodes it dominates are marked I-F\]. (See defocusing rule (3) given above.) After the weak/strong labelling has been determined, accents are assigned according to the following algorithm: each node that is marked \[+F\] launches an accent, which trickles down the tree along a path of strong nodes until it lands on a terminal node (a word). In our example, the accents launched by CP, IP and VP all coincide with the accent launched by the NP node dominating zijn tweede doelpunt, finally landing on the word tweede. Note that if the word doelpunt had not been concept-given, then the N O and the N' would not have been marked I-F\] and the Default Accent Rule would not have applied. The accent would then have landed on doelpunt Since the NP node dominating de aanvaller is weak, no accent trickles down to it, and because it is marked I-F\] it does not launch an accent itself. The AP node dominating the phrase dertien minuten later (its internal structure is not shown due to lack of space) does launch an accent, which trickles down to the word later. The NP dertien minuten, which is contained in the AP, also launches an accent; since this cannot land on the word minuten (which is defocused due to concept-givenness) it ends up on the word dertien.</Paragraph> <Paragraph position="10"> Recently, an algorithm for the generation of contrastive accent has been added to the GoalGetter system. This algorithm assigns a pitch accent to phrases which provide contrastive information, overriding deaccentuation due to givenness. For more 3Evidence for this rule comes from constructions like the following: (i) Kluivert liep \[vP \[v0 voorbij\] \[Np het doel\]l (ii) Kluivert fiep \[vP \[top het doel\] \[v,, voorbij\]\] Both (i) and (ii) can be translated as 'Kluivert walked past the goal'. Since voorbij is not accented in either case, the p0 node should be labeled weak. The fact that voorbij is unaccentable in these positions cannot be explained by claiming the word itself is unaccentable, since in Kluivert liep er voorbij ('Kluivert walked past it') the word does receive an accent.</Paragraph> <Paragraph position="11"> details on the algorithm, see Theune (1997).</Paragraph> <Paragraph position="12"> After accentuation, phrase boundaries are assigned. Three phrase boundary strengths are distinguished. 4 The sentence-final boundary (///) is the strongest one. Words which are clause final (i.e., the last word in a CP or IP) or which precede a punctuation symbol other than a comma (e.g., ';') are followed by a mQor boundary (-//). Minor boundaries (/) are assigned to other words preceding a comma.</Paragraph> <Paragraph position="13"> Additionally, constituents to the left of an I', a C' or a maximal projection are followed by a weak boundary, provided that both constituents are accessible for accent, and that the left one has sufficient length (more than four syllables). This is a slightly modified version of a structural condition proposed by Dirksen and Quen@ (1993). In our example only the AP dertien minuten later meets this condition and is therefore followed by a minor phrase boundary. Since the sentence contains no punctuation and consists of just one clause, the only other phrase boundary is the sentence-final one.</Paragraph> </Section> <Section position="7" start_page="42" end_page="43" type="metho"> <SectionTitle> 4 Speech Generation </SectionTitle> <Paragraph position="0"> The SGM has two output modes, phrase concatenation and phonetics-to-speech, each of which makes optimal use of the prosodic markers generated by the LGM. We start with a brief description of the two output modes, followed by a discussion of the prosodic realization in either output mode.</Paragraph> <Paragraph position="1"> Phrase concatenation - Phrase concatenation is a technique which tries to reconcile the high-fidelity quality and inherent naturalness of prerecorded speech with the flexibility of synthetic speech. Entire phrases and words are recorded, and played back in different orders to form complete utterances.</Paragraph> <Paragraph position="2"> In this way a large number of utterances can be pronounced on the basis of a limited number of prerecor(led phrases, saving memory space and increasing flexibility. This technique is best applied to a carrierand slot situation where there is a limited number of types of utterances (carriers) with variable information to be inserted in fixed positions (slots).</Paragraph> <Paragraph position="3"> The systems based on D2S fit this situation well.</Paragraph> <Paragraph position="4"> The carriers correspond to the syntactic templates and these have slots for variable information such as match results, player names, etc.</Paragraph> <Paragraph position="5"> Successful application of the phrase concatenation technique is not quite as trivial as it may seem at first sight. If all the phrases are recorded in isolation without taking their accentuation or their po41n longer texts, containing more complicated constructions, it might be desirable to distinguish more levels. Sanderman (1996) proposes a boundary depth of five to achieve more natural phrasing.</Paragraph> <Paragraph position="6"> sition in the sentence into account, the resulting speech will have discontinuities in duration, loudness and intonation. Our method is more sophisticated in that different prosodic variants for otherwise identical phrases have been recorded. To determine how many and what prosodic realizations should be recorded for each phrase, a thorough analysis of the material the system can generate is required.</Paragraph> <Section position="1" start_page="42" end_page="43" type="sub_section"> <SectionTitle> Phonetics-to-Speech - Synthetic speech is far </SectionTitle> <Paragraph position="0"> more flexible than any form of prerecorded speech.</Paragraph> <Paragraph position="1"> Since there is complete control over the realization it is very well suited to test the accentuation and phrasing rules. In commercial applications synthetic speech is not used very often since the naturalness of the output speech still leaves a great deal to be desired.</Paragraph> <Paragraph position="2"> Because the LGM provides all relevant information there is no need for full-fledged text-to-speech synthesis. The LGM generates an orthographic representation which has a unique mapping to a phonetic representation. 5 This makes it possible to do errorless grapheme-to-phoneme conversion by looking up the words in a lexicon instead of using rules. Our phonetics-to-speech system, SPENGI (SPeech synthesis ENGIne) uses diphone concatenation in either LPC or PSOLA format. The rule formalism for intonation is an implementation based on the intonation theory of 't Hart et al. (1990).</Paragraph> <Paragraph position="3"> Realizing prosody in speech generation The enriched text that the LGM generates contains several prosodic markers. In the phrase concatenation component these markers trigger the choice of the appropriate prosodic variant from the phrase database and the pauses to be inserted at the appropriate positions.</Paragraph> <Paragraph position="4"> The carrier sentences have been recorded in just one prosodic version. The variable words that fill the slots have been recorded in six different prosodic variants to account for the place in the sentence where they occur and the accentuation they receive.</Paragraph> <Paragraph position="5"> A word can be either accented or deaccented. We did not instruct our speaker as to how to realize the accents in the carrier sentences. In the variables we just made sure that accents were realized consistently in each category. When a word occurs before a minor phrase boundary the word is realized with a continuation rise. A major phrase boundary triggers a pause and possibly a lengthening of the word preceding the boundary. Before a final phrase boundary, the word is realized with a final fall. Inserting the right words in the right contexts optimizes the prosody of the output speech, thus achieving fluency and a natural rendering.</Paragraph> <Paragraph position="6"> In Dutch, the score of a match is pronounced in a special way: the major boundary between the two numbers triggers lengthening of the first number and a pause between the two numbers, but the two accented numbers are realized with a so-called 'flat hat' pattern as if they were part of the same clause (see 't Hart (1990) for a description of pitch movements).</Paragraph> <Paragraph position="7"> This is indicated by a special marker used only in the phrase concatenation component of GoalGetter (the @_-sign). There is another special marker (the 70-sign) to mark nouns functioning as an adjunct to another noun. The special nouns are always accented and shorter in duration than when they occur as a head noun. Figure 6 shows a stylized pitch contour of the opening sentence of Figure 3, which illustrates how the score is pronounced.</Paragraph> <Paragraph position="8"> In the phonetics-to-speech component the prosodic markers are used to trigger the intonation and duration rules. Intonation is represented as a series of pitch movements with restrictions on the possible combinations of movements. The words that are accented are given a prominence-lending pitch pattern (a pointed hat or a flat hat are most commonly used). At the boundaries a pause of some length can occur, where the length of the pause depends on the strength of the boundary. A boundary can also trigger a continuation rise or pre-boundary lengthening, as mentioned above. To allow for variation in the intonation, each rule has a number of weighted alternatives from which a random choice is made (taking the weights into account). This also makes it possible to have some optional rules, for instance, for the melodic highlighting of syntactic boundaries which is not obligatory.</Paragraph> </Section> </Section> class="xml-element"></Paper>