File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1008_intro.xml
Size: 4,348 bytes
Last Modified: 2025-10-06 14:02:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1008"> <Title>Statistical Modeling for Unit Selection in Speech Synthesis</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Unit Selection Methods </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Overview of a Traditional Unit Selection System </SectionTitle> <Paragraph position="0"> This section describes in detail the cost functions used in the AT&T Natural Voices Product that we will use as the baseline in our experimental results, see (Beutnagel et al., 1999a) for more details about this system. In this system, unit selection is based on (Hunt and Black, 1996) but using units corresponding to halfphones instead of phones. Let U be the set of recorded units. Two cost functions are defined: the target cost Ct(fi,ui) is used to estimate the mismatch between the features of the feature vector fi and the unit ui; the concatenation cost Cc(ui,uj) is used to estimate the smoothness of the acoustic signal when concatenating the units ui and uj. Given a sequence f = f1...fn of feature vectors, unit selection can then be formulated as the problem of finding the sequence of units</Paragraph> <Paragraph position="2"> In practice, not all unit sequences of a given length are considered. A preselection method such as the one proposed by (Conkie et al., 2000) is used. The computation of the target cost can be split in two parts: the context cost Cp that is the component of the target cost corresponding to the phonetic context, and the feature cost Cf that corresponds the other components of the target cost:</Paragraph> <Paragraph position="4"> For each phonetic context r of length 5, a list L(r) of the units that are the most frequently used in the phonetic context r is computed. For each feature vector fi in f, the candidate units for fi are computed in the following way. Let ri be the 5-phone context of fi in f. The context costs between fi and all the units in the preselection list of the phonetic context ri are computed and the M units with the best context cost are selected:</Paragraph> <Paragraph position="6"> The feature costs between fi and the units in Ui are then computed and the N units with the best target cost are selected:</Paragraph> <Paragraph position="8"> is determined using a classical Viterbi search. Thus, for each position i, the N2 concatenation costs between the units in Uprimei and Uprimei+1 need to be computed. The caching method for concatenation costs proposed in (Beutnagel et al., 1999b) can be used to improve the efficiency of the system.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Statistical Modeling Approach </SectionTitle> <Paragraph position="0"> Our statistical modeling approach was described in Section 1. As already mentioned, our general approach would consists of deriving both the target cost [?]logP(f|u) and the concatenation cost [?]logP(u) from appropriate training data using general statistical methods. To simplify the problem, we will use the existing target cost provided by the traditional unit selection system and concentrate on the problem of estimating the concatenation cost.</Paragraph> <Paragraph position="1"> We used the unit selection system presented in the previous section to generate a large corpus of more than 8M unit sequences, each unit corresponding to a unique recorded halfphone. This corpus was used to build an n-gram statistical language model using Katz backoff smoothing technique (Katz, 1987).</Paragraph> <Paragraph position="2"> This model provides us with a new cost function, the grammar cost Cg, defined by:</Paragraph> <Paragraph position="4"> where P is the probability distribution estimated by our model. We used this new cost function to replace both the concatenation and context costs used in the traditional approach. Unit selection then consists of finding the unit sequence u such that:</Paragraph> <Paragraph position="6"> In this approach, rather than using a preselection method such as that of (Conkie et al., 2000), we are using the statistical language model to restrict the candidate space (see Section 4.2).</Paragraph> </Section> </Section> class="xml-element"></Paper>