File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1008_metho.xml

Size: 13,248 bytes

Last Modified: 2025-10-06 14:09:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1008">
  <Title>Statistical Modeling for Unit Selection in Speech Synthesis</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Representation by Weighted Finite-State
Transducers
</SectionTitle>
    <Paragraph position="0"> An important advantage of the statistical framework we introduced for unit selection is that the resulting components can be naturally represented by weighted finite-state transducers. This casts unit selection into a familiar schema, that of a Viterbi decoder applied to a weighted transducer.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Weighted Finite-State Transducers
</SectionTitle>
      <Paragraph position="0"> We give a brief introduction to weighted finite-state transducers. We refer the reader to (Mohri, 2004; Mohri et al., 2000) for an extensive presentation of these devices and will use the definitions and notation introduced by these authors.</Paragraph>
      <Paragraph position="1"> A weighted finite-state transducer T is an 8-tuple T = (S,[?],Q,I,F,E,l,r) where S is the finite input alphabet of the transducer, [?] is the finite output alphabet, Q is a finite set of states, I [?] Q the set of initial states, F [?] Q the set of final states,</Paragraph>
      <Paragraph position="3"> nite set of transitions, l : I - R the initial weight function, and r : F - R the final weight function mapping F to R. In our statistical framework, the weights can be interpreted as log-likelihoods, thus there are added along a path. Since we use the standard Viterbi approximation, the weight associated by T to a pair of strings (x,y) [?] S[?] x [?][?] is given by:</Paragraph>
      <Paragraph position="5"> where R(I,x,y,F) denotes the set of paths from an initial state p [?] I to a final state q [?] F with input label x and output label y, w[pi] the weight of the path pi, l[p[pi]] the initial weight of the origin state of pi, and r[n[pi]] the final weight of its destination.</Paragraph>
      <Paragraph position="6"> A Weighted automaton A = (S,Q,I,F,E,l,r) is defined in a similar way by simply omitting the output (or input) labels. We denote by P2(T) the  transducer T2. (c) T1 *T2, the result of the composition of T1 and T2.</Paragraph>
      <Paragraph position="7"> weighted automaton obtained from T by removing its input labels.</Paragraph>
      <Paragraph position="8"> A general composition operation similar to the composition of relations can be defined for weighted finite-state transducers (Eilenberg, 1974; Berstel, 1979; Salomaa and Soittola, 1978; Kuich and Salomaa, 1986). The composition of two transducers T1 and T2 is a weighted transducer denoted</Paragraph>
      <Paragraph position="10"> There exists a simple algorithm for constructing</Paragraph>
      <Paragraph position="12"> 1997; Mohri et al., 1996). The states of T are identified as pairs of a state of T1 and a state of T2. A state (q1,q2) in T1*T2 is an initial (final) state if and only if q1 is an initial (resp. final) state of T1 and q2 is an initial (resp. final) state of T2. The transitions of T are the result of matching a transition of T1 and a transition of T2 as follows: (q1,a,b,w1,qprime1) and (q2,b,c,w2,qprime2) produce the transition</Paragraph>
      <Paragraph position="14"> in T. The efficiency of this algorithm was critical to that of our unit selection system. Thus, we designed an improved composition that we will describe later.</Paragraph>
      <Paragraph position="15"> Figure 1(c) gives the resulting of the composition of the weighted transducers given figure 2(a) and (b).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Language Model Weighted Transducer
</SectionTitle>
      <Paragraph position="0"> The n-gram statistical language model we construct for unit sequences can be represented by a weighted automaton G which assigns to each sequence u its log-likelihood:</Paragraph>
      <Paragraph position="2"> according to our probability estimate P. Since a unit sequence u uniquely determines the corresponding halfphone sequence x, the n-gram statistical model equivalently defines a model of the joint distribution of P(x,u). G can be augmented to define a weighted transducer ^G assigning to pairs (x,u) their log-likelihoods. For any halfphone sequence x and unit sequence u, we define ^G by:</Paragraph>
      <Paragraph position="4"> The weighted transducer ^G can be used to generate all the unit sequences corresponding to a specific halfphone sequence given by a finite automaton p, using composition: p* ^G. In our case, we also wish to use the language model transducer ^G to limit the number of candidate unit sequences considered. We will do that by giving a strong precedence to n-grams of units that occurred in the training corpus (see Section 4.2).</Paragraph>
      <Paragraph position="5"> Example Figure 2(a) shows the bigram model G estimated from the following corpus:</Paragraph>
      <Paragraph position="7"> where &lt;s&gt; and &lt;/s&gt; are the symbols marking the start and the end of an utterance. When the unit u1 is associated to the halfphone p1 and both units u1 and u2 are associated to the halfphone p2, the corresponding weighted halfphone-to-unit transducer ^G is the one shown in Figure 2(b).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Unit Selection with Weighted Finite-State
Transducers
</SectionTitle>
      <Paragraph position="0"> From each sequence f = f1...fn of feature vectors specified by the text analysis frontend, we can straightforwardly derive the halfphone sequence to be synthesized and represent it by a finite automaton p, since the first component of each feature vector fi is the corresponding halfphone. Let W be the weighted automaton obtained by composition of p with ^G and projection on the output:</Paragraph>
      <Paragraph position="2"> W represents the set of candidate unit sequences with their respective grammar costs. We can then use a speech recognition decoder to search for the best sequence u since W can be thought of as the  counterpart of a speech recognition transducer, f the equivalent of the acoustic features and Cf the analogue of the acoustic cost. Our decoder uses a standard beam search of W to determine the best path by computing on-the-fly the feature cost between each unit and its corresponding feature vector. null Composition constitutes the most costly operation in this framework. Section 4 presents several of the techniques that we used to speed up that algorithm in the context of unit selection.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Algorithms
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Composition with String Potentials
</SectionTitle>
      <Paragraph position="0"> In general, composition may create non-coaccessible states, i.e., states that do not admit a path to a final state. These states can be removed after composition using a standard connection (or trimming) algorithm that removes unnecessary states. However, our purpose here is to avoid the creation of such states to save computational time.</Paragraph>
      <Paragraph position="1"> To that end, we introduce the notion of string potential at each state.</Paragraph>
      <Paragraph position="2"> Let i[pi] (o[pi]) be the input (resp. output) label of a path pi, and denote by x [?] y the longest common prefix of two strings x and y. Let q be a state in a weighted transducer. The input (output) string potential of q is defined as the longest common prefix of the input (resp. output) labels of all the paths in T from q to a final state:</Paragraph>
      <Paragraph position="4"> The string potentials of the states of T can be computed using the generic shortest-distance algorithm of (Mohri, 2002) over the string semiring. They can be used in composition in the following way. We will say that two strings x and y are comparable if x is a prefix of y or y is a prefix of x.</Paragraph>
      <Paragraph position="5"> Let (q1,q2) be a state in T = T1 * T2. Note that (q1,q2) is a coaccessible state only if the output string potential of q1 in T1 and the input string potential of q2 in T2 are comparable, i.e., po(q1) is a prefix of pi(q2) or pi(q2) is a prefix of po(q1).</Paragraph>
      <Paragraph position="6"> Hence, composition can be modified to create only those states for which the string potentials are compatible. null As an example, state 2 = (1,5) of the transducer</Paragraph>
      <Paragraph position="8"> strings.</Paragraph>
      <Paragraph position="9"> The notion of string potentials can be extended to further reduce the number of non-coaccessible states created by composition. The extended input string potential of q in T, is denoted by -pi(q) and is the set of strings defined by:</Paragraph>
      <Paragraph position="11"> where zi(q) [?] S and is such that for every s [?] zi(q), there exist a path pi from q to a final state such that pi(q)s is a prefix of the input label of pi. The extended output string potential of q, -po(q), is defined similarly. A state (q1,q2) in T1 *T2 is coaccessible</Paragraph>
      <Paragraph position="13"> Using string potentials helped us substantially improve the efficiency of composition in unit selection.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Language Model Transducer - Backoff
</SectionTitle>
      <Paragraph position="0"> As mentioned before, the transducer ^G represents an n-gram backoff model for the joint probability distribution P(x,u). Thus, backoff transitions are used in a standard fashion when ^G is viewed as an automaton over paired sequences (x,u). Since we use ^G as a transducer mapping halfphone sequences to unit sequences to determine the most likely unit sequence u given a halfphone sequence x 1we need to clarify the use of the backoff transitions in the composition p* ^G.</Paragraph>
      <Paragraph position="1"> Denote by O(V ) the set of output labels of a set of transitions V . Then, the correct use derived from the definition of the backoff transitions in the joint model is as follows. At a given state s of ^G and for a given input halfphone a, the outgoing transitions with input a are the transitions V of s with input label a, and for each b negationslash[?] O(V), the transition of the first backoff state of s with input label a and output b.</Paragraph>
      <Paragraph position="2"> For the purpose of our unit selection system, we had to resort to an approximation. This is because in general, the backoff use just outlined leads to examining, for a given halfphone, the set of all units possible at each state, which is typically quite large.2 Instead, we restricted the inspection of the backoff states in the following way within the composition p* ^G. A state s1 in p corresponds in the composed transducer p* ^G to a set of states (s1,s2), s2 [?] S2, where S2 is a subset of the states of ^G. When computing the outgoing transitions of the states in (s1,s2) with input label a, the backoff transitions of a state s2 are inspected if and only if none of the states in S2 has an outgoing transition with input la- null tical language models, about 400,000, is quite large compared to the usual word-based models.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Language Model Transducer - Shrinking
</SectionTitle>
      <Paragraph position="0"> A classical algorithm for reducing the size of an n-gram language model is shrinking using the entropy-based method of (Stolcke, 1998) or the weighted difference method (Seymore and Rosenfeld, 1996), both quite similar in practice. In our experiments, we used a modified version of the weighted difference method. Let w be a unit and let h be its conditioning history within the n-gram model. For a given shrink factor g, the transition corresponding to the n-gram hw is removed from the weighted automaton if: log(tildewideP(w|h)) [?]log(ahtildewideP(w|hprime)) [?] gc(hw) (10) where hprime is the backoff sequence associated with h. Thus, a higher-order n-gram hw is pruned when it does not provide a probability estimate significantly different from the corresponding lower-order n-gram sequence hprimew.</Paragraph>
      <Paragraph position="1"> This standard shrinking method needs to be modified to be used in the case of our halfphone-to-unit weighted transducer model with the restriction on the traversal of the backoff transitions described in the previous section. The shrinking methods must take into account all the transitions sharing the same input label at the state identified with h and its back-off state hprime. Thus, at each state identified with h in ^G, a transition with input label x is pruned when the following condition holds: summationdisplay</Paragraph>
      <Paragraph position="3"> where hprime is the backoff sequence associate with h and Xxk is the set of output labels of all the outgoing transitions with input label x of the state identified with k.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML