File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0706_metho.xml

Size: 10,124 bytes

Last Modified: 2025-10-06 14:08:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0706">
  <Title>Architectures for speech-to-speech translation using finite-state models</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Finite-state transducers and speech
</SectionTitle>
    <Paragraph position="0"> translation The statistical framework allow us to formulate the speech translation problem as follows: Let x be an acoustic representation of a given utterance; typically a sequence of acoustic vectors or &amp;quot;frames&amp;quot;. The translation of x into a target-language sentence can be formulated as the search for a word sequence, ^t, from the target language such that:</Paragraph>
    <Paragraph position="2"> Conceptually, the translation can be viewed as a two-step process (Ney, 1999; Ney et al., 2000):</Paragraph>
    <Paragraph position="4"> where s is a sequence of source-language words which would match the observed acoustic sequence x and t is a target-language word sequence associated with s. Consequently,</Paragraph>
    <Paragraph position="6"> and, with the natural assumption that Pr(x|s,t) does not depend on the target sentence t,</Paragraph>
    <Paragraph position="8"> Using a SFST as a model for Pr(s,t) and HMMs to model Pr(x|s), Eq. 3 is transformed in the optimization problem:</Paragraph>
    <Paragraph position="10"> where PrT (s,t) is the probability supplied by the SFST and PrM(x|s) is the density value supplied by the corresponding HMMs associated to s for the acoustic sequence x.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Finite-state transducers
</SectionTitle>
      <Paragraph position="0"> A SFST, T , is a tuple &lt;Q,S,[?],R,q0,F, P&gt; , where Q is a finite set of states; q0 is the initial state; S and [?] are finite sets of input symbols (source words) and output symbols (target words), respectively (S[?][?] = [?]); R is a set of transitions of the form (q,a,o,qprime) for q,qprime [?] Q, a [?] S, o [?] [?]star and1 P : R - IR+ (transition probabilities) and F : Q - IR+ (finalstate probabilities) are functions such that [?]q [?] Q:</Paragraph>
      <Paragraph position="2"> Fig. 1 shows a small fragment of a SFST for Spanish to English translation.</Paragraph>
      <Paragraph position="3"> A particular case of finite-state transducers are known as subsequential transducers (SSTs). These are finite-state transducers with the restriction of being deterministic (if (q,a,o,q), (q,a,oprime,qprime) [?] R, then o = oprime and q = qprime). SSTs also have output strings associated to the (final) states. This can fit well under the above formulation by simply adding an end-off-sentence marker to each input sentence.</Paragraph>
      <Paragraph position="4"> For a pair (s,t) [?] Sstar x [?]star, a translation form, ph, is a sequence of transitions in a SFST T :</Paragraph>
      <Paragraph position="6"> where ~tj denotes a substring of target words (the empty string for ~tj is also possible), such that ~t1 ~t2 ...~tI = t and I is the length of the source sentence s. The probability of ph is</Paragraph>
      <Paragraph position="8"> Finally, the probability of the pair (s,t) is</Paragraph>
      <Paragraph position="10"> where d(s,t) is the set of all translation forms for the pair (s,t).</Paragraph>
      <Paragraph position="11"> These models have implicit source and target language models embedded in their definitions, which are simply the marginal distributions of PrT . In practice, the source (target) language model can be obtained by removing the target (source) words from each transition of the model.</Paragraph>
      <Paragraph position="12"> 1By[?]star andSstar we denote the sets of finite-length strings on [?] and S, respectively</Paragraph>
      <Paragraph position="14"> doble / with two beds (1) doble / double room (0.3) individual / single room (0.7)  be translated to either &amp;quot;a double room&amp;quot; or &amp;quot;a room with two beds&amp;quot;. The most probable translation is the first one with probability of 0.09.</Paragraph>
      <Paragraph position="15"> The structural (states and transitions) and the probabilistic components of a SFST can be learned automatically from training pairs in a single process using the MGTI technique (Casacuberta, 2000). Alternatively, the structural component can be learned using the OMEGA technique (Vilar, 2000), while the probabilistic component is estimated in a second step using maximum likelihood or other possible criteria (Pic'o and Casacuberta, 2001). One of the main problems that appear during the learning process is the modelling of events that have not been seen in the training set. This problem can be confronted, in a similar way as in language modelling, by using smoothing techniques in the estimation process of the probabilistic components of the SFST (Llorens, 2000). Alternatively, smoothing can be applied in the process of learning both components (Casacuberta, 2000).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Architectures for speech translation
</SectionTitle>
      <Paragraph position="0"> Using Eq. 7 as a model for Pr(s,t) in Eq. 4,</Paragraph>
      <Paragraph position="2"> For the computation of PrM(x|s) in Eq. 8, let b be an arbitrary segmentation of x into I acoustic subsequences, each of which associated with a source word (therefore, I is the number of words in s). Then:</Paragraph>
      <Paragraph position="4"> where -xi is the i-th. acoustic segment of b, and each source word si has an associated HMM that supplies the density value PrM(-xi|si).</Paragraph>
      <Paragraph position="5"> Finally, by substituting Eq. 5 and Eq. 9 into Eq. 8 and approximating sums by maximisations:</Paragraph>
      <Paragraph position="7"> Solving this maximisation yields (an approximation to) the most likely target-language sentence ^t for the observed source-language acoustic sequence x.</Paragraph>
      <Paragraph position="8"> This computation can be accomplished using the well known Viterbi algorithm. It searches for an optimal sequence of states in an integrated network (integrated architecture) which is built by substituting each edge of the SFST by the corresponding HMM of the source word associated to the edge.</Paragraph>
      <Paragraph position="9"> This integration process is illustrated in Fig. 2. A small SFST is presented in the first panel (a) of this figure. In panel (b), the source words in each edge are substituted by the corresponding phonetic transcription. In panel (c) each phoneme is substituted by the corresponding HMM of the phone. Clearly, this direct integration approach often results in huge finite-state networks. Correspondingly, a straight-forward (dynamic-programming) search for an optimal target sentence may require a prohibitively high computational effort. Fortunately, this computational cost can be dramatically reduced by means of standard heuristic acceleration techniques such as beam search.</Paragraph>
      <Paragraph position="10"> An alternative, which sacrifices optimality more drastically, is to break the search down into two steps, leading to a so-called &amp;quot;serial architecture&amp;quot;. In the first step a conventional source-language speech decoding system (using just a source-language language model) is used to obtain a single (may be multiple) hypothesis for the sequence of uttered words.</Paragraph>
      <Paragraph position="11"> In the second step, this text sequence is translated into a target-language sentence.</Paragraph>
      <Paragraph position="12">  (figure c) in a FST (figure a). l denotes the empty string in panels a and b. In panel c, source symbols are typeset in small fonts, target strings are typeset in large fonts and edges with no symbols denote empty transitions.</Paragraph>
      <Paragraph position="13"> Using Pr(s,t) = Pr(t  |s) * Pr(s) in Eq. 3 and approximating the sum by the maximum, the optimization problem can be presented as</Paragraph>
      <Paragraph position="15"> In other words, the search for an optimal target-language sentence is now approximated as follows:  1. Word decoding of x. A source-language sentence ^s is searched for using a source language model, PrN(s), for Pr(s) and the corresponding HMMs, PrM(x|s), to model Pr(x|s): ^s [?] argmax s (PrN(s) * PrM(x|s)).</Paragraph>
      <Paragraph position="16"> 2. Translation of ^s. A target-language sentence ^t is searched for using a SFST, PrT (^s,t), as a</Paragraph>
      <Paragraph position="18"> A better alternative for this crude &amp;quot;two-step&amp;quot; approach is to use Pr(s,t) = Pr(s  |t)*Pr(t) in Eq. 3.</Paragraph>
      <Paragraph position="19"> Now, approximating the sum by the maximum, the optimization problem can be presented as</Paragraph>
      <Paragraph position="21"> The main problem of this approach is the term t that appears in the first maximisation (Eq. 16).</Paragraph>
      <Paragraph position="22"> A possible solution is to follow an iterative procedure where t, that is used for computing ^s, is the one obtained from argmaxt Pr(^s,t) in the previous iteration (Garc'ia-Varea et al., 2000). In this case, Pr(s  |t) can be modelled by a source language model that depends on a previously computed ~t: PrN,~t(s). In the first iteration no ^t is known, but PrN,~t(s) can be approximated by PrN(s). Following this idea, the search can be formulated as: Initialization: Let PrN,t(s) be approximated by a source language model PrN(s).</Paragraph>
      <Paragraph position="23"> while not convergence 1. Word decoding of x. A source-language sentence ^s is searched for using a source language model that depends on the target sentence, PrN,Vt(s), for Pr(s  |t) (Vt is the ^t computed in the previous iteration) and the corresponding HMMs, PrM(x  |s), to model Pr(x |</Paragraph>
      <Paragraph position="25"> The first iteration corresponds to the sequential architecture proposed above.</Paragraph>
      <Paragraph position="26"> While this seems a promising idea, only very preliminary experiments were carried out (Garc'ia-Varea et al., 2000) and it has not been considered in the experiments presented in the present paper.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML