File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1001_metho.xml

Size: 18,374 bytes

Last Modified: 2025-10-06 14:07:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1001">
  <Title>Parameter Estimation for Probabilistic Finite-State Transducers</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Estimation in Parameterized FSTs
</SectionTitle>
    <Paragraph position="0"> We are primarily concerned with the following training paradigm, novel in its generality. Let f : !R 0 be a joint probabilistic relation that is computed by a weighted FST. The FST was built by some recipe that used the parameter vector .</Paragraph>
    <Paragraph position="1"> Changing may require us to rebuild the FST to get updated weights; this can involve composition, regexp compilation, multiplication of feature strengths, etc. (Lazy algorithms that compute arcs and states of tion cannot be realized by any weighted FST), one can sometimes succeed by first intersecting g with a smaller regular set in which the input being considered is known to fall. In the extreme, if each input string is fully observed (not the case if the input is bound by composition to the output of a one-to-many FST), one can succeed by restricting g to each input string in turn; this amounts to manually dividing f(x;y) by g(x).</Paragraph>
    <Paragraph position="2"> 10Traditionally log(strength) values are called weights, but this paper uses &amp;quot;weight&amp;quot; to mean something else.</Paragraph>
    <Paragraph position="4"> f on demand (Mohri et al., 1998) can pay off here, since only part of f may be needed subsequently.) As training data we are given a set of observed (input;output) pairs, (xi;yi). These are assumed to be independent random samples from a joint distribution of the form f^ (x;y); the goal is to recover the true ^ . Samples need not be fully observed (partly supervised training): thus xi ;yi may be given as regular sets in which input and output were observed to fall. For example, in ordinary HMM training,xi = and represents a completely hidden state sequence (cf. Ristad (1998), who allows any regular set), while yi is a single string representing a completely observed emission sequence.11 What to optimize? Maximum-likelihood estimation guesses ^ to be the maximizingQ if (xi;yi). Maximum-posterior estimation tries to maximizeP( ) Qif (xi;yi) whereP( ) is a prior probability. In a log-linear parameterization, for example, a prior that penalizes feature strengths far from 1 can be used to do feature selection and avoid overfitting (Chen and Rosenfeld, 1999).</Paragraph>
    <Paragraph position="5"> The EM algorithm (Dempster et al., 1977) can maximize these functions. Roughly, the E step guesses hidden information: if (xi;yi) was generated from the current f , which FST paths stand a chance of having been the path used? (Guessing the path also guesses the exact input and output.) The M step updates to make those paths more likely.</Paragraph>
    <Paragraph position="6"> EM alternates these steps and converges to a local optimum. The M step's form depends on the parameterization and the E step serves the M step's needs. Let f be Fig. 1a and suppose (xi;yi) = (a(a + b) ;xxz). During the E step, we restrict to paths compatible with this observation by computing xi f yi, shown in Fig. 2. To find each path's posterior probability given the observation (xi;yi), just conditionalize: divide its raw probability by the total probability ( 0:1003) of all paths in Fig. 2.</Paragraph>
    <Paragraph position="7"> 11To implement an HMM by an FST, compose a probabilistic FSA that generates a state sequence of the HMM with a conditional FST that transduces HMM states to emitted symbols. But that is not the full E step. The M step uses not individual path probabilities (Fig. 2 has infinitely many) but expected counts derived from the paths.</Paragraph>
    <Paragraph position="8"> Crucially,x4 will show how the E step can accumulate these counts effortlessly. We first explain their use by the M step, repeating the presentation ofx2: If the parameters are the 17 weights in Fig. 1a, the M step reestimates the probabilities of the arcs from each state to be proportional to the expected number of traversals of each arc (normalizing at each state to make the FST Markovian). So the E step must count traversals. This requires mapping Fig. 2 back onto Fig. 1a: to traverse either 8 a:x ! 9 or 9 a:x !10 in Fig. 2 is &amp;quot;really&amp;quot; to traverse 0 a:x ! 0 in Fig. 1a. If Fig. 1a was built by composition, the M step is similar but needs the expected traversals of the arcs in Fig. 1b-c. This requires further unwinding of Fig. 1a's 0 a:x ! 0 : to traverse that arc is &amp;quot;really&amp;quot; to traverse Fig. 1b's 4 a:p ! 4 and Fig. 1c's 6 p:x ! 6 .</Paragraph>
    <Paragraph position="9"> If Fig. 1b was defined by the regexp given earlier, traversing 4 a:p ! 4 is in turn &amp;quot;really&amp;quot; just evidence that the -coin came up heads. To learn the weights ; ; ; , count expected heads/tails for each coin.</Paragraph>
    <Paragraph position="10"> If arc probabilities (or even ; ; ; ) have log-linear parameterization, then the E step must compute c = Piecf(xi;yi), where ec(x;y) denotes the expected vector of total feature counts along a random path in f whose (input;output) matches (x;y). The M step then treats c as fixed, observed data and adjusts until the predicted vector of total feature counts equals c, using Improved Iterative Scaling (Della Pietra et al., 1997; Chen and Rosenfeld, 1999).12 For globally normalized, joint models, the predicted vector is ecf( ; ). If the log-linear probabilities are conditioned on the state and/or the input, the predicted vector is harder to describe (though usually much easier to compute).13 12IIS is itself iterative; to avoid nested loops, run only one iteration at each M step, giving a GEM algorithm (Riezler, 1999). Alternatively, discard EM and use gradient-based optimization. 13For per-state conditional normalization, let Dj;a be the set of arcs from state j with input symbol a2 ; their weights are normalized to sum to 1. Besides computing c, the E step must count the expected number dj;a of traversals of arcs in each Dj;a. Then the predicted vector given isPj;adj;a (expected feature counts on a randomly chosen arc in Dj;a). Per-state joint normalization (Eisner, 2001b,x8.2) is similar but drops the dependence on a. The difficult case is global conditional normalization. It arises, for example, when training a joint model of the form f = (g h ) , where h is a conditional It is also possible to use this EM approach for discriminative training, where we wish to maximizeQ iP(yijxi) and f (x;y) is a conditional FST that defines P(yjx). The trick is to instead train a joint model g f , where g(xi) defines P(xi), thereby maximizing QiP(xi) P(yi j xi). (Of course, the method of this paper can train such compositions.) If x1;:::xn are fully observed, just define each g(xi) = 1=n. But by choosing a more general model of g, we can also handle incompletely observed xi: training g f then forces g and f to cooperatively reconstruct a distribution over the possible inputs and do discriminative training of f given those inputs. (Any parameters of g may be either frozen before training or optimized along with the parameters of f .) A final possibility is that each xi is defined by a probabilistic FSA that already supplies a distribution over the inputs; then we consider xi f yi directly, just as in the joint model.</Paragraph>
    <Paragraph position="11"> Finally, note that EM is not all-purpose. It only maximizes probabilistic objective functions, and even there it is not necessarily as fast as (say) conjugate gradient. For this reason, we will also show below how to compute the gradient of f (xi;yi) with respect to , for an arbitrary parameterized FST f .</Paragraph>
    <Paragraph position="12"> We remark without elaboration that this can help optimize task-related objective functions, such asP</Paragraph>
    <Paragraph position="14"/>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The E Step: Expectation Semirings
</SectionTitle>
    <Paragraph position="0"> It remains to devise appropriate E steps, which looks rather daunting. Each path in Fig. 2 weaves together parameters from other machines, which we must untangle and tally. In the 4-coin parameterization, path 8 a:x ! 9 a:x !10 a: !10 a: !10 b:z !12 must yield up a vectorhH ;T ;H ;T ;H ;T ;H ;T ithat counts observed heads and tails of the 4 coins. This nontrivially works out toh4;1;0;1;1;1;1;2i: For other parameterizations, the path must instead yield a vector of arc traversal counts or feature counts.</Paragraph>
    <Paragraph position="1"> Computing a count vector for one path is hard enough, but it is the E step's job to find the expected value of this vector--an average over the infinitely log-linear model of P(vju) for u2 0 ;v2 0 . Then the predicted count vector contributed by h is PiPu2 0 P(u j xi;yi) ech(u; 0 ). The termPiP(ujxi;yi) computes the expected count of each u2 0 . It may be found by a variant ofx4 in which path values are regular expressions over 0 . many paths through Fig. 2 in proportion to their posterior probabilities P( jxi;yi). The results for all (xi;yi) are summed and passed to the M step.</Paragraph>
    <Paragraph position="2"> Abstractly, let us say that each path has not only a probability P( ) 2 [0;1] but also a value val( ) in a vector space V, which counts the arcs, features, or coin flips encountered along path . The value of a path is the sum of the values assigned to its arcs.</Paragraph>
    <Paragraph position="3"> The E step must return the expected value of the unknown path that generated (xi;yi). For example, if every arc had value 1, then expected value would be expected path length. Letting denote the set of paths in xi f yi (Fig. 2), the expected value is14</Paragraph>
    <Paragraph position="5"> The denominator of equation (1) is the total probability of all accepting paths in xi f yi. But while computing this, we will also compute the numerator.</Paragraph>
    <Paragraph position="6"> The idea is to augment the weight data structure with expectation information, so each weight records a probability and a vector counting the parameters that contributed to that probability. We will enforce an invariant: the weight of any pathset must be (P 2 P( );P 2 P( ) val( )) 2R 0 V, from which (1) is trivial to compute.</Paragraph>
    <Paragraph position="7"> Berstel and Reutenauer (1988) give a sufficiently general finite-state framework to allow this: weights may fall in any set K (instead of R). Multiplication and addition are replaced by binary operations and on K. Thus is used to combine arc weights into a path weight and is used to combine the weights of alternative paths. To sum over infinite sets of cyclic paths we also need a closure operation , interpreted as k = L1i=0ki. The usual finite-state algorithms work if (K; ; ; ) has the structure of a closed semiring.15 Ordinary probabilities fall in the semiring</Paragraph>
    <Paragraph position="9"> i;yij ) = 1 or 0 according to whether 2 .</Paragraph>
    <Paragraph position="10"> 15That is: (K; ) is a monoid (i.e., : K K ! K is associative) with identity 1. (K; ) is a commutative monoid with identity 0. distributes over from both sides, 0 k = k 0 = 0, and k = 1 k k = 1 k k. For finite-state composition, commutativity of is needed as well.</Paragraph>
    <Paragraph position="11"> 16The closure operation is defined for p 2 [0;1) as p = 1=(1 p), so cycles with weights in [0;1) are allowed.</Paragraph>
    <Paragraph position="12"> V -expectation semiring, (R 0 V; ; ; ):</Paragraph>
    <Paragraph position="14"> if p defined, (p;v) def= (p ;p vp ) (4) If an arc has probability p and value v, we give it the weight (p;pv), so that our invariant (see above) holds if consists of a single length-0 or length-1 path. The above definitions are designed to preserve our invariant as we build up larger paths and pathsets. lets us concatenate (e.g.) simple paths 1; 2 to get a longer path with P( ) = P( 1)P( 2) and val( ) = val( 1) + val( 2). The definition of guarantees that path 's weight will be (P( );P( ) val( )). lets us take the union of two disjoint pathsets, and computes infinite unions.</Paragraph>
    <Paragraph position="15"> To compute (1) now, we only need the total weight ti of accepting paths in xi f yi (Fig. 2).</Paragraph>
    <Paragraph position="16"> This can be computed with finite-state methods: the machine ( xi) f (yi ) is a version that replaces all input:output labels with : , so it maps ( ; ) to the same total weight ti. Minimizing it yields a one-state FST from which ti can be read directly! The other &amp;quot;magical&amp;quot; property of the expectation semiring is that it automatically keeps track of the tangled parameter counts. For instance, recall that traversing 0 a:x ! 0 should have the same effect as traversing both the underlying arcs 4 a:p ! 4 and 6 p:x ! 6 . And indeed, if the underlying arcs have values v1 and v2, then the composed arc 0 a:x ! 0 gets weight (p1;p1v1) (p2;p2v2) = (p1p2;p1p2(v1 +v2)), just as if it had value v1 +v2.</Paragraph>
    <Paragraph position="17"> Some concrete examples of values may be useful: To count traversals of the arcs of Figs. 1b-c, number these arcs and let arc'have valuee', the'th basis vector. Then the 'th element of val( ) counts the appearances of arc ' in path , or underlying path .</Paragraph>
    <Paragraph position="18"> A regexp of formE+ F = E+(1 )F should be weighted as ( ; ek)E + (1 ;(1 )ek+1)F in the new semiring. Then elements k and k + 1 of val( ) count the heads and tails of the -coin.</Paragraph>
    <Paragraph position="19"> For a global log-linear parameterization, an arc's value is a vector specifying the arc's features. Then val( ) counts all the features encountered along .</Paragraph>
    <Paragraph position="20"> Really we are manipulating weighted relations, not FSTs. We may combine FSTs, or determinize or minimize them, with any variant of the semiringweighted algorithms.17 As long as the resulting FST computes the right weighted relation, the arrangement of its states, arcs, and labels is unimportant. The same semiring may be used to compute gradients. We would like to findf (xi;yi) and its gradient with respect to , where f is real-valued but need not be probabilistic. Whatever procedures are used to evaluate f (xi;yi) exactly or approximately--for example, FST operations to compile f followed by minimization of ( xi) f (yi )--can simply be applied over the expectation semiring, replacing each weight p by (p;rp) and replacing the usual arithmetic operations with , , etc.18 (2)-(4) preserve the gradient ((2) is the derivative product rule), so this computation yields (f (xi;yi);rf (xi;yi)).</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Removing Inefficiencies
</SectionTitle>
    <Paragraph position="0"> Now for some important remarks on efficiency: Computing ti is an instance of the well-known algebraic path problem (Lehmann, 1977; Tarjan, 1981a). LetTi = xi f yi. Thenti is the total semiring weight w0n of paths in Ti from initial state 0 to final state n (assumed WLOG to be unique and unweighted). It is wasteful to compute ti as suggested earlier, by minimizing ( xi) f (yi ), since then the real work is done by an -closure step (Mohri, 2002) that implements the all-pairs version of algebraic path, whereas all we need is the single-source version. If n and m are the number of states and edges,19 then both problems are O(n3) in the worst case, but the single-source version can be solved in essentially O(m) time for acyclic graphs and other reducible flow graphs (Tarjan, 1981b). For a general graph Ti, Tarjan (1981b) shows how to partition into &amp;quot;hard&amp;quot; subgraphs that localize the cyclicity or irreducibility, then run the O(n3) algorithm on each subgraph (thereby reducing n to as little as 1), and recombine the results. The overhead of partitioning and recombining is essentially only O(m).</Paragraph>
    <Paragraph position="1"> For speeding up theO(n3) problem on subgraphs, one can use an approximate relaxation technique 17Eisner (submitted) develops fast minimization algorithms that work for the real and V-expectation semirings.</Paragraph>
    <Paragraph position="2"> 18Division and subtraction are also possible: (p;v) = ( p; v) and (p;v) 1 = (p 1; p 1vp 1). Division is commonly used in defining f (for normalization).</Paragraph>
    <Paragraph position="3"> 19Multiple edges from j to k are summed into a single edge. (Mohri, 2002). Efficient hardware implementation is also possible via chip-level parallelism (Rote, 1985).</Paragraph>
    <Paragraph position="4"> In many cases of interest, Ti is an acyclic graph.20 Then Tarjan's method computes w0j for each j in topologically sorted order, thereby finding ti in a linear number of and operations. For HMMs (footnote 11), Ti is the familiar trellis, and we would like this computation of ti to reduce to the forward-backward algorithm (Baum, 1972). But notice that it has no backward pass. In place of pushing cumulative probabilities backward to the arcs, it pushes cumulative arcs (more generally, values in V) forward to the probabilities. This is slower because our and are vector operations, and the vectors rapidly lose sparsity as they are added together. We therefore reintroduce a backward pass that lets us avoid and when computing ti (so they are needed only to construct Ti). This speedup also works for cyclic graphs and for any V. Write wjk as (pjk;vjk), and let w1jk = (p1jk;v1jk) denote the weight of the edge from j to k.19 Then it can be shown that w0n = (p0n;Pj;kp0jv1jkpkn). The forward and backward probabilities, p0j and pkn, can be computed using single-source algebraic path for the simpler semiring (R;+; ; )--or equivalently, by solving a sparse linear system of equations over R, a much-studied problem at O(n) space, O(nm) time, and faster approximations (Greenbaum, 1997).</Paragraph>
    <Paragraph position="5"> A Viterbi variant of the expectation semiring exists: replace (3) with if(p1 &gt; p2;(p1;v1);(p2;v2)).</Paragraph>
    <Paragraph position="6"> Here, the forward and backward probabilities can be computed in time only O(m + nlogn) (Fredman and Tarjan, 1987). k-best variants are also possible.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML