File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1001_intro.xml

Size: 10,519 bytes

Last Modified: 2025-10-06 14:01:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1001">
  <Title>Parameter Estimation for Probabilistic Finite-State Transducers</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Transducers and Parameters
</SectionTitle>
    <Paragraph position="0"> Finite-state machines, including finite-state automata (FSAs) and transducers (FSTs), are a kind of labeled directed multigraph. For ease and brevity, we explain them by example. Fig. 1a shows a probabilistic FST with input alphabet = fa;bg, output alphabet = fx;zg, and all states final. It may be regarded as a device for generating a string pair in by a random walk from 0 . Two paths exist that generate both input aabb and output xz:</Paragraph>
    <Paragraph position="2"> Each of the paths has probability .0002646, so the probability of somehow generating the pair (aabb;xz) is :0002646 +:0002646 = :0005292.</Paragraph>
    <Paragraph position="3"> Abstracting away from the idea of random walks, arc weights need not be probabilities. Still, define a path's weight as the product of its arc weights and the stopping weight of its final state. Thus Fig. 1a defines a weighted relation f where f(aabb;xz) = :0005292. This particular relation does happen to be probabilistic (seex1). It represents a joint distribution (since Px;yf(x;y) = 1). Meanwhile, Fig. 1c defines a conditional one (8xPyf(x;y) = 1).</Paragraph>
    <Paragraph position="4"> This paper explains how to adjust probability distributions like that of Fig. 1a so as to model training data better. The algorithm improves an FST's numeric weights while leaving its topology fixed.</Paragraph>
    <Paragraph position="5"> How many parameters are there to adjust in Fig. 1a? That is up to the user who built it! An FST model with few parameters is more constrained, making optimization easier. Some possibilities: Most simply, the algorithm can be asked to tune the 17 numbers in Fig. 1a separately, subject to the constraint that the paths retain total probability 1. A more specific version of the constraint requires the FST to remain Markovian: each of the 4 states must present options with total probability 1 (at state 1 , 15+.7+.03.+.12=1). This preserves the random-walk interpretation and (we will show) entails no loss of generality. The 4 restrictions leave 13 free params.</Paragraph>
    <Paragraph position="6"> But perhaps Fig. 1a was actually obtained as the composition of Fig. 1b-c, effectively defin-</Paragraph>
    <Paragraph position="8"> main Markovian, they have 5 and 1 degrees of freedom respectively, so now Fig. 1a has only 6 parameters total.2 In general, composing machines multiplies their arc counts but only adds their parameter counts. We wish to optimize just the few underlying parameters, not independently optimize the many arc weights of the composed machine.</Paragraph>
    <Paragraph position="9"> Perhaps Fig. 1b was itself obtained by the probabilistic regular expression (a : p) (b : (p + q)) with the 3 parameters ( ; ; ) = (:7;:2;:5). With = :1 from footnote 2, the composed machine 2Why does Fig. 1c have only 1 degree of freedom? The Markovian requirement means something different in Fig. 1c, which defines a conditional relation P(output j mid) rather than a joint one. A random walk on Fig. 1c chooses among arcs with a given input label. So the arcs from state 6 with input p must have total probability 1 (currently .9+.1). All other arc choices are forced by the input label and so have probability 1. The only tunable value is .1 (denote it by ), with :9 = 1 .</Paragraph>
    <Paragraph position="10"> (Fig. 1a) has now been described with a total of just 4 parameters!3 Here, probabilistic union E + F def= E + (1 )F means &amp;quot;flip a -weighted coin and generateE if heads,F if tails.&amp;quot; E def= ( E) (1 ) means &amp;quot;repeatedly flip an -weighted coin and keep repeating E as long as it comes up heads.&amp;quot; These 4 parameters have global effects on Fig. 1a, thanks to complex parameter tying: arcs 4 b:p ! 5 , 5 b:q ! 5 in Fig. 1b get respective probabilities (1 ) and (1 ) , which covary with and vary oppositely with . Each of these probabilities in turn affects multiple arcs in the composed FST of Fig. 1a.</Paragraph>
    <Paragraph position="11"> We offer a theorem that highlights the broad applicability of these modeling techniques.4 If f(input;output) is a weighted regular relation, then the following statements are equivalent: (1)f is a joint probabilistic relation; (2) f can be computed by a Markovian FST that halts with probability 1; (3) f can be expressed as a probabilistic regexp, i.e., a regexp built up from atomic expressions a:b (for a2 [f g;b2 [f g) using concatenation, probabilistic union +p, and probabilistic closure p.</Paragraph>
    <Paragraph position="12"> For defining conditional relations, a good regexp language is unknown to us, but they can be defined in several other ways: (1) via FSTs as in Fig. 1c, (2) by compilation of weighted rewrite rules (Mohri and Sproat, 1996), (3) by compilation of decision trees (Sproat and Riley, 1996), (4) as a relation that performs contextual left-to-right replacement of input substrings by a smaller conditional relation (Gerdemann and van Noord, 1999),5 (5) by conditionalization of a joint relation as discussed below. A central technique is to define a joint relation as a noisy-channel model, by composing a joint relation with a cascade of one or more conditional relations as in Fig. 1 (Pereira and Riley, 1997; Knight and Graehl, 1998). The general form is illustrated by 3Conceptually, the parameters represent the probabilities of reading another a ( ); reading another b ( ); transducing b to p rather than q ( ); starting to transduce p to rather than x ( ). 4To prove (1))(3), express f as an FST and apply the well-known Kleene-Sch&amp;quot;utzenberger construction (Berstel and Reutenauer, 1988), taking care to write each regexp in the construction as a constant times a probabilistic regexp. A full proof is straightforward, as are proofs of (3))(2), (2))(1).</Paragraph>
    <Paragraph position="13"> 5In (4), the randomness is in the smaller relation's choice of how to replace a match. One can also get randomness through the choice of matches, ignoring match possibilities by randomly deleting markers in Gerdemann and van Noord's construction.</Paragraph>
    <Paragraph position="15"> implemented by composing 4 machines.6;7 There are also procedures for defining weighted FSTs that are not probabilistic (Berstel and Reutenauer, 1988). Arbitrary weights such as 2.7 may be assigned to arcs or sprinkled through a regexp (to be compiled into : =2:7 ! arcs). A more subtle example is weighted FSAs that approximate PCFGs (Nederhof, 2000; Mohri and Nederhof, 2001), or to extend the idea, weighted FSTs that approximate joint or conditional synchronous PCFGs built for translation. These are parameterized by the PCFG's parameters, but add or remove strings of the PCFG to leave an improper probability distribution.</Paragraph>
    <Paragraph position="16"> Fortunately for those techniques, an FST with positive arc weights can be normalized to make it jointly or conditionally probabilistic: An easy approach is to normalize the options at each state to make the FST Markovian. Unfortunately, the result may differ for equivalent FSTs that express the same weighted relation. Undesirable consequences of this fact have been termed &amp;quot;label bias&amp;quot; (Lafferty et al., 2001). Also, in the conditional case such per-state normalization is only correct if all states accept all input suffixes (since &amp;quot;dead ends&amp;quot; leak probability mass).8 A better-founded approach is global normalization, which simply divides each f(x;y) byP</Paragraph>
    <Paragraph position="18"> ditional case). To implement the joint case, just divide stopping weights by the total weight of all paths (whichx4 shows how to find), provided this is finite.</Paragraph>
    <Paragraph position="19"> In the conditional case, let g be a copy of f with the output labels removed, so that g(x) finds the desired divisor; determinize g if possible (but this fails for some weighted FSAs), replace all weights with their reciprocals, and compose the result with f.9 6P(w;x) defines the source model, and is often an &amp;quot;identity FST&amp;quot; that requires w = x, really just an FSA.</Paragraph>
    <Paragraph position="20">  &amp;quot;branching noisy channels&amp;quot; (a case of dendroid distributions). In Pw;xP(vjw)P(v0jw)P(w;x)P(yjx), the true transcription w can be triply constrained by observing speech y and two errorful transcriptions v;v0, which independently depend on w. 8A corresponding problem exists in the joint case, but may be easily avoided there by first pruning non-coaccessible states. 9It suffices to make g unambiguous (one accepting path per string), a weaker condition than determinism. When this is not possible (as in the inverse of Fig. 1b, whose conditionaliza-Normalization is particularly important because it enables the use of log-linear (maximum-entropy) parameterizations. Here one defines each arc weight, coin weight, or regexp weight in terms of meaningful features associated by hand with that arc, coin, etc. Each feature has a strength 2R&gt;0, and a weight is computed as the product of the strengths of its features.10 It is now the strengths that are the learnable parameters. This allows meaningful parameter tying: if certain arcs such as u:i !, o:e !, and a:ae ! share a contextual &amp;quot;vowel-fronting&amp;quot; feature, then their weights rise and fall together with the strength of that feature. The resulting machine must be normalized, either per-state or globally, to obtain a joint or a conditional distribution as desired. Such approaches have been tried recently in restricted cases (McCallum et al., 2000; Eisner, 2001b; Lafferty et al., 2001).</Paragraph>
    <Paragraph position="21"> Normalization may be postponed and applied instead to the result of combining the FST with other FSTs by composition, union, concatenation, etc. A simple example is a probabilistic FSA defined by normalizing the intersection of other probabilistic FSAs f1;f2;:::. (This is in fact a log-linear model in which the component FSAs define the features: string x has logfi(x) occurrences of feature i.) In short, weighted finite-state operators provide a language for specifying a wide variety of parameterized statistical models. Let us turn to their training.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML